# Fake news classifier


***In this notebook, we're about to build a robust classifier with the purpose of identifying fake news. The two datatsets used to build the classifier are taken from Kaggle : https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset. Work will also be split in two notebooks : this one will analyze the data and gather some useful insight, the second one will be for experimenting with different models.***

To start off, let's import some of the libraries we're going to need, then take a look at our data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
true_dataset = pd.read_csv('news/True.csv')
fake_dataset = pd.read_csv('news/Fake.csv')

In [3]:
true_dataset

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"
...,...,...,...,...
21412,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,"August 22, 2017"
21413,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,"August 22, 2017"
21414,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017"
21415,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017"


In [4]:
fake_dataset

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"
...,...,...,...,...
23476,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016"
23477,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016"
23478,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016"
23479,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016"


***First and foremost, we can observe the following :***
1.  Both datasets are relatively similar in size  
2.  Both contain same four features(title, text, subject, date)

In [5]:
true_dataset.isna().sum()

title      0
text       0
subject    0
date       0
dtype: int64

In [6]:
fake_dataset.isna().sum()

title      0
text       0
subject    0
date       0
dtype: int64

No missing values in both datasets!

In [7]:
true_dataset['subject'].value_counts()

politicsNews    11272
worldnews       10145
Name: subject, dtype: int64

In [8]:
fake_dataset['subject'].value_counts()

News               9050
politics           6841
left-news          4459
Government News    1570
US_News             783
Middle-east         778
Name: subject, dtype: int64

We will now proceed to merge both datasets, as well as editing them, to an extent. A column denoting whether or not the article is fake is also required. Let's now do all of that.

In [9]:
true_dataset['is_fake'] = 0
fake_dataset['is_fake'] = 1

data = pd.concat([true_dataset, fake_dataset])
data = data.reset_index()
data = data.sample(frac = 1)

In [10]:
data

Unnamed: 0,index,title,text,subject,date,is_fake
29462,8045,Bette Midler’s Brilliant Tweet Perfectly Expo...,"Let s just say, hypothetically, that Donald Tr...",News,"February 17, 2016",1
42068,20651,MAN BRUTALLY ASSAULTED At CA Trump Rally Tells...,This video would be on a 24/7 mainstream media...,left-news,"May 1, 2016",1
982,982,First charges filed in Russia probe led by U.S...,WASHINGTON (Reuters) - A federal grand jury in...,politicsNews,"October 28, 2017",0
2304,2304,Interior Department watchdog to investigate th...,WASHINGTON (Reuters) - The Interior Department...,politicsNews,"August 4, 2017",0
38169,16752,MICHELLE AND BARACK OBAMA Had Time For This “C...,President Barack Obama and First Lady Michelle...,Government News,"Feb 20, 2016",1
...,...,...,...,...,...,...
18143,18143,Factbox: German coalition watch: Merkel seeks ...,BERLIN (Reuters) - Chancellor Angela Merkel wo...,worldnews,"October 6, 2017",0
18570,18570,Barca closed soccer stadium to show support fo...,BARCELONA (Reuters) - Barcelona president Jose...,worldnews,"October 1, 2017",0
1442,1442,U.S. to expel nearly two-thirds of Cuban embas...,WASHINGTON (Reuters) - The Trump administratio...,politicsNews,"October 3, 2017",0
22763,1346,Donald Trump Complains To Europe That They Ma...,"During the presidential campaign, Donald Trump...",News,"May 27, 2017",1


In [11]:
data.duplicated().sum()

0

For this project, I will be dropping both the subject and date columns, as I believe an accurate model could be built without them.

In [12]:
data = data.drop(['subject', 'date'], axis = 1)
data

Unnamed: 0,index,title,text,is_fake
29462,8045,Bette Midler’s Brilliant Tweet Perfectly Expo...,"Let s just say, hypothetically, that Donald Tr...",1
42068,20651,MAN BRUTALLY ASSAULTED At CA Trump Rally Tells...,This video would be on a 24/7 mainstream media...,1
982,982,First charges filed in Russia probe led by U.S...,WASHINGTON (Reuters) - A federal grand jury in...,0
2304,2304,Interior Department watchdog to investigate th...,WASHINGTON (Reuters) - The Interior Department...,0
38169,16752,MICHELLE AND BARACK OBAMA Had Time For This “C...,President Barack Obama and First Lady Michelle...,1
...,...,...,...,...
18143,18143,Factbox: German coalition watch: Merkel seeks ...,BERLIN (Reuters) - Chancellor Angela Merkel wo...,0
18570,18570,Barca closed soccer stadium to show support fo...,BARCELONA (Reuters) - Barcelona president Jose...,0
1442,1442,U.S. to expel nearly two-thirds of Cuban embas...,WASHINGTON (Reuters) - The Trump administratio...,0
22763,1346,Donald Trump Complains To Europe That They Ma...,"During the presidential campaign, Donald Trump...",1


In [13]:
from sklearn.model_selection import train_test_split

train, test = train_test_split(data, test_size = 0.2, random_state = 42)
X_train = train.drop('is_fake', axis = 1)
y_train = train['is_fake']
X_test = test.drop('is_fake', axis = 1)
y_test = test['is_fake']

In [14]:
y_train.value_counts()

1    18755
0    17163
Name: is_fake, dtype: int64

In [15]:
y_test.value_counts()

1    4726
0    4254
Name: is_fake, dtype: int64

In [16]:
train_true = X_train.loc[(y_train == 0), :]
train_fake = X_train.loc[(y_train == 1), :]

***Now, we're about to engineer a several features on our own, which will hopefully aid us in our task to correctly classify fake news.***

## Lexical Diversity of Fake News
Let's define a measure for lexical diversity to find out how many unique vocabs are used in fake news articles.

Let's define the lexical diversity measure as $\frac{\text{number of unique words in one (target)category}}{\text{number of words in both (target)categories}}$

In [17]:
import string

def lexical_diversity(data, extra, feature): # Function for calculating the aforementioned measure
    
    # Getting all the words
    col = data[feature].str.lower()
    text = ''.join(col) # Merge all articles into one big text
    exclude = set(string.punctuation) # Create set of excluded characters
    words = ''.join(char for char in text if char not in exclude) 
    split_words = words.split()
    
    # Analogously
    extra_col = data[feature].str.lower()
    extra_text = ''.join(extra_col)
    extra_exclude = set(string.punctuation)
    extra_words = ''.join(char for char in extra_text if char not in extra_exclude)
    extra_split_words = extra_words.split()
    
    return len(set(split_words)) / (len(split_words) + len(extra_split_words))

print('Lexical diversity of true news: %.5f' % lexical_diversity(train_true, train_fake, 'text'))
print('Lexical diversity of fake news: %.5f' % lexical_diversity(train_fake, train_true, 'text'))

Lexical diversity of true news: 0.00698
Lexical diversity of fake news: 0.01096


And interestingly enough, there is in fact a significant difference in the unique vocabulary used in both types of news.

## Punctuation

Punctuation might play a big role in determining the autheticity of the articles. Things like question and exclamation marks, commas, stops are likely to be prevalent in one of the types of data. The investigation will commence in the following code cells.


In [19]:
import regex as re

# Helper function
def count_punctuation(text):
    
    peri = re.subn(r'\.', '', text)[1]
    comm = re.subn(r'\,', '', text)[1]
    ques = re.subn(r'\?', '', text)[1]
    excl = re.subn(r'\!', '', text)[1]
    
    return [peri, comm, ques, excl]

def get_punct_dataframe(dataset, feature):
    return dataset.apply(lambda row: pd.Series({'_peri' + feature:count_punctuation(row[feature])[0],
                                                'comm_' + feature:count_punctuation(row[feature])[1],
                                                'ques_' + feature:count_punctuation(row[feature])[2],
                                                'excl_' + feature:count_punctuation(row[feature])[3]}), axis = 1)

punctuation_train_title = get_punct_dataframe(train, 'title')
punctuation_test_title = get_punct_dataframe(test, 'title')
punctuation_train_text = get_punct_dataframe(train, 'text')
punctuation_test_text = get_punct_dataframe(test, 'text')

In [20]:
punctuation_train_text

Unnamed: 0,_peritext,comm_text,ques_text,excl_text
24373,16,21,1,0
22139,21,20,0,0
5512,56,34,0,0
13263,18,8,0,0
32936,15,19,0,0
...,...,...,...,...
38832,127,147,8,0
27085,35,31,6,4
9396,17,21,0,0
20611,20,28,0,0


## Text and title length

In [21]:
def create_len_dataframe(dataset):
    return dataset.apply(lambda row: pd.Series({'title_length': len(row['title']), 
                                                'text_length': len(row['text'])}), axis = 1)

train_len = create_len_dataframe(train)
test_len = create_len_dataframe(test)

In [22]:
train_len

Unnamed: 0,title_length,text_length
24373,81,1614
22139,87,2237
5512,54,3710
13263,68,1948
32936,113,2435
...,...,...
38832,91,15818
27085,96,3735
9396,43,2105
20611,66,3052
