# Building a Fake News classifier with Naive Bayes

The idea of this project is to try to build a fake news classifier with Naive Bayes algorithm. I'm using a fake and real news dataset by Clément Bisallion that has 23481 fake news and 21417 real news. The files can be found [here](https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset?select=True.csv)

In [1]:
import re
import pandas as pd
pd.options.display.max_colwidth=100

fake = pd.read_csv('Fake.csv')
real = pd.read_csv('True.csv')

In [2]:
fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’s Eve Message; This is Disturbing,"Donald Trump just couldn t wish all Americans a Happy New Year and leave it at that. Instead, he...",News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian Collusion Investigation,House Intelligence Committee Chairman Devin Nunes is going to have a bad day. He s been under th...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke For Threatening To Poke People ‘In The Eye’,"On Friday, it was revealed that former Milwaukee Sheriff David Clarke, who was being considered ...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name Coded Into His Website (IMAGES),"On Christmas day, Donald Trump announced that he would be back to work the following day, but ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump During His Christmas Speech,Pope Francis used his annual Christmas Day message to rebuke Donald Trump without even mentionin...,News,"December 25, 2017"


In [3]:
real.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip their fiscal script","WASHINGTON (Reuters) - The head of a conservative Republican faction in the U.S. Congress, who v...",politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits on Monday: Pentagon,WASHINGTON (Reuters) - Transgender people will be allowed for the first time to enlist in the U....,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Mueller do his job',WASHINGTON (Reuters) - The special counsel investigation of links between Russia and President T...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat tip-off: NYT,WASHINGTON (Reuters) - Trump campaign adviser George Papadopoulos told an Australian diplomat in...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much more' for Amazon shipments,SEATTLE/WASHINGTON (Reuters) - President Donald Trump called on the U.S. Postal Service on Frida...,politicsNews,"December 29, 2017"


At first look one of the things people should look to try to identify if a news is fake or real is the first line of text, most of the news starts with the location and media were it was written. I'm going to create a column in both dataframes with the labels fake and real, then i'm going to concatenate both dfs and find the percentage of fake and real news

In [4]:
fake['label'] = 'Fake'
real['label'] = 'Real'

full = pd.concat([fake, real])
full.drop(columns=['text', 'subject', 'date'], inplace=True)
full.head()

Unnamed: 0,title,label
0,Donald Trump Sends Out Embarrassing New Year’s Eve Message; This is Disturbing,Fake
1,Drunk Bragging Trump Staffer Started Russian Collusion Investigation,Fake
2,Sheriff David Clarke Becomes An Internet Joke For Threatening To Poke People ‘In The Eye’,Fake
3,Trump Is So Obsessed He Even Has Obama’s Name Coded Into His Website (IMAGES),Fake
4,Pope Francis Just Called Out Donald Trump During His Christmas Speech,Fake


However this set is really big, so the let's take 5% of it

In [5]:
data_randomized = full.sample(frac=1, random_state=1)
full_set_index = round(len(data_randomized) * 0.90)
full_set = data_randomized[full_set_index:].reset_index(drop=True)
full_set.shape[0]

4490

In [6]:
full_set['label'].value_counts(normalize=True)

Fake    0.512695
Real    0.487305
Name: label, dtype: float64

There's a 50/50 distribution between fake and real news. To make the testing process more realistic, i'm keeping 80% of the set for training and 20% for testing.

In [7]:
data_randomized = full_set.sample(frac=1, random_state=1)

training_test_index = round(len(data_randomized) * 0.8)

train = data_randomized[:training_test_index].reset_index(drop=True)
test = data_randomized[training_test_index:].reset_index(drop=True)

print(train.shape[0])
print(test.shape[0])

3592
898


In [8]:
train['label'].value_counts(normalize=True)

Fake    0.504454
Real    0.495546
Name: label, dtype: float64

In [9]:
test['label'].value_counts(normalize=True)

Fake    0.545657
Real    0.454343
Name: label, dtype: float64

After sampling both datasets ended with the same proportion (52/48) of fake/real news

Now let's clean the dataset to train the algorithm

## Cleaning letter, case and punctuation for titles only

In [10]:
#Before cleaning
train.head()

Unnamed: 0,title,label
0,Trump says to approve lifting restrictions on South Korea missile payload limits,Real
1,"VANISHED: FBI FILES Related To Mysterious “Suicide” Death Of Hillary’s Trusted WH Counsel, Vince...",Fake
2,"WAR ON WORDS: Facebook Censorship Widens, Website to Curate ‘Favored’ News",Fake
3,Mexico to review need for tax changes after U.S. reform-document,Real
4,Harry Reid Has Brutally Honest Message For The GOP From Their Pals At The NRA (TWEET),Fake


In [11]:
#After cleaning
train['title'] = train['title'].str.replace('\W', ' ')
train['title'] = train['title'].str.lower()
train.head()

Unnamed: 0,title,label
0,trump says to approve lifting restrictions on south korea missile payload limits,Real
1,vanished fbi files related to mysterious suicide death of hillary s trusted wh counsel vince...,Fake
2,war on words facebook censorship widens website to curate favored news,Fake
3,mexico to review need for tax changes after u s reform document,Real
4,harry reid has brutally honest message for the gop from their pals at the nra tweet,Fake


## Creating vocabulary for title only

In [12]:
train['title'] = train['title'].str.split()
vocabulary = []

for row in train['title']:
    for word in row:
        vocabulary.append(word)
        
vocabulary = list(set(vocabulary))

len(vocabulary)

7716

There are 5431 unique words in the train set. Now let's create a dictionary to count the words in the fake and real news titles

In [13]:
word_counts_per_title = {unique_word: [0] * len(train['title']) for unique_word in vocabulary}

for index, title in enumerate(train['title']):
    for word in title:
        word_counts_per_title[word][index] += 1
        
word_counts_title = pd.DataFrame(word_counts_per_title)
word_counts_title.head()

Unnamed: 0,screams,scotland,epa,baton,enigma,alleviate,unyielding,fulfilling,boardrooms,columbia,...,driven,arabia,adios,arbitrary,cake,drills,handler,hills,donation,toilet
0,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,0,0,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


In [14]:
train_clean = pd.concat([train, word_counts_title], axis=1)
train_clean.head()

Unnamed: 0,title,label,screams,scotland,epa,baton,enigma,alleviate,unyielding,fulfilling,...,driven,arabia,adios,arbitrary,cake,drills,handler,hills,donation,toilet
0,"[trump, says, to, approve, lifting, restrictions, on, south, korea, missile, payload, limits]",Real,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
1,"[vanished, fbi, files, related, to, mysterious, suicide, death, of, hillary, s, trusted, wh, cou...",Fake,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
2,"[war, on, words, facebook, censorship, widens, website, to, curate, favored, news]",Fake,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
3,"[mexico, to, review, need, for, tax, changes, after, u, s, reform, document]",Real,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0
4,"[harry, reid, has, brutally, honest, message, for, the, gop, from, their, pals, at, the, nra, tw...",Fake,0,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0,0,0


## Calculating constants first

Now i need to find the probability of fake news given title and real news given title. So we need to find:
* p(fake)
* p(real)
* Nfake
* Nreal
* Nvocabulary

In [15]:
fake_title = train_clean[train_clean['label'] == 'Fake']
real_title = train_clean[train_clean['label'] == 'Real']

In [16]:
p_fake = len(fake_title)/len(train_clean)
p_real = len(real_title)/len(train_clean)

print('p(fake):', p_fake)
print('p(real):', p_real)

p(fake): 0.5044543429844098
p(real): 0.4955456570155902


In [17]:
n_words_per_fake_title = fake_title['title'].apply(len)
n_fake = n_words_per_fake_title.sum()

n_words_per_real_title = real_title['title'].apply(len)
n_real = n_words_per_real_title.sum()

n_vocabulary = len(vocabulary)

alpha = 1

print('Nfake:', n_fake)
print('Nreal:', n_real)
print('Nvocabulary:', n_vocabulary)

Nfake: 28392
Nreal: 18753
Nvocabulary: 7716


## Calculating parameters
Now that we have the constant terms calculated above, we can move on with calculating the parameters P(wi|Fake) and P(wi|Real). Each parameter will thus be a conditional probability value associated with each word in the vocabulary.

In [18]:
parameters_fake = {unique_word: 0 for unique_word in vocabulary}
parameters_real = {unique_word: 0 for unique_word in vocabulary}

for word in vocabulary:
    n_word_fake = fake_title[word].sum()
    p_word_fake = (n_word_fake + alpha) / (n_fake + alpha * n_vocabulary)
    parameters_fake[word] = p_word_fake
    
    n_word_real = real_title[word].sum()
    p_word_real = (n_word_real + alpha) / (n_real + alpha * n_vocabulary)
    parameters_real[word] = p_word_real

## Classifying a new headline

In [19]:
def classify(headline):
    
    headline = re.sub('\W', ' ', headline)
    headline = headline.lower()
    headline = headline.split()
    
    p_fake_given_headline = p_fake
    p_real_given_headline = p_real
    
    for word in headline:
        if word in parameters_fake:
            p_fake_given_headline *= parameters_fake[word]
        
        if word in parameters_real:
            p_real_given_headline *= parameters_real[word]
            
    print('P(fake|headline):', p_fake_given_headline)
    print('P(real|headline):', p_real_given_headline)
    
    if p_fake_given_headline > p_real_given_headline:
        print('Label: Fake')
    elif p_fake_given_headline < p_real_given_headline:
        print('Label: Real')
    else:
        print('Equal probabilities, have a human classify this')

In [20]:
classify('Obama issues call to action in eulogy for Lewis, links Trump to foes of civil rights in 1960s')

P(fake|headline): 9.063723844475128e-52
P(real|headline): 8.25304087114931e-51
Label: Real


In [21]:
classify('Biden Campaign Whittles VP Shortlist Down To Either Woman Or Man With Long Hair')

P(fake|headline): 2.9246202514610007e-40
P(real|headline): 1.898196617709361e-41
Label: Fake


## Classifying test samples

In [22]:
def classify_test(headline):
    
    headline = re.sub('\W', ' ', headline)
    headline = headline.lower()
    headline = headline.split()
    
    p_fake_given_headline = p_fake
    p_real_given_headline = p_real
    
    for word in headline:
        if word in parameters_fake:
            p_fake_given_headline *= parameters_fake[word]
        
        if word in parameters_real:
            p_real_given_headline *= parameters_real[word]
    
    if p_fake_given_headline > p_real_given_headline:
        return 'Fake'
    elif p_fake_given_headline < p_real_given_headline:
        return 'Real'
    else:
        return 'Needs human classification'

In [23]:
test['predicted'] = test['title'].apply(classify_test)
test.head()

Unnamed: 0,title,label,predicted
0,Saudi-led coalition allows first aid ship into Yemen's Hodeidah port: local officials,Real,Real
1,Trump Administration In FREE FALL After ‘Concrete Evidence’ Linking Trump To Russia Emerges (DE...,Fake,Fake
2,Watch Robert Reich’s Brain Nearly MELT After Republican Says CIA Is ‘Attacking’ Trump (VIDEO),Fake,Fake
3,U.S. Navy may raise current 308-ship target for fleet,Real,Real
4,BOOM! DR ALVEDA KING Scolds Sen. Liz Warren: We won’t accept racist bait and switch [Video],Fake,Fake


In [24]:
print('Test labels:')
print(test['label'].value_counts())
print('-' * 20)
print('Predicted labels:')
print(test['predicted'].value_counts())

Test labels:
Fake    490
Real    408
Name: label, dtype: int64
--------------------
Predicted labels:
Fake    494
Real    404
Name: predicted, dtype: int64


The amount of real news decreased between the test labels and the predicted labels. While the amount of fake news increased. There wasn't any headline that required a human classification

In [25]:
test[(test['label'] == 'Real') & (test['predicted'] == 'Fake')]

Unnamed: 0,title,label,predicted
15,Republican Romney to make 'major speech' on 2016 presidential race: Fox,Real,Fake
35,"For Republican Rubio, a moment of truth in race to lead U.S.",Real,Fake
73,NRA calls for more regulation after Vegas shooting,Real,Fake
74,"Trump says he is 'very, very close' to making Fed chair decision",Real,Fake
93,Syrian militant group releases video of leader apparently uninjured,Real,Fake
108,Martin Luther King's daughter says 'God can triumph over Trump',Real,Fake
112,Rand Paul's accused attacker pleads not guilty to assault,Real,Fake
134,Trump accuses Cruz of stealing Iowa caucuses through 'fraud',Real,Fake
141,"Obama, Clinton scold Trump over proposed Muslim ban",Real,Fake
158,Factbox: Why the Zika virus is causing alarm,Real,Fake


There are 21 headlines that apparently were misclassified. I searched for them on google and found that all of them are headlines on the Reuters website and other sites.

In [26]:
test[(test['label'] == 'Fake') & (test['predicted'] == 'Real')]

Unnamed: 0,title,label,predicted
10,Clinton’s ‘No-Fly Zone’ over Syria Will Not “Save Lives” – It Will Lead to War with Russia,Fake,Real
27,Sanders Campaign Fights Back After Ohio Bans Youth From Primaries,Fake,Real
39,"#CrookedHillary’s Karma! NEW POLL Shows Trump Takes Lead In BLUE STATES: Michigan, Wisconsin, Ne...",Fake,Real
49,NY Lawmakers Compare MMA Fighting To ‘Gay P*rn’ And Police Abuse,Fake,Real
57,"House GOP Holds Zika Bill Hostage, Proposes Plan To Slash Birth Control Funding",Fake,Real
80,Trump Slapped With Lawsuit For Refusing To Release White House Visitor Logs,Fake,Real
83,Trump Considers ‘Terminating NAFTA’ With Executive Order (DETAILS),Fake,Real
111,BOOM! Fed Judge Ruling UNBLOCKS Trump Travel Ban…Asks ACLU Lawyer: “Where Does It Say Muslim Cou...,Fake,Real
115,Trump Owes A Foreign Bank At Least $100 Million – And It’s Fighting U.S. Regulators,Fake,Real
149,"Before Scalia’s Body Is Even Cold, Republicans Vow To Obstruct ANY Obama Nominee",Fake,Real


On the other side, 9 fake news were reported as real news. A search on google showed that the majority of them don't appear at all. And at least 3 of them were click baits or were misclassified. For example the headline "Report: Facebook’s Zuckerberg Gave FBI’s Mueller Info for Russia Investigation" i found multiple articles in websites like cnbc, wsj (Wallstreet journal), politico.com and vanityfair that had the same headline or similar. Maybe the content of the article is the one that's giving fake news

Now let's meassure the accuracy of the Fake news classifier

In [27]:
correct = 0
total = test.shape[0]

for row in test.iterrows():
    row = row[1]
    if row['label'] == row['predicted']:
        correct += 1
        
print('Correct:', correct)
print('Incorrect:', total - correct)
print('Accuracy:', correct / total)

Correct: 826
Incorrect: 72
Accuracy: 0.9198218262806236


The algorithm has a 93% of accuracy, wich is good. However, it looks like analyzing only the title isn't enough to do a proper classification of fake/real news, there's a chance that using a bigger portion of the full dataset, the accuracy might go down. A text analysis is required, but to be able to do it with this algorithm a higher processing power is required. Also there's a need to apply things like stemming and feature selection to have a better accuracy.