# Honours Project - Fake News Detection

In [1]:
"""
git code:
git add .
git commit -m "First commit"
git push origin master
"""

'\ngit code:\ngit add .\ngit commit -m "First commit"\ngit push origin master\n'

This project will be using the approach of CRISP-DM (Cross-industry standard process for data mining), which is a widely used process for knowledge discovery in data sets. 
The process encompasses several phases:

    1. Business Understanding
    2. Data Understanding
    3. Data Preparation
    4. Modeling
    5. Evaluation
    6. Deployment

Possible data sets:

https://www.kaggle.com/pontes/fake-news-sample

Real News: https://archive.ics.uci.edu/ml/datasets/News+Aggregator

https://toolbox.google.com/datasetsearch/search?query=fake%20news&docid=sHyIQgRMuTsFH02AAAAAAA%3D%3D

https://github.com/hanselowski/athene_system/tree/master/data

## Step 1 - Business Understanding

TODO

Goals/ Objectives &
Success Criteria

## Step 2 - Data Understanding

### 2 a) - Importing Libraries

In [2]:
import pandas as pd
import numpy as np
import nltk
import sklearn
import string
import time

from sklearn.model_selection import train_test_split

#nltk.download('punkt')

#from sklearn.model_selection import train_test_split
#import sklearn.model_selection as ms
#import sklearn.feature_extraction.text as text
#import sklearn.naive_bayes as nb
#from sklearn.feature_extraction.text import CountVectorizer
#from sklearn.naive_bayes import MultinomialNB
#from sklearn.naive_bayes import BernoulliNB
#from sklearn.naive_bayes import GaussianNB
#from sklearn.metrics import confusion_matrix
#from sklearn.metrics import accuracy_score

### 2 b) - Loading the Data Set

In [3]:
df = pd.read_csv("data/FakeNews-(balanced)/fake_or_real_news.csv", encoding="utf-8")
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [4]:
#preview of an FAKE article
df.iloc[16,2]

'Shocking! Michele Obama & Hillary Caught Glamorizing Date Rape Promoters First lady claims moral high ground while befriending rape-glorifying rappers Infowars.com - October 27, 2016 Comments \nAlex Jones breaks down the complete hypocrisy of Michele Obama and Hillary Clinton attacking Trump for comments he made over a decade ago while The White House is hosting and promoting rappers who boast about date raping women and selling drugs in their music. \nRappers who have been welcomed to the White House by the Obama’s include “Rick Ross,” who promotes drugging and raping woman in his song “U.O.N.E.O.” \nWhile attacking Trump as a sexual predator, Michelle and Hillary have further mainstreamed the degradation of women through their support of so-called musicians who attempt to normalize rape. NEWSLETTER SIGN UP Get the latest breaking news & specials from Alex Jones and the Infowars Crew. Related Articles'

In [5]:
#preview of an REAL article
df.iloc[8,2]

'Hillary Clinton and Donald Trump made some inaccurate claims during an NBC “commander-in-chief” forum on military and veterans issues:\n\n• Clinton wrongly claimed Trump supported the war in Iraq after it started, while Trump was wrong, once again, in saying he was against the war before it started.\n\n•\xa0Trump said that President Obama set a “certain date” for withdrawing troops from Iraq, when that date was set before Obama was sworn in.\n\n•\xa0Trump said that Obama’s visits to China, Saudi Arabia and Cuba were “the first time in the history, the storied history of Air Force One” when “high officials” of a host country did not appear to greet the president. Not true.\n\n•\xa0Clinton said that Trump supports privatizing the Veterans Health Administration. That’s false. Trump said he supports allowing veterans to seek care at either public or private hospitals.\n\n•\xa0Trump said Clinton made “a terrible mistake on Libya” when she was secretary of State. But, at the time, Trump als

In [6]:
df.info

<bound method DataFrame.info of       Unnamed: 0                                              title  \
0           8476                       You Can Smell Hillary’s Fear   
1          10294  Watch The Exact Moment Paul Ryan Committed Pol...   
2           3608        Kerry to go to Paris in gesture of sympathy   
3          10142  Bernie supporters on Twitter erupt in anger ag...   
4            875   The Battle of New York: Why This Primary Matters   
5           6903                                        Tehran, USA   
6           7341  Girl Horrified At What She Watches Boyfriend D...   
7             95                  ‘Britain’s Schindler’ Dies at 106   
8           4869  Fact check: Trump and Clinton at the 'commande...   
9           2909  Iran reportedly makes new push for uranium con...   
10          1357  With all three Clintons in Iowa, a glimpse at ...   
11           988  Donald Trump’s Shockingly Weak Delegate Game S...   
12          7041  Strong Solar Storm, Tech Ri

In [7]:
columns = df.columns.tolist()

In [8]:
print(columns)

['Unnamed: 0', 'title', 'text', 'label']


In [9]:
df["label"].value_counts()

REAL    3171
FAKE    3164
Name: label, dtype: int64

The data set seems to be well balanced.

In [10]:
type(df["title"])

pandas.core.series.Series

In [11]:
type(df["text"])

pandas.core.series.Series

In [12]:
df.dtypes

Unnamed: 0     int64
title         object
text          object
label         object
dtype: object

It appears that "Unnamed: 0" is the index since it only contains numbers. The column will be checked for duplicates to check.

In [13]:
print(any(df["Unnamed: 0"].duplicated()))

False


In [14]:
df.isnull().values.any()

False

## Step 3 - Data Preparation

### 3 a) - General Polishing

In [15]:
#rename "Unnamed: 0" and make it the index of the data frame
df.columns = ["index", "title", "text", "label"]
df.set_index("index", inplace=True)

In [16]:
df.head()

Unnamed: 0_level_0,title,text,label
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [17]:
#order by index
df.sort_index(inplace=True)

In [18]:
df.head()

Unnamed: 0_level_0,title,text,label
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
2,Study: women had to drive 4 times farther afte...,Ever since Texas laws closed about half of the...,REAL
3,"Trump, Clinton clash in dueling DC speeches","Donald Trump and Hillary Clinton, now at the s...",REAL
5,"As Reproductive Rights Hang In The Balance, De...",WASHINGTON -- Forty-three years after the Supr...,REAL
6,"Despite Constant Debate, Americans' Abortion O...",It's been a big week for abortion news.\n\nCar...,REAL
7,Obama Argues Against Goverment Shutdown Over P...,President Barack Obama said Saturday night tha...,REAL


In [19]:
df.index

Int64Index([    2,     3,     5,     6,     7,     9,    10,    12,    14,
               16,
            ...
            10543, 10545, 10546, 10547, 10548, 10549, 10551, 10553, 10555,
            10557],
           dtype='int64', name='index', length=6335)

The index seems to skip some numbers, for example 8 and 11. The index will be properly assigned.

In [20]:
df['index'] = df.reset_index().index

In [21]:
df.set_index("index", inplace=True)
df.head()

Unnamed: 0_level_0,title,text,label
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Study: women had to drive 4 times farther afte...,Ever since Texas laws closed about half of the...,REAL
1,"Trump, Clinton clash in dueling DC speeches","Donald Trump and Hillary Clinton, now at the s...",REAL
2,"As Reproductive Rights Hang In The Balance, De...",WASHINGTON -- Forty-three years after the Supr...,REAL
3,"Despite Constant Debate, Americans' Abortion O...",It's been a big week for abortion news.\n\nCar...,REAL
4,Obama Argues Against Goverment Shutdown Over P...,President Barack Obama said Saturday night tha...,REAL


It appears that the data contains several characters like \n or \a. They will be removed from the data set.

### 3 b) - Normalising The Data

In [22]:
# the function strip() will be used to remove those characters
# Example:
s = "\n \a abc \n \n"
print(s.strip())

 abc


In [23]:
s = "\n\nCar"
print(s.strip())

Car


In [24]:
df["text"] = df["text"].apply(lambda x: x.strip())

In [25]:
# since some characters are part of the string, they have to be removed with the replace function
df["text"] = df["text"].apply(lambda x: x.replace("\n", ""))
df["text"] = df["text"].apply(lambda x: x.replace("\t", ""))
#df["text"] = df["text"].apply(lambda x: x.replace("\x", ""))
df["text"] = df["text"].apply(lambda x: x.replace("\xa0", ""))

df["title"] = df["title"].apply(lambda x: x.replace("\n", ""))
df["title"] = df["title"].apply(lambda x: x.replace("\t", ""))
#df["title"] = df["title"].apply(lambda x: x.replace("\x", ""))
df["title"] = df["title"].apply(lambda x: x.replace("\xa0", ""))

In [26]:
df.head()

Unnamed: 0_level_0,title,text,label
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,Study: women had to drive 4 times farther afte...,Ever since Texas laws closed about half of the...,REAL
1,"Trump, Clinton clash in dueling DC speeches","Donald Trump and Hillary Clinton, now at the s...",REAL
2,"As Reproductive Rights Hang In The Balance, De...",WASHINGTON -- Forty-three years after the Supr...,REAL
3,"Despite Constant Debate, Americans' Abortion O...",It's been a big week for abortion news.Carly F...,REAL
4,Obama Argues Against Goverment Shutdown Over P...,President Barack ObamasaidSaturday night that ...,REAL


The next step is to remove punctuation.

In [27]:
#df["text"] = df["text"].apply(lambda x: x.replace(string.punctuation, ""))
#df["title"] = df["title"].apply(lambda x: x.replace(string.punctuation, ""))

df["title"] = df["title"].str.replace("[{}]".format(string.punctuation), "")
df["text"] = df["text"].str.replace("[{}]".format(string.punctuation), "")

In [28]:
#convert every word to lower case - normalising case
df["title"] = df["title"].str.lower()
df["text"] = df["text"].str.lower()
df["label"] = df["label"].str.lower()

In [29]:
df.head()

Unnamed: 0_level_0,title,text,label
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,study women had to drive 4 times farther after...,ever since texas laws closed about half of the...,real
1,trump clinton clash in dueling dc speeches,donald trump and hillary clinton now at the st...,real
2,as reproductive rights hang in the balance deb...,washington fortythree years after the supreme...,real
3,despite constant debate americans abortion opi...,its been a big week for abortion newscarly fio...,real
4,obama argues against goverment shutdown over p...,president barack obamasaidsaturday night that ...,real


Additionally the labels will be converted to binary values: 0 and 1.

In [30]:
#df["label"] = df["label"].apply(lambda x: x.replace("real", 0))
#df["label"] = df["label"].apply(lambda x: x.replace("fake", 1))

df["label"] = df["label"].replace(to_replace=["real", "fake"], value=[0, 1])

In [31]:
df.head()

Unnamed: 0_level_0,title,text,label
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,study women had to drive 4 times farther after...,ever since texas laws closed about half of the...,0
1,trump clinton clash in dueling dc speeches,donald trump and hillary clinton now at the st...,0
2,as reproductive rights hang in the balance deb...,washington fortythree years after the supreme...,0
3,despite constant debate americans abortion opi...,its been a big week for abortion newscarly fio...,0
4,obama argues against goverment shutdown over p...,president barack obamasaidsaturday night that ...,0


In the next step stopwords such as "the" or "a" will be removed since they do not contribute to a deeper meaning of a sentence.

In [32]:
from nltk.corpus import stopwords
stop_words = stopwords.words('english')
print(stop_words)

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his', 'himself', 'she', "she's", 'her', 'hers', 'herself', 'it', "it's", 'its', 'itself', 'they', 'them', 'their', 'theirs', 'themselves', 'what', 'which', 'who', 'whom', 'this', 'that', "that'll", 'these', 'those', 'am', 'is', 'are', 'was', 'were', 'be', 'been', 'being', 'have', 'has', 'had', 'having', 'do', 'does', 'did', 'doing', 'a', 'an', 'the', 'and', 'but', 'if', 'or', 'because', 'as', 'until', 'while', 'of', 'at', 'by', 'for', 'with', 'about', 'against', 'between', 'into', 'through', 'during', 'before', 'after', 'above', 'below', 'to', 'from', 'up', 'down', 'in', 'out', 'on', 'off', 'over', 'under', 'again', 'further', 'then', 'once', 'here', 'there', 'when', 'where', 'why', 'how', 'all', 'any', 'both', 'each', 'few', 'more', 'most', 'other', 'some', 'such', 'no', 'nor', 'not', 'only', 'own', 'same', 'so', 'than', '

In [33]:
from nltk.tokenize import word_tokenize
import string

#function that tokenises words and removes stop words, punctuation and non alphanumerical characters in a sentence
def func_normalise(sentence):
    tokens = word_tokenize(sentence)
    #print(tokens)
    stop_words = set(stopwords.words("english"))
    stop_words.add("n't")
    stop_words.add("nt")
    stop_words.add("u")
    
    table = str.maketrans("", "", string.punctuation)
    stripped = [w.translate(table) for w in tokens]
    
    words = [word for word in stripped if word.isalpha()]
    
    new_sentence = [w for w in words if not w in stop_words] 
            
    new_sentence_str = " ".join(new_sentence)
    
    return new_sentence_str

In [34]:
#testing the function
func_normalise("ever ? since hasn't the / texa*s laws closed about half of the where didn't")

'ever since texas laws closed half'

Now the above function will be applied to the data in order to normalise it.

In [35]:
df["title"] = df["title"].apply(func_normalise)

In [36]:
df["text"] = df["text"].apply(func_normalise)

In [37]:
df.head()

Unnamed: 0_level_0,title,text,label
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,study women drive times farther texas laws clo...,ever since texas laws closed half states abort...,0
1,trump clinton clash dueling dc speeches,donald trump hillary clinton starting line gen...,0
2,reproductive rights hang balance debate modera...,washington fortythree years supreme court esta...,0
3,despite constant debate americans abortion opi...,big week abortion newscarly fiorinas passionat...,0
4,obama argues goverment shutdown planned parent...,president barack obamasaidsaturday night congr...,0


As seen in the above example the text data has been (successfully) normalised.

### 3 c) - Stemming 

Stemming is the process of reducing words to their root. For example, "playing" and "played" reduce to the stem "play". Therefore stemming helps with reducing the vocabulary and allows to focus on the sense of a sentence.

In [38]:
#from nltk.stem.porter import PorterStemmer
#according to the nltk website the snowballstemmer is better than the "original" porter stemmer
#https://www.nltk.org/howto/stem.html

from nltk.stem.snowball import SnowballStemmer

#function that stems words in a sentence
def func_stem(sentence):
    tokens = word_tokenize(sentence)
    snowball_stemmer = SnowballStemmer("english")
    stemmed_sentence = [snowball_stemmer.stem(word) for word in tokens]
    stemmed_sentence_str = " ".join(stemmed_sentence)
    return stemmed_sentence_str

In [39]:
#test
func_stem("playing player play played plays")

'play player play play play'

In [40]:
df["title"] = df["title"].apply(func_stem)

In [41]:
df["text"] = df["text"].apply(func_stem)

In [42]:
df.head()

Unnamed: 0_level_0,title,text,label
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,studi women drive time farther texa law close ...,ever sinc texa law close half state abort clin...,0
1,trump clinton clash duel dc speech,donald trump hillari clinton start line genera...,0
2,reproduct right hang balanc debat moder drop ball,washington fortythre year suprem court establi...,0
3,despit constant debat american abort opinion r...,big week abort newscar fiorina passion inaccur...,0
4,obama argu gover shutdown plan parenthood,presid barack obamasaidsaturday night congress...,0


### 3 d) - Lemmatising

According to literature lemmatising and stemming words is similar. However, stemming tries to cut off endings of words whereas lemmatising compares them to other words. To test whether lemmatising makes a difference in accuracy it will be implemented.

In [43]:
"""
import nltk
nltk.download('wordnet')
"""

def func_lemmatise(sentence):
    tokens = word_tokenize(sentence)
    lemmatiser = nltk.WordNetLemmatizer()
    lemmatised_sentence = [lemmatiser.lemmatize(word) for word in tokens]
    lemmatised_sentence_str = " ".join(lemmatised_sentence)
    return lemmatised_sentence_str

In [44]:
#test
func_lemmatise("playing player play played plays")

'playing player play played play'

In [45]:
df["title"] = df["title"].apply(func_lemmatise)

In [46]:
df["text"] = df["text"].apply(func_lemmatise)

In [47]:
df.head()

Unnamed: 0_level_0,title,text,label
index,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,studi woman drive time farther texa law close ...,ever sinc texa law close half state abort clin...,0
1,trump clinton clash duel dc speech,donald trump hillari clinton start line genera...,0
2,reproduct right hang balanc debat moder drop ball,washington fortythre year suprem court establi...,0
3,despit constant debat american abort opinion r...,big week abort newscar fiorina passion inaccur...,0
4,obama argu gover shutdown plan parenthood,presid barack obamasaidsaturday night congress...,0


### 3 e) - TF-IDF 

Since ML algorithms require numerical data as input instead of text, the text will be vectorised using the TF-IDF method, which stands for term frequency - inverse document frequency. TF-IDF is measure of orginiality of a word by comparing the number of times a word appears in a doc with the number of docs the words appears in.

In [48]:
#remove arabian characters etc. Maybe activate them later?
df["title"] = df["title"].replace("[^a-zA-Z0-9 ]", "", regex=True)

In [49]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer()
feature_matrix = tfidf.fit_transform(df["title"])

In [50]:
feature_matrix.toarray()

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [51]:
tfidf.get_feature_names()

['aap',
 'abandon',
 'abbi',
 'abc',
 'abcwapo',
 'abduct',
 'abdullah',
 'abedin',
 'abil',
 'abl',
 'abnorm',
 'aboard',
 'abolish',
 'abort',
 'abortionrevers',
 'abridg',
 'abroad',
 'abrog',
 'absenc',
 'absente',
 'absolut',
 'abstain',
 'absurd',
 'abu',
 'abus',
 'abyss',
 'aca',
 'accept',
 'access',
 'accid',
 'accident',
 'accomplish',
 'accord',
 'account',
 'accur',
 'accus',
 'acela',
 'acheron',
 'achiev',
 'ackbar',
 'acknowledg',
 'aclu',
 'acquir',
 'acquisit',
 'acquit',
 'acquitt',
 'acr',
 'across',
 'act',
 'action',
 'activ',
 'activist',
 'actor',
 'actual',
 'acupunctur',
 'ad',
 'adam',
 'adapt',
 'add',
 'adderal',
 'addict',
 'addictionher',
 'address',
 'adelson',
 'adequ',
 'adhd',
 'adhm',
 'adjust',
 'admin',
 'administr',
 'admir',
 'admit',
 'adopt',
 'ador',
 'adpr',
 'adpresnet',
 'adult',
 'advanc',
 'advantag',
 'advert',
 'advertis',
 'advic',
 'advis',
 'advisor',
 'advoc',
 'aerodynam',
 'afar',
 'affair',
 'affect',
 'affili',
 'affirm',
 'affo

In [52]:
df_title_tfidf = pd.DataFrame(feature_matrix.toarray(), columns=tfidf.get_feature_names())

In [53]:
df_title_tfidf.head()

Unnamed: 0,aap,abandon,abbi,abc,abcwapo,abduct,abdullah,abedin,abil,abl,...,zika,zimbabw,zion,zionist,zip,zone,ztech,zuckerberg,zuess,zulu
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


### 3 f) - Most Frequent Words

Most frequent words in "real" news within the data set:

In [54]:
df_real = df.loc[df["label"] == 0]

In [55]:
from collections import Counter
Counter(" ".join(df_real["title"]).split()).most_common(10)

[('trump', 634),
 ('clinton', 399),
 ('obama', 293),
 ('gop', 243),
 ('donald', 185),
 ('hillari', 184),
 ('debat', 167),
 ('republican', 163),
 ('new', 141),
 ('say', 138)]

In [56]:
Counter(" ".join(df_real["text"]).split()).most_common(10)

[('said', 15065),
 ('trump', 13904),
 ('clinton', 9705),
 ('state', 9410),
 ('would', 7747),
 ('republican', 7681),
 ('presid', 6377),
 ('say', 6338),
 ('one', 6203),
 ('peopl', 6055)]

Most frequent words in "fake" news within the data set:

In [57]:
df_fake = df.loc[df["label"] == 1]

In [58]:
Counter(" ".join(df_fake["title"]).split()).most_common(10)

[('trump', 451),
 ('hillari', 398),
 ('clinton', 336),
 ('elect', 211),
 ('u', 200),
 ('new', 139),
 ('russia', 125),
 ('fbi', 122),
 ('video', 122),
 ('america', 115)]

In [59]:
Counter(" ".join(df_fake["text"]).split()).most_common(10)

[('clinton', 6939),
 ('u', 6852),
 ('trump', 6515),
 ('peopl', 5478),
 ('state', 5402),
 ('one', 5203),
 ('would', 4895),
 ('hillari', 4516),
 ('like', 4101),
 ('elect', 4013)]

## Step 4 - Machine Learning

In [60]:
#assign the labels to y to compare them with the predictions made by model
y = df["label"]

In [61]:
from sklearn.metrics import accuracy_score
from sklearn.metrics import precision_score
from sklearn.metrics import recall_score
from sklearn.metrics import f1_score
from sklearn.metrics import log_loss
from sklearn.metrics import roc_auc_score

#function that compares different test_size splits and its result on multiple metric scores
def func_classifier_metrics(classifier):
    start = time.time()
    print("-start-")
    for x in range(1, 10):
        print(x)
        X_train, X_test, y_train, y_test = train_test_split(df_title_tfidf, y, test_size=x/10, random_state=2, shuffle=True)
        classifier.fit(X_train ,y_train)
        
        pred_on_test_data = classifier.predict(X_test)
        acc_score = accuracy_score(pred_on_test_data, y_test)
        prec = precision_score(y_test.values, pred_on_test_data, pos_label=1)
        recall = recall_score(y_test.values, pred_on_test_data)
        f1 = f1_score(y_test.values, pred_on_test_data, average="binary")
        l_loss = log_loss(y_test.values, pred_on_test_data)
        roc_auc = roc_auc_score(y_test.values, pred_on_test_data)
        
        print("Test size: ", x/10, "| Accuracy: ", "{0:.4f}".format(acc_score), "| Precision: ", "{0:.4f}".format(prec), "| Recall: ", "{0:.4f}".format(recall), "| F1: ", "{0:.4f}".format(f1), "| ROC-AUC: ", "{0:.4f}".format(roc_auc), "| Log. Loss: ", "{0:.4f}".format(l_loss))
        x = x + 1
    print("end of loop")   
    print("Time: {} mins".format(round((time.time() - start) / 60, 2)))

### 4 a) - Naive Bayes Classifier

In [62]:
from sklearn.naive_bayes import MultinomialNB
from sklearn.naive_bayes import BernoulliNB
from sklearn.naive_bayes import GaussianNB

In [63]:
#Multinomial Naive Bayes
#https://stats.stackexchange.com/questions/33185/difference-between-naive-bayes-multinomial-naive-bayes

mNB = MultinomialNB()
func_classifier_metrics(mNB)

-start-
1
Test size:  0.1 | Accuracy:  0.8044 | Precision:  0.8225 | Recall:  0.7700 | F1:  0.7954 | ROC-AUC:  0.8040 | Log. Loss:  6.7553
2
Test size:  0.2 | Accuracy:  0.8066 | Precision:  0.8407 | Recall:  0.7540 | F1:  0.7950 | ROC-AUC:  0.8063 | Log. Loss:  6.6788
3
Test size:  0.3 | Accuracy:  0.8085 | Precision:  0.8357 | Recall:  0.7607 | F1:  0.7964 | ROC-AUC:  0.8078 | Log. Loss:  6.6135
4
Test size:  0.4 | Accuracy:  0.7987 | Precision:  0.8344 | Recall:  0.7414 | F1:  0.7852 | ROC-AUC:  0.7983 | Log. Loss:  6.9514
5
Test size:  0.5 | Accuracy:  0.8081 | Precision:  0.8192 | Recall:  0.7778 | F1:  0.7980 | ROC-AUC:  0.8073 | Log. Loss:  6.6287
6
Test size:  0.6 | Accuracy:  0.7916 | Precision:  0.8016 | Recall:  0.7637 | F1:  0.7822 | ROC-AUC:  0.7911 | Log. Loss:  7.1968
7
Test size:  0.7 | Accuracy:  0.7781 | Precision:  0.8036 | Recall:  0.7312 | F1:  0.7657 | ROC-AUC:  0.7777 | Log. Loss:  7.6632
8
Test size:  0.8 | Accuracy:  0.7573 | Precision:  0.7772 | Recall:  0.717

In [64]:
bNB = BernoulliNB()
func_classifier_metrics(bNB)

-start-
1
Test size:  0.1 | Accuracy:  0.7950 | Precision:  0.8123 | Recall:  0.7604 | F1:  0.7855 | ROC-AUC:  0.7945 | Log. Loss:  7.0822
2
Test size:  0.2 | Accuracy:  0.8090 | Precision:  0.8464 | Recall:  0.7524 | F1:  0.7966 | ROC-AUC:  0.8087 | Log. Loss:  6.5970
3
Test size:  0.3 | Accuracy:  0.8059 | Precision:  0.8395 | Recall:  0.7489 | F1:  0.7916 | ROC-AUC:  0.8050 | Log. Loss:  6.7043
4
Test size:  0.4 | Accuracy:  0.7976 | Precision:  0.8432 | Recall:  0.7271 | F1:  0.7809 | ROC-AUC:  0.7970 | Log. Loss:  6.9923
5
Test size:  0.5 | Accuracy:  0.7973 | Precision:  0.8278 | Recall:  0.7377 | F1:  0.7801 | ROC-AUC:  0.7959 | Log. Loss:  6.9994
6
Test size:  0.6 | Accuracy:  0.7856 | Precision:  0.8171 | Recall:  0.7245 | F1:  0.7680 | ROC-AUC:  0.7844 | Log. Loss:  7.4058
7
Test size:  0.7 | Accuracy:  0.7709 | Precision:  0.8115 | Recall:  0.7008 | F1:  0.7521 | ROC-AUC:  0.7703 | Log. Loss:  7.9124
8
Test size:  0.8 | Accuracy:  0.7536 | Precision:  0.7853 | Recall:  0.693

In [65]:
gNB = GaussianNB()
func_classifier_metrics(gNB)

-start-
1
Test size:  0.1 | Accuracy:  0.6719 | Precision:  0.7354 | Recall:  0.5240 | F1:  0.6119 | ROC-AUC:  0.6701 | Log. Loss:  11.3314
2
Test size:  0.2 | Accuracy:  0.6811 | Precision:  0.7522 | Recall:  0.5349 | F1:  0.6252 | ROC-AUC:  0.6803 | Log. Loss:  11.0132
3
Test size:  0.3 | Accuracy:  0.6860 | Precision:  0.7504 | Recall:  0.5427 | F1:  0.6299 | ROC-AUC:  0.6838 | Log. Loss:  10.8468
4
Test size:  0.4 | Accuracy:  0.6669 | Precision:  0.7162 | Recall:  0.5442 | F1:  0.6184 | ROC-AUC:  0.6660 | Log. Loss:  11.5039
5
Test size:  0.5 | Accuracy:  0.6689 | Precision:  0.7017 | Recall:  0.5576 | F1:  0.6214 | ROC-AUC:  0.6661 | Log. Loss:  11.4367
6
Test size:  0.6 | Accuracy:  0.6669 | Precision:  0.6943 | Recall:  0.5720 | F1:  0.6272 | ROC-AUC:  0.6650 | Log. Loss:  11.5039
7
Test size:  0.7 | Accuracy:  0.6773 | Precision:  0.7115 | Recall:  0.5875 | F1:  0.6436 | ROC-AUC:  0.6766 | Log. Loss:  11.1444
8
Test size:  0.8 | Accuracy:  0.6730 | Precision:  0.6961 | Recall:

Multinomial Naive Bayes seems to provide the best accuracy score with a test size of 0.2
Accuracy:  0.8113654301499605

### 4 b) - Support Vector Machines

In [66]:
from sklearn.svm import SVC
#svc = SVC(gamma='auto', random_state=0)
#svc = SVC(gamma="auto")
svc = SVC(gamma="scale")

In [67]:
"""
func_classifier_acc(svc)
"""

'\nfunc_classifier_acc(svc)\n'

In [68]:
"""
svc = SVC(gamma="auto")
func_classifier_acc(svc)
"""

'\nsvc = SVC(gamma="auto")\nfunc_classifier_acc(svc)\n'

In [69]:
svc_test = SVC(gamma="scale", C=1.5, kernel="poly", degree=2, coef0=0.001)

In [70]:
#https://docs.google.com/spreadsheets/d/1eSNWea1PujxDQeiwCEZZRcOWYB3lpEsqjekS0HRSUBw/edit#gid=0
#func_classifier_metrics(svc_test)

SVM with gamma=scale and a test size of 0.3 yielded in the highest accuracy:  0.8211467648605997

### 4 c) - Perceptron

In [71]:
from sklearn.neural_network import MLPClassifier

In [None]:
#https://docs.google.com/spreadsheets/d/1BBmq1wlc3AzKkBj5wE9EmoaM--XYjovsaVtyVbbHxes/edit#gid=0

In [101]:
mlp = MLPClassifier(alpha=0.6, learning_rate="invscaling")
func_classifier_metrics(mlp)

-start-
1
Test size:  0.1 | Accuracy:  0.7997 | Precision:  0.7688 | Recall:  0.8498 | F1:  0.8073 | ROC-AUC:  0.8003 | Log. Loss:  6.9188
2
Test size:  0.2 | Accuracy:  0.8122 | Precision:  0.7908 | Recall:  0.8460 | F1:  0.8175 | ROC-AUC:  0.8123 | Log. Loss:  6.4880
3
Test size:  0.3 | Accuracy:  0.8133 | Precision:  0.7868 | Recall:  0.8515 | F1:  0.8179 | ROC-AUC:  0.8138 | Log. Loss:  6.4500
4
Test size:  0.4 | Accuracy:  0.5576 | Precision:  0.5289 | Recall:  0.9889 | F1:  0.6892 | ROC-AUC:  0.5610 | Log. Loss:  15.2797
5
Test size:  0.5 | Accuracy:  0.8078 | Precision:  0.7828 | Recall:  0.8381 | F1:  0.8095 | ROC-AUC:  0.8085 | Log. Loss:  6.6396
6
Test size:  0.6 | Accuracy:  0.7919 | Precision:  0.7658 | Recall:  0.8287 | F1:  0.7960 | ROC-AUC:  0.7926 | Log. Loss:  7.1877
7
Test size:  0.7 | Accuracy:  0.7838 | Precision:  0.7609 | Recall:  0.8222 | F1:  0.7904 | ROC-AUC:  0.7841 | Log. Loss:  7.4686
8
Test size:  0.8 | Accuracy:  0.7707 | Precision:  0.7474 | Recall:  0.81



Test size:  0.9 | Accuracy:  0.7383 | Precision:  0.7130 | Recall:  0.7953 | F1:  0.7519 | ROC-AUC:  0.7385 | Log. Loss:  9.0376
end of loop
Time: 24.22 mins


Test size: 0.25 | Accuracy:  0.8232 | Precision:  0.7983 | Recall:  0.8613 | F1:  0.8286 | ROC-AUC:  0.8235 | Log. Loss:  6.1054


0.8153603366649133


In [75]:
mlp = MLPClassifier(hidden_layer_sizes=(150,100,50), max_iter=300,activation = 'relu',solver='adam',random_state=1)

In [76]:
X_train, X_test, y_train, y_test = train_test_split(df_title_tfidf, y, test_size=0.3, random_state=2, shuffle=True)
mlp.fit(X_train ,y_train)
pred_on_test_data = mlp.predict(X_test)
acc_score = accuracy_score(pred_on_test_data, y_test)
print(acc_score)

0.787480273540242


In [77]:
from sklearn.linear_model import Perceptron

### 4 c) - Random Forest

In [78]:
from sklearn.ensemble import RandomForestClassifier

In [79]:
rfc = RandomForestClassifier()
func_classifier_metrics(rfc)

-start-
1




Test size:  0.1 | Accuracy:  0.7839 | Precision:  0.7604 | Recall:  0.8211 | F1:  0.7896 | ROC-AUC:  0.7844 | Log. Loss:  7.4635
2
Test size:  0.2 | Accuracy:  0.7885 | Precision:  0.7855 | Recall:  0.7905 | F1:  0.7880 | ROC-AUC:  0.7885 | Log. Loss:  7.3058
3
Test size:  0.3 | Accuracy:  0.7901 | Precision:  0.7771 | Recall:  0.8045 | F1:  0.7906 | ROC-AUC:  0.7903 | Log. Loss:  7.2494
4
Test size:  0.4 | Accuracy:  0.7830 | Precision:  0.7751 | Recall:  0.7924 | F1:  0.7836 | ROC-AUC:  0.7830 | Log. Loss:  7.4967
5
Test size:  0.5 | Accuracy:  0.7835 | Precision:  0.7575 | Recall:  0.8174 | F1:  0.7863 | ROC-AUC:  0.7843 | Log. Loss:  7.4791
6
Test size:  0.6 | Accuracy:  0.7695 | Precision:  0.7431 | Recall:  0.8093 | F1:  0.7748 | ROC-AUC:  0.7703 | Log. Loss:  7.9601
7
Test size:  0.7 | Accuracy:  0.7504 | Precision:  0.7339 | Recall:  0.7790 | F1:  0.7558 | ROC-AUC:  0.7506 | Log. Loss:  8.6212
8
Test size:  0.8 | Accuracy:  0.7326 | Precision:  0.6996 | Recall:  0.8098 | F1:  0

In [80]:
rfc = RandomForestClassifier(n_estimators=150)
func_classifier_metrics(rfc)

-start-
1
Test size:  0.1 | Accuracy:  0.8139 | Precision:  0.7826 | Recall:  0.8626 | F1:  0.8207 | ROC-AUC:  0.8145 | Log. Loss:  6.4284
2
Test size:  0.2 | Accuracy:  0.7987 | Precision:  0.7713 | Recall:  0.8460 | F1:  0.8070 | ROC-AUC:  0.7990 | Log. Loss:  6.9515
3
Test size:  0.3 | Accuracy:  0.7991 | Precision:  0.7658 | Recall:  0.8526 | F1:  0.8069 | ROC-AUC:  0.7999 | Log. Loss:  6.9406
4
Test size:  0.4 | Accuracy:  0.7952 | Precision:  0.7662 | Recall:  0.8449 | F1:  0.8036 | ROC-AUC:  0.7956 | Log. Loss:  7.0741
5
Test size:  0.5 | Accuracy:  0.7951 | Precision:  0.7547 | Recall:  0.8588 | F1:  0.8034 | ROC-AUC:  0.7967 | Log. Loss:  7.0758
6
Test size:  0.6 | Accuracy:  0.7798 | Precision:  0.7350 | Recall:  0.8609 | F1:  0.7930 | ROC-AUC:  0.7814 | Log. Loss:  7.6057
7
Test size:  0.7 | Accuracy:  0.7698 | Precision:  0.7267 | Recall:  0.8586 | F1:  0.7872 | ROC-AUC:  0.7705 | Log. Loss:  7.9514
8
Test size:  0.8 | Accuracy:  0.7545 | Precision:  0.6968 | Recall:  0.896

In [81]:
rfc = RandomForestClassifier(n_estimators=1000)
func_classifier_metrics(rfc)

-start-
1
Test size:  0.1 | Accuracy:  0.8044 | Precision:  0.7708 | Recall:  0.8594 | F1:  0.8127 | ROC-AUC:  0.8051 | Log. Loss:  6.7553
2


KeyboardInterrupt: 

In [None]:
rfc = RandomForestClassifier(n_estimators=100, max_features=15)
func_classifier_metrics(rfc)

### 4 d) - Word2Vec

In [None]:
import gensim
from time import time
import multiprocessing
from gensim.models import Word2Vec

In [None]:
cores = multiprocessing.cpu_count()

In [None]:
w2v_model = Word2Vec(min_count=20, window=2, size=300, sample=6e-5, alpha=0.03, min_alpha=0.0007, negative=20,
                     workers=cores-1)

In [None]:
t = time()

w2v_model.build_vocab(df["text"], progress_per=10000)

print('Time to build vocab: {} mins'.format(round((time() - t) / 60, 2)))

In [None]:
w2v_model.train(df["text"], total_examples=w2v_model.corpus_count, epochs=30, report_delay=1)

print('Time to train the model: {} mins'.format(round((time() - t) / 60, 2)))

In [None]:
w2v_model.init_sims(replace=True)

In [None]:
w2v_model.similar_by_word("clinton")

In [None]:
w2v_model.wv.most_similar(positive=["donald"])

In [None]:
import numpy as np
from sklearn import metrics
y = np.array([1, 1, 2, 2])
pred = np.array([0.1, 0.4, 0.35, 0.8])
fpr, tpr, thresholds = metrics.roc_curve(y, pred, pos_label=2)
metrics.auc(fpr, tpr)

In [None]:
#import pd.pandas_profiling
#pandas_profiling.ProfileReport(df)

In [None]:
#mNB.fit(X_train ,y_train)

#pred_on_test_data = mNB.predict(X_test)
#acc_score = accuracy_score(pred_on_test_data, y_test)
#print ("Accuracy in percentage : " , acc_score, "%")

In [None]:
print("Number of training data : ", len(X_train))
print("Percentage of training data: ",len(X_train) * 100 / len(df), "%")
print("----------------")
print("Number of training data : ", len(y_test))
print("Percentage of training data: ",len(y_test) * 100 / len(df),"%")

In [None]:
df.to_csv("data/FakeNews-(balanced)/fake_or_real_news_prepared.csv", encoding="utf-8")

In [None]:
pip install gensim

Since this data set is well balanced, accuracy can be perceived as a reliable metric.

https://www.analyticsvidhya.com/blog/2018/04/a-comprehensive-guide-to-understand-and-implement-text-classification-in-python/

In [None]:
https://www.sv-europe.com/crisp-dm-methodology/
https://www.kaggle.com/aidenloe/data-understanding-using-python
https://towardsdatascience.com/exploratory-data-analysis-in-python-c9a77dfa39ce
https://towardsdatascience.com/exploratory-data-analysis-tutorial-in-python-15602b417445
https://www.learndatasci.com/tutorials/python-pandas-tutorial-complete-introduction-for-beginners/
