# 1 - Introduction

## Domain-specific area / Problem space
### Definition of Fake News
 Fake news is a phrase we hear more and more recently, but what exactly is fake news? Simply put fake news is news that is inaccurate, often on purpose, and spreads disinformation and misinformation for the purpose to gain attention, mislead, deceive or even sway people's voting behavior(https://en.wikipedia.org/wiki/Fake_news).
### Examples of real world consequences
 This, although seemingly just an annoyance, can have many real world consequences. An example of these are the reactions to what is known as "Pizza Gate". This was a fake news turned conspiracy theory that spread online that spread the idea that certain pizza chains had a basement where children were being held for selling as sexual object. One man became convinced it was true, and showed up at one such restaurant, armed and threatened violence and demanded the staff show him the basement so he could rescue the children(https://www.washingtonpost.com/news/local/wp/2016/12/04/d-c-police-respond-to-report-of-a-man-with-a-gun-at-comet-ping-pong-restaurant/)
### Area of of NLP contribution
 Journalism websites that value fact checked quality news, would benefit from such a detector.This problem is essentially one of text classification, a subset of natural language processing, where text is categorized into groups and sub groups. As such, a fake news detector can help automate the task of filtering out fake news, much as a spam filter filters out spam email, to help prevent any news or media outlet from accidentally spreading misinformation.  
## Objectives
I for one am tired of getting my hopes up that Bigfoot has been found singing karaoke, only to have my hopes and dreams dashed after closer scrutiny.
Therefore, in this notebook, I will attempt to create a classifier that (hopefully) with a relatively high degree of accuracy can detect if a given text is fake or real news. Using existing libraries from python I will train a model to sniff out fake news. If succesfull, hopefully this project could help contribute to preventing the spread of false information and help the news outlets maintain a high quality of journalism. 
### Chosen methods and justification
Being that I would like my detector to read the entirety of the text, I have decided to use Term Frequency Inverse Document Frequency (Tfids) in order to determine the relative importance of text. Since this method focuses on how important a given word or set of words is to a corpus and information retrieval as well as frequency distrobution, my hope is that it can find commonalities in the text types used in fake news versus real news. It also handily removes stop words which helps simplify oour cleaning of the data. Put simply, this looks at the frequency of words in a corpus, as well as the relative weight or importance of the words in the document.
### Classifiers
For this I want to try two differnet classifers. The first being the one we learned in class, the Naive Bayes MultinomialNB. I am intersted to see how a relativley simple naive bayes based model performs in classifying something as nebulous and difficult as fake news.
As a second classifer, I found after some digging around on the subject of fake news detecotrs and classifiers in the following paper: 
- ( Gupta, S., Meel, P. (2021). Fake News Detection Using Passive-Aggressive Classifier. In: Ranganathan, G., Chen, J., Rocha, Á. (eds) Inventive Communication and Computational Technologies. Lecture Notes in Networks and Systems, vol 145. Springer, Singapore. https://doi.org/10.1007/978-981-15-7345-3_13 )
So of course I want to test it against naive bayes and see how they compare.

## Dataset 
The dataset acquired for this project was procured from Kaggle as a series of csv files. There are three, a train.csv file which contains various real and fake news articles, with label of 0 for a "Real" article and 1 for a "Fake" article. The model will be trained and tested on this. There are also two others labeled Real.csv and Fake.csv. This are two files that contain only Real and Fake news articles respectively. I will use those to create a test of my own for accuracy (i.e. I will see how many of the fake news articles my detector actually thinks are fake )
### Dataset characteristics
The training dataset is divided into the follwing Categories represented in each column:
- id: the numerical id of the row
- title: the title of the article
- author: the author of the article
- text: the body of the article itself
- label: The label of Real or Fake it is classified as, 0 for Real news, 1 for Fake

train_csv found here: https://www.kaggle.com/competitions/fake-news/data?select=train.csv

There is also a Fake.csv  data files sourced from a differnet set that I will use to perform a "field test" of the data. This data set has the following Categoires in each column:

- title: The title of the artice
- text: The bodt of text of the news article
- subject: The topic of the news report
- date: The date it was published

Fake.csv data can be found here: https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset?select=True.csv


## Evaluation methods 
### Method 1 - accuracy using python library
Given that the intended use of this is to filter out fake news articles, then the number of actual fake news articles it detects versus how many there were would be the most important thing. Therefore, I will judge the success of the model based on its accuracy score. I will check its accuracy in the following two ways. First, I will use the python libraries metrics library to see the easily readable report (classification_report) that will use the testing data from the csv file train_csv. 
### Method 2 -  A "field test" with the trained model
As a secondary check, I will create my own function using the model we trained that will check individually the fake news csv file text and divide that number by the total text. The idea being, it should recognize all of them as "fake", but if it does not, then I can see what percentage of it it was able to recognize ( as a kind of real world test ).

# 2 - Implementation

## Preprocessing 

### Getting the csv data as a pandas dataframe

In [1]:
# Import pandas and assign dataframes with the csv data
import pandas as pd
df_fake = pd.read_csv("Fake.csv") # the 'field test' data
df_train = pd.read_csv("train.csv") # the training dataset

### Checking the training data

In [2]:
# Looking at the dataset
df_train

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1
...,...,...,...,...,...
20795,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...,0
20796,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...,0
20797,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...,0
20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal...",1


### Changing the labels
I find the 0 for Real and 1 for fake to be a bit hard to read, so here I am going to make a dict to replace all of them with the lables "Real" and "Fake"

In [3]:
# A dictionary to with 0 as the Real and 1 as Fale
conversion_dict = {0:"Real", 1:"Fake"}
# Replace all labels that correspond to 0 or 1 as Real and Fake
df_train['label'] = df_train['label'].replace(conversion_dict)
# Check we did it
df_train

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,Fake
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,Real
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",Fake
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,Fake
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,Fake
...,...,...,...,...,...
20795,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...,Real
20796,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...,Real
20797,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...,Real
20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal...",Fake


### Cleaning the data
The data looks fairly clean already, but just to be sure lets check if there are any duplicate data or empty rows.

In [4]:
# Check for duplicates and see if we still have 20,800 rows
df_train.drop_duplicates(inplace=True)
df_train.shape

(20800, 5)

In [5]:
# Check to see how many, if any, columns have empty rows
df_train.isnull().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

In [6]:
#Drop the empties
df_train.dropna(axis=0, inplace=True)
df_train.shape

(18285, 5)

Good thing we checked, that is nearly 2,000 empty rows. Might have thrown our model off. 

## Benchmarks

According to this[1] paper on fake news detection, they achieved an accuracy of 74% with Naive Bayes.

The accuracy achived with a Passive Agressive Calssifier was reported at 90.8% according to this research[2] ( and using the same dataset I trained on )

- Naive Bayes Paper- Sharma, D.K., Garg, S. IFND: a benchmark dataset for fake news detection. Complex Intell. Syst. (2021). https://doi.org/10.1007/s40747-021-00552-1

- Passive Agessive Classifier Paper - B.Suganthi, K.Manohari, FAKEDETECTOR: Effective Fake News Detection with Passive Aggressive Algorithm Science, Technology and Development Volume X Issue IX SEPTEMBER 2021 ISSN : 0950-0707 (2021)

## Spliting the data and applying tfid for tokenzing
Now for the fun, let's seperate our data into training groups and apply tfid to it. First we will need to download some libraies and install scikit learn.

In [7]:
# Installing sklearn
import sys
!{sys.executable} -m pip install sklearn

# Getting a test training split and feature extraction modules
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer



## Removing stop words, ignoring overly abbundant commonalities and spliting into testing and training groups
Now lets split into training and testing. I'm going with 20% for testing, which gives us 80% for training. We are also shuffling the data and adding a random state to prevent any unbalnaced splits.
The TfidfVectorizer conveniently has a parameter for stop words, so we will go ahead and fill that in, and give it a max_df of 0.7 ( meanind to ignore text that is common in 70% or more of the text ). 
## Term frequency and inverse document frequncy
The most important aspect as to why I chose the tfidf for tokenzing the words is due to the fact that not only does it categorize the wrods by frequency, but also by the inverse document frequcy, or that is to say by how "important" they seem to be for the document.

In [8]:
# Get the training and testing data. 
x_train, x_test, y_train, y_test = train_test_split(df_train['text'], df_train['label'], test_size=0.2, random_state=7, shuffle=True)
# Set the vectorizer, and removing english stop words. max_df removes any text that
#  correlates to 70% and above of the text
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)


## Tokenizing our data 
Now to seperate our training and testing and fit to add to our classifier. I found that makepipeline doesn't play well with pandas dataframes ( or at least I couldn't get it to ) so we will be doing this the long way. Notice we had to assign the type as 'U' for unicode in order for everything to be ready for our vecotrizer and classifier.

In [9]:
# Get the vectors/tokens and fit for our model
tfidf_vec_train = tfidf_vectorizer.fit_transform(x_train.values.astype('U')) # assign type U for unicode
#set aside the tokens for testing oour model
tfidf_vec_test = tfidf_vectorizer.transform(x_test.values.astype('U'))

## Choosing and comparing our classifiers
For this I have decided to compare two different classifiers and see how they perform. This is because through the course of the module, I was taught about one, the naive bayes Multinomial for text classification, so I thought it would work well for this project. However, I also want to compare with a different calssifier known as the PassiveAgressive Calssifier. During my research into building a Fake news Detector, I found some articles on this classifier, and wondered if it would be a better choice or not.

### Selected features
The features I deicided to go with was the body of the text itself, and using the Tfidf vecotrizor I removed stop words as well as any words that were common in 70% or more of the text. In this way I think I can avoid commonalities between the texts and focus on the words that tend to be more prevalent in fake news. 

### MultinomialNB 
The Multinomial Naive Bayes algorithm is a Bayesian learning technique where it guesses the tag of a text, in this case "Real" or "Fake", using the Bayes theorem. It calculates each tag's likelihood for a given sample and outputs the tag with the greatest chance. Let's see how it performs.

In [10]:
# Import the Naive Bayes classifer and store it
from sklearn.naive_bayes import MultinomialNB
NB_classifier = MultinomialNB()
NB_classifier.fit(tfidf_vec_train, y_train)

## Performace of the MultinomialNB
Lets check our performance and see how well we are doing.
First we will need to download the libaries we need.

In [11]:
# Get the accuracy score module
from sklearn.metrics import accuracy_score

In [12]:
# Set up our prediciton, score and calcualte the accuracy, rounded to two decimal places
nb_y_pred = NB_classifier.predict(tfidf_vec_test)
nb_score = accuracy_score(y_test, nb_y_pred)
MNB_accuracy = round(nb_score*100,2)
print(f'NB Accuracy: {MNB_accuracy}%')

NB Accuracy: 77.96%


77.96%, not too shabby considering our benchmark of 74%! Let's take a closer look at how we did.

## Get the classifcation report and see the detailed results

In [13]:
# See the detialed report
from sklearn.metrics import classification_report
print(classification_report(y_test, nb_y_pred))

              precision    recall  f1-score   support

        Fake       0.99      0.49      0.66      1580
        Real       0.72      1.00      0.84      2077

    accuracy                           0.78      3657
   macro avg       0.86      0.75      0.75      3657
weighted avg       0.84      0.78      0.76      3657



Well accuracy is what we are most concered with, but intersting the differnece in precision between real and fake.

## Checking with custom algorithm, a.k.a our "field test"
Here I wanted to see how well it did when given purely Fake news examples that I have prepared above. I will run through each text in both files, and calculate the percentage of them it correctly tagged as "Real" or "Fake."

In [14]:
# First change the alllabels to fake as we know this is the fake dataset
df_fake['label']='Fake'
df_fake

Unnamed: 0,title,text,subject,date,label
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017",Fake
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017",Fake
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017",Fake
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017",Fake
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017",Fake
...,...,...,...,...,...
23476,McPain: John McCain Furious That Iran Treated ...,21st Century Wire says As 21WIRE reported earl...,Middle-east,"January 16, 2016",Fake
23477,JUSTICE? Yahoo Settles E-mail Privacy Class-ac...,21st Century Wire says It s a familiar theme. ...,Middle-east,"January 16, 2016",Fake
23478,Sunnistan: US and Allied ‘Safe Zone’ Plan to T...,Patrick Henningsen 21st Century WireRemember ...,Middle-east,"January 15, 2016",Fake
23479,How to Blow $700 Million: Al Jazeera America F...,21st Century Wire says Al Jazeera America will...,Middle-east,"January 14, 2016",Fake


In [15]:
# Simple function using our classifer
# Takes an article text and returns "Real" or "Fake"
def NB_detectFake(text): # New text is raw text
    #test gets the tokenized vectors from our feature extractor
    test=tfidf_vectorizer.transform([text])
    # pred is the prediction from our Naive Bayes classifer
    pred=NB_classifier.predict(test)
    return pred[0]

Now lets iterate through the fake csv file and see how we did.

In [16]:
# Accuracy; how many fake news article were correctly predicted to be fake
sum = 0 # Set the sum to 0
for i in range(len(df_fake)): # Iterate over the len of the dataset
    if(NB_detectFake(df_fake['text'][i])=='Fake'): # if we return Fake for a text, increase sum
        sum = sum + 1
        
percent = round(sum / len(df_fake) * 100, 2) # Divide the number of fakes we found by the number of text
print(f'Percent fake news detected:{percent}%') # Print out the results

Percent fake news detected:7.62%


Oof, not nearly as good as the metrics would have us believe eh? Far, far below the above accuracy. It seems Naive Bayes doesn't play well in our field test. Lets try out the Passive Agressive Classifier. 

## Performance of the Passive Agressive Classifier
Now let us try the same, but this time trying with a different classifier

In [17]:
# Import and set classifier
from sklearn.linear_model import PassiveAggressiveClassifier
pac_classifier = PassiveAggressiveClassifier(max_iter=50) # the default is 50
pac_classifier.fit(tfidf_vec_train, y_train)

In [18]:
# Set the prediction, score and accuracy
pac_y_pred = pac_classifier.predict(tfidf_vec_test)
pac_score = accuracy_score(y_test, pac_y_pred)
pac_accuracy = round(pac_score*100,2)
print(f'Pac Accuracy: {pac_accuracy}%')

Pac Accuracy: 96.58%


96%! Wow, a bit above our bench mark again, but lets take a closer look and then give it a field test. Remeber how poorly the above did?

In [19]:
# Get the report
print(classification_report(y_test, pac_y_pred))

              precision    recall  f1-score   support

        Fake       0.96      0.96      0.96      1580
        Real       0.97      0.97      0.97      2077

    accuracy                           0.97      3657
   macro avg       0.97      0.96      0.97      3657
weighted avg       0.97      0.97      0.97      3657



In [20]:
# Same as above, returns Real or Fake, but uses t he Passive Agressive Classifier
def PAC_detectFake(text): # New text is raw text
    #test gets the tokenized vectors from our feature extractor
    test=tfidf_vectorizer.transform([text])
    # pred is the prediction from our Passive Agressive classifer
    pred=pac_classifier.predict(test)
    return pred[0]

In [21]:
# Accuracy; how many fake news article were correctly predicted to be fake
sum = 0 # Set the sum to 0
for i in range(len(df_fake)): # Iterate over the len of the dataset
    if(PAC_detectFake(df_fake['text'][i])=='Fake'): # if we return Fake for a text, increase sum
        sum = sum + 1
        
percent = round(sum / len(df_fake) * 100, 2) # Divide the number of fakes we found by the number of text
print(f'Percent fake news detected:{percent}%') # Print out the results

Percent fake news detected:66.35%


A bit underwhelming, no? Only 66% in our field test. It seems it performs much better on the trained data rather than the other data.

# 3 - Conclusions / Outcome
## Performance
Accroding to the above we can see that the accuracy for the naivebayes MultinomialNB was around 86%, however when we tested it to find the fakes and true news when given only fake and real news it did terrible, with only 7% for the fake news. The Passive Aggressive Classifer, on the other hand had a sligtly better score at 96% according to the report, but when we tested it on the Fake only stories the performance was again, worse. But it is important to not it still managed to find nearly 67% of the fake news. 
This coould be due to the data in the "field test" ( a.k.a fake.csv ) being different in structure than the training data as theu came from different sources. 

## Summary
Clearly detecting fake news is something that perhaps a simple classifier may not be well suited to. Even when the results of the testing came back looking very good, when tested in our psuedo "real world" test, the accuracy quickly dropped off.
#### Areas for improvement
In this paper, I looked at only the body of text itself, however the source of the data is also a very strong indicator of Real or Fake news, as is the author themselves. Given those data points were omitted in an effort to find any common patterns in just the body of the text, perhaps the peformance would be greatly boosted by those. However, it is important to note that new "Fake" sources are popping up all the time, and that could ultimately lead to worse performance as new unkown could skew the results(i.e a new source might automatically seem more "Real" due to the fitting).
#### Further devlopment
In this case I actually feel that more data would be the most useful, we had only a total of 20,000 data points to train on and that may have been an issue when presented with our challenge data. As a even further task, perhaps applying deep learning would yield even more promising resluts.