<h1><center> Using Machine Learning Techniques in <br> the Detection of Fake News </center></h1>

<h2 style='font-weight:normal'><center> Spring Term Project, <b> Lisanna Lehes </b> </center></h2>

<div style="text-align: right"> 
Universidad de Huelva <br>
Facultad de CC. Experimentales <br>
Grado en Química <br>
Computational Chemistry
 </div>


#### <b>Aim of the project:</b> to create a machine learning model that would help detect fake news using Scikit-learn library and Passive Agression Algorithm.

## Part I

### 1. Importing the necessary libraries 

In [1]:
import pandas as pd
import numpy as np

### 2. Importing the dataset

In [2]:
# Importing dataset
df = pd.read_csv('/Users/Lisanna/Desktop/fake_news/train.csv')

In [3]:
df.shape

(20800, 5)

In [4]:
# Returning the first 5 rows
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


### 3. Converting the '0's and '1's to 'FAKE' and 'TRUE'

‘0’ for RELIABLE article <br>
‘1’ for FAKE NEWS

In [5]:
# Accessing a group of rows and columns by label
df.loc[(df['label'] == 1) , ['label']] = 'FAKE'
df.loc[(df['label'] == 0), ['label']] = 'REAL'

In [6]:
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,FAKE
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,REAL
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",FAKE
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,FAKE
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,FAKE


### 4. Splitting the dataset into test and training data

In [7]:
labels = df.label
labels.head()

0    FAKE
1    REAL
2    FAKE
3    FAKE
4    FAKE
Name: label, dtype: object

In [8]:
labels.value_counts()

FAKE    10413
REAL    10387
Name: label, dtype: int64

Splitting the downloaded dataset into two subsets, 70% of the entries
will be used to train the model and the rest (30%) to test the model’s predictive power.

In [9]:
from sklearn.model_selection import train_test_split
# Split arrays or matrices into random train and test subsets

In [10]:
x_train, x_test, y_train, y_test = train_test_split(df['text'].values.astype('str'), labels, test_size = 0.3, random_state = 7)

In [11]:
# Random_state parameter may be provided to control the random number generator used

### 5. Using TfidfVectorizer

__TfidfVectorizer__ uses stop words from the English language. <br>
The number of times a word appears in a document is its __Term Frequency__. <br> A higher value means a term appears more often than others, and so, the document is a good match when the term is part of the search terms.


In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer

In [13]:
tfidf_vectorizer = TfidfVectorizer(stop_words = 'english', max_df = 0.7)

__max_df__ --> ignores terms that appear in more than 70% of the documents --> used for terms that appear too frequently

### 6. Fitting and transfroming the test and training data set

In [14]:
tfidf_train = tfidf_vectorizer.fit_transform(x_train)

In [15]:
tfidf_test = tfidf_vectorizer.transform(x_test)

* __transform()__ : parameters generated from fit() method,applied upon model to generate transformed data set.

* __fit_transform()__ : combination of fit() and transform() api on same data set



### 7. Initializing PassiveAggressiveClassifier

In [16]:
from sklearn.linear_model import PassiveAggressiveClassifier
# Passive: if correct classification, keep the model; 
# Aggressive: if incorrect classification, update to adjust to this misclassified example.

In [17]:
pa_classifier = PassiveAggressiveClassifier(max_iter = 50)
pa_classifier.fit(tfidf_train, y_train)

PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
                            early_stopping=False, fit_intercept=True,
                            loss='hinge', max_iter=50, n_iter_no_change=5,
                            n_jobs=None, random_state=None, shuffle=True,
                            tol=0.001, validation_fraction=0.1, verbose=0,
                            warm_start=False)

__max_iter__ --> Maximum number of iterations of the k-means algorithm for a single run.


### 8. Predicting and calculating the accuracy

In [30]:
from sklearn.metrics import accuracy_score

In [19]:
y_pred = pa_classifier.predict(tfidf_test)
score = accuracy_score(y_test, y_pred)

In [20]:
print('Accuracy:', score)

Accuracy: 0.9600961538461539


In [21]:
print(f'Accuracy: {round(score*100, 3)}%')

Accuracy: 96.01%


Now the accuracy of the model can be seen when it was conducting its tests. <br> Whereas we know the model's accuracy, we don't know the number of successful predictions/failures. <br> Therefore, we are now going to build a confusion matrix.

### 9. Building a confusion matrix

In [28]:
from sklearn.metrics import confusion_matrix

In [23]:
confusion_matrix(y_test, y_pred, labels = ['FAKE', 'REAL'])

array([[2996,  116],
       [ 133, 2995]], dtype=int64)

__Results:__ <br>
* The model successfully predicted 3001 positives. (Is fake and is predicted as fake)
* The model successfully predicted 2998 negatives. (Is real and is predicted as real)
* The model predicted 111 false positives. (Real news were considered as fake)
* The model predicted 130 false negatives. (Fake News were considered as real)

Additionally, we could compute the __F1-score__. <br> 
__F1 score__ is used to measure a test’s accuracy, therefore, an F-score is considered perfect when it's 1 , while the model is a total failure when it's 0.<br>  A good __F1-score__ means that you have low false positives and low false negatives,


F1 = 2TP / (2TP + FP + FN)

In [24]:
F1 = 2*3001 / (2*3001 + 130 + 111)
print(F1)

0.9613967643761012


In [32]:
from sklearn import metrics
print(metrics.classification_report(y_test, y_pred, labels = ['FAKE', 'REAL']))

              precision    recall  f1-score   support

        FAKE       0.96      0.96      0.96      3112
        REAL       0.96      0.96      0.96      3128

    accuracy                           0.96      6240
   macro avg       0.96      0.96      0.96      6240
weighted avg       0.96      0.96      0.96      6240



## Part II

__The aim of part II is to experiment on the model by adding and substracting some variables and to see if the accuracy score is affected by it e.g.:__

1. Build a model that would take in __only__ the title of a news article and then predict if it's fake or real.

2. Build a model that would take in __both__, the __text__ of the article and the __title__

### 1. Experimenting with the title of news articles

In [33]:
labels_2 = df.label

In [34]:
x_train_title, x_test_title ,y_train_title ,y_test_title = train_test_split(df['title'].values.astype('str'), labels_2, test_size=0.3, random_state=7)


In [35]:
tfidf_vectorizer_title = TfidfVectorizer(stop_words='english', max_df=0.7)
#Fit and transform train set, transform test set

tfidf_train_title = tfidf_vectorizer_title.fit_transform(x_train_title) 
tfidf_test_title = tfidf_vectorizer_title.transform(x_test_title)

In [36]:
pa_classifier_title = PassiveAggressiveClassifier(max_iter = 50)
pa_classifier_title.fit(tfidf_train_title, y_train_title)

PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
                            early_stopping=False, fit_intercept=True,
                            loss='hinge', max_iter=50, n_iter_no_change=5,
                            n_jobs=None, random_state=None, shuffle=True,
                            tol=0.001, validation_fraction=0.1, verbose=0,
                            warm_start=False)

In [37]:
y_pred_title = pa_classifier_title.predict(tfidf_test_title)
score_title = accuracy_score(y_test_title, y_pred_title)
print('Accuracy:', score_title)

Accuracy: 0.9286858974358975


In [38]:
print(f'Accuracy: {round(score_title*100, 3)}%')

Accuracy: 92.869%


In [39]:
confusion_matrix(y_test_title, y_pred_title, labels = ['FAKE', 'REAL'])

array([[2924,  188],
       [ 257, 2871]], dtype=int64)

__Results:__
* The model predicted 2929 positives. (Is fake and were considered fake)
* The model predicted 2874 negatives. (Is real and were considered real)
* The model predicted 183 false positives. (Real news wereconsidered as fake)
* The model predicted 254 false negatives. (Fake news were considered as real)
<br>


In [40]:
print(metrics.classification_report(y_test_title, y_pred_title, labels = ['FAKE', 'REAL']))

              precision    recall  f1-score   support

        FAKE       0.92      0.94      0.93      3112
        REAL       0.94      0.92      0.93      3128

    accuracy                           0.93      6240
   macro avg       0.93      0.93      0.93      6240
weighted avg       0.93      0.93      0.93      6240



### 2. Prediction model with 2 variables

In this part I am trying to detect fake news by building a model that would have __two variables__ - __text of the article__ and __title__ - and see if the <br> accuracy score is affected by it (whether it builds a better model or not).

In [41]:
labels_3 = df.label

In [60]:
sampledata = df[['text', 'title']]

In [62]:
x_train_2, x_test_2, y_train_2, y_test_2 = train_test_split(df[sampledata].values.astype('str'), labels_3, test_size = 0.3, random_state = 7)

ValueError: Boolean array expected for the condition, not object

In [48]:
texts = df.text
texts.value_counts()

                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                        

In [50]:
titles = df.title
titles.value_counts()

Get Ready For Civil Unrest: Survey Finds That Most Americans Are Concerned About Election Violence                           5
The Dark Agenda Behind Globalism And Open Borders                                                                            5
Let’s Be Clear – A Vote For Warmonger Hillary Clinton Is A Vote For World War 3                                              4
The Fix Is In: NBC Affiliate Accidentally Posts Election Results A Week Early: Hillary Wins Presidency 42% to Trump’s 40%    4
What to Cook This Week - The New York Times                                                                                  4
                                                                                                                            ..
Donald Trump on Terror: ‘This Bloodshed Must End’                                                                            1
As More Devices Board Planes, Travelers Are Playing With Fire - The New York Times                             

In [59]:
#x_train_2, x_test_2, y_train_2, y_test_2 = train_test_split([np.transpose(texts, titles)].values.astype('str'), labels_3, test_size = 0.3, random_state = 7)

In [None]:
tfidf_vectorizer_2 = TfidfVectorizer(stop_words='english', max_df=0.7)
#Fit and transform train set, transform test set

tfidf_train_2 = tfidf_vectorizer_2.fit_transform(x_train_2) 
tfidf_test_2 = tfidf_vectorizer_2.transform(x_test_2)

In [None]:
pa_classifier_2 = PassiveAggressiveClassifier(max_iter = 50)
pa_classifier_2.fit(tfidf_train_2, y_train_2)

In [None]:
y_pred_2 = pa_classifier_2e.predict(tfidf_test_2)
score_2 = accuracy_score(y_test_2, y_pred_2)
print('Accuracy:', score_2)

In [None]:
print(f'Accuracy: {round(score_2*100, 3)}%')

In [None]:
confusion_matrix(y_test_2, y_pred_2, labels = ['FAKE', 'REAL'])

Results: 

In [None]:
print(metrics.classification_report(y_test_2, y_pred_2, labels = ['FAKE', 'REAL']))