# Fake News Detector

## Libraries

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer #used for the transformation of text data.
from sklearn.linear_model import PassiveAggressiveClassifier #used to train the model
from sklearn.metrics import accuracy_score, confusion_matrix # this is used to analyze the results.

## Data preprocessing

We need to adapt our data to our machine learning model first. We will build a preprocessing pipeline.

In [2]:
#get the data
raw_data= pd.read_csv("news.csv")
#just renaming the column of Unnamed:0 into news ID.
raw_data

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
...,...,...,...,...
6330,4490,State Department says it can't find emails fro...,The State Department told the Republican Natio...,REAL
6331,8062,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,FAKE
6332,8622,Anti-Trump Protesters Are Tools of the Oligarc...,Anti-Trump Protesters Are Tools of the Oligar...,FAKE
6333,4021,"In Ethiopia, Obama seeks progress on peace, se...","ADDIS ABABA, Ethiopia —President Obama convene...",REAL


We drop the Unnamed:0 because it represents only the ID, which is useless in our prediction.

In [3]:
data=raw_data.drop(['Unnamed: 0'],axis=1)
data.head()

Unnamed: 0,title,text,label
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [4]:
data.shape

(6335, 3)

In [5]:
labels=data["label"]
print(labels.head())
print(labels.value_counts())

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object
REAL    3171
FAKE    3164
Name: label, dtype: int64


In [6]:
data.iloc[:,:-1]

Unnamed: 0,title,text
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello..."
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T..."
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...
...,...,...
6330,State Department says it can't find emails fro...,The State Department told the Republican Natio...
6331,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...
6332,Anti-Trump Protesters Are Tools of the Oligarc...,Anti-Trump Protesters Are Tools of the Oligar...
6333,"In Ethiopia, Obama seeks progress on peace, se...","ADDIS ABABA, Ethiopia —President Obama convene..."


In [7]:
data["title"].shape

(6335,)

In [8]:
data["text"].shape

(6335,)

## Separation of training data and test data

In [9]:
X_train, X_test, y_train, y_test = train_test_split(data["text"], labels, test_size=0.2, random_state=42)

In [10]:
print("X_train -> "+str(X_train.shape))
print("X_test -> "+str(X_test.shape))

X_train -> (5068,)
X_test -> (1267,)


In [11]:
print("y_train -> "+str(y_train.shape))
print("y_test -> "+str(y_test.shape))

y_train -> (5068,)
y_test -> (1267,)


## Tfidfvectorizer

This method offers the possibility to determine de tf-idf scores of our text data. Which means:
tf -> the Term Frequency : the number of times a word occures) , removing words that are not of importance.

idf-> inversion detector frequency : the same concept as tf, but the analysis goes beyond the single row, the frequency in the whole dataset is analyzed.

These informations will be used as Data for our Machine Learning Model.

(it is important to do this only on our training data, otherwise the model knows the testing data)

In [12]:
# we initialize our TdidfVectorizer
tfidf_vectorizer=TfidfVectorizer(use_idf=True)

In [13]:
# fit and transform the training set
tfidf_train=tfidf_vectorizer.fit_transform(X_train) 

# we only transform the test set,otherwise it will train on the test set...
tfidf_test=tfidf_vectorizer.transform(X_test)

## PassiveAgressiveClassifier

This algorithme is a online-learning-algorithme , used for a HUGE stream of Data. It offers the possibility to take the input data in sequential order, rather than in batch format. The model can learn step-by-step. 

If the model makes a good prediction, it stays passive and continues, but when it makes a false prediction it becomes aggressive and makes small changes to the model to improve his accuracy. 

### Build Model

In [14]:
# we build our model: C is the regulation parameter, and denotes the penalization the model will make on an incorrect prediciton.
model = PassiveAggressiveClassifier(C = 0.5, random_state = 5)

In [15]:
model.fit(tfidf_train,y_train)

PassiveAggressiveClassifier(C=0.5, average=False, class_weight=None,
                            early_stopping=False, fit_intercept=True,
                            loss='hinge', max_iter=1000, n_iter_no_change=5,
                            n_jobs=None, random_state=5, shuffle=True,
                            tol=0.001, validation_fraction=0.1, verbose=0,
                            warm_start=False)

### Predictions

In [16]:
y_pred=model.predict(tfidf_test)
y_pred

array(['FAKE', 'FAKE', 'FAKE', ..., 'REAL', 'REAL', 'REAL'], dtype='<U4')

### Accuracy and Confusion Matrix.

In [17]:
cm=confusion_matrix(y_test,y_pred)
acc= accuracy_score(y_test,y_pred)

print("The confusion matrix:")
print(cm)

print("The accuracy of our model:"+ str(acc))


The confusion matrix:
[[585  43]
 [ 35 604]]
The accuracy of our model:0.9384372533543804


In [18]:
print("With this model we have "+ str(cm[0][0])+" true positives, "+str(cm[1][1])+" true negatives, "
     +str(cm[0][1])+" false positives and "+str(cm[1][0])+" false negatives.")


With this model we have 585 true positives, 604 true negatives, 43 false positives and 35 false negatives.
