# The Fake News Problem

In this project, a model which detects if a text has fake news or not must be build.
As the data is labeled, we have a supervised learning problem and classification  type problem(binary response, fake and not fake(real)).

In [528]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score
from sklearn.metrics import confusion_matrix
from sklearn.model_selection import cross_val_score
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import LinearSVC


In [478]:
dataset = pd.read_csv('news.csv')
dataset.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [479]:
dataset.shape

(6335, 4)

In [480]:
dataset.nunique(axis=1).sum()

25340

There is no duplicated value.

In [481]:
dataset.isnull().sum()


Unnamed: 0    0
title         0
text          0
label         0
dtype: int64

There is no NaN value.

In [482]:
dataset.groupby('label').size()

label
FAKE    3164
REAL    3171
dtype: int64

**Data seems to be ok, no necessary clean must be made.**

In [483]:
features = dataset[['title', 'text']]
features

Unnamed: 0,title,text
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello..."
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T..."
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...
...,...,...
6330,State Department says it can't find emails fro...,The State Department told the Republican Natio...
6331,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...
6332,Anti-Trump Protesters Are Tools of the Oligarc...,Anti-Trump Protesters Are Tools of the Oligar...
6333,"In Ethiopia, Obama seeks progress on peace, se...","ADDIS ABABA, Ethiopia —President Obama convene..."


In [484]:
label_data = pd.get_dummies(dataset['label'])
label_data

Unnamed: 0,FAKE,REAL
0,1,0
1,1,0
2,0,1
3,1,0
4,0,1
...,...,...
6330,0,1
6331,1,0
6332,1,0
6333,0,1


In [485]:
label_data.pop('REAL')

0       0
1       0
2       1
3       0
4       1
       ..
6330    1
6331    0
6332    0
6333    1
6334    1
Name: REAL, Length: 6335, dtype: uint8

In [486]:
label_data

Unnamed: 0,FAKE
0,1
1,1
2,0
3,1
4,0
...,...
6330,0
6331,1
6332,1
6333,0


In [487]:
label_data.rename(columns={'FAKE':'label'}) # 0 = real, 1 = fake

Unnamed: 0,label
0,1
1,1
2,0
3,1
4,0
...,...
6330,0
6331,1
6332,1
6333,0


### Spliting data between train and test using train_test_split

In [529]:
cv = 42
train_X, test_X, train_y, test_y = train_test_split(features['text'], label_data, test_size=0.2, random_state=cv, shuffle=True)
train_y = np.ravel(train_y, )
print(train_X.shape, train_y.shape)

(5068,) (5068,)


**TfidfVectorizer** - Convert a collection of raw documents to a matrix of TF-IDF features.

In [520]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
tfidf_train = tfidf_vectorizer.fit_transform(train_X) 
tfidf_test = tfidf_vectorizer.transform(test_X)

### Decision Tree

In [530]:
decision_tree = DecisionTreeClassifier(random_state=cv)
cross_val_dt = cross_val_score(decision_tree, tfidf_train, train_y,  
                        scoring='accuracy', cv=10)
(cross_val_dt*100).mean()

81.58894060231853

### Linear Support Vector Machine

In [531]:
svc = LinearSVC(random_state=cv)
cross_val_lsvm = cross_val_score(svc, tfidf_train, train_y,  
                        scoring='accuracy', cv=10)
(cross_val_lsvm*100).mean()

93.58682009183681

### Passive Agressive Classifier

In [532]:
pac = PassiveAggressiveClassifier(random_state=cv)
cross_val_pac = cross_val_score(pac, tfidf_train, train_y,  
                        scoring='accuracy', cv=10)
(cross_val_pac*100).mean()

93.90259684574066

### Linear Support Vector Machine

In [533]:
svc = LinearSVC(random_state=cv, loss='hinge')
model_svc = svc.fit(tfidf_train,train_y)
prediction_svc = svc.predict(tfidf_test)
score_svc = accuracy_score(test_y,prediction_svc)
print(f'Accuracy: {round(score_svc*100,2)}%')
confusion_matrix(test_y,prediction_svc, labels=[1,0])

Accuracy: 93.29%


array([[593,  35],
       [ 50, 589]], dtype=int64)

### Passive Agressive Classifier

In [525]:
pac = PassiveAggressiveClassifier(max_iter=1000, early_stopping=True, random_state=42)
pac.fit(tfidf_train,train_y)

PassiveAggressiveClassifier(early_stopping=True, random_state=42)

In [526]:
y_pred = pac.predict(tfidf_test)
score = accuracy_score(test_y,y_pred)
print(f'Accuracy: {round(score*100,2)}%')

Accuracy: 93.21%


In [527]:
confusion_matrix(test_y,y_pred, labels=[1,0])


array([[584,  44],
       [ 42, 597]], dtype=int64)

### Linear Support Vector Machine was the chosen model.