# Python Project – Detecting Fake News

#### Objective:
To build a model to accurately classify a piece of news as REAL or FAKE
####  About: 

This advanced python project of detecting fake news deals with fake and real news. Using sklearn, we build a TfidfVectorizer on our dataset. Then, we initialize a PassiveAggressive Classifier and fit the model. In the end, the accuracy score and the confusion matrix tell us how well our model fares.
   

__Author__ = 'Rinaldo Gagiano'  
__Email__ = 'Rinaldogagiano@gmail.com'  
__Github__ = 'https://github.com/RinaldoG'  
__Project__ = 'Data-flair'

## Necessary Imports

In [5]:
import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

## Load the Date

In [7]:
#Read the data into a dataframe
df = pd.read_csv('news.csv')

## Examine the Data

In [15]:
#Get shape of data
df.shape

(6335, 4)

News dataset has 4 variables with 6335 observations

In [17]:
#Get head(quick view) of data
df.head(10)

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
5,6903,"Tehran, USA","\nI’m not an immigrant, but my grandparents ...",FAKE
6,7341,Girl Horrified At What She Watches Boyfriend D...,"Share This Baylee Luciani (left), Screenshot o...",FAKE
7,95,‘Britain’s Schindler’ Dies at 106,A Czech stockbroker who saved more than 650 Je...,REAL
8,4869,Fact check: Trump and Clinton at the 'commande...,Hillary Clinton and Donald Trump made some ina...,REAL
9,2909,Iran reportedly makes new push for uranium con...,Iranian negotiators reportedly have made a las...,REAL


In [21]:
#Get labels
labels = df.label
labels.head(10)

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
5    FAKE
6    FAKE
7    REAL
8    REAL
9    REAL
Name: label, dtype: object

## Check for Missing Values

In [28]:
#Count of any NA values within Text column
sum(pd.isnull(df['text']))

0

## Create Training and Testing Sets

In [45]:
#Split the dataset
text_train,text_test,labels_train,labels_test=train_test_split(df['text'], labels, test_size=0.2, random_state=7)

### Preview of Training and Testing Sets

In [47]:
text_train.head()

6237    The head of a leading survivalist group has ma...
3722    ‹ › Arnaldo Rodgers is a trained and educated ...
5774    Patty Sanchez, 51, used to eat 13,000 calories...
336     But Benjamin Netanyahu’s reelection was regard...
3622    John Kasich was killing it with these Iowa vot...
Name: text, dtype: object

In [48]:
text_test.head()

3534    A day after the candidates squared off in a fi...
6265    VIDEO : FBI SOURCES SAY INDICTMENT LIKELY FOR ...
3123    It's debate season, where social media has bro...
3940    Mitch McConnell has decided to wager the Repub...
2856    Donald Trump, the actual Republican candidate ...
Name: text, dtype: object

In [49]:
labels_train.head()

6237    FAKE
3722    FAKE
5774    FAKE
336     REAL
3622    REAL
Name: label, dtype: object

In [50]:
labels_test.head()

3534    REAL
6265    FAKE
3123    REAL
3940    REAL
2856    REAL
Name: label, dtype: object

## Initialise a TfidVectorizer

Stop words are the most common words in a language that are to be filtered out before processing the natural language data. We use stop words from the english language lexicon. We will cap our maximum document frequency of 0.7 (terms with a higher document frequency will be discarded).

In [51]:
#Initialize a TfidfVectorizer
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)

## Fit and Transform Test and Train Set

In [52]:
#Fit and transform train set
tfidf_train=tfidf_vectorizer.fit_transform(text_train) 

In [53]:
#Transform test set
tfidf_test=tfidf_vectorizer.transform(text_test)

## Initialise a PassiveAggressiveClassifier

In [63]:
#Initialise a PassiveAggressiveClassifier
pac=PassiveAggressiveClassifier(max_iter=50)

## Fitting of Classifier

In [64]:
pac.fit(tfidf_train,labels_train)

PassiveAggressiveClassifier(C=1.0, average=False, class_weight=None,
              early_stopping=False, fit_intercept=True, loss='hinge',
              max_iter=50, n_iter=None, n_iter_no_change=5, n_jobs=None,
              random_state=None, shuffle=True, tol=None,
              validation_fraction=0.1, verbose=0, warm_start=False)

## Predict on Test Set and Calculate Accuracy 

In [66]:
#Predict on the test set and calculate accuracy
test_pred=pac.predict(tfidf_test)
score=accuracy_score(labels_test,test_pred)
print(f'Accuracy: {round(score*100,2)}%')

Accuracy: 92.5%


## Confusion Matrix

In [67]:
#Build confusion matrix
confusion_matrix(labels_test,test_pred, labels=['FAKE','REAL'])

array([[586,  52],
       [ 43, 586]])

Model Indications:
- 586 true positives
- 586 true negatives
- 43 false positives
- 52 false negatives

** Output of model may vary if repeated due to seeding **

# Summary

We learned to detect fake news in Python Jupyter NoteBook. We took a political dataset, implemented a TfidfVectorizer, initialized a PassiveAggressiveClassifier, and fit our model. We ended up obtaining an accuracy of 92.5% in magnitude.

** Accuracy may change if repeated due to seeding **