# Fake News Detection

#### Text Analysis (Natural Language Processing) and Classification
**Problem to solve:** Do you trust all the news you hear from social media? All news are not real, right? So how will you detect the fake news? A type of yellow journalism, fake news encapsulates pieces of news that may be hoaxes and is generally spread through social media and other online media. This is often done to further or impose certain ideas and is often achieved with political agendas. Such news items may contain false and/or exaggerated claims, and may end up being viralized by algorithms, and users may end up in a filter bubble.

**Dataset:** The dataset we’ll use for this python project- we’ll call it news.csv. This dataset has a shape of 7796×4. The first column identifies the news, the second and third are the title and text, and the fourth column has labels denoting whether the news is REAL or FAKE.

In [74]:
#importing necessary libraries

import numpy as np
import pandas as pd
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.model_selection import train_test_split
import re
from sklearn.metrics import accuracy_score, confusion_matrix

print('Libraries imported!')

Libraries imported!


In [85]:
# get data

column_names = ['news_id', 'title', 'text', 'label']
df = pd.read_csv('news.csv', names=column_names, header=0)

# clean text column, removing all non-ascii characters
df['cleaned_text'] = df['text'].apply(lambda text: ' '.join(re.sub('\W', ' ', text).split()).lower())

# get shape and head
print(df.shape)
df.head()

(6335, 5)


Unnamed: 0,news_id,title,text,label,cleaned_text
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE,daniel greenfield a shillman journalism fellow...
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE,google pinterest digg linkedin reddit stumbleu...
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL,u s secretary of state john f kerry said monda...
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE,kaydee king kaydeeking november 9 2016 the les...
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL,it s primary day in new york and front runners...


In [75]:
# splitting data into training (80%) and testing sets (20%)

labels = df.label
x_train, x_test, y_train, y_test = train_test_split(df['cleaned_text'], labels, test_size=0.2, random_state=7)

print('x_train size: ', x_train.shape[0])
print('x_test size: ', x_test.shape[0])

x_train size:  5068
x_test size:  1267
6237    FAKE
3722    FAKE
5774    FAKE
336     REAL
3622    REAL
Name: label, dtype: object


### Pre-processing Natural Language Text data

#### Bag-of-Words Model
We cannot work with text directly when using machine learning algorithms. Instead, we need to convert the text to numbers.

A simple and effective model for thinking about text documents in machine learning is called the Bag-of-Words Model, or BoW. The model is simple in that it throws away all of the order information in the words and focuses on the occurrence of words in a document.

This can be done by assigning each word a unique number (tf-idf weight). Then any document we see can be encoded as a fixed-length vector with the length of the vocabulary of known words. The value in each position in the vector could be filled with a count or frequency (or idf value) of each word in the encoded document.

The TfidfVectorizer to learn vocabulary and inverse document frequencies across documents (i.e news text) in the training set and then encode those documents.

In [60]:
# Let’s initialize a TfidfVectorizer with stop words from the English language and a maximum document frequency of 0.7 
# (terms with a higher document frequency will be discarded). Stop words are the most common words in a language that are 
# to be filtered out before processing the natural language data. And a TfidfVectorizer turns a collection of raw documents 
# into a matrix of TF-IDF features.

tfidfvectorizer = TfidfVectorizer(stop_words='english', max_df=0.7, use_idf=True)
# fit the vectorizer on the train set, learning vocabulary and idf from the training set
tfidfvectorizer.fit(x_train)
# encode the training set and test set
tfidfvectorizer_vectors_xtrain = tfidfvectorizer.transform(x_train).toarray()
tfidfvectorizer_vectors_xtest = tfidfvectorizer.transform(x_test).toarray()

type(tfidfvectorizer_vectors_xtrain)

numpy.ndarray

A vocabulary of 61651 words is learned from the documents and each word is assigned a unique integer index in the output vector.

The inverse document frequencies are calculated for each word in the vocabulary, assigning the lowest score of 1.0 to the most frequently observed word.

In [61]:
# vocabulary and idf values learned by the vectorizer

vocabulary = tfidfvectorizer.vocabulary_
idf_values = tfidfvectorizer.idf_

print('vocabulary size (learned): ', len(vocabulary))
print(idf_values.shape)

vocabulary size (learned):  61651
(61651,)


In [71]:
# as there are 5068 documents in the training set, the encoded training set will have 5068 rows and 61651 columns 
# corresponding to the learned vocabulary. Every new document i.e. news post will be encoded using these learned
# vocabularies and their corresponding tf-idf values will be calculated and used for the machine learning algorithm later.

print(tfidfvectorizer_vectors_xtrain.shape)

(5068, 61651)


### Predictive modelling

#### Passive Aggressive Classifier
Check out for a [youtube video](https://www.youtube.com/watch?v=TJU8NfDdqNQ) by Victor Lavrenko. Citing him here: The Passive Aggressive (PA) algorithm is perfect for classifying massive streams of text data (e.g. Twitter tweets, news text (in our case)). It's easy to implement and very fast, but does not provide global guarantees like the support-vector machine (SVM).

We consider the Online setting, that is we receive examples in a sequential manner. On each round we receive an instance x (x is in R^n) and extend a prediction using our current hypothesis wt. We then receive the true target y and suffer an instantaneous loss based on the discrepancy between yt and our prediction. Our goal is to make the cumulative loss that we suffer small. Finally, we update the hypothesis according to the previous hypothesis and the current example.

In [None]:
# initialize, fit a PassiveAggressiveClassifier on the training set vectors and its labels

pac = PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidfvectorizer_vectors_xtrain, y_train)

# predict on the test set and calculate accuracy of the model
y_pred = pac.predict(tfidfvectorizer_vectors_xtest)

In [83]:
# model evaluation

score = accuracy_score(y_test, y_pred)
print(f'Accuracy: {round(score*100, 2)}%')

# to dive deeper into evaluating the classifier, we'll use a confusion matrix - which is a table that is often 
# used to describe the performance of a classification model (or “classifier”) on a set of test data for which the truth
# values are known already
tn, fp, fn, tp = confusion_matrix(y_test, y_pred, labels=['FAKE', 'REAL']).ravel()
confusion_matrix(y_test, y_pred, labels=['FAKE', 'REAL'])

Accuracy: 92.9%


array([[595,  43],
       [ 47, 582]], dtype=int64)