# Fake News Detection Project

## Objective

To build a model to classify the fake news accurately as REAL or FAKE. 

## About this Project

Using sklearn, we build a **TfidfVectorizer** on our dataset. Then, we intiliaze **PassiveAggressiveClassifier** and fit the model. In the end, the accuracy score and the **confusion matrix** tell us how well our model works. 

## The Dataset

The dataset is provided by DataFlair, one of the online data science training platforms. The dataset we used in this project is in .csv format. The shape of the dataset is 6335X4. The first column identifies the news, the second and the third are the title and the text, and the fourth column has labels denoting whether the news is REAL  or FAKE. 

## Prerequisites

We have to install the following libraries to begin this project. 

In [1]:
pip install numpy pandas sklearn

Collecting sklearn
  Downloading https://files.pythonhosted.org/packages/1e/7a/dbb3be0ce9bd5c8b7e3d87328e79063f8b263b2b1bfa4774cb1147bfcd3f/sklearn-0.0.tar.gz
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py) ... [?25ldone
[?25h  Stored in directory: /home/jupyterlab/.cache/pip/wheels/76/03/bb/589d421d27431bcd2c6da284d5f2286c8e3b2ea3cf1594c074
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0
Note: you may need to restart the kernel to use updated packages.


## Steps for detecting fake news

The following are the steps to detect fake news and to complete this project.

1. Make necessary imports.

In [2]:
import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

2. Now lets read the data into DataFrame and get the shape and first 5 rows of the dataframe. 

In [3]:
#Read the data
df = pd.read_csv('FN Dataset/news.csv')
#Get the shape
df.shape
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [4]:
df.shape

(6335, 4)

The above step shows the shape and first 5 rows as the output.

3. Get the labels from the data

In [5]:
#Get the labels
labels = df.label
labels.head()

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

### Data Sampling

4. Split the data into training and test datasets.

In [6]:
x = df.text
x.head()

0    Daniel Greenfield, a Shillman Journalism Fello...
1    Google Pinterest Digg Linkedin Reddit Stumbleu...
2    U.S. Secretary of State John F. Kerry said Mon...
3    — Kaydee King (@KaydeeKing) November 9, 2016 T...
4    It's primary day in New York and front-runners...
Name: text, dtype: object

In [7]:
y = labels
y.tail()

6330    REAL
6331    FAKE
6332    FAKE
6333    REAL
6334    REAL
Name: label, dtype: object

In [8]:
#Splitting the dataset
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.2, random_state = 42)

### Initializing TfidfVectorizer

5. Lets initialize **TfidfVectorizer** with *stop words* from English language and maximum dcoument frequency of 0.7 (terms with higher frequency are discarded). Stop words are the most common words in a language that are to be filtered out before processing the natural language data.  Now, fit and transform the vectorizer on the train set and transform the vectorizer on the test set.

In [9]:
#Initializing TfidfVectorizer 
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
#Fit and transform train set and transform test set
tfidf_train = tfidf_vectorizer.fit_transform(x_train)
tfidf_test = tfidf_vectorizer.transform(x_test)

### Initializing PassiveAggressiveClassifier

6. Next, we'll initialize PassiveAggressiveClassifier and we'll fit this on tfidf_train and y_train. Then we'll predict on the test set from TfidfVectorizer and calculate the accuracy with accuracy_score() from sklearn.metrics. Finally, we'll print out the confusion matrix to gain insight into the number of false and true negatives and positives. 

In [10]:
#Initiliazing the PassiveAggressiveClassifier
pac = PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)
#Predict on the test set and calculate accuracy
y_pred = pac.predict(tfidf_test)
score = accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')
#Creating confusion matrix
confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])



Accuracy: 93.76%


array([[588,  40],
       [ 39, 600]])