# Detecting Fake News with Python and Machine Learning



### Do you trust all the news you hear from social media?

### All news are not real, right?

### How will you detect fake news?

### The answer is Python. By practicing this advanced python project of detecting fake news, you will easily make a difference between real and fake news.

### Before moving ahead in this machine learning project, get aware of the terms related to it like fake news, tfidfvectorizer, PassiveAggressive Classifier.

#### - Fake News : fake news encapsulates pieces of news that may be hoaxes and is generally spread through social media and other online media
####  -TfidfVectorizer : The number of times a word appears in a document is its Term Frequency. 
####  - PassiveAggressive Classifier : Passive Aggressive algorithms are online learning algorithms. Such an algorithm remains passive for a correct classification outcome, and turns aggressive in the event of a miscalculation, updating and adjusting

In [2]:
# Step1 : We will start this project by importing Necessarry Libraries


In [3]:
import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

In [4]:
# Step2 : Now, let’s read the data into a DataFrame, and get the shape of the data and the first 5 records.                  

In [5]:
df = pd.read_csv('D:/datasets/news.csv',index_col=0)

In [6]:
df

Unnamed: 0,title,text,label
8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL
...,...,...,...
4490,State Department says it can't find emails fro...,The State Department told the Republican Natio...,REAL
8062,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,The ‘P’ in PBS Should Stand for ‘Plutocratic’ ...,FAKE
8622,Anti-Trump Protesters Are Tools of the Oligarc...,Anti-Trump Protesters Are Tools of the Oligar...,FAKE
4021,"In Ethiopia, Obama seeks progress on peace, se...","ADDIS ABABA, Ethiopia —President Obama convene...",REAL


In [7]:
## News Data set has 6335 Rows and 3 Columns

## Step 3 : Let us check for Null values and then Let us check the shape of data

In [8]:
df.isnull().sum()

title    0
text     0
label    0
dtype: int64

#### * from the above it is clear that this dataset does not have any null values*

In [10]:
df.shape
df.head()

Unnamed: 0,title,text,label
8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [23]:
## Step 4 : We will extract the labels from the given data set

# Get the labels
labels=df.label
labels.head()

8476     FAKE
10294    FAKE
3608     REAL
10142    FAKE
875      REAL
Name: label, dtype: object

In [12]:
## Let us split the data set into Training and Testing sets

In [24]:
# Split the dataset
x_train,x_test,y_train,y_test=train_test_split(df['text'], labels, test_size=0.2, random_state=7)

#### Let’s initialize a TfidfVectorizer with stop words from the English language and a maximum document frequency of 0.7 (terms with a higher document frequency will be discarded). Stop words are the most common words in a language that are to be filtered out before processing the natural language data. And a TfidfVectorizer turns a collection of raw documents into a matrix of TF-IDF features.

#### Now, fit and transform the vectorizer on the train set, and transform the vectorizer on the test set.

In [25]:
## Step 5 :
# Initialize a TfidfVectorizer
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)

In [28]:
# Fit and transform train set, transform test set
tfidf_train=tfidf_vectorizer.fit_transform(x_train) 
tfidf_test=tfidf_vectorizer.transform(x_test)

In [16]:
## Step 6

## Next, we’ll initialize a PassiveAggressiveClassifier. This is. We’ll fit this on tfidf_train and y_train.

## Then, we’ll predict on the test set from the TfidfVectorizer and calculate the accuracy with accuracy_score() 
# from sklearn.metrics

In [26]:
# Initialize a Passive Aggressive Classifier
pac=PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)

# Predict on the test set and calculate accuracy
y_pred=pac.predict(tfidf_test)
score=accuracy_score(y_test,y_pred)
print(f'Accuracy: {round(score*100,2)}%')

Accuracy: 92.74%


In [None]:
## We have got an Accuracy of 93.05% with this model and Accepted 

In [18]:
## Step 7 : let’s print out a confusion matrix to gain insight into the number of false and true negatives and positives.

In [27]:
##  Build confusion matrix

confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])

array([[587,  51],
       [ 41, 588]], dtype=int64)

In [21]:
## By seeing the above Output We can say that 

## 1) We have 593 True Positives
## 2) 43 False Positive
## 3) 49 False Negative

## Summary

 ### we now know how to detect fake news with Python. We took a political dataset, implemented a TfidfVectorizer, initialized a PassiveAggressiveClassifier, and fit our model. We ended up obtaining an accuracy of 92.74% in magnitude.