## Fake News detection using TermFrequency InverseDocumentFrequency (TF-IDF)

##### This is done using Passive Agressive Classifier which builds TFIDF matrix based on the frequeny of the word. Greater the TFIDF value more the frequency of the word in the test and viceversa

**Points to be considered**
- Few words like " is a was etc..... " also occur most frequently but we can't consider them for our analysis

#### Let's Start By importing the required libraries

In [3]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.linear_model import PassiveAggressiveClassifier as PAC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix 

#### let's import the dataset now

In [5]:
data = pd.read_csv('news.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [15]:
data.columns = ['noOfWords', 'title', 'text', 'label']
data.shape

(6335, 4)

#### Splitting the dataset into training and testing dataset

In [14]:
x_train, x_test, y_train, y_test = train_test_split(data['text'], data['label'], test_size=0.25, random_state=7)

#### Initialize the TF-IDF Vectorizer

In [19]:
tfidf_vec = TfidfVectorizer(stop_words='english', max_df=0.7)
x_train_tfidf = tfidf_vec.fit_transform(x_train)
x_test_tfidf = tfidf_vec.transform(x_test)

#### After transforming the new into the data in TFIDF matrix

In [25]:
x_train_tfidf

<4751x59709 sparse matrix of type '<class 'numpy.float64'>'
	with 1254643 stored elements in Compressed Sparse Row format>

In [27]:
x_test_tfidf

<1584x59709 sparse matrix of type '<class 'numpy.float64'>'
	with 402277 stored elements in Compressed Sparse Row format>

#### Initializing the Passive Aggressive Classifier object

In [42]:
pac = PAC(max_iter=100)
pac.fit(x_train_tfidf, y_train)

PassiveAggressiveClassifier(max_iter=100)

In [43]:
y_pred = pac.predict(x_test_tfidf)

#### Observing the results using the Evaluaton metrics

1. Accuracy score
2. Confusion matrix

In [44]:
accuracy_score(y_test, y_pred)

0.9286616161616161

In [38]:
confusion_matrix(y_test, y_pred)

array([[752,  56],
       [ 62, 714]], dtype=int64)