# Fake News 

## Imports

In [1]:
# Necessary imports
import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

## Vocabulary

We will use in this project some terms that may sound unfamiliar to most people, so here we'll try to explain some of them :
 - Term Frequency (TF) : The number of times a word appears in a document.
 - Inverse Document Frequency (IDF) is a measure of how significant a term is in the entire corpus of documents.
 - TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF features.

## Beginning 

We read the data into a DataFrame

In [2]:
df = pd.read_csv('news.csv')

In [3]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


Now let's split the dataset into training and testing sets.

In [5]:
labels = df.label
X_train, X_test, Y_train, Y_test = train_test_split(df['text'], labels, test_size=0.2, random_state=1)

What we want to do now, is turn this set of documents to a matrix of TF-IDF features with the TfidfVectorizer. Before doing that, we want to remove the english stop words from the documents before processing. Stop words are words that are the most used in every document like "the", "a", "is", "are", etc

In [7]:
tfidf_vectorizer = TfidfVectorizer(stop_words='english', max_df=0.7)
tfidf_train = tfidf_vectorizer.fit_transform(X_train)
tfidf_test = tfidf_vectorizer.transform(X_test)

After transforming our data, we want to build a predictive model. We will train a PassiveAggressiveClassifier

In [18]:
pac = PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train, Y_train)

PassiveAggressiveClassifier(max_iter=50)

Now, let's see how our model performs on the test data.

In [21]:
y_pred = pac.predict(tfidf_test)
score = accuracy_score(Y_test, y_pred)
print("Accuracy : " + str(score*100) + "%")

Accuracy : 94.47513812154696%


We have a 94% accuracy which is great ! 