## Detecting Fake News with Python & Machine Learning

- This ML model aims to filter out fake news from real news on socila media and other online platforms
- We use a TfidfVectorizer to convert raw documents to TF-IDF Features
- TF(term-frequency) : number of times a word appears in a document
- IDF(inverse document frequency : meausures significance of a term inthe entire document
- PassiveAggressiveClassifier : remains passive for a correct classification outcome and aggressive in the event of a miscalculation   

### 1. Installing Libraries

In [1]:
pip install numpy pandas sklearn

Collecting sklearn
  Downloading sklearn-0.0.post1.tar.gz (3.6 kB)
Building wheels for collected packages: sklearn
  Building wheel for sklearn (setup.py): started
  Building wheel for sklearn (setup.py): finished with status 'done'
  Created wheel for sklearn: filename=sklearn-0.0.post1-py3-none-any.whl size=2959 sha256=d3507e11290b74fd999ff34ff422c3109ed2b7ea73e80783a9205ba5a8ebc774
  Stored in directory: c:\users\kyosk\appdata\local\pip\cache\wheels\f8\e0\3d\9d0c2020c44a519b9f02ab4fa6d2a4a996c98d79ab2f569fa1
Successfully built sklearn
Installing collected packages: sklearn
Successfully installed sklearn-0.0.post1
Note: you may need to restart the kernel to use updated packages.


In [2]:
import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix



### 2. Importing Data into a Dataframe

In [6]:
df = pd.read_csv('news.csv')

### 3. Preanalysis : shape and first five rows

In [7]:
df.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [9]:
df.shape

(6335, 4)

#### 3.1 Extracting the labels from the dataset

In [12]:
labels = df.label
labels.head()

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

#### 3.2. Split data into training and testing datasets

In [13]:
x_train,x_test,y_train,y_test=train_test_split(df['text'], labels, test_size=0.2, random_state=7)

### 4. Initializing TfidfVectorizer with df of 0.7

In [14]:
#terms with a higher document frequency will be discarded
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)

tfidf_train=tfidf_vectorizer.fit_transform(x_train)
tfidf_test=tfidf_vectorizer.transform(x_test)

#### 4.1 Passive Agressive Classifier

In [15]:
#initializing a passive aggressive classifier
pac=PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train, y_train)

#predict the test set and calculate the accuracy
y_pred=pac.predict(tfidf_test)
score=accuracy_score(y_test, y_pred)
print(f'Accuracy: {round(score*100,2)}%')

Accuracy: 92.82%


#### 4.2 Confusion Matrix to gain insight on false and trie negatives and positives


In [16]:
confusion_matrix(y_test, y_pred, labels=['FAKE', 'REAL'])

array([[589,  49],
       [ 42, 587]], dtype=int64)

### 5. Conclusion

- the data has 589 true positives and 49 false negatives
- the data has 42 false positives and 587 true negatives