<a href="https://colab.research.google.com/github/EOHFA-GOAT/machine-learning-projects/blob/master/Fake%20News%20Detection%20Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##Introduction
In my first ever Machine Learning project, I will be training a model to detect fake news. Using the Scikit-Learn library in Python, I will use TfidfVectorizer, which transforms text to vectors with multiple elements (features). The term TF-IDF itself is intended to reflect how important a word is to a document.

Then, I will use a Passive Aggressive Classifier. This does not require a learning rate.

Given these details, the project is clearly under a supervised learning technique.


In [0]:
#import necessary libraries
import numpy as np
import pandas as pd
import itertools
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix

##Uploading the Dataset

In [3]:
from google.colab import files
uploaded = files.upload()

Saving news.csv to news.csv


Let's read and get the shape of the data

In [9]:
data = pd.read_csv('news.csv')
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [10]:
data.shape

(6335, 4)

##Getting the Labels of the Data

In [13]:
labels = data.label
labels.head()

0    FAKE
1    FAKE
2    REAL
3    FAKE
4    REAL
Name: label, dtype: object

##Splitting the Dataset Into Training and Testing Sets



In [0]:
x_train,x_test,y_train,y_test=train_test_split(data['text'], labels, test_size=0.2, random_state=7)

##Beginning to initialize a TfidfVectorizer with commonly used words (aka stop words)

In [0]:
tfidf_vectorizer=TfidfVectorizer(stop_words='english', max_df=0.7)

In [0]:
#Fit and transform the train set, and transform the test set
tfidf_train=tfidf_vectorizer.fit_transform(x_train) 
tfidf_test=tfidf_vectorizer.transform(x_test)

Commonly used words as well as terms with a higher document frequency of 0.7 will be filtered out.

##Initializing a Passive Agressive Classifier

In [25]:
#Initialize a Passive Agressive Classifier
pac = PassiveAggressiveClassifier(max_iter=50)
pac.fit(tfidf_train,y_train)

#Predict on the test set and then calculate the accuracy
y_pred = pac.predict(tfidf_test)
score = accuracy_score(y_test,y_pred)
print(f'Accuracy Score: {round(score*100,2)}%')

Accuracy Score: 92.74%


This model's accuracy score tells me how accurate the model is in determining whether a news article is fake or not.

## Making a Confusion Matrix 
So that it may tell me what number of news articles the model predicted correctly to be real, what number of news articles the model predicted incorrectly to be real, what number the news articles the model predicted correctly to be fake, and what number of news articles the model predicted incorrectly to be fake.



In [27]:
#Creating a Confusion Matrix
confusion_matrix(y_test,y_pred, labels=['FAKE','REAL'])

array([[589,  49],
       [ 43, 586]])

The model predicted 589 true predictions to be real, 49 false predictions to be real, 43 false predictions to be fake, and 586 true predictions to be real.

## Conclusion
In this machine learning project, I already knew exactly how many news articles were real or fake because of the labels in the data. However, the computer itself did not know that. Thus, I essentially trained the computer to detect real and fake news, giving it data to learn from. This techique in machine learning is called supervised learning. The computer ended up being extremely accurate in detecting real and fake news. If the computer was determine the number of real and fake news articles, it will only get 7.26% of them wrong. That's only 92 out of 1,266 news articles it will get wrong! Pretty impressive!!!

As always, I hope that you found this project fascinating!

Emini Offutt, High School Junior

02 April 2020