Inspiration from [here](https://www.kaggle.com/vanshjatana/text-classification-from-scratch) and [here](https://data-flair.training/blogs/advanced-python-project-detecting-fake-news/)

Dataset from [here](https://www.kaggle.com/clmentbisaillon/fake-and-real-news-dataset)

Dataset is quite usabale without needing much cleaning. Aim is to use different ML models to detect the fake news. 

**What is a TfidfVectorizer?**

TF (Term Frequency): The number of times a word appears in a document is its Term Frequency. A higher value means a term appears more often than others, and so, the document is a good match when the term is part of the search terms.

IDF (Inverse Document Frequency): Words that occur many times a document, but also occur many times in many others, may be irrelevant. IDF is a measure of how significant a term is in the entire corpus.

The TfidfVectorizer converts a collection of raw documents into a matrix of TF-IDF features.

In [24]:
# Import Libraries
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfTransformer

## Classification
from sklearn.linear_model import LogisticRegression
from sklearn.svm import LinearSVC
from sklearn.naive_bayes import MultinomialNB

## Regression
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import RandomForestClassifier

true = pd.read_csv("true.csv")
true.head()

Unnamed: 0,title,text,subject,date
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017"
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017"
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017"
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017"
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017"


In [2]:
fake = pd.read_csv("fake.csv")
fake.head()

Unnamed: 0,title,text,subject,date
0,Donald Trump Sends Out Embarrassing New Year’...,Donald Trump just couldn t wish all Americans ...,News,"December 31, 2017"
1,Drunk Bragging Trump Staffer Started Russian ...,House Intelligence Committee Chairman Devin Nu...,News,"December 31, 2017"
2,Sheriff David Clarke Becomes An Internet Joke...,"On Friday, it was revealed that former Milwauk...",News,"December 30, 2017"
3,Trump Is So Obsessed He Even Has Obama’s Name...,"On Christmas day, Donald Trump announced that ...",News,"December 29, 2017"
4,Pope Francis Just Called Out Donald Trump Dur...,Pope Francis used his annual Christmas Day mes...,News,"December 25, 2017"


In [6]:
# Create a column with label for both DFs
fake['label'] = 'fake'
true['label'] = 'true'

# Combine both DFs
## Drop the index. Read more:
## df.reset_index(drop=True) drops the current index of the Dataframe
## and replaces it with an index of increasing integers. It never drops columns.

news = pd.concat([fake, true]).reset_index(drop=True)
news.tail(10)

Unnamed: 0,title,text,subject,date,label
44888,"Mata Pires, owner of embattled Brazil builder ...","SAO PAULO (Reuters) - Cesar Mata Pires, the ow...",worldnews,"August 22, 2017",True
44889,"U.S., North Korea clash at U.N. forum over nuc...",GENEVA (Reuters) - North Korea and the United ...,worldnews,"August 22, 2017",True
44890,"U.S., North Korea clash at U.N. arms forum on ...",GENEVA (Reuters) - North Korea and the United ...,worldnews,"August 22, 2017",True
44891,Headless torso could belong to submarine journ...,COPENHAGEN (Reuters) - Danish police said on T...,worldnews,"August 22, 2017",True
44892,North Korea shipments to Syria chemical arms a...,UNITED NATIONS (Reuters) - Two North Korean sh...,worldnews,"August 21, 2017",True
44893,'Fully committed' NATO backs new U.S. approach...,BRUSSELS (Reuters) - NATO allies on Tuesday we...,worldnews,"August 22, 2017",True
44894,LexisNexis withdrew two products from Chinese ...,"LONDON (Reuters) - LexisNexis, a provider of l...",worldnews,"August 22, 2017",True
44895,Minsk cultural hub becomes haven from authorities,MINSK (Reuters) - In the shadow of disused Sov...,worldnews,"August 22, 2017",True
44896,Vatican upbeat on possibility of Pope Francis ...,MOSCOW (Reuters) - Vatican Secretary of State ...,worldnews,"August 22, 2017",True
44897,Indonesia to buy $1.14 billion worth of Russia...,JAKARTA (Reuters) - Indonesia will buy 11 Sukh...,worldnews,"August 22, 2017",True


## Prepare Text Data for ML 

Read original [here](https://machinelearningmastery.com/prepare-text-data-machine-learning-scikit-learn/)

Text data requires special preparation before you can start using it for predictive modeling.

The text must be parsed to remove words, called **tokenization**. Then the words need to be encoded as integers or floating point values for use as input to a machine learning algorithm, called feature extraction (or **vectorization**).

### Bag-of-Words Model
We cannot work with text directly when using machine learning algorithms.

Instead, we need to convert the text to numbers.

We may want to perform classification of documents, so each document is an **“input”** and a class label is the **“output”** (eg. fake or true news) for our predictive algorithm. Algorithms take vectors of numbers as input, therefore we need to convert documents to fixed-length vectors of numbers.

A simple and effective model for thinking about text documents in machine learning is called the Bag-of-Words Model, or BoW.

The model is simple in that it throws away all of the order information in the words and focuses on the occurrence of words in a document.

This can be done by assigning each word a unique number. Then any document we see can be encoded as a fixed-length vector with the length of the vocabulary of known words. The value in each position in the vector could be filled with a count or frequency of each word in the encoded document.

This is the bag of words model, where we are only concerned with encoding schemes that represent what words are present or the degree to which they are present in encoded documents without any information about order.

There are many ways to extend this simple method, both by better clarifying what a “word” is and in defining what to encode about each word in the vector.

The scikit-learn library provides 3 different schemes that we can use, and we will briefly look at each.

**Convert text to word count vectors with CountVectorizer**
The CountVectorizer provides a simple way to both tokenize a collection of text documents and build a vocabulary of known words, but also to encode new documents using that vocabulary.

You can use it as follows:

- Create an instance of the CountVectorizer class.
- Call the fit() function in order to learn a vocabulary from one or more documents.
- Call the transform() function on one or more documents as needed to encode each as a vector.

**Convert text to word frequency vectors with TfidfVectorizer**
Word counts are a good starting point, but are very basic.

One issue with simple counts is that some words like “the” will appear many times and their large counts will not be very meaningful in the encoded vectors.

An alternative is to calculate word frequencies, and by far the most popular method is called TF-IDF. This is an acronym than stands for “Term Frequency – Inverse Document” Frequency which are the components of the resulting scores assigned to each word.

Term Frequency: This summarizes how often a given word appears within a document.
Inverse Document Frequency: This downscales words that appear a lot across documents.
Without going into the math, TF-IDF are word frequency scores that try to highlight words that are more interesting, e.g. frequent in a document but not across documents.

The TfidfVectorizer will tokenize documents, learn the vocabulary and inverse document frequency weightings, and allow you to encode new documents. Alternately, if you already have a learned CountVectorizer, you can use it with a TfidfTransformer to just calculate the inverse document frequencies and start encoding documents.

fit()
transform()

## Supervised Learning -> Classification

- 1) Logistic Regression
- 2) Support Vector
- 3) Naive Bayes


TfidfTransformer transforms a count matrix to a normalized tf or tf-id representation. TfidfTransformer is normalizing the count

In [19]:
x_train, x_test, y_train, y_test = train_test_split(news['text'], news.label, test_size = 0.2, random_state = 2020)

pipe = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('model', LogisticRegression())])

model = pipe.fit(x_train, y_train)
prediction = model.predict(x_test)


print ("Logistic Regression accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))
print (confusion_matrix(y_test, prediction, labels = ['fake', 'true']))

Logistic Regression accuracy: 98.76%
[[4674   66]
 [  45 4195]]


This means that there are 

4674 True Positive
66 False Negative

45 False Positive
4195 True Negative

Sum of First Row = True/Positive
Sum of Second Row = False/Negative

In [20]:
x_train, x_test, y_train, y_test = train_test_split(news['text'], news.label, test_size = 0.2, random_state = 2020)

pipe = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('model', LinearSVC())])

model = pipe.fit(x_train, y_train)
prediction = model.predict(x_test)


print ("Support Vector accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))
print (confusion_matrix(y_test, prediction, labels = ['fake', 'true']))

Support Vector accuracy: 99.55%
[[4720   20]
 [  20 4220]]


This means that there are 

4720 True Positive
20 False Negative

20 False Positive
4220 True Negative

Sum of First Row = True/Positive
Sum of Second Row = False/Negative

In [23]:
x_train, x_test, y_train, y_test = train_test_split(news['text'], news.label, test_size = 0.2, random_state = 2020)

pipe = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('model', MultinomialNB())])

model = pipe.fit(x_train, y_train)
prediction = model.predict(x_test)


print ("Multionomial Naive Bayes accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))
print (confusion_matrix(y_test, prediction, labels = ['fake', 'true']))

Multionomial Naive Bayes accuracy: 93.56%
[[4486  254]
 [ 324 3916]]


This means that there are 

4486 True Positive
254 False Negative

324 False Positive
3916 True Negative

Sum of First Row = True/Positive
Sum of Second Row = False/Negative

## Supervised Learning -> Regression

- 1) Decision Tree
- 2) Random Forest

Read more about Decision Tree [here](https://scikit-learn.org/stable/modules/generated/sklearn.tree.DecisionTreeClassifier.html)

Criterion: Function to measure the quality of split. "gini" for Gini impurity, "entropy" for the informaiton gain

Max Depth: Number of Nodes before reaching leaves. Choose an int

Splitter: Strategy used to choose the split at each node. Either "best" or "random"

In [29]:
x_train, x_test, y_train, y_test = train_test_split(news['text'], news.label, test_size = 0.2, random_state = 2020)

pipe = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('model', DecisionTreeClassifier(criterion= 'entropy',
                                                                                                              max_depth = 10,
                                                                                                              splitter = 'best'
                                                                                                              ))])

model = pipe.fit(x_train, y_train)
prediction = model.predict(x_test)


print ("Decision Tree accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))
print (confusion_matrix(y_test, prediction, labels = ['fake', 'true']))

Decision Tree accuracy: 99.53%
[[4722   18]
 [  24 4216]]


In [30]:
x_train, x_test, y_train, y_test = train_test_split(news['text'], news.label, test_size = 0.2, random_state = 2020)

pipe = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('model', RandomForestClassifier())])

model = pipe.fit(x_train, y_train)
prediction = model.predict(x_test)


print ("Random Forest accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))
print (confusion_matrix(y_test, prediction, labels = ['fake', 'true']))

Random Forest accuracy: 98.82%
[[4691   49]
 [  57 4183]]


## Unsupervised Learning-> Clustering
1) KNN Classification

Read more [here](https://scikit-learn.org/stable/modules/generated/sklearn.neighbors.KNeighborsClassifier.html)

First definite number of neighbors
Define weight and algorithm use

weights: 
- 'uniform' all points in each neighborhood are weighted equally
- 'distance' weight points by the inverse of their distance, in this case closer neighbors of a query point will have a greater influence than neighbors which are further away

algorithm:
- 'ball_tree' 
- 'kd_tree'
- 'brute' brute-force search
- 'auto' will attempt to decide the most approrpiate algorithm based on the values passed to fit method

In [31]:
x_train, x_test, y_train, y_test = train_test_split(news['text'], news.label, test_size = 0.2, random_state = 2020)

pipe = Pipeline([('vect', CountVectorizer()), ('tfidf', TfidfTransformer()), ('model', KNeighborsClassifier(n_neighbors = 10, weights = 'distance', algorithm = 'brute'))])

model = pipe.fit(x_train, y_train)
prediction = model.predict(x_test)


print ("Random Forest accuracy: {}%".format(round(accuracy_score(y_test, prediction)*100,2)))
print (confusion_matrix(y_test, prediction, labels = ['fake', 'true']))

Random Forest accuracy: 67.59%
[[4695   45]
 [2865 1375]]
