## SVM and Random Forest Classifier

Here we will train two models, a Random Forest Model and a Support Vector Machine.
We will then combine and average their outcomes. 
Combining these two models will hopefully work to give us a much better accuracy for predicting fake news. Both models work in different ways;

- Decision Trees in a random forest are all trained independently on subsets of the data. Each tree looks at random parts of the text and decides wether it thinks the news is real or fake, then a final decision is reached by whatever the majority of trees decided.
    - RFs are less prone to outlier influence as its result depends on the outcomes of multiple trees
    - The randomness of feature selection, and of the data each tree trains on, helps ensure that the model will give more accurate predictions for unseen data 

- The Support Vector Machine will try and create the best decision boundary it can to separate real and fake news. If all news articles were plotted as dots on a graph, the SVM would try and fine the best line it can that separates the real ones from the fake ones, where the line is as far from any articles as it can be.
    - SVMs work well for classifying complex data, as data can be mapped in a higher dimension, making it easier to find a decision boundary (We can choose what kernel to use for the model, which decides how to form the decision boundary)
    - Also generalizes well to unseen data 

If we combine both methods we can combine the strengths of both models and potentially get more accurate predictions.

In [1]:
#Imports
import pandas as pd
import numpy as np
import pickle

from sklearn.ensemble import RandomForestClassifier, VotingClassifier
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split

#Import Vectorizers
from sklearn.feature_extraction.text import TfidfVectorizer, CountVectorizer, HashingVectorizer

#TextPreprocess.py provides functions for preprocessing a piece of text and a dataset
from TextPreprocess import preprocessText, preprocessDataset

## Text Preprocessing

In [2]:
# Reading dataset from .csv file
dataset = pd.read_csv('Datasets/fake_or_real_news.csv')

# Separating labels from rest of dataset
labels = dataset.label

dataset.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [3]:
# Preprocessing data

#preprocessDataset((dataset to process), (name of column with text to be processed))
preprocessedData = preprocessDataset(dataset, 'text')

In [5]:
preprocessedData.head()

0    daniel greenfield shillman journalism fellow f...
1    google pinterest digg linkedin reddit stumbleu...
2    u secretary state john f kerry said monday sto...
3    kaydee king kaydeeking november 9 2016 lesson ...
4    primary day new york frontrunners hillary clin...
Name: text, dtype: object

In [6]:
# Split data into training and testing
x_train, x_test, y_train, y_test = train_test_split(preprocessedData, labels, test_size= 0.2, random_state= 7)

## Vectorizing

We need to convert the raw text we want to train our model on into numbers that our model can understand.

#### <u>TF-IDF:</u>
(Term Frequency Inverse Document Frequency)
- How often words are found in a document and how unique they are to the document
- Contains information on the most and least important words in the document
- Purely based off occurrence of words, does not take into account positions of words in the text

####  <u>Bag of Words:</u>
(Count vectorizer)
- Counts occurrences of words in the document
- Similar to TF-IDF except doesnt look at importance of words

#### <u>Hashing Vectorizer:</u>
- Convert a document to a matrix of token occurrences.
- Uses hashing to map token strings to feature indexes

#### <u>N-Grams:</u>
- Combinations of all adjacent words of length N
- Counts the occurrences of these N-grams in the document
- Maintains word order, but there is a tradeoff with choosing a value for N. Smaller N may result in less useful information, a higher N will give however give a very large matrix with lots of features. May be too many features


In [16]:
#TFIDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words= 'english', max_df= 0.7)

tfidf_train = tfidf_vectorizer.fit_transform(x_train)
tfidf_test = tfidf_vectorizer.transform(x_test)

In [6]:
#Bag of Words Vectorizer
bow_vectorizer = CountVectorizer(stop_words= 'english')

bow_train = bow_vectorizer.fit_transform(x_train)
bow_test = bow_vectorizer.transform(x_test)

In [14]:
#Hashing Vectorizer
hashing_vectorizer = HashingVectorizer(n_features= 2**20)

hashed_train = hashing_vectorizer.fit_transform(x_train)
hashed_test = hashing_vectorizer.transform(x_test)


In [20]:
#N-Grams
#Used for tokenizing then using count vectorizer to turn the ngrams to numbers

ngram_range = (1,3)

ngramVectorizer = CountVectorizer(ngram_range=ngram_range)

ngram_train = ngramVectorizer.fit_transform(x_train)
ngram_test = ngramVectorizer.transform(x_test)

In [8]:
#N-Grams TF-IDF

ngram_range = (1,3)

ngram_TFIDF_Vec = TfidfVectorizer(ngram_range=ngram_range)

ngram_tfidf_train = ngram_TFIDF_Vec.fit_transform(x_train)
ngram_tfidf_test = ngram_TFIDF_Vec.transform(x_test)

# Models

### SVM

In [17]:
#Setting up the SVM
# Probability set to true so we can use soft voting with the random forest
svm_model = svm.SVC(kernel='linear', probability=True)

### Random Forest

In [18]:
#Setting up the random forest model
rf = RandomForestClassifier()

### Combine Both models using voting
Allows us to get the majority output of both models and combine them into one classifier.

Soft voting predicts the label based on predicted probabilities.

**Takes a long time to train (2 - 3 minutes generally, up to at least 11 with N-grams)**

In [19]:
Combined_Models = VotingClassifier(estimators=[('svm', svm_model), ('randomForest', rf)], voting='soft')

# Fit our training data to the ensemble model

Combined_Models = Combined_Models.fit(tfidf_train, y_train)

## Prediction and Accuracy

Some sample accuracy scores (** There are other metrics for accuracy we should look at too **)
- Using TF_IDF = 0.934
- Using BOW = 0.89
- Using hashing = 0.927
- Using N-grams (N = 3) and BOW = 0.887 [more confidence in predictions than without n-grams]
- Using N-grams (N = 3) and TF-IDF = 0.922 [more confidence in predictions than without n-grams]


In [22]:
#predict labels for testing set 
y_pred_combined = Combined_Models.predict(tfidf_test)

In [23]:
#accuracy_svm = accuracy_score(y_test, y_pred_svm)
#print("SVM Accuracy: ", accuracy_svm)
#accuracy_rf = accuracy_score(y_test, y_pred_rf)
#print("Random Forest Accuracy: ", accuracy_rf)

accuracy_combined = accuracy_score(y_test, y_pred_combined)
print("Combined Accuracy: ", accuracy_combined)

Combined Accuracy:  0.9337016574585635


## Test own Text

In [30]:
# Sample text taken from RTE.ie
filePath = 'Test_Text.txt'

with open(filePath, 'r', encoding= 'utf-8') as file:
    sample_text = file.read()

#preprocessText(Text to be processed)
preprocessedText = preprocessText(sample_text)

tfidf_sampleText = tfidf_vectorizer.transform([preprocessedText])

samplePred_Combined = Combined_Models.predict(tfidf_sampleText)

Combined_confidence = Combined_Models.predict_proba(tfidf_sampleText)

print("Predicted Label Combined Models: ", samplePred_Combined)
print("Model Confidence that article is Fake = ", Combined_confidence[0][0])
print("Model Confidence that article is Real = ", Combined_confidence[0][1])


Predicted Label Combined Models:  ['FAKE']
Model Confidence that article is Fake =  0.6393507372049535
Model Confidence that article is Real =  0.3606492627950466
