## SVM and Random Forest Classifier

Here we will train two models, a Random Forest Model and a Support Vector Machine.
We will then combine and average their outcomes. 
Combining these two models will hopefully work to give us a much better accuracy for predicting fake news. Both models work in different ways;

- Decision Trees in a random forest are all trained independently on subsets of the data. Each tree looks at random parts of the text and decides wether it thinks the news is real or fake, then a final decision is reached by whatever the majority of trees decided.
    - RFs are less prone to outlier influence as its result depends on the outcomes of multiple trees
    - The randomness of feature selection, and of the data each tree trains on, helps ensure that the model will give more accurate predictions for unseen data 

- The Support Vector Machine will try and create the best decision boundary it can to separate real and fake news. If all news articles were plotted as dots on a graph, the SVM would try and fine the best line it can that separates the real ones from the fake ones, where the line is as far from any articles as it can be.
    - SVMs work well for classifying complex data, as data can be mapped in a higher dimension, making it easier to find a decision boundary (We can choose what kernel to use for the model, which decides how to form the decision boundary)
    - Also generalizes well to unseen data 

If we combine both methods we can combine the strengths of both models and potentially get more accurate predictions.

In [23]:
#Imports
import pandas as pd
import numpy as np
import pickle

from sklearn.ensemble import RandomForestClassifier
from sklearn import svm
from sklearn.metrics import accuracy_score
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer

#TextPreprocess.py provides functions for preprocessing a piece of text and a dataset
from TextPreprocess import preprocessText, preprocessDataset

## Text Preprocessing

In [24]:
# Reading dataset from .csv file
dataset = pd.read_csv('Datasets/fake_or_real_news.csv')

# Separating labels from rest of dataset
labels = dataset.label

dataset.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [25]:
# Preprocessing data

#preprocessDataset((dataset to process), (name of column with text to be processed)
preprocessedData = preprocessDataset(dataset, 'text')

In [26]:
preprocessedData.head()

0    daniel greenfield shillman journalism fellow f...
1    google pinterest digg linkedin reddit stumbleu...
2    u secretary state john f kerry said monday sto...
3    kaydee king kaydeeking november 9 2016 lesson ...
4    primary day new york frontrunners hillary clin...
Name: text, dtype: object

In [27]:
# Split data into training and testing
x_train, x_test, y_train, y_test = train_test_split(preprocessedData, labels, test_size= 0.2, random_state= 7)

In [28]:
# Vectorize text so we can feed it to algorithms

#TFIDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words= 'english', max_df= 0.7)

tfidf_train = tfidf_vectorizer.fit_transform(x_train)
tfidf_test = tfidf_vectorizer.transform(x_test)

# Models

### SVM

In [29]:
#Setting up and training the SVM
svm_model = svm.SVC(kernel='linear')
svm_model.fit(tfidf_train, y_train)

### Random Forest

In [30]:
#Setting up and training the random forest model on the same data as the SVM
rf = RandomForestClassifier()

rf.fit(tfidf_train, y_train)

## Prediction and Accuracy

In [31]:
#predict labels for testing set from both models

y_pred_svm = svm_model.predict(tfidf_test)

y_pred_rf = rf.predict(tfidf_test)

In [32]:
accuracy_svm = accuracy_score(y_test, y_pred_svm)
print("SVM Accuracy: ", accuracy_svm)

accuracy_rf = accuracy_score(y_test, y_pred_rf)
print("Random Forest Accuracy: ", accuracy_rf)

SVM Accuracy:  0.9265982636148382
Random Forest Accuracy:  0.8950276243093923


## Test own Text

In [34]:
# Sample text taken from RTE.ie
filePath = 'Test_Text.txt'

with open(filePath, 'r', encoding= 'utf-8') as file:
    sample_text = file.read()

#preprocessText(Text to be processed)
preprocessedText = preprocessText(sample_text)

tfidf_sampleText = tfidf_vectorizer.transform([preprocessedText])

samplePred_SVM = svm_model.predict(tfidf_sampleText)
#SVM_confidence = svm_model.predict_proba(tfidf_sampleText)

samplePred_RF = rf.predict(tfidf_sampleText)
RF_confidence = rf.predict_proba(tfidf_sampleText)

print("Predicted Label SVM: ", samplePred_SVM)
#print("SVM Confidence = ", SVM_confidence)
print("Predicted Label RF: ", samplePred_RF)
print("RF Confidence = ", RF_confidence)


Predicted Label SVM:  ['REAL']
Predicted Label RF:  ['REAL']
RF Confidence =  [[0.34 0.66]]
