## Random Forest

Combines the output of multiple decision trees to get a single result.
Can be used for both classification and regression.

Made up of nodes, which are decision points where data is split based on a particular feature or condition.

Hyperparameters we can set:
- Node Size
- Number of Trees
- Number of Features

In [1]:
#Imports
import pandas as pd
import numpy as np
import pickle

from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import accuracy_score
from sklearn.model_selection import RandomizedSearchCV, train_test_split
from scipy.stats import randint
from sklearn.feature_extraction.text import TfidfVectorizer

#TextPreprocess.py provides functions for preprocessing a piece of text and a dataset
from app import prediction,preprocessDataset


[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\reina\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


# Preprocessing

In [3]:
# Reading dataset from .csv file
dataset = pd.read_csv('Model/Datasets/fake_or_real_news.csv')

# Separating labels from rest of dataset
labels = dataset.label

dataset.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [4]:
# Preprocessing data

#preprocessDataset((dataset to process), (name of column with text to be processed)
preprocessedData = preprocessDataset(dataset, 'text')

In [5]:
preprocessedData.head()

0    daniel greenfield shillman journalism fellow f...
1    google pinterest digg linkedin reddit stumbleu...
2    u secretary state john f kerry said monday sto...
3    kaydee king kaydeeking november 9 2016 lesson ...
4    primary day new york frontrunners hillary clin...
Name: text, dtype: object

In [6]:
# Split data into training and testing
x_train, x_test, y_train, y_test = train_test_split(preprocessedData, labels, test_size= 0.2, random_state= 7)

In [7]:
# Vectorize text so we can feed it to algorithm

#TFIDF Vectorizer
tfidf_vectorizer = TfidfVectorizer(stop_words= 'english', max_df= 0.7)

tfidf_train = tfidf_vectorizer.fit_transform(x_train)
tfidf_test = tfidf_vectorizer.transform(x_test)

# Model

In [8]:
#Setting up and training the model
rf = RandomForestClassifier()

rf.fit(tfidf_train, y_train)

# Prediction and Accuracy

In [9]:
# Predict using testing set
y_pred = rf.predict(tfidf_test)

In [10]:
# Accuracy based on testing data
accuracy = accuracy_score(y_test, y_pred)
print("Accuracy: ", accuracy)

Accuracy:  0.8895027624309392


## Test Own Text

In [18]:
# Sample text taken from RTE.ie
filePath = 'Model/Test_Text.txt'

with open(filePath, 'r', encoding= 'utf-8') as file:
    sample_text = file.read()

#preprocessText(Text to be processed)
preprocessedText = prediction(sample_text)

tfidf_sampleText = tfidf_vectorizer.transform([preprocessedText])

samplePred = rf.predict(tfidf_sampleText)

print("Predicted Label: ", samplePred)

Predicted Label:  ['FAKE']


## Saving Model using pickle

In [21]:
# V1 - 14/03/2024

with open('FakeNewsModel_V1.pkl', 'wb') as f:
    pickle.dump((rf,tfidf_vectorizer),f)