### Fake News Detection

- Student Name: Foo Fang Khai
- Student ID: 0134196

#### Project Flow:
1. Problem Statement
2. Data Gathering
3. Data Cleaning
4. Data Preprocessing:
    - Tokenization
    - Lower Case
    - Stopwords
    - Lemmatization /Stemming
5. Vectorization
    - TF-IDF
6. Model Building
    - Train Test Split
    - Model Object Initialization
    - Training and Testing of Model
7. Model Evaluation
    - Accuracy Score
    - Confusion Matrix
    - Classification Report
8. Prediction On Given Data
9. Keyword Extraction
    - Data Preprocessing
        - Regular Expression
        - Lower Case
    - Creating IDF
    - TfidfTransformer (Use For Computing IDF)
    - Computing TF-IDF
10. Deployment of Model

## Problem Statement

In the realm of natural language processing, the detection of fake news is a popular topic. In our everyday routine, we absorb news through a variety of sources, but occasionally it can be challenging to distinguish between fake and real news. This research compares two models based on how well they predict whether a given piece of news is true or false and how well they extract keywords from the news

## Data Gathering

In [58]:
# Import Libraries Needed

import pandas as pd
import numpy as np
import re
import pickle
import nltk
import seaborn as sns
import matplotlib.pyplot as plt
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer
from sklearn.tree import DecisionTreeClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [2]:
# Load dataset which consists of 4 columns, Unnamed: 0 (non-effective variable), title, text, and label
# While a dataset with title, text, and labels is crucial for fake news detection, it's not exclusive to this task
# The title and text provide the necessary textual context, and labels guide the model's learning process

data = pd.read_csv("/Users/_fangkhai/Documents/Academic/Computer Science Semester 7/Natural Language Processing/Assignment Dataset/WELFake_Dataset.csv")

In [3]:
data

Unnamed: 0.1,Unnamed: 0,title,text,label
0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,1,,Did they post their votes for Hillary already?,1
2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1
...,...,...,...,...
72129,72129,Russians steal research on Trump in hack of U....,WASHINGTON (Reuters) - Hackers believed to be ...,0
72130,72130,WATCH: Giuliani Demands That Democrats Apolog...,"You know, because in fantasyland Republicans n...",1
72131,72131,Migrants Refuse To Leave Train At Refugee Camp...,Migrants Refuse To Leave Train At Refugee Camp...,0
72132,72132,Trump tussle gives unpopular Mexican leader mu...,MEXICO CITY (Reuters) - Donald Trump’s combati...,0


## Data Cleaning

In [4]:
# Use .describe() function to have a deeper analysis of our dataset

data.describe()

Unnamed: 0.1,Unnamed: 0,label
count,72134.0,72134.0
mean,36066.5,0.514404
std,20823.436496,0.499796
min,0.0,0.0
25%,18033.25,0.0
50%,36066.5,1.0
75%,54099.75,1.0
max,72133.0,1.0


In [5]:
# Use .info() function to have a deeper analysis of our dataset

data.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 72134 entries, 0 to 72133
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  72134 non-null  int64 
 1   title       71576 non-null  object
 2   text        72095 non-null  object
 3   label       72134 non-null  int64 
dtypes: int64(2), object(2)
memory usage: 2.2+ MB


In [6]:
# Drop non-effective variables for analysis purpose such as Unnamed: 0

data.drop(columns = "Unnamed: 0", inplace = True)

In [7]:
data

Unnamed: 0,title,text,label
0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,,Did they post their votes for Hillary already?,1
2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1
...,...,...,...
72129,Russians steal research on Trump in hack of U....,WASHINGTON (Reuters) - Hackers believed to be ...,0
72130,WATCH: Giuliani Demands That Democrats Apolog...,"You know, because in fantasyland Republicans n...",1
72131,Migrants Refuse To Leave Train At Refugee Camp...,Migrants Refuse To Leave Train At Refugee Camp...,0
72132,Trump tussle gives unpopular Mexican leader mu...,MEXICO CITY (Reuters) - Donald Trump’s combati...,0


In [8]:
# Checking the value counts for label to see whether is there any inbalance of label
# For instance, if value 0 has 1000 and value 1 has 37000, the model trained will be bias towards value 1 since it's 
# making fewer mistakes on value 1

data["label"].value_counts()

1    37106
0    35028
Name: label, dtype: int64

In [9]:
# Checking the unique count for label to ensure data quality. For instance, if the unique count for label is 3, which 
# mean something is wrong since the label will be either 0 or 1 (spam or not spam)

data["label"].nunique()

2

In [10]:
# Use .shape to view how many rows and columns are there in our dataset

data.shape

(72134, 3)

In [11]:
# Use .isnull().sum() to check for the total (sum of) null values for each column

data.isnull().sum()

title    558
text      39
label      0
dtype: int64

In [12]:
# Use .dtypes to check the data types for each column

data.dtypes

title    object
text     object
label     int64
dtype: object

In [13]:
null_title = data[data["title"].isnull()]
print(null_title)

null_text = data[data["text"].isnull()]
print(null_text)

      title                                               text  label
1       NaN     Did they post their votes for Hillary already?      1
43      NaN  True. Hillary needs a distraction and what bet...      1
162     NaN  All eyes on Electoral delegates. The People kn...      1
185     NaN                                               Cool      1
269     NaN  A leading US senator: US Supporting War in Syr...      1
...     ...                                                ...    ...
71484   NaN  Another Arab supremacist masturbation fantasy....      1
71521   NaN  I'm sure they drastically changed accounting m...      1
71540   NaN  It's easy to imagine Obama or Kerry pissing hi...      1
71570   NaN  Ever since the powers to be assassinated JFK A...      1
71734   NaN             Hmm, free college, now that's an idea.      1

[558 rows x 3 columns]
                                                   title text  label
2457   Après le succès de « Mariés au premier regard ...  NaN      

In [14]:
# In here, I filled the null values with text ("No Text") instead of dropping rows with null value since dropping
# columns with null value isn't the best way for Data Cleaning Process   

replacement_text = "No Text"
data = data.fillna(replacement_text)

In [15]:
# Check for null values again after filling the null values with replacement_text

data.isnull().sum()

title    0
text     0
label    0
dtype: int64

In [16]:
data["text"][0]

'No comment is expected from Barack Obama Members of the #FYF911 or #FukYoFlag and #BlackLivesMatter movements called for the lynching and hanging of white people and cops. They encouraged others on a radio show Tuesday night to  turn the tide  and kill white people and cops to send a message about the killing of black people in America.One of the F***YoFlag organizers is called  Sunshine.  She has a radio blog show hosted from Texas called,  Sunshine s F***ing Opinion Radio Show. A snapshot of her #FYF911 @LOLatWhiteFear Twitter page at 9:53 p.m. shows that she was urging supporters to  Call now!! #fyf911 tonight we continue to dismantle the illusion of white Below is a SNAPSHOT Twitter Radio Call Invite   #FYF911The radio show aired at 10:00 p.m. eastern standard time.During the show, callers clearly call for  lynching  and  killing  of white people.A 2:39 minute clip from the radio show can be heard here. It was provided to Breitbart Texas by someone who would like to be referred to

In [123]:
data["title"][3]

'Bobby Jindal, raised Hindu, uses story of Christian conversion to woo evangelicals for potential 2016 bid'

In [17]:
data.columns

Index(['title', 'text', 'label'], dtype='object')

## Data Preprocessing - With Title As Feature Since Noise In The Text Can Make It Harder To Distinguish Between Fake And Genuine News

#### In here, several operations will be perform such as:
    - Tokenization
    - Lower Case
    - Stopwords
    - Lemmatization

In [18]:
# Download stopwords from nltk library

nltk.download('stopwords')

[nltk_data] Downloading package stopwords to
[nltk_data]     /Users/_fangkhai/nltk_data...
[nltk_data]   Package stopwords is already up-to-date!


True

In [80]:
# Initialize stopwords to have a view of sample of stopwords

stopwords = stopwords.words("english")
print(stopwords[0:20])

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're", "you've", "you'll", "you'd", 'your', 'yours', 'yourself', 'yourselves', 'he', 'him', 'his']


In [81]:
# In here, lemmatization will be use instead of stemming and explanation will be provided when performing stemming

lm = WordNetLemmatizer()
corpus = []

for i in range (len(data)):
    # Using regular expression to search for a pattern in a given string
    review = re.sub("a-zA-z0-9", " ", data["title"][i])
    
    # Converting to lower case
    review = review.lower()
    
    # Applying tokenization
    review = review.split()
    
    # Removing stopwords
    review = [lm.lemmatize(y) for y in review if y not in stopwords]
    
    # Joining the sentence after lemmatization is applied
    review = " ".join(review)
    
    # Storing review into corpus array
    corpus.append(review)

In [82]:
print("Before Preprocessing:\n", data["title"][0])
print("\nAfter Preprocessing:\n", corpus[0])

# We can see that before and after preprocessing, there are significant differences such as all words is in lowercase

Before Preprocessing:
 LAW ENFORCEMENT ON HIGH ALERT Following Threats Against Cops And Whites On 9-11By #BlackLivesMatter And #FYF911 Terrorists [VIDEO]

After Preprocessing:
 law enforcement high alert following threat cop white 9-11by #blacklivesmatter #fyf911 terrorist [video]


In [83]:
# In here, stemming will be use

port_stem = PorterStemmer()
crps = []

for i in range (len(data)):
    # Using regular expression to search for a pattern in a given string
    con = re.sub("[^a-zA-Z]", " ", data["title"][i])
    
    # Converting to lower case
    con = con.lower()
    
    # Applying tokenization
    con = con.split()
    
    # Removing stopwords
    con = [port_stem.stem(x) for x in con if x not in stopwords]
    
    # Joining the sentence after lemmatization is applied
    con = " ".join(con)
    
    # Storing review into corpus array
    crps.append(con)

In [84]:
print("Before Preprocessing:\n", data["title"][0])
print("\nAfter Preprocessing:\n", crps[0])

# We can see that before and after preprocessing, stemming is not suitable since it is a more aggressive approach 
# than lematization and can results in loss of some words meaning and context 

Before Preprocessing:
 LAW ENFORCEMENT ON HIGH ALERT Following Threats Against Cops And Whites On 9-11By #BlackLivesMatter And #FYF911 Terrorists [VIDEO]

After Preprocessing:
 law enforc high alert follow threat cop white blacklivesmatt fyf terrorist video


## Vectorization - Converting Text Data Into Vector Form

In [24]:
# In here TF-IDF will be used to convert the text data into vector form

tf = TfidfVectorizer()
x = tf.fit_transform(corpus).toarray()
x

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [25]:
y = data["label"]

## Model Building - With A Train Test Split Ratio 7:3

#### In here, several operations will be perform such as:
    - Train Test Split
    - Model Object Initialization
    - Training and Testing of Model

#### Train Test Split

In [26]:
# Use train_test_split() function to split our data into 70% for training and 30% for testing

X_train, x_test, Y_train, y_test = train_test_split(x, y, test_size = 0.3, random_state = 50, stratify = y)

In [27]:
len(X_train), len(Y_train)

(50493, 50493)

In [28]:
len(x_test), len(y_test)

(21641, 21641)

#### Model Object Initialization - RandomForestClassifier()

In [29]:
# Our training data is sent into RandomForestClassifier in this instance to train our model
# RandomForestClassifier is used because it's a popular machine learning model that can be used for a wide range of
# classification tasks, including fake news detection

rf = RandomForestClassifier()
rf.fit(X_train, Y_train)

#### Model Object Initialization - DecisionTreeClassifier()

In [30]:
# Our training data is sent into DecisionTreeClassifier in this instance to train our model
# DecisionTreeClassifier is used because it is a relatively simple model and are less prone to overfitting when the
# depth of the tree is appropriately controlled.

model = DecisionTreeClassifier()
model.fit(X_train, Y_train)

#### Model Object Initialization - LogisticRegression()

In [31]:
# Our training data is sent into LogisticRegression in this instance to train our model
# LogisticRegression is used because of its simplicity and efficiency since it is a simple linear model, which means 
# it is computationally efficient and easy to implement.

lrm = LogisticRegression()
lrm.fit(X_train, Y_train)

## Model Evaluation

#### In here, several operations will be perform such as:
    - Accuracy Score
    - Confusion Matrix
    - Classification Report

In [32]:
# Calculating the model accuracy for the first model trained - RandomForestClassifier

y_pred = rf.predict(x_test)
rf.score(x_test, y_test)

0.9063814056651726

In [33]:
# Calculating the model accuracy for the second model trained - DecisionTreeClassifier

y_pre = model.predict(x_test)
model.score(x_test, y_test)

0.8697379973199021

In [34]:
# Calculating the model accuracy for the third model trained - LogisticRegression

pred = lrm.predict(x_test)
lrm.score(x_test, y_test)

0.901298461254101

In [35]:
# Creating a class to evaluate the trained model with three different evaluation metrics, Accuracy Score, Confusion
# Matrix, and Classification Report 

class Evaluation:
    
    def __init__(self, model, X_train, x_test, Y_train, y_test):
        self.model = model
        self.x_train = X_train
        self.x_test = x_test
        self.y_train = Y_train
        self.y_test = y_test
        
    def test_evaluation(self):
        y_pred_test = self.model.predict(self.x_test)
        
        # Calculating the accuracy score base on the test data and the predicted data
        acc_scr_test = accuracy_score(self.y_test, y_pred_test)
        print("Accuracy Score On Testing Data Set:", acc_scr_test)
        print()
        
        # Calculating the confusion matrix base on the test data and the predicted data
        con_mat_test = confusion_matrix(self.y_test, y_pred_test)
        print("Confusion Matrix On Testing Data Set:\n", con_mat_test)
        print()
        
        # Calculating the classification report base on the test data and the predicted data
        class_rep_test = classification_report(self.y_test, y_pred_test)
        print("Classification Report On Testing Data Set:\n", class_rep_test)

In [36]:
# Model evaluation for the first trained model - RandomForestClassifier()

evaluation = Evaluation(rf, X_train, x_test, Y_train, y_test)
evaluation.test_evaluation()

Accuracy Score On Testing Data Set: 0.9063814056651726

Confusion Matrix On Testing Data Set:
 [[ 9385  1124]
 [  902 10230]]

Classification Report On Testing Data Set:
               precision    recall  f1-score   support

           0       0.91      0.89      0.90     10509
           1       0.90      0.92      0.91     11132

    accuracy                           0.91     21641
   macro avg       0.91      0.91      0.91     21641
weighted avg       0.91      0.91      0.91     21641



In [37]:
# Model evaluation for the second trained model - DecisionTreeClassifier()

evaluation = Evaluation(model, X_train, x_test, Y_train, y_test)
evaluation.test_evaluation()

Accuracy Score On Testing Data Set: 0.8697379973199021

Confusion Matrix On Testing Data Set:
 [[9043 1466]
 [1353 9779]]

Classification Report On Testing Data Set:
               precision    recall  f1-score   support

           0       0.87      0.86      0.87     10509
           1       0.87      0.88      0.87     11132

    accuracy                           0.87     21641
   macro avg       0.87      0.87      0.87     21641
weighted avg       0.87      0.87      0.87     21641



In [38]:
# Model evaluation for the third trained model - DecisionTreeClassifier()

evaluation = Evaluation(lrm, X_train, x_test, Y_train, y_test)
evaluation.test_evaluation()

Accuracy Score On Testing Data Set: 0.901298461254101

Confusion Matrix On Testing Data Set:
 [[ 9287  1222]
 [  914 10218]]

Classification Report On Testing Data Set:
               precision    recall  f1-score   support

           0       0.91      0.88      0.90     10509
           1       0.89      0.92      0.91     11132

    accuracy                           0.90     21641
   macro avg       0.90      0.90      0.90     21641
weighted avg       0.90      0.90      0.90     21641



## Model Building - With A Train Test Split Ratio 8:2

#### In here, several operations will be perform such as:
    - Train Test Split
    - Model Object Initialization
    - Training and Testing of Model

#### Train Test Split

In [39]:
# Use train_test_split() function to split our data into 80% for training and 20% for testing

x_train, X_test, y_train, Y_test = train_test_split(x, y, test_size = 0.2, random_state = 50, stratify = y)

In [40]:
len(x_train), len(y_train)

(57707, 57707)

In [41]:
len(X_test), len(Y_test)

(14427, 14427)

#### Model Object Initialization - RandomForestClassifier()

In [42]:
# Our training data is sent into RandomForestClassifier in this instance to train our model 
# RandomForestClassifier is used because it's a popular machine learning model that can be used for a wide range of 
# classification tasks, including fake news detection

rfc = RandomForestClassifier()
rfc.fit(x_train, y_train)

#### Model Object Initialization - DecisionTreeClassifier()

In [43]:
# Our training data is sent into DecisionTreeClassifier in this instance to train our model 
# DecisionTreeClassifier is used because it is a relatively simple model and are less prone to overfitting when the 
# depth of the tree is appropriately controlled.

mdl = DecisionTreeClassifier()
mdl.fit(x_train, y_train)

#### Model Object Initialization - LogisticRegression()

In [44]:
# Our training data is sent into LogisticRegression in this instance to train our model 
# LogisticRegression is used because of its simplicity and efficiency since it is a simple linear model, which means 
# it is computationally efficient and easy to implement.

lr = LogisticRegression()
lr.fit(x_train, y_train)

## Model Evaluation

#### In here, several operations will be perform such as:
    - Accuracy Score
    - Confusion Matrix
    - Classification Report

In [45]:
# Calculating the model accuracy for the first model trained - RandomForestClassifier

y_prd = rfc.predict(X_test)
rfc.score(X_test, Y_test)

0.9078117418728773

In [46]:
# Calculating the model accuracy for the second model trained - DecisionTreeClassifier

y_pr = mdl.predict(X_test)
mdl.score(X_test, Y_test)

0.8714909544603868

In [47]:
# Calculating the model accuracy for the third model trained - LogisticRegression

prediction = lr.predict(X_test)
lr.score(X_test, Y_test)

0.9027517848478547

In [48]:
# Model evaluation for the first trained model - RandomForestClassifier()

evaluation = Evaluation(rfc, x_train, X_test, y_train, Y_test)
evaluation.test_evaluation()

Accuracy Score On Testing Data Set: 0.9078117418728773

Confusion Matrix On Testing Data Set:
 [[6248  758]
 [ 572 6849]]

Classification Report On Testing Data Set:
               precision    recall  f1-score   support

           0       0.92      0.89      0.90      7006
           1       0.90      0.92      0.91      7421

    accuracy                           0.91     14427
   macro avg       0.91      0.91      0.91     14427
weighted avg       0.91      0.91      0.91     14427



In [49]:
# Model evaluation for the second trained model - DecisionTreeClassifier()

evaluation = Evaluation(mdl, x_train, X_test, y_train, Y_test)
evaluation.test_evaluation()

Accuracy Score On Testing Data Set: 0.8714909544603868

Confusion Matrix On Testing Data Set:
 [[6041  965]
 [ 889 6532]]

Classification Report On Testing Data Set:
               precision    recall  f1-score   support

           0       0.87      0.86      0.87      7006
           1       0.87      0.88      0.88      7421

    accuracy                           0.87     14427
   macro avg       0.87      0.87      0.87     14427
weighted avg       0.87      0.87      0.87     14427



In [50]:
# Model evaluation for the second trained model - LogisticRegression()

evaluation = Evaluation(lr, x_train, X_test, y_train, Y_test)
evaluation.test_evaluation()

Accuracy Score On Testing Data Set: 0.9027517848478547

Confusion Matrix On Testing Data Set:
 [[6188  818]
 [ 585 6836]]

Classification Report On Testing Data Set:
               precision    recall  f1-score   support

           0       0.91      0.88      0.90      7006
           1       0.89      0.92      0.91      7421

    accuracy                           0.90     14427
   macro avg       0.90      0.90      0.90     14427
weighted avg       0.90      0.90      0.90     14427



In [51]:
print("\033[1mThe Below Evaluation Will Be Based On A Train Test Split Ratio Of 7:3 \n\033[0m")
# Model evaluation for the first trained model with 7:3 train test split ratio - RandomForestClassifier()
print("\033[1mRandomForestClassifier\033[0m")
evaluation = Evaluation(rf, X_train, x_test, Y_train, y_test)
evaluation.test_evaluation()

# Model evaluation for the second trained model with 7:3 train test split ratio - DecisionTreeClassifier()
print("\033[1m\nDecisionTreeClassifier\033[0m")
evaluation = Evaluation(model, X_train, x_test, Y_train, y_test)
evaluation.test_evaluation()

# Model evaluation for the third trained model with 7:3 train test split ratio - LogisticRegression()
print("\033[1m\nLogisticRegression\033[0m")
evaluation = Evaluation(lrm, X_train, x_test, Y_train, y_test)
evaluation.test_evaluation()


print("\033[1m\n\nThe Below Evaluation Will Be Based On A Train Test Split Ratio Of 8:2 \n\033[0m")
# Model evaluation for the first trained model with 8:2 train test split ratio - RandomForestClassifier()
print("\033[1mRandomForestClassifier\033[0m")
evaluation = Evaluation(rfc, x_train, X_test, y_train, Y_test)
evaluation.test_evaluation()

# Model evaluation for the second trained model with 8:2 train test split ratio - DecisionTreeClassifier()
print("\033[1m\nDecisionTreeClassifier\033[0m")
evaluation = Evaluation(mdl, x_train, X_test, y_train, Y_test)
evaluation.test_evaluation()

# Model evaluation for the third trained model with 8:2 train test split ratio - LogisticRegression()
print("\033[1m\nLogisticRegression\033[0m")
evaluation = Evaluation(lr, x_train, X_test, y_train, Y_test)
evaluation.test_evaluation()

print("\033[1m\n\n∴ We can notice that the model with the highest accuracy is RandomForestClassifier with a 8:2 Train Test Split Ratio \033[0m")

[1mThe Below Evaluation Will Be Based On A Train Test Split Ratio Of 7:3 
[0m
[1mRandomForestClassifier[0m
Accuracy Score On Testing Data Set: 0.9063814056651726

Confusion Matrix On Testing Data Set:
 [[ 9385  1124]
 [  902 10230]]

Classification Report On Testing Data Set:
               precision    recall  f1-score   support

           0       0.91      0.89      0.90     10509
           1       0.90      0.92      0.91     11132

    accuracy                           0.91     21641
   macro avg       0.91      0.91      0.91     21641
weighted avg       0.91      0.91      0.91     21641

[1m
DecisionTreeClassifier[0m
Accuracy Score On Testing Data Set: 0.8697379973199021

Confusion Matrix On Testing Data Set:
 [[9043 1466]
 [1353 9779]]

Classification Report On Testing Data Set:
               precision    recall  f1-score   support

           0       0.87      0.86      0.87     10509
           1       0.87      0.88      0.87     11132

    accuracy                

## Prediction On Given Data

In [85]:
# Creating a function named lemmatization for Prediction Pipeline

def lemmatization(content):
    
    # Using regular expression to search for a pattern in a given string
    con = re.sub("a-zA-z0-9", " ", content)
    
    # Converting to lower case
    con = con.lower()
    
    # Applying tokenization
    con = con.split()
    
    # Removing stopwords
    con = [lm.lemmatize(z) for z in con if z not in stopwords]
    
    # Joining the sentence after lemmatization is applied
    con = " ".join(con)
    
    # Return the value
    return con

In [89]:
# Store the pickle version of vector and the trained model
pickle.dump(tf, open("vector.pkl", "wb"))

# Dumping the vector file of the model file
pickle.dump(rfc, open("rfc.pkl", "wb"))

# Loading the pickle file 
vector_form = pickle.load(open("vector.pkl", "rb"))

# Loading the model file
load_model = pickle.load(open("rfc.pkl", "rb"))

In [90]:
# Creating a function to classify whether the new is fake or real

def fake_news(news):
    news = lemmatization(news)
    input_data = [news]
    vector_form1 = vector_form.transform(input_data)
    prediction = load_model.predict(vector_form1)
    return prediction

In [95]:
val = fake_news(""" 'Bobby Jindal, raised Hindu, uses story of Christian conversion to woo evangelicals for potential 2016 bid' """)

In [96]:
if val == [0]:
    print("Reliable")
else:
    print("Not Reliable")
    
print("\n∴ It is \033[1mTRUE\033[0m that the news above is authentic")

Reliable

∴ It is [1mTRUE[0m that the news above is authentic


#### Comparing results obtained to existing research
- As comparing to the existing research: https://ieeexplore.ieee.org/abstract/document/9797147/metrics#metrics
    1. They claimed that of the five algorithms they used, which included Logistic Regression, Naïve Bayes, Support Vector Machine, Random Forest, and Decision Tree, the model trained with the highest accuracy to predict fake news was Naïve Bayes, followed by Decision Tree, and then Support Vector Machine, implying that Random Forest and Logistic Regression ranked fourth and last. However, the model trained in this project with the highest performance to predict fake news was RandomForestClassifier, with an accuracy of 90.8%, whereas DecisionTree scored only 87.1% which is the lowest among all models trained. 

#### Challenges faced
- Data preprocessing:
    1. When I utilised stemming instead of lemmatization, I found out that it's not suitable since it is a more aggressive approach than lematization and can results in loss of some words meaning and context. Hence, during the data preprocessing phase, lemmatization was used instead of stemming

#### Future work
##### Potential avenues for future research or improvements:
- Mutilmodal analysis
    1. Expand the fake news detection model to handle multiple modalities, including text, images, videos, etc. Combining information from different modalities can improve the accuracy of fake news detection, as it can capture inconsistencies between textual content and accompanying media.
- Transer learning 
    1. Leverage pre-trained language models (e.g., BERT) and fine-tune them on fake news detection task. Transfer learning can help the model perform better, especially when we have limited labeled data. 

##### Limitations of current approach:
- Model selection 
    1. Using Random Forest, Logistic Regression, and Decision Tree models have their limitations. They may not capture complex patterns and relationships in the data as effectively as deep learning models.
- Temporal aspect 
    1. Fake news evolves rapidly, and our models may not capture the most recent trends in misinformation. Hence, regular updates and retrain our models with fresh data is needed.

## Keyword Extraction

#### In here, several operations will be perform such as:
    - Data Preprocessing which includes:
        - Regular Expression
        - Lower Case
    - Creating IDF
    - TfidfTransformer (Use For Computing IDF)
    - Computing TF-IDF

## Data Preprocessing

In [110]:
# Creating a function for text pre-processing

def pre_process(text):
    
    # Using regular expression to search for a pattern in a given string
    text = re.sub("a-zA-z0-9", " ", text)
    
    # Removing tags
    text = re.sub("</?.*?>", " ", text)
    
    # Converting to lower case
    text = text.lower()
    
    # Return the value
    return text

In [111]:
data["txt"] = data["title"] + data["text"]
data["txt"] = data["txt"].apply(lambda x:pre_process(x))

data["txt"][1]

'no textdid they post their votes for hillary already?'

## Creating IDF - Using CountVectorizer To Create A Vocabulary And Generate Word Counts

In [112]:
# Creating a function to get stop words

def get_stop_words(stop_file_path):
    """ load stop words """
    
    with open(stop_file_path, "r", encoding = "utf-8") as f:
        stopwords = f.readlines()
        stop_set = set(m.strip() for m in stopwords)
        return frozenset(stop_set)

# Getting the text column 
docs=data["text"].tolist()

# Create a vocabulary of words and ignored words that appear in 85% of documents

cv = CountVectorizer(max_df = 0.85, stop_words = stopwords)
word_count_vector = cv.fit_transform(docs)

In [113]:
# Using .shape() to check the number of rows and columns 

word_count_vector.shape

(72134, 243356)

In [114]:
# Limiting our vocabulary size to 50,000

cv = CountVectorizer(max_df = 0.85, stop_words = stopwords, max_features=20000)
word_count_vector = cv.fit_transform(docs)
word_count_vector.shape

(72134, 20000)

## TfidfTransformer - Use To Compute Inverse Document Frequency (IDF)

In [115]:
tfidf_transformer = TfidfTransformer(smooth_idf = True, use_idf = True)
tfidf_transformer.fit(word_count_vector)

In [116]:
tfidf_transformer.idf_

array([5.62951629, 2.7827752 , 9.00824081, ..., 9.1417722 , 9.00824081,
       8.9281981 ])

## Computing TF-IDF and Extracting Keywords

In [117]:
def sort_coo(coo_matrix):
    tuples = zip(coo_matrix.col, coo_matrix.data)
    return sorted(tuples, key=lambda x: (x[1], x[0]), reverse=True)

def extract_topn_from_vector(feature_names, sorted_items, topn=10):
    """get the feature names and tf-idf score of top n items"""
    
    # Use only the topn items from vector
    sorted_items = sorted_items[:topn]

    score_vals = []
    feature_vals = []

    for idx, score in sorted_items:
        fname = feature_names[idx]
        
        # Keeping track of feature name and its corresponding score
        score_vals.append(round(score, 3))
        feature_vals.append(feature_names[idx])

    # Create a tuples of feature and score
    results= {}
    for idx in range(len(feature_vals)):
        results[feature_vals[idx]]=score_vals[idx]

    return results

In [118]:
# Getting the feature names out using cv initialized earlier
feature_names=cv.get_feature_names_out()

# Get the document that I want to extract keywords from
doc = docs[0]

# Generate tf-idf for the given document
tf_idf_vector = tfidf_transformer.transform(cv.transform([doc]))

# Sort the tf-idf vectors by descending order of scores
sorted_items = sort_coo(tf_idf_vector.tocoo())

# Extract only the top n; n here is 10
keywords=extract_topn_from_vector(feature_names,sorted_items,15)

# Print the results
print("\n=====Title=====")
print(docs_title[0])
print("\n=====Body=====")
print(docs_text[0])
print("\n===Keywords===")
for k in keywords:
    print(k,keywords[k])


=====Title=====
LAW ENFORCEMENT ON HIGH ALERT Following Threats Against Cops And Whites On 9-11By #BlackLivesMatter And #FYF911 Terrorists [VIDEO]

=====Body=====
No comment is expected from Barack Obama Members of the #FYF911 or #FukYoFlag and #BlackLivesMatter movements called for the lynching and hanging of white people and cops. They encouraged others on a radio show Tuesday night to  turn the tide  and kill white people and cops to send a message about the killing of black people in America.One of the F***YoFlag organizers is called  Sunshine.  She has a radio blog show hosted from Texas called,  Sunshine s F***ing Opinion Radio Show. A snapshot of her #FYF911 @LOLatWhiteFear Twitter page at 9:53 p.m. shows that she was urging supporters to  Call now!! #fyf911 tonight we continue to dismantle the illusion of white Below is a SNAPSHOT Twitter Radio Call Invite   #FYF911The radio show aired at 10:00 p.m. eastern standard time.During the show, callers clearly call for  lynching  and

## Attempted To Do Web Scrapping

In [119]:
import requests
from bs4 import BeautifulSoup

# URL of the Stack Overflow page for the "python" tag sorted by votes
url = "https://stackoverflow.com/questions/tagged/python?tab=Votes"

# Send an HTTP GET request to the URL
response = requests.get(url)

# Parse the HTML content
soup = BeautifulSoup(response.text, "html.parser")

# Find all the question summary elements through finding their class name
question_summaries = soup.find_all(class_="s-post-summary js-post-summary")

# Created a For Loop to loop through the question summaries and extract information
for summary in question_summaries:
    # Extract question title
    title_element = summary.find(class_="s-post-summary--content-title")
    title = title_element.text

    # Print question details
    print("Title:", title)

    # Find and extract comments
    comments = summary.find_all(class_="s-post-summary--content")
    if comments:
        print("Comments:")
        for comment in comments:
            comment_text = comment.find(class_="s-post-summary--content-excerpt").text.strip()
            print("- " + comment_text)

    print("\n")

Title: 
What does the "yield" keyword do in Python?

Comments:
- What is the use of the yield keyword in Python? What does it do?
For example, I'm trying to understand this code1:
def _get_child_candidates(self, distance, min_dist, max_dist):
    if self._leftchild ...


Title: 
What does if __name__ == "__main__": do?

Comments:
- What does this do, and why should one include the if statement?
if __name__ == "__main__":
    print("Hello, World!")
If you are trying to close a question where someone should be ...


Title: 
Does Python have a ternary conditional operator?

Comments:
- Is there a ternary conditional operator in Python?


Title: 
What are metaclasses in Python?

Comments:
- What are metaclasses? What are they used for?


Title: 
How do I check whether a file exists without exceptions?

Comments:
- How do I check whether a file exists or not, without using the try statement?


Title: 
How do I merge two dictionaries in a single expression in Python?

Comments:
- I want to 

## Deployment of Model - Using Streamlit