**Fake News Classification with The Help Of Natural Language Processing Technique.**
Fake news detection is a hot topic in the field of natural language processing. We consume news through several mediums throughout the day in our daily routine, but sometimes it becomes difficult to decide which one is fake and which one is authentic. Our job is to create a model which predicts whether a given news is real or fake.

**Libarary required**

In [None]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

In [None]:
import pandas as pd
import csv

# Define the URL of the dataset
dataset_url = "https://zenodo.org/record/4561253/files/WELFake_Dataset.csv?download=1"

# Read the first 1000 rows of the dataset
df = pd.read_csv(dataset_url, nrows=1000)

# Define the output file path
output_file_path = "extracted_data.csv"

# Open the output file in write mode
with open(output_file_path, "w") as output_file:
    # Create a CSV writer object
    writer = csv.writer(output_file)

    # Write the header row
    writer.writerow(df.columns.tolist())

    # Iterate over the DataFrame and write each row to the output file
    for index, row in df.iterrows():
        writer.writerow(row.tolist())

# Display the first few rows of the extracted dataset
print(df.head())


   Unnamed: 0                                              title  \
0           0  LAW ENFORCEMENT ON HIGH ALERT Following Threat...   
1           1                                                NaN   
2           2  UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...   
3           3  Bobby Jindal, raised Hindu, uses story of Chri...   
4           4  SATAN 2: Russia unvelis an image of its terrif...   

                                                text  label  
0  No comment is expected from Barack Obama Membe...      1  
1     Did they post their votes for Hillary already?      1  
2   Now, most of the demonstrators gathered last ...      1  
3  A dozen politically active pastors came here f...      0  
4  The RS-28 Sarmat missile, dubbed Satan 2, will...      1  


**2. Data Analysis**

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1000 entries, 0 to 999
Data columns (total 4 columns):
 #   Column      Non-Null Count  Dtype 
---  ------      --------------  ----- 
 0   Unnamed: 0  1000 non-null   int64 
 1   title       989 non-null    object
 2   text        1000 non-null   object
 3   label       1000 non-null   int64 
dtypes: int64(2), object(2)
memory usage: 31.4+ KB


In [5]:
df['label'].value_counts()

1    534
0    466
Name: label, dtype: int64

In [7]:
df.isna().sum()

Unnamed: 0     0
title         11
text           0
label          0
dtype: int64

In [8]:
df.isna().sum()

Unnamed: 0     0
title         11
text           0
label          0
dtype: int64

In [6]:
df.shape

(1000, 4)

In [9]:
df.isna().sum()

Unnamed: 0     0
title         11
text           0
label          0
dtype: int64

In [10]:
df.shape

(1000, 4)

In [11]:
df.reset_index(inplace=True)
df.head()

Unnamed: 0.1,index,Unnamed: 0,title,text,label
0,0,0,LAW ENFORCEMENT ON HIGH ALERT Following Threat...,No comment is expected from Barack Obama Membe...,1
1,1,1,,Did they post their votes for Hillary already?,1
2,2,2,UNBELIEVABLE! OBAMA’S ATTORNEY GENERAL SAYS MO...,"Now, most of the demonstrators gathered last ...",1
3,3,3,"Bobby Jindal, raised Hindu, uses story of Chri...",A dozen politically active pastors came here f...,0
4,4,4,SATAN 2: Russia unvelis an image of its terrif...,"The RS-28 Sarmat missile, dubbed Satan 2, will...",1


In [12]:
df['title'][0]

'LAW ENFORCEMENT ON HIGH ALERT Following Threats Against Cops And Whites On 9-11By #BlackLivesMatter And #FYF911 Terrorists [VIDEO]'

**3. Data Preprocessing**

**1. Tokenization**

In [14]:
sample_data = 'The quick brown fox jumps over the lazy dog'
sample_data = sample_data.split()
sample_data

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

**2. Make Lowercase**

In [15]:
sample_data = [data.lower() for data in sample_data]
sample_data

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

**3.Remove stopwords**

In [27]:
import nltk
nltk.download('stopwords')

from nltk.corpus import stopwords
stopwords = stopwords.words('english')
print(stopwords[0:10])
print(len(stopwords))


['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
179


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


**4. Steamming**

In [19]:
ps = PorterStemmer()
sample_data_stemming = [ps.stem(data) for data in sample_data]
print(sample_data_stemming)

['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazi', 'dog']


**5.Lemmatization**

In [29]:
import nltk
nltk.download('wordnet')  # Make sure there is no extra indentation here

lm = WordNetLemmatizer()
sample_data_lemma = [lm.lemmatize(data) for data in sample_data]
print(sample_data_lemma)


[nltk_data] Downloading package wordnet to /root/nltk_data...


['the', 'quick', 'brown', 'fox', 'jump', 'over', 'the', 'lazy', 'dog']


In [31]:
import re
from nltk.stem import WordNetLemmatizer

lm = WordNetLemmatizer()
corpus = []
for i in range(len(df)):
    title = str(df['title'][i])  # Ensure df['title'][i] is a string
    review = re.sub('[^a-zA-Z0-9]', ' ', title)  # Correct the regular expression
    review = review.lower()
    review = review.split()
    review = [lm.lemmatize(x) for x in review if x not in stopwords]
    review = " ".join(review)
    corpus.append(review)


In [32]:
len(corpus)

1000

In [33]:
df['title'][0]

'LAW ENFORCEMENT ON HIGH ALERT Following Threats Against Cops And Whites On 9-11By #BlackLivesMatter And #FYF911 Terrorists [VIDEO]'

In [34]:
corpus[0]

'law enforcement high alert following threat cop white 9 11by blacklivesmatter fyf911 terrorist video'

**4 Text Data to Vector**

In [35]:
tf = TfidfVectorizer()
x = tf.fit_transform(corpus).toarray()
x

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [36]:
y = df['label']
y.head()

0    1
1    1
2    1
3    0
4    1
Name: label, dtype: int64

**Data splitted in train and test**

In [37]:
x_train, x_test, y_train, y_test = train_test_split(x,y, test_size = 0.3, random_state = 10, stratify = y )

In [38]:
len(x_train),len(y_train)

(700, 700)

In [39]:
len(x_test), len(y_test)

(300, 300)

**5.Model Building**

In [40]:
rf = RandomForestClassifier()
rf.fit(x_train, y_train)

**6.Model evaluation**

In [41]:
y_pred = rf.predict(x_test)
accuracy_score_ = accuracy_score(y_test,y_pred)
accuracy_score_

0.78

In [47]:
class Evaluation:

    def __init__(self,model,x_train,x_test,y_train,y_test):
        self.model = model
        self.x_train = x_train
        self.x_test = x_test
        self.y_train = y_train
        self.y_test = y_test

    def train_evaluation(self):
        y_pred_train = self.model.predict(self.x_train)

        acc_scr_train = accuracy_score(self.y_train,y_pred_train)
        print("Accuracy Score On Training Data Set :",acc_scr_train)
        print()

        con_mat_train = confusion_matrix(self.y_train,y_pred_train)
        print("Confusion Matrix On Training Data Set :\n",con_mat_train)
        print()

        class_rep_train = classification_report(self.y_train,y_pred_train)
        print("Classification Report On Training Data Set :\n",class_rep_train)


    def test_evaluation(self):
        y_pred_test = self.model.predict(self.x_test)

        acc_scr_test = accuracy_score(self.y_test,y_pred_test)
        print("Accuracy Score On Testing Data Set :",acc_scr_test)
        print()

        con_mat_test = confusion_matrix(self.y_test,y_pred_test)
        print("Confusion Matrix On Testing Data Set :\n",con_mat_test)
        print()

        class_rep_test = classification_report(self.y_test,y_pred_test)
        print("Classification Report On Testing Data Set :\n",class_rep_test)


In [48]:
Evaluation(rf,x_train, x_test, y_train, y_test).train_evaluation()

Accuracy Score On Training Data Set : 1.0

Confusion Matrix On Training Data Set :
 [[326   0]
 [  0 374]]

Classification Report On Training Data Set :
               precision    recall  f1-score   support

           0       1.00      1.00      1.00       326
           1       1.00      1.00      1.00       374

    accuracy                           1.00       700
   macro avg       1.00      1.00      1.00       700
weighted avg       1.00      1.00      1.00       700



In [49]:
#Checking the accuracy on testing dataset
Evaluation(rf,x_train, x_test, y_train, y_test).test_evaluation()

Accuracy Score On Testing Data Set : 0.78

Confusion Matrix On Testing Data Set :
 [[ 81  59]
 [  7 153]]

Classification Report On Testing Data Set :
               precision    recall  f1-score   support

           0       0.92      0.58      0.71       140
           1       0.72      0.96      0.82       160

    accuracy                           0.78       300
   macro avg       0.82      0.77      0.77       300
weighted avg       0.81      0.78      0.77       300



**Prediction pipeline**

In [50]:
class Preprocessing:

    def __init__(self,data):
        self.data = data

    def text_preprocessing_user(self):
        lm = WordNetLemmatizer()
        pred_data = [self.data]
        preprocess_data = []
        for data in pred_data:
            review = re.sub('^a-zA-Z0-9',' ', data)
            review = review.lower()
            review = review.split()
            review = [lm.lemmatize(x) for x in review if x not in stopwords]
            review = " ".join(review)
            preprocess_data.append(review)
        return preprocess_dat

In [54]:
df['title'][122]

'HILLARY WILL LAND IN PRISON, NOT THE OVAL OFFICE'

In [67]:
import re
from nltk.stem import WordNetLemmatizer

class Preprocessing:
    def __init__(self, data):
        self.data = data

    def text_preprocessing_user(self):
        lm = WordNetLemmatizer()
        stopwords = set(nltk.corpus.stopwords.words('english'))

        preprocess_data = []
        for data in self.data:
            review = re.sub('[^a-zA-Z]', ' ', data)
            review = review.lower()
            review = review.split()
            review = [lm.lemmatize(word) for word in review if word not in stopwords]
            review = " ".join(review)
            preprocess_data.append(review)
        return preprocess_data

data ='HILLARY WILL LAND IN PRISON, NOT THE OVAL OFFICE'
preprocessed_data = Preprocessing(data).text_preprocessing_user()


In [66]:
class Prediction:

    def __init__(self,pred_data, model):
        self.pred_data = pred_data
        self.model = model

    def prediction_model(self):
        preprocess_data = Preprocessing(self.pred_data).text_preprocessing_user()
        data = tf.transform(preprocess_data)
        prediction = self.model.predict(data)

        if prediction [0] == 0 :
            return "The News Is Fake"

        else:
            return "The News Is Real"

In [68]:
data = 'HILLARY WILL LAND IN PRISON, NOT THE OVAL OFFICE'
Prediction(data,rf).prediction_model()

'The News Is Real'

In [69]:
df['title'][3]

'Bobby Jindal, raised Hindu, uses story of Christian conversion to woo evangelicals for potential 2016 bid'

In [70]:
user_data = 'Bobby Jindal, raised Hindu, uses story of Christian conversion to woo evangelicals for potential 2016 bid'
Prediction(user_data,rf).prediction_model()

'The News Is Real'