# Fake News Classification Using NLP Technique. 
This project focuses on creating a Fake News Classifier using a systematic workflow. <br>It involves problem definition, data collection, and preprocessing with tokenization, lowercase conversion, stopwords removal, and lemmatization. <br>The textual data is then transformed into vectors using techniques like BagofWords and TF-IDF. <br>A machine learning model is built, trained, and evaluated using accuracy, confusion matrix, and classification report metrics.

## Project Flow:
1. Data Gathering

2. Some Of Data Analysis

3. Data Preprocessing : Here we perform some operation on data
    A. Lower Case
    B. Tokenization
    C. Remove Punctuation
    D. Stopwords 
    E. Lemmatization

4. Vectorization (Convert Text data into the Vector):
    ِA. TF-IDF
   
6. Model Building :
    A. Model Object Initialization
    B. Train and Test Model
   
7. Model Evaluation :
    A. Accuracy Score
    B. Confusition Matrix
    C. Classification Report

8. Model Deployment

9. Prediction on Some Data        

# Importing Packages

In [51]:
import pandas as pd
import numpy as np
import string
from nltk.tokenize import word_tokenize
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

# 1. Data Gathering

In [2]:
df = pd.read_csv("News_dataset.csv")
df.head(2)

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0


# 2. Some Of Data Analysis

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


In [4]:
df['label'].value_counts()

label
1    10413
0    10387
Name: count, dtype: int64

In [5]:
df.shape

(20800, 5)

In [6]:
df = df.drop(['id','text','author'],axis = 1)
df.head(2)

Unnamed: 0,title,label
0,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",0


In [7]:
df['title'][0]

'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It'

In [8]:
df.isna().sum()

title    558
label      0
dtype: int64

In [9]:
df = df.dropna() #Handled Missing values by droping those rows

In [10]:
df.isna().sum()

title    0
label    0
dtype: int64

In [11]:
df.shape

(20242, 2)

# 3. Data Preprocessing

## 1.Make Lowercase

In [12]:
LowerCaseWords = [sentence.lower() for sentence in df['title']]
print(LowerCaseWords[0])

house dem aide: we didn’t even see comey’s letter until jason chaffetz tweeted it


## 2.Tokenization

In [13]:
TokenizeWords = [word_tokenize(sentence) for sentence in LowerCaseWords]

In [14]:
print(TokenizeWords[0])
print(len(TokenizeWords[0]))

['house', 'dem', 'aide', ':', 'we', 'didn', '’', 't', 'even', 'see', 'comey', '’', 's', 'letter', 'until', 'jason', 'chaffetz', 'tweeted', 'it']
19


## 3. Remove Punctuations

In [15]:
Punctuations = list(string.punctuation+'’')
print(Punctuations)
print(len(Punctuations))

['!', '"', '#', '$', '%', '&', "'", '(', ')', '*', '+', ',', '-', '.', '/', ':', ';', '<', '=', '>', '?', '@', '[', '\\', ']', '^', '_', '`', '{', '|', '}', '~', '’']
33


In [16]:
FinalWords = [[word for word in sentence if word not in Punctuations] for sentence in TokenizeWords]

In [17]:
print(FinalWords[0])
len(FinalWords[0])

['house', 'dem', 'aide', 'we', 'didn', 't', 'even', 'see', 'comey', 's', 'letter', 'until', 'jason', 'chaffetz', 'tweeted', 'it']


16

## 4. Remove Stopwords

In [18]:
Stopwords = stopwords.words('english')
print(Stopwords[0:10])
print(len(Stopwords))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
179


In [19]:
FinalWords1 = [[word for word in sentence if word not in Stopwords] for sentence in FinalWords]

In [20]:
print(FinalWords1[0])
len(FinalWords1[0])

['house', 'dem', 'aide', 'even', 'see', 'comey', 'letter', 'jason', 'chaffetz', 'tweeted']


10

## 5. Lemmatization

In [21]:
lm = WordNetLemmatizer()
LemmatizerWords = [[lm.lemmatize(word) for word in sentence] for sentence in FinalWords1]

In [22]:
print(LemmatizerWords[0])

['house', 'dem', 'aide', 'even', 'see', 'comey', 'letter', 'jason', 'chaffetz', 'tweeted']


In [23]:
df['title'][0]

'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It'

In [24]:
print(LemmatizerWords[0])

['house', 'dem', 'aide', 'even', 'see', 'comey', 'letter', 'jason', 'chaffetz', 'tweeted']


## 6. Convert List Into String Again

In [26]:
def Conv2STR(sentence):
    Sentence = " ".join(sentence)
    return Sentence

In [27]:
FSentences = []

Sentences = LemmatizerWords

for sentence in Sentences:
    PSentence = Conv2STR(sentence)
    FSentences.append(PSentence)

## Final Results

In [28]:
df['title'][0]

'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It'

In [34]:
print(LemmatizerWords[0])

['house', 'dem', 'aide', 'even', 'see', 'comey', 'letter', 'jason', 'chaffetz', 'tweeted']


In [35]:
print(FSentences[0])

house dem aide even see comey letter jason chaffetz tweeted


# 4. Vectorization (Convert Text data into the Vector)

In [36]:
tf = TfidfVectorizer()
X = tf.fit_transform(FSentences).toarray()

In [41]:
print(X[0])

[0. 0. 0. ... 0. 0. 0.]


In [42]:
y = df['label']
print(y.head())

0    1
1    0
2    1
3    1
4    1
Name: label, dtype: int64


# 5. Data splitting into the train and test

In [44]:
x_train, x_test, y_train, y_test = train_test_split(X,y, test_size = 0.3)

In [45]:
len(x_train),len(y_train)

(14169, 14169)

In [46]:
len(x_test), len(y_test)

(6073, 6073)

# 5. Building The Model

In [47]:
rf = RandomForestClassifier()
rf.fit(x_train, y_train)

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


# 6. Model Evaluation

In [48]:
y_pred = rf.predict(x_test)
accuracy_score_ = accuracy_score(y_test,y_pred) 
accuracy_score_

  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


0.9395685822493002

In [60]:
class Evaluation:
    
    def __init__(self,model,x_train,x_test,y_train,y_test):
        self.model = model
        self.x_train = x_train
        self.x_test = x_test
        self.y_train = y_train
        self.y_test = y_test
        
    def train_evaluation(self):
        y_pred_train = self.model.predict(self.x_train)
        
        acc_scr_train = accuracy_score(self.y_train,y_pred_train)
        print("Accuracy Score On Training Data Set :",acc_scr_train)
        print()
        
        con_mat_train = confusion_matrix(self.y_train,y_pred_train)
        print("Confusion Matrix On Training Data Set :\n",con_mat_train)
        print()
        
        class_rep_train = classification_report(self.y_train,y_pred_train)
        print("Classification Report On Training Data Set :\n",class_rep_train)
        
        
    def test_evaluation(self):
        y_pred_test = self.model.predict(self.x_test)
        
        acc_scr_test = accuracy_score(self.y_test,y_pred_test)
        print("Accuracy Score On Testing Data Set :",acc_scr_test)
        print()
        
        con_mat_test = confusion_matrix(self.y_test,y_pred_test)
        print("Confusion Matrix On Testing Data Set :\n",con_mat_test)
        print()
        
        class_rep_test = classification_report(self.y_test,y_pred_test)
        print("Classification Report On Testing Data Set :\n",class_rep_test)

In [61]:
#Checking the accuracy on training dataset
Evaluation(rf,x_train, x_test, y_train, y_test).train_evaluation()

Accuracy Score On Training Data Set : 1.0

Confusion Matrix On Training Data Set :
 [[7271    0]
 [   0 6898]]

Classification Report On Training Data Set :
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      7271
           1       1.00      1.00      1.00      6898

    accuracy                           1.00     14169
   macro avg       1.00      1.00      1.00     14169
weighted avg       1.00      1.00      1.00     14169



  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


In [62]:
#Checking the accuracy on testing dataset
Evaluation(rf,x_train, x_test, y_train, y_test).test_evaluation()

Accuracy Score On Testing Data Set : 0.9395685822493002

Confusion Matrix On Testing Data Set :
 [[2802  314]
 [  53 2904]]

Classification Report On Testing Data Set :
               precision    recall  f1-score   support

           0       0.98      0.90      0.94      3116
           1       0.90      0.98      0.94      2957

    accuracy                           0.94      6073
   macro avg       0.94      0.94      0.94      6073
weighted avg       0.94      0.94      0.94      6073



  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):
  if is_sparse(pd_dtype):
  if is_sparse(pd_dtype) or not is_extension_array_dtype(pd_dtype):


# Prediction Pipeline

In [70]:
class Preprocessing:
    
    def __init__(self,data):
        self.data = data
        
    def text_preprocessing_user(self):
        lm = WordNetLemmatizer()
        pred_data = [self.data]    
        preprocess_data = []
        for data in pred_data:
            review = data.lower()
            review = word_tokenize(review)
            review = [lm.lemmatize(x) for x in review if x not in Stopwords and x not in Punctuations]
            review = " ".join(review)
            preprocess_data.append(review)
        return preprocess_data    

In [71]:
df['title'][1]

'FLYNN: Hillary Clinton, Big Woman on Campus - Breitbart'

In [72]:
data = 'FLYNN: Hillary Clinton, Big Woman on Campus - Breitbart'
Preprocessing(data).text_preprocessing_user()

['flynn hillary clinton big woman campus breitbart']

In [73]:
class Prediction:
    
    def __init__(self,pred_data, model):
        self.pred_data = pred_data
        self.model = model
        
    def prediction_model(self):
        preprocess_data = Preprocessing(self.pred_data).text_preprocessing_user()
        data = tf.transform(preprocess_data)
        prediction = self.model.predict(data)
        
        if prediction [0] == 0 :
            return "The News Is Fake"
        
        else:
            return "The News Is Real"
        

In [74]:
data = 'FLYNN: Hillary Clinton, Big Woman on Campus - Breitbart'
Prediction(data,rf).prediction_model()

'The News Is Fake'

In [75]:
df['title'][3]

'15 Civilians Killed In Single US Airstrike Have Been Identified'

In [76]:
user_data = '15 Civilians Killed In Single US Airstrike Have Been Identified' 
Prediction(user_data,rf).prediction_model()

'The News Is Real'

# شكراً لكم