# Fake News Classification with the help of NLP Technique

### Required Libraries

In [1]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from nltk.stem import PorterStemmer, WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.ensemble import RandomForestClassifier
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

### Data Gathering

*Steps to Import the Dataset Correctly
1]Use the Raw URL: To read the CSV file directly, you need to use the raw content URL rather than the GitHub page URL. The raw URL for the dataset can be constructed by replacing github.com with raw.githubusercontent.com and removing the blob/ part.

2]Read the CSV File: Use pd.read_csv() with the corrected URL.

3]Additional Considerations
Check for Delimiters: If the CSV file uses a different delimiter (like a semicolon ; instead of a comma ,), you can specify the delimiter in the read_csv function using the sep parameter:
python
df = pd.read_csv(url, sep=';')

4]Inspect the Dataset: If you continue to encounter issues, consider downloading the dataset manually and inspecting it to understand its structure. This can help identify any irregularities in the data format.

5]Handling Bad Lines: If there are problematic lines in the CSV, you can use the error_bad_lines parameter (deprecated in newer versions, use on_bad_lines instead):
python
df = pd.read_csv(url, on_bad_lines='skip')



In [2]:
import pandas as pd

# Use the raw URL
url = 'https://raw.githubusercontent.com/AI-ML-LEARNING/Fake-News-Classification_NLP/main/News_dataset.csv'

# Attempt to read the CSV file
try:
    df = pd.read_csv(url)
    print(df.head(10))
except pd.errors.ParserError as e:
    print("ParserError:", e)
except Exception as e:
    print("An error occurred:", e)

   id                                              title  \
0   0  House Dem Aide: We Didn’t Even See Comey’s Let...   
1   1  FLYNN: Hillary Clinton, Big Woman on Campus - ...   
2   2                  Why the Truth Might Get You Fired   
3   3  15 Civilians Killed In Single US Airstrike Hav...   
4   4  Iranian woman jailed for fictional unpublished...   
5   5  Jackie Mason: Hollywood Would Love Trump if He...   
6   6  Life: Life Of Luxury: Elton John’s 6 Favorite ...   
7   7  Benoît Hamon Wins French Socialist Party’s Pre...   
8   8  Excerpts From a Draft Script for Donald Trump’...   
9   9  A Back-Channel Plan for Ukraine and Russia, Co...   

                         author  \
0                 Darrell Lucus   
1               Daniel J. Flynn   
2            Consortiumnews.com   
3               Jessica Purkiss   
4                Howard Portnoy   
5               Daniel Nussbaum   
6                           NaN   
7               Alissa J. Rubin   
8                       

### Data Analysis

In [3]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20800 entries, 0 to 20799
Data columns (total 5 columns):
 #   Column  Non-Null Count  Dtype 
---  ------  --------------  ----- 
 0   id      20800 non-null  int64 
 1   title   20242 non-null  object
 2   author  18843 non-null  object
 3   text    20761 non-null  object
 4   label   20800 non-null  int64 
dtypes: int64(2), object(3)
memory usage: 812.6+ KB


In [4]:
df['label'].value_counts()

label
1    10413
0    10387
Name: count, dtype: int64

In [5]:
df.shape

(20800, 5)

In [6]:
df.isna().sum()

id           0
title      558
author    1957
text        39
label        0
dtype: int64

*if the data set has numerical features or column with missing value they can be replaced by mean,median and mode

*but if dataset  is categorical(like yes,no,true,flase) in that case we replace with mode values

*but the dataset that i used is text(NLP),most of the time available data is text,so we cannot replace with mode or anything.In   that case we need to drop those columns



##### Handeled missing values

In [7]:
df = df.dropna()   #this drops the missing value rows

In [8]:
df.isna().sum()

id        0
title     0
author    0
text      0
label     0
dtype: int64

In [9]:
df.shape

(18285, 5)

In [10]:
df.reset_index(inplace=True)
df.head()

Unnamed: 0,index,id,title,author,text,label
0,0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


df.reset_index(inplace=True)
This line resets the index of the DataFrame df. By default, when you create a DataFrame, it assigns an integer index starting from 0. However, if you have performed any operations that changed the index, such as filtering or grouping, this line will reset the index to a new sequence of integers.
The inplace=True parameter tells the reset_index() function to modify the DataFrame in place, rather than creating a new DataFrame. This means that the original DataFrame df will be updated with the new index.
After running this line, the DataFrame df will have a new index, and any previous index will be added as a new column.

In [11]:
#text is too much.For now we will work on title
df['title'][0]

'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It'

In [12]:
df = df.drop(['id','text','author'],axis = 1)  #axis 1 means columns and 0 means rows
df.head()

Unnamed: 0,index,title,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",0
2,2,Why the Truth Might Get You Fired,1
3,3,15 Civilians Killed In Single US Airstrike Hav...,1
4,4,Iranian woman jailed for fictional unpublished...,1


### Data  Pre-Processing

Repairing and cleaning the test data for machines to be abel to analyze it. This puts data in workable form and highlights features in the text that an algorithm can work on

##### 1.Tokenization

Token helps in understanding the context,also they help in interpreting the meaning of the text by analyzing the sequence of words

Two types of tokenization: word tokenization and sentence tokenization.

So here we see **word tokenization**.This is used to break the sentence into the separate words or tokens/ it is the process of splitting a string into a text or list of tokens



In [13]:
# take one sample data
sample_data = 'The quick brown fox jumps over the lazy dog'
sample_data = sample_data.split()
print(sample_data)
len(sample_data)

['The', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']


9

##### 2.Make Lower Case

Say for egs in our sample data, "the is repeated 2 times",one with T(capital) and t(small).

So if we go for next preprocessing without doing any alterations(maybe convert text data into numerics) so 2 different feature is created.Word is same but one  T(capital) is used and t(small) is used,that is why 2 separate feature is created..

Or sometimes there might be an typo in our dataset (the,The,THe,etc),here 3 different feature is created

To avoid this all are converted to small letter.Basically we use **list compherension** technique

In [14]:
sample_data = [data.lower() for data in sample_data]
sample_data

['the', 'quick', 'brown', 'fox', 'jumps', 'over', 'the', 'lazy', 'dog']

##### 3.Remove Stop Words

There are many words that occur frequently like:is,the,and etc

But in NLP pipeline we flag this words as stopwords

Stopwords are those words in the text which does not add any meaning to the sentence and their removal will not affect our preprocessing of the text.

Why do we remove this?They are removed from vocabulary to reduce the noise and also to reduce the dimension of the features.

In [15]:
# pip install nltk

In [16]:
# import nltk
# nltk.download('stopwords')

In [17]:
stopwords = stopwords.words('english')
print(stopwords[0:10])
print(len(stopwords))

['i', 'me', 'my', 'myself', 'we', 'our', 'ours', 'ourselves', 'you', "you're"]
179


In [18]:
sample_data = [data for data in sample_data if data not in stopwords ]
print(sample_data)
print(len(sample_data))

['quick', 'brown', 'fox', 'jumps', 'lazy', 'dog']
6


After removing the stopwords it reduced to 6 from 9(see 11th cell)

##### 4.Stemming

This is used to normalize the words into its base or root forms.

One problem with stemming is that sometimes it produces the root word which may not have any meaning at all 

When doing data pre processing either we go with stemming or we go with lemmetization

lemmetization is quite similar to stemming but it is used to group different inflected form of the words.It is called as lemma

the main main difference between stemming and lemmetization is that it produce the root word which has a meaning.But when speed is compared stemming is faster than lemmetization.Because in lemmetization scan a corgpus which consumes time and the pre processing

if building level application,then we go with lemmetization

In [19]:
# stemmig library used PorterStemmer
# create a object for stemming
ps = PorterStemmer()
sample_data_stemming = [ps.stem(data) for data in sample_data]
print(sample_data_stemming)

['quick', 'brown', 'fox', 'jump', 'lazi', 'dog']


##### 5.Lemmatization 

In [20]:
# pip install nltk

In [21]:
# import nltk
# nltk.download('wordnet')

In [22]:
# lemmetization library used WordNetLemmatizer
# create word lemetizer objects
lm = WordNetLemmatizer()
sample_data_lemmatizer = [lm.lemmatize(data) for data in sample_data]
print(sample_data_lemmatizer)

['quick', 'brown', 'fox', 'jump', 'lazy', 'dog']


**The above 5 operations are performed on our main dataset which is created with title and the label column**

In [23]:
df.shape    

(18285, 3)

In [24]:
df['title'][0]

'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It'

Our for loop will be running for 18285 times

In [25]:
# now 
lm = WordNetLemmatizer()
corpus = [] #empty list created with corpus object
for i in range(len(df)):
    review = re.sub('^a-zA-Z0-9',' ',df['title'][i])#regular expression
    review = review.lower()
    review = review.split()
    review = [lm.lemmatize(x) for x in review if x not in stopwords]
    review = " ".join(review)
    corpus.append(review)

In [26]:
len(corpus)

18285

In [27]:
df['title'][0]

'House Dem Aide: We Didn’t Even See Comey’s Letter Until Jason Chaffetz Tweeted It'

In [28]:
corpus[0]

'house dem aide: didn’t even see comey’s letter jason chaffetz tweeted'

### Vectorizer (Convert Text data into Vector)

In [29]:
tf = TfidfVectorizer()
x = tf.fit_transform(corpus).toarray()
x

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

In [30]:
y = df['label']
y.head()

0    1
1    0
2    1
3    1
4    1
Name: label, dtype: int64

### Data Split into train and test

In [31]:
x_train,x_test,y_train,y_test = train_test_split(x,y,test_size=0.3,random_state=10,stratify=y)

In [32]:
len(x_train),len(y_train)  #xtrain is input(independent variable) &&&&&& ytrain is output(dependent)

(12799, 12799)

In [33]:
len(x_test),len(y_test)

(5486, 5486)

### Model Building

In [34]:
rf = RandomForestClassifier()
rf.fit(x_train,y_train)

### Model Evaluation

In [35]:
y_pred = rf.predict(x_test)
accuracyy = accuracy_score(y_test,y_pred)
accuracyy

0.9362012395187751

In [48]:
class Evaluation:
    def __init__(self,model,x_train,x_test,y_train,y_test):
        self.model = model
        self.x_train = x_train
        self.x_test = x_test
        self.y_train = y_train
        self.y_test = y_test
    
    
    def train_evaluation(self):
        y_pred_train = self.model.predict(self.x_train)
        
        acc_scr_train = accuracy_score(self.y_train,y_pred_train)
        print("Accuracy score on the training data set:\n",acc_scr_train)
        print()
        
        con_mat_train = confusion_matrix(self.y_train,y_pred_train)
        print("Confusion MAtrix on the training data set:\n",con_mat_train)
        print()
        
        class_rep_train = classification_report(self.y_train,y_pred_train)
        print("Classification Report on the training data set:\n",class_rep_train)
        print()
        
        
    
    def test_evaluation(self):
        y_pred_test = self.model.predict(self.x_test)
        
        acc_scr_test = accuracy_score(self.y_test,y_pred_test)
        print("Accuracy score on the testing data set:\n",acc_scr_test)
        print()
        
        con_mat_test = confusion_matrix(self.y_test,y_pred_test)
        print("Confusion MAtrix on the testing data set:\n",con_mat_test)
        print()
        
        class_rep_test = classification_report(self.y_test,y_pred_test)
        print("Classification Report on the testing data set:\n",class_rep_test)
        print()
        

##### Checking Accuracy on training dataset

In [49]:
Evaluation(rf,x_train,x_test,y_train,y_test).train_evaluation()

Accuracy score on the training data set:
 1.0

Confusion MAtrix on the training data set:
 [[7252    0]
 [   0 5547]]

Classification Report on the training data set:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      7252
           1       1.00      1.00      1.00      5547

    accuracy                           1.00     12799
   macro avg       1.00      1.00      1.00     12799
weighted avg       1.00      1.00      1.00     12799




#####  Checking Accuracy on testing dataset

In [50]:
Evaluation(rf,x_train,x_test,y_train,y_test).test_evaluation()

Accuracy score on the testing data set:
 0.9362012395187751

Confusion MAtrix on the testing data set:
 [[2820  289]
 [  61 2316]]

Classification Report on the testing data set:
               precision    recall  f1-score   support

           0       0.98      0.91      0.94      3109
           1       0.89      0.97      0.93      2377

    accuracy                           0.94      5486
   macro avg       0.93      0.94      0.94      5486
weighted avg       0.94      0.94      0.94      5486




### Prediction Pipelining


In [52]:
# class Preprocessing:
#     def __init__(self,data):
#         self.data = data
    
#     def text_preprocessing_user(self):
#         lm = WordNetLemmatizer()
#         pred_data = [self.data] 
#         preprocess_data = []
#         for data in pred_data:
#             review = re.sub('^a-zA-Z0-9',' ',data)
#             review = review.lower()
#             review = review.split()
#             review = [lm.lemmatize(x) for x in review if x not in stopwords]
#             review = " ".join(review)
#             preprocess_data.append(review)
#         return preprocess_data
    
class Preprocessing:
    def __init__(self, data):
        self.data = data
    
    def text_preprocessing_user(self):
        lm = WordNetLemmatizer()
        pred_data = [self.data]
        preprocess_data = []
        
        # Use the correct regex pattern to remove unwanted characters
        for data in pred_data:
            review = re.sub('[^a-zA-Z0-9]', ' ', data)  # Corrected regex
            review = review.lower()
            review = review.split()
            review = [lm.lemmatize(x) for x in review if x not in stopwords.words('english')]
            review = " ".join(review)
            preprocess_data.append(review)
        
        return preprocess_data

In [54]:
df['title'][1]

'FLYNN: Hillary Clinton, Big Woman on Campus - Breitbart'

In [56]:
data = 'FLYNN: Hillary Clinton, Big Woman on Campus - Breitbart'
Preprocessing(data).text_preprocessing_user()

['flynn: hillary clinton, big woman campus - breitbart']

In [62]:
# class Prediction:
#     def __init__(self,pred_data,model):
#         self.pred_data = pred_data
#         self.model = model
        
#     def prediction_model(self):
#         process_data = Preprocessing(self.pred_data).text_preprocessing_user
#         data = tf.transform(preprocess_data)
#         prediction = self.model.predict(data)
        
#         if prediction [0] == 0 :
#             return "The news is Fake"
#         else:
#             return "The news is Real"
        

class Prediction:
    def __init__(self, pred_data, model):
        self.pred_data = pred_data
        self.model = model
        
    def prediction_model(self):
        # Correct the method name here
        preprocess_data = Preprocessing(self.pred_data).text_preprocessing_user()
        
        # Assuming tf is a TfidfVectorizer or similar
        data = tf.transform(preprocess_data)  # Ensure tf is defined and initialized properly
        prediction = self.model.predict(data)
        
        if prediction[0] == 0:
            return "The news is Fake"
        else:
            return "The news is Real"

In [63]:
data = 'FLYNN: Hillary Clinton, Big Woman on Campus - Breitbart'
Prediction(data,rf).prediction_model()

'The news is Fake'