---

# **Fake News Classifier - Google Colab Notebook**

## **Problem Statement:**
In today’s world, the spread of fake news has become a significant issue, especially with the rapid growth of social media and online news platforms. Fake news can cause harm to individuals, organizations, and even entire societies by spreading false information. Detecting fake news with a high level of accuracy is essential to mitigate its impact.

This project aims to create a machine learning model capable of classifying news articles as either "real" or "fake" based on their content. We are leveraging deep learning techniques such as Recurrent Neural Networks (RNNs) with LSTM (Long Short-Term Memory) layers, which are particularly effective in natural language processing (NLP) tasks like text classification.

## **Approach:**

The approach to solving this problem consists of the following steps:

1. **Data Preprocessing:**
   - Load the dataset.
   - Check for missing values and drop them.
   - Extract independent features (X) and dependent features (y).

2. **Text Preprocessing:**
   - Tokenize the text data to prepare it for deep learning models.
   - Remove stopwords and apply stemming to reduce words to their base form.
   - Convert text to a numerical representation using techniques like One-Hot Encoding and Word Embeddings.

3. **Model Building:**
   - Use an LSTM-based neural network model to classify the news as fake or real.
   - The model consists of an embedding layer to transform text into vector representations, an LSTM layer to capture the sequential nature of text, and a Dense layer for the final classification.

4. **Model Training:**
   - Split the dataset into training and testing sets.
   - Train the model on the training data while validating it on the test data.
   - Use binary cross-entropy as the loss function and Adam optimizer for efficient training.

5. **Model Evaluation:**
   - Evaluate the model using accuracy, precision, recall, and F1-score metrics.
   - Display the results in a classification report.

---

### **Code Implementation:**

#### 1. **Import Libraries and Load Dataset:**

```python
import pandas as pd
df = pd.read_csv('/content/train.csv')
df.head()
df.shape
```

#### 2. **Data Preprocessing:**

- Checking for missing values and dropping rows with NaN values:

```python
# Checking missing Values
df.isnull().sum()

# Drop NaN Values
df.dropna(inplace=True)
df.head()
```

- Extracting independent (X) and dependent (y) features:

```python
# Get the Independent Features
X = df.drop('label', axis=1)

# Get the Dependent Features
y = df['label']
```

#### 3. **Text Preprocessing:**

- Cleaning the text by removing non-alphabetical characters and converting to lowercase:

```python
import nltk
import re
from nltk.corpus import stopwords
nltk.download('stopwords')

# Initialize Stemmer
from nltk.stem.porter import PorterStemmer
ps = PorterStemmer()

corpus = []

for i in range(0, len(messages)):
    review = re.sub('[^a-zA-Z]', ' ', messages['title'][i])
    review = review.lower()
    review = review.split()

    review = [ps.stem(word) for word in review if word not in stopwords.words('english')]
    review = ' '.join(review)
    corpus.append(review)
```

#### 4. **Text Representation:**

- Applying One-Hot Encoding and Padding for embedding:

```python
# One Hot Representation
onehot_repr = [one_hot(words, voc_size) for words in corpus]

# Padding the sequences
sent_length = 20
embedded_docs = pad_sequences(onehot_repr, padding='pre', maxlen=sent_length)
print(embedded_docs)
```

#### 5. **Building the LSTM Model:**

- Defining and compiling the LSTM-based model:

```python
embedding_vector_features = 40
model = Sequential([
    Input(shape=(sent_length,)),  # Specify input shape (sentence length)
    Embedding(input_dim=voc_size, output_dim=embedding_vector_features),
    LSTM(100),
    Dense(1, activation='sigmoid')
])

model.compile(loss='binary_crossentropy', optimizer='adam', metrics=['accuracy'])
print(model.summary())
```

#### 6. **Model Training:**

- Splitting the data into training and testing sets and training the model:

```python
X_final = np.array(embedded_docs)
y_final = np.array(y)

# Train test split
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X_final, y_final, test_size=0.33, random_state=42)

# Model training
model.fit(X_train, y_train, validation_data=(X_test, y_test), epochs=10, batch_size=64)
```

#### 7. **Model Evaluation:**

- Predicting and evaluating the model's performance:

```python
y_pred = model.predict(X_test)
y_pred = np.where(y_pred > 0.5, 1, 0)  # AUC, ROC Curve

from sklearn.metrics import accuracy_score
accuracy = accuracy_score(y_test, y_pred)

# Display the accuracy as a percentage
accuracy_percentage = accuracy * 100
print(f'Accuracy: {accuracy_percentage:.2f}%')

from sklearn.metrics import classification_report
print(classification_report(y_test, y_pred))
```

#### **Results:**
- **Accuracy:** 91.88%
- **Precision:** 0.95 (for class 0), 0.88 (for class 1)
- **Recall:** 0.90 (for class 0), 0.94 (for class 1)
- **F1-Score:** 0.93 (for class 0), 0.91 (for class 1)

---

### **Conclusion:**

The Fake News Classifier model performs well with an accuracy of 91.88%, making it a reliable solution for classifying news articles as either real or fake. By utilizing LSTM for sequence modeling and text preprocessing techniques like stemming and one-hot encoding, we were able to significantly improve the model's performance in detecting fake news.

---


In [1]:
import pandas as pd

In [2]:
df=pd.read_csv('/content/train.csv')

In [3]:
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [4]:
df.shape

(20800, 5)

In [5]:
#Checking missing Values
df.isnull().sum()

Unnamed: 0,0
id,0
title,558
author,1957
text,39
label,0


In [6]:
#Drop NaN Values
df.dropna(inplace=True)

In [7]:
df.head()

Unnamed: 0,id,title,author,text,label
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...,1
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...,0
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ...",1
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...,1
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...,1


In [8]:
#Get the Independent Features
X=df.drop('label',axis=1)

In [9]:
#get the dependent Features
y=df['label']

In [10]:
X.shape

(18285, 4)

In [11]:
y.shape

(18285,)

In [12]:
import tensorflow as tf

In [27]:
#pylint: disable=import-error
from tensorflow.keras.layers import Embedding, Input
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.models import Sequential
from tensorflow.keras.preprocessing.text import one_hot
from tensorflow.keras.layers import LSTM
from tensorflow.keras.layers import Dense

In [18]:
#Vocabulary Size
voc_size=5000

In [19]:
#One Hot representation
messages=X.copy()

In [20]:
messages['title'][1]

'FLYNN: Hillary Clinton, Big Woman on Campus - Breitbart'

In [21]:
messages

Unnamed: 0,id,title,author,text
0,0,House Dem Aide: We Didn’t Even See Comey’s Let...,Darrell Lucus,House Dem Aide: We Didn’t Even See Comey’s Let...
1,1,"FLYNN: Hillary Clinton, Big Woman on Campus - ...",Daniel J. Flynn,Ever get the feeling your life circles the rou...
2,2,Why the Truth Might Get You Fired,Consortiumnews.com,"Why the Truth Might Get You Fired October 29, ..."
3,3,15 Civilians Killed In Single US Airstrike Hav...,Jessica Purkiss,Videos 15 Civilians Killed In Single US Airstr...
4,4,Iranian woman jailed for fictional unpublished...,Howard Portnoy,Print \nAn Iranian woman has been sentenced to...
...,...,...,...,...
20795,20795,Rapper T.I.: Trump a ’Poster Child For White S...,Jerome Hudson,Rapper T. I. unloaded on black celebrities who...
20796,20796,"N.F.L. Playoffs: Schedule, Matchups and Odds -...",Benjamin Hoffman,When the Green Bay Packers lost to the Washing...
20797,20797,Macy’s Is Said to Receive Takeover Approach by...,Michael J. de la Merced and Rachel Abrams,The Macy’s of today grew from the union of sev...
20798,20798,"NATO, Russia To Hold Parallel Exercises In Bal...",Alex Ansary,"NATO, Russia To Hold Parallel Exercises In Bal..."


In [23]:
import nltk
import re
from nltk.corpus import stopwords
nltk.download('stopwords')

[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.


True

In [24]:
##Dataset Preprocessing
from nltk.stem.porter import PorterStemmer
ps=PorterStemmer()
corpus=[]

for i in range(0,len(messages)):
  review=re.sub('[^a-zA-Z]',' ',messages['title'][i])
  review=review.lower()
  review=review.split()

  review=[ps.stem(word) for word in review if not word in stopwords.words('english')]
  review= ' '.join(review)
  corpus.append(review)


In [25]:
## One Hot Representtion
onehot_repr=[one_hot(words,voc_size)for words in corpus]
onehot_repr

[[412, 3526, 252, 295, 3902, 4750, 2085, 4771, 1184, 3206],
 [819, 3186, 829, 1590, 497, 56, 4047],
 [3944, 1262, 3482, 4547],
 [2540, 3564, 2563, 2086, 2283, 3048],
 [3069, 497, 2155, 38, 310, 1790, 497, 4592, 1621, 3734],
 [3375,
  56,
  2213,
  4800,
  20,
  4156,
  4364,
  2705,
  3456,
  980,
  1015,
  1374,
  1709,
  4173,
  4047],
 [2441, 2534, 4408, 3365, 653, 231, 3464, 688, 4775, 2230, 4836],
 [1120, 1658, 4478, 3294, 3386, 2131, 4156, 357, 4775, 2230, 4836],
 [2923, 3821, 2905, 2208, 3162, 4232, 2565, 1341, 4156, 2620],
 [4947, 2375, 3448, 1831, 1237, 3912, 231, 3598],
 [2728, 4786, 4632, 3163, 1128, 363, 3597, 3489, 1454, 1192, 3029],
 [2086, 1886, 3902, 4232, 4156, 3386],
 [2945, 4126, 1211, 4918, 2406, 1065, 1009, 1583, 530],
 [3646, 3397, 4266, 2747, 4260, 3672, 2778, 4775, 2230, 4836],
 [747, 602, 333, 3199, 1744, 4775, 2230, 4836],
 [3074, 1019, 683, 4896, 395, 2811, 3493, 1230, 506, 3126],
 [610, 1639, 3186],
 [1169, 3852, 2033, 572, 4156, 1013, 1563, 4047],
 [1555, 4

In [26]:
#Embedding representation // word2vec representation
sent_length=20
embedded_docs=pad_sequences(onehot_repr,padding='pre',maxlen=sent_length)
print(embedded_docs)

[[   0    0    0 ... 4771 1184 3206]
 [   0    0    0 ...  497   56 4047]
 [   0    0    0 ... 1262 3482 4547]
 ...
 [   0    0    0 ... 4775 2230 4836]
 [   0    0    0 ... 2635 1416 3861]
 [   0    0    0 ... 2686  784 4772]]


In [34]:
# Creating the Embedding Model
embedding_vector_features = 40
# Define the model using Input layer
model = Sequential([
    Input(shape=(sent_length,)),  # Specify input shape (sentence length)
    Embedding(input_dim=voc_size, output_dim=embedding_vector_features),
    LSTM(100),
    Dense(1, activation='sigmoid')
])

In [35]:
model.compile(loss='binary_crossentropy',optimizer='adam',metrics=['accuracy'])
print(model.summary())

None


In [36]:
len(embedded_docs),y.shape

(18285, (18285,))

In [37]:
import numpy as np
X_final=np.array(embedded_docs)
y_final=np.array(y)

In [38]:
X_final.shape,y_final.shape

((18285, 20), (18285,))

In [39]:
#Train test Split
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test=train_test_split(X_final,y_final,test_size=0.33,random_state=42)

In [40]:
#Model Training
model.fit(X_train,y_train,validation_data=(X_test,y_test),epochs=10,batch_size=64)


Epoch 1/10
[1m192/192[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m12s[0m 39ms/step - accuracy: 0.7963 - loss: 0.4293 - val_accuracy: 0.9183 - val_loss: 0.1922
Epoch 2/10
[1m192/192[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 52ms/step - accuracy: 0.9391 - loss: 0.1520 - val_accuracy: 0.9228 - val_loss: 0.1888
Epoch 3/10
[1m192/192[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m8s[0m 38ms/step - accuracy: 0.9660 - loss: 0.0948 - val_accuracy: 0.9210 - val_loss: 0.2038
Epoch 4/10
[1m192/192[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m10s[0m 53ms/step - accuracy: 0.9797 - loss: 0.0630 - val_accuracy: 0.9206 - val_loss: 0.2493
Epoch 5/10
[1m192/192[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m17s[0m 38ms/step - accuracy: 0.9838 - loss: 0.0473 - val_accuracy: 0.9150 - val_loss: 0.2890
Epoch 6/10
[1m192/192[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m11s[0m 43ms/step - accuracy: 0.9878 - loss: 0.0358 - val_accuracy: 0.9140 - val_loss: 0.3172
Epoch 7/10
[1m19

<keras.src.callbacks.history.History at 0x7ce87bd3e3d0>

In [41]:
#Performance metrics
y_pred=model.predict(X_test)


[1m189/189[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m2s[0m 8ms/step


In [43]:
y_pred=np.where(y_pred>0.5,1,0) ##AUC, ROC Curve


In [44]:
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test,y_pred)

array([[3074,  345],
       [ 145, 2471]])

In [46]:
from sklearn.metrics import accuracy_score

# Calculate the accuracy
accuracy = accuracy_score(y_test, y_pred)

# Display the accuracy as a percentage
accuracy_percentage = accuracy * 100
print(f'Accuracy: {accuracy_percentage:.2f}%')

Accuracy: 91.88%


In [48]:
from sklearn.metrics import classification_report
print(classification_report(y_test,y_pred))

              precision    recall  f1-score   support

           0       0.95      0.90      0.93      3419
           1       0.88      0.94      0.91      2616

    accuracy                           0.92      6035
   macro avg       0.92      0.92      0.92      6035
weighted avg       0.92      0.92      0.92      6035

