📰 Fake News Detection using NLP  
Author: Rakhi Mahendrasingh Rajput   
Institution:SANJAY GHODAWAT UNIVERSITY 
Date:17 March 2025


📌 Introduction  
In today's digital era, fake news spreads rapidly, influencing public opinions and decisions. Our project aims to develop a machine learning model that detects fake news using Natural Language Processing (NLP).  

🎯 Objectives  
- Analyze news articles to determine authenticity.  
- Use NLP techniques to preprocess textual data.  
- Train a machine learning model for classification.  

In [43]:
import numpy as np
import pandas as pd
import json
import csv
import random

from tensorflow.keras.preprocessing.text import Tokenizer
from tensorflow.keras.preprocessing.sequence import pad_sequences
from tensorflow.keras.utils import to_categorical
from tensorflow.keras import regularizers

import pprint
import tensorflow.compat.v1 as tf
from tensorflow.python.framework import ops
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
tf.disable_eager_execution()
tf.compat.v1.enable_eager_execution()






📂 Dataset  
 Name: Fake News Dataset  
 Source: Kaggle / Real-time web scraping  

1️⃣ Data Collection  
- Obtained a dataset from Kaggle or web scraping.  
- Preprocessed text data for model training.  


In [44]:
# Reading the data
data = pd.read_csv(r"E:\fake news detection\fake_or_real_news.csv")
data.head()

Unnamed: 0.1,Unnamed: 0,title,text,label
0,8476,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,10294,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,3608,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,10142,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,875,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


In [45]:
data = data.drop(["Unnamed: 0"], axis=1)
data.head(5)


Unnamed: 0,title,text,label
0,You Can Smell Hillary’s Fear,"Daniel Greenfield, a Shillman Journalism Fello...",FAKE
1,Watch The Exact Moment Paul Ryan Committed Pol...,Google Pinterest Digg Linkedin Reddit Stumbleu...,FAKE
2,Kerry to go to Paris in gesture of sympathy,U.S. Secretary of State John F. Kerry said Mon...,REAL
3,Bernie supporters on Twitter erupt in anger ag...,"— Kaydee King (@KaydeeKing) November 9, 2016 T...",FAKE
4,The Battle of New York: Why This Primary Matters,It's primary day in New York and front-runners...,REAL


2️⃣ Data Preprocessing  
- Removed stopwords, punctuation, and special characters.    
- Vectorized text using Word Embeddings (Word2Vec).  


In [46]:
# encoding the labels
le = preprocessing.LabelEncoder()
le.fit(data['label'])
data['label'] = le.transform(data['label'])


In [47]:
embedding_dim = 50
max_length = 54
trunc_type = 'post'
padding_type = 'post'
oov_tok = "<OOV>"
training_size = 3000
test_portion = .1


Tokenization :-

This process divides a large piece of continuous text into distinct units or tokens basically. Here we use columns separately for a temporal basis as a pipeline just for good accuracy.

In [48]:
title = []
text = []
labels = []
for x in range(training_size):
    title.append(data['title'][x])
    text.append(data['text'][x])
    labels.append(data['label'][x])


In [49]:
tokenizer1 = Tokenizer()
tokenizer1.fit_on_texts(title)
word_index1 = tokenizer1.word_index
vocab_size1 = len(word_index1)
sequences1 = tokenizer1.texts_to_sequences(title)
padded1 = pad_sequences(
    sequences1,  padding=padding_type, truncating=trunc_type)
split = int(test_portion * training_size)
training_sequences1 = padded1[split:training_size]
test_sequences1 = padded1[0:split]
test_labels = labels[0:split]
training_labels = labels[split:training_size]


Generating Word Embedding:-

It allows words with similar meanings to have a similar representation. Here each individual word is represented as real-valued vectors in a predefined vector space.

In [50]:
embeddings_index = {}
with open(r'C:\Users\vanra\Downloads\glove.6B.50d.txt', encoding='utf-8') as f:            # Now open the file
    for line in f:
        values = line.split()
        word = values[0]
        coefs = np.asarray(values[1:], dtype='float32')
        embeddings_index[word] = coefs

# Generating embeddings
embeddings_matrix = np.zeros((vocab_size1+1, embedding_dim))
for word, i in word_index1.items():
    embedding_vector = embeddings_index.get(word)
    if embedding_vector is not None:
        embeddings_matrix[i] = embedding_vector

3️⃣ Model Selection  
- Implemented  machine learning models
  
- Evaluated models based on accuracy, precision, recall, and F1-score

In [51]:
model = tf.keras.Sequential([
    tf.keras.layers.Embedding(vocab_size1+1, embedding_dim,
                              input_length=max_length, weights=[
                                  embeddings_matrix],
                              trainable=False),
    tf.keras.layers.Dropout(0.2),
    tf.keras.layers.Conv1D(64, 5, activation='relu'),
    tf.keras.layers.MaxPooling1D(pool_size=4),
    tf.keras.layers.LSTM(64),
    tf.keras.layers.Dense(1, activation='sigmoid')
])
model.compile(loss='binary_crossentropy',
              optimizer='adam', metrics=['accuracy'])
model.summary()




4️⃣ Model Training & Evaluation  
- Split data into training (80%) and testing (20%)sets.  
- Trained models and tuned hyperparameters for better performance.  
- Achieved 74% accuracy with the best-performing model.  


In [56]:
num_epochs = 50

training_padded = np.array(training_sequences1)
training_labels = np.array(training_labels)
testing_padded = np.array(test_sequences1)
testing_labels = np.array(test_labels)

history = model.fit(training_padded, training_labels, epochs=num_epochs, validation_data=(testing_padded, testing_labels), verbose=2)

print("Training Complete")

Epoch 1/50




85/85 - 11s - 129ms/step - accuracy: 0.6167 - loss: 0.6421 - val_accuracy: 0.6900 - val_loss: 0.5597
Epoch 2/50
85/85 - 10s - 112ms/step - accuracy: 0.7070 - loss: 0.5711 - val_accuracy: 0.7000 - val_loss: 0.5368
Epoch 3/50
85/85 - 9s - 111ms/step - accuracy: 0.7330 - loss: 0.5246 - val_accuracy: 0.7167 - val_loss: 0.5125
Epoch 4/50
85/85 - 10s - 113ms/step - accuracy: 0.7693 - loss: 0.4827 - val_accuracy: 0.7300 - val_loss: 0.4899
Epoch 5/50
85/85 - 9s - 110ms/step - accuracy: 0.8137 - loss: 0.4189 - val_accuracy: 0.7067 - val_loss: 0.5686
Epoch 6/50
85/85 - 10s - 117ms/step - accuracy: 0.8156 - loss: 0.4012 - val_accuracy: 0.7633 - val_loss: 0.4798
Epoch 7/50
85/85 - 10s - 112ms/step - accuracy: 0.8389 - loss: 0.3665 - val_accuracy: 0.7667 - val_loss: 0.5119
Epoch 8/50
85/85 - 10s - 114ms/step - accuracy: 0.8619 - loss: 0.3327 - val_accuracy: 0.7467 - val_loss: 0.4883
Epoch 9/50
85/85 - 10s - 112ms/step - accuracy: 0.8830 - loss: 0.2947 - val_accuracy: 0.7367 - val_loss: 0.5153
Epoch

In [61]:
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# Get model predictions (probabilities)
y_pred_probs = model.predict(testing_padded)

# Convert probabilities to binary predictions (assuming binary classification)
y_pred = (y_pred_probs > 0.5).astype(int)

# Compute accuracy, precision, recall, and F1-score
accuracy = accuracy_score(testing_labels, y_pred)
precision = precision_score(testing_labels, y_pred)
recall = recall_score(testing_labels, y_pred)
f1 = f1_score(testing_labels, y_pred)

# Print results
print(f"✅ Model Evaluation Results:")
print(f"📌 Accuracy: {accuracy:.2%}")
print(f"📌 Precision: {precision:.2%}")
print(f"📌 Recall: {recall:.2%}")
print(f"📌 F1-Score: {f1:.2%}")

# Store results in a dictionary
results = {
    "Model": "LSTM",
    "Accuracy": f"{accuracy:.2%}",
    "Precision": f"{precision:.2%}",
    "Recall": f"{recall:.2%}",
    "F1-Score": f"{f1:.2%}"
}

# Display results in a table format
import pandas as pd
df_results = pd.DataFrame([results])
print(df_results)


[1m 1/10[0m [32m━━[0m[37m━━━━━━━━━━━━━━━━━━[0m [1m0s[0m 83ms/step



[1m10/10[0m [32m━━━━━━━━━━━━━━━━━━━━[0m[37m[0m [1m1s[0m 53ms/step
✅ Model Evaluation Results:
📌 Accuracy: 74.33%
📌 Precision: 75.48%
📌 Recall: 75.00%
📌 F1-Score: 75.24%
  Model Accuracy Precision  Recall F1-Score
0  LSTM   74.33%    75.48%  75.00%   75.24%


Model Evaluation and Prediction:-

Now, the detection model is built using TensorFlow. Now we will try to test the model by using some news text by predicting whether it is true or false.

In [57]:
# sample text to check if fake or not
X = "Karry to go to France in gesture of sympathy"

# detection
sequences = tokenizer1.texts_to_sequences([X])[0]
sequences = pad_sequences([sequences], maxlen=54,
                          padding=padding_type, 
                          truncating=trunc_type)
if(model.predict(sequences, verbose=0)[0][0] >= 0.5):
    print("This news is True")
else:
    print("This news is false")


This news is false


🚀 Future Scope  
- Improve model with advanced transformers (GPT-based models)  
- Extend dataset for better generalization.  
- Integrate with social media monitoring tools.

📜 References  
- Research Papers & Articles  
- Kaggle Dataset Link  
- NLP Libraries Documentation  

📌 Conclusion  
Our fake news detection model successfully identifies unreliable news sources using NLP techniques. The project showcases the power of AI in combating misinformation in the digital age.  

🔗 [https://github.com/Rakhii24]  
📧 [rajputrakhi2409@gmail.com] 