# Fake News Detection System

This notebook implements a complete pipeline for detecting fake news using Natural Language Processing (NLP) and Machine Learning.

## Workflow
1.  **Data Loading**: Load and merge True and Fake news datasets.
2.  **Preprocessing**: Clean text using NLTK and BeautifulSoup.
3.  **EDA**: Visualize class distributions and text characteristics.
4.  **Feature Engineering**: Convert text to TF-IDF vectors.
5.  **Model Training**: Train Logistic Regression, Random Forest, and SVM.
6.  **Evaluation**: Evaluate models using accuracy, confusion matrix, and ROC curves.
7.  **Explainability**: Use LIME to explain individual predictions.

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.express as px
import plotly.graph_objects as go
import re
import warnings
import nltk
from bs4 import BeautifulSoup
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.svm import SVC
from sklearn.metrics import accuracy_score, classification_report, confusion_matrix, roc_curve, auc
from lime.lime_text import LimeTextExplainer
from sklearn.pipeline import make_pipeline
import joblib

# Suppress warnings for cleaner output
warnings.filterwarnings('ignore')

# Download NLTK data
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\moham\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package wordnet to
[nltk_data]     C:\Users\moham\AppData\Roaming\nltk_data...
[nltk_data]   Package wordnet is already up-to-date!
[nltk_data] Downloading package omw-1.4 to
[nltk_data]     C:\Users\moham\AppData\Roaming\nltk_data...
[nltk_data]   Package omw-1.4 is already up-to-date!


True

## 1. Data Loading

In [2]:
# Load datasets
try:
    true_df = pd.read_csv('True.csv')
    fake_df = pd.read_csv('Fake.csv')
except FileNotFoundError:
    print("Error: CSV files not found. Please ensure 'True.csv' and 'Fake.csv' are in the same directory.")

# Add labels: 1 for True, 0 for Fake
true_df['label'] = 1
fake_df['label'] = 0

# Merge datasets
df = pd.concat([true_df, fake_df], axis=0).reset_index(drop=True)

# Shuffle dataset
df = df.sample(frac=1).reset_index(drop=True)

print(f"Total records: {df.shape[0]}")
df.head()

Total records: 2026


Unnamed: 0,title,text,subject,date,label
0,U.S. diplomats accuse Tillerson of breaking ch...,WASHINGTON (Reuters) - A group of about a doze...,politicsNews,"November 21, 2017",1
1,Trump supports Republican tax overhaul bill: a...,WASHINGTON (Reuters) - U.S. President Donald T...,politicsNews,"November 2, 2017",1
2,BREAKING: Someone Else Connected To Trump Is ...,"Today, more bad news for Trump broke as yet an...",News,"September 13, 2017",0
3,"Flynn's lawyers cut talks with Trump team, sig...",WASHINGTON (Reuters) - Lawyers for Michael Fly...,politicsNews,"November 23, 2017",1
4,'He's such a dreamer:' Skepticism dogs U.S. en...,WASHINGTON/SEOUL (Reuters) - Saddled with the ...,politicsNews,"November 3, 2017",1


## 2. Preprocessing

In [3]:
stop_words = set(stopwords.words('english'))
lemmatizer = WordNetLemmatizer()

def clean_text(text):
    # Remove HTML tags
    text = BeautifulSoup(text, "html.parser").get_text()
    
    # Remove non-alphabetic characters and lower case
    text = re.sub(r'[^a-zA-Z]', ' ', text)
    text = text.lower()
    
    # Tokenize and remove stopwords & lemmatize
    words = text.split()
    words = [lemmatizer.lemmatize(word) for word in words if word not in stop_words]
    
    return ' '.join(words)

# Apply cleaning (this may take a moment)
# We will use the 'text' column for prediction. 
# If 'title' is also relevant, we could combine them, but 'text' usually contains the body.
df['clean_text'] = df['text'].apply(clean_text)

df[['text', 'clean_text', 'label']].head()

Unnamed: 0,text,clean_text,label
0,WASHINGTON (Reuters) - A group of about a doze...,washington reuters group dozen u state departm...,1
1,WASHINGTON (Reuters) - U.S. President Donald T...,washington reuters u president donald trump su...,1
2,"Today, more bad news for Trump broke as yet an...",today bad news trump broke yet another person ...,0
3,WASHINGTON (Reuters) - Lawyers for Michael Fly...,washington reuters lawyer michael flynn presid...,1
4,WASHINGTON/SEOUL (Reuters) - Saddled with the ...,washington seoul reuters saddled toughest job ...,1


## 3. Exploratory Data Analysis (EDA)

In [4]:
# Class Distribution
fig = px.histogram(df, x='label', title='Class Distribution (0=Fake, 1=True)', 
                   color='label', labels={'label': 'News Type'})
fig.update_layout(bargap=0.2)
fig.show()

# Text Length Distribution
df['text_len'] = df['clean_text'].apply(lambda x: len(x.split()))
fig_len = px.histogram(df, x='text_len', color='label', nbins=100, 
                       title='Text Length Distribution by Class', barmode='overlay')
fig_len.show()

## 4. Feature Engineering

In [5]:
X = df['clean_text']
y = df['label']

# Train-Test Split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# TF-IDF Vectorization
tfidf = TfidfVectorizer(max_features=5000)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

print("TF-IDF Matrix Shape:", X_train_tfidf.shape)

TF-IDF Matrix Shape: (1620, 5000)


## 5. Model Training

In [6]:
# Initialize models
models = {
    "Logistic Regression": LogisticRegression(),
    "Random Forest": RandomForestClassifier(n_estimators=100, random_state=42),
    "SVM": SVC(kernel='linear', probability=True) # Probability=True needed for LIME/ROC
}

# Train models
for name, model in models.items():
    print(f"Training {name}...")
    model.fit(X_train_tfidf, y_train)
    print(f"{name} trained.")

Training Logistic Regression...
Logistic Regression trained.
Training Random Forest...
Random Forest trained.
Training SVM...
Random Forest trained.
Training SVM...
SVM trained.
SVM trained.


## 6. Model Evaluation

In [7]:
results = []

for name, model in models.items():
    y_pred = model.predict(X_test_tfidf)
    acc = accuracy_score(y_test, y_pred)
    
    print(f"--- {name} ---")
    print(classification_report(y_test, y_pred))
    
    # Confusion Matrix
    cm = confusion_matrix(y_test, y_pred)
    fig_cm = px.imshow(cm, text_auto=True, title=f'Confusion Matrix - {name}',
                       labels=dict(x="Predicted", y="Actual"), x=['Fake', 'True'], y=['Fake', 'True'])
    fig_cm.show()
    
    results.append({'Model': name, 'Accuracy': acc})

--- Logistic Regression ---
              precision    recall  f1-score   support

           0       0.99      0.98      0.99       189
           1       0.98      1.00      0.99       217

    accuracy                           0.99       406
   macro avg       0.99      0.99      0.99       406
weighted avg       0.99      0.99      0.99       406



--- Random Forest ---
              precision    recall  f1-score   support

           0       1.00      1.00      1.00       189
           1       1.00      1.00      1.00       217

    accuracy                           1.00       406
   macro avg       1.00      1.00      1.00       406
weighted avg       1.00      1.00      1.00       406



--- SVM ---
              precision    recall  f1-score   support

           0       0.99      1.00      1.00       189
           1       1.00      1.00      1.00       217

    accuracy                           1.00       406
   macro avg       1.00      1.00      1.00       406
weighted avg       1.00      1.00      1.00       406



### ROC Curves

In [8]:
fig_roc = go.Figure()

for name, model in models.items():
    if hasattr(model, "predict_proba"):
        y_prob = model.predict_proba(X_test_tfidf)[:, 1]
        fpr, tpr, _ = roc_curve(y_test, y_prob)
        roc_auc = auc(fpr, tpr)
        
        fig_roc.add_trace(go.Scatter(x=fpr, y=tpr, mode='lines', name=f'{name} (AUC = {roc_auc:.2f})'))

fig_roc.update_layout(title='ROC Curve Comparison', xaxis_title='False Positive Rate', yaxis_title='True Positive Rate')
fig_roc.show()

## 7. Explainability with LIME

In [10]:
# Create a pipeline for LIME (needs raw text input)
c = make_pipeline(tfidf, models['Logistic Regression'])

explainer = LimeTextExplainer(class_names=['Fake', 'True'])

# Pick a random sample from test set
idx = 10  # Change index to see different examples
sample_text = X_test.iloc[idx]
true_label = y_test.iloc[idx]

print("True Label:", "True" if true_label == 1 else "Fake")
print("Text Snippet:", sample_text[:200], "...")

exp = explainer.explain_instance(sample_text, c.predict_proba, num_features=10)
exp.show_in_notebook(text=True)

True Label: Fake
Text Snippet: incapable putting f cking phone important overseas trip donald trump embarrassed country going tirade twitter hour ahead meeting russian dictator vladimir putin trump chose post rant twitter rather pr ...


ImportError: cannot import name 'display' from 'IPython.core.display' (C:\Users\moham\AppData\Roaming\Python\Python311\site-packages\IPython\core\display.py)

## 8. Save Models

In [11]:
# Save models and vectorizer for Streamlit app
joblib.dump(models, 'models.joblib')
joblib.dump(tfidf, 'vectorizer.joblib')
print("Models and vectorizer saved successfully.")

Models and vectorizer saved successfully.
