# üì∞ Fake News Detection Using NLP




## Step 1: Environment Setup & Library Installation


In [1]:
!pip install pandas numpy scikit-learn nltk transformers torch

import pandas as pd
import numpy as np
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.svm import LinearSVC
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
import nltk
import torch
from IPython.display import display, HTML

print("All packages imported successfully!")

All packages imported successfully!


## Step 2: Dataset Loading and Column Verification

In [5]:
import os
print("Current folder:", os.getcwd())

fake = pd.read_csv("Fake.csv")
true = pd.read_csv("True.csv")

print("Fake news columns:", fake.columns)
print("Real news columns:", true.columns)


Current folder: C:\Fake-News-Detection
Fake news columns: Index(['title', 'text', 'subject', 'date'], dtype='object')
Real news columns: Index(['title', 'text', 'subject', 'date'], dtype='object')


## Step 3: Label Assignment (0 = Fake, 1 = Real)

In [6]:
fake['label'] = 0  # Fake news
true['label'] = 1  # Real news

print(fake[['title','label']].head())
print(true[['title','label']].head())


                                               title  label
0   Donald Trump Sends Out Embarrassing New Year‚Äô...      0
1   Drunk Bragging Trump Staffer Started Russian ...      0
2   Sheriff David Clarke Becomes An Internet Joke...      0
3   Trump Is So Obsessed He Even Has Obama‚Äôs Name...      0
4   Pope Francis Just Called Out Donald Trump Dur...      0
                                               title  label
0  As U.S. budget fight looms, Republicans flip t...      1
1  U.S. military to accept transgender recruits o...      1
2  Senior U.S. Republican senator: 'Let Mr. Muell...      1
3  FBI Russia probe helped by Australian diplomat...      1
4  Trump wants Postal Service to charge 'much mor...      1


## Step 4: Dataset Balancing and Merging


In [7]:
# Balance the dataset
fake_sampled = fake.sample(n=len(true), random_state=42)  # balance

# Merge datasets
df = pd.concat([fake_sampled, true], axis=0)
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

print("\nLabel counts after balancing:")
print(df['label'].value_counts())



Label counts after balancing:
label
0    21417
1    21417
Name: count, dtype: int64


## Step 5: Text Cleaning and Preprocessing

In [8]:
import re

def clean_text(text):
    text = str(text).lower()
    text = re.sub(r'\[.*?\]', '', text)
    text = re.sub(r'https?://\S+|www\.\S+', '', text)
    text = re.sub(r'<.*?>+', '', text)
    text = re.sub(r'[^a-z\s]', '', text)
    return text

# Clean text column
df['text'] = df['text'].apply(clean_text)
print(df['text'].head())


0                                                     
1    washington reuters  special counsel robert mue...
2    the republican national convention starts this...
3    trump didn t stop with names of potential cabi...
4    fox news never disappoints when it operates mo...
Name: text, dtype: object


## Step 6: Train‚ÄìTest Data Split

In [9]:
X = df['text']
y = df['label']

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

print("Training samples:", len(X_train))
print("Testing samples:", len(X_test))


Training samples: 34267
Testing samples: 8567


## Step 7: Feature Extraction using TF-IDF (Sample Data)


In [10]:
vectorizer = TfidfVectorizer(
    stop_words='english',
    max_features=20000,  # Top 20k features
    max_df=0.9,
    min_df=5,
    ngram_range=(1,2)
)

X_train_vec = vectorizer.fit_transform(X_train)
X_test_vec = vectorizer.transform(X_test)

print("TF-IDF FULL shape:", X_train_vec.shape)


TF-IDF FULL shape: (34267, 20000)


## Step 8: Model Training using Support Vector Machine (SVM)


In [11]:
model = LinearSVC()
model.fit(X_train_vec, y_train)

print("Final model trained")


Final model trained


## Step 9: Model Evaluation


In [12]:
y_pred = model.predict(X_test_vec)

print("Accuracy:", accuracy_score(y_test, y_pred))
print("\nClassification Report:\n", classification_report(y_test, y_pred))


Accuracy: 0.9971985525855025

Classification Report:
               precision    recall  f1-score   support

           0       1.00      1.00      1.00      4284
           1       1.00      1.00      1.00      4283

    accuracy                           1.00      8567
   macro avg       1.00      1.00      1.00      8567
weighted avg       1.00      1.00      1.00      8567



## Step 10: Interactive Prediction with Colored Output


In [13]:
from IPython.display import display, HTML

def predict_news(text):
    cleaned = clean_text(text)
    vec = vectorizer.transform([cleaned])
    pred = model.predict(vec)[0]
    
    if pred == 1:
        display(HTML("<h3 style='color:green'>‚úÖ REAL NEWS</h3>"))
    else:
        display(HTML("<h3 style='color:red'>‚ùå FAKE NEWS</h3>"))

# User input
user_text = input("Enter news text: ")
predict_news(user_text)


Enter news text:   WASHINGTON (Reuters) - A senior U.S. lawmaker announced on Thursday that he would not seek re-election next year, citing personal reasons and a desire to spend more time with his family.


## Step 11: Testing Model on Sample Dataset News


In [14]:
# Test on dataset examples
real_text = df[df['label'] == 1]['text'].iloc[10]
fake_text = df[df['label'] == 0]['text'].iloc[10]

print("Real news prediction:")
predict_news(real_text)

print("Fake news prediction:")
predict_news(fake_text)


Real news prediction:


Fake news prediction:


## Conclusion

This project successfully implements a Fake News Detection system using Natural Language Processing and Machine Learning. 
TF-IDF was used for feature extraction and a Support Vector Machine classifier was trained to achieve high accuracy. 
The system can interactively classify news articles as fake or real, with colored feedback, making it useful for real-world applications.
