<a href="https://colab.research.google.com/github/Madhusudan3223/fake-news-detection-v2/blob/main/Fake_News_Detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Project Title 📰 Fake News Detection


 Project Summary

📌 Objective:
To build a machine learning model that can automatically detect whether a news article is real or fake using text data.

🗂️ Dataset:
Used two labeled datasets: True.csv and Fake.csv

Combined and labeled:

True → label 1 (Real News)

Fake → label 0 (Fake News)

Total entries: ~44,000


🧹 Data Preprocessing:
Combined and shuffled both datasets

Cleaned the news text:

Lowercased the text

Removed digits and punctuation

Removed stopwords (e.g., "the", "is", "and")

Applied lemmatization (e.g., "running" → "run")


🔎 Feature Extraction:
Used TF-IDF Vectorization (Top 5000 features) to convert text into numerical format suitable for model training.


🤖 Model Training:
Algorithm used: Logistic Regression

Train-Test Split: 80% training, 20% testing

Achieved:

✅ Accuracy: ~94% on test data

📊 Good performance shown in classification report and confusion matrix

💾 Model Export:
Exported the trained model as: fake_news_model_v2.pkl

Exported the fitted TF-IDF vectorizer as: tfidf_vectorizer_v2.pkl

Downloaded both files for deployment in a web app Streamlit

💡 Outcome:
This project demonstrates a real-world application of Natural Language Processing (NLP) and machine learning to combat misinformation by detecting fake news articles with high accuracy.



📥Import Required Libraries

In [3]:
import pandas as pd

📁  Upload Dataset File from Local

In [4]:
from google.colab import files
uploaded = files.upload()

Saving True.csv (1).zip to True.csv (1).zip
Saving Fake.csv (4).zip to Fake.csv (4).zip


📦  Unzip the Dataset Files

In [5]:
import zipfile

# Unzip True.csv
with zipfile.ZipFile("True.csv (1).zip", 'r') as zip_ref:
    zip_ref.extractall()

# Unzip Fake.csv
with zipfile.ZipFile("Fake.csv (4).zip", 'r') as zip_ref:
    zip_ref.extractall()


🧹 Load, Label, Combine, and Shuffle the Dataset

In [6]:
import pandas as pd

# Load CSVs
true_df = pd.read_csv("True.csv")
fake_df = pd.read_csv("Fake.csv")

# Add labels
true_df["label"] = 1  # Real
fake_df["label"] = 0  # Fake

# Combine both
df = pd.concat([true_df, fake_df], ignore_index=True)

# Shuffle the combined dataset
df = df.sample(frac=1, random_state=42).reset_index(drop=True)

# Preview
df.head()


Unnamed: 0,title,text,subject,date,label
0,BREAKING: GOP Chairman Grassley Has Had Enoug...,"Donald Trump s White House is in chaos, and th...",News,"July 21, 2017",0
1,Failed GOP Candidates Remembered In Hilarious...,Now that Donald Trump is the presumptive GOP n...,News,"May 7, 2016",0
2,Mike Pence’s New DC Neighbors Are HILARIOUSLY...,Mike Pence is a huge homophobe. He supports ex...,News,"December 3, 2016",0
3,California AG pledges to defend birth control ...,SAN FRANCISCO (Reuters) - California Attorney ...,politicsNews,"October 6, 2017",1
4,AZ RANCHERS Living On US-Mexico Border Destroy...,Twisted reasoning is all that comes from Pelos...,politics,"Apr 25, 2017",0


🧼  Clean the News Text using NLTK

In [7]:
import nltk
import string
import re
from nltk.corpus import stopwords
from nltk.stem import WordNetLemmatizer

# Download necessary resources
nltk.download('stopwords')
nltk.download('wordnet')
nltk.download('omw-1.4')

# Set up cleaner tools
stop_words = set(stopwords.words("english"))
lemmatizer = WordNetLemmatizer()

# Define cleaning function
def clean_text(text):
    text = text.lower()
    text = re.sub(r'\d+', '', text)  # Remove digits
    text = text.translate(str.maketrans('', '', string.punctuation))  # Remove punctuation
    words = text.split()
    words = [lemmatizer.lemmatize(w) for w in words if w not in stop_words]
    return ' '.join(words)

# Apply to the 'text' column
df["clean_text"] = df["text"].apply(clean_text)

# Preview cleaned version
df[["text", "clean_text"]].head()


[nltk_data] Downloading package stopwords to /root/nltk_data...
[nltk_data]   Unzipping corpora/stopwords.zip.
[nltk_data] Downloading package wordnet to /root/nltk_data...
[nltk_data] Downloading package omw-1.4 to /root/nltk_data...


Unnamed: 0,text,clean_text
0,"Donald Trump s White House is in chaos, and th...",donald trump white house chaos trying cover ru...
1,Now that Donald Trump is the presumptive GOP n...,donald trump presumptive gop nominee time reme...
2,Mike Pence is a huge homophobe. He supports ex...,mike penny huge homophobe support exgay conver...
3,SAN FRANCISCO (Reuters) - California Attorney ...,san francisco reuters california attorney gene...
4,Twisted reasoning is all that comes from Pelos...,twisted reasoning come pelosi day especially p...


📈 Vectorization, Model Training & Evaluation

In [8]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

# Vectorize clean text
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df["clean_text"])
y = df["label"]

# Split data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Train Logistic Regression
model = LogisticRegression()
model.fit(X_train, y_train)

# Predict on test set
y_pred = model.predict(X_test)

# Evaluate
print("✅ Accuracy:", accuracy_score(y_test, y_pred))
print("\n📊 Classification Report:\n", classification_report(y_test, y_pred))
print("\n🧮 Confusion Matrix:\n", confusion_matrix(y_test, y_pred))


✅ Accuracy: 0.9880846325167038

📊 Classification Report:
               precision    recall  f1-score   support

           0       0.99      0.99      0.99      4669
           1       0.98      0.99      0.99      4311

    accuracy                           0.99      8980
   macro avg       0.99      0.99      0.99      8980
weighted avg       0.99      0.99      0.99      8980


🧮 Confusion Matrix:
 [[4600   69]
 [  38 4273]]


💾 Save the Trained Model and Vectorizer

In [9]:
import joblib

# Save the model
joblib.dump(model, "fake_news_model_v2.pkl")

# Save the vectorizer
joblib.dump(vectorizer, "tfidf_vectorizer_v2.pkl")


['tfidf_vectorizer_v2.pkl']

💾 Save the Trained Fake News Detection Model

In [10]:
import joblib

# Save the model
joblib.dump(model, "fake_news_model_v2.pkl")


['fake_news_model_v2.pkl']

✅ Final Training on Entire Dataset and Saving Updated Model & Vectorizer

In [12]:
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
import joblib

# Re-fit the TF-IDF vectorizer
vectorizer = TfidfVectorizer(max_features=5000)
X = vectorizer.fit_transform(df["clean_text"])
y = df["label"]

# Re-train the model
model = LogisticRegression()
model.fit(X, y)

# Save the correct (fitted) model and vectorizer
joblib.dump(vectorizer, "tfidf_vectorizer_v2.pkl")
joblib.dump(model, "fake_news_model_v2.pkl")


['fake_news_model_v2.pkl']

💾 Save Vectorizer for Future Inference

In [13]:
joblib.dump(vectorizer, "tfidf_vectorizer_v2.pkl")


['tfidf_vectorizer_v2.pkl']

⬇️  Download Model & Vectorizer

In [15]:
from google.colab import files
files.download("tfidf_vectorizer_v2.pkl")
files.download("fake_news_model_v2.pkl")


<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>