<a href="https://colab.research.google.com/github/Tanu-N-Prabhu/Python/blob/master/Machine%20Learning/05_projects/Fake%20News%20Detection/fake_news_detection.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Fake News Detection with Machine Learning

Fake news is a major challenge in today's digital age. With the rise of social media, misinformation spreads faster than ever. This project aims to build a **machine learning classifier** that can automatically detect whether a given news article is **fake or real** based on its content.

---

## Objectives

- Understand the structure of fake and real news articles
- Preprocess and clean text data
- Convert text into numerical features using **TF-IDF**
- Train classification models (Logistic Regression, Random Forest)
- Evaluate performance with **accuracy**, **F1-score**, and **ROC-AUC**
- Predict the authenticity of new, unseen news content

---

## Dataset

- **Source**: [ISOT Fake and Real News Dataset on Kaggle](https://www.kaggle.com/datasets/clmentbisaillon/fake-and-real-news-dataset)
- Two CSV files:
  - `Fake.csv`: Contains fake news articles
  - `True.csv`: Contains real news articles
- Each file contains:
  - `title`: Title of the news article
  - `text`: Full text of the article
  - `subject`: Category (not used here)
  - `date`: Publish date

Let's begin!


## Step 1: Upload and Load the Dataset

In [1]:
from google.colab import files
import pandas as pd

fake = pd.read_csv('/content/Fake.csv')
real = pd.read_csv('/content/True.csv')

# Add labels
fake['label'] = 'FAKE'
real['label'] = 'REAL'

# Combine datasets
df = pd.concat([fake[['title','text','label']], real[['title','text','label']]], ignore_index=True)
df.sample(5)

Unnamed: 0,title,text,label
44869,Britain outlines plans to break free of Europe...,LONDON (Reuters) - Britain on Wednesday outlin...,REAL
1811,Trump Brags About The ‘Beautiful Chocolate Ca...,"Trump, channeling Marie Antoinette, decided to...",FAKE
30334,U.S. mayors ask Trump to keep young illegal im...,NEW YORK (Reuters) - Mayors from the largest U...,REAL
10926,BOOM! KELLYANNE CONWAY Schools CNN’s Anderson ...,Kellyanne Conway went at it with Anderson Coop...,FAKE
27426,Trump struggles to win over moderate Republica...,WASHINGTON (Reuters) - Time was running short ...,REAL


## Step 2: Preprocess Text Data

In [2]:
import re

# Clean and combine title + text
def clean_text(text):
    text = re.sub(r'\W+', ' ', str(text))
    return text.lower()

df['content'] = (df['title'] + ' ' + df['text']).apply(clean_text)
df['label'] = df['label'].map({'FAKE': 0, 'REAL': 1})
df.drop(columns=['title', 'text'], inplace=True)
df.head()

Unnamed: 0,label,content
0,0,donald trump sends out embarrassing new year ...
1,0,drunk bragging trump staffer started russian ...
2,0,sheriff david clarke becomes an internet joke...
3,0,trump is so obsessed he even has obama s name...
4,0,pope francis just called out donald trump dur...


## Step 3: Train/Test Split

In [3]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    df['content'], df['label'], test_size=0.2, random_state=42, stratify=df['label']
)

## Step 4: Convert Text to TF-IDF Vectors

In [4]:
from sklearn.feature_extraction.text import TfidfVectorizer

tfidf = TfidfVectorizer(stop_words='english', max_df=0.7)
X_train_tfidf = tfidf.fit_transform(X_train)
X_test_tfidf = tfidf.transform(X_test)

## Step 5: Train Classifiers

In [5]:
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier

lr = LogisticRegression(max_iter=1000)
rf = RandomForestClassifier(n_estimators=100, random_state=42)

lr.fit(X_train_tfidf, y_train)
rf.fit(X_train_tfidf, y_train)

## Step 6: Evaluate Model Performance

In [6]:
from sklearn.metrics import classification_report, roc_auc_score

for model, name in [(lr, "Logistic Regression"), (rf, "Random Forest")]:
    preds = model.predict(X_test_tfidf)
    proba = model.predict_proba(X_test_tfidf)[:,1]
    print(f"\n{name} Results:")
    print(classification_report(y_test, preds, target_names=['FAKE', 'REAL']))
    print("ROC AUC Score:", round(roc_auc_score(y_test, proba), 3))


Logistic Regression Results:
              precision    recall  f1-score   support

        FAKE       0.99      0.98      0.99      4696
        REAL       0.98      0.99      0.99      4284

    accuracy                           0.99      8980
   macro avg       0.99      0.99      0.99      8980
weighted avg       0.99      0.99      0.99      8980

ROC AUC Score: 0.999

Random Forest Results:
              precision    recall  f1-score   support

        FAKE       0.99      0.99      0.99      4696
        REAL       0.99      0.99      0.99      4284

    accuracy                           0.99      8980
   macro avg       0.99      0.99      0.99      8980
weighted avg       0.99      0.99      0.99      8980

ROC AUC Score: 1.0


## Step 7: Save Vectorizer and Model

In [7]:
import joblib

joblib.dump(tfidf, "tfidf_vectorizer.pkl")
joblib.dump(rf, "fake_news_rf_model.pkl")

['fake_news_rf_model.pkl']

## Step 8: Predict on Custom Text

In [8]:
def predict_news(text):
    cleaned = clean_text(text)
    vect = tfidf.transform([cleaned])
    pred = rf.predict(vect)[0]
    return "REAL" if pred == 1 else "FAKE"

sample = "Breaking: Scientists discover water on Mars!"
predict_news(sample)

'FAKE'

---

## Summary

- Preprocessed ~20,000 news articles
- Converted text to TF-IDF features
- Trained and evaluated classifiers
- Deployed a custom prediction tool for real-time fake news detection

This model can be further improved using deep learning models (like BERT), ensemble methods, and by including metadata like source credibility, author, or publishing domain.

---
