# Fake News Detection Using Machine Learning & NLP

## ðŸ“Œ Problem Statement
The rapid spread of misinformation through digital media has made fake news detection a critical challenge. This project aims to build a machine learning model that can automatically classify news articles as **Real** or **Fake** based on their textual content.

---

## ðŸŽ¯ Objective
The objective of this project is to develop an **end-to-end binary text classification system** using **Natural Language Processing (NLP)** techniques and deploy it as an interactive application.

---

## ðŸ§  Type of Problem
- Supervised Learning  
- Binary Classification  
- NLP-based Text Classification 

In [55]:
import numpy as np 
import pandas as pd 

## Dataset Description
The dataset consists of two CSV files:


* True.csv â†’ Real news articles
* Fake.csv â†’ Fake news articles

## Dataset Preparation Strategy


1. Assign labels to both datasets
2. Merge them into a single DataFrame
3. Shuffle data to avoid ordering bias


In [34]:
true_data = pd.read_csv('/kaggle/input/fake-news-detection-datasets/News _dataset/True.csv')
true_data['label'] = 1

In [35]:
fake_data = pd.read_csv('/kaggle/input/fake-news-detection-datasets/News _dataset/Fake.csv')
fake_data['label'] = 0

In [36]:
data = pd.concat([true_data,fake_data],axis=0)

In [37]:
data.head()

Unnamed: 0,title,text,subject,date,label
0,"As U.S. budget fight looms, Republicans flip t...",WASHINGTON (Reuters) - The head of a conservat...,politicsNews,"December 31, 2017",1
1,U.S. military to accept transgender recruits o...,WASHINGTON (Reuters) - Transgender people will...,politicsNews,"December 29, 2017",1
2,Senior U.S. Republican senator: 'Let Mr. Muell...,WASHINGTON (Reuters) - The special counsel inv...,politicsNews,"December 31, 2017",1
3,FBI Russia probe helped by Australian diplomat...,WASHINGTON (Reuters) - Trump campaign adviser ...,politicsNews,"December 30, 2017",1
4,Trump wants Postal Service to charge 'much mor...,SEATTLE/WASHINGTON (Reuters) - President Donal...,politicsNews,"December 29, 2017",1


In [38]:
data = data.sample(frac=1, random_state=42).reset_index(drop=True)


In [39]:
data.isnull().sum()

title      0
text       0
subject    0
date       0
label      0
dtype: int64

## Why Combine title and text?

Fake news detection depends heavily on **context** and **phrasing**. Using both the title and the article body provides richer semantic information.

In [40]:
data['content'] = data['title']+ " "+ data['text']

In [41]:
data = data[['content','label']]

In [42]:
data.head()

Unnamed: 0,content,label
0,BREAKING: GOP Chairman Grassley Has Had Enoug...,0
1,Failed GOP Candidates Remembered In Hilarious...,0
2,Mike Penceâ€™s New DC Neighbors Are HILARIOUSLY...,0
3,California AG pledges to defend birth control ...,1
4,AZ RANCHERS Living On US-Mexico Border Destroy...,0


In [56]:
import re
from nltk.corpus import stopwords
from sklearn.feature_extraction.text import TfidfVectorizer

In [57]:
X = data['content']
y = data['label']

## Trainâ€“Test Split

In [48]:
from sklearn.model_selection import train_test_split
X_train,X_test,y_train,y_test = train_test_split(X,y,test_size=0.2,random_state=2,stratify=y)

## Why Use a Pipeline?

Using a pipeline ensures:

1. No data leakage
2. Consistent preprocessing
3. Deployment-ready design
4. Cleaner, reproducible code


In [50]:
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression

pipe = Pipeline([
    ('tfidf', TfidfVectorizer(
        lowercase=True,
        stop_words='english',
        ngram_range=(1,2),
        max_features=5000
    )),
    ('model', LogisticRegression(
        max_iter=1000,
        class_weight='balanced'
    ))
])

pipe.fit(X_train,y_train)

In [52]:
from sklearn.metrics import accuracy_score
y_pred = pipe.predict(X_test)
print(accuracy_score(y_test,y_pred))

0.989532293986637


## Why Cross-Validation?

A single train-test split can give misleading results. Cross-validation provides a robust estimate of generalization performance.

In [53]:
from sklearn.model_selection import cross_val_score
scores = cross_val_score(
    pipe,
    X_train,
    y_train,
    cv=5,
    scoring = 'f1'
)

In [54]:
print(scores.mean())

0.987593675598719
