# Project 11: Fake News Detection

This notebook builds a machine learning model to distinguish between 'Real' and 'Fake' news articles. We will use a classic NLP approach involving TF-IDF vectorization and a simple but effective linear classifier.

## 1. Setup and Library Imports

In [None]:
import pandas as pd
import numpy as np
import re
import nltk
from nltk.corpus import stopwords
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import PassiveAggressiveClassifier
from sklearn.metrics import accuracy_score, confusion_matrix, classification_report

## 2. Data Loading and Preparation

In [None]:
# Load the datasets
try:
    df_fake = pd.read_csv('data/Fake.csv')
    df_true = pd.read_csv('data/True.csv')
    print("Data loaded successfully.")
except FileNotFoundError:
    print("Data files not found. Make sure 'Fake.csv' and 'True.csv' are in the 'data/' directory.")

# Add labels
df_fake['label'] = 1 # 1 for Fake
df_true['label'] = 0 # 0 for True

# Combine the dataframes
df_combined = pd.concat([df_fake, df_true], ignore_index=True)

# Shuffle the dataset
df_combined = df_combined.sample(frac=1, random_state=42).reset_index(drop=True)

df_combined.head()

## 3. Text Preprocessing

In [None]:
# Download stopwords
nltk.download('stopwords')
stop_words = set(stopwords.words('english'))

def clean_text(text):
    """Function to clean article text."""
    text = text.lower()
    text = re.sub(r'[^a-zA-Z\s]', '', text) # Remove punctuation and numbers
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

# Apply the cleaning function to the 'text' column
df_combined['clean_text'] = df_combined['text'].apply(clean_text)

## 4. Feature Engineering (TF-IDF) and Data Splitting

In [None]:
# Define features and labels
X = df_combined['clean_text']
y = df_combined['label']

# Split the data
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Initialize a TfidfVectorizer
tfidf_vectorizer = TfidfVectorizer(max_features=5000)

# Fit and transform the training set, transform the test set
X_train_tfidf = tfidf_vectorizer.fit_transform(X_train)
X_test_tfidf = tfidf_vectorizer.transform(X_test)

## 5. Model Building and Training

In [None]:
# Initialize a PassiveAggressiveClassifier
pac = PassiveAggressiveClassifier(max_iter=50, random_state=42)
pac.fit(X_train_tfidf, y_train)

## 6. Model Evaluation

In [None]:
# Predict on the test set and calculate accuracy
y_pred = pac.predict(X_test_tfidf)
score = accuracy_score(y_test, y_pred)
print(f'Accuracy: {score:.4f}')

# Build a confusion matrix
conf_matrix = confusion_matrix(y_test, y_pred)
print("\nConfusion Matrix:")
print(conf_matrix)

# Classification Report
print("\nClassification Report:")
print(classification_report(y_test, y_pred, target_names=['Real (0)', 'Fake (1)']))

## 7. Conclusion

This notebook demonstrated a simple yet effective pipeline for fake news detection. By combining TF-IDF features with a Passive Aggressive Classifier, we were able to achieve high accuracy in distinguishing between real and fake news articles.

This project highlights how powerful traditional NLP and machine learning techniques can be for text classification tasks.