# Challenge 2: Fake News Detector Project - Educational Notebook

## 🎯 Theme: AI for Information Integrity and Media Literacy

Welcome to Challenge 2! You'll build an intelligent fake news detection system that helps identify misinformation in news articles. This challenge explores one of the most critical applications of AI in our information age - combating the spread of false information.

## 📖 What You'll Learn
- **Text Classification**: Distinguish between real and fake news articles
- **Feature Engineering**: Extract meaningful signals from news text
- **NLP Techniques**: Apply advanced text processing for classification
- **Model Evaluation**: Assess classifier performance for critical applications
- **Information Literacy**: Understand how misinformation spreads and can be detected
- **Ethical AI**: Consider the responsibility and challenges of automated fact-checking

## 🗂️ Dataset Overview
You'll work with a news article dataset containing:
- **Headlines**: Article titles that may contain bias or sensational language
- **Text Content**: Full article text with varying writing styles and quality
- **Labels**: Binary classification (real=1, fake=0)
- **Diverse Sources**: Different types of news sources and topics

## 🚀 Challenge Roadmap
Follow these steps to build your fake news detector:

1. **📊 Data Exploration**: Understand news article patterns and distributions
2. **🔍 Linguistic Analysis**: Discover differences between real and fake news
3. **🧹 Text Preprocessing**: Clean and standardize news content
4. **⚙️ Feature Engineering**: Extract signals that distinguish fake from real news
5. **🤖 Model Training**: Build and train a news classifier
6. **📈 Evaluation**: Assess model performance with appropriate metrics
7. **💭 Critical Thinking**: Consider limitations and ethical implications

---

## 💡 **Key Insight**: 
Fake news detection goes beyond simple keyword matching - it requires understanding linguistic patterns, source credibility, and the subtle ways misinformation can be crafted to appear legitimate.

---

## ⚠️ **Important Considerations**:
- **No system is perfect**: AI can help flag suspicious content but shouldn't be the sole arbiter of truth
- **Context matters**: The same facts can be presented with different biases
- **Evolution of deception**: Bad actors constantly adapt to evade detection systems
- **Human oversight**: Critical decisions about information should involve human judgment

---

### Task 1: Load and Explore the Dataset

**🎯 Goal**: Understand your news data and the challenge of distinguishing real from fake content

**📝 What to do**:
- Load the dataset and examine article structure
- Analyze the distribution of real vs. fake news
- Compare sample articles to identify potential patterns
- Look for obvious linguistic or structural differences

**💡 Key Questions to Explore**:
- What makes an article "fake" vs. "real"?
- Are there obvious patterns in language, tone, or structure?
- How might the definition of "fake news" affect our approach?
- What challenges might arise in real-world deployment?

In [None]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.naive_bayes import MultinomialNB
from sklearn.metrics import classification_report, confusion_matrix

# Load the dataset from GitHub repository
print("Loading fake news dataset from GitHub...")

# GitHub raw URLs for the dataset files
fake_url = "https://raw.githubusercontent.com/BridgingAISocietySummerSchools/Coding-Project/main/challenge_2/data/Fake.csv"
true_url = "https://raw.githubusercontent.com/BridgingAISocietySummerSchools/Coding-Project/main/challenge_2/data/True.csv"

try:
    # Load fake news articles
    fake_df = pd.read_csv(fake_url)
    fake_df['label'] = 0  # 0 for fake news
    print(f"✅ Loaded {len(fake_df)} fake news articles")
    
    # Load true news articles  
    true_df = pd.read_csv(true_url)
    true_df['label'] = 1  # 1 for real news
    print(f"✅ Loaded {len(true_df)} real news articles")
    
    # Combine datasets
    df = pd.concat([fake_df, true_df], ignore_index=True)
    
    print(f"\n📊 Combined dataset shape: {df.shape}")
    print(f"📊 Total articles: {len(df)}")
    
except Exception as e:
    print(f"❌ Error loading data: {e}")
    print("💡 Trying to load from local data folder as fallback...")
    try:
        df = pd.read_csv("../data/fake_news_dataset.csv")
        print("✅ Loaded from local data folder")
    except:
        print("❌ Could not load data from local folder either")
        raise

print("\n🔍 Dataset Overview:")
print("First few rows:")
display_cols = ['title', 'text', 'label'] if 'title' in df.columns else df.columns[:3].tolist()
print(df[display_cols].head())

print("\n📈 Label distribution:")
label_counts = df['label'].value_counts()
print(f"Real news (1): {label_counts.get(1, 0)}")
print(f"Fake news (0): {label_counts.get(0, 0)}")

print("\n📋 Column information:")
print(f"Columns: {list(df.columns)}")
print(f"Data types: {df.dtypes.to_dict()}")

### Task 2: Text Preprocessing


In [None]:
import re
import nltk
from nltk.corpus import stopwords

# Download required NLTK data
try:
    stopwords.words('english')
except LookupError:
    nltk.download('stopwords')

stop_words = set(stopwords.words('english'))

def preprocess_text(text):
    # Convert to lowercase
    text = text.lower()
    # Remove special characters and digits
    text = re.sub(r'[^a-zA-Z\s]', '', text)
    # Remove stopwords
    text = ' '.join([word for word in text.split() if word not in stop_words])
    return text

# YOUR_CODE_HERE: Apply preprocessing to the text column
df['processed_text'] = df['text'].apply(preprocess_text)

print("Text preprocessing completed!")
print(df[['text', 'processed_text']].head())


### Task 3: Feature Extraction and Model Training


In [None]:
# YOUR_CODE_HERE: Split the data into training and testing sets
X = df['processed_text']
y = df['label']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# YOUR_CODE_HERE: Create TF-IDF features
vectorizer = TfidfVectorizer(max_features=5000)
X_train_tfidf = vectorizer.fit_transform(X_train)
X_test_tfidf = vectorizer.transform(X_test)

# YOUR_CODE_HERE: Train a Naive Bayes classifier
model = MultinomialNB()
model.fit(X_train_tfidf, y_train)

print("Model training completed!")


### Task 4: Model Evaluation


In [None]:
# YOUR_CODE_HERE: Make predictions and evaluate the model
y_pred = model.predict(X_test_tfidf)

print("Classification Report:")
print(classification_report(y_test, y_pred))

print("
Confusion Matrix:")
print(confusion_matrix(y_test, y_pred))


### Task 5: Ethical Considerations
Discuss the following questions:

1. What are the potential consequences of false positives and false negatives in fake news detection?
2. How might this system be biased based on the training data?
3. What are the implications of automated content moderation on free speech?
4. How can we ensure transparency and accountability in AI-powered fact-checking systems?
