# Lab 5 — Final Mini Project (Choose Track)

✅ **Submit this notebook** with outputs + short notes.

**Requirements:**
1) Load dataset
2) Train/test split
3) Pipeline (vectorizer + model)
4) Metrics (F1 + confusion matrix)
5) Error analysis (10 wrong predictions)
6) Demo predictions on 5 custom inputs

---

## Choose ONE track
- Track A (Easy): Built-in dataset
- Track B (Medium): BBC News category classification (public CSV)
- Track C (Advanced): Toxic comments (HuggingFace dataset subset)


In [None]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, ConfusionMatrixDisplay

print('✅ Ready')

# Track B (Medium) — BBC News Category Classification (Recommended)

This track is recommended for most teams.

Dataset source: public GitHub CSV.

In [None]:
url = "https://raw.githubusercontent.com/selva86/datasets/master/BBCNews.csv"
df = pd.read_csv(url)
df.head()

In [None]:
df['category'].value_counts()

In [None]:
X_train, X_test, y_train, y_test = train_test_split(
    df['text'], df['category'], test_size=0.25, random_state=42, stratify=df['category']
)

## Train TF-IDF + Logistic Regression pipeline

In [None]:
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression

clf = Pipeline([
    ('tfidf', TfidfVectorizer(stop_words='english', ngram_range=(1,2), min_df=2)),
    ('model', LogisticRegression(max_iter=500))
])

clf.fit(X_train, y_train)
y_pred = clf.predict(X_test)

print(classification_report(y_test, y_pred))

## Confusion matrix

In [None]:
cm = confusion_matrix(y_test, y_pred, labels=clf.classes_)
disp = ConfusionMatrixDisplay(cm, display_labels=clf.classes_)
disp.plot(xticks_rotation=45)
plt.show()

## Error analysis

In [None]:
test_df = pd.DataFrame({'text': X_test, 'y_true': y_test, 'y_pred': y_pred})
wrong = test_df[test_df['y_true'] != test_df['y_pred']]
wrong.head(10)

## Demo predictions (mandatory)

In [None]:
demo_inputs = [
    "The company reported strong quarterly profits and new investments.",
    "The team scored a last-minute goal to win the match.",
    "New smartphone release includes AI features and better battery.",
    "Government announced new policy changes for the economy.",
    "A famous actor starred in a new blockbuster movie."
]

for s in demo_inputs:
    print(s, '->', clf.predict([s])[0])

# Optional Track C (Advanced) — Toxic Comments (HuggingFace)
Run only if your team wants an advanced dataset.

In [None]:
# Uncomment to try Track C
# !pip -q install datasets
# from datasets import load_dataset
# ds = load_dataset('civil_comments')
# train = ds['train'].select(range(5000))
# adv_df = pd.DataFrame(train)[['text', 'toxicity']]
# adv_df['label'] = (adv_df['toxicity'] > 0.5).astype(int)
# adv_df.head()