
# ITAI 2373 — NewsBot Intelligence System (Midterm) — **BBC Dataset**
**Student:** `Tzorin, Bryan`  

**Repo folder:** `ITAI2373-NewsBot-Midterm/`  

**Notebook name :** `NewsBot_Midterm_Tzorin_Bryan.ipynb`

This notebook implements **Modules 1–8** end-to-end using the **BBC News** dataset from Kaggle. It is designed for **Google Colab (free tier)** and mirrors the midterm rubric.


## 0) Setup — installs & imports

In [1]:

import sys, subprocess, os

def ensure(pkg):
    try:
        __import__(pkg)
        print(f"✓ {pkg} available")
    except Exception:
        print(f"Installing {pkg}…")
        subprocess.check_call([sys.executable, "-m", "pip", "install", "-q", pkg])

for p in ["pandas", "numpy", "scikit-learn", "spacy", "nltk", "matplotlib", "seaborn"]:
    ensure(p)

import pandas as pd, numpy as np, re, json, math, os, random, collections
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.pipeline import Pipeline
from sklearn.linear_model import LogisticRegression
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC
from sklearn.metrics import accuracy_score, f1_score, classification_report, confusion_matrix
import nltk, spacy

# NLTK resources (quiet)
for res in ["vader_lexicon", "stopwords", "punkt"]:
    try:
        nltk.data.find(f"sentiment/{res}") if res=="vader_lexicon" else nltk.data.find(f"tokenizers/{res}")
    except LookupError:
        nltk.download(res, quiet=True)

# spaCy model (small)
try:
    nlp = spacy.load("en_core_web_sm")
except Exception:
    subprocess.check_call([sys.executable, "-m", "spacy", "download", "en_core_web_sm"])
    nlp = spacy.load("en_core_web_sm")

print("✅ Setup complete")


✓ pandas available
✓ numpy available
Installing scikit-learn…
✓ spacy available
✓ nltk available
✓ matplotlib available
✓ seaborn available
✅ Setup complete



## 1) Data — Download BBC dataset from Kaggle
> **Option A:** Kaggle API inside Colab (recommended).  
> **Option B:** Upload CSV/JSON manually, then run **1.3 Load**.


In [2]:

# Kaggle setup (Colab) — uncomment in Colab only
# !pip install kaggle -q
# from google.colab import files
# print("Upload kaggle.json (Kaggle account → Create New API Token).")
# uploaded = files.upload()
# !mkdir -p ~/.kaggle
# !cp kaggle.json ~/.kaggle/
# !chmod 600 ~/.kaggle/kaggle.json
# print("✅ Kaggle API ready.")


In [3]:

# Download competition files — uncomment in Colab only
# !kaggle competitions download -c learn-ai-bbc -q
# !unzip -o learn-ai-bbc.zip
# !ls -la


### 1.3 Load BBC data (auto-detect common filenames)

In [4]:

import glob, pandas as pd, os

csv_candidates = glob.glob("*.csv") + glob.glob("**/*.csv", recursive=True)
json_candidates = glob.glob("*.json") + glob.glob("**/*.json", recursive=True)
print("CSV candidates:", csv_candidates[:10])
print("JSON candidates:", json_candidates[:10])

df = None
src_file = None
for fname in csv_candidates:
    try:
        tmp = pd.read_csv(fname)
        cols = [c.lower() for c in tmp.columns]
        if any(k in cols for k in ["text","content","article"]) and any(k in cols for k in ["category","label","class"]):
            df = tmp.copy()
            src_file = fname
            break
    except Exception:
        pass

if df is None and os.path.exists("train.csv"):
    df = pd.read_csv("train.csv")
    src_file = "train.csv"

if df is not None:
    print(f"Loaded: {src_file}, shape={df.shape}")
    print(df.head())
else:
    print("⚠️ Upload a CSV via the Colab sidebar, then re-run this cell.")


CSV candidates: ['sample_data/mnist_train_small.csv', 'sample_data/california_housing_train.csv', 'sample_data/california_housing_test.csv', 'sample_data/mnist_test.csv']
JSON candidates: ['sample_data/anscombe.json']
⚠️ Upload a CSV via the Colab sidebar, then re-run this cell.


### 1.4 Standardize columns, filter & sample

In [5]:

def find_col(cands, cols):
    cols_lower = [c.lower() for c in cols]
    for c in cands:
        if c in cols_lower:
            return cols[cols_lower.index(c)]
    return None

if df is not None:
    text_col = find_col(["text","content","article","description","headline"], df.columns)
    label_col = find_col(["category","label","class","topic"], df.columns)
    if text_col is None or label_col is None:
        raise ValueError("Could not infer text/label columns. Rename them to contain 'text' and 'category'.")

    df = df.dropna(subset=[text_col, label_col]).copy()
    df = df[df[text_col].str.len() > 30]  # substantial text
    df = df.rename(columns={text_col: "content", label_col: "category"})
    if len(df) > 2000:
        df = df.sample(2000, random_state=42)
    print(df["category"].value_counts())
    df.to_csv("newsbot_dataset.csv", index=False)
    print("✅ Saved standardized dataset as newsbot_dataset.csv")


## 2) Module 1 — Real-World NLP Application Context


**Use case:** Media monitoring dashboard for editors/PR teams to route and summarize news.  
**Users:** Editors, PR/brand teams, analysts, researchers.  
**Value:** Fast categorization, entity tracking, sentiment monitoring, pattern discovery.


## 3) Module 2 — Text Preprocessing

In [6]:

import re, nltk
from nltk.corpus import stopwords
stop_set = set(stopwords.words("english"))

def clean_text(s: str) -> str:
    s = re.sub(r"http\S+|www\.\S+", " ", s)
    s = re.sub(r"<.*?>", " ", s)
    s = re.sub(r"[^A-Za-z0-9\s']", " ", s)
    s = re.sub(r"\s+", " ", s).strip()
    return s

def spacy_lemma(doc):
    return " ".join(t.lemma_.lower() for t in nlp(doc) if t.is_alpha and t.lemma_.lower() not in stop_set)

if 'df' in globals() and df is not None:
    df["clean"] = df["content"].apply(clean_text)
    sample_n = min(2000, len(df))
    df_lemma = df.sample(sample_n, random_state=42).copy()
    df_lemma["lemma"] = df_lemma["clean"].apply(spacy_lemma)
    df_lemma = df_lemma.dropna(subset=["lemma"])
    print(df_lemma[["category","lemma"]].head())


## 4) Module 3 — TF‑IDF & Top Terms

In [7]:

from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np, pandas as pd

if 'df_lemma' in globals():
    tfidf = TfidfVectorizer(max_features=20000, ngram_range=(1,2), min_df=2)
    X = tfidf.fit_transform(df_lemma["lemma"])
    y = df_lemma["category"].values
    feature_names = np.array(tfidf.get_feature_names_out())

    top_terms = {}
    for cat in pd.Series(y).unique():
        idx = np.where(y == cat)[0]
        class_mean = X[idx].mean(axis=0).A1
        top_idx = class_mean.argsort()[-15:][::-1]
        top_terms[cat] = feature_names[top_idx].tolist()

    for cat, terms in top_terms.items():
        print(f"\nTop terms for {cat}:")
        print(", ".join(terms))


## 5) Module 4 — POS Distributions

In [8]:

from collections import Counter
import itertools, pandas as pd

def pos_counts(texts, max_docs=120):
    c = Counter()
    for t in itertools.islice(texts, max_docs):
        for tok in nlp(t):
            if not tok.is_space:
                c[tok.pos_] += 1
    total = sum(c.values()) or 1
    return {k: v/total for k,v in c.items()}

if 'df_lemma' in globals():
    pos_by_cat = {cat: pos_counts(grp["lemma"].tolist(), 100) for cat, grp in df_lemma.groupby("category")}
    pos_df = pd.DataFrame(pos_by_cat).fillna(0).sort_index()
    display(pos_df)


## 6) Module 5 — Syntax & SVO Samples

In [9]:

def extract_svo(doc):
    subj, verb, obj_ = None, None, None
    for tok in doc:
        if tok.dep_ in ("nsubj","nsubjpass"):
            subj = tok.text
            verb = tok.head.lemma_
        if tok.dep_ in ("dobj","pobj","attr","dative"):
            obj_ = tok.text
    if subj and verb and obj_:
        return (subj, verb, obj_)
    return None

if 'df_lemma' in globals():
    samples = []
    for txt in df_lemma["lemma"].head(40):
        trip = extract_svo(nlp(txt))
        if trip:
            samples.append(trip)
    print(samples[:12])


## 7) Module 6 — Sentiment by Category

In [10]:

from nltk.sentiment import SentimentIntensityAnalyzer
sia = SentimentIntensityAnalyzer()

if 'df_lemma' in globals():
    df_lemma["sentiment"] = df_lemma["lemma"].apply(lambda s: sia.polarity_scores(s)["compound"])
    print(df_lemma.groupby("category")["sentiment"].agg(["mean","median","count"]).sort_values("mean"))


## 8) Module 7 — Classifiers & Evaluation

In [11]:

models = {
    "LogReg": LogisticRegression(max_iter=2000),
    "LinearSVC": LinearSVC(),
    "MultinomialNB": MultinomialNB()
}

if 'df_lemma' in globals():
    X_train, X_test, y_train, y_test = train_test_split(df_lemma["lemma"], df_lemma["category"], test_size=0.2, random_state=42, stratify=df_lemma["category"])

    results = {}
    for name, clf in models.items():
        pipe = Pipeline([("tfidf", TfidfVectorizer(max_features=30000, ngram_range=(1,2), min_df=2)),
                         ("clf", clf)])
        pipe.fit(X_train, y_train)
        preds = pipe.predict(X_test)
        acc = accuracy_score(y_test, preds)
        f1 = f1_score(y_test, preds, average="macro")
        results[name] = {"accuracy": acc, "macro_f1": f1, "model": pipe}
        print(f"{name}: accuracy={acc:.3f}, macro_f1={f1:.3f}")

    best = max(results.items(), key=lambda kv: kv[1]["macro_f1"])[0]
    print("Best model:", best)
    print("\nClassification report:\n", classification_report(y_test, results[best]["model"].predict(X_test)))
    cm = confusion_matrix(y_test, results[best]["model"].predict(X_test), labels=sorted(y_test.unique()))

    import matplotlib.pyplot as plt, numpy as np
    plt.figure(figsize=(6,6))
    plt.imshow(cm, aspect='auto')
    plt.title("Confusion Matrix (best model)")
    plt.xlabel("Predicted")
    plt.ylabel("True")
    plt.xticks(range(len(sorted(y_test.unique()))), sorted(y_test.unique()), rotation=45)
    plt.yticks(range(len(sorted(y_test.unique()))), sorted(y_test.unique()))
    plt.colorbar()
    plt.tight_layout()
    plt.show()


## 9) Module 8 — NER (PERSON/ORG/GPE/DATE/MONEY)

In [12]:

from collections import defaultdict, Counter
import pandas as pd

if 'df_lemma' in globals():
    ent_counts = defaultdict(Counter)
    for cat, grp in df_lemma.groupby("category"):
        for text in grp["lemma"].head(60):
            for e in nlp(text).ents:
                if e.label_ in {"PERSON","ORG","GPE","DATE","MONEY"}:
                    ent_counts[cat][e.label_] += 1
    ent_df = pd.DataFrame(ent_counts).fillna(0).astype(int)
    display(ent_df)



## 10) Business Insights
- **Routing:** classifier sends articles to the right desk; CM shows overlap (e.g., Tech vs Business).  
- **Entities:** frequent **ORG/PERSON** per category → beats and alerts.  
- **Sentiment:** track tone by category; anomalies can be newsworthy.  
- **Style:** POS/Dependency reveal writing patterns across desks.
