
# Advanced NLP Project: End‑to‑End Text Classification (Scrape → Dataset → Experiments → Tuning → Evaluation → Demo)

**Last updated:** 2025-09-04 09:33

This notebook upgrades a real‑world NLP pipeline to an **advanced, resume‑ready** project. It includes:

- **Data ingestion**: respectful web scraping & content extraction
- **Dataset building**: cleaning, labeling, and Hugging Face `datasets` integration
- **EDA**: quick sanity checks, length stats, and class balance plots
- **Modeling**: multiple transformer backbones (DistilBERT, BERT, RoBERTa)
- **Training**: Hugging Face `Trainer` with early stopping, mixed precision
- **Hyperparameter tuning**: `optuna` via `Trainer.hyperparameter_search`
- **Evaluation**: accuracy, precision, recall, F1 (macro), confusion matrix, error analysis
- **Inference & Packaging**: pipeline for batch inference
- **Deployment**: minimal **Gradio** demo
- **Reproducibility**: config cell, seeds, and model card stub

> Tip: Run sections incrementally. Comment/uncomment heavy cells (tuning/SHAP) if needed.


In [None]:

# %%capture
# If running fresh, uncomment installs.
# !pip install -U transformers datasets evaluate accelerate scikit-learn optuna gradio bs4 trafilatura matplotlib pandas numpy
# Optional (heavy): shap umap-learn
# !pip install shap umap-learn


In [None]:

import os, random, time, json, math
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.metrics import classification_report, confusion_matrix, precision_recall_fscore_support

import datasets
from datasets import Dataset, DatasetDict
import evaluate

from transformers import (AutoTokenizer, AutoModelForSequenceClassification,
                          DataCollatorWithPadding, TrainingArguments, Trainer,
                          EarlyStoppingCallback, pipeline, set_seed)

# Optional: optuna for tuning
import optuna

SEED = 42
set_seed(SEED)
rng = np.random.default_rng(SEED)

pd.set_option('display.max_colwidth', 200)



## 1) Data Ingestion: Scraping / Import

Two options:

1. **Scrape**: provide a list of URLs for each class (distant supervision).  
2. **Import**: load a CSV with columns `text` and `label`.

> For resume-quality, keep a small **URL manifest** per class and a **raw dump** CSV for reproducibility.


In [None]:

from bs4 import BeautifulSoup
import requests
import trafilatura

def fetch_text(url, timeout=15):
    """Fetch and extract main text from a web page using trafilatura (fallback to BeautifulSoup)."""
    try:
        downloaded = trafilatura.fetch_url(url, no_ssl=True, timeout=timeout)
        if downloaded:
            text = trafilatura.extract(downloaded, include_formatting=False, include_images=False)
            if text and len(text.split()) > 50:
                return text
    except Exception as e:
        pass

    # Fallback: simple BS4 (less accurate)
    try:
        resp = requests.get(url, headers={"User-Agent": "Mozilla/5.0"}, timeout=timeout)
        if resp.status_code == 200:
            soup = BeautifulSoup(resp.text, 'html.parser')
            for tag in soup(['script','style','header','footer','nav','aside']):
                tag.decompose()
            text = ' '.join(soup.get_text(separator=' ').split())
            if len(text.split()) > 50:
                return text
    except Exception as e:
        return None

    return None

def crawl_class(urls, label, sleep=1.0):
    records = []
    for u in urls:
        txt = fetch_text(u)
        if txt:
            records.append({"url": u, "text": txt, "label": label})
        time.sleep(sleep)  # be polite
    return pd.DataFrame(records)

# EXAMPLE: Provide your own URLs per label (keep small for demo)
URLS = {
    # "tech": ["https://example.com/article1", "https://example.com/article2"],
    # "sports": ["https://example.com/article3"],
}

DO_SCRAPE = False  # set True after filling URLS above

if DO_SCRAPE and URLS:
    frames = []
    for label, urls in URLS.items():
        frames.append(crawl_class(urls, label))
    df_raw = pd.concat(frames, ignore_index=True)
else:
    # Fallback demo: create a tiny synthetic dataset (replace with your CSV or scraping output)
    data = {
        "text": [
            "Apple unveils new chips for AI on-device computing at the developer conference.",
            "The local team clinched the championship after a dramatic penalty shootout.",
            "Researchers propose a novel transformer variant that improves long-context modeling.",
            "A record-breaking run highlights the athlete's training regimen and endurance."
        ],
        "label": ["tech", "sports", "tech", "sports"],
        "url": [None, None, None, None]
    }
    df_raw = pd.DataFrame(data)

df_raw.head()



## 2) Cleaning & Label Normalization
- Drop duplicates/empties
- Normalize labels to consecutive integers with a mapping (stored in metadata)


In [None]:

def clean_text(s):
    if not isinstance(s, str): return None
    # Lightweight cleaning; avoid aggressive normalization that might hurt models
    s = s.strip()
    s = ' '.join(s.split())
    return s if len(s.split()) >= 5 else None

df = df_raw.copy()
df['text'] = df['text'].apply(clean_text)
df = df.dropna(subset=['text', 'label']).drop_duplicates(subset=['text']).reset_index(drop=True)

labels = sorted(df['label'].unique())
label2id = {lbl:i for i,lbl in enumerate(labels)}
id2label = {i:lbl for lbl,i in label2id.items()}

df['label_id'] = df['label'].map(label2id)

print("Labels:", labels)
print("Counts:\n", df['label'].value_counts())
df.head()



## 3) Train/Validation/Test Split & Hugging Face Datasets


In [None]:

train_df, temp_df = train_test_split(df, test_size=0.3, random_state=SEED, stratify=df['label_id'])
valid_df, test_df = train_test_split(temp_df, test_size=0.5, random_state=SEED, stratify=temp_df['label_id'])

ds = DatasetDict({
    "train": Dataset.from_pandas(train_df[['text','label_id']]),
    "validation": Dataset.from_pandas(valid_df[['text','label_id']]),
    "test": Dataset.from_pandas(test_df[['text','label_id']]),
})

num_labels = len(labels)
ds



## 4) Quick EDA


In [None]:

lengths = [len(t.split()) for t in df['text']]
plt.figure()
plt.hist(lengths, bins=20)
plt.title("Text length distribution (words)")
plt.xlabel("Words")
plt.ylabel("Count")
plt.show()

plt.figure()
df['label'].value_counts().plot(kind='bar')
plt.title("Class balance")
plt.xlabel("Label")
plt.ylabel("Count")
plt.show()



## 5) Tokenization


In [None]:

MODEL_CANDIDATES = [
    "distilbert-base-uncased",
    "bert-base-uncased",
    "roberta-base",
]

max_length = 256

def tokenize_function(examples, tokenizer):
    return tokenizer(examples["text"], truncation=True, max_length=max_length)

# Prepare a default tokenizer to inspect
default_tokenizer = AutoTokenizer.from_pretrained(MODEL_CANDIDATES[0])
tokenized_ds_preview = ds["train"].select(range(min(5, ds["train"].num_rows))).map(lambda x: tokenize_function(x, default_tokenizer))
tokenized_ds_preview



## 6) Metrics
We report **accuracy**, **precision**, **recall**, **F1 (macro)** and the **confusion matrix**.


In [None]:

accuracy_metric = evaluate.load("accuracy")

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = logits.argmax(axis=-1)
    acc = accuracy_metric.compute(predictions=preds, references=labels)["accuracy"]
    pr, rc, f1, _ = precision_recall_fscore_support(labels, preds, average='macro', zero_division=0)
    return {"accuracy": acc, "precision_macro": pr, "recall_macro": rc, "f1_macro": f1}



## 7) Training Utilities
Trains a model for a few epochs with early stopping and returns the `Trainer` and evaluation metrics.


In [None]:

def train_one(model_name, ds, num_labels, output_dir, learning_rate=2e-5, batch_size=8, epochs=3):
    tokenizer = AutoTokenizer.from_pretrained(model_name, use_fast=True)
    tokenized = ds.map(lambda x: tokenize_function(x, tokenizer), batched=True, remove_columns=["text"])
    collator = DataCollatorWithPadding(tokenizer=tokenizer)

    model = AutoModelForSequenceClassification.from_pretrained(model_name, num_labels=num_labels, id2label=id2label, label2id=label2id)

    args = TrainingArguments(
        output_dir=output_dir,
        evaluation_strategy="epoch",
        save_strategy="epoch",
        load_best_model_at_end=True,
        metric_for_best_model="f1_macro",
        logging_strategy="steps",
        logging_steps=50,
        learning_rate=learning_rate,
        per_device_train_batch_size=batch_size,
        per_device_eval_batch_size=batch_size,
        num_train_epochs=epochs,
        seed=SEED,
        fp16=True if torch.cuda.is_available() else False,
        report_to=["none"],
    )

    trainer = Trainer(
        model=model,
        args=args,
        train_dataset=tokenized["train"],
        eval_dataset=tokenized["validation"],
        tokenizer=tokenizer,
        data_collator=collator,
        compute_metrics=compute_metrics,
        callbacks=[EarlyStoppingCallback(early_stopping_patience=2)]
    )

    trainer.train()
    eval_metrics = trainer.evaluate(tokenized["validation"])
    return trainer, eval_metrics

import torch



## 8) Baseline Experiments (Multiple Backbones)
Run a short training for each candidate backbone and compare.


In [None]:

results = []
trainers = {}

for model_name in MODEL_CANDIDATES:
    out_dir = f"models/{model_name.replace('/', '_')}-baseline"
    os.makedirs(out_dir, exist_ok=True)
    trainer, metrics = train_one(model_name, ds, num_labels, out_dir, learning_rate=2e-5, batch_size=8, epochs=3)
    metrics["model"] = model_name
    results.append(metrics)
    trainers[model_name] = trainer

df_results = pd.DataFrame(results).sort_values("f1_macro", ascending=False).reset_index(drop=True)
df_results



## 9) Hyperparameter Tuning (Optuna)
We run a small hyperparameter search on the best backbone to squeeze extra performance.


In [None]:

best_model_name = df_results.iloc[0]["model"]
print("Best baseline backbone:", best_model_name)

tokenizer = AutoTokenizer.from_pretrained(best_model_name, use_fast=True)
tokenized = ds.map(lambda x: tokenize_function(x, tokenizer), batched=True, remove_columns=["text"])
collator = DataCollatorWithPadding(tokenizer=tokenizer)
model = AutoModelForSequenceClassification.from_pretrained(best_model_name, num_labels=num_labels, id2label=id2label, label2id=label2id)

args = TrainingArguments(
    output_dir=f"models/{best_model_name.replace('/', '_')}-tuned",
    evaluation_strategy="epoch",
    save_strategy="epoch",
    load_best_model_at_end=True,
    metric_for_best_model="f1_macro",
    learning_rate=2e-5,
    per_device_train_batch_size=8,
    per_device_eval_batch_size=8,
    num_train_epochs=5,
    seed=SEED,
    fp16=True if torch.cuda.is_available() else False,
    report_to=["none"],
)

trainer = Trainer(
    model=model,
    args=args,
    train_dataset=tokenized["train"],
    eval_dataset=tokenized["validation"],
    tokenizer=tokenizer,
    data_collator=collator,
    compute_metrics=compute_metrics,
    callbacks=[EarlyStoppingCallback(early_stopping_patience=2)],
)

def optuna_hp_space(trial):
    return {
        "learning_rate": trial.suggest_float("learning_rate", 1e-5, 5e-5, log=True),
        "per_device_train_batch_size": trial.suggest_categorical("per_device_train_batch_size", [8, 16]),
        "num_train_epochs": trial.suggest_int("num_train_epochs", 3, 6),
        "weight_decay": trial.suggest_float("weight_decay", 0.0, 0.1),
    }

best_run = trainer.hyperparameter_search(
    direction="maximize",
    backend="optuna",
    hp_space=optuna_hp_space,
    n_trials=6
)

best_run



## 10) Final Training & Test Evaluation
Train with the best-found hyperparameters on train+validation and evaluate on the held-out test set.


In [None]:

# Apply best hyperparameters
for n, v in best_run.hyperparameters.items():
    setattr(trainer.args, n, v)

# Merge train+validation for final training
train_val = datasets.concatenate_datasets([tokenized["train"], tokenized["validation"]])

trainer.train_dataset = train_val
trainer.train()

test_metrics = trainer.evaluate(tokenized["test"])
print(json.dumps(test_metrics, indent=2))

# Confusion matrix & report
preds = trainer.predict(tokenized["test"]).predictions.argmax(axis=-1)
true = np.array(tokenized["test"]["label_id"])

cm = confusion_matrix(true, preds, labels=list(range(num_labels)))
print("\nClassification report:\n", classification_report(true, preds, target_names=labels, zero_division=0))

plt.figure()
plt.imshow(cm, interpolation='nearest')
plt.title('Confusion matrix')
plt.xlabel('Predicted')
plt.ylabel('True')
plt.xticks(range(num_labels), labels, rotation=45, ha='right')
plt.yticks(range(num_labels), labels)
plt.colorbar()
plt.tight_layout()
plt.show()



## 11) Error Analysis (Most-Confused Examples)


In [None]:

df_test = test_df.reset_index(drop=True).copy()
df_test['pred'] = preds
df_test['true_label'] = df_test['label_id'].map(id2label)
df_test['pred_label'] = df_test['pred'].map(id2label)
errors = df_test[df_test['pred'] != df_test['label_id']]

# Show a few hardest errors (longest texts among misclassified or any heuristic)
errors_sorted = errors.sort_values(by=df_test['text'].str.len(), ascending=False, na_position='last')
errors_sorted[['text','true_label','pred_label']].head(10)



## 12) Inference Pipeline


In [None]:

clf = pipeline("text-classification", model=trainer.model, tokenizer=trainer.tokenizer, return_all_scores=True, truncation=True)
samples = [
    "The player scored a hat-trick in the final match.",
    "A breakthrough in quantum processors was announced by the research team."
]
predictions = clf(samples)
predictions



## 13) Save Model & Minimal Gradio Demo


In [None]:

save_dir = "deployable_model"
trainer.model.save_pretrained(save_dir)
trainer.tokenizer.save_pretrained(save_dir)

import gradio as gr

def predict_gradio(text):
    res = clf(text)[0]
    # Return label with max score
    best = max(res, key=lambda x: x['score'])
    return f"{best['label']} ({best['score']:.3f})"

demo = gr.Interface(
    fn=predict_gradio,
    inputs=gr.Textbox(lines=4, label="Enter text"),
    outputs=gr.Textbox(label="Prediction"),
    title="NLP Classifier Demo",
    description="Transformer-based text classifier with tuned hyperparameters.",
)

# To launch locally:
# demo.launch(share=False)



## 14) Model Card (Stub)

**Model**: `<best_backbone>-finetuned-topic-classifier`  
**Labels**: {label → id} mapping stored in notebook.  
**Intended Use**: Topic classification of news-like short/medium texts.  
**Training Data**: URLs listed in manifest or CSV; see `df_raw`.  
**Metrics**: Accuracy, Precision, Recall, F1 (macro), Confusion Matrix.  
**Ethical Considerations**: Beware of dataset bias, domain drift, and potential misclassification harms.  
**Limitations**: Small dataset, limited topics; not robust to slang, code-mixed text without further finetuning.  
**How to Reproduce**: Run this notebook top-to-bottom after filling URL manifests or loading your CSV.


In [None]:

meta = {
    "labels": labels,
    "label2id": label2id,
    "id2label": id2label,
    "seed": SEED,
    "candidates": MODEL_CANDIDATES,
    "best_backbone": best_model_name,
    "test_metrics": test_metrics,
}
os.makedirs("artifacts", exist_ok=True)
with open("artifacts/metadata.json", "w") as f:
    json.dump(meta, f, indent=2)
"Saved artifacts/metadata.json"
