# ClassiNews — Colab Notebook

**Intelligent News Article Categorization System**

This notebook contains a full pipeline for the ClassiNews project:
- dataset loading (use your uploaded ZIP or fallback to AG News),
- preprocessing, EDA, baseline TF-IDF + Logistic Regression,
- Transformer fine-tuning (DistilBERT) using Hugging Face, and
- evaluation + inference examples.

**How to use:** Open this notebook in Google Colab. If you have a project ZIP (like `(genAi)_final project.zip`), upload it to Colab or mount Google Drive and place the zip in the working directory. The notebook will try to detect and use your uploaded dataset; otherwise it will use the AG News dataset as a fallback.

In [None]:
# ## Setup — run this first
# Install required libraries in Colab
!pip install -q transformers datasets evaluate scikit-learn matplotlib seaborn sentencepiece
!pip install -q accelerate
# For progress bars
from IPython.display import clear_output
clear_output()
print("Setup complete. Libraries installed.")

## Dataset
You have two options:
1. **Upload your project ZIP** (e.g. `(genAi)_final project.zip`) with CSV(s) containing news text and labels. Place it in the notebook working directory or mount Drive. The notebook will look for CSV files with columns named `text` and `label` (or `category`).

2. **Fallback**: If no CSV is found, the notebook automatically loads the **AG News** dataset (4 classes) from the `datasets` library for demonstration.

**If you use your own dataset**, ensure it has at least two columns: `text` (article content/title) and `label` (integer index or string category).

In [None]:
# Try to detect uploaded CSV files or unzip a provided zip file.
import os, glob, zipfile, pandas as pd
DATA_PATH = '/content'  # Colab working directory
found_csv = glob.glob(os.path.join(DATA_PATH, '*.csv'))

# If a zip file is present (like your uploaded project zip), unzip it and search again
zips = glob.glob(os.path.join(DATA_PATH, '*.zip'))
if zips:
    for z in zips:
        try:
            zipfile.ZipFile(z).extractall(DATA_PATH)
            print(f'Extracted: {z}')
        except Exception as e:
            print('Could not extract', z, e)
    found_csv = glob.glob(os.path.join(DATA_PATH, '**', '*.csv'), recursive=True)

found_csv[:10], len(found_csv)

In [None]:
# Load your CSV if found (prefers first matching CSV). Otherwise load AG News.
from datasets import load_dataset, Dataset
import pandas as pd

if found_csv:
    csv_path = found_csv[0]
    print('Using CSV:', csv_path)
    df = pd.read_csv(csv_path)
    # Try to standardize columns
    if 'text' not in df.columns:
        # try common alternatives
        for alt in ['content','article','body','headline']:
            if alt in df.columns:
                df = df.rename(columns={alt:'text'})
                break
    if 'label' not in df.columns and 'category' in df.columns:
        df = df.rename(columns={'category':'label'})
    display(df.head())
    # If labels are strings, convert to categorical codes later.
    dataset = Dataset.from_pandas(df[['text','label']])
else:
    print('No CSV found — loading AG News as fallback.')
    dataset = load_dataset('ag_news')  # has train/test splits
    display(dataset['train'][0])

## Preprocessing
We'll show a simple preprocessing pipeline: clean text, lowercase, remove extra spaces. For transformer models we won't remove stopwords or stem (transformers handle raw text better). For TF-IDF baseline we'll apply simple tokenization as needed.

In [None]:
# Basic preprocessing utilities
import re
def clean_text(text):
    if not isinstance(text, str):
        return ''
    text = text.replace('\n',' ').replace('\r',' ')
    text = re.sub(r'\s+', ' ', text)   # collapse whitespace
    text = text.strip()
    return text

# Apply to dataset (datasets.Dataset or HuggingFace dataset)
from datasets import Dataset as HFDataset, DatasetDict
if hasattr(dataset, 'column_names'):
    # single split (from CSV)
    dataset = dataset.map(lambda x: {'text': clean_text(x['text'])})
    # if labels are strings convert them to ids
    sample_label = dataset[0].get('label', None)
    if isinstance(sample_label, str):
        labels = sorted(list(set(dataset['label'])))
        label2id = {l:i for i,l in enumerate(labels)}
        dataset = dataset.map(lambda x: {'label': label2id[x['label']]}, remove_columns=dataset.column_names)
else:
    # dataset like AG News with train/test splits
    dataset = dataset.map(lambda x: {'text': clean_text(x['text'])})
print('Preprocessing finished. Sample:')
print(dataset['train'][0] if 'train' in dataset.keys() else dataset[0])

## Train / Validation / Test split
If dataset came with train/test splits, we'll use them. Otherwise create a split (80/10/10).

In [None]:
from datasets import DatasetDict
# If dataset has train/test already (like AG News), keep splits. Otherwise split.
if 'train' in dataset.keys() and 'test' in dataset.keys():
    ds = DatasetDict({
        'train': dataset['train'],
        'test': dataset['test']
    })
    # Create validation from train
    ds['train'], ds['validation'] = ds['train'].train_test_split(test_size=0.1, seed=42).values()
else:
    # single split dataset
    tmp = dataset.train_test_split(test_size=0.2, seed=42)
    ds = DatasetDict({
        'train': tmp['train'],
        'test': tmp['test']
    })
    ds['train'], ds['validation'] = ds['train'].train_test_split(test_size=0.111, seed=42).values()  # ~10% val of original
print(ds)

## Baseline: TF-IDF + Logistic Regression
We'll vectorize text using TF-IDF and train a scikit-learn logistic regression classifier.

In [None]:
# Convert datasets to pandas for scikit-learn
import pandas as pd, numpy as np
train_df = pd.DataFrame(ds['train'])
val_df = pd.DataFrame(ds['validation'])
test_df = pd.DataFrame(ds['test'])

from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix, accuracy_score

tfidf = TfidfVectorizer(max_features=20000, ngram_range=(1,2))
clf = LogisticRegression(max_iter=1000)
pipe = Pipeline([('tfidf', tfidf), ('clf', clf)])

print('Training TF-IDF + LogisticRegression baseline...')
pipe.fit(train_df['text'].astype(str), train_df['label'])

preds = pipe.predict(test_df['text'].astype(str))
print('Accuracy:', accuracy_score(test_df['label'], preds))
print('\nClassification Report:\n', classification_report(test_df['label'], preds, digits=4))

## Transformer fine-tuning (DistilBERT)
We'll fine-tune a DistilBERT base model (faster & smaller). This uses Hugging Face Transformers `Trainer` API.

In [None]:
# Prepare for fine-tuning
from transformers import AutoTokenizer, AutoModelForSequenceClassification, TrainingArguments, Trainer
import numpy as np

MODEL_NAME = 'distilbert-base-uncased'
tokenizer = AutoTokenizer.from_pretrained(MODEL_NAME)

# Tokenization function
def tokenize_fn(batch):
    return tokenizer(batch['text'], truncation=True, padding='max_length', max_length=256)

# Tokenize datasets
tokenized = ds.map(tokenize_fn, batched=True, remove_columns=[c for c in ds['train'].column_names if c!='label' and c!='text'])
tokenized = tokenized.rename_column('label','labels')
tokenized.set_format('torch')

num_labels = len(set(train_df['label'])) if 'label' in train_df.columns else (max(train_df['label'])+1)
model = AutoModelForSequenceClassification.from_pretrained(MODEL_NAME, num_labels=num_labels)

In [None]:
# Define compute metrics
import evaluate
metric = evaluate.load('accuracy')

def compute_metrics(eval_pred):
    logits, labels = eval_pred
    preds = np.argmax(logits, axis=-1)
    return metric.compute(predictions=preds, references=labels)

# Training arguments - keep small for Colab / demo
training_args = TrainingArguments(
    output_dir='./results',
    evaluation_strategy='epoch',
    learning_rate=2e-5,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=32,
    num_train_epochs=2,
    weight_decay=0.01,
    save_total_limit=1,
    load_best_model_at_end=True,
    push_to_hub=False,
    report_to='none',
)

trainer = Trainer(
    model=model,
    args=training_args,
    train_dataset=tokenized['train'],
    eval_dataset=tokenized['validation'],
    tokenizer=tokenizer,
    compute_metrics=compute_metrics
)

# Start training (may take long depending on runtime)
trainer.train()

In [None]:
# Evaluate on test set
test_tok = tokenized['test']
preds_output = trainer.predict(test_tok)
preds = np.argmax(preds_output.predictions, axis=-1)

from sklearn.metrics import classification_report, accuracy_score
print('Test accuracy:', accuracy_score(test_tok['labels'].numpy(), preds))
print(classification_report(test_tok['labels'].numpy(), preds, digits=4))

## Save model & inference example

In [None]:
# Save the fine-tuned model locally (you can then download or push to Drive)
model.save_pretrained('./classinews_distilbert')
tokenizer.save_pretrained('./classinews_distilbert')
print('Saved model to ./classinews_distilbert')


# Inference example function
from transformers import pipeline
classifier = pipeline('text-classification', model='./classinews_distilbert', tokenizer='./classinews_distilbert', return_all_scores=False)

def predict(text):
    text = clean_text(text)
    out = classifier(text, truncation=True, max_length=256)
    return out

# Demo
sample_text = "President signs new economic bill to boost small businesses and create jobs."
print('Sample prediction:', predict(sample_text))

## Notes & next steps
- For multi-label tasks, convert training & loss to support multi-label (BCEWithLogitsLoss) and prepare multi-hot labels.
- For larger datasets or better performance, use a GPU runtime (Colab GPU/TPU), increase epochs, and tune learning rate.
- Add data augmentation, class balancing, or use ensemble methods for stronger baselines.

---

**End of notebook**