# 03 — Category Detection with Stratified K-Fold Cross Validation

Goal: 
Train and evaluate multiple classical ML models for **news category classification** using:
- TF-IDF feature extraction
- **StratifiedKFold (k=5)**
- Metrics: accuracy, precision, recall, F1
- Report **mean ± std** across folds (mentor requirement)

Output:
A final table: **Model comparison (mean ± std)**.

In [1]:
# Used Libraries

import pandas as pd
import numpy as np

from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.preprocessing import LabelEncoder

from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.naive_bayes import MultinomialNB
from sklearn.svm import LinearSVC

In [2]:
# Helper methods

def print_dataset(text, df):
    print("\n" + text + ":")
    display(df.head())

def read_dataset(path):
    return pd.read_csv(path)

### Constants

In [3]:
RANDOM_STATE = 42
N_SPLITS = 5

DATA_PATH = "../data/preprocessed_kosovo_news.csv"
TEXT_COL = "text"                                  
CATEGORY_COL = "category"
MODEL_COMPARISONS = [] 

### Read dataset

In [4]:
df = read_dataset(DATA_PATH)
df.head()

Unnamed: 0,title,category,source,text
0,As Kate as Meghan; ja cila është princesha më ...,Fun;Argëtim,Lajmi,as kate as meghan; ja cila është princesha më ...
1,"I kapen 10 kg substanca narkotike në BMW X5, a...",Lajme;Nacionale,Lajmi,"i kapen 10 kg substanca narkotike në bmw x5, a..."
2,"E fundit, Mbappe mund të zyrtarizohet nesër te...",La Liga;Lajme futbolli;Sport,Lajmi,"e fundit, mbappe mund të zyrtarizohet nesër te..."
3,Enca e quan jetë pushimin në plazh me poza në ...,nan;Entertainment,Lajmi,enca e quan jetë pushimin në plazh me poza në ...
4,Gurët në veshka – Kurat natyrale dhe si t’i pë...,Lifestyle;Shëndeti,Lajmi,gurët në veshka – kurat natyrale dhe si t’i pë...


## Category target preparation
The dataset may contain multiple categories separated by `;`.
For standard multi-class classification, we define a single target label per sample.
We use a deterministic rule to extract a **primary category**.

In [5]:
def extract_primary_category(cat: str) -> str:
    # Split by ';', remove empty and 'nan' tokens, then pick one.
    parts = [p.strip() for p in cat.split(";") if p.strip() and p.strip().lower() != "nan"]
    if not parts:
        return np.nan
    
    # use the LAST category
    return parts[-1]

df["primary_category"] = df[CATEGORY_COL].apply(extract_primary_category)
df = df.dropna(subset=["primary_category"]).copy()

df["primary_category"].value_counts().head(20)

primary_category
Lajme             434024
Kosovë            154486
Sport             122822
Ndërkombëtare      81791
Nga Bota           68263
Maqedoni           60974
Shkurt             48121
Yjet               44443
Showbiz            40595
Shqiperi           34777
Bota               31586
Shqipëri           30343
Magazina           28549
Kronika e Zezë     27676
Ekonomi            25493
CultBiz            20675
Serie A            19771
Premier League     19394
Fun Lajme          19352
Politikë           17395
Name: count, dtype: int64

## Handle rare categories (optional)
To ensure stable stratified folds, we remove categories with very few samples.

In [6]:
MIN_SAMPLES_PER_CLASS = 50  # tune (e.g., 20/50/100)
counts = df["primary_category"].value_counts()
keep = counts[counts >= MIN_SAMPLES_PER_CLASS].index
df = df[df["primary_category"].isin(keep)].copy()

df["primary_category"].value_counts().describe()

count       295.000000
mean       5910.220339
std       29027.347826
min          51.000000
25%         145.000000
50%         439.000000
75%        2113.000000
max      434024.000000
Name: count, dtype: float64

## Encode labels & X, y defination

In [7]:
le = LabelEncoder()
y = le.fit_transform(df["primary_category"])
X = df["text"]

NUM_CLASSES = len(le.classes_)
print("Classes:", NUM_CLASSES)
print("Example classes:", le.classes_[:15])

Classes: 295
Example classes: ['(VIDEO)' 'A e dini se...' 'Afrika' 'Aktivitete sportive' 'Amerika'
 'Amerika Latine' 'Analiza' 'Analizë' 'Aplikacione' 'Argëtim' 'Arsim'
 'Arte' 'Artikull i Sponsoruar' 'Atletikë' 'Australia']


## Models to evaluate
We evaluate multiple baseline ML classifiers using the same TF-IDF representation.

In [8]:
MODELS = {
    "MultinomialNB": MultinomialNB(),
    "LogisticRegression": LogisticRegression(max_iter=2000),
    "LinearSVC": LinearSVC(),
    "SGDClassifier": SGDClassifier(random_state=RANDOM_STATE)
}

## Stratified K-Fold cross validation (mentor requirement)
We use StratifiedKFold so each fold preserves the category distribution.
We report accuracy, precision, recall, and F1 as **mean ± std** across folds.

In [9]:
cv = StratifiedKFold(n_splits=N_SPLITS, shuffle=True, random_state=RANDOM_STATE)

SCORING = {
    "accuracy": "accuracy",
    "precision": "precision_weighted",
    "recall": "recall_weighted",
    "f1": "f1_weighted"
}


In [10]:
# =========================
# K-FOLD MODEL EVALUATION
# =========================

import numpy as np
import pandas as pd

from sklearn.model_selection import StratifiedKFold, cross_validate
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

from sklearn.naive_bayes import MultinomialNB
from sklearn.linear_model import LogisticRegression, SGDClassifier
from sklearn.svm import LinearSVC

from IPython.display import display, Markdown

# -------------------------
# CONFIG (SAFE FOR MAC)
# -------------------------
RANDOM_STATE = 42
N_SPLITS = 5
N_JOBS = 1
MAX_FEATURES = 10000

from joblib import Memory
memory = Memory(location="./.sk_cache", verbose=0)

MODELS = {
    "MultinomialNB": MultinomialNB(),
    "LinearSVC": LinearSVC(),
    "SGD_LogReg": SGDClassifier(
        loss="log_loss",
        alpha=1e-5,
        max_iter=1000,
        tol=1e-3,
        random_state=RANDOM_STATE
    )
}

# -------------------------
# CV + METRICS
# -------------------------
cv = StratifiedKFold(
    n_splits=N_SPLITS,
    shuffle=True,
    random_state=RANDOM_STATE
)

SCORING = {
    "accuracy": "accuracy",
    "precision": "precision_weighted",
    "recall": "recall_weighted",
    "f1": "f1_weighted"
}

# -------------------------
# PIPELINE
# -------------------------
def build_pipeline(model):
    return Pipeline(
        [
            ("tfidf", TfidfVectorizer(
                max_features=MAX_FEATURES,
                ngram_range=(1, 1),  # start unigram (fast)
                # Then if it’s fast enough, switch to: ngram_range=(1,2)
                min_df=2
            )),
            ("clf", model)
        ],
        memory=memory
    )

# -------------------------
# STORAGE
# -------------------------
MODEL_COMPARISONS = []

# -------------------------
# RUN ONE MODEL AT A TIME
# -------------------------
def run_model(model_name):
    print(f"\nRunning model: {model_name} ...")

    pipeline = build_pipeline(MODELS[model_name])

    scores = cross_validate(
        pipeline,
        X, y,
        cv=cv,
        scoring=SCORING,
        n_jobs=N_JOBS,
        return_train_score=False
    )

    row = {
        "model": model_name,
        "accuracy_mean": float(np.mean(scores["test_accuracy"])),
        "accuracy_std": float(np.std(scores["test_accuracy"])),
        "precision_mean": float(np.mean(scores["test_precision"])),
        "precision_std": float(np.std(scores["test_precision"])),
        "recall_mean": float(np.mean(scores["test_recall"])),
        "recall_std": float(np.std(scores["test_recall"])),
        "f1_mean": float(np.mean(scores["test_f1"])),
        "f1_std": float(np.std(scores["test_f1"]))
    }

    MODEL_COMPARISONS.append(row)

    # show result as markdown table
    df_one = pd.DataFrame([row])
    display(Markdown(df_one.to_markdown(index=False)))

    return row

# -------------------------
# FINAL LEADERBOARD
# -------------------------
def show_leaderboard():
    if not MODEL_COMPARISONS:
        print("No models evaluated yet.")
        return

    df = (
        pd.DataFrame(MODEL_COMPARISONS)
        .sort_values("f1_mean", ascending=False)
        .reset_index(drop=True)
    )
    display(df)

In [11]:
X_small = X[:20000]
y_small = y[:20000]
scores = cross_validate(build_pipeline(MODELS["SGD_LogReg"]), X_small, y_small, cv=cv, scoring=SCORING, n_jobs=1)

  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


In [11]:
run_model("MultinomialNB")


Running model: MultinomialNB ...


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


| model         |   accuracy_mean |   accuracy_std |   precision_mean |   precision_std |   recall_mean |   recall_std |   f1_mean |      f1_std |
|:--------------|----------------:|---------------:|-----------------:|----------------:|--------------:|-------------:|----------:|------------:|
| MultinomialNB |        0.496674 |    0.000556364 |         0.490146 |      0.00172522 |      0.496674 |  0.000556364 |  0.460409 | 0.000728489 |

{'model': 'MultinomialNB',
 'accuracy_mean': 0.49667367358468384,
 'accuracy_std': 0.0005563639346664044,
 'precision_mean': 0.4901460067869025,
 'precision_std': 0.0017252236153776157,
 'recall_mean': 0.49667367358468384,
 'recall_std': 0.0005563639346664044,
 'f1_mean': 0.46040854297845035,
 'f1_std': 0.0007284892407378377}

In [12]:
run_model("LinearSVC")


Running model: LinearSVC ...


  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])
  _warn_prf(average, modifier, f"{metric.capitalize()} is", result.shape[0])


| model     |   accuracy_mean |   accuracy_std |   precision_mean |   precision_std |   recall_mean |   recall_std |   f1_mean |     f1_std |
|:----------|----------------:|---------------:|-----------------:|----------------:|--------------:|-------------:|----------:|-----------:|
| LinearSVC |        0.679761 |     0.00060889 |         0.653807 |     0.000760791 |      0.679761 |   0.00060889 |  0.654477 | 0.00063212 |

{'model': 'LinearSVC',
 'accuracy_mean': 0.6797612868257514,
 'accuracy_std': 0.0006088901764260413,
 'precision_mean': 0.6538074134362727,
 'precision_std': 0.0007607911760113517,
 'recall_mean': 0.6797612868257514,
 'recall_std': 0.0006088901764260413,
 'f1_mean': 0.6544766510346525,
 'f1_std': 0.0006321204777859011}

In [None]:
run_model("SGD_LogReg")

In [13]:
show_leaderboard()

Unnamed: 0,model,accuracy_mean,accuracy_std,precision_mean,precision_std,recall_mean,recall_std,f1_mean,f1_std
0,LinearSVC,0.679761,0.000609,0.653807,0.000761,0.679761,0.000609,0.654477,0.000632
1,MultinomialNB,0.496674,0.000556,0.490146,0.001725,0.496674,0.000556,0.460409,0.000728
