# Baseline Character TF-IDF + Linear SVM

This notebook provides a reproducible experiment for the initial typological language identification baseline. It uses the bundled sample dataset so we can sanity-check the pipeline end-to-end before scaling to WiLI-2018 subsets.

**What this notebook covers**
- Load the curated sample dataset
- Split into training and evaluation folds
- Vectorize with character n-grams (TF-IDF)
- Fit a linear SVM classifier (`LinearSVC`)
- Report accuracy, macro metrics, and a confusion matrix heatmap

Once this flow looks healthy, you can point the same logic at WiLI subsets generated via `python -m src.data.wili_downloader prepare-subset`.


In [None]:
from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns
from sklearn.metrics import classification_report, confusion_matrix
from sklearn.model_selection import train_test_split
from sklearn.pipeline import Pipeline
from sklearn.svm import LinearSVC

from src.config.settings import SAMPLE_DATA, TrainingConfig
from src.features.text_vectorizer import build_char_vectorizer


In [None]:
config = TrainingConfig()
df = pd.read_csv(SAMPLE_DATA)

X_train, X_test, y_train, y_test = train_test_split(
    df["text"],
    df["language"],
    test_size=config.test_size,
    random_state=config.random_state,
    stratify=df["language"],
)

pipeline = Pipeline(
    [
        (
            "vectorizer",
            build_char_vectorizer(
                ngram_range=config.ngram_range,
                max_features=config.max_features,
                use_idf=config.use_idf,
            ),
        ),
        (
            "classifier",
            LinearSVC(
                C=config.c_value,
                class_weight=config.class_weight,
                random_state=config.random_state,
            ),
        ),
    ]
)

pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)

report = classification_report(y_test, y_pred, output_dict=False)
print(report)

cm = confusion_matrix(y_test, y_pred, labels=sorted(df["language"].unique()))
plt.figure(figsize=(6, 5))
sns.heatmap(
    cm,
    annot=True,
    fmt="d",
    cmap="Blues",
    xticklabels=sorted(df["language"].unique()),
    yticklabels=sorted(df["language"].unique()),
)
plt.title("Confusion Matrix - Sample Dataset")
plt.xlabel("Predicted")
plt.ylabel("True")
plt.tight_layout()
plt.show()
