<a href="https://colab.research.google.com/github/KiptooAlvin/Emobilis_cohort_5/blob/main/Swahili_News_Classification.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Swahili News Classification
**Goal:** Build a multiclass classifier that assigns Swahili news articles to one of the categories and produce a submission with probabilities for each class (required for Log Loss evaluation).

**Important notes before running:**

- This notebook is robust to two common target formats in `train.csv`:
  1. a single column named `target` containing class names (e.g., `kitaifa`, `michezo`, ...)
  2. multiple one-hot columns (e.g., `kitaifa`, `michezo`, ... with 0/1 values)

Run all cells top-to-bottom. Each code block is followed by detailed explanations.


In [2]:
# --- Imports ---
import os
import re
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.linear_model import LogisticRegression
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, log_loss
from sklearn.pipeline import Pipeline


**Explanation:**
This cell imports the Python libraries we need. `pandas` and `numpy` are for data handling, `re` for simple text cleaning, and scikit-learn provides feature extraction, modeling, and evaluation tools. The `Pipeline` will help keep preprocessing and modeling together cleanly.

In [3]:
# --- Load datasets ---
TRAIN_PATH = '/content/train.csv'
TEST_PATH = '/content/test.csv'
SUB_PATH = '/content/sample_submission.csv'

# Load files (will raise a clear error if files are missing)
train = pd.read_csv(TRAIN_PATH)
test = pd.read_csv(TEST_PATH)
sample_sub = pd.read_csv(SUB_PATH)

print('Train shape:', train.shape)
print('Test shape :', test.shape)
print('Sample sub :', sample_sub.shape)

# Show first rows to inspect columns
display(train.head())
display(test.head())
display(sample_sub.head())

Train shape: (5151, 3)
Test shape : (1030, 2)
Sample sub : (1288, 6)


Unnamed: 0,id,content,category
0,SW0,SERIKALI imesema haitakuwa tayari kuona amani...,Kitaifa
1,SW1,"Mkuu wa Mkoa wa Tabora, Aggrey Mwanri amesiti...",Biashara
2,SW10,SERIKALI imetoa miezi sita kwa taasisi zote z...,Kitaifa
3,SW100,KAMPUNI ya mchezo wa kubahatisha ya M-bet ime...,michezo
4,SW1000,WATANZANIA wamekumbushwa kusherehekea sikukuu...,Kitaifa


Unnamed: 0,swahili_id,content
0,ae3baa6c34aa523fd2aa4de3c89448efff922311,Rais John Magufuli amemuagiza Msajili wa Hazi...
1,c4ee26a3ade8064a2ec494996e836900fd32dd8e,TAHARUKI imezuka katika mkutano wa Naibu Wazi...
2,58aee3aa1d94554ff57e6a053dbd60658e4890ff,"KOCHA wa Azam FC ya Dar es Salaam, Idd Cheche..."
3,00579c2307b5c11003d21c40c3ecff5e922c3fd8,THAMANI ya mauzo ya bidhaa za Afrika Masharik...
4,c83e9738ae5d1790ee85b99863deb734e7614c52,"WAZIRI wa Nchi, Ofi si ya Makamu wa Rais, Muu..."


Unnamed: 0,swahili_id,kitaifa,michezo,biashara,kimataifa,burudani
0,001dd47ac202d9db6624a5ff734a5e7dddafeaf2,0,0,0,0,0
1,0043d97f7690e9bc02f0ed8bb2b260d1d44bad92,0,0,0,0,0
2,00579c2307b5c11003d21c40c3ecff5e922c3fd8,0,0,0,0,0
3,00868eeee349e286303706ef0ffd851f39708d37,0,0,0,0,0
4,00a5cb12d3058dcf2e42f277eee599992db32412,0,0,0,0,0


**Explanation:**
This cell loads `train.csv`, `test.csv`, and `sample_submission.csv`. It prints the shapes and displays the first rows so you can quickly verify column names and formats. The sample submission is used to determine the required output column order (important for Log Loss scoring). If your files are in a different folder, update the `TRAIN_PATH`, `TEST_PATH`, and `SUB_PATH` variables accordingly.

In [4]:
# --- Text cleaning function ---
def clean_text(text):
    if pd.isnull(text):
        return ''
    # Lowercase
    text = str(text).lower()
    # Remove URLs
    text = re.sub(r'http\S+|www\.\S+', ' ', text)
    # Keep letters and basic punctuation, remove other symbols and digits
    text = re.sub(r'[^a-zA-Zā-ž0-9\s]', ' ', text)
    # Collapse repeated whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

# Apply to training and test data (assumes the text column is named 'text' or 'content' or similar)
# Try to detect the text column:
text_candidates = [c for c in train.columns if c.lower() in ('text', 'article', 'content', 'body', 'news')]
if len(text_candidates) == 0:
    # fallback: choose the longest string column
    str_cols = [c for c in train.columns if train[c].dtype == object]
    text_col = max(str_cols, key=lambda c: train[c].astype(str).str.len().median()) if str_cols else None
else:
    text_col = text_candidates[0]

print('Detected text column:', text_col)

train['clean_text'] = train[text_col].astype(str).apply(clean_text)
test['clean_text']  = test[text_col].astype(str).apply(clean_text)

# Show cleaned examples
display(train[['clean_text']].head())

Detected text column: content


Unnamed: 0,clean_text
0,serikali imesema haitakuwa tayari kuona amani ...
1,mkuu wa mkoa wa tabora aggrey mwanri amesitish...
2,serikali imetoa miezi sita kwa taasisi zote za...
3,kampuni ya mchezo wa kubahatisha ya m bet imei...
4,watanzania wamekumbushwa kusherehekea sikukuu ...


**Explanation:**
We create a `clean_text` function that:
- lowercases text
- removes URLs
- removes unusual symbols while keeping letters and digits
- collapses extra whitespace

The notebook tries to automatically detect a reasonable text column (`text`, `article`, `content`, `body`, or `news`). If none of those exist it falls back to selecting the string column with the longest median length. The cleaned text is stored in a new `clean_text` column for both train and test. Cleaning before vectorization improves model robustness.

In [5]:
# --- Target handling ---
# The dataset may provide targets in two formats:
# 1) A single column 'target' with class names
# 2) Multiple one-hot columns (kitaifa, michezo, ...)
# Attempt to detect format
if 'target' in train.columns:
    y_raw = train['target'].astype(str)
    print('Detected single target column named "target" with sample values:', y_raw.unique()[:10])
else:
    # find columns in train that also appear in sample_sub (these are likely the class columns)
    candidate_classes = [c for c in train.columns if c in sample_sub.columns]
    if len(candidate_classes) >= 2:
        # assume these are one-hot columns
        class_cols = candidate_classes
        print('Detected one-hot class columns:', class_cols)
        # convert one-hot to single label per row
        y_raw = train[class_cols].idxmax(axis=1)
    else:
        # fallback: try to find the column with small number of unique string values
        str_cols = [c for c in train.columns if train[c].dtype == object]
        possible = [(c, train[c].nunique()) for c in str_cols]
        possible = sorted(possible, key=lambda x: x[1])
        if possible:
            chosen = possible[0][0]
            y_raw = train[chosen].astype(str)
            print(f'Fallback: using column "{chosen}" as target with {possible[0][1]} unique values')
        else:
            raise ValueError('Could not detect target column. Please ensure train.csv contains a target column or one-hot class columns.')

# Encode labels to integers for the model
le = LabelEncoder()
le.fit(y_raw)
train['target_label'] = le.transform(y_raw)
classes = list(le.classes_)
print('Classes detected (in this order):', classes)

# Quick distribution check
display(train['target_label'].value_counts().sort_index())

Fallback: using column "category" as target with 5 unique values
Classes detected (in this order): ['Biashara', 'Burudani', 'Kimataifa', 'Kitaifa', 'michezo']


Unnamed: 0_level_0,count
target_label,Unnamed: 1_level_1
0,1360
1,17
2,54
3,2000
4,1720


**Explanation:**
This cell detects how the target labels are stored and converts them into a single label column `target_label` suitable for model training.

- If a `target` column exists, it uses that directly.
- If one-hot class columns are present (e.g., `kitaifa`, `michezo`, ...), it converts them into a single label using `idxmax`.
- Otherwise it falls back to choosing a string column with a small number of unique values.

Finally, we use `LabelEncoder` to map the class names to integer labels and print the class order. The order is important because the submission probabilities must match the class order expected by the evaluator (we will align probabilities to `sample_submission.csv` later).

In [6]:
# --- Vectorization and train/validation split ---
X = train['clean_text'].values
y = train['target_label'].values

X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.15, random_state=42, stratify=y)

pipeline = Pipeline([
    ('tfidf', TfidfVectorizer(max_features=20000, ngram_range=(1,2), min_df=3)),
    ('clf', LogisticRegression(max_iter=1000, multi_class='multinomial', solver='lbfgs'))
])

pipeline.fit(X_train, y_train)

print('Training completed.')

# Validate
val_probs = pipeline.predict_proba(X_val)
val_preds = pipeline.predict(X_val)
print('\nValidation classification report:\n')
print(classification_report(y_val, val_preds, target_names=le.classes_))
print('\nValidation log loss:', log_loss(y_val, val_probs))



Training completed.

Validation classification report:

              precision    recall  f1-score   support

    Biashara       0.85      0.78      0.81       204
    Burudani       0.00      0.00      0.00         3
   Kimataifa       0.00      0.00      0.00         8
     Kitaifa       0.80      0.88      0.84       300
     michezo       0.95      0.95      0.95       258

    accuracy                           0.86       773
   macro avg       0.52      0.52      0.52       773
weighted avg       0.85      0.86      0.86       773


Validation log loss: 0.42010068738144896


  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))
  _warn_prf(average, modifier, f"{metric.capitalize()} is", len(result))


**Explanation:**
- We split the training data into training and validation sets (15% for validation), stratified by class to preserve class distribution.
- A `Pipeline` is used containing `TfidfVectorizer` followed by `LogisticRegression` (multinomial). TF-IDF converts text into numeric features; logistic regression is a fast, strong baseline for multiclass text classification.
- After fitting, we evaluate with `classification_report` (precision/recall/F1) and `log_loss` — the challenge metric. `log_loss` is particularly sensitive to predicted probabilities, so later we will ensure we output calibrated probabilities for submission.

If you see warnings about convergence, increase `max_iter` or try `solver='saga'` for large data. For quick results in a 10-minute window this configuration is usually stable.

In [10]:
# --- Prepare test predictions and submission ---
# Determine the required class order from sample_submission (columns other than the id column)
possible_id_cols = [c for c in sample_sub.columns if c.lower() in ('id','test_id','testid','test')]
if len(possible_id_cols) == 0:
    # fallback to the first column
    id_col = sample_sub.columns[0]
else:
    id_col = possible_id_cols[0]

class_cols_sample_sub = [c for c in sample_sub.columns if c != id_col] # These are the class columns from sample_sub (lowercase)
print('ID column detected in sample submission:', id_col)
print('Submission class column order will be:', class_cols_sample_sub)

# Transform test text and predict probabilities
X_test = test['clean_text'].values
probs = pipeline.predict_proba(X_test)  # shape (n_samples, n_classes)

# Create a mapping from the LabelEncoder's classes (which are the order of `probs`)
# to their respective column names in `sample_sub` (handling case differences).
# This mapping will tell us which column in `probs` corresponds to which submission column.

# First, create a dictionary to easily look up the index of each class in `le.classes_`
le_class_to_index = {cls: i for i, cls in enumerate(le.classes_)}

# Prepare the submission DataFrame with the ID column
submission = pd.DataFrame({id_col: test[id_col].values})

# Populate the class probability columns in the submission DataFrame
for sub_col_name in class_cols_sample_sub:
    # Find the corresponding class name in le.classes_ (case-insensitively)
    matched_le_class = None
    for le_class in le.classes_:
        if le_class.lower() == sub_col_name.lower():
            matched_le_class = le_class
            break

    if matched_le_class is not None:
        # Get the index of this class in le.classes_ to access the correct column in `probs`
        class_index = le_class_to_index[matched_le_class]
        submission[sub_col_name] = probs[:, class_index]
    else:
        # If a class in sample_sub is not found in le.classes_ (even case-insensitively)
        print(f"Warning: Class '{sub_col_name}' from sample_submission not found in training classes. Filling with zeros.")
        submission[sub_col_name] = 0.0

# Ensure the final submission DataFrame has columns in the exact order as sample_sub
submission = submission[sample_sub.columns]

# Save
submission.to_csv('submission.csv', index=False)
print('Submission saved to submission.csv')
submission.head()

ID column detected in sample submission: swahili_id
Submission class column order will be: ['kitaifa', 'michezo', 'biashara', 'kimataifa', 'burudani']
Submission saved to submission.csv


Unnamed: 0,swahili_id,kitaifa,michezo,biashara,kimataifa,burudani
0,ae3baa6c34aa523fd2aa4de3c89448efff922311,0.546759,0.070385,0.357173,0.019708,0.005975
1,c4ee26a3ade8064a2ec494996e836900fd32dd8e,0.977458,0.009388,0.009224,0.003247,0.000683
2,58aee3aa1d94554ff57e6a053dbd60658e4890ff,0.059579,0.918616,0.01522,0.004744,0.00184
3,00579c2307b5c11003d21c40c3ecff5e922c3fd8,0.252667,0.019924,0.717187,0.007396,0.002827
4,c83e9738ae5d1790ee85b99863deb734e7614c52,0.883692,0.021237,0.085883,0.007133,0.002056


**Explanation:**
- We detect the ID column in `sample_submission.csv` (commonly `test_id` or `id`) and the required class order.
- We predict probabilities on the test set using `pipeline.predict_proba`.
- The probabilities are arranged to match the class column order expected by the submission file. This order must match exactly or evaluation will be incorrect.
- Finally, we save `submission.csv` with the same columns as `sample_submission.csv` so you can upload it directly to Zindi.

If the sample submission's class names don't match the training classes (rare), we fall back to the classes detected from training and warn you. Make sure you verify the `submission.head()` output visually before uploading to the leaderboard.