# Swahili news classification using NLP

## Problem Statement  
The goal of this project is to build a Swahili news classification model that accurately categorizes news articles into six predefined categories:

- uchumi (economy)

- kitaifa (national news)

- michezo (sports)

- kimataifa (international news)

- burudani (entertainment)

- afya (health)

To achieve this, we will preprocess the text data to remove noise, tokenize, and normalize the text, followed by building a classification model. To ensure a clear, reproducible, and scalable approach, we will implement the preprocessing steps within an Scikit-learn Pipeline.

### Importing the necessary libraries

In [None]:
import pandas as pd
import re
import string
import nltk
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
from sklearn.preprocessing import LabelEncoder
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.pipeline import Pipeline
from sklearn.feature_extraction.text import TfidfVectorizer

In [22]:
nltk.download("stopwords")
nltk.download("punkt")

[nltk_data] Downloading package stopwords to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package stopwords is already up-to-date!
[nltk_data] Downloading package punkt to
[nltk_data]     C:\Users\Administrator\AppData\Roaming\nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

### Check the first 5 and last 5 columns

This is to check for consisteny through the data. From this we can also see the columns we  are working with.

In [24]:
train_df.tail()

Unnamed: 0,id,content,category
23263,SW24920,Alitoa pongezi hizo alipozindua rasmi hatua y...,uchumi
23264,SW4038,Na NORA DAMIAN-DAR ES SALAAM TEKLA (si jina ...,kitaifa
23265,SW16649,"Mkuu wa Mkoa wa Njombe, Dk Rehema Nchimbi wak...",uchumi
23266,SW23291,"MABINGWA wa Ligi Kuu Soka Tanzania Bara, Simb...",michezo
23267,SW11778,"WIKI iliyopita, nilianza makala haya yanayole...",kitaifa


#### Observation:  
1. The data maintains uniformity from top to bottom.
2. The columns are: (id, content, category).
3. The train data has 23268 entries.

In [25]:
# Loading the testing dataset
test_df = pd.read_csv('data/test.csv')
test_df.head()

Unnamed: 0,text,label
0,BUNGE limehakikishiwa kuwa hakuna changamoto ...,kitaifa
1,Twiga ilicheza mechi ya kirafiki na Kenya kwe...,michezo
2,['Miaka mitano iliyopita Harry Maguire alikuwa...,michezo
3,"Bethsheba Wambura, Dar es Salaam Msanii wa Bon...",burudani
4,"\nMwekezaji wa Klabu ya Simba, Mohammed Dewji ...",michezo


In [26]:
test_df.tail()

Unnamed: 0,text,label
7333,Kamati hiyo ilibainisha kuwa moja ya mapunguf...,uchumi
7334,ARODIA PETER-DODOMA HOSPITALI ya Rufaa ya Benj...,kitaifa
7335,WAKATI mazoezi ya timu ya taifa ya Tanzania (...,michezo
7336,\n\tNa Suleiman Rashid Omar-Pemba\n \n\n \n\tW...,kitaifa
7337,BAO pekee lililofungwa na mshambuliaji wa Yan...,michezo


#### Observation:  
1. The data also maintains uniformity.
2. The columns are (text and label). This set of data does not have an id column.However the text and content columns are similar and also label and category.
3. The test data has 7338 entries.

### Check the details of the datasets

Here we are going to check the number of non-null values and the datatypes of the data in the columns.

In [27]:
train_df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 23268 entries, 0 to 23267
Data columns (total 3 columns):
 #   Column    Non-Null Count  Dtype 
---  ------    --------------  ----- 
 0   id        23268 non-null  object
 1   content   23268 non-null  object
 2   category  23268 non-null  object
dtypes: object(3)
memory usage: 545.5+ KB


#### Observation
1. We can see we do not have null values in the dataset

### Check the categories in the category column.

This being the target we check the categories we are dealing with.

In [28]:
train_df['category'].unique()

array(['uchumi', 'kitaifa', 'michezo', 'kimataifa', 'burudani', 'afya'],
      dtype=object)

#### Observation
The categories are:
1. uchumi - This is the economy category of the news.
2. kitaifa - This is the national news category.
3. michezo - This is the sports news category.
4. kimataifa- this is the international news category.
5. burudani - This is the entertainment news category.
6. afya - this is the health news category.

#### Check the number of values in the categories.

Here we check the number of occurences of each category.

In [30]:
train_df['category'].value_counts()

kitaifa      10242
michezo       6004
burudani      2229
uchumi        2028
kimataifa     1906
afya           859
Name: category, dtype: int64

The value counts in order of most frequent to least frequent is:
1. kitaifa   -   10242
2. michezo   -    6004
3. burudani  -    2229
4. uchumi   -     2028
5. kimataifa  -   1906
6. afya     -      859

# Data cleaning

In this step we are going to prepare the data for exploratory data analysis (EDA) and preprocessing. The clleaning objectives are:
1. Lowercasing – Convert all text to lowercase to maintain uniformity.

2. Removing Special Characters & Punctuation – Strip out unnecessary symbols (e.g., !?,.) to clean the text.

3. Removing Stopwords – Remove common Swahili stopwords (requires a Swahili stopwords list, which exists inside the `data` folder as a csv file).

4. Tokenization – Split the text into individual words for further processing.

5. Lemmatization/Stemming – Normalize words to their base form. This is done by removing suffixes and prefixes (this requires a Swahili NLP package).

6. Vectorization – Convert the processed text into numerical features using TF-IDF or CountVectorizer.

We are goiung to implement these steps im a pipeline to automate the process. 


#### Check for Null Values

Null values affect EDA and Modeling negatively and have to be removed from data.  

In [31]:
test_df.isna().sum()

text     0
label    0
dtype: int64

In [32]:
train_df.isna().sum()

id          0
content     0
category    0
dtype: int64

Both the train and test dataset do not have null values.

#### Check for duplicated entries

Duplicates mess with EDA accuracy and model perfomance. We need to deal with them before these steps.

In [33]:
train_df.duplicated().sum()

0

The data does not have duplicated values

#### Drop unnecessary columns

Identity columns contribute nothing to modeling and EDA. In this case the ID column has to be dropped.

In [None]:
# Drop 'id' column before processing
train_df = train_df.drop(columns=["id"])
train_df

#### Loading the stopwords from the csv file

Stop words are common language occuring words which contribute to grammatical correctness of a text but do not hold much meaning in the message of the text. In modelling these words are noise and have to be removed. The natural language toolkit (NLTK) doesn't have a Swahili library hence we imported the stopwords from an external source.

In [35]:
# Load Swahili stopwords from CSV
stopwords_df = pd.read_csv("data/Common Swahili Stop-words.csv")
stopwords_df.head()

Unnamed: 0,StopWords
0,tuna
1,ilikuwa
2,kisha
3,pili
4,mbaya


In [36]:
# We convert it to a set for fast lookup
swahili_stopwords = set(stopwords_df["StopWords"].dropna())
swahili_stopwords

{'a',
 'acha',
 'afanaleki',
 'aidha',
 'akiwa',
 'ala',
 'ali',
 'alia',
 'aliendelea',
 'alikuwa',
 'aliweza',
 'ama',
 'ambacho',
 'ambako',
 'ambalo',
 'ambamo',
 'ambao',
 'ambapo',
 'ambaye',
 'anafanya',
 'anafikiri',
 'anajua',
 'anakwenda',
 'anatakiwa',
 'anatokea',
 'anaye',
 'angali',
 'anza',
 'atakuwa',
 'au',
 'b',
 'baada',
 'baadaye',
 'baadhi',
 'barabara',
 'basi',
 'bila',
 'bora',
 'budi',
 'c',
 'cha',
 'chake',
 'chako',
 'chini',
 'chochote',
 'chote',
 'd',
 'dhidi',
 'duu',
 'e',
 'ebo',
 'ewaa',
 'f',
 'fauka',
 'g',
 'h',
 'hadi',
 'haiyumkini',
 'halafu',
 'halikadhalika',
 'hao',
 'haohao',
 'hapa',
 'hapana',
 'hapo',
 'haraka',
 'harakaharaka',
 'hasa',
 'hasha',
 'hata',
 'hii',
 'hili',
 'hilihili',
 'hivi',
 'hivyo',
 'hivyohivyo',
 'hiyo',
 'hiyohiyo',
 'hizi',
 'hizo',
 'huko',
 'hukohuko',
 'huku',
 'hukuhuku',
 'humu',
 'humuhumu',
 'huo',
 'huohuo',
 'hususani',
 'huu',
 'ila',
 'ile',
 'ilhali',
 'ili',
 'ilikuwa',
 'ingawa',
 'ingawaje',
 'ipi'

### Defining the suffixes and prefixes for lemmatization

As defined above, lemmatization is the removal of prefixes and suffixes from data to return them to their base form. For this we require the prefix and suffix list. Below are the lists of suffixes and prefixes gotten from multiple searches across the web.

In [None]:

swahili_suffixes = [
            'ni', 'to', 'ua', 'ika', 'eka', 'wa', 'ka', 'sha', 'la', 'lo', 'zo', 'e', 'ye', 'mo', 'ji', 'po'
        ]
swahili_suffixes

In [45]:
swahili_prefixes = [
            'm', 'wa', 'ki', 'vi', 'u', 'zi', 'ku', 'pa', 'mu', 'ni', 'tu', 'hu', 'ha', 'me', 'ta', 'li',
            'si', 'hatu', 'ham', 'hawa', 'hu', 'ha', 'a', 'ya'
        ]
swahili_prefixes

['m',
 'wa',
 'ki',
 'vi',
 'u',
 'zi',
 'ku',
 'pa',
 'mu',
 'ni',
 'tu',
 'hu',
 'ha',
 'me',
 'ta',
 'li',
 'si',
 'hatu',
 'ham',
 'hawa',
 'hu',
 'ha',
 'a',
 'ya']

## Text Preprocessing Steps (Implemented in a Pipeline)


### Cleaning Function 
Here, we will have a class called `SwahiliTextCleaner`, that inherits attributes from sklearn's BaseEstimator and TransformerMixin classes.  
We will define the various steps it will take in cleaning the text, then later apply it to both the `train_df` and the `test_df`.  
Our `preprocessing_pipeline` will then include both the `SwahiliTextCleaner` and the `TfidfVectorizer`.

In [46]:
class SwahiliTextCleaner(BaseEstimator, TransformerMixin):
    def lemmatize(self, text):
        # Remove prefixes only if the remaining word is reasonable
        for prefix in swahili_prefixes:
            if text.startswith(prefix) and len(text) > len(prefix) + 2:  # Keep at least 3 characters after removal
                text = text[len(prefix):]
                break  # Stop after the first valid match
        
        # Remove suffixes only if the remaining word is reasonable
        for suffix in swahili_suffixes:
            if text.endswith(suffix) and len(text) > len(suffix) + 2:
                text = text[:-len(suffix)]
                break  # Stop after the first valid match

        return text
    def clean_text(self, text):
        # Convert to string (handle any non-string entries)
        text = str(text)
        # Remove newline (`\n`), tab (`\t`), and extra spaces
        text = re.sub(r"\s+", " ", text)
        # Remove punctuation and special characters
        text = re.sub(f"[{string.punctuation}]", "", text)
        # Remove brackets (handles cases like row 2 in test set)
        text = re.sub(r"[\[\]]", "", text)
        # Convert to lowercase
        text = text.lower()
        # Tokenization using NLTK
        tokens = word_tokenize(text)
        # Remove stopwords
        tokens = [word for word in tokens if word not in swahili_stopwords]
        # Lemmatization (combined prefix + suffix removal)
        tokens = [self.lemmatize(word) for word in tokens]
        # Join words back to a sentence
        return " ".join(tokens)

    def fit(self, X, y=None):
        return self  # coz no fitting needed

    def transform(self, X):
        return X.apply(self.clean_text)

# Preprocessing

This step involves preparing the clean data for modeling.
1. Creating pipelines for vectorizing the feature and label encoding the target.
2. Calling the train and test, Xand y for use in model training and testing

### Pipelines

#### Preprocessing the Features

Machine learning and deep learning models do deal with texts hence the need to change the text to a format which the models would be able to handle the text. Term Frequency-Inverse Document Frequency Vectorizer converts text data into numerical feature vectors that can be used by machine learning and deep learning models. TF-IDF is useful because:

1. Terms that are unique to a document are given higher weight.

2. Terms that are common across all documents (like "za" or "na") are downweighted.

3. It helps improve the relevance of features for text classification and clustering.1. 

In [47]:
# Define the preprocessing pipeline
preprocessing_pipeline = Pipeline([
    ("text_cleaner", SwahiliTextCleaner()),
    ("tfidf", TfidfVectorizer(max_features=5000))  # Convert text to numerical features
])

#### Preprocessing the target

The target being categorical, we need to encode the different classes so that the model can use this format. We use label encoding which  assigns numerical labels to the different classes.

In [48]:
# Custom function to encode labels
def encode_labels(y):
    le = LabelEncoder()
    return le.fit_transform(y)

# Create label encoding pipeline for target
encoding_pipeline = Pipeline([
    ("label_encoder", FunctionTransformer(encode_labels))
])

In [51]:
# Apply the pipeline to the target on training and test sets
train_df["category_encoded"] = encoding_pipeline.fit_transform(train_df["category"])
test_df["label_encoded"] = encoding_pipeline.transform(test_df["label"])

In [53]:
# Apply the pipeline to the 'content' and 'text' column only (of both train and test sets)
train_df['preprocessed_content'] = preprocessing_pipeline.fit_transform(train_df["content"])
test_df['preprocessed_text'] = preprocessing_pipeline.transform(test_df["text"])



#### Calling the Target and features for train and test data

In [54]:
y_train = train_df["category_encoded"]
y_test = test_df["label_encoded"]

In [55]:
X_train = train_df['preprocessed_content']
X_test = test_df['preprocessed_text']