The Swahili news dataset contains more than 31,000 news articles from different news categories such as Local, International, Business or Financial, health, sports, and Entertainment. The Swahili language is one of the most spoken languages in Africa, it is spoken by 100-150 million people across East Africa.

The data was collected from different news publication platforms inside and outside of Tanzania. The dataset will be used to develop a multi-class classification model to classify news content according to their specific categories specified.

The model will be used by Swahili online news platforms to automatically group news according to their categories and help readers find the specific news they want to read.

In [None]:
!pip install pandarallel
!pip install nltk
!pip install datasets

In [2]:
from datasets import load_dataset
import pandas as pd

In [3]:
# load the swahili_news dataset
dataset = load_dataset("swahili_news")

Downloading builder script:   0%|          | 0.00/4.62k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/3.25k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.14k [00:00<?, ?B/s]

Downloading and preparing dataset swahili_news/swahili_news to /root/.cache/huggingface/datasets/swahili_news/swahili_news/0.2.0/ed5c9a13b97e0d2864ff1e34bfbd38b2f2c54fea77acffcaef187eb4f13cf8cc...


Downloading data files:   0%|          | 0/2 [00:00<?, ?it/s]

Downloading data: 0.00B [00:00, ?B/s]

Downloading data: 0.00B [00:00, ?B/s]

Extracting data files:   0%|          | 0/2 [00:00<?, ?it/s]

Generating train split:   0%|          | 0/22207 [00:00<?, ? examples/s]

Generating test split:   0%|          | 0/7338 [00:00<?, ? examples/s]

Dataset swahili_news downloaded and prepared to /root/.cache/huggingface/datasets/swahili_news/swahili_news/0.2.0/ed5c9a13b97e0d2864ff1e34bfbd38b2f2c54fea77acffcaef187eb4f13cf8cc. Subsequent calls will reuse this data.


  0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 22207
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7338
    })
})

This code snippet will create a pandas DataFrame called df with two columns: text and label, and a number of rows equal to the sum of the number of rows in the train and test datasets.

Note that the ignore_index=True argument in the concat() function is used to reset the index of the concatenated DataFrame, since the two original DataFrames may have overlapping index values.

In [5]:
# convert the train and test datasets to separate pandas DataFrames
train_df = dataset['train'].to_pandas()
test_df = dataset['test'].to_pandas()

# concatenate the two dataframes
df = pd.concat([train_df, test_df], ignore_index=True)

# display the resulting DataFrame
print(df.head())

                                                text  label
0   Bodi ya Utalii Tanzania (TTB) imesema, itafan...      0
1   PENDO FUNDISHA-MBEYA RAIS Dk. John Magufuri, ...      1
2  Mwandishi Wetu -Singida BENKI ya NMB imetoa ms...      0
3   TIMU ya taifa ya Tanzania, Serengeti Boys jan...      2
4   Na AGATHA CHARLES – DAR ES SALAAM ALIYEKUWA K...      1


In [6]:
df['label'].unique()

array([0, 1, 2, 3, 4, 5])

Punkt: This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plain text in a target language before it can be used.

In [7]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


True

In [8]:
from pandarallel import pandarallel
from nltk.stem import PorterStemmer
import re


stemmer = PorterStemmer()
# preprocess the text data

def preprocess(text):

  # convert to lowercase
  text = text.lower()

  #remove special characters
  text = re.sub(r'[^\w\s]', '', text)

  # tokenize the text
  tokens = nltk.word_tokenize(text)

  #Performing word stemming
  tokens = [stemmer.stem(token) for token in tokens]

  # rejoin the tokens into a string
  text = ' '.join(tokens)
    
  return text


# apply the preprocess function to the 'text' column of the DataFrame
df['text'] = df['text'].apply(preprocess)

# display the resulting DataFrame
print(df.head())

                                                text  label
0  bodi ya utalii tanzania ttb imesema itafanya m...      0
1  pendo fundishambeya rai dk john magufuri ameta...      1
2  mwandishi wetu singida benki ya nmb imetoa msa...      0
3  timu ya taifa ya tanzania serengeti boy jana i...      2
4  na agatha charl dar es salaam aliyekuwa katibu...      1


Count Vectorizer: It is used to transform a given text into a vector on the basis of the frequency (count) of each word that occurs in the entire text.


In information retrieval, tf–idf, short for term frequency–inverse document frequency, is a numerical statistic that is intended to reflect how important a word is to a document in a collection or corpus. It is often used as a weighting factor in searches of information retrieval, text mining, and user modeling.

In [9]:
# Feature extraction
from sklearn.feature_extraction.text import CountVectorizer
from sklearn.feature_extraction.text import TfidfTransformer

count_vect = CountVectorizer()
#This is gives us word count per message
counts = count_vect.fit_transform(df['text'])

# create a TfidfVectorizer object
transformer  = TfidfTransformer().fit(counts)
# fit the vectorizer on the preprocessed text data
counts = transformer .transform(counts)

In [10]:
from sklearn.model_selection import train_test_split
from sklearn.svm import SVC

# Split your data into training and testing sets
X_train, X_test, y_train, y_test = train_test_split(counts, df['label'], test_size=0.1, random_state=42)

# Train an SVM classifier on the preprocessed data
svm = SVC(kernel= 'linear', C=1.0, random_state=42)
model = svm.fit(X_train,y_train )

In [11]:
# Evaluating the model
import numpy as np
prediction = model.predict(X_test)
print(np.mean(prediction ==y_test))
# Our model classifies with an accuracy of 89.2%

0.8923857868020305


**Deployment**

Export the trained model as a Python object: we will use Python's pickle module to save the trained SVM model as a binary file. This file can then be loaded and used in a Python script to make predictions on new text data.

In [12]:
import pickle

# Save the trained SVM model as a binary file
with open('svm_model.pkl', 'wb') as f:
    pickle.dump(model, f)
    
# Load the SVM model from the binary file
with open('svm_model.pkl', 'rb') as f:
    model = pickle.load(f)

**Tests**

How to use the binary file to make predictions on new text data, which involves exporting the trained model as a Python object.we first load the trained SVM model and the TF-IDF vectorizer used for training the model from their respective binary files. We then define a new text sample to be classified and preprocess it using the same vectorizer. Finally, we use the loaded SVM model to make predictions on the preprocessed text sample and print the predicted label(s).