The Swahili news dataset contains more than 31,000 news articles from different news categories such as Local, International, Business or Financial, health, sports, and Entertainment. The Swahili language is one of the most spoken languages in Africa, it is spoken by 100-150 million people across East Africa.

The data was collected from different news publication platforms inside and outside of Tanzania. The dataset will be used to develop a multi-class classification model to classify news content according to their specific categories specified.

The model will be used by Swahili online news platforms to automatically group news according to their categories and help readers find the specific news they want to read.

In [None]:
!pip install pandarallel
!pip install nltk
!pip install datasets

In [2]:
from datasets import load_dataset
import pandas as pd

In [11]:
# load the swahili_news dataset
dataset = load_dataset("swahili_news")



  0%|          | 0/2 [00:00<?, ?it/s]

In [4]:
dataset

DatasetDict({
    train: Dataset({
        features: ['text', 'label'],
        num_rows: 22207
    })
    test: Dataset({
        features: ['text', 'label'],
        num_rows: 7338
    })
})

This code snippet will create a pandas DataFrame called df with two columns: text and label, and a number of rows equal to the sum of the number of rows in the train and test datasets.

Note that the ignore_index=True argument in the concat() function is used to reset the index of the concatenated DataFrame, since the two original DataFrames may have overlapping index values.

In [13]:
# convert the train and test datasets to separate pandas DataFrames
train_df = dataset['train'].to_pandas()
test_df = dataset['test'].to_pandas()

# concatenate the two dataframes
df = pd.concat([train_df, test_df], ignore_index=True)

# display the resulting DataFrame
print(df.head())

                                                text  label
0   Bodi ya Utalii Tanzania (TTB) imesema, itafan...      0
1   PENDO FUNDISHA-MBEYA RAIS Dk. John Magufuri, ...      1
2  Mwandishi Wetu -Singida BENKI ya NMB imetoa ms...      0
3   TIMU ya taifa ya Tanzania, Serengeti Boys jan...      2
4   Na AGATHA CHARLES – DAR ES SALAAM ALIYEKUWA K...      1


In [14]:
df['label'].unique()

array([0, 1, 2, 3, 4, 5])

Punkt: This tokenizer divides a text into a list of sentences by using an unsupervised algorithm to build a model for abbreviation words, collocations, and words that start sentences. It must be trained on a large collection of plain text in a target language before it can be used.

In [25]:
import nltk
nltk.download('punkt')

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Package punkt is already up-to-date!


True

In [26]:
from pandarallel import pandarallel
from nltk.stem import PorterStemmer
import re


stemmer = PorterStemmer()
# preprocess the text data

def preprocess(text):

  # convert to lowercase
  text = text.lower()

  #remove special characters
  text = re.sub(r'[^\w\s]', '', text)

  # tokenize the text
  tokens = nltk.word_tokenize(text)

  #Performing word stemming
  tokens = [stemmer.stem(token) for token in tokens]

  # rejoin the tokens into a string
  text = ' '.join(tokens)
    
  return text


# apply the preprocess function to the 'text' column of the DataFrame
df['text'] = df['text'].apply(preprocess)

# display the resulting DataFrame
print(df.head())

                                                text  label
0  bodi ya utalii tanzania ttb imesema itafanya m...      0
1  pendo fundishambeya rai dk john magufuri ameta...      1
2  mwandishi wetu singida benki ya nmb imetoa msa...      0
3  timu ya taifa ya tanzania serengeti boy jana i...      2
4  na agatha charl dar es salaam aliyekuwa katibu...      1
