# NLP Modelling and Algorithms - Solutions
This notebook guides you through basic modelling of text data and using Machine Learning algorithms to classify Kinyarwanda news articles into one of 14 categories.

## Data preparation & pre-processing

### Download and combine data

In [60]:
# Import text data and libraries
import pandas as pd
import warnings
warnings.filterwarnings("ignore")

data1_url = "https://raw.githubusercontent.com/MBAZA-NLP/nlp-training/main/data/kinnews_raw_500.csv"
data2_url = "https://raw.githubusercontent.com/MBAZA-NLP/nlp-training/main/data/kinnews_raw_500_1000.csv"

data1 = pd.read_csv(data1_url)
data2 = pd.read_csv(data2_url)

# Combine data1 and data2 into one DataFrame
data = pd.concat([data1, data2])

data

Unnamed: 0,label,kin_label,en_label,url,title,content
0,5,imyidagaduro,entertainment,http://www.igihe.com/imyidagaduro/article/namu...,Inkuru y’urukundo rwa Iyakare na Nyirandayishi...,Iri tsinda abarizi bakunze kubonamo umugabo w...
1,1,politiki,politic,http://www.igihe.com/amakuru/muri-afurika/arti...,Uganda: Abapolisikazi bakuriweho gusaba uburen...,Muri aya mavugurura kandi abapolisikazi bo mu...
2,1,politiki,politic,http://www.igihe.com/diaspora/ibikorwa/article...,U Bufaransa: Abanyarwanda bararimbanyije mu my...,Abanyarwanda batuye mu bice bitandukanye by’I...
3,5,imyidagaduro,entertainment,http://muhabura.rw/amakuru/imyidagaduro/articl...,Ibyo utari uzi kuri Nyakwigendera Mowzey Radi...,"Nubwo yabaye icyamamare muri muzika, ngo bury..."
4,5,imyidagaduro,entertainment,http://www.igihe.com/imyidagaduro/article/nell...,"Nelly Kelba, umuhanzi mushya uhanze amaso isok...",Uyu musore yatangiye gukora umuziki we mu nta...
...,...,...,...,...,...,...
495,14,urukundo,relationship,http://muhabura.rw/amakuru/urukundo/article/im...,Imyitwarire Abasore na abagabo bagira ikunze g...,"Mu buzima bwa muntu, abantu bashobora guhura ..."
496,14,urukundo,relationship,http://muhabura.rw/amakuru/urukundo/article/me...,Menya byinshi ku kurangiza k’Umugore mu gihe c...,"Kuri ubu, kurangiza ku bagore mu gihe cy’imib..."
497,14,urukundo,relationship,http://muhabura.rw/amakuru/urukundo/article/bi...,Bimwe mu bintu biranga umugore cyangwa umukobw...,"Uruhurirane rw’abanditsi aribo Nora Ephron, B..."
498,14,urukundo,relationship,http://muhabura.rw/amakuru/urukundo/article/es...,Ese wari uzi ibanga ry’ibintu abagabo bakunda ...,Ubusanzwe abagabo muri kamere yabo ntibakunda...


### Add category names (easier to interpret than the numeric labels)

In [63]:
# Download the category names
categories_url = "https://raw.githubusercontent.com/MBAZA-NLP/nlp-training/main/data/labels.csv"
categories = pd.read_csv(categories_url)

# Add the correct category name to each row (there are several ways to do this)
data = pd.merge(categories, data, on='label')

data

Unnamed: 0,label,category_name,kin_label,en_label,url,title,content
0,1,politics,politiki,politic,http://www.igihe.com/amakuru/muri-afurika/arti...,Uganda: Abapolisikazi bakuriweho gusaba uburen...,Muri aya mavugurura kandi abapolisikazi bo mu...
1,1,politics,politiki,politic,http://www.igihe.com/diaspora/ibikorwa/article...,U Bufaransa: Abanyarwanda bararimbanyije mu my...,Abanyarwanda batuye mu bice bitandukanye by’I...
2,1,politics,politiki,politic,http://kigalipost.com/Rubingisa-Pudence-yatore...,\nRubingisa Pudence yatorewe kuba Meya w’Umujy...,Rubingisa Pudence ni we watorewe kuyobora Umu...
3,1,politics,politiki,politic,http://www.igihe.com/politiki/article/murangwa...,Murangwa Hadidja yatowe nk’umusenateri uhagara...,Murangwa asimbuye Uwamurera Salama wari watow...
4,1,politics,politiki,politic,http://www.igihe.com/diaspora/ibikorwa/article...,Abanyarwanda baba muri Amerika batanze miliyon...,Mu gukumira ko iki cyorezo cyarushaho gukwira...
...,...,...,...,...,...,...,...
994,14,relationship,urukundo,relationship,http://muhabura.rw/amakuru/urukundo/article/im...,Imyitwarire Abasore na abagabo bagira ikunze g...,"Mu buzima bwa muntu, abantu bashobora guhura ..."
995,14,relationship,urukundo,relationship,http://muhabura.rw/amakuru/urukundo/article/me...,Menya byinshi ku kurangiza k’Umugore mu gihe c...,"Kuri ubu, kurangiza ku bagore mu gihe cy’imib..."
996,14,relationship,urukundo,relationship,http://muhabura.rw/amakuru/urukundo/article/bi...,Bimwe mu bintu biranga umugore cyangwa umukobw...,"Uruhurirane rw’abanditsi aribo Nora Ephron, B..."
997,14,relationship,urukundo,relationship,http://muhabura.rw/amakuru/urukundo/article/es...,Ese wari uzi ibanga ry’ibintu abagabo bakunda ...,Ubusanzwe abagabo muri kamere yabo ntibakunda...


### Handle empty fields (NaN) -> ``dropna()``

In [64]:
data = data.dropna()

data

Unnamed: 0,label,category_name,kin_label,en_label,url,title,content
0,1,politics,politiki,politic,http://www.igihe.com/amakuru/muri-afurika/arti...,Uganda: Abapolisikazi bakuriweho gusaba uburen...,Muri aya mavugurura kandi abapolisikazi bo mu...
1,1,politics,politiki,politic,http://www.igihe.com/diaspora/ibikorwa/article...,U Bufaransa: Abanyarwanda bararimbanyije mu my...,Abanyarwanda batuye mu bice bitandukanye by’I...
2,1,politics,politiki,politic,http://kigalipost.com/Rubingisa-Pudence-yatore...,\nRubingisa Pudence yatorewe kuba Meya w’Umujy...,Rubingisa Pudence ni we watorewe kuyobora Umu...
3,1,politics,politiki,politic,http://www.igihe.com/politiki/article/murangwa...,Murangwa Hadidja yatowe nk’umusenateri uhagara...,Murangwa asimbuye Uwamurera Salama wari watow...
4,1,politics,politiki,politic,http://www.igihe.com/diaspora/ibikorwa/article...,Abanyarwanda baba muri Amerika batanze miliyon...,Mu gukumira ko iki cyorezo cyarushaho gukwira...
...,...,...,...,...,...,...,...
994,14,relationship,urukundo,relationship,http://muhabura.rw/amakuru/urukundo/article/im...,Imyitwarire Abasore na abagabo bagira ikunze g...,"Mu buzima bwa muntu, abantu bashobora guhura ..."
995,14,relationship,urukundo,relationship,http://muhabura.rw/amakuru/urukundo/article/me...,Menya byinshi ku kurangiza k’Umugore mu gihe c...,"Kuri ubu, kurangiza ku bagore mu gihe cy’imib..."
996,14,relationship,urukundo,relationship,http://muhabura.rw/amakuru/urukundo/article/bi...,Bimwe mu bintu biranga umugore cyangwa umukobw...,"Uruhurirane rw’abanditsi aribo Nora Ephron, B..."
997,14,relationship,urukundo,relationship,http://muhabura.rw/amakuru/urukundo/article/es...,Ese wari uzi ibanga ry’ibintu abagabo bakunda ...,Ubusanzwe abagabo muri kamere yabo ntibakunda...


### Separate the data (to use for prediction) and label (to predict) columns
Note: In the beginning, we will only use the article 'content' for classification. Later, you can also play around with using the title, or even both.  

In [88]:
X = data['content']             # data to use for prediction
y = data['category_name']       # label to predict

X

0       Muri aya mavugurura kandi abapolisikazi bo mu...
1       Abanyarwanda batuye mu bice bitandukanye by’I...
2       Rubingisa Pudence ni we watorewe kuyobora Umu...
3       Murangwa asimbuye Uwamurera Salama wari watow...
4       Mu gukumira ko iki cyorezo cyarushaho gukwira...
                             ...                        
994     Mu buzima bwa muntu, abantu bashobora guhura ...
995     Kuri ubu, kurangiza ku bagore mu gihe cy’imib...
996     Uruhurirane rw’abanditsi aribo Nora Ephron, B...
997     Ubusanzwe abagabo muri kamere yabo ntibakunda...
998     u bijanye no kumenya neza ivyiyumviro abagore...
Name: content, Length: 999, dtype: object

### Text Pre-processing

In [89]:
# Lowercase text
# X = X.str.lower()

# Remove all special characters using regular expression
# X = X.str.replace(r'[^a-zA-Z\s]', '')

# Remove stopwords
STOPWORD_KN = {'aba', 'abo', 'aha', 'aho', 'ari', 'ati', 'aya', 'ayo', 'ba', 'baba', 'babo', 'bari', 'be', 'bo', 'bose',
           'bw', 'bwa', 'bwo', 'by', 'bya', 'byo', 'cy', 'cya', 'cyo', 'hafi', 'ibi', 'ibyo', 'icyo', 'iki',
           'imwe', 'iri', 'iyi', 'iyo', 'izi', 'izo', 'ka', 'ko', 'ku', 'kuri', 'kuva', 'kwa', 'maze', 'mu', 'muri',
           'na', 'naho','nawe', 'ngo', 'ni', 'niba', 'nk', 'nka', 'no', 'nta', 'nuko', 'rero', 'rw', 'rwa', 'rwo', 'ry',
           'rya','ubu', 'ubwo', 'uko', 'undi', 'uri', 'uwo', 'uyu', 'wa', 'wari', 'we', 'wo', 'ya', 'yabo', 'yari', 'ye',
           'yo', 'yose', 'za', 'zo'}
# X = X.apply(lambda row: " ".join([word for word in str(row).split() if word not in STOPWORD_KN]))

print('After processing:\n', X[1])
print('\n\nBefore processing:')
data['content'][1]

After processing:
 abanyarwanda batuye bice bitandukanye byisi bararimbanyije myiteguro kwerekeza ahazabera rwanda day 2019 bufaransa bakomeje gushishikarizwa kuzitabira munsi ufite amateka afatika buzima bwabanyarwanda nigihugu cyabibarutse butumwa bwe ambasaderi wu rwanda bufaransa kabale jacques yakanguriye abanyarwanda kwiyandikisha hakiri kare kuko bizafasha myiteguro kubakira itangazo yashyizeho umukono ryo 7 kanama 2019 rigira riti ambasade yu rwanda iramenyesha abanyarwanda bufaransa u butaliyani espagne portugal monaco rwanda day izabera mujyi bonn budage 24 kanama 2019 rikomeza rivuga ambasade iboneyeho gutumira abanyamuryango bayo bufaransa kwitabira gikorwa cyingenzi kwiyandikisha rwanda day umwanya mwiza kungurana ibitekerezo abayitabira bahabwa igihe kubaza ibibazo umukuru wigihugu gutanga ibitekerezo nibyifuzo bitandukanye abanyarwanda bufaransa bakiriye rwanda day yabereye i paris 2011 gihe yitabiriwe nabantu 3700 barimo abanyarwanda ninshuti zabo insanganyamatsiko yagi

' Abanyarwanda batuye mu bice bitandukanye by’Isi bararimbanyije mu myiteguro yo kwerekeza ahazabera Rwanda Day 2019. Abo mu Bufaransa bakomeje gushishikarizwa kuzitabira uwo munsi ufite amateka afatika mu buzima bw’Abanyarwanda n’igihugu cyabibarutse. Mu butumwa bwe, Ambasaderi w’u Rwanda mu Bufaransa, Kabale Jacques, yakanguriye Abanyarwanda kwiyandikisha hakiri kare kuko bizafasha mu myiteguro yo kubakira. Itangazo yashyizeho umukono ryo ku wa 7 Kanama 2019, rigira riti “Ambasade y’u Rwanda iramenyesha Abanyarwanda baba mu Bufaransa, u Butaliyani, Espagne, Portugal na Monaco ko Rwanda Day izabera mu Mujyi wa Bonn mu Budage ku wa 24 Kanama 2019.’’ Rikomeza rivuga ko “Ambasade iboneyeho gutumira abanyamuryango bayo baba mu Bufaransa kwitabira icyo gikorwa cy’ingenzi no kwiyandikisha.’’ Rwanda Day ni umwanya mwiza wo kungurana ibitekerezo, abayitabira bahabwa igihe cyo kubaza ibibazo Umukuru w’Igihugu, gutanga ibitekerezo n’ibyifuzo bitandukanye. Abanyarwanda baba mu Bufaransa bakiriye

## Transformation: Making text machine-readable (TF-IDF)
There are several ways to turn text into something that a Machine Learning algorithm can handle. Here, we will be using one of the basic methods: Term-Frequency Inverse Document Frequency, of TF-IDF for short.

Look at the resulting TF-IDF matrix:
- What do rows refer to? What about column?
- What to the cell values mean?
- How large is the matrix? Why does it have this number of rows and columns?
- Do the column names make sense?

In [92]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Transform X using TF-IDF
vectorizer = TfidfVectorizer()
X_transformed = vectorizer.fit_transform(X).toarray()

X_transformed

# Visualize the TF-IDF matrix
# pd.DataFrame(X_transformed, columns=vectorizer.get_feature_names())

array([[0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       ...,
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.],
       [0., 0., 0., ..., 0., 0., 0.]])

## Training a Text Classifier
In order to be able to evaluate your model, you have to split the data into training and test datasets.

Then, train your model using the training data (article content ``X_train`` and article category ``y_train``).

In [43]:
# Split the data into train and test data
from sklearn.model_selection import train_test_split

SEED = 1   # random state/seed for reproducibility

X_train, X_test, y_train, y_test = train_test_split(X_transformed, y, test_size=0.3, random_state=SEED)

print(f'X_train shape: {X_train.shape}\nX_test shape {X_test.shape}\ny_train shape {y_train.shape}\ny_test shape {y_test.shape}')

X_train shape: (699, 45362)
X_test shape (300, 45362)
y_train shape (699,)
y_test shape (300,)


In [44]:
from time import time
from sklearn.neural_network import MLPClassifier

before = time()

# Create the classifier object
classifier = MLPClassifier(max_iter=50, random_state=SEED)

# Train the classifier on the training data X_train, y_train
classifier.fit(X_train, y_train)

print(f'Training took {time() - before} seconds.')

Training took 48.96645450592041 seconds.




## Evaluation
For evaluation, first have the model do predictions on your test data.
Then have a look at the evaluation metrics.
- What is the overall accuracy?
- Is the accuracy high enough? What should the target accuracy be?
- Which categories are predicted well? Which ones are not? What could be the reason?
- Where do precision and recall differ sharply? What could be the reason?

In [45]:
# Do predictions on the test data. Create a DataFrame with three columns: Text, true label, and predicted label
before = time()

predictions = classifier.predict(X_test)

print(f'Inference/predictions took {time() - before} seconds.')

pd.DataFrame(predictions, columns=['Predicted label'])

Inference/predictions took 0.05527687072753906 seconds.


Unnamed: 0,Predicted label
0,religion
1,religion
2,culture
3,entertainment
4,entertainment
...,...
295,religion
296,religion
297,sport
298,entertainment


In [46]:
# Evaluate your model using various metrics
from sklearn.metrics import classification_report

print(classification_report(y_true=y_test, y_pred=predictions))

               precision    recall  f1-score   support

      culture       1.00      0.10      0.18        10
      economy       0.50      0.33      0.40         3
    education       1.00      0.33      0.50         3
entertainment       0.74      0.94      0.83        54
  environment       1.00      0.62      0.77         8
      fashion       0.00      0.00      0.00         4
       health       1.00      0.15      0.27        13
      history       0.00      0.00      0.00         1
     politics       0.82      0.82      0.82        17
 relationship       0.92      0.99      0.95        80
     religion       0.74      0.88      0.80        65
        sport       0.85      0.94      0.89        31
   technology       0.00      0.00      0.00         2
      tourism       0.50      0.33      0.40         9

     accuracy                           0.81       300
    macro avg       0.65      0.46      0.49       300
 weighted avg       0.81      0.81      0.78       300



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Improving your model

### Hyperparameter tuning
As a first step to improve your model's performance, you can try to change ("tune") the hyperparameters used for training.

The next cell includes some common hyperparameters. Refer to the scikit-learn documentation on the ``MLPClassifier`` for more parameters and how to use them: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

In [28]:
# Some hyperparameters you may want to play around with. They are set to their default values.
# Refer to the documentation to see other hyperparameters.
HIDDEN_LAYER_SIZES = (100)  # has to be a tuple
MAX_ITERATIONS = 200
LEARNING_RATE = 0.001


new_classifier = MLPClassifier(random_state=SEED,
                               hidden_layer_sizes=HIDDEN_LAYER_SIZES,
                               max_iter=MAX_ITERATIONS,
                               learning_rate_init=LEARNING_RATE)

# Train the new classifier
before = time()
new_classifier.fit(X_train, y_train)

# Predict on the test data
new_predictions = new_classifier.predict(X_test)

print(f'Training and inference took {time() - before} seconds.')
print(classification_report(y_test, new_predictions))

Training and inference took 130.00215458869934 seconds.
              precision    recall  f1-score   support

           1       0.65      0.97      0.78        73
           2       0.93      0.93      0.93        43
           3       0.79      0.88      0.83        60
           4       0.75      0.35      0.47        26
           5       0.76      0.88      0.82        51
           6       0.00      0.00      0.00         4
           7       0.00      0.00      0.00         9
           8       0.00      0.00      0.00         3
           9       1.00      0.29      0.44         7
          10       1.00      1.00      1.00         1
          11       1.00      0.50      0.67         6
          12       0.00      0.00      0.00         3
          13       0.00      0.00      0.00         2
          14       1.00      0.25      0.40        12

    accuracy                           0.76       300
   macro avg       0.56      0.43      0.45       300
weighted avg       0.73 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Try other ML algorithms
Try using other classification algorithms. Do they improve the metrics?

See e. g. here for some other classifiers that scikit-learn supports: https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

In [None]:
...

### Try using the 'title' column for predictions

In [None]:
...

### Try using more data
Use more data and see if the model performs better. You can get more data here:

- Complete data of Kinyarwanda news: https://drive.google.com/drive/folders/1zxn0hgrOLlUsK5V0c7l71eAj1t2jiyox?usp=sharing
- Complete data of Kirundi news: https://drive.google.com/uc?export=download&id=1-53VQFOHqBeoX2JiN01X1Sxgfh78ckru

Does using Kirundi news articles improve or worsen the model's performance? What does that mean?

In [None]:
...