# NLP Modelling and Algorithms - Solutions
This notebook guides you through basic modelling of text data and using Machine Learning algorithms to classify Kinyarwanda news articles into one of 14 categories.

In [11]:
SEED = 1   # random state/seed for reproducibility

## Prepare data

### Download and combine data

In [78]:
# Import text data and libraries
import pandas as pd

data1_url = "https://raw.githubusercontent.com/MBAZA-NLP/nlp-training/main/data/kinnews_0_500.csv"
data2_url = "https://raw.githubusercontent.com/MBAZA-NLP/nlp-training/main/data/kinnews_1000_1500.csv"

data1 = pd.read_csv(data1_url)
data2 = pd.read_csv(data2_url)

# Combine data1 and data2 into one DataFrame
data = pd.concat([data1, data2])

data

Unnamed: 0,label,title,content
0,2,ikipe y’ u rwanda amavubi yahesheje u rwanda a...,uyu mukino wabaye itariki ukwakira gihe ikipe ...
1,11,urubyiruko itorero erc giterane cy’ububyutse k...,urubyiruko itorero ry’ivugabutumwa n’isanamiti...
2,4,rusizi bambaye udupfukamunwa n’ubwo bamwe bata...,kuri kabiri tariki gicurasi urasanga isura nsh...
3,5,abanyarwanda batatu begukanye ibihembo pam awards,buri mwaka ibihembo bihembo bihabwa abaririmby...
4,11,light family choir igiye gukora igitaramo cy’a...,korali light family ikorera umurimo w’ivugabut...
...,...,...,...
495,1,ihohoterwa ntirikorerwa abagore gusa n’abagabo...,ibi bikaba biterwa hari abagore bamwe barumvis...
496,11,muco adonis wahimbye nzogera yavuze’ yashyize ...,ndategereje’ indirimbo muco adonis yakoze ugus...
497,1,abatangabuhamya rubanza munyenyezi bemerewe ku...,steven mcauliffe yatangajeko umwirondoro w’aba...
498,2,mayweather agiye guhura mcgregor mukino bise b...,ni umukino bamwe bise byendagusetsa ariko usho...


### Add category names (easier to interpret than the numeric labels)

In [79]:
# Download the category names
categories_url = "https://raw.githubusercontent.com/MBAZA-NLP/nlp-training/main/data/labels.csv"
categories = pd.read_csv(categories_url)

# Add the correct category name to each row (there are several ways to do this)
data = pd.merge(categories, data, on='label')

data

Unnamed: 0,label,category_name,title,content
0,1,politics,saa niyo saha nziza gutera akabariro bashakanye,yifashishije science’ umwanditsi akaba n’inzob...
1,1,politics,ntitwabona ishimwe duha abaturokoye jenoside- ...,ibi byagarutsweho benshi barokotse jenoside bi...
2,1,politics,perezida kagame yasabye hakemurwa ikibazo cy’i...,perezida repubulika paul kagame yasabye inzego...
3,1,politics,gakenke inkeragutabara n’abafasha barahuguwe m...,kubakira byiza mateka y’u rwanda kugira uruhar...
4,1,politics,rubavunyuma gato y’uruzinduko minisitiri shyak...,hadaciye n’umunsi umwe nyuma y’uruzinduko mini...
...,...,...,...,...
995,14,relationship,menya amagambo meza udakwiye kwibagirwa kubwir...,urukundo nk’ubusitani bw’indabo bukeneye kubag...
996,14,relationship,umugore will smith yatunguye abantu barimo n’u...,jada pinkett smith w’umukinnyi filime akaba n’...
997,14,relationship,ubuhamya bubabaje bw’ umugore ukiri isugi nyum...,nari mfite imyaka niga kiciro mbere kaminuza n...
998,14,relationship,bamwe bakobwa ugomba kwirinda kuryamana nabo,niba ibi bikurikira ubizi cyangwa ukaba wabibo...


### Separate the data (to use for prediction) and label (to predict) columns
Note: In the beginning, we will only use the article 'content' for classification. Later, you can also play around with using the title, or even both.  

In [None]:
X = data['content']             # data to use for prediction
y = data['category_name']       # label to predict

## Making text machine-readable (TF-IDF)
There are several ways to turn text into something that a Machine Learning algorithm can handle. Here, we will be using one of the basic methods: Term-Frequency Inverse Document Frequency, of TF-IDF for short.

Look at the resulting TF-IDF matrix:
- What do rows refer to? What about column?
- What to the cell values mean?
- How large is the matrix? Why does it have this number of rows and columns?
- Do the column names make sense?

In [80]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Transform X using TF-IDF
vectorizer = TfidfVectorizer()
X = vectorizer.fit_transform(X).toarray()

# Visualize the TF-IDF matrix
pd.DataFrame(X, columns=vectorizer.get_feature_names())

Unnamed: 0,aabanyarwanda,aamerika,ab,aba,abaatanira,ababa,abababanjirije,abababaye,ababacungira,ababafasha,...,zumvikanyweho,zunze,zurugo,zuturere,zuyu,zuzuye,zuzuyemo,zuzuze,zyed,zzirimo
0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
1,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
2,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
3,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
4,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
995,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
996,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
997,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0
998,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,...,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0


## Training a Text Classifier
In order to be able to evaluate your model, you have to split the data into training and test datasets.

Then, train your model using the training data (article content ``X_train`` and article category ``y_train``).

In [81]:
# Split the data into train and test data
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=SEED)

print(f'X_train shape: {X_train.shape}\nX_test shape {X_test.shape}\ny_train shape {y_train.shape}\ny_test shape {y_test.shape}')

X_train shape: (700, 34455)
X_test shape (300, 34455)
y_train shape (700,)
y_test shape (300,)


In [82]:
from time import time
from sklearn.neural_network import MLPClassifier

before = time()

# Create the classifier object
classifier = MLPClassifier(random_state=SEED)

# Train the classifier on the training data X_train, y_train
classifier.fit(X_train, y_train)

print(f'Training took {time() - before} seconds.')

Training took 130.59096813201904 seconds.


## Evaluation
For evaluation, first have the model do predictions on your test data.
Then have a look at the evaluation metrics.
- What is the overall accuracy?
- Is the accuracy high enough? What should the target accuracy be?
- Which categories are predicted well? Which ones are not? What could be the reason?
- Where do precision and recall differ sharply? What could be the reason?

In [83]:
# Do predictions on the test data. Create a DataFrame with three columns: Text, true label, and predicted label
before = time()

predictions = classifier.predict(X_test)

print(f'Inference/predictions took {time() - before} seconds.')

pd.DataFrame(predictions, columns=['Predicted label'])

Inference/predictions took 0.12498092651367188 seconds.


Unnamed: 0,Predicted label
0,economy
1,entertainment
2,economy
3,sport
4,sport
...,...
295,economy
296,economy
297,politics
298,sport


In [84]:
# Evaluate your model using various metrics
from sklearn.metrics import classification_report

print(classification_report(y_true=y_test, y_pred=predictions))

               precision    recall  f1-score   support

      culture       0.00      0.00      0.00         8
      economy       0.71      0.90      0.80        63
    education       0.00      0.00      0.00         5
entertainment       0.73      0.91      0.81        54
      fashion       0.00      0.00      0.00         2
       health       0.79      0.48      0.59        23
      history       0.00      0.00      0.00         1
     politics       0.78      0.82      0.80        72
 relationship       0.86      0.60      0.71        10
     religion       1.00      0.33      0.50         9
        sport       0.91      0.95      0.93        44
   technology       0.75      0.60      0.67         5
      tourism       1.00      0.50      0.67         4

     accuracy                           0.77       300
    macro avg       0.58      0.47      0.50       300
 weighted avg       0.75      0.77      0.75       300



  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


## Improving your model

### Hyperparameter tuning
As a first step to improve your model's performance, you can try to change ("tune") the hyperparameters used for training.

The next cell includes some common hyperparameters. Refer to the scikit-learn documentation on the ``MLPClassifier`` for more parameters and how to use them: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

In [28]:
# Some hyperparameters you may want to play around with. They are set to their default values.
# Refer to the documentation to see other hyperparameters.
HIDDEN_LAYER_SIZES = (100)  # has to be a tuple
MAX_ITERATIONS = 200
LEARNING_RATE = 0.001


new_classifier = MLPClassifier(random_state=SEED,
                               hidden_layer_sizes=HIDDEN_LAYER_SIZES,
                               max_iter=MAX_ITERATIONS,
                               learning_rate_init=LEARNING_RATE)

# Train the new classifier
before = time()
new_classifier.fit(X_train, y_train)

# Predict on the test data
new_predictions = new_classifier.predict(X_test)

print(f'Training and inference took {time() - before} seconds.')
print(classification_report(y_test, new_predictions))

Training and inference took 130.00215458869934 seconds.
              precision    recall  f1-score   support

           1       0.65      0.97      0.78        73
           2       0.93      0.93      0.93        43
           3       0.79      0.88      0.83        60
           4       0.75      0.35      0.47        26
           5       0.76      0.88      0.82        51
           6       0.00      0.00      0.00         4
           7       0.00      0.00      0.00         9
           8       0.00      0.00      0.00         3
           9       1.00      0.29      0.44         7
          10       1.00      1.00      1.00         1
          11       1.00      0.50      0.67         6
          12       0.00      0.00      0.00         3
          13       0.00      0.00      0.00         2
          14       1.00      0.25      0.40        12

    accuracy                           0.76       300
   macro avg       0.56      0.43      0.45       300
weighted avg       0.73 

  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))
  _warn_prf(average, modifier, msg_start, len(result))


### Optional 1 - Try other ML algorithms
Try using other classification algorithms. Do they improve the metrics?

See e. g. here for some other classifiers that scikit-learn supports: https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

In [None]:
...

### Optional 2 - Try using more data
Use more data and see if the model performs better. You can get more data here:

- Complete data of Kinyarwanda news: https://drive.google.com/drive/folders/1zxn0hgrOLlUsK5V0c7l71eAj1t2jiyox?usp=sharing
- Complete data of Kirundi news: https://drive.google.com/uc?export=download&id=1-53VQFOHqBeoX2JiN01X1Sxgfh78ckru

Does using Kirundi news articles improve or worsen the model's performance? What does that mean?

In [None]:
...