# NLP Modelling and Algorithms - Solutions
This notebook guides you through basic modelling of text data and using Machine Learning algorithms to classify Kinyarwanda news articles into one of 14 categories.

In [None]:
SEED = 1   # random state/seed for reproducibility

## Prepare data

### Download and combine data

In [None]:
# Import text data and libraries
import pandas as pd

data1_url = "https://raw.githubusercontent.com/MBAZA-NLP/nlp-training/main/data/kinnews_0_500.csv"
data2_url = "https://raw.githubusercontent.com/MBAZA-NLP/nlp-training/main/data/kinnews_1000_1500.csv"

data1 = ...
data2 = ...

# Combine data1 and data2 into one DataFrame
data = ...

data

### Add category names (easier to interpret than the numeric labels)

In [None]:
# Download the category names
categories_url = "https://raw.githubusercontent.com/MBAZA-NLP/nlp-training/main/data/labels.csv"
categories = ...

# Add the correct category name to each row (there are several ways to do this)
...

data

### Separate the data (to use for prediction) and label (to predict) columns
Note: In the beginning, we will only use the article 'content' for classification. Later, you can also play around with using the title, or even both.  

In [None]:
X = ...     # data to use for prediction
y = ...     # label to predict

## Making text machine-readable (TF-IDF)
There are several ways to turn text into something that a Machine Learning algorithm can handle. Here, we will be using one of the basic methods: Term-Frequency Inverse Document Frequency, of TF-IDF for short.

Look at the resulting TF-IDF matrix:
- What do rows refer to? What about column?
- What to the cell values mean?
- How large is the matrix? Why does it have this number of rows and columns?
- Do the column names make sense?

In [None]:
from sklearn.feature_extraction.text import TfidfVectorizer

# Transform X using TF-IDF (do not forget to call toarray())
vectorizer = TfidfVectorizer()
X = ...

# Visualize the TF-IDF matrix
pd.DataFrame(X, columns=vectorizer.get_feature_names())

## Training a Text Classifier
In order to be able to evaluate your model, you have to split the data into training and test datasets.

Then, train your model using the training data (article content ``X_train`` and article category ``y_train``).

In [None]:
# Split the data into train and test data
from sklearn.model_selection import train_test_split

# Remember considering the random state
X_train, X_test, y_train, y_test = ...

print(f'X_train shape: {X_train.shape}\nX_test shape {X_test.shape}\ny_train shape {y_train.shape}\ny_test shape {y_test.shape}')

In [None]:
from time import time
from sklearn.neural_network import MLPClassifier

before = time()

# Create the classifier object (remember considering the random state)
classifier = ...

# Train the classifier on the training data X_train, y_train
...

print(f'Training took {time() - before} seconds.')

## Evaluation
For evaluation, first have the model do predictions on your test data.
Then have a look at the evaluation metrics.
- What is the overall accuracy?
- Is the accuracy high enough? What should the target accuracy be?
- Which categories are predicted well? Which ones are not? What could be the reason?
- Where do precision and recall differ sharply? What could be the reason?

In [None]:
# Do predictions on the test data. Create a DataFrame with three columns: Text, true label, and predicted label
before = time()

predictions = ...

print(f'Inference/predictions took {time() - before} seconds.')

pd.DataFrame(predictions, columns=['Predicted label'])

In [None]:
# Evaluate your model using various metrics
from sklearn.metrics import classification_report

print(classification_report(y_true=..., y_pred=...))

## Improving your model

### Hyperparameter tuning
As a first step to improve your model's performance, you can try to change ("tune") the hyperparameters used for training.

The next cell includes some common hyperparameters. Refer to the scikit-learn documentation on the ``MLPClassifier`` for more parameters and how to use them: https://scikit-learn.org/stable/modules/generated/sklearn.neural_network.MLPClassifier.html

In [None]:
# Some hyperparameters you may want to play around with. They are set to their default values.
# Refer to the documentation to see other hyperparameters.
HIDDEN_LAYER_SIZES = (100)  # has to be a tuple
MAX_ITERATIONS = 200
LEARNING_RATE = 0.001


new_classifier = MLPClassifier(random_state=SEED,
                               hidden_layer_sizes=HIDDEN_LAYER_SIZES,
                               max_iter=MAX_ITERATIONS,
                               learning_rate_init=LEARNING_RATE)

before = time()

# Train the new classifier
...

# Predict on the test data
new_predictions = ...

print(f'Training and inference took {time() - before} seconds.')
print(classification_report(y_true=..., y_pred=...))

### Optional 1 - Try other ML algorithms
Try using other classification algorithms. Do they improve the metrics?

See e. g. here for some other classifiers that scikit-learn supports: https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html

In [None]:
...

### Optional 2 - Try using more data
Use more data and see if the model performs better. You can get more data here:

- Complete data of Kinyarwanda news: https://drive.google.com/drive/folders/1zxn0hgrOLlUsK5V0c7l71eAj1t2jiyox?usp=sharing
- Complete data of Kirundi news: https://drive.google.com/uc?export=download&id=1-53VQFOHqBeoX2JiN01X1Sxgfh78ckru

Does using Kirundi news articles improve or worsen the model's performance? What does that mean?

In [None]:
...