# Template NLP

**Prerequisites :**
- Download the fasttext embedding matrix: https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.fr.300.vec.gz and extract it in in the folder `{{package_name}}-data`

- **Launch this notebook with the kernel of your virtual environment** (used in the last part of this notebook). In order to create a kernel linked to your virtual environment : `python -m ipykernel install --user --name=name_of_the_kernel` (when your virtual environment is activated)

## 1. Understand how the nlp template works

**Why use the NLP template?**

The NLP (natural language processing) template automatically generates a NLP project including the more mainstream models and facilitating their industrialization.

The generated project can be used for **classification** tasks on text data. Of course, you have to adapt it to your particular use case. 

**Structure of the generated project**

.   
├── <span style="color:darkred">{{package_name}}</span>  **# The package** <br>
│   ├── <span style="color:darkred">models_training</span> **# Folder containing all the modules pertaining to the models** <br>
│   ├── <span style="color:darkred">monitoring</span> **# Folder containing all the modules pertaining to the explainers and MLflow** <br>
│   ├── <span style="color:darkred">preprocessing</span> **# Folder containing all the modules pertaining to preprocessing** <br>
├── <span style="color:darkred">{{package_name}}-data</span>   &emsp;&emsp;&emsp;&emsp;**# Folder containing all the datasets** <br>
├── <span style="color:darkred">{{package_name}}-exploration</span>   &emsp;&emsp;&emsp;&emsp;**# Folder containing this tutorial and which should contain all your experiments and explorations** <br> 
├── <span style="color:darkred">{{package_name}}-models</span>   &emsp;&emsp;&emsp;&emsp;**# Folder containing all the models generated** <br>
├── <span style="color:darkred">{{package_name}}-ressources</span>   &emsp;&emsp;&emsp;&emsp;**# Folder containing some ressources such as the instructions to upload a model** <br>
├── <span style="color:darkred">{{package_name}}-scripts</span>   &emsp;&emsp;&emsp;&emsp;**# Folder containing examples script to preprocess data, train models, predict and use a demonstrator** <br>
│   ├── <span style="color:darkred">active_learning</span> **# Folder containing an example of active learning** <br>
│   ├── <span style="color:darkred">utils</span> **# Folder containing scripts to preprocess data** <br>
│   ├── <span style="color:darkred">utils_torch</span> **# Folder containing scripts to put data in pytorch format** <br>
├── <span style="color:darkred">{{package_name}}-transformers</span>   &emsp;&emsp;&emsp;&emsp;**# Folder containing the pytorch transformers** <br>
├── <span style="color:darkred">{{package_name}}.egg-info</span>   &emsp;&emsp;&emsp;&emsp;**# Folder containing various data on the package** <br>
├── <span style="color:darkred">tests</span>   &emsp;&emsp;&emsp;&emsp;**# Folder containing all the unit tests** <br>
├── .gitignore <br>
├── Makefile <br>
├── README.md    <br>
├── requirements.txt    <br>
└── setup.py   

**General principles on the generated packages**

- Data must be saved in the `{{package_name}}-data` folder<br>
<br>
- Trained models will automatically be saved in the `{{package_name}}-models` folder<br>
<br>
- Be aware that all the functions/methods for writing/reading files uses these two folders as base. Thus when a script has an argument for the path of a file/model, the given path should be **relative** to the `{{package_name}}-data`/`{{package_name}}-models` folders.<br>
<br>
- The provided scripts in `{{package_name}}-scripts` are given as example. You can use them to help you develop but their use is not required. The package is more useful by providing the functions contained in utils, preprocessing and in the models' classes<br>
<br>
- The file `preprocess.py` contains the various preprocessing pipelines used in this package. This file contains a dictionary of pipelines. It will be used to create datasets (one for each preprocessing pipeline). Be very careful when you modify a pipeline because, the already trained model won't be retrocompatible with it. It is generally advised to create a new pipeline.<br>
<br>
- You can use this package for mono-label and multi-labels tasks (`multi_label` argument in models' classes)<br>
<br>
- The modelling part is structured as follows :
    - ModelClass: main class taking care of saving data and metrics (among other)
    - ModelPipeline: child class of ModelClass managing all models related to a sklearn pipeline
    - ModelKeras: child class of ModelClass managing all models using Keras




## 1. Use the template to train your first model

For that purpose, we will use a dataset containing popular video games from the website jeuxvideo.com

This dataset contains a description of the games and their type (Action, RPG, etc.). The goal will be to predict the type of the game from its description.

Note that the dataset is in french but it is not necessary to understand french to follow this tutorial.

Note : in the following exercises, the datasets are .csv with `;` as separator and `utf-8` as encoding. These are the default values of the generated project.

<span style="color:red">**Exercice 1**</span>

Goal:

-Split a dataset in train / valid / test

TODO:
- Use the script `utils/0_split_train_valid_test.py` on the dataset `{{package_name}}-data/dataset_jvc.csv`
- We want a 'random' split but **with a random seed set to 42** (in order to always reproduce the same results)
- We use default splitting ratio (60/20/20)

Help:
- The file `utils/0_split_train_valid_test.py` splits a dataset in 3 .csv files:
    - {fichier}_train.csv : the training dataset
    - {fichier}_valid.csv : the validation dataset
    - {fichier}_test.csv : the test dataset
- You can specify the type of split : random, stratified or hierarchical (here, use random)
- Reminder: the path to the file to process is relative to `{{package_name}}-data`
- To get the possible arguments of the script : `python 0_split_train_valid_test.py --help`
- Don't forget to activate your virtual environment ...

**Exercice 1** :  Validation

~ Run the following cell

In [None]:
import nlp
nlp.test_exercice_1()

**Exercice 1** :  Solution

In [None]:
import nlp
nlp.get_exercice_1_solution()

<span style="color:red">**Exercice 2**</span>

Goal:

-Obtain a random sample of the file `dataset_jvc_train.csv` (n=10) (we won't use it after, this exercise is just here to show what can be done)

TODO:
- Use the script `utils/0_create_samples.py` on the dataset `{{package_name}}-data/dataset_jvc.csv`
- We want a sample of 10 lines

Help:
- The file `utils/0_create_samples.py` samples a dataset
- To get the possible arguments of the script : `python 0_create_samples.py --help`
- Don't forget to activate your virtual environment ...

**Exercice 2** :  Validation

~ Run the following cell

In [None]:
import nlp
nlp.test_exercice_2()

**Exercice 2** :  Solution

In [None]:
import nlp
nlp.get_exercice_2_solution()

<span style="color:red">**Exercice 3**</span>

Goal:

- Apply the default preprocessing to `dataset_jvc_train.csv`

TODO:
- Use the script `1_preprocess_data.py` on the dataset `{{package_name}}-data/dataset_jvc_train.csv` to apply the default pipeline (`preprocess_P1`)
- The preprocessing must be done on the `description` column

Help:
- The file `1_preprocess_data.py` applies a preprocessing pipeline **to one column of one or several .csv files**
- Without the argument `preprocessing`, this python script creates as many files as there are **pipelines registered in `preprocessing/preprocess.py`**
- It works as follows:<br>
    - In `preprocessing/preprocess.py`: <br>
        - There is a dictionary of function (`get_preprocessors_dict`): key: str -> function <br>
            - /!\ Don't remove the element 'no_preprocess': lambda x: x /!\ <br>
        - There are preprocessing functions (usually from words_n_fun pipelines) <br>
    - In `1_preprocess_data.py` :<br>
        - Reads the dictionary of `preprocessing/preprocess.py` <br>
        - For each 'key' of the dictionary (except `no_preprocess`):<br>
            - Get the associated preprocessing function
            - Load data
            - Create a column `preprocessed_text` -> apply the preprocessing function
            - Save the result -> {file_name}_{key}.csv <br>
- To get the possible arguments of the script : `python 1_preprocess_data.py --help`
- Don't forget to activate your virtual environment ...

**Exercice 3** :  Validation

~ Run the following cell

In [None]:
import nlp
nlp.test_exercice_3()

**Exercice 3** :  Solution

In [None]:
import nlp
nlp.get_exercice_3_solution()

<span style="color:red">**Exercice 4**</span>

Goal:

- Apply a "custom" preprocess to `dataset_jvc_train.csv` and `dataset_jvc_valid.csv`

TODO:
- Add a new preprocessing pipeline `preprocess_P2` in `preprocessing/preprocess.py`

- '''# pipeline to use <br>
pipeline = ['remove_non_string', 'get_true_spaces', 'remove_punct', 'to_lower','remove_stopwords', 'trim_string', 'remove_leading_and_ending_spaces']
'''
- Use the script `1_preprocess_data.py` to apply the new pipeline `preprocess_P2`
- The preprocessing must be done on the `description` column

Help:
- You have to create a new preprocessing in `preprocessing/preprocess.py` and add it to the dictionary of `get_preprocessors_dict()`
- Don't forget to activate your virtual environment ...

**Exercice 4** :  Validation

~ Run the following cell

In [None]:
import nlp
nlp.test_exercice_4()

**Exercice 4** :  Solution

In [None]:
import nlp
nlp.get_exercice_4_solution()

<span style="color:red">**Exercice 5**</span>

Goal:

- Use the script `2_training.py` to train a mono-label TD-IDF + SVM model to predict the 'RPG' category

- Training dataset : `dataset_jvc_train_preprocess_P2.csv`

- Validation dataset : `dataset_jvc_valid_preprocess_P2.csv`

TODO:
- Use the script `2_training.py` with the proper arguments

- We want to train on the column `preprocessed_text`, result of the preprocessing on the `description` column

- We want to predict the `RPG` column

Help:
- The script `2_training.py` trains a model on a dataset
- It works as follows:<br>
    - Read a train .csv file as input <br>
        - If a validation file is given, it will use it <br>
        - Otherwise, split the train dataset in two parts (train/validation) <br>
    - Manage `y_col` argument: <br>
        - If there is only one value, training in mono-label mode <br>
        - If several values, training in multi-labels mode <br>
    - **Manual modifications of the script**: <br>
        - **To change the model used** -> you have to comment/uncomment/modify the code in the "training" part (not necessary for this exercise) <br>
        - **To load datasets** -> if a dataset is not in the right format, you have to adapt the loading part (not necessary for this exercise) <br>
    - Optionnal argument (no need to use them in this exercise): <br>
        - `min_rows` minimal number of lines necessary to handle a class (default 0)
        - `iter_for_keras` : Number of model iteration if it is a Keras model (experimental) <br>
        - `level_save` : level of save <br>:
            - `HIGH` : everything is saved (model, plots, predictions, etc.) <br>
            - `MEDIUM` : the predictions are not saved
            - `LOW` : we don't save models nor plots (be careful, you can't re-use the model)<br> 
- To get the possible arguments of the script : `python 2_training.py --help`
- Don't forget to activate your virtual environment ...

**Exercice 5** :  Validation

~ Manual validation

After having executed the script `2_training.py`, you should see logs similar to this :

<img src="images/model1.png">

With the default TF-IDF, you can see that the model overfits on the train dataset <br>

A new model was created in the folder `{{package_name}}-models`. It contains the save of the model, results, statistics and plots.

<img src="images/model1_path.png">

Details:
- `plots/` : Folder containing the plots (here confusion matrices) <br>
- `acc_train@​0.xx` : Empty file. The value after @ indicates the accuracy of the model on the train dataset <br>
- `acc_valid@​0.xx` : Empty file. The value after @ indicates the accuracy of the model on the validation dataset <br>
- `configurations.json` : **Configurations used by the model**. Mandatory yo re-use a model <br>
- `f1_train@​0.xx.csv` : Statistics per class on the train dataset. The value after @ indicates the weighted f1-score <br>
- `f1_valid@​0.xx.csv` : Statistics per class on the validation dataset. The value after @ indicates the weighted f1-score <br>
- `model_{name_of_the_model}.pkl` : Pickle of the class of the trained model
- `{name_of_the_type_of_model}_standalone.pkl` : Pickle of the model. Standalone version (ie. no need for the generated package, for example sklearn model) <br>
- `model_upload_instructions.md` : Instructions to upload a model to use it (needs to be customized)
- `predictions_train.csv` : Prediction on the train dataset. Wrong predictions first <br>
- `predictions_valid.csv` : Prediction on the validation dataset. Wrong predictions first <br>
- `proprietes.json` : Configuration file for uploading the model. Not useful for this tutorial

**Exercice 5** :  Solution

In [None]:
import nlp
nlp.get_exercice_5_solution()

<span style="color:red">**Exercice 6**</span>

Goal:

- Use the script `2_training.py` to train a mono-label LSTM model to predict all the categories of the dataset

- Training dataset : `dataset_jvc_train.csv`

- Validation dataset : `dataset_jvc_valid.csv`

TODO:
- Generate the fasttext embedding matrix in a .pkl -> cf. `utils/0_get_embedding_dict.py`

- Modify the script `2_training.py` to select the model `ModelEmbeddingLstm`

<img src="images/choix_lstm.jpg">

- *optionnal* : You can see the model structure directly in the script `{{package_name}}/models_training/model_embedding_lstm.py` (`_get_model()` function)

- Use the script `2_training.py` with the proper arguments

- We want to train on the `description` column

- We want to predict the columns : "Action", "Aventure", "RPG", "Plate-Forme", "FPS", "Course", "Strategie", "Sport", "Reflexion", "Combat"

Help:
- Check that the file `cc.fr.300.vec` is in the folder `{{package_name}}-data`
- If training is too long, we advise that you lower the number of epochs
- The strucutre of Deep Learning models are to be modified directly in the code of the suitable class (`_get_model()` function)
- To get the possible arguments of the script : `python 2_training.py --help`
- Don't forget to activate your virtual environment ...

**Exercice 6** :  Validation

~ Manual validation

After having executed the script `2_training.py`, you should see logs similar to this :

<img src="images/model2.png">

Here we can see that our results are pretty good on some categories (Course, FPS) and less so on others (Strategies, Reflexion).

You should obtain a f1-score higher than 0.70. To compare, a TF-IDF/SVM gives a f1-score of roughly 0.63

A new folder has been created in your folder `{{package_name}}-models`. It contains the save of the model, results, statistics and plots.

**Exercice 6** :  Solution

In [None]:
import nlp
nlp.get_exercice_6_solution

<span style="color:red">**Exercice 7**</span>

Goal:

- Use your model to predict on the test dataset `dataset_jvc_test.csv`

TODO:
- Get the name of your model which is the name of the created folder `model_embedding_lstm_{YYYY_MM_DD-hh_mm_ss}`

- Use the script `3_predict.py` to predict on the test dataset `dataset_jvc_test.csv`

- We want to predict using the `description` column

- *optional* The argument `y_col` is optional but you can use it to evaluate the model's performances on the test dataset  : "Action", "Aventure", "RPG", "Plate-Forme", "FPS", "Course", "Strategie", "Sport", "Reflexion", "Combat"

Help:
- To get the possible arguments of the script : `python 3_predict.py --help`
- Don't forget to activate your virtual environment ...

**Exercice 7** :  Validation

~ Manual validation

After having executed the script `3_predict.py`, you should see logs similar to this :

<img src="images/predictions.png">

A new folder `predictions/dataset_jvc_test/` has been created in your folder `{{package_name}}-data`. It contains predictions on the test dataset and statistics and plots (if y_col has been given)

**Exercice 7** :  Solution

In [None]:
import nlp
nlp.get_exercice_7_solution()

## 3. Use a model to predict on new data

<br>

In this section, we will see how to reload "manually" a model and how to use it

To load a model :

In [None]:
from {{package_name}}.models_training import utils_models

#
# TODO: replace the name of the model by the one you trained
#
model, model_conf = utils_models.load_model('model_embedding_lstm_{YYYY_MM_DD-hh_mm_ss}')

Now, feel free to imagine the description of a video game (in french) or take one from a french website (such as www.jeuxvideo.com)...

In [None]:
description = '''The Legend of Pôle emploi 2 : le retour des Data Scientists, est un jeu de tir à la première personne, 
 dans lequel vous vivrez une aventure digne du Seigneur des anneaux !
 Encore meilleur que The Legend of Pôle emploi, une multitude de nouvelles armes à feu viendront enrichir votre arsenal de guerre.'''

.. and predict its class !

In [None]:
from IPython.display import clear_output

predictions = model.predict([description])
clear_output()
print(predictions)

🤬 What is that ?!!! A vector ???!!! 🤬

😎 Stay calm ! Do not panic ! 😎 

The `inverse_transform` method is here to save the day !

In [None]:
model.inverse_transform(predictions)[0]

You can now play around with your model 😄

<br>

**Disclaimer : To be perfectly honest, the training dataset is really small -> the performance of your model will probably be poor  😕**