# Template NLP

**Prerequisites:**

- This notebook must have been generated using the Gabarit's NLP template.   


- Download the fasttext embedding matrix: https://dl.fbaipublicfiles.com/fasttext/vectors-crawl/cc.fr.300.vec.gz and extract it in `{{package_name}}-data`


-  Download the file `dataset_jvc.csv` here (https://github.com/OSS-Pole-Emploi/gabarit/tree/main/gabarit/template_nlp/nlp_data) and place it inside `{{package_name}}-data`


- **Launch this notebook with a kernel using your project virtual environment**. In order to create a kernel linked to your virtual environment : `python -m ipykernel install --user --name=your_venv_name` (once your virtual environment is activated). Obviously, the project must be installed on this virtual environment.

---
---
---

## 1. Understand how the NLP template works

**Why use the NLP template?**

The NLP (Natural Language Processing) template automatically generates a NLP project including the more mainstream models and facilitating their industrialization.

The generated project can be used for **classification** tasks on text data. Of course, you have to adapt it to your particular use case. 

**Structure of the generated project**

<div style="font-family: monospace; display: grid; grid-template-columns: 1fr 2fr;">
  <div>.                                </div>  <div style="color: green;"></div>
  <div>.                                </div>  <div style="color: green;"></div>
  <div>├── {{package_name}}             </div>  <div style="color: green;"># The package</div>
  <div>│ ├── models_training            </div>  <div style="color: green;"># Folder containing all the modules related to the models</div>
  <div>│ ├── monitoring                 </div>  <div style="color: green;"># Folder containing all the modules related to the explainers and MLflow</div>
  <div>│ └── preprocessing              </div>  <div style="color: green;"># Folder containing all the modules related to the preprocessing</div>
  <div>├── {{package_name}}-data        </div>  <div style="color: green;"># Folder containing all the data (datasets, embeddings, etc.)</div>
  <div>├── {{package_name}}-exploration </div>  <div style="color: green;"># Folder where all your experiments and explorations must go</div>
  <div>├── {{package_name}}-models      </div>  <div style="color: green;"># Folder containing all the generated models</div>
  <div>├── {{package_name}}-ressources  </div>  <div style="color: green;"># Folder containing some ressources such as the instructions to upload a model</div>
  <div>├── {{package_name}}-scripts     </div>  <div style="color: green;"># Folder containing examples script to preprocess data, train models, predict and use a demonstrator</div>
  <div>│ └── utils                      </div>  <div style="color: green;"># Folder containing utils scripts (such as split train/test, sampling, etc...)</div>
  <div>├── {{package_name}}-tutorials    </div>  <div style="color: green;"># Folder containing notebook tutorials, including this one</div>
  <div>├── tests                        </div>  <div style="color: green;"># Folder containing all the unit tests</div>
  <div>├── .gitignore                   </div>  <div style="color: green;"></div>
  <div>├── .coveragerc                  </div>  <div style="color: green;"></div>
  <div>├── Makefile                     </div>  <div style="color: green;"></div>
  <div>├── nose_setup_coverage.cfg      </div>  <div style="color: green;"></div>
  <div>├── README.md                    </div>  <div style="color: green;"></div>
  <div>├── requirements.txt             </div>  <div style="color: green;"></div>
  <div>├── setup.py                     </div>  <div style="color: green;"></div>
  <div>└── version.txt                  </div>  <div style="color: green;"></div>
</div>

**General principles on the generated packages**

- Data must be saved in the `{{package_name}}-data` folder<br>
<br>
- Trained models will automatically be saved in the `{{package_name}}-models` folder<br>
<br>
- Be aware that all the functions/methods for writing/reading files uses these two folders as base. Thus when a script has an argument for the path of a file/model, the given path should be **relative** to the `{{package_name}}-data`/`{{package_name}}-models` folders.<br>
<br>
- The provided scripts in `{{package_name}}-scripts` are given as example. You can use them as accelerators, but their use is not required.<br>
<br>
- The file `preprocess.py` contains the various preprocessing pipelines used in this package. This file contains a dictionary of pipelines. It will be used to create datasets. Be very careful when you modify a pipeline because the already trained model won't be retrocompatible with it. It is generally advised to create a new pipeline.<br>
<br>
- You can use this package for mono-label and multi-labels tasks (`multi_label` argument in models' classes)<br>
<br>
- The modelling part is structured as follows :
    - `ModelClass`: main class taking care of saving data and metrics (among other)
    - `ModelPipeline`: child class of ModelClass managing all models related to a sklearn pipeline
    - `ModelKeras`: child class of ModelClass managing all models using Keras
    - `ModelHuggingFace`: child class of ModelClass, implementing an HuggingFace model (can be override for more complex problems)
    
---
---
---

## 2. Use the template to train your first model

### Load utility functions

Please run the following cell to load needed utility functions. These functions are only needed in this notebook.

In [None]:
# Import utility functions
import utils_main_tutorial

---

### Video games dataset

For the tutorial purpose, we will use a dataset containing popular video games from the French website jeuxvideo.com.

This dataset contains a description of many games and their type (Action, RPG, etc.). The goal will be to predict the type of the game from its description.

Note that the dataset is in French but it is not necessary to understand French to follow this tutorial.

Note: the main dataset is a CSV file with `;` as separator and `utf-8` as encoding. These are the default values for generated project. If you have generated a project with different options, you must first edit the CSV file accordingly.

---

<span style="color:red">**Exercice 1**</span> : **train / valid / test split**

**Goal:**

- Split the main dataset in train / valid / test sets

**TODO:**
- Use the script `utils/0_split_train_valid_test.py` on the dataset `{{package_name}}-data/dataset_jvc.csv`
- We want a 'random' split but **with a random seed set to 42** (in order to always reproduce the same results)
- We use the default splitting ratios (0.6 / 0.2 / 0.2)

**Help:**
- The file `utils/0_split_train_valid_test.py` splits a dataset in 3 .csv files:
    - {filename}_train.csv: the training dataset
    - {filename}_valid.csv: the validation dataset
    - {filename}_test.csv: the test dataset
- You can specify the type of split : random, stratified or hierarchical (here, use random)
- Reminder: the path to the file to process is relative to `{{package_name}}-data`
- To get the possible arguments of the script: `python 0_split_train_valid_test.py --help`
- Don't forget to activate your virtual environment ...

**Exercice 1** :  Validation

~ Run the following cell

In [None]:
import utils_main_tutorial
utils_main_tutorial.test_exercice_1()

**Exercice 1** :  Solution

In [None]:
import utils_main_tutorial
utils_main_tutorial.get_exercice_1_solution()

---

<span style="color:red">**Exercice 2**</span> : **random sample**

**Goal:**

- Get a random sample of the file `dataset_jvc_train.csv` (n=10) (we won't use it, this exercise is just here to show what can be done)

**TODO:**
- Use the script `utils/0_create_samples.py` on the dataset `{{package_name}}-data/dataset_jvc.csv`
- We want a sample of 10 lines

**Help:**
- The file `utils/0_create_samples.py` samples a dataset
- To get the possible arguments of the script: `python 0_create_samples.py --help`
- Don't forget to activate your virtual environment ...

**Exercice 2** :  Validation

~ Run the following cell

In [None]:
import utils_main_tutorial
utils_main_tutorial.test_exercice_2()

**Exercice 2** :  Solution

In [None]:
import utils_main_tutorial
utils_main_tutorial.get_exercice_2_solution()

---

<span style="color:red">**Exercice 3**</span> : **pre-processing**

**Goal:**

- Apply the default preprocessing to `dataset_jvc_train.csv`

**TODO:**
- Use the script `1_preprocess_data.py` on the dataset `{{package_name}}-data/dataset_jvc_train.csv` to apply the default pipeline (`preprocess_P1`)
- The preprocessing must be done on the `description` column

**Help:**
- The file `1_preprocess_data.py` applies a preprocessing pipeline **to one column of one or several .csv files**
- Without the argument `preprocessing`, this python script creates as many files as there are **pipelines registered in `preprocessing/preprocess.py`**
- It works as follows:<br>
    - In `preprocessing/preprocess.py`: <br>
        - There is a dictionary of functions (`get_preprocessors_dict`): key: str -> function <br>
            - /!\ Don't remove the default element 'no_preprocess': lambda x: x /!\ <br>
        - There are preprocessing functions (usually from words_n_fun pipelines) <br>
    - In `1_preprocess_data.py` :<br>
        - We retrieve the dictionary of functions from `preprocessing/preprocess.py` <br>
        - If a `preprocessing` argument is specified, we keep only the corresponding key from the dictionnary <br>
        - Otherwise, we keep all keys (except `no_preprocess`) <br>
        - For each entry of the dictionary, we:<br>
            - Get the associated preprocessing function
            - Load data
            - Create a column `preprocessed_text` -> apply the preprocessing function
            - Save the result -> {file_name}_{key}.csv <br>
- To get the possible arguments of the script: `python 1_preprocess_data.py --help`
- Don't forget to activate your virtual environment ...

**Important:**
- Each preprocessed file is saved in the `{{package_name}}-data` folder.
- To track which preprocessed has been done, we add a first line to these files as a metadata line (e.g. `#preprocess_P1`).

**Exercice 3** :  Validation

~ Run the following cell

In [None]:
import utils_main_tutorial
utils_main_tutorial.test_exercice_3()

**Exercice 3** :  Solution

In [None]:
import utils_main_tutorial
utils_main_tutorial.get_exercice_3_solution()

---

<span style="color:red">**Exercice 4**</span> : **custom pre-processing**

**Goal:**

- Apply a "custom" preprocess to `dataset_jvc_train.csv` and `dataset_jvc_valid.csv`

**TODO:**
- Add a new preprocessing pipeline `preprocess_P2` in `preprocessing/preprocess.py`

- Pipeline to use :
```python
pipeline = ['remove_non_string', 'get_true_spaces', 'remove_punct', 'to_lower','remove_stopwords', 'trim_string', 'remove_leading_and_ending_spaces']
```
- Use the script `1_preprocess_data.py` to apply the new pipeline `preprocess_P2`
- The preprocessing must be done on the `description` column

**Help:**
- You have to create a new preprocessing in `preprocessing/preprocess.py` and add it to the dictionary of `get_preprocessors_dict()`
- Don't forget to activate your virtual environment ...

**Exercice 4** :  Validation

~ Run the following cell

In [None]:
import utils_main_tutorial
utils_main_tutorial.test_exercice_4()

**Exercice 4** :  Solution

In [None]:
import utils_main_tutorial
utils_main_tutorial.get_exercice_4_solution()

---


<span style="color:red">**Exercice 5**</span> : **Train a model**

**Goal:**

- Use the script `2_training.py` to train a mono-label TD-IDF + SVM model to predict the 'RPG' category

- Training dataset : `dataset_jvc_train_preprocess_P2.csv`

- Validation dataset : `dataset_jvc_valid_preprocess_P2.csv`

**TODO:**
- Use the script `2_training.py` with the proper arguments

- We want to train on the column `preprocessed_text`, result of the preprocessing on the `description` column

- We want to predict the `RPG` column

**Help:**
- The script `2_training.py` trains a model on a dataset
- It works as follows:<br>
    - Read a train .csv file as input <br>
        - If a validation file is given, it will use it as validation data <br>
    - Manage `y_col` argument: <br>
        - If there is only one value, training in mono-label mode <br>
        - If several values, training in multi-labels mode <br>
    - **Manual modifications of the script**: <br>
        - **To change the model used** -> you have to comment/uncomment/modify the code in the "training" part (not necessary for this exercise) <br>
        - **To load datasets** -> if a dataset is not in the right format, you have to adapt the loading part (not necessary for this exercise) <br>
- To get the possible arguments of the script: `python 2_training.py --help`
- Don't forget to activate your virtual environment ...

**Exercice 5** :  Validation

**~ Manual validation**

After having executed the script `2_training.py`, you should see logs similar to these ones:

<img src="images/model1.png">

With the default TF-IDF, you can see that the model overfits on the train dataset <br>

A new model was created in the folder `{{package_name}}-models`. It contains the save of the model, results, statistics and plots.

<img src="images/model1_path.png">

Details:
- `plots/` : Folder containing the plots (here confusion matrices) <br>
- `acc_train@​0.xx` : Empty file. The value after @ indicates the accuracy of the model on the train dataset <br>
- `acc_valid@​0.xx` : Empty file. The value after @ indicates the accuracy of the model on the validation dataset <br>
- `configurations.json` : **Configurations used by the model**. Mandatory to re-use a model <br>
- `f1_train@​0.xx.csv` : Statistics per class on the train dataset. The value after @ indicates the weighted f1-score <br>
- `f1_valid@​0.xx.csv` : Statistics per class on the validation dataset. The value after @ indicates the weighted f1-score <br>
- `model_{name_of_the_model}.pkl` : Saved model in Pickle format. Full model object.
- `{name_of_the_type_of_model}_standalone.pkl` : Saved model in Pickle format. Standalone version (e.g. the sklearn model only). <br>
- `model_upload_instructions.md` : Instructions to upload a model to use it (needs to be customized).
- `predictions_train.csv` : Predictions on the train dataset. Wrong predictions first. <br>
- `predictions_valid.csv` : Predictions on the validation dataset. Wrong predictions first. <br>
- `properties.json` : Property file to be uploaded alongside the model. Not useful for this tutorial.

**Exercice 5** :  Solution

In [None]:
import utils_main_tutorial
utils_main_tutorial.get_exercice_5_solution()

---

<span style="color:red">**Exercice 6**</span> : **Try another classification model**

**Goal:**

- Use the script `2_training.py` to train a mono-label LSTM model to predict all the categories of the dataset

- Training dataset : `dataset_jvc_train.csv`

- Validation dataset : `dataset_jvc_valid.csv`

**TODO:**
- Generate the fasttext embedding matrix as a .pkl file -> cf. `utils/0_get_embedding_dict.py`

- Modify the script `2_training.py` to select the model `ModelEmbeddingLstm`

<img src="images/lstm_choice.jpg">

- *optionnal* : You can see the model structure directly in the script `{{package_name}}/models_training/model_embedding_lstm.py` (`_get_model()` function)

- Use the script `2_training.py` with the proper arguments

- We want to train on the `description` column

- We want to predict the columns : "Action", "Aventure", "RPG", "Plate-Forme", "FPS", "Course", "Strategie", "Sport", "Reflexion", "Combat"

**Help:**
- Check that the file `cc.fr.300.vec` is in the folder `{{package_name}}-data`
- We advise you to lower the number of epochs as the training can be too long
- The strucutre of Deep Learning models are to be modified directly in the code of the suitable class (`_get_model()` function)
- To get the possible arguments of the script: `python 2_training.py --help`
- Don't forget to activate your virtual environment ...

**Exercice 6** :  Validation

**~ Manual validation**

After having executed the script `2_training.py`, you should see logs similar to this :

<img src="images/model2.png">

Here we can see that our results are pretty good on some categories (Course, FPS) and quite poort on others (Strategies, Reflexion).

With enough epochs, you should obtain a f1-score higher than 0.70 on the validation set. To compare, a TF-IDF/SVM gives a f1-score of roughly 0.63

A new folder has been created in your folder `{{package_name}}-models`. It contains the save of the model, results, statistics and plots.

**Exercice 6** :  Solution

In [None]:
import utils_main_tutorial
utils_main_tutorial.get_exercice_6_solution()

---


<span style="color:red">**Exercice 7**</span> : **Test your model on the test dataset**

**Goal:**

- Use your model to predict on the test dataset `dataset_jvc_test.csv`

**TODO:**
- Get the name of your model which is the name of the created folder `model_embedding_lstm_{YYYY_MM_DD-hh_mm_ss}`

- Use the script `3_predict.py` to predict on the test dataset `dataset_jvc_test.csv`

- We want to predict using the `description` column

- *optional* The argument `y_col` is optional but you can use it to evaluate the model's performances on the test dataset  : "Action", "Aventure", "RPG", "Plate-Forme", "FPS", "Course", "Strategie", "Sport", "Reflexion", "Combat"

**Help:**
- To get the possible arguments of the script: `python 3_predict.py --help`
- Don't forget to activate your virtual environment ...

**Exercice 7** :  Validation

**~ Manual validation**

After having executed the script `3_predict.py`, you should see logs similar to this :

<img src="images/predictions.png">

A new folder `predictions/dataset_jvc_test/` has been created in your folder `{{package_name}}-data`. It contains predictions on the test dataset and statistics and plots (if y_col has been given)

**Exercice 7** :  Solution

In [None]:
import utils_main_tutorial
utils_main_tutorial.get_exercice_7_solution()

---
---
---


## 3. Use a model to predict on new data

<br>

In this section, we will see how to reload "manually" a model and how to use it

To load a model :

In [None]:
from {{package_name}}.models_training import utils_models

#
# TODO: replace the name of the model by the one you trained
#
model, model_conf = utils_models.load_model('model_embedding_lstm_{YYYY_MM_DD-hh_mm_ss}')

Now, feel free to imagine the description of a video game (in french) or take one from a french website (such as www.jeuxvideo.com) ...

In [None]:
description = '''The Legend of Pôle emploi 2 : le retour des Data Scientists, est un jeu de tir à la première personne, 
 dans lequel vous vivrez une aventure digne du Seigneur des anneaux !
 Encore meilleur que The Legend of Pôle emploi, une multitude de nouvelles armes à feu viendront enrichir votre arsenal de guerre.'''

.. and predict its class !

In [None]:
from IPython.display import clear_output

predictions = model.predict([description])
clear_output()
print(predictions)

🤬 What is that ?!!! A vector ???!!! 🤬

😎 Stay calm ! Do not panic ! 😎 

The `inverse_transform` method is here to save the day !

In [None]:
model.inverse_transform(predictions)[0]

You can now play around with your model 😄

<br>

**Disclaimer : To be perfectly honest, the training dataset is really small -> the performance of your model will probably be poor  😕**

---
---
---


## 4. BONUS : You can now showcase your best models to the world !

#### Well ... Maybe you should stick to your localhost for the moment ...
<br>

You are now ready to demonstrate how good your models work. We implemented a default ***Streamlit*** app., let's try it !

You juste have to open a command shell in your {{package_name}}-scripts folder and run `streamlit run 4_demonstrator.py`.

It will start a Streamlit app on the default port (8501) : http://localhost:8501/

<img src="images/demonstrator.png">

Now just have fun showing your best models 😀