## **Introduction to ML for NLP [Network + Practical]**

### **CNN**

After the fine-tuning phase, we obtained and saved the weights of the best models for each language.

In this notebook, we explore the results of their training and test them on a test set they have never seen, in order to verify their real performance.

#### **Libraries**

We import the necessary libraries for the notebook.

In [1]:
# general
import pandas as pd
from tqdm import tqdm
tqdm.pandas()

# pytorch
import torch

# custom imports
from utility.models_pytorch import PytorchModel
from utility.dataviz import plot_model_fit_loss, plot_classes_accuracy

print("> Libraries Imported")

> Libraries Imported


#### **Setup**

- We set the device to *cuda*
- We import the dataset

In [2]:
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print("> Device:", device)

> Device: cuda


In [3]:
dataframe = pd.read_pickle("data/3_multi_eurlex_encoded.pkl")
dataframe.head(3)

Unnamed: 0,celex_id,labels,labels_new,text_en,text_de,text_it,text_pl,text_sv,text_en_enc,text_de_enc,text_it_enc,text_pl_enc,text_sv_enc,set
0,32010D0395,2,0,commission decision of december on state aid c...,beschluss der kommission vom dezember uber die...,decisione della commissione del dicembre conce...,decyzja komisji z dnia grudnia r w sprawie pom...,kommissionens beslut av den december om det st...,"[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...","[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...","[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...","[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...","[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...",train
1,32012R0453,2,0,commission implementing regulation eu no of ma...,durchfuhrungsverordnung eu nr der kommission v...,regolamento di esecuzione ue n della commissio...,rozporzadzenie wykonawcze komisji ue nr z dnia...,kommissionens genomforandeforordning eu nr av ...,"[[2, 1275, 1276, 29, 100, 4, 743, 1277, 15, 12...","[[1302, 33, 1303, 3, 4, 5, 807, 15, 1304, 3, 6...","[[453, 10, 1422, 38, 14, 3, 4, 5, 990, 1423, 1...","[[1753, 1754, 3, 34, 24, 4, 5, 829, 7, 1755, 9...","[[2, 1239, 33, 23, 4, 5, 806, 7, 774, 4, 132, ...",train
2,32012D0043,2,0,commission implementing decision of january au...,durchfuhrungsbeschluss der kommission vom janu...,decisione di esecuzione della commissione del ...,decyzja wykonawcza komisji z dnia stycznia r u...,kommissionens genomforandebeslut av den januar...,"[[2, 1275, 3, 4, 1310, 1311, 15, 1015, 4, 1312...","[[1344, 3, 4, 5, 1345, 15, 1346, 74, 1347, 134...","[[2, 10, 1422, 3, 4, 5, 1454, 245, 1455, 24, 1...","[[2, 1791, 3, 4, 5, 1792, 7, 1, 1793, 1794, 65...","[[2, 1279, 4, 5, 1280, 7, 1281, 19, 1282, 1283...",train


#### **LSTM**

**Instantiate a Pytorch Model**

We use our custom class PytorchModel to train a LSTM.

In [4]:
COUNTS_EN = 3506
COUNTS_DE = 4216
COUNTS_IT = 4180
COUNTS_PL = 5255
COUNTS_SV = 4010

#### **Visualize the training results**

We plot the training and validation loss, as well as the mean validation accuracy for each class.

##### *English Model*

In [5]:
# Set all paths of the best model

model_weights_en_path = "models/CNN_fixed_en/CNN_fixed[en][batch_size=64][epochs=50][vocab_size=3506][emb_dim=1024][kernel_size=5][stride=1][padding=3][lr=0.001][dropout=0.1]_best.model"

global_res_en_path = "models/CNN_fixed_en/CNN_fixed[en][batch_size=64][epochs=50][vocab_size=3506][emb_dim=1024][kernel_size=5][stride=1][padding=3][lr=0.001][dropout=0.1]_global_results.csv"

classes_res_en_path = "models/CNN_fixed_en/CNN_fixed[en][batch_size=64][epochs=50][vocab_size=3506][emb_dim=1024][kernel_size=5][stride=1][padding=3][lr=0.001][dropout=0.1]_classes_results.csv"

In [6]:
# Instantiate the CNN with the same parameters

BEST_CNN_EN = PytorchModel(

    # set model and text language
    model_type      = "CNN_fixed",
    dataset         = dataframe,
    language        = "en",

    # set device, bacth size and epochs
    device          = device,
    batch_size      = 64,
    epochs          = 50,

    # set general hyperparameters
    learning_rate   = 0.001,

    # set specific hyperparameters
    vocab_size      = COUNTS_EN,
    embedding_dim   = 1024,
    out_channels    = 1,
    kernel_size     = 5,
    stride          = 1,
    padding         = 3,
    dropout_p       = 0.1,
)

# Load its weights
BEST_CNN_EN.MODEL.load_state_dict(torch.load(model_weights_en_path))

> Parameters imported for CNN_fixed
> Dataset correctly divided in training set, validation set and test set
> Created Pytorch datasets and dataloaders
> Initialization required 0.131 seconds


<All keys matched successfully>

In [7]:
global_res_en = pd.read_csv(global_res_en_path)

plot_model_fit_loss(
    train_loss=global_res_en['training_loss'],
    val_loss=global_res_en['validation_loss'],
    vertical_line=46, # best epoch chosen
    subtitle="Details: " + BEST_CNN_EN.MODEL_DESCRIPTION
)

In [8]:
classes_res_en = pd.read_csv(classes_res_en_path)

# trick to reset index with classes values
classes_res_en["index"] = [0,1,2]*50
classes_res_en.set_index("index")

plot_classes_accuracy(
    classes_res_en, 
    vertical_line=46, # best epoch chosen
    subtitle="Details: " + BEST_CNN_EN.MODEL_DESCRIPTION
    )

In [9]:
# fnally, we test it
test_loss, test_acc, classes_dict = BEST_CNN_EN.test_model()

> Test Loss:     0.7085
> Test Accuracy: 0.8425

> Classes Accuracy
   * Class 0	 0.8627 [377 out of 437]
   * Class 1	 0.8237 [313 out of 380]
   * Class 2	 0.8381 [321 out of 383]
   * Mean        0.8415


##### *German Model*

In [10]:
# Set all paths of the best model

model_weights_de_path = "models/CNN_fixed_de/CNN_fixed[de][batch_size=64][epochs=50][vocab_size=4216][emb_dim=2048][kernel_size=5][stride=1][padding=1][lr=0.001][dropout=0.1]_best.model"

global_res_de_path = "models/CNN_fixed_de/CNN_fixed[de][batch_size=64][epochs=50][vocab_size=4216][emb_dim=2048][kernel_size=5][stride=1][padding=1][lr=0.001][dropout=0.1]_global_results.csv"

classes_res_de_path = "models/CNN_fixed_de/CNN_fixed[de][batch_size=64][epochs=50][vocab_size=4216][emb_dim=2048][kernel_size=5][stride=1][padding=1][lr=0.001][dropout=0.1]_classes_results.csv"

In [11]:
# Instantiate the CNN with the same parameters

BEST_CNN_DE = PytorchModel(

    # set model and text language
    model_type      = "CNN_fixed",
    dataset         = dataframe,
    language        = "de",

    # set device, bacth size and epochs
    device          = device,
    batch_size      = 64,
    epochs          = 50,

    # set general hyperparameters
    learning_rate   = 0.001,

    # set specific hyperparameters
    vocab_size      = COUNTS_DE,
    embedding_dim   = 2048,
    out_channels    = 1,
    kernel_size     = 5,
    stride          = 1,
    padding         = 1,
    dropout_p       = 0.1,
)

# Load its weights
BEST_CNN_DE.MODEL.load_state_dict(torch.load(model_weights_de_path))

> Parameters imported for CNN_fixed
> Dataset correctly divided in training set, validation set and test set
> Created Pytorch datasets and dataloaders
> Initialization required 0.056 seconds


<All keys matched successfully>

In [12]:
global_res_de = pd.read_csv(global_res_de_path)

plot_model_fit_loss(
    train_loss=global_res_de['training_loss'],
    val_loss=global_res_de['validation_loss'],
    vertical_line=31, # best epoch chosen
    subtitle="Details: " + BEST_CNN_DE.MODEL_DESCRIPTION
)

In [15]:
classes_res_de = pd.read_csv(classes_res_de_path)

# trick to reset index with classes values
classes_res_de["index"] = [0,1,2]*50
classes_res_de.set_index("index")

plot_classes_accuracy(
    classes_res_de, 
    vertical_line=31, # best epoch chosen
    subtitle="Details: " + BEST_CNN_DE.MODEL_DESCRIPTION
    )

In [14]:
# fnally, we test it
test_loss, test_acc, classes_dict = BEST_CNN_DE.test_model()

> Test Loss:     0.7286
> Test Accuracy: 0.8208

> Classes Accuracy
   * Class 0	 0.8146 [356 out of 437]
   * Class 1	 0.8368 [318 out of 380]
   * Class 2	 0.812 [311 out of 383]
   * Mean        0.8211


---

In [None]:
# first, we iterate over the models .txt and choose the best model

import os

directory = os.fsencode("models/CNN_fixed_de")
    
for file in os.listdir(directory):
     filename = os.fsdecode(file)
     if filename.endswith(".txt"): 
        with open(os.path.join("models/CNN_fixed_de",filename)) as f:
            lines = f.readlines()

            for line in lines:
                print(line)

     else:
         continue