## **Introduction to ML for NLP [Network + Practical]**

### **LSTM**

After the fine-tuning phase, we obtained and saved the weights of the best models for each language.

In this notebook, we explore the results of their training and test them on a test set they have never seen, in order to verify their real performance.

#### **Libraries**

We import the necessary libraries for the notebook.

In [1]:
# general
import pandas as pd
from tqdm import tqdm
tqdm.pandas()

# pytorch
import torch

# custom imports
from utility.models_pytorch import PytorchModel
from utility.dataviz import plot_model_fit_loss, plot_classes_accuracy

print("> Libraries Imported")

> Libraries Imported


#### **Setup**

- We set the device to *cuda*
- We import the dataset
- Setup counts for each language

In [2]:
# Device configuration
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

print("> Device:", device)

> Device: cuda


In [3]:
dataframe = pd.read_pickle("data/3_multi_eurlex_encoded.pkl")

In [4]:
COUNTS_EN = 3506
COUNTS_DE = 4216
COUNTS_IT = 4180
COUNTS_PL = 5255
COUNTS_SV = 4010

#### **Visualize the training results**

We plot the training and validation loss, as well as the mean validation accuracy for each class.

##### *English Model*

In [5]:
# Set all paths of the best model

model_weights_en_path = "models/LSTM_fixed_en/LSTM_fixed[en][batch_size=64][epochs=50][vocab_size=3506][emb_dim=1024][hidden_dim=2048][lr=0.001][dropout=0.0]_best.model"

global_res_en_path = "models/LSTM_fixed_en/LSTM_fixed[en][batch_size=64][epochs=50][vocab_size=3506][emb_dim=1024][hidden_dim=2048][lr=0.001][dropout=0.0]_global_results.csv"

classes_res_en_path = "models/LSTM_fixed_en/LSTM_fixed[en][batch_size=64][epochs=50][vocab_size=3506][emb_dim=1024][hidden_dim=2048][lr=0.001][dropout=0.0]_classes_results.csv"

In [6]:
# Instantiate the LSTM with the same parameters

BEST_LSTM_EN = PytorchModel(

    # set model and text language
    model_type      = "LSTM_fixed",
    dataset         = dataframe,
    language        = "en",

    # set device, bacth size and epochs
    device          = device,
    batch_size      = 64,
    epochs          = 50,

    # set hyperparameters
    vocab_size      = COUNTS_EN,
    embedding_dim   = 1024,
    hidden_dim      = 2048,
    learning_rate   = 0.001,
    dropout_p       = 0.0
)

# Load its weights
BEST_LSTM_EN.MODEL.load_state_dict(torch.load(model_weights_en_path))

> Parameters imported
> Dataset correctly divided in training set, validation set and test set
> Created Pytorch datasets and dataloaders
> Model 'LSTM_fixed' instantiated
> Initialization required 1.5288 seconds


<All keys matched successfully>

In [7]:
global_res_en = pd.read_csv(global_res_en_path)

plot_model_fit_loss(
    train_loss=global_res_en['training_loss'],
    val_loss=global_res_en['validation_loss'],
    vertical_line=31, # best epoch chosen
    subtitle="Models Details: " + BEST_LSTM_EN.MODEL_DESCRIPTION
)

In [8]:
classes_res_en = pd.read_csv(classes_res_en_path)

# trick to reset index with classes values
classes_res_en["index"] = [0,1,2]*50
classes_res_en.set_index("index")

plot_classes_accuracy(
    classes_res_en, 
    vertical_line=31, # best epoch chosen
    subtitle="Models Details: " + BEST_LSTM_EN.MODEL_DESCRIPTION
    )

In [9]:
# fnally, we test it
test_loss, test_acc, classes_dict = BEST_LSTM_EN.test_model()

> Test Loss:     0.5561
> Test Accuracy: 0.8217

> Classes Accuracy
   * Class 0	 0.8307 [363 out of 437]
   * Class 1	 0.8474 [322 out of 380]
   * Class 2	 0.7859 [301 out of 383]
   * Mean        0.8213


##### *German Model*

In [10]:
# Set all paths of the best model

model_weights_de_path = "models/LSTM_fixed_de/LSTM_fixed[de][batch_size=64][epochs=50][vocab_size=4216][emb_dim=1024][hidden_dim=1024][lr=0.001][dropout=0.0]_best.model"

global_res_de_path = "models/LSTM_fixed_de/LSTM_fixed[de][batch_size=64][epochs=50][vocab_size=4216][emb_dim=1024][hidden_dim=1024][lr=0.001][dropout=0.0]_global_results.csv"

classes_res_de_path = "models/LSTM_fixed_de/LSTM_fixed[de][batch_size=64][epochs=50][vocab_size=4216][emb_dim=1024][hidden_dim=1024][lr=0.001][dropout=0.0]_classes_results.csv"

In [11]:
# Instantiate the LSTM with the same parameters

BEST_LSTM_DE = PytorchModel(

    # set model and text language
    model_type      = "LSTM_fixed",
    dataset         = dataframe,
    language        = "de",

    # set device, bacth size and epochs
    device          = device,
    batch_size      = 64,
    epochs          = 50,

    # set hyperparameters
    vocab_size      = COUNTS_DE,
    embedding_dim   = 1024,
    hidden_dim      = 1024,
    learning_rate   = 0.001,
    dropout_p       = 0.0
)

# Load its weights
BEST_LSTM_DE.MODEL.load_state_dict(torch.load(model_weights_de_path))

> Parameters imported
> Dataset correctly divided in training set, validation set and test set
> Created Pytorch datasets and dataloaders
> Model 'LSTM_fixed' instantiated
> Initialization required 0.074 seconds


<All keys matched successfully>

In [12]:
global_res_de = pd.read_csv(global_res_de_path)

plot_model_fit_loss(
    train_loss=global_res_de['training_loss'],
    val_loss=global_res_de['validation_loss'],
    vertical_line=42, # best epoch chosen
    subtitle="Models Details: " + BEST_LSTM_DE.MODEL_DESCRIPTION
)

In [13]:
classes_res_de = pd.read_csv(classes_res_de_path)

# trick to reset index with classes values
classes_res_de["index"] = [0,1,2]*50
classes_res_de.set_index("index")

plot_classes_accuracy(
    classes_res_de, 
    vertical_line=42, # best epoch chosen
    subtitle="Models Details: " + BEST_LSTM_DE.MODEL_DESCRIPTION
    )

In [14]:
# finally, we test it
test_loss, test_acc, classes_dict = BEST_LSTM_DE.test_model()

> Test Loss:     0.9223
> Test Accuracy: 0.7575

> Classes Accuracy
   * Class 0	 0.7025 [307 out of 437]
   * Class 1	 0.7632 [290 out of 380]
   * Class 2	 0.8146 [312 out of 383]
   * Mean        0.7601


##### *Italian Model*

In [15]:
# Set all paths of the best model

model_weights_it_path = "models/LSTM_fixed_it/LSTM_fixed[it][batch_size=64][epochs=50][vocab_size=4180][emb_dim=1024][hidden_dim=2048][lr=0.001][dropout=0.0]_best.model"

global_res_it_path = "models/LSTM_fixed_it/LSTM_fixed[it][batch_size=64][epochs=50][vocab_size=4180][emb_dim=1024][hidden_dim=2048][lr=0.001][dropout=0.0]_global_results.csv"

classes_res_it_path = "models/LSTM_fixed_it/LSTM_fixed[it][batch_size=64][epochs=50][vocab_size=4180][emb_dim=1024][hidden_dim=2048][lr=0.001][dropout=0.0]_classes_results.csv"

In [16]:
# Instantiate the LSTM with the same parameters

BEST_LSTM_IT = PytorchModel(

    # set model and text language
    model_type      = "LSTM_fixed",
    dataset         = dataframe,
    language        = "it",

    # set device, bacth size and epochs
    device          = device,
    batch_size      = 64,
    epochs          = 50,

    # set hyperparameters
    vocab_size      = COUNTS_IT,
    embedding_dim   = 1024,
    hidden_dim      = 2048,
    learning_rate   = 0.001,
    dropout_p       = 0.0
)

# Load its weights
BEST_LSTM_IT.MODEL.load_state_dict(torch.load(model_weights_it_path))

> Parameters imported
> Dataset correctly divided in training set, validation set and test set
> Created Pytorch datasets and dataloaders
> Model 'LSTM_fixed' instantiated
> Initialization required 0.152 seconds


<All keys matched successfully>

In [17]:
global_res_it = pd.read_csv(global_res_it_path)

plot_model_fit_loss(
    train_loss=global_res_it['training_loss'],
    val_loss=global_res_it['validation_loss'],
    vertical_line=34, # best epoch chosen
    subtitle="Models Details: " + BEST_LSTM_IT.MODEL_DESCRIPTION
)

In [18]:
classes_res_it = pd.read_csv(classes_res_it_path)

# trick to reset index with classes values
classes_res_it["index"] = [0,1,2]*50
classes_res_it.set_index("index")

plot_classes_accuracy(
    classes_res_it, 
    vertical_line=34, # best epoch chosen
    subtitle="Models Details: " + BEST_LSTM_IT.MODEL_DESCRIPTION
    )

In [19]:
test_loss, test_acc, classes_dict = BEST_LSTM_IT.test_model()

> Test Loss:     0.5415
> Test Accuracy: 0.8133

> Classes Accuracy
   * Class 0	 0.8032 [351 out of 437]
   * Class 1	 0.8263 [314 out of 380]
   * Class 2	 0.812 [311 out of 383]
   * Mean        0.8138


##### *Polish Model*

In [20]:
# Set all paths of the best model

model_weights_pl_path = "models/LSTM_fixed_pl/LSTM_fixed[pl][batch_size=64][epochs=50][vocab_size=5255][emb_dim=1024][hidden_dim=1024][lr=0.001][dropout=0.0]_best.model"

global_res_pl_path = "models/LSTM_fixed_pl/LSTM_fixed[pl][batch_size=64][epochs=50][vocab_size=5255][emb_dim=1024][hidden_dim=1024][lr=0.001][dropout=0.0]_global_results.csv"

classes_res_pl_path = "models/LSTM_fixed_pl/LSTM_fixed[pl][batch_size=64][epochs=50][vocab_size=5255][emb_dim=1024][hidden_dim=1024][lr=0.001][dropout=0.0]_classes_results.csv"

In [21]:
# Instantiate the LSTM with the same parameters

BEST_LSTM_PL = PytorchModel(

    # set model and text language
    model_type      = "LSTM_fixed",
    dataset         = dataframe,
    language        = "pl",

    # set device, bacth size and epochs
    device          = device,
    batch_size      = 64,
    epochs          = 50,

    # set hyperparameters
    vocab_size      = COUNTS_PL,
    embedding_dim   = 1024,
    hidden_dim      = 1024,
    learning_rate   = 0.001,
    dropout_p       = 0.0
)

# Load its weights
BEST_LSTM_PL.MODEL.load_state_dict(torch.load(model_weights_pl_path))

> Parameters imported
> Dataset correctly divided in training set, validation set and test set
> Created Pytorch datasets and dataloaders
> Model 'LSTM_fixed' instantiated
> Initialization required 0.079 seconds


<All keys matched successfully>

In [22]:
global_res_pl = pd.read_csv(global_res_pl_path)

plot_model_fit_loss(
    train_loss=global_res_pl['training_loss'],
    val_loss=global_res_pl['validation_loss'],
    vertical_line=13, # best epoch chosen
    subtitle="Models Details: " + BEST_LSTM_PL.MODEL_DESCRIPTION
)

In [23]:
classes_res_pl = pd.read_csv(classes_res_pl_path)

# trick to reset index with classes values
classes_res_pl["index"] = [0,1,2]*50
classes_res_pl.set_index("index")

plot_classes_accuracy(
    classes_res_pl, 
    vertical_line=13, # best epoch chosen
    subtitle="Models Details: " + BEST_LSTM_PL.MODEL_DESCRIPTION
    )

In [24]:
test_loss, test_acc, classes_dict = BEST_LSTM_PL.test_model()

> Test Loss:     0.6351
> Test Accuracy: 0.8075

> Classes Accuracy
   * Class 0	 0.7849 [343 out of 437]
   * Class 1	 0.8237 [313 out of 380]
   * Class 2	 0.8172 [313 out of 383]
   * Mean        0.8086


##### *Swedish Model*

In [25]:
# LSTM_fixed[sv][batch_size=64][epochs=50][vocab_size=4010][emb_dim=1024][hidden_dim=1024][lr=0.001][dropout=0.0]

# Set all paths of the best model

model_weights_sv_path = "models/LSTM_fixed_sv/LSTM_fixed[sv][batch_size=64][epochs=50][vocab_size=4010][emb_dim=1024][hidden_dim=1024][lr=0.001][dropout=0.0]_best.model"

global_res_sv_path = "models/LSTM_fixed_sv/LSTM_fixed[sv][batch_size=64][epochs=50][vocab_size=4010][emb_dim=1024][hidden_dim=1024][lr=0.001][dropout=0.0]_global_results.csv"

classes_res_sv_path = "models/LSTM_fixed_sv/LSTM_fixed[sv][batch_size=64][epochs=50][vocab_size=4010][emb_dim=1024][hidden_dim=1024][lr=0.001][dropout=0.0]_classes_results.csv"

In [26]:
# Instantiate the LSTM with the same parameters

BEST_LSTM_SV = PytorchModel(

    # set model and text language
    model_type      = "LSTM_fixed",
    dataset         = dataframe,
    language        = "sv",

    # set device, bacth size and epochs
    device          = device,
    batch_size      = 64,
    epochs          = 50,

    # set hyperparameters
    vocab_size      = COUNTS_SV,
    embedding_dim   = 1024,
    hidden_dim      = 1024,
    learning_rate   = 0.001,
    dropout_p       = 0.0
)

# Load its weights
BEST_LSTM_SV.MODEL.load_state_dict(torch.load(model_weights_sv_path))

> Parameters imported
> Dataset correctly divided in training set, validation set and test set
> Created Pytorch datasets and dataloaders
> Model 'LSTM_fixed' instantiated
> Initialization required 0.072 seconds


<All keys matched successfully>

In [27]:
global_res_sv = pd.read_csv(global_res_sv_path)

plot_model_fit_loss(
    train_loss=global_res_sv['training_loss'],
    val_loss=global_res_sv['validation_loss'],
    vertical_line=48, # best epoch chosen
    subtitle="Models Details: " + BEST_LSTM_SV.MODEL_DESCRIPTION
)

In [28]:
classes_res_sv = pd.read_csv(classes_res_sv_path)

# trick to reset index with classes values
classes_res_sv["index"] = [0,1,2]*50
classes_res_sv.set_index("index")

plot_classes_accuracy(
    classes_res_sv, 
    vertical_line=48, # best epoch chosen
    subtitle="Models Details: " + BEST_LSTM_SV.MODEL_DESCRIPTION
    )

In [29]:
test_loss, test_acc, classes_dict = BEST_LSTM_SV.test_model()

> Test Loss:     0.9211
> Test Accuracy: 0.8192

> Classes Accuracy
   * Class 0	 0.8124 [355 out of 437]
   * Class 1	 0.85 [323 out of 380]
   * Class 2	 0.7963 [305 out of 383]
   * Mean        0.8196
