## **Introduction to ML for NLP [Network + Practical]**

### **Linear Support Vector Classifier**

#### **Libraries**

After the fine-tuning phase, we know what the best C values are for each languages.

In this notebook, we test the best models on a test set they have never seen, in order to verify their real performance.

In [1]:
# general
import pandas as pd

from utility.models_sklearn import CustomLinearSVC

print("> Libraries Imported")

> Libraries Imported


#### **Setup**

We only need to import the dataset

In [2]:
dataframe = pd.read_pickle("data/3_multi_eurlex_encoded.pkl")
dataframe.head(3)

Unnamed: 0,celex_id,labels,labels_new,text_en,text_de,text_it,text_pl,text_sv,text_en_enc,text_de_enc,text_it_enc,text_pl_enc,text_sv_enc,set
0,32010D0395,2,0,commission decision of december on state aid c...,beschluss der kommission vom dezember uber die...,decisione della commissione del dicembre conce...,decyzja komisji z dnia grudnia r w sprawie pom...,kommissionens beslut av den december om det st...,"[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...","[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...","[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...","[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...","[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...",train
1,32012R0453,2,0,commission implementing regulation eu no of ma...,durchfuhrungsverordnung eu nr der kommission v...,regolamento di esecuzione ue n della commissio...,rozporzadzenie wykonawcze komisji ue nr z dnia...,kommissionens genomforandeforordning eu nr av ...,"[[2, 1275, 1276, 29, 100, 4, 743, 1277, 15, 12...","[[1302, 33, 1303, 3, 4, 5, 807, 15, 1304, 3, 6...","[[453, 10, 1422, 38, 14, 3, 4, 5, 990, 1423, 1...","[[1753, 1754, 3, 34, 24, 4, 5, 829, 7, 1755, 9...","[[2, 1239, 33, 23, 4, 5, 806, 7, 774, 4, 132, ...",train
2,32012D0043,2,0,commission implementing decision of january au...,durchfuhrungsbeschluss der kommission vom janu...,decisione di esecuzione della commissione del ...,decyzja wykonawcza komisji z dnia stycznia r u...,kommissionens genomforandebeslut av den januar...,"[[2, 1275, 3, 4, 1310, 1311, 15, 1015, 4, 1312...","[[1344, 3, 4, 5, 1345, 15, 1346, 74, 1347, 134...","[[2, 10, 1422, 3, 4, 5, 1454, 245, 1455, 24, 1...","[[2, 1791, 3, 4, 5, 1792, 7, 1, 1793, 1794, 65...","[[2, 1279, 4, 5, 1280, 7, 1281, 19, 1282, 1283...",train


#### **Test Best Models**

##### *English Model*

In [3]:
LinearSVC_EN = CustomLinearSVC(
    dataset     = dataframe,
    language    = "en",
    C           = 1.0 # only hyperparameter
)

eval_res = LinearSVC_EN.train_model()
test_res = LinearSVC_EN.test_model(on="test")

> Parameters imported
> Dataset correctly divided in training set, validation set and test set
> Computed TF-IDF for train, val and test set
> Model 'LinearSVC' Instantiated
> Training completed in 0.222 seconds

> Testing the model on 'val set'
  - Accuracy Score: 0.9115 

              precision    recall  f1-score   support

           0     0.9249    0.9086    0.9167       339
           1     0.9049    0.8818    0.8932       313
           2     0.9037    0.9448    0.9238       308

    accuracy                         0.9115       960
   macro avg     0.9112    0.9117    0.9112       960
weighted avg     0.9116    0.9115    0.9113       960


> Testing the model on 'test set'
  - Accuracy Score: 0.8792 

              precision    recall  f1-score   support

           0     0.9078    0.8787    0.8930       437
           1     0.8601    0.8737    0.8668       380
           2     0.8670    0.8851    0.8760       383

    accuracy                         0.8792      1200
   macro

##### *German Model*

In [4]:
LinearSVC_DE = CustomLinearSVC(
    dataset     = dataframe,
    language    = "de",
    C           = 1.0 # only hyperparameter
)

eval_res = LinearSVC_DE.train_model()
test_res = LinearSVC_DE.test_model(on="test")

> Parameters imported
> Dataset correctly divided in training set, validation set and test set
> Computed TF-IDF for train, val and test set
> Model 'LinearSVC' Instantiated
> Training completed in 0.286 seconds

> Testing the model on 'val set'
  - Accuracy Score: 0.9125 

              precision    recall  f1-score   support

           0     0.9333    0.9086    0.9208       339
           1     0.8885    0.8914    0.8900       313
           2     0.9146    0.9383    0.9263       308

    accuracy                         0.9125       960
   macro avg     0.9121    0.9127    0.9123       960
weighted avg     0.9127    0.9125    0.9125       960


> Testing the model on 'test set'
  - Accuracy Score: 0.8758 

              precision    recall  f1-score   support

           0     0.9117    0.8741    0.8925       437
           1     0.8594    0.8684    0.8639       380
           2     0.8539    0.8851    0.8692       383

    accuracy                         0.8758      1200
   macro

##### *Italian Model*

In [5]:
LinearSVC_IT = CustomLinearSVC(
    dataset     = dataframe,
    language    = "it",
    C           = 1.0 # only hyperparameter
)

eval_res = LinearSVC_IT.train_model()
test_res = LinearSVC_IT.test_model(on="test")

> Parameters imported
> Dataset correctly divided in training set, validation set and test set
> Computed TF-IDF for train, val and test set
> Model 'LinearSVC' Instantiated
> Training completed in 0.272 seconds

> Testing the model on 'val set'
  - Accuracy Score: 0.9094 

              precision    recall  f1-score   support

           0     0.9271    0.8997    0.9132       339
           1     0.9169    0.8818    0.8990       313
           2     0.8848    0.9481    0.9154       308

    accuracy                         0.9094       960
   macro avg     0.9096    0.9098    0.9092       960
weighted avg     0.9102    0.9094    0.9093       960


> Testing the model on 'test set'
  - Accuracy Score: 0.8775 

              precision    recall  f1-score   support

           0     0.9074    0.8741    0.8904       437
           1     0.8528    0.8842    0.8682       380
           2     0.8701    0.8747    0.8724       383

    accuracy                         0.8775      1200
   macro

##### *Polish Model*

In [6]:
LinearSVC_PL = CustomLinearSVC(
    dataset     = dataframe,
    language    = "pl",
    C           = 1.0 # only hyperparameter
)

eval_res = LinearSVC_PL.train_model()
test_res = LinearSVC_PL.test_model(on="test")

> Parameters imported
> Dataset correctly divided in training set, validation set and test set
> Computed TF-IDF for train, val and test set
> Model 'LinearSVC' Instantiated
> Training completed in 0.317 seconds

> Testing the model on 'val set'
  - Accuracy Score: 0.9125 

              precision    recall  f1-score   support

           0     0.9388    0.9056    0.9219       339
           1     0.9035    0.8978    0.9006       313
           2     0.8944    0.9351    0.9143       308

    accuracy                         0.9125       960
   macro avg     0.9123    0.9128    0.9123       960
weighted avg     0.9131    0.9125    0.9125       960


> Testing the model on 'test set'
  - Accuracy Score: 0.88 

              precision    recall  f1-score   support

           0     0.9141    0.8764    0.8949       437
           1     0.8668    0.8737    0.8702       380
           2     0.8568    0.8903    0.8732       383

    accuracy                         0.8800      1200
   macro a

##### *Swedish Model*

In [7]:
LinearSVC_SV = CustomLinearSVC(
    dataset     = dataframe,
    language    = "sv",
    C           = 1.0 # only hyperparameter
)

eval_res = LinearSVC_SV.train_model()
test_res = LinearSVC_SV.test_model(on="test")

> Parameters imported
> Dataset correctly divided in training set, validation set and test set
> Computed TF-IDF for train, val and test set
> Model 'LinearSVC' Instantiated
> Training completed in 0.271 seconds

> Testing the model on 'val set'
  - Accuracy Score: 0.9094 

              precision    recall  f1-score   support

           0     0.9281    0.9145    0.9212       339
           1     0.9055    0.8882    0.8968       313
           2     0.8934    0.9253    0.9091       308

    accuracy                         0.9094       960
   macro avg     0.9090    0.9093    0.9090       960
weighted avg     0.9096    0.9094    0.9094       960


> Testing the model on 'test set'
  - Accuracy Score: 0.885 

              precision    recall  f1-score   support

           0     0.9125    0.8833    0.8977       437
           1     0.8590    0.8816    0.8701       380
           2     0.8811    0.8903    0.8857       383

    accuracy                         0.8850      1200
   macro 