## **Introduction to ML for NLP [Network + Practical]**

### **Downloading the Dataset**

All information on the chosen dataset can be read [here](https://huggingface.co/datasets/multi_eurlex), on the official HuggingFace website.

#### **Libraries**

We import the necessary libraries for the notebook.

In [11]:
# general
import pandas as pd
from tqdm import tqdm

# library with multi_eurlex data
from datasets import load_dataset

print("> Libraries Imported")

> Libraries Imported


#### **Import the dataset**

There are 23 official EU languages available, but we only choose 5:

| **_Language_** | **_Total EU Speakers (%)_** |
|:--------------:|:---------------------------:|
|        English | 51                          |
|         German | 32                          |
|        Italian | 16                          |
|         Polish | 9                           |
|        Swedish | 3                           |

*Source: [HuggingFace](https://huggingface.co/datasets/multi_eurlex)*

**English**

Let's start with english.

In [2]:
dataset_en = load_dataset('multi_eurlex', 'en')

Downloading and preparing dataset multi_eurlex/en (download: 2.58 GiB, generated: 467.05 MiB, post-processed: Unknown size, total: 3.04 GiB) to C:\Users\Daniele\.cache\huggingface\datasets\multi_eurlex\en\1.0.0\8ec8b79877a517369a143ead6679d1788d13e51cf641ed29772f4449e8364fb6...


Downloading data: 100%|██████████| 2.77G/2.77G [04:48<00:00, 9.60MB/s] 
                                                                                         

Dataset multi_eurlex downloaded and prepared to C:\Users\Daniele\.cache\huggingface\datasets\multi_eurlex\en\1.0.0\8ec8b79877a517369a143ead6679d1788d13e51cf641ed29772f4449e8364fb6. Subsequent calls will reuse this data.


100%|██████████| 3/3 [00:00<00:00, 88.16it/s]


We transform it in a pandas dataframe, to make it easier to explore and work with.

In [10]:
# explore the structure
dataset_en

DatasetDict({
    train: Dataset({
        features: ['celex_id', 'text', 'labels'],
        num_rows: 55000
    })
    test: Dataset({
        features: ['celex_id', 'text', 'labels'],
        num_rows: 5000
    })
    validation: Dataset({
        features: ['celex_id', 'text', 'labels'],
        num_rows: 5000
    })
})

We create a custom function to convert the data.

In [21]:
def from_hf_dataset_to_dataframe(dataset, language):

    # list of available keys
    keys = ["train","test","validation"]

    # create placeholder
    celex_id = []
    text = []
    labels = []

    # iterate over each key
    for key in keys:

        # iterate over each id, text and labels for key
        custom_desc = "Converting '" + key + "' rows in a dataframe"
        for item in tqdm(dataset[key], total=len(dataset[key]), desc=custom_desc):

            celex_id.append(item["celex_id"])
            text.append(item["text"])
            labels.append(item["labels"])

    # create the dataframe, starting from the 3 lists
    final_df = pd.DataFrame(
        list(zip(celex_id, text, labels)),
        columns =['celex_id', 'text_' + language, 'labels_' + language]
        )

    # finally, return the df
    return final_df

In [22]:
dataframe_en = from_hf_dataset_to_dataframe(dataset_en, language="en")
dataframe_en

Converting 'train' rows in a dataframe: 100%|██████████| 55000/55000 [00:03<00:00, 14941.92it/s]
Converting 'test' rows in a dataframe: 100%|██████████| 5000/5000 [00:00<00:00, 14409.60it/s]
Converting 'validation' rows in a dataframe: 100%|██████████| 5000/5000 [00:00<00:00, 14244.17it/s]


Unnamed: 0,celex_id,text_en,labels_en
0,32006D0213,COMMISSION DECISION\nof 6 March 2006\nestablis...,"[1, 20, 7, 3, 0]"
1,32003R1330,Commission Regulation (EC) No 1330/2003\nof 25...,"[2, 17]"
2,32003R1786,Council Regulation (EC) No 1786/2003\nof 29 Se...,"[3, 19, 6]"
3,31985R2590,*****\nCOMMISSION REGULATION (EEC) No 2590/85\...,"[12, 17, 19, 6]"
4,31993R1103,COMMISSION REGULATION (EEC) No 1103/93 of 30 A...,"[18, 3, 4, 1]"
...,...,...,...
64995,32011D0151,COMMISSION DECISION\nof 3 March 2011\namending...,"[4, 11, 5, 0, 12, 15]"
64996,32010D0256,COMMISSION DECISION\nof 30 April 2010\namendin...,"[12, 0, 6]"
64997,32010D0177,COMMISSION DECISION\nof 23 March 2010\namendin...,"[1, 4, 0, 3, 18]"
64998,32012R0307,COMMISSION IMPLEMENTING REGULATION (EU) No 307...,"[0, 3, 17, 15]"


**German**

We now do the same for German.

First, we download the data.

In [23]:
dataset_de = load_dataset('multi_eurlex', 'de')

Downloading and preparing dataset multi_eurlex/de (download: 2.58 GiB, generated: 512.42 MiB, post-processed: Unknown size, total: 3.08 GiB) to C:\Users\Daniele\.cache\huggingface\datasets\multi_eurlex\de\1.0.0\8ec8b79877a517369a143ead6679d1788d13e51cf641ed29772f4449e8364fb6...


                                                                                         

Dataset multi_eurlex downloaded and prepared to C:\Users\Daniele\.cache\huggingface\datasets\multi_eurlex\de\1.0.0\8ec8b79877a517369a143ead6679d1788d13e51cf641ed29772f4449e8364fb6. Subsequent calls will reuse this data.


100%|██████████| 3/3 [00:00<00:00, 88.17it/s]


Then we convert it to a dataframe using our custom function.

In [25]:
dataframe_de = from_hf_dataset_to_dataframe(dataset_de, language="de")
dataframe_de

Converting 'train' rows in a dataframe: 100%|██████████| 55000/55000 [00:03<00:00, 13999.27it/s]
Converting 'test' rows in a dataframe: 100%|██████████| 5000/5000 [00:00<00:00, 12852.76it/s]
Converting 'validation' rows in a dataframe: 100%|██████████| 5000/5000 [00:00<00:00, 13297.83it/s]


Unnamed: 0,celex_id,text_de,labels_de
0,32006D0213,ENTSCHEIDUNG DER KOMMISSION\nvom 6. März 2006\...,"[1, 20, 7, 3, 0]"
1,32003R1330,Verordnung (EG) Nr. 1330/2003 der Kommission\n...,"[2, 17]"
2,32003R1786,Verordnung (EG) Nr. 1786/2003 des Rates\nvom 2...,"[3, 19, 6]"
3,31985R2590,*****\nVERORDNUNG (EWG) Nr. 2590/85 DER KOMMIS...,"[12, 17, 19, 6]"
4,31993R1103,VERORDNUNG (EWG) Nr. 1103/93 DER KOMMISSION vo...,"[18, 3, 4, 1]"
...,...,...,...
64995,32011D0151,BESCHLUSS DER KOMMISSION\nvom 3. März 2011\nzu...,"[4, 11, 5, 0, 12, 15]"
64996,32010D0256,BESCHLUSS DER KOMMISSION\nvom 30. April 2010\n...,"[12, 0, 6]"
64997,32010D0177,BESCHLUSS DER KOMMISSION\nvom 23. März 2010\nz...,"[1, 4, 0, 3, 18]"
64998,32012R0307,DURCHFÜHRUNGSVERORDNUNG (EU) Nr. 307/2012 DER ...,"[0, 3, 17, 15]"


The number of observations is the same. 

However, we perform an inner join on `celex_id` in order to have the same sentences in both languages (just to be sure).

In [27]:
dataframe_complete = pd.merge(
    dataframe_en,
    dataframe_de,
    on='celex_id',
    how="inner"
    )

dataframe_complete

Unnamed: 0,celex_id,text_en,labels_en,text_de,labels_de
0,32006D0213,COMMISSION DECISION\nof 6 March 2006\nestablis...,"[1, 20, 7, 3, 0]",ENTSCHEIDUNG DER KOMMISSION\nvom 6. März 2006\...,"[1, 20, 7, 3, 0]"
1,32003R1330,Commission Regulation (EC) No 1330/2003\nof 25...,"[2, 17]",Verordnung (EG) Nr. 1330/2003 der Kommission\n...,"[2, 17]"
2,32003R1786,Council Regulation (EC) No 1786/2003\nof 29 Se...,"[3, 19, 6]",Verordnung (EG) Nr. 1786/2003 des Rates\nvom 2...,"[3, 19, 6]"
3,31985R2590,*****\nCOMMISSION REGULATION (EEC) No 2590/85\...,"[12, 17, 19, 6]",*****\nVERORDNUNG (EWG) Nr. 2590/85 DER KOMMIS...,"[12, 17, 19, 6]"
4,31993R1103,COMMISSION REGULATION (EEC) No 1103/93 of 30 A...,"[18, 3, 4, 1]",VERORDNUNG (EWG) Nr. 1103/93 DER KOMMISSION vo...,"[18, 3, 4, 1]"
...,...,...,...,...,...
64995,32011D0151,COMMISSION DECISION\nof 3 March 2011\namending...,"[4, 11, 5, 0, 12, 15]",BESCHLUSS DER KOMMISSION\nvom 3. März 2011\nzu...,"[4, 11, 5, 0, 12, 15]"
64996,32010D0256,COMMISSION DECISION\nof 30 April 2010\namendin...,"[12, 0, 6]",BESCHLUSS DER KOMMISSION\nvom 30. April 2010\n...,"[12, 0, 6]"
64997,32010D0177,COMMISSION DECISION\nof 23 March 2010\namendin...,"[1, 4, 0, 3, 18]",BESCHLUSS DER KOMMISSION\nvom 23. März 2010\nz...,"[1, 4, 0, 3, 18]"
64998,32012R0307,COMMISSION IMPLEMENTING REGULATION (EU) No 307...,"[0, 3, 17, 15]",DURCHFÜHRUNGSVERORDNUNG (EU) Nr. 307/2012 DER ...,"[0, 3, 17, 15]"


As expected, the number of observations did not change.

Note: we keep the labels in all languages, although they should obviously match, in order to check (once all joins have been made) that there are no inconsistencies in the labels of the final dataset. 

**Italian, Polish, Swedish**

We now apply the df conversion and inner joins to the texts in the remaining languages.

In [29]:
dataset_it = load_dataset('multi_eurlex', 'it')
dataset_pl = load_dataset('multi_eurlex', 'pl')
dataset_sv = load_dataset('multi_eurlex', 'sv')

dataframe_it = from_hf_dataset_to_dataframe(dataset_it, language="it")
dataframe_pl = from_hf_dataset_to_dataframe(dataset_pl, language="pl")
dataframe_sv = from_hf_dataset_to_dataframe(dataset_sv, language="sv")

Converting 'train' rows in a dataframe: 100%|██████████| 55000/55000 [00:03<00:00, 14161.06it/s]
Converting 'test' rows in a dataframe: 100%|██████████| 5000/5000 [00:00<00:00, 12657.56it/s]
Converting 'validation' rows in a dataframe: 100%|██████████| 5000/5000 [00:00<00:00, 12854.02it/s]
Converting 'train' rows in a dataframe: 100%|██████████| 23197/23197 [00:01<00:00, 11781.45it/s]
Converting 'test' rows in a dataframe: 100%|██████████| 5000/5000 [00:00<00:00, 11820.59it/s]
Converting 'validation' rows in a dataframe: 100%|██████████| 5000/5000 [00:00<00:00, 12345.99it/s]
Converting 'train' rows in a dataframe: 100%|██████████| 42490/42490 [00:03<00:00, 13340.81it/s]
Converting 'test' rows in a dataframe: 100%|██████████| 5000/5000 [00:00<00:00, 12048.42it/s]
Converting 'validation' rows in a dataframe: 100%|██████████| 5000/5000 [00:00<00:00, 13054.13it/s]


In [30]:
# italian
dataframe_complete = pd.merge(
    dataframe_complete,
    dataframe_it,
    on='celex_id',
    how="inner"
    )

# polish
dataframe_complete = pd.merge(
    dataframe_complete,
    dataframe_pl,
    on='celex_id',
    how="inner"
    )

# swedish
dataframe_complete = pd.merge(
    dataframe_complete,
    dataframe_sv,
    on='celex_id',
    how="inner"
    )

dataframe_complete

Unnamed: 0,celex_id,text_en,labels_en,text_de,labels_de,text_it,labels_it,text_pl,labels_pl,text_sv,labels_sv
0,32006D0213,COMMISSION DECISION\nof 6 March 2006\nestablis...,"[1, 20, 7, 3, 0]",ENTSCHEIDUNG DER KOMMISSION\nvom 6. März 2006\...,"[1, 20, 7, 3, 0]",DECISIONE DELLA COMMISSIONE\ndel 6 marzo 2006\...,"[1, 20, 7, 3, 0]",DECYZJA KOMISJI\nz dnia 6 marca 2006 r.\nustan...,"[1, 20, 7, 3, 0]",KOMMISSIONENS BESLUT\nav den 6 mars 2006\nom i...,"[1, 20, 7, 3, 0]"
1,32003R1786,Council Regulation (EC) No 1786/2003\nof 29 Se...,"[3, 19, 6]",Verordnung (EG) Nr. 1786/2003 des Rates\nvom 2...,"[3, 19, 6]",Regolamento (CE) n. 1786/2003 del Consiglio\nd...,"[3, 19, 6]",Rozporządzenie Rady (WE) nr 1786/2003\nz dnia ...,"[3, 19, 6]",Rådets förordning (EG) nr 1786/2003\nav den 29...,"[3, 19, 6]"
2,32004R1038,COMMISSION REGULATION (EC) No 1038/2004\nof 27...,"[3, 17, 5]",VERORDNUNG (EG) Nr. 1038/2004 DER KOMMISSION\n...,"[3, 17, 5]",REGOLAMENTO (CE) N. 1038/2004 DELLA COMMISSION...,"[3, 17, 5]",ROZPORZĄDZENIE KOMISJI (WE) NR 1038/2004\nz dn...,"[3, 17, 5]",KOMMISSIONENS FÖRORDNING (EG) nr 1038/2004\nav...,"[3, 17, 5]"
3,32003R1012,Commission Regulation (EC) No 1012/2003\nof 12...,"[2, 5, 10, 8, 3, 18, 15]",Verordnung (EG) Nr. 1012/2003 der Kommission\n...,"[2, 5, 10, 8, 3, 18, 15]",Regolamento (CE) n. 1012/2003 della Commission...,"[2, 5, 10, 8, 3, 18, 15]",Rozporządzenie Komisji (WE) nr 1012/2003\nz dn...,"[2, 5, 10, 8, 3, 18, 15]",Kommissionens förordning (EG) nr 1012/2003\nav...,"[2, 5, 10, 8, 3, 18, 15]"
4,32003R2229,Council Regulation (EC) No 2229/2003\nof 22 De...,"[18, 3, 4, 1]",Verordnung (EG) Nr. 2229/2003 des Rates\nvom 2...,"[18, 3, 4, 1]",Regolamento (CE) n. 2229/2003 del Consiglio\nd...,"[18, 3, 4, 1]",Rozporządzenie Rady (WE) nr 2229/2003\nz dnia ...,"[18, 3, 4, 1]",Rådets förordning (EG) nr 2229/2003\nav den 22...,"[18, 3, 4, 1]"
...,...,...,...,...,...,...,...,...,...,...,...
32935,32011D0151,COMMISSION DECISION\nof 3 March 2011\namending...,"[4, 11, 5, 0, 12, 15]",BESCHLUSS DER KOMMISSION\nvom 3. März 2011\nzu...,"[4, 11, 5, 0, 12, 15]",DECISIONE DELLA COMMISSIONE\ndel 3 marzo 2011\...,"[4, 11, 5, 0, 12, 15]",DECYZJA KOMISJI\nz dnia 3 marca 2011 r.\nzmien...,"[4, 11, 5, 0, 12, 15]",KOMMISSIONENS BESLUT\nav den 3 mars 2011\nom ä...,"[4, 11, 5, 0, 12, 15]"
32936,32010D0256,COMMISSION DECISION\nof 30 April 2010\namendin...,"[12, 0, 6]",BESCHLUSS DER KOMMISSION\nvom 30. April 2010\n...,"[12, 0, 6]",DECISIONE DELLA COMMISSIONE\ndel 30 aprile 201...,"[12, 0, 6]",DECYZJA KOMISJI\nz dnia 30 kwietnia 2010 r.\nz...,"[12, 0, 6]",KOMMISSIONENS BESLUT\nav den 30 april 2010\nom...,"[12, 0, 6]"
32937,32010D0177,COMMISSION DECISION\nof 23 March 2010\namendin...,"[1, 4, 0, 3, 18]",BESCHLUSS DER KOMMISSION\nvom 23. März 2010\nz...,"[1, 4, 0, 3, 18]",DECISIONE DELLA COMMISSIONE\ndel 23 marzo 2010...,"[1, 4, 0, 3, 18]",DECYZJA KOMISJI\nz dnia 23 marca 2010 r.\nzmie...,"[1, 4, 0, 3, 18]",KOMMISSIONENS BESLUT\nav den 23 mars 2010\nom ...,"[1, 4, 0, 3, 18]"
32938,32012R0307,COMMISSION IMPLEMENTING REGULATION (EU) No 307...,"[0, 3, 17, 15]",DURCHFÜHRUNGSVERORDNUNG (EU) Nr. 307/2012 DER ...,"[0, 3, 17, 15]",REGOLAMENTO DI ESECUZIONE (UE) N. 307/2012 DEL...,"[0, 3, 17, 15]",ROZPORZĄDZENIE WYKONAWCZE KOMISJI (UE) NR 307/...,"[0, 3, 17, 15]",KOMMISSIONENS GENOMFÖRANDEFÖRORDNING (EU) nr 3...,"[0, 3, 17, 15]"


With less popular languages, such as Italian, Polish and Swedish, not all texts are available. 

After the inner joins we are left with around 33k observations, which are more than enough to continue our classification.  

**Are there any inconsistencies?**

Before saving the final dataset, we make sure that there are no inconsistencies in the labels.

In [31]:
inconsistent_rows = []

for id,row in dataframe_complete.iterrows():

    if row["labels_en"] != row["labels_de"] or \
       row["labels_en"] != row["labels_it"] or \
       row["labels_en"] != row["labels_pl"] or \
       row["labels_en"] != row["labels_sv"]:

        inconsistent_rows.append(id)

# show inconsistent rows, if any
inconsistent_rows

[]

There are no inconsistencies, we can now save the complete dataframe to disk, as a .csv, for future use.

In [34]:
# remove redundant columns (we just need one 'labels' col)
dataframe_complete["labels"] = dataframe_complete["labels_en"]
dataframe_complete_clean = dataframe_complete[["celex_id","labels","text_en","text_de","text_it","text_pl","text_sv"]]

# show df
dataframe_complete_clean

Unnamed: 0,celex_id,labels,text_en,text_de,text_it,text_pl,text_sv
0,32006D0213,"[1, 20, 7, 3, 0]",COMMISSION DECISION\nof 6 March 2006\nestablis...,ENTSCHEIDUNG DER KOMMISSION\nvom 6. März 2006\...,DECISIONE DELLA COMMISSIONE\ndel 6 marzo 2006\...,DECYZJA KOMISJI\nz dnia 6 marca 2006 r.\nustan...,KOMMISSIONENS BESLUT\nav den 6 mars 2006\nom i...
1,32003R1786,"[3, 19, 6]",Council Regulation (EC) No 1786/2003\nof 29 Se...,Verordnung (EG) Nr. 1786/2003 des Rates\nvom 2...,Regolamento (CE) n. 1786/2003 del Consiglio\nd...,Rozporządzenie Rady (WE) nr 1786/2003\nz dnia ...,Rådets förordning (EG) nr 1786/2003\nav den 29...
2,32004R1038,"[3, 17, 5]",COMMISSION REGULATION (EC) No 1038/2004\nof 27...,VERORDNUNG (EG) Nr. 1038/2004 DER KOMMISSION\n...,REGOLAMENTO (CE) N. 1038/2004 DELLA COMMISSION...,ROZPORZĄDZENIE KOMISJI (WE) NR 1038/2004\nz dn...,KOMMISSIONENS FÖRORDNING (EG) nr 1038/2004\nav...
3,32003R1012,"[2, 5, 10, 8, 3, 18, 15]",Commission Regulation (EC) No 1012/2003\nof 12...,Verordnung (EG) Nr. 1012/2003 der Kommission\n...,Regolamento (CE) n. 1012/2003 della Commission...,Rozporządzenie Komisji (WE) nr 1012/2003\nz dn...,Kommissionens förordning (EG) nr 1012/2003\nav...
4,32003R2229,"[18, 3, 4, 1]",Council Regulation (EC) No 2229/2003\nof 22 De...,Verordnung (EG) Nr. 2229/2003 des Rates\nvom 2...,Regolamento (CE) n. 2229/2003 del Consiglio\nd...,Rozporządzenie Rady (WE) nr 2229/2003\nz dnia ...,Rådets förordning (EG) nr 2229/2003\nav den 22...
...,...,...,...,...,...,...,...
32935,32011D0151,"[4, 11, 5, 0, 12, 15]",COMMISSION DECISION\nof 3 March 2011\namending...,BESCHLUSS DER KOMMISSION\nvom 3. März 2011\nzu...,DECISIONE DELLA COMMISSIONE\ndel 3 marzo 2011\...,DECYZJA KOMISJI\nz dnia 3 marca 2011 r.\nzmien...,KOMMISSIONENS BESLUT\nav den 3 mars 2011\nom ä...
32936,32010D0256,"[12, 0, 6]",COMMISSION DECISION\nof 30 April 2010\namendin...,BESCHLUSS DER KOMMISSION\nvom 30. April 2010\n...,DECISIONE DELLA COMMISSIONE\ndel 30 aprile 201...,DECYZJA KOMISJI\nz dnia 30 kwietnia 2010 r.\nz...,KOMMISSIONENS BESLUT\nav den 30 april 2010\nom...
32937,32010D0177,"[1, 4, 0, 3, 18]",COMMISSION DECISION\nof 23 March 2010\namendin...,BESCHLUSS DER KOMMISSION\nvom 23. März 2010\nz...,DECISIONE DELLA COMMISSIONE\ndel 23 marzo 2010...,DECYZJA KOMISJI\nz dnia 23 marca 2010 r.\nzmie...,KOMMISSIONENS BESLUT\nav den 23 mars 2010\nom ...
32938,32012R0307,"[0, 3, 17, 15]",COMMISSION IMPLEMENTING REGULATION (EU) No 307...,DURCHFÜHRUNGSVERORDNUNG (EU) Nr. 307/2012 DER ...,REGOLAMENTO DI ESECUZIONE (UE) N. 307/2012 DEL...,ROZPORZĄDZENIE WYKONAWCZE KOMISJI (UE) NR 307/...,KOMMISSIONENS GENOMFÖRANDEFÖRORDNING (EU) nr 3...


Finally, save to csv.

In [36]:
dataframe_complete_clean.to_csv("data/0_multi_eurlex_custom.csv", index=False)
print("> All done!")

> All done!
