## **NLP Practical**

### **Exploring the Dataset**

After cleaning the dataset, we explore it in order to understand what data we are dealing with.

#### **Libraries**

We import the necessary libraries for the notebook.

In [1]:
# general
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas()

# dataviz
import plotly.express as px

print("> Libraries Imported")

> Libraries Imported


#### **Import the dataset**

We can simply read our custom (and cleaned) csv. 

In [2]:
dataframe = pd.read_csv("../data/1_multi_eurlex_clean.csv")
dataframe.head()

Unnamed: 0,celex_id,labels,text_en,text_de,text_it,text_pl,text_sv
0,32006D0213,1,commission decision of march establishing the ...,entscheidung der kommission vom marz zur festl...,decisione della commissione del marzo che dete...,decyzja komisji z dnia marca r ustanawiajaca k...,kommissionens beslut av den mars om indelning ...
1,32003R1786,3,council regulation ec no of september on the c...,verordnung eg nr des rates vom september uber ...,regolamento ce n del consiglio del settembre r...,rozporzadzenie rady we nr z dnia wrzesnia r w ...,radets forordning eg nr av den september om de...
2,32004R1038,3,commission regulation ec no of may fixing the ...,verordnung eg nr der kommission vom mai zur fe...,regolamento ce n della commissione del maggio ...,rozporzadzenie komisji we nr z dnia maja r ust...,kommissionens forordning eg nr av den maj om f...
3,32003R1012,2,commission regulation ec no of june amending f...,verordnung eg nr der kommission vom juni zur n...,regolamento ce n della commissione del giugno ...,rozporzadzenie komisji we nr z dnia czerwca r ...,kommissionens forordning eg nr av den juni om ...
4,32003R2229,18,council regulation ec no of december imposing ...,verordnung eg nr des rates vom dezember zur ei...,regolamento ce n del consiglio del dicembre ch...,rozporzadzenie rady we nr z dnia grudnia r nak...,radets forordning eg nr av den december om inf...


#### **Data Exploration**

**1. How many occurrences for class?**

Let's see wether the classes are balanced or not.

First, a simple function to plot the number of texts for each class.

In [3]:
def create_obs_per_classes_barplot(df, width, height, title):
    """
    Input:
        > df            dataset [tickets vendita]
        > width         of plot
        > height        of plot
        > title         of plot
    Output:
        > prints and returns the plot object
    """

    # data manipulation for plot
    temp_df = pd.DataFrame(df.groupby("labels")["text_en"].count()).reset_index()

    # convert labels to string (so that they are treated as categoricals)
    temp_df.labels = temp_df.labels.astype(str)

    # create plot
    fig = px.bar(
        temp_df, 
        x="labels", 
        y="text_en", 
        color="labels",
        title=title,
        width=width, height=height, 
        text_auto=True,
            labels={
                "text": "Number of Observations",
                "labels":"Label (first level)"
            },
        )
    fig.update_layout(xaxis_tickangle=-45, showlegend=False)
    fig.update_traces(textangle=0, textposition="outside", cliponaxis=False)
    fig.show()

    return fig

Now we can plot!

In [4]:
# create plot
barplot = create_obs_per_classes_barplot(
    df = dataframe,
    width = 1400, 
    height = 600, 
    title = "Number of Texts for each Label<br><sup>Total Classes: 21<sup>"
    )

The dataset is very unbalanced: a high number of classes have less than 500 observations, while others exceed 2000 (in some cases by a lot, i.e. with class 3).

As interesting as it is to test how architectures behave with unbalanced data, classes such as 14 would only introduce noise, given the low number of observations. We therefore decide to update the dataset, keeping only the classes with at least 500 observations.

**Note after the first training tests**: the texts are very, very long (1000-1200 words). As a consequence, the training times and epochs needed to arrive at satisfactory performance are high as well. 

In order to make the training faster, we select only 3 classes: *2*, *3* and *18* (the most frequent). For each class, we then randomly select 2000 observations each to avoid working with unbalanced classes. 

In [5]:
# 1. groupby texts
temp_df = pd.DataFrame(dataframe.groupby("labels")["text_en"].count()).reset_index()

# 2. obtain the list of labels with at least 500 obs
classes_to_keep = list(temp_df.loc[temp_df["text_en"] >= 2646]["labels"])

classes_to_keep

[2, 3, 18]

In [6]:
# filter the dataset
reduced_dataframe = dataframe.loc[dataframe["labels"].isin(classes_to_keep)]
reduced_dataframe

Unnamed: 0,celex_id,labels,text_en,text_de,text_it,text_pl,text_sv
1,32003R1786,3,council regulation ec no of september on the c...,verordnung eg nr des rates vom september uber ...,regolamento ce n del consiglio del settembre r...,rozporzadzenie rady we nr z dnia wrzesnia r w ...,radets forordning eg nr av den september om de...
2,32004R1038,3,commission regulation ec no of may fixing the ...,verordnung eg nr der kommission vom mai zur fe...,regolamento ce n della commissione del maggio ...,rozporzadzenie komisji we nr z dnia maja r ust...,kommissionens forordning eg nr av den maj om f...
3,32003R1012,2,commission regulation ec no of june amending f...,verordnung eg nr der kommission vom juni zur n...,regolamento ce n della commissione del giugno ...,rozporzadzenie komisji we nr z dnia czerwca r ...,kommissionens forordning eg nr av den juni om ...
4,32003R2229,18,council regulation ec no of december imposing ...,verordnung eg nr des rates vom dezember zur ei...,regolamento ce n del consiglio del dicembre ch...,rozporzadzenie rady we nr z dnia grudnia r nak...,radets forordning eg nr av den december om inf...
6,32008R0284,3,commission regulation ec no of march registeri...,verordnung eg nr der kommission vom marz zur e...,regolamento ce n della commissione del marzo r...,rozporzadzenie komisji we nr z dnia marca r re...,kommissionens forordning eg nr av den mars om ...
...,...,...,...,...,...,...,...
32923,32011R0354,3,commission regulation eu no of april opening a...,verordnung eu nr der kommission vom april zur ...,regolamento ue n della commissione del aprile ...,rozporzadzenie komisji ue nr z dnia kwietnia r...,kommissionens forordning eu nr av den april om...
32925,32012R0751,3,commission implementing regulation eu no of au...,durchfuhrungsverordnung eu nr der kommission v...,regolamento di esecuzione ue n della commissio...,rozporzadzenie wykonawcze komisji ue nr z dnia...,kommissionens genomforandeforordning eu nr av ...
32929,32012D0272,18,council decision of may on the signing on beha...,beschluss des rates vom mai uber die unterzeic...,decisione del consiglio del maggio relativa al...,decyzja rady z dnia maja r w sprawie podpisani...,radets beslut av den maj om undertecknande pa ...
32931,32012D0454,3,council implementing decision cfsp of august i...,durchfuhrungsbeschluss gasp des rates vom augu...,decisione di esecuzione pesc del consiglio del...,decyzja wykonawcza rady wpzib z dnia sierpnia ...,radets genomforandebeslut gusp av den augusti ...


We now keep 2000 obs for each class.

In [8]:
dataframe_class2 = reduced_dataframe.loc[reduced_dataframe["labels"] == 2].sample(n=2000, random_state=42)
dataframe_class3 = reduced_dataframe.loc[reduced_dataframe["labels"] == 3].sample(n=2000, random_state=42) 
dataframe_class18 = reduced_dataframe.loc[reduced_dataframe["labels"] == 18].sample(n=2000, random_state=42) 

reduced_dataframe = pd.concat([dataframe_class2, dataframe_class3, dataframe_class18])
reduced_dataframe

Unnamed: 0,celex_id,labels,text_en,text_de,text_it,text_pl,text_sv
1353,32010D0395,2,commission decision of december on state aid c...,beschluss der kommission vom dezember uber die...,decisione della commissione del dicembre conce...,decyzja komisji z dnia grudnia r w sprawie pom...,kommissionens beslut av den december om det st...
28193,32012R0453,2,commission implementing regulation eu no of ma...,durchfuhrungsverordnung eu nr der kommission v...,regolamento di esecuzione ue n della commissio...,rozporzadzenie wykonawcze komisji ue nr z dnia...,kommissionens genomforandeforordning eu nr av ...
28475,32012D0043,2,commission implementing decision of january au...,durchfuhrungsbeschluss der kommission vom janu...,decisione di esecuzione della commissione del ...,decyzja wykonawcza komisji z dnia stycznia r u...,kommissionens genomforandebeslut av den januar...
19060,32007R0730,2,commission regulation ec no of june establishi...,verordnung eg nr der kommission vom juni zur f...,regolamento ce n della commissione del giugno ...,rozporzadzenie komisji we nr z dnia czerwca r ...,kommissionens forordning eg nr av den juni om ...
21514,32009R0375,2,commission regulation ec no of may fixing the ...,verordnung eg nr der kommission vom mai zur fe...,regolamento ce n della commissione del maggio ...,rozporzadzenie komisji we nr z dnia maja r ust...,kommissionens forordning eg nr av den maj om f...
...,...,...,...,...,...,...,...
27301,32013R0519,18,commission regulation eu no of february adapti...,verordnung eu nr der kommission vom februar zu...,regolamento ue n della commissione del febbrai...,rozporzadzenie komisji ue nr z dnia lutego r d...,kommissionens forordning eu nr av den februari...
17620,32008D0914,18,commission decision of june on the confirmatio...,entscheidung der kommission juni zur bestatigu...,decisione della commissione dell giugno recant...,decyzja komisji z dnia czerwca r w sprawie zat...,kommissionens beslut av den juni om godkannand...
13434,31999R2502,18,commission regulation ec no of november amendi...,verordnung eg nr der kommission vom november z...,regolamento ce n della commissione del novembr...,rozporzadzenie komisji we nr z dnia listopada ...,kommissionens forordning eg nr av den november...
19018,32008D0847,18,council decision of november on the eligibilit...,beschluss des rates vom november uber die ford...,decisione del consiglio del novembre sull ammi...,decyzja rady z dnia listopada r w sprawie kwal...,radets beslut av den november om berattigande ...


Let's see if everything worked correctly with a plot :)

In [10]:
# create plot
barplot = create_obs_per_classes_barplot(
    df = reduced_dataframe,
    width = 1400, 
    height = 600, 
    title = "Number of Texts for each Label<br><sup>Total Classes: 3<sup>"
    )

#### **Re-map classes**

In order not to have errors during training, it is necessary to remap the label names (so that they go from 0 to 11).

As we already did before, we can simply apply a custom function.

In [11]:
# Reduced to 4 classes (> 2000 occurrences, class 3 excluded)
def map_old_labels_to_new_labels(label):

    if label == 2:
        return 0
    elif label == 3:
        return 1
    elif label == 18:
        return 2
    else:
        return label

In [12]:
# turn off pandas warning
pd.options.mode.chained_assignment = None

# apply the function
reduced_dataframe["labels_new"] = reduced_dataframe["labels"].progress_apply(map_old_labels_to_new_labels)

# explore results
set(reduced_dataframe["labels_new"])

100%|██████████| 6000/6000 [00:00<00:00, 858813.91it/s]


{0, 1, 2}

**2. How many words in each text?**

In [13]:
text_cols = ["text_en", "text_de", "text_it", "text_pl", "text_sv"]

for col in text_cols:

    # obtain the number of words in each text
    reduced_dataframe[col + "_len"] = reduced_dataframe[col].apply(lambda x: len(x.split()))
    
    # calculate its mean
    temp_mean = np.mean(reduced_dataframe[col + "_len"])

    # print res
    print(f"> Average number of words in '{col[-2:]}' texts: {round(temp_mean)}")

> Average number of words in 'en' texts: 1064
> Average number of words in 'de' texts: 950
> Average number of words in 'it' texts: 1101
> Average number of words in 'pl' texts: 918
> Average number of words in 'sv' texts: 933


The average number of words in the texts is around 1200-1400. We will need this information in the tokenization and embedding phase.

#### **Save the changes**

Finally, we can save this final version of the dataset and train some networks!

In [15]:
# reorder columns
reduced_dataframe = reduced_dataframe[["celex_id", "labels", "labels_new", "text_en", "text_de", "text_it", "text_pl", "text_sv"]]

reduced_dataframe.to_csv("../data/2_multi_eurlex_reduced.csv", index=False)
print("> All done!")

> All done!
