## **NLP Practical**

### **Exploring the Dataset**

After cleaning the dataset, we explore it in order to understand what data we are dealing with.

#### **Libraries**

We import the necessary libraries for the notebook.

In [1]:
# general
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas()

# dataviz
import plotly.express as px

print("> Libraries Imported")

> Libraries Imported


#### **Import the dataset**

We can simply read our custom (and cleaned) csv. 

In [2]:
dataframe = pd.read_csv("../data/1_multi_eurlex_clean.csv")
dataframe.head()

Unnamed: 0,celex_id,labels,text_en,text_de,text_it,text_pl,text_sv
0,32006D0213,1,commission decision of march establishing the ...,entscheidung der kommission vom marz zur festl...,decisione della commissione del marzo che dete...,decyzja komisji z dnia marca r ustanawiajaca k...,kommissionens beslut av den mars om indelning ...
1,32003R1786,3,council regulation ec no of september on the c...,verordnung eg nr des rates vom september uber ...,regolamento ce n del consiglio del settembre r...,rozporzadzenie rady we nr z dnia wrzesnia r w ...,radets forordning eg nr av den september om de...
2,32004R1038,3,commission regulation ec no of may fixing the ...,verordnung eg nr der kommission vom mai zur fe...,regolamento ce n della commissione del maggio ...,rozporzadzenie komisji we nr z dnia maja r ust...,kommissionens forordning eg nr av den maj om f...
3,32003R1012,2,commission regulation ec no of june amending f...,verordnung eg nr der kommission vom juni zur n...,regolamento ce n della commissione del giugno ...,rozporzadzenie komisji we nr z dnia czerwca r ...,kommissionens forordning eg nr av den juni om ...
4,32003R2229,18,council regulation ec no of december imposing ...,verordnung eg nr des rates vom dezember zur ei...,regolamento ce n del consiglio del dicembre ch...,rozporzadzenie rady we nr z dnia grudnia r nak...,radets forordning eg nr av den december om inf...


#### **Data Exploration**

**1. How many occurrences for class?**

Let's see wether the classes are balanced or not.

First, a simple function to plot the number of texts for each class.

In [3]:
def create_obs_per_classes_barplot(df, width, height, title):
    """
    Input:
        > df            dataset [tickets vendita]
        > width         of plot
        > height        of plot
        > title         of plot
    Output:
        > prints and returns the plot object
    """

    # data manipulation for plot
    temp_df = pd.DataFrame(df.groupby("labels")["text_en"].count()).reset_index()

    # convert labels to string (so that they are treated as categoricals)
    temp_df.labels = temp_df.labels.astype(str)

    # create plot
    fig = px.bar(
        temp_df, 
        x="labels", 
        y="text_en", 
        color="labels",
        title=title,
        width=width, height=height, 
        text_auto=True,
            labels={
                "text": "Number of Observations",
                "labels":"Label (first level)"
            },
        )
    fig.update_layout(xaxis_tickangle=-45, showlegend=False)
    fig.update_traces(textangle=0, textposition="outside", cliponaxis=False)
    fig.show()

    return fig

Now we can plot!

In [4]:
# create plot
barplot = create_obs_per_classes_barplot(
    df = dataframe,
    width = 1400, 
    height = 600, 
    title = "Number of Texts for each Label<br><sup>Total Classes: 21<sup>"
    )

The dataset is very unbalanced: a high number of classes have less than 500 observations, while others exceed 2000 (in some cases by a lot, i.e. with class 3).

As interesting as it is to test how architectures behave with unbalanced data, classes such as 14 would only introduce noise, given the low number of observations. We therefore decide to update the dataset, keeping only the classes with at least 500 observations.

In [11]:
# 1. groupby texts
temp_df = pd.DataFrame(dataframe.groupby("labels")["text_en"].count()).reset_index()

# 2. obtain the list of labels with at least 500 obs
classes_to_keep = list(temp_df.loc[temp_df["text_en"] >= 2000]["labels"])

classes_to_keep

[0, 2, 3, 7, 18]

In [12]:
classes_to_keep.pop(2)
classes_to_keep

[0, 2, 7, 18]

In [13]:
# filter the dataset
reduced_dataframe = dataframe.loc[dataframe["labels"].isin(classes_to_keep)]
reduced_dataframe

Unnamed: 0,celex_id,labels,text_en,text_de,text_it,text_pl,text_sv
3,32003R1012,2,commission regulation ec no of june amending f...,verordnung eg nr der kommission vom juni zur n...,regolamento ce n della commissione del giugno ...,rozporzadzenie komisji we nr z dnia czerwca r ...,kommissionens forordning eg nr av den juni om ...
4,32003R2229,18,council regulation ec no of december imposing ...,verordnung eg nr des rates vom dezember zur ei...,regolamento ce n del consiglio del dicembre ch...,rozporzadzenie rady we nr z dnia grudnia r nak...,radets forordning eg nr av den december om inf...
5,32003R0223,7,commission regulation ec no of february on lab...,verordnung eg nr der kommission vom februar zu...,regolamento ce n della commissione del febbrai...,rozporzadzenie komisji we nr z dnia lutego r w...,kommissionens forordning eg nr av den februari...
9,31989L0681,7,council directive of december amending directi...,richtlinie des rates vom dezember zur anderung...,direttiva del consiglio del dicembre che modif...,dyrektywa rady z dnia grudnia r zmieniajaca dy...,radets direktiv av den december om andring av ...
15,32006R1007,2,commission regulation ec no of june determinin...,verordnung eg nr der kommission vom juni zur f...,regolamento ce n della commissione del giugno ...,rozporzadzenie komisji we nr z dnia czerwca r ...,kommissionens forordning eg nr av den juni om ...
...,...,...,...,...,...,...,...
32928,32011R0880,7,commission regulation eu no of september corre...,verordnung eu nr der kommission vom september ...,regolamento ue n della commissione del settemb...,rozporzadzenie komisji ue nr z dnia wrzesnia r...,kommissionens forordning eu nr av den septembe...
32929,32012D0272,18,council decision of may on the signing on beha...,beschluss des rates vom mai uber die unterzeic...,decisione del consiglio del maggio relativa al...,decyzja rady z dnia maja r w sprawie podpisani...,radets beslut av den maj om undertecknande pa ...
32933,32012R0596,18,commission regulation eu no of july initiating...,verordnung eu nr der kommission vom juli zur e...,regolamento ue n della commissione del luglio ...,rozporzadzenie komisji ue nr z dnia lipca r ws...,kommissionens forordning eu nr av den juli om ...
32934,32010D0165,7,commission decision of march withdrawing the r...,beschluss der kommission vom marz uber die str...,decisione della commissione del marzo che riti...,decyzja komisji z dnia marca r w sprawie wycof...,kommissionens beslut av den mars om strykning ...


In [14]:
# create plot
barplot = create_obs_per_classes_barplot(
    df = reduced_dataframe,
    width = 1400, 
    height = 600, 
    title = "Number of Texts for each Label<br><sup>Total Classes: 12<sup>"
    )

This situation is much preferable: the dataset is still unbalanced, but there is a sufficient number of observations for each remaining class.

#### **Re-map classes**

In order not to have errors during training, it is necessary to remap the label names (so that they go from 0 to 11).

As we already did before, we can simply apply a custom function.

In [15]:
def map_old_labels_to_new_labels(label):

    if label == 7:
        return 5
    elif label == 8:
        return 6
    elif label == 10:
        return 7
    elif label == 11:
        return 8
    elif label == 12:
        return 9
    elif label == 17:
        return 10
    elif label == 18:
        return 11
    else:
        return label

# Reduced to 4 classes (> 2000 occurrences, class 3 excluded)
def map_old_labels_to_new_labels(label):

    if label == 2:
        return 1
    elif label == 7:
        return 2
    elif label == 18:
        return 3
    else:
        return label

In [16]:
# turn off pandas warning
pd.options.mode.chained_assignment = None

# apply the function
reduced_dataframe["labels_new"] = reduced_dataframe["labels"].progress_apply(map_old_labels_to_new_labels)

# explore results
set(reduced_dataframe["labels_new"])

100%|██████████| 11676/11676 [00:00<00:00, 972993.20it/s]


{0, 1, 2, 3}

**2. How many words in each text?**

In [17]:
text_cols = ["text_en", "text_de", "text_it", "text_pl", "text_sv"]

for col in text_cols:

    # obtain the number of words in each text
    reduced_dataframe[col + "_len"] = reduced_dataframe[col].apply(lambda x: len(x.split()))
    
    # calculate its mean
    temp_mean = np.mean(reduced_dataframe[col + "_len"])

    # print res
    print(f"> Average number of words in '{col[-2:]}' texts: {round(temp_mean)}")

> Average number of words in 'en' texts: 1231
> Average number of words in 'de' texts: 1100
> Average number of words in 'it' texts: 1274
> Average number of words in 'pl' texts: 1058
> Average number of words in 'sv' texts: 1080


The average number of words in the texts is around 1200-1400. We will need this information in the tokenization and embedding phase.

#### **Save the changes**

Finally, we can save this final version of the dataset and train some networks!

In [18]:
# reorder columns
reduced_dataframe = reduced_dataframe[["celex_id", "labels", "labels_new", "text_en", "text_de", "text_it", "text_pl", "text_sv"]]

reduced_dataframe.to_csv("../data/2_multi_eurlex_reduced_v2.csv", index=False)
print("> All done!")

> All done!
