## **NLP Practical**

### **Exploring the Dataset**

After cleaning the dataset, we explore it in order to understand what data we are dealing with.

#### **Libraries**

We import the necessary libraries for the notebook.

In [1]:
# general
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas()

# dataviz
import plotly.express as px

print("> Libraries Imported")

> Libraries Imported


#### **Import the dataset**

We can simply read our custom (and cleaned) csv. 

In [2]:
dataframe = pd.read_csv("../data/1_multi_eurlex_clean.csv")
dataframe.head()

Unnamed: 0,celex_id,labels,text_en,text_de,text_it,text_pl,text_sv
0,32006D0213,1,commission decision of _number_ march _number_...,entscheidung der kommission vom _number_ marz ...,decisione della commissione del _number_ marzo...,decyzja komisji z dnia _number_ marca _number_...,kommissionens beslut av den _number_ mars _num...
1,32003R1786,3,council regulation ec no _number_ _number_ of ...,verordnung eg nr _number_ _number_ des rates v...,regolamento ce n _number_ _number_ del consigl...,rozporzadzenie rady we nr _number_ _number_ z ...,radets forordning eg nr _number_ _number_ av d...
2,32004R1038,3,commission regulation ec no _number_ _number_ ...,verordnung eg nr _number_ _number_ der kommiss...,regolamento ce n _number_ _number_ della commi...,rozporzadzenie komisji we nr _number_ _number_...,kommissionens forordning eg nr _number_ _numbe...
3,32003R1012,2,commission regulation ec no _number_ _number_ ...,verordnung eg nr _number_ _number_ der kommiss...,regolamento ce n _number_ _number_ della commi...,rozporzadzenie komisji we nr _number_ _number_...,kommissionens forordning eg nr _number_ _numbe...
4,32003R2229,18,council regulation ec no _number_ _number_ of ...,verordnung eg nr _number_ _number_ des rates v...,regolamento ce n _number_ _number_ del consigl...,rozporzadzenie rady we nr _number_ _number_ z ...,radets forordning eg nr _number_ _number_ av d...


#### **Data Exploration**

**1. How many occurrences for class?**

Let's see wether the classes are balanced or not.

First, a simple function to plot the number of texts for each class.

In [3]:
def create_obs_per_classes_barplot(df, width, height, title):
    """
    Input:
        > df            dataset [tickets vendita]
        > width         of plot
        > height        of plot
        > title         of plot
    Output:
        > prints and returns the plot object
    """

    # data manipulation for plot
    temp_df = pd.DataFrame(df.groupby("labels")["text_en"].count()).reset_index()

    # convert labels to string (so that they are treated as categoricals)
    temp_df.labels = temp_df.labels.astype(str)

    # create plot
    fig = px.bar(
        temp_df, 
        x="labels", 
        y="text_en", 
        color="labels",
        title=title,
        width=width, height=height, 
        text_auto=True,
            labels={
                "text": "Number of Observations",
                "labels":"Label (first level)"
            },
        )
    fig.update_layout(xaxis_tickangle=-45, showlegend=False)
    fig.update_traces(textangle=0, textposition="outside", cliponaxis=False)
    fig.show()

    return fig

Now we can plot!

In [4]:
# create plot
barplot = create_obs_per_classes_barplot(
    df = dataframe,
    width = 1400, 
    height = 600, 
    title = "Number of Texts for each Label<br><sup>Total Classes: 21<sup>"
    )

The dataset is very unbalanced: a high number of classes have less than 500 observations, while others exceed 2000 (in some cases by a lot, i.e. with class 3).

As interesting as it is to test how architectures behave with unbalanced data, classes such as 14 would only introduce noise, given the low number of observations. We therefore decide to update the dataset, keeping only the classes with at least 500 observations.

In [5]:
# 1. groupby texts
temp_df = pd.DataFrame(dataframe.groupby("labels")["text_en"].count()).reset_index()

# 2. obtain the list of labels with at least 500 obs
classes_to_keep = list(temp_df.loc[temp_df["text_en"] >= 500]["labels"])

classes_to_keep

[0, 1, 2, 3, 4, 7, 8, 10, 11, 12, 17, 18]

In [6]:
# filter the dataset
reduced_dataframe = dataframe.loc[dataframe["labels"].isin(classes_to_keep)]
reduced_dataframe

Unnamed: 0,celex_id,labels,text_en,text_de,text_it,text_pl,text_sv
0,32006D0213,1,commission decision of _number_ march _number_...,entscheidung der kommission vom _number_ marz ...,decisione della commissione del _number_ marzo...,decyzja komisji z dnia _number_ marca _number_...,kommissionens beslut av den _number_ mars _num...
1,32003R1786,3,council regulation ec no _number_ _number_ of ...,verordnung eg nr _number_ _number_ des rates v...,regolamento ce n _number_ _number_ del consigl...,rozporzadzenie rady we nr _number_ _number_ z ...,radets forordning eg nr _number_ _number_ av d...
2,32004R1038,3,commission regulation ec no _number_ _number_ ...,verordnung eg nr _number_ _number_ der kommiss...,regolamento ce n _number_ _number_ della commi...,rozporzadzenie komisji we nr _number_ _number_...,kommissionens forordning eg nr _number_ _numbe...
3,32003R1012,2,commission regulation ec no _number_ _number_ ...,verordnung eg nr _number_ _number_ der kommiss...,regolamento ce n _number_ _number_ della commi...,rozporzadzenie komisji we nr _number_ _number_...,kommissionens forordning eg nr _number_ _numbe...
4,32003R2229,18,council regulation ec no _number_ _number_ of ...,verordnung eg nr _number_ _number_ des rates v...,regolamento ce n _number_ _number_ del consigl...,rozporzadzenie rady we nr _number_ _number_ z ...,radets forordning eg nr _number_ _number_ av d...
...,...,...,...,...,...,...,...
32935,32011D0151,4,commission decision of _number_ march _number_...,beschluss der kommission vom _number_ marz _nu...,decisione della commissione del _number_ marzo...,decyzja komisji z dnia _number_ marca _number_...,kommissionens beslut av den _number_ mars _num...
32936,32010D0256,12,commission decision of _number_ april _number_...,beschluss der kommission vom _number_ april _n...,decisione della commissione del _number_ april...,decyzja komisji z dnia _number_ kwietnia _numb...,kommissionens beslut av den _number_ april _nu...
32937,32010D0177,1,commission decision of _number_ march _number_...,beschluss der kommission vom _number_ marz _nu...,decisione della commissione del _number_ marzo...,decyzja komisji z dnia _number_ marca _number_...,kommissionens beslut av den _number_ mars _num...
32938,32012R0307,0,commission implementing regulation eu no _numb...,durchfuhrungsverordnung eu nr _number_ _number...,regolamento di esecuzione ue n _number_ _numbe...,rozporzadzenie wykonawcze komisji ue nr _numbe...,kommissionens genomforandeforordning eu nr _nu...


In [7]:
# create plot
barplot = create_obs_per_classes_barplot(
    df = reduced_dataframe,
    width = 1400, 
    height = 600, 
    title = "Number of Texts for each Label<br><sup>Total Classes: 12<sup>"
    )

This situation is much preferable: the dataset is still unbalanced, but there is a sufficient number of observations for each remaining class.

#### **Re-map classes**

In order not to have errors during training, it is necessary to remap the label names (so that they go from 0 to 11).

As we already did before, we can simply apply a custom function.

In [8]:
def map_old_labels_to_new_labels(label):

    if label == 7:
        return 5
    elif label == 8:
        return 6
    elif label == 10:
        return 7
    elif label == 11:
        return 8
    elif label == 12:
        return 9
    elif label == 17:
        return 10
    elif label == 18:
        return 11
    else:
        return label

In [9]:
# turn off pandas warning
pd.options.mode.chained_assignment = None

# apply the function
reduced_dataframe["labels_new"] = reduced_dataframe["labels"].progress_apply(map_old_labels_to_new_labels)

# explore results
set(reduced_dataframe["labels_new"])

100%|██████████| 30825/30825 [00:00<00:00, 906619.78it/s]


{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11}

**2. How many words in each text?**

In [12]:
text_cols = ["text_en", "text_de", "text_it", "text_pl", "text_sv"]

for col in text_cols:

    # obtain the number of words in each text
    reduced_dataframe[col + "_len"] = reduced_dataframe[col].apply(lambda x: len(x.split()))
    
    # calculate its mean
    temp_mean = np.mean(reduced_dataframe[col + "_len"])

    # print res
    print(f"> Average number of words in '{col[-2:]}' texts: {round(temp_mean)}")

> Average number of words in 'en' texts: 1384
> Average number of words in 'de' texts: 1247
> Average number of words in 'it' texts: 1426
> Average number of words in 'pl' texts: 1204
> Average number of words in 'sv' texts: 1229


The average number of words in the texts is around 1200-1400. We will need this information in the tokenization and embedding phase.

#### **Save the changes**

Finally, we can save this final version of the dataset and train some networks!

In [11]:
# reorder columns
reduced_dataframe = reduced_dataframe[["celex_id", "labels", "labels_new", "text_en", "text_de", "text_it", "text_pl", "text_sv"]]

reduced_dataframe.to_csv("../data/2_multi_eurlex_reduced.csv", index=False)
print("> All done!")

> All done!
