## **NLP Practical**

### **Cleaning the Dataset**

After obtaining the dataset with the texts to be classified in 5 different languages, we have to preprocess it. 

#### **Libraries**

We import the necessary libraries for the notebook.

In [10]:
# general
import pandas as pd
from tqdm import tqdm
tqdm.pandas()

# string manipulation
import re
import unidecode

print("> Libraries Imported")

> Libraries Imported


#### **Import the dataset**

We can simply read our custom csv. 

In [2]:
dataframe_complete = pd.read_csv("../data/0_multi_eurlex_custom.csv")
dataframe_complete

Unnamed: 0,celex_id,labels,text_en,text_de,text_it,text_pl,text_sv
0,32006D0213,"[1, 20, 7, 3, 0]",COMMISSION DECISION\nof 6 March 2006\nestablis...,ENTSCHEIDUNG DER KOMMISSION\nvom 6. März 2006\...,DECISIONE DELLA COMMISSIONE\ndel 6 marzo 2006\...,DECYZJA KOMISJI\nz dnia 6 marca 2006 r.\nustan...,KOMMISSIONENS BESLUT\nav den 6 mars 2006\nom i...
1,32003R1786,"[3, 19, 6]",Council Regulation (EC) No 1786/2003\nof 29 Se...,Verordnung (EG) Nr. 1786/2003 des Rates\nvom 2...,Regolamento (CE) n. 1786/2003 del Consiglio\nd...,Rozporządzenie Rady (WE) nr 1786/2003\nz dnia ...,Rådets förordning (EG) nr 1786/2003\nav den 29...
2,32004R1038,"[3, 17, 5]",COMMISSION REGULATION (EC) No 1038/2004\nof 27...,VERORDNUNG (EG) Nr. 1038/2004 DER KOMMISSION\n...,REGOLAMENTO (CE) N. 1038/2004 DELLA COMMISSION...,ROZPORZĄDZENIE KOMISJI (WE) NR 1038/2004\nz dn...,KOMMISSIONENS FÖRORDNING (EG) nr 1038/2004\nav...
3,32003R1012,"[2, 5, 10, 8, 3, 18, 15]",Commission Regulation (EC) No 1012/2003\nof 12...,Verordnung (EG) Nr. 1012/2003 der Kommission\n...,Regolamento (CE) n. 1012/2003 della Commission...,Rozporządzenie Komisji (WE) nr 1012/2003\nz dn...,Kommissionens förordning (EG) nr 1012/2003\nav...
4,32003R2229,"[18, 3, 4, 1]",Council Regulation (EC) No 2229/2003\nof 22 De...,Verordnung (EG) Nr. 2229/2003 des Rates\nvom 2...,Regolamento (CE) n. 2229/2003 del Consiglio\nd...,Rozporządzenie Rady (WE) nr 2229/2003\nz dnia ...,Rådets förordning (EG) nr 2229/2003\nav den 22...
...,...,...,...,...,...,...,...
32935,32011D0151,"[4, 11, 5, 0, 12, 15]",COMMISSION DECISION\nof 3 March 2011\namending...,BESCHLUSS DER KOMMISSION\nvom 3. März 2011\nzu...,DECISIONE DELLA COMMISSIONE\ndel 3 marzo 2011\...,DECYZJA KOMISJI\nz dnia 3 marca 2011 r.\nzmien...,KOMMISSIONENS BESLUT\nav den 3 mars 2011\nom ä...
32936,32010D0256,"[12, 0, 6]",COMMISSION DECISION\nof 30 April 2010\namendin...,BESCHLUSS DER KOMMISSION\nvom 30. April 2010\n...,DECISIONE DELLA COMMISSIONE\ndel 30 aprile 201...,DECYZJA KOMISJI\nz dnia 30 kwietnia 2010 r.\nz...,KOMMISSIONENS BESLUT\nav den 30 april 2010\nom...
32937,32010D0177,"[1, 4, 0, 3, 18]",COMMISSION DECISION\nof 23 March 2010\namendin...,BESCHLUSS DER KOMMISSION\nvom 23. März 2010\nz...,DECISIONE DELLA COMMISSIONE\ndel 23 marzo 2010...,DECYZJA KOMISJI\nz dnia 23 marca 2010 r.\nzmie...,KOMMISSIONENS BESLUT\nav den 23 mars 2010\nom ...
32938,32012R0307,"[0, 3, 17, 15]",COMMISSION IMPLEMENTING REGULATION (EU) No 307...,DURCHFÜHRUNGSVERORDNUNG (EU) Nr. 307/2012 DER ...,REGOLAMENTO DI ESECUZIONE (UE) N. 307/2012 DEL...,ROZPORZĄDZENIE WYKONAWCZE KOMISJI (UE) NR 307/...,KOMMISSIONENS GENOMFÖRANDEFÖRORDNING (EU) nr 3...


#### **Clean the texts**

First, let's inspect some texts.

In [3]:
dataframe_complete["text_en"][0]

'COMMISSION DECISION\nof 6 March 2006\nestablishing the classes of reaction-to-fire performance for certain construction products as regards wood flooring and solid wood panelling and cladding\n(notified under document number C(2006) 655)\n(Text with EEA relevance)\n(2006/213/EC)\nTHE COMMISSION OF THE EUROPEAN COMMUNITIES,\nHaving regard to the Treaty establishing the European Community,\nHaving regard to Directive 89/106/EEC of 21 December 1988, on the approximation of laws, regulations and administrative provisions of the Member States relating to construction products (1), and in particular Article 20(2) thereof,\nWhereas:\n(1)\nDirective 89/106/EEC envisages that in order to take account of different levels of protection for construction works at national, regional or local level, it may be necessary to establish in the interpretative documents classes corresponding to the performance of products in respect of each essential requirement. Those documents have been published as the 

There do not seem to be any particular problems, except for the special character *'\n'*, which we can simply replace with a space.

We can also remove the punctuation, as well as replace all numbers with a keyword.

**Note**

Usually, stopwords are removed and stemming or lemmatisation is applied. However, this would require knowing the list of stopwords for each language, along with specific dictionaries for stemming or lemmatization.

For now, we decide not to apply these procedures. By not applying them to any language, we will be able to compare the performance of the models without having advantaged any language (in terms of quality of the stopwords removed, etc.).

It is time to create our custom cleaning function.

In [6]:
NUMBER_RULE = re.compile("\d+")
ONLY_LETTERS = re.compile('[^a-zA-Z _]+')

def clean_text(text):

    text = text.replace("\n"," ")                   # substitute \n with a space
    text = re.sub(NUMBER_RULE, " _number_ ", text)  # identify numbers in the text
    text = unidecode.unidecode(text)                # remove accents and sort (they may create errors in the future architectures)
    text = text.lower()                             # lowercase text
    text = ONLY_LETTERS.sub(' ', text)              # remove all that it is not text
    text = re.sub(' +', ' ', text)                  # remove multiple spaces, if any

    # returned pre-processed text
    return text

Let's try it out on the first text.

In [7]:
clean_text(dataframe_complete["text_en"][0])

'commission decision of _number_ march _number_ establishing the classes of reaction to fire performance for certain construction products as regards wood flooring and solid wood panelling and cladding notified under document number c _number_ _number_ text with eea relevance _number_ _number_ ec the commission of the european communities having regard to the treaty establishing the european community having regard to directive _number_ _number_ eec of _number_ december _number_ on the approximation of laws regulations and administrative provisions of the member states relating to construction products _number_ and in particular article _number_ _number_ thereof whereas _number_ directive _number_ _number_ eec envisages that in order to take account of different levels of protection for construction works at national regional or local level it may be necessary to establish in the interpretative documents classes corresponding to the performance of products in respect of each essential 

It works as expected! We can now apply it to all the texts in the dataset.

In [12]:
# specify columns with texts
text_cols = ["text_en", "text_de", "text_it", "text_pl", "text_sv"]

# iterate over the text columns
for text_col in text_cols:
    print("\n > Cleaning column '" + text_col + "'")
    dataframe_complete[text_col] = dataframe_complete[text_col].progress_apply(clean_text)


 > Cleaning column 'text_en'


100%|██████████| 32940/32940 [01:19<00:00, 412.05it/s]



 > Cleaning column 'text_de'


100%|██████████| 32940/32940 [01:53<00:00, 290.72it/s]



 > Cleaning column 'text_it'


100%|██████████| 32940/32940 [01:54<00:00, 288.69it/s]



 > Cleaning column 'text_pl'


100%|██████████| 32940/32940 [01:50<00:00, 297.10it/s]



 > Cleaning column 'text_sv'


100%|██████████| 32940/32940 [01:44<00:00, 313.72it/s]


In [13]:
dataframe_complete

Unnamed: 0,celex_id,labels,text_en,text_de,text_it,text_pl,text_sv
0,32006D0213,"[1, 20, 7, 3, 0]",commission decision of _number_ march _number_...,entscheidung der kommission vom _number_ marz ...,decisione della commissione del _number_ marzo...,decyzja komisji z dnia _number_ marca _number_...,kommissionens beslut av den _number_ mars _num...
1,32003R1786,"[3, 19, 6]",council regulation ec no _number_ _number_ of ...,verordnung eg nr _number_ _number_ des rates v...,regolamento ce n _number_ _number_ del consigl...,rozporzadzenie rady we nr _number_ _number_ z ...,radets forordning eg nr _number_ _number_ av d...
2,32004R1038,"[3, 17, 5]",commission regulation ec no _number_ _number_ ...,verordnung eg nr _number_ _number_ der kommiss...,regolamento ce n _number_ _number_ della commi...,rozporzadzenie komisji we nr _number_ _number_...,kommissionens forordning eg nr _number_ _numbe...
3,32003R1012,"[2, 5, 10, 8, 3, 18, 15]",commission regulation ec no _number_ _number_ ...,verordnung eg nr _number_ _number_ der kommiss...,regolamento ce n _number_ _number_ della commi...,rozporzadzenie komisji we nr _number_ _number_...,kommissionens forordning eg nr _number_ _numbe...
4,32003R2229,"[18, 3, 4, 1]",council regulation ec no _number_ _number_ of ...,verordnung eg nr _number_ _number_ des rates v...,regolamento ce n _number_ _number_ del consigl...,rozporzadzenie rady we nr _number_ _number_ z ...,radets forordning eg nr _number_ _number_ av d...
...,...,...,...,...,...,...,...
32935,32011D0151,"[4, 11, 5, 0, 12, 15]",commission decision of _number_ march _number_...,beschluss der kommission vom _number_ marz _nu...,decisione della commissione del _number_ marzo...,decyzja komisji z dnia _number_ marca _number_...,kommissionens beslut av den _number_ mars _num...
32936,32010D0256,"[12, 0, 6]",commission decision of _number_ april _number_...,beschluss der kommission vom _number_ april _n...,decisione della commissione del _number_ april...,decyzja komisji z dnia _number_ kwietnia _numb...,kommissionens beslut av den _number_ april _nu...
32937,32010D0177,"[1, 4, 0, 3, 18]",commission decision of _number_ march _number_...,beschluss der kommission vom _number_ marz _nu...,decisione della commissione del _number_ marzo...,decyzja komisji z dnia _number_ marca _number_...,kommissionens beslut av den _number_ mars _num...
32938,32012R0307,"[0, 3, 17, 15]",commission implementing regulation eu no _numb...,durchfuhrungsverordnung eu nr _number_ _number...,regolamento di esecuzione ue n _number_ _numbe...,rozporzadzenie wykonawcze komisji ue nr _numbe...,kommissionens genomforandeforordning eu nr _nu...


#### **From multi-label to multi-class**

The dataset is designed for multi-label classification. However, as explained in the [paper](https://arxiv.org/pdf/2109.00904.pdf) on the dataset, the labels are *multi-granular*. 

This means that we can keep only the first label and convert the task into a multi-class classification.

We create a function that can extract the first value from the list. 

In [24]:
def clean_labels(labels):

    # first, we convert the string back to a list
    labels = list(labels[1:-1].split(", "))

    # then, we extract the first element only and we return it (as int)
    return int(labels[0])

We can then apply it to the whole dataset.

In [29]:
dataframe_complete["labels"] = dataframe_complete["labels"].progress_apply(clean_labels)

100%|██████████| 32940/32940 [00:00<00:00, 658791.20it/s]


The paper says that there are 21 classes in the first level. 

We check that everything matches, showing the different available classes. 

In [32]:
set(dataframe_complete["labels"])

{0, 1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20}

They are 21 indeed.

#### **Show cleaned dataset**

We now show the final dataset, on which we will train the architectures.

In [30]:
dataframe_complete

Unnamed: 0,celex_id,labels,text_en,text_de,text_it,text_pl,text_sv
0,32006D0213,1,commission decision of _number_ march _number_...,entscheidung der kommission vom _number_ marz ...,decisione della commissione del _number_ marzo...,decyzja komisji z dnia _number_ marca _number_...,kommissionens beslut av den _number_ mars _num...
1,32003R1786,3,council regulation ec no _number_ _number_ of ...,verordnung eg nr _number_ _number_ des rates v...,regolamento ce n _number_ _number_ del consigl...,rozporzadzenie rady we nr _number_ _number_ z ...,radets forordning eg nr _number_ _number_ av d...
2,32004R1038,3,commission regulation ec no _number_ _number_ ...,verordnung eg nr _number_ _number_ der kommiss...,regolamento ce n _number_ _number_ della commi...,rozporzadzenie komisji we nr _number_ _number_...,kommissionens forordning eg nr _number_ _numbe...
3,32003R1012,2,commission regulation ec no _number_ _number_ ...,verordnung eg nr _number_ _number_ der kommiss...,regolamento ce n _number_ _number_ della commi...,rozporzadzenie komisji we nr _number_ _number_...,kommissionens forordning eg nr _number_ _numbe...
4,32003R2229,18,council regulation ec no _number_ _number_ of ...,verordnung eg nr _number_ _number_ des rates v...,regolamento ce n _number_ _number_ del consigl...,rozporzadzenie rady we nr _number_ _number_ z ...,radets forordning eg nr _number_ _number_ av d...
...,...,...,...,...,...,...,...
32935,32011D0151,4,commission decision of _number_ march _number_...,beschluss der kommission vom _number_ marz _nu...,decisione della commissione del _number_ marzo...,decyzja komisji z dnia _number_ marca _number_...,kommissionens beslut av den _number_ mars _num...
32936,32010D0256,12,commission decision of _number_ april _number_...,beschluss der kommission vom _number_ april _n...,decisione della commissione del _number_ april...,decyzja komisji z dnia _number_ kwietnia _numb...,kommissionens beslut av den _number_ april _nu...
32937,32010D0177,1,commission decision of _number_ march _number_...,beschluss der kommission vom _number_ marz _nu...,decisione della commissione del _number_ marzo...,decyzja komisji z dnia _number_ marca _number_...,kommissionens beslut av den _number_ mars _num...
32938,32012R0307,0,commission implementing regulation eu no _numb...,durchfuhrungsverordnung eu nr _number_ _number...,regolamento di esecuzione ue n _number_ _numbe...,rozporzadzenie wykonawcze komisji ue nr _numbe...,kommissionens genomforandeforordning eu nr _nu...


The cleaning procedure is time consuming (especially for the texts).

We save the dataset so that we can use it later.

In [33]:
dataframe_complete.to_csv("../data/1_multi_eurlex_clean.csv", index=False)
print("> All done!")

> All done!
