## **NLP Practical**

### **Preprocessing**

It is now time to preprocess the texts.

#### **Libraries**

We import the necessary libraries for the notebook.

In [1]:
# general
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas()

# dataviz
import plotly.express as px         # create interactive plots

# preprocessing
import spacy                        # tokenization
from collections import Counter     # count words occurrences

print("> Libraries Imported")



> Libraries Imported


#### **Import the dataset**

We read our custom (and cleaned) csv. 

In [2]:
dataframe = pd.read_csv("../data/2_multi_eurlex_reduced.csv")
dataframe.head()

Unnamed: 0,celex_id,labels,labels_new,text_en,text_de,text_it,text_pl,text_sv
0,32006D0213,1,1,commission decision of _number_ march _number_...,entscheidung der kommission vom _number_ marz ...,decisione della commissione del _number_ marzo...,decyzja komisji z dnia _number_ marca _number_...,kommissionens beslut av den _number_ mars _num...
1,32003R1786,3,3,council regulation ec no _number_ _number_ of ...,verordnung eg nr _number_ _number_ des rates v...,regolamento ce n _number_ _number_ del consigl...,rozporzadzenie rady we nr _number_ _number_ z ...,radets forordning eg nr _number_ _number_ av d...
2,32004R1038,3,3,commission regulation ec no _number_ _number_ ...,verordnung eg nr _number_ _number_ der kommiss...,regolamento ce n _number_ _number_ della commi...,rozporzadzenie komisji we nr _number_ _number_...,kommissionens forordning eg nr _number_ _numbe...
3,32003R1012,2,2,commission regulation ec no _number_ _number_ ...,verordnung eg nr _number_ _number_ der kommiss...,regolamento ce n _number_ _number_ della commi...,rozporzadzenie komisji we nr _number_ _number_...,kommissionens forordning eg nr _number_ _numbe...
4,32003R2229,18,11,council regulation ec no _number_ _number_ of ...,verordnung eg nr _number_ _number_ des rates v...,regolamento ce n _number_ _number_ del consigl...,rozporzadzenie rady we nr _number_ _number_ z ...,radets forordning eg nr _number_ _number_ av d...


#### **Preprocessing**

**Part 1: Tokenization**

This essentially involves dividing a sentence, paragraph or entire text document into smaller units, such as individual words or terms.

We create a custom function capable of tokenizing the texts in different languages.

In [3]:
# load the tokenizers (one for each language)

TOKENIZER_EN = spacy.load('en_core_web_sm')
TOKENIZER_DE = spacy.load('nl_core_news_sm')
TOKENIZER_IT = spacy.load('it_core_news_sm')
TOKENIZER_PL = spacy.load('pl_core_news_sm')
TOKENIZER_SV = spacy.load('sv_core_news_sm')

In [4]:
# create custom function

def tokenize(text, language):

    # use the required tokenizer (based on input language)
    if language=="en":
        tokenizer = TOKENIZER_EN
    elif language=="de":
        tokenizer = TOKENIZER_DE
    elif language=="it":
        tokenizer = TOKENIZER_IT
    elif language=="pl":
        tokenizer = TOKENIZER_PL
    elif language=="sv":
        tokenizer = TOKENIZER_SV
    else:
        error_str = "Error: language '" + language + "' not available. Please choose between 'en', 'de', 'it', 'pl', 'sv'."
        print(error_str)

    text = text.replace("_","")
    return [token.text for token in tokenizer.tokenizer(text)]

Here we show a small example of tokenization for the first text (in each language). 

In [5]:
test_text_en = dataframe["text_en"][0]
test_text_de = dataframe["text_de"][0]
test_text_it = dataframe["text_it"][0]
test_text_pl = dataframe["text_pl"][0]
test_text_sv = dataframe["text_sv"][0]


print("Sample of 'en' text:", test_text_en[0:120])
print("Sample of 'de' text:", test_text_de[0:120])
print("Sample of 'it' text:", test_text_it[0:120])
print("Sample of 'pl' text:", test_text_pl[0:120])
print("Sample of 'sv' text:", test_text_sv[0:120])

Sample of 'en' text: commission decision of _number_ march _number_ establishing the classes of reaction to fire performance for certain cons
Sample of 'de' text: entscheidung der kommission vom _number_ marz _number_ zur festlegung der brandverhaltensklassen fur bestimmte bauproduk
Sample of 'it' text: decisione della commissione del _number_ marzo _number_ che determina le classi di reazione al fuoco di alcuni prodotti 
Sample of 'pl' text: decyzja komisji z dnia _number_ marca _number_ r ustanawiajaca klasy reakcji na ogien niektorych wyrobow budowlanych w o
Sample of 'sv' text: kommissionens beslut av den _number_ mars _number_ om indelning i klasser beroende pa reaktion vid brandpaverkan for vis


In [6]:
print("Example of tokenization in 'en':", tokenize(text=test_text_en, language="en")[0:10])
print("Example of tokenization in 'de':", tokenize(text=test_text_de, language="de")[0:10])
print("Example of tokenization in 'it':", tokenize(text=test_text_it, language="it")[0:10])
print("Example of tokenization in 'pl':", tokenize(text=test_text_pl, language="pl")[0:10])
print("Example of tokenization in 'sv':", tokenize(text=test_text_sv, language="sv")[0:10])

Example of tokenization in 'en': ['commission', 'decision', 'of', 'number', 'march', 'number', 'establishing', 'the', 'classes', 'of']
Example of tokenization in 'de': ['entscheidung', 'der', 'kommission', 'vom', 'number', 'marz', 'number', 'zur', 'festlegung', 'der']
Example of tokenization in 'it': ['decisione', 'della', 'commissione', 'del', 'number', 'marzo', 'number', 'che', 'determina', 'le']
Example of tokenization in 'pl': ['decyzja', 'komisji', 'z', 'dnia', 'number', 'marca', 'number', 'r', 'ustanawiajaca', 'klasy']
Example of tokenization in 'sv': ['kommissionens', 'beslut', 'av', 'den', 'number', 'mars', 'number', 'om', 'indelning', 'i']


**Part 2: Count Words Occurrences**

We now count the number of occurrences of each word (now a token) in the corpuses.

In [7]:
# instantiate a Counter
COUNTS_EN = Counter()
COUNTS_DE = Counter()
COUNTS_IT = Counter()
COUNTS_PL = Counter()
COUNTS_SV = Counter()

for index, row in tqdm(dataframe.iterrows(), total=dataframe.shape[0], desc="> Counting words in texts"):
    COUNTS_EN.update(tokenize(row['text_en'], language="en"))
    COUNTS_DE.update(tokenize(row['text_de'], language="de"))
    COUNTS_IT.update(tokenize(row['text_it'], language="it"))
    COUNTS_PL.update(tokenize(row['text_pl'], language="pl"))
    COUNTS_SV.update(tokenize(row['text_sv'], language="sv"))

> Counting words in texts: 100%|██████████| 30825/30825 [04:33<00:00, 112.83it/s]


In [8]:
# show first 10 words of COUNTS_EN
i = 0
for key,value in COUNTS_EN.items():
    if i < 10:
        print(f"{key}: {value}")
    i+=1

commission: 239011
decision: 110058
of: 2122065
number: 3326816
march: 18376
establishing: 40092
the: 3798810
classes: 1341
reaction: 601
to: 1248629


*How many words in the dictionary of each language?*

In [9]:
print("Total Words in 'en' texts:", len(COUNTS_EN.keys()))
print("Total Words in 'de' texts:", len(COUNTS_DE.keys()))
print("Total Words in 'it' texts:", len(COUNTS_IT.keys()))
print("Total Words in 'pl' texts:", len(COUNTS_PL.keys()))
print("Total Words in 'sv' texts:", len(COUNTS_SV.keys()))

Total Words in 'en' texts: 78547
Total Words in 'de' texts: 251591
Total Words in 'it' texts: 106707
Total Words in 'pl' texts: 173840
Total Words in 'sv' texts: 229339


It is very interesting to note how the number of words used in each language changes significantly to describe the same laws. 

Considering English and Swedish, for example, we can see how the number of Swedish words used exceeds the number of English words by a factor of 3.

Starting from the generated dictionaries, we can now create the 5 vocabularies.

In [10]:
# prepare placeholders
# - one for index mapping
# - one with the list of all the words

EN_WORDS_TO_INDEX = {"":0, "UNK":1}
EN_WORDS = ["", "UNK"]

DE_WORDS_TO_INDEX = {"":0, "UNK":1}
DE_WORDS = ["", "UNK"]

IT_WORDS_TO_INDEX = {"":0, "UNK":1}
IT_WORDS = ["", "UNK"]

PL_WORDS_TO_INDEX = {"":0, "UNK":1}
PL_WORDS = ["", "UNK"]

SV_WORDS_TO_INDEX = {"":0, "UNK":1}
SV_WORDS = ["", "UNK"]


# iterate over each counter and populate the placeholders

for word in COUNTS_EN:
    EN_WORDS_TO_INDEX[word] = len(EN_WORDS)
    EN_WORDS.append(word)

for word in COUNTS_DE:
    DE_WORDS_TO_INDEX[word] = len(DE_WORDS)
    DE_WORDS.append(word)

for word in COUNTS_IT:
    IT_WORDS_TO_INDEX[word] = len(IT_WORDS)
    IT_WORDS.append(word)

for word in COUNTS_PL:
    PL_WORDS_TO_INDEX[word] = len(PL_WORDS)
    PL_WORDS.append(word)

for word in COUNTS_SV:
    SV_WORDS_TO_INDEX[word] = len(SV_WORDS)
    SV_WORDS.append(word)

print("> Dictionaries Generated")

> Dictionaries Generated


In [19]:
print("Total Words in 'en' texts:", len(EN_WORDS))
print("Total Words in 'de' texts:", len(DE_WORDS))
print("Total Words in 'it' texts:", len(IT_WORDS))
print("Total Words in 'pl' texts:", len(PL_WORDS))
print("Total Words in 'sv' texts:", len(SV_WORDS))

Total Words in 'en' texts: 78549
Total Words in 'de' texts: 251593
Total Words in 'it' texts: 106709
Total Words in 'pl' texts: 173842
Total Words in 'sv' texts: 229341


**Part 3: Encoding**

We have everything we need to encode our texts!

Note: we choose that the maximum length of each review should be 1400 words, as the average length of texts is around 1200-1400 words.

In [11]:
# create our custom encoding function

def encode_text(text, word_to_index, language, N=1400):
    
    # 1. tokenize the text
    tokenized = tokenize(text, language)

    # 2. encode the text
    encoded = np.zeros(N, dtype=int)
    enc1 = np.array([word_to_index.get(word, word_to_index["UNK"]) for word in tokenized])
    length = min(N, len(enc1))
    encoded[:length] = enc1[:length]

    # finally, return encoded text
    return encoded, length

In [12]:
# test the function on the first english text

test_encoded = encode_text(
    text=dataframe["text_en"][0], 
    word_to_index=EN_WORDS_TO_INDEX, 
    language="en", 
    N=1400
    )

test_encoded

(array([2, 3, 4, ..., 0, 0, 0]), 513)

In [13]:
# finally, we can apply the function to the whole df, encoding all texts

dataframe['text_en_enc'] = dataframe['text_en'].progress_apply(lambda x: np.array(encode_text(x, EN_WORDS_TO_INDEX, language="en"), dtype=object))
dataframe['text_de_enc'] = dataframe['text_de'].progress_apply(lambda x: np.array(encode_text(x, DE_WORDS_TO_INDEX, language="de"), dtype=object))
dataframe['text_it_enc'] = dataframe['text_it'].progress_apply(lambda x: np.array(encode_text(x, IT_WORDS_TO_INDEX, language="it"), dtype=object))
dataframe['text_pl_enc'] = dataframe['text_pl'].progress_apply(lambda x: np.array(encode_text(x, PL_WORDS_TO_INDEX, language="pl"), dtype=object))
dataframe['text_sv_enc'] = dataframe['text_sv'].progress_apply(lambda x: np.array(encode_text(x, SV_WORDS_TO_INDEX, language="sv"), dtype=object))

100%|██████████| 30825/30825 [00:49<00:00, 617.20it/s]
100%|██████████| 30825/30825 [00:50<00:00, 606.70it/s]
100%|██████████| 30825/30825 [00:54<00:00, 566.63it/s]
100%|██████████| 30825/30825 [00:49<00:00, 625.32it/s]
100%|██████████| 30825/30825 [00:50<00:00, 609.34it/s]


In [14]:
# show df
dataframe

Unnamed: 0,celex_id,labels,labels_new,text_en,text_de,text_it,text_pl,text_sv,text_en_enc,text_de_enc,text_it_enc,text_pl_enc,text_sv_enc
0,32006D0213,1,1,commission decision of _number_ march _number_...,entscheidung der kommission vom _number_ marz ...,decisione della commissione del _number_ marzo...,decyzja komisji z dnia _number_ marca _number_...,kommissionens beslut av den _number_ mars _num...,"[[2, 3, 4, 5, 6, 5, 7, 8, 9, 4, 10, 11, 12, 13...","[[2, 3, 4, 5, 6, 7, 6, 8, 9, 3, 10, 11, 12, 13...","[[2, 3, 4, 5, 6, 7, 6, 8, 9, 10, 11, 12, 13, 1...","[[2, 3, 4, 5, 6, 7, 6, 8, 9, 10, 11, 12, 13, 1...","[[2, 3, 4, 5, 6, 7, 6, 8, 9, 10, 11, 12, 13, 1..."
1,32003R1786,3,3,council regulation ec no _number_ _number_ of ...,verordnung eg nr _number_ _number_ des rates v...,regolamento ce n _number_ _number_ del consigl...,rozporzadzenie rady we nr _number_ _number_ z ...,radets forordning eg nr _number_ _number_ av d...,"[[133, 177, 34, 92, 5, 5, 4, 5, 178, 5, 44, 8,...","[[196, 31, 98, 6, 6, 42, 43, 5, 6, 197, 6, 49,...","[[219, 41, 110, 6, 6, 5, 53, 5, 6, 220, 6, 221...","[[242, 49, 39, 32, 6, 6, 4, 5, 6, 243, 6, 8, 1...","[[49, 189, 38, 33, 6, 6, 4, 5, 6, 190, 6, 8, 5..."
2,32004R1038,3,3,commission regulation ec no _number_ _number_ ...,verordnung eg nr _number_ _number_ der kommiss...,regolamento ce n _number_ _number_ della commi...,rozporzadzenie komisji we nr _number_ _number_...,kommissionens forordning eg nr _number_ _numbe...,"[[2, 177, 34, 92, 5, 5, 4, 5, 74, 5, 665, 8, 2...","[[196, 31, 98, 6, 6, 3, 4, 5, 6, 454, 6, 8, 78...","[[219, 41, 110, 6, 6, 3, 4, 5, 6, 471, 6, 8, 8...","[[242, 3, 39, 32, 6, 6, 4, 5, 6, 552, 6, 8, 10...","[[2, 189, 38, 33, 6, 6, 4, 5, 6, 442, 6, 8, 72..."
3,32003R1012,2,2,commission regulation ec no _number_ _number_ ...,verordnung eg nr _number_ _number_ der kommiss...,regolamento ce n _number_ _number_ della commi...,rozporzadzenie komisji we nr _number_ _number_...,kommissionens forordning eg nr _number_ _numbe...,"[[2, 177, 34, 92, 5, 5, 4, 5, 352, 5, 688, 14,...","[[196, 31, 98, 6, 6, 3, 4, 5, 6, 402, 6, 8, 80...","[[219, 41, 110, 6, 6, 3, 4, 5, 6, 417, 6, 418,...","[[242, 3, 39, 32, 6, 6, 4, 5, 6, 497, 6, 8, 11...","[[2, 189, 38, 33, 6, 6, 4, 5, 6, 397, 6, 8, 77..."
4,32003R2229,18,11,council regulation ec no _number_ _number_ of ...,verordnung eg nr _number_ _number_ des rates v...,regolamento ce n _number_ _number_ del consigl...,rozporzadzenie rady we nr _number_ _number_ z ...,radets forordning eg nr _number_ _number_ av d...,"[[133, 177, 34, 92, 5, 5, 4, 5, 43, 5, 690, 94...","[[196, 31, 98, 6, 6, 42, 43, 5, 6, 44, 6, 8, 8...","[[219, 41, 110, 6, 6, 5, 53, 5, 6, 54, 6, 8, 4...","[[242, 49, 39, 32, 6, 6, 4, 5, 6, 51, 6, 8, 11...","[[49, 189, 38, 33, 6, 6, 4, 5, 6, 52, 6, 8, 78..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
30820,32011D0151,4,4,commission decision of _number_ march _number_...,beschluss der kommission vom _number_ marz _nu...,decisione della commissione del _number_ marzo...,decyzja komisji z dnia _number_ marca _number_...,kommissionens beslut av den _number_ mars _num...,"[[2, 3, 4, 5, 6, 5, 688, 3, 5, 5, 34, 353, 354...","[[401, 3, 4, 5, 6, 7, 6, 8, 809, 3, 2, 6, 6, 3...","[[2, 3, 4, 5, 6, 7, 6, 8, 850, 42, 2, 6, 6, 41...","[[2, 3, 4, 5, 6, 7, 6, 8, 2486, 209, 6, 6, 39,...","[[2, 3, 4, 5, 6, 7, 6, 8, 778, 4, 3, 6, 6, 38,..."
30821,32010D0256,12,9,commission decision of _number_ april _number_...,beschluss der kommission vom _number_ april _n...,decisione della commissione del _number_ april...,decyzja komisji z dnia _number_ kwietnia _numb...,kommissionens beslut av den _number_ april _nu...,"[[2, 3, 4, 5, 248, 5, 688, 3, 5, 5, 42, 18, 19...","[[401, 3, 4, 5, 6, 275, 6, 8, 809, 3, 2, 6, 6,...","[[2, 3, 4, 5, 6, 300, 6, 418, 850, 3, 2, 6, 6,...","[[2, 3, 4, 5, 6, 343, 6, 8, 2486, 209, 6, 6, 5...","[[2, 3, 4, 5, 6, 265, 6, 8, 778, 4, 3, 6, 6, 5..."
30822,32010D0177,1,1,commission decision of _number_ march _number_...,beschluss der kommission vom _number_ marz _nu...,decisione della commissione del _number_ marzo...,decyzja komisji z dnia _number_ marca _number_...,kommissionens beslut av den _number_ mars _num...,"[[2, 3, 4, 5, 6, 5, 688, 3, 5, 5, 34, 112, 223...","[[401, 3, 4, 5, 6, 7, 6, 8, 809, 42, 736, 6, 6...","[[2, 3, 4, 5, 6, 7, 6, 8, 850, 42, 2, 6, 6, 41...","[[2, 3, 4, 5, 6, 7, 6, 8, 2486, 209, 6, 6, 39,...","[[2, 3, 4, 5, 6, 7, 6, 8, 778, 4, 3, 6, 6, 38,..."
30823,32012R0307,0,0,commission implementing regulation eu no _numb...,durchfuhrungsverordnung eu nr _number_ _number...,regolamento di esecuzione ue n _number_ _numbe...,rozporzadzenie wykonawcze komisji ue nr _numbe...,kommissionens genomforandeforordning eu nr _nu...,"[[2, 132, 177, 856, 92, 5, 5, 4, 5, 248, 5, 7,...","[[15611, 1010, 98, 6, 6, 3, 4, 5, 6, 275, 6, 8...","[[219, 12, 422, 1042, 110, 6, 6, 3, 4, 155, 6,...","[[242, 4085, 3, 1367, 32, 6, 6, 4, 5, 6, 343, ...","[[2, 54694, 974, 33, 6, 6, 4, 5, 6, 265, 6, 8,..."


#### **Train, Test and Validation splits**

We split the dataset into:
- training set
- validation set
- test set

In [15]:
from sklearn.model_selection import train_test_split

# SPLIT 1: train [0.8] and test [0.2]

# execute the split (randomly, stratified fashion)
train_indexes, test_indexes, _, _ = train_test_split(
    dataframe["text_en_enc"],
    dataframe["labels_new"],
    test_size=0.2,
    random_state=42
)

# create placeholder
dataframe["set"] = "not specified"

# set value based on stratified split
dataframe.loc[train_indexes.index, "set"] = "train"
dataframe.loc[test_indexes.index, "set"] = "test"


# SPLIT 2: we split train again in order to obtain the validation set

train_indexes, val_indexes, _, _ = train_test_split(
    dataframe.loc[dataframe["set"] == "train"]["text_en_enc"],
    dataframe.loc[dataframe["set"] == "train"]["labels_new"],
    test_size=0.2,
    random_state=42
)

# update dataset with validation info
dataframe.loc[val_indexes.index, "set"] = "validation"

*How many observations for each set?*

In [16]:
Counter(dataframe['set'])

Counter({'train': 19728, 'test': 6165, 'validation': 4932})

#### **Save preprocessed texts**

We save the data in pickle format in order to preserve the column types.

In [17]:
dataframe.to_pickle("../data/3_multi_eurlex_encoded.pkl")
print("> All done!")

> All done!
