## **NLP Practical**

### **Preprocessing**

It is now time to preprocess the texts.

#### **Libraries**

We import the necessary libraries for the notebook.

In [1]:
# general
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas()

# dataviz
import plotly.express as px         # create interactive plots

# preprocessing
import spacy                        # tokenization
from collections import Counter     # count words occurrences

print("> Libraries Imported")



> Libraries Imported


#### **Import the dataset**

We read our custom (and cleaned) csv. 

In [2]:
dataframe = pd.read_csv("../data/2_multi_eurlex_reduced.csv")
dataframe

Unnamed: 0,celex_id,labels,labels_new,text_en,text_de,text_it,text_pl,text_sv
0,32010D0395,2,0,commission decision of december on state aid c...,beschluss der kommission vom dezember uber die...,decisione della commissione del dicembre conce...,decyzja komisji z dnia grudnia r w sprawie pom...,kommissionens beslut av den december om det st...
1,32012R0453,2,0,commission implementing regulation eu no of ma...,durchfuhrungsverordnung eu nr der kommission v...,regolamento di esecuzione ue n della commissio...,rozporzadzenie wykonawcze komisji ue nr z dnia...,kommissionens genomforandeforordning eu nr av ...
2,32012D0043,2,0,commission implementing decision of january au...,durchfuhrungsbeschluss der kommission vom janu...,decisione di esecuzione della commissione del ...,decyzja wykonawcza komisji z dnia stycznia r u...,kommissionens genomforandebeslut av den januar...
3,32007R0730,2,0,commission regulation ec no of june establishi...,verordnung eg nr der kommission vom juni zur f...,regolamento ce n della commissione del giugno ...,rozporzadzenie komisji we nr z dnia czerwca r ...,kommissionens forordning eg nr av den juni om ...
4,32009R0375,2,0,commission regulation ec no of may fixing the ...,verordnung eg nr der kommission vom mai zur fe...,regolamento ce n della commissione del maggio ...,rozporzadzenie komisji we nr z dnia maja r ust...,kommissionens forordning eg nr av den maj om f...
...,...,...,...,...,...,...,...,...
5995,32013R0519,18,2,commission regulation eu no of february adapti...,verordnung eu nr der kommission vom februar zu...,regolamento ue n della commissione del febbrai...,rozporzadzenie komisji ue nr z dnia lutego r d...,kommissionens forordning eu nr av den februari...
5996,32008D0914,18,2,commission decision of june on the confirmatio...,entscheidung der kommission juni zur bestatigu...,decisione della commissione dell giugno recant...,decyzja komisji z dnia czerwca r w sprawie zat...,kommissionens beslut av den juni om godkannand...
5997,31999R2502,18,2,commission regulation ec no of november amendi...,verordnung eg nr der kommission vom november z...,regolamento ce n della commissione del novembr...,rozporzadzenie komisji we nr z dnia listopada ...,kommissionens forordning eg nr av den november...
5998,32008D0847,18,2,council decision of november on the eligibilit...,beschluss des rates vom november uber die ford...,decisione del consiglio del novembre sull ammi...,decyzja rady z dnia listopada r w sprawie kwal...,radets beslut av den november om berattigande ...


#### **Preprocessing**

**Part 1: Tokenization**

This essentially involves dividing a sentence, paragraph or entire text document into smaller units, such as individual words or terms.

We create a custom function capable of tokenizing the texts in different languages.

In [3]:
# load the tokenizers (one for each language)

TOKENIZER_EN = spacy.load('en_core_web_sm')
TOKENIZER_DE = spacy.load('nl_core_news_sm')
TOKENIZER_IT = spacy.load('it_core_news_sm')
TOKENIZER_PL = spacy.load('pl_core_news_sm')
TOKENIZER_SV = spacy.load('sv_core_news_sm')

In [4]:
# create custom function

def tokenize(text, language):

    # use the required tokenizer (based on input language)
    if language=="en":
        tokenizer = TOKENIZER_EN
    elif language=="de":
        tokenizer = TOKENIZER_DE
    elif language=="it":
        tokenizer = TOKENIZER_IT
    elif language=="pl":
        tokenizer = TOKENIZER_PL
    elif language=="sv":
        tokenizer = TOKENIZER_SV
    else:
        error_str = "Error: language '" + language + "' not available. Please choose between 'en', 'de', 'it', 'pl', 'sv'."
        print(error_str)

    text = text.replace("_","")
    return [token.text for token in tokenizer.tokenizer(text)]

Here we show a small example of tokenization for the first text (in each language). 

In [5]:
test_text_en = dataframe["text_en"][0]
test_text_de = dataframe["text_de"][0]
test_text_it = dataframe["text_it"][0]
test_text_pl = dataframe["text_pl"][0]
test_text_sv = dataframe["text_sv"][0]


print("Sample of 'en' text:", test_text_en[0:120])
print("Sample of 'de' text:", test_text_de[0:120])
print("Sample of 'it' text:", test_text_it[0:120])
print("Sample of 'pl' text:", test_text_pl[0:120])
print("Sample of 'sv' text:", test_text_sv[0:120])

Sample of 'en' text: commission decision of december on state aid c ex n by germany for the restructuring of landesbank baden wurttemberg not
Sample of 'de' text: beschluss der kommission vom dezember uber die staatliche beihilfe c ex n deutschlands zur umstrukturierung der landesba
Sample of 'it' text: decisione della commissione del dicembre concernente l aiuto di stato c ex n eseguito dalla germania a favore della rist
Sample of 'pl' text: decyzja komisji z dnia grudnia r w sprawie pomocy panstwa c ex n ktorej niemcy zamierzaja udzielic w celu restrukturyzac
Sample of 'sv' text: kommissionens beslut av den december om det statliga stod c f d n som tyskland har genomfort for rekapitalisering av lan


In [6]:
print("Example of tokenization in 'en':", tokenize(text=test_text_en, language="en")[0:10])
print("Example of tokenization in 'de':", tokenize(text=test_text_de, language="de")[0:10])
print("Example of tokenization in 'it':", tokenize(text=test_text_it, language="it")[0:10])
print("Example of tokenization in 'pl':", tokenize(text=test_text_pl, language="pl")[0:10])
print("Example of tokenization in 'sv':", tokenize(text=test_text_sv, language="sv")[0:10])

Example of tokenization in 'en': ['commission', 'decision', 'of', 'december', 'on', 'state', 'aid', 'c', 'ex', 'n']
Example of tokenization in 'de': ['beschluss', 'der', 'kommission', 'vom', 'dezember', 'uber', 'die', 'staatliche', 'beihilfe', 'c']
Example of tokenization in 'it': ['decisione', 'della', 'commissione', 'del', 'dicembre', 'concernente', 'l', 'aiuto', 'di', 'stato']
Example of tokenization in 'pl': ['decyzja', 'komisji', 'z', 'dnia', 'grudnia', 'r', 'w', 'sprawie', 'pomocy', 'panstwa']
Example of tokenization in 'sv': ['kommissionens', 'beslut', 'av', 'den', 'december', 'om', 'det', 'statliga', 'stod', 'c']


**Part 2: Count Words Occurrences**

We now count the number of occurrences of each word (now a token) in the corpuses.

In [7]:
# instantiate a Counter
COUNTS_EN = Counter()
COUNTS_DE = Counter()
COUNTS_IT = Counter()
COUNTS_PL = Counter()
COUNTS_SV = Counter()

for index, row in tqdm(dataframe.iterrows(), total=dataframe.shape[0], desc="> Counting words in texts"):
    COUNTS_EN.update(tokenize(row['text_en'], language="en"))
    COUNTS_DE.update(tokenize(row['text_de'], language="de"))
    COUNTS_IT.update(tokenize(row['text_it'], language="it"))
    COUNTS_PL.update(tokenize(row['text_pl'], language="pl"))
    COUNTS_SV.update(tokenize(row['text_sv'], language="sv"))

> Counting words in texts: 100%|██████████| 6000/6000 [00:48<00:00, 123.63it/s]


In [8]:
# show first 10 words of COUNTS_EN
i = 0
for key,value in COUNTS_EN.items():
    if i < 10:
        print(f"{key}: {value}")
    i+=1

commission: 38904
decision: 17465
of: 345645
december: 7003
on: 67601
state: 13320
aid: 9671
c: 6715
ex: 1770
n: 1249


*How many words in the dictionary of each language?*

In [9]:
print("Total Words in 'en' texts:", len(COUNTS_EN.keys()))
print("Total Words in 'de' texts:", len(COUNTS_DE.keys()))
print("Total Words in 'it' texts:", len(COUNTS_IT.keys()))
print("Total Words in 'pl' texts:", len(COUNTS_PL.keys()))
print("Total Words in 'sv' texts:", len(COUNTS_SV.keys()))

Total Words in 'en' texts: 31454
Total Words in 'de' texts: 87084
Total Words in 'it' texts: 44570
Total Words in 'pl' texts: 74691
Total Words in 'sv' texts: 79973


It is very interesting to note how the number of words used in each language changes significantly to describe the same laws. 

Considering English and Swedish, for example, we can see how the number of Swedish words used exceeds the number of English words by a factor of 3.

The number of words is very high, let us try to remove the uncommon words.

In [10]:
COUNTS_EN = {key:val for key, val in COUNTS_EN.items() if val >= 100}
COUNTS_DE = {key:val for key, val in COUNTS_DE.items() if val >= 100}
COUNTS_IT = {key:val for key, val in COUNTS_IT.items() if val >= 100}
COUNTS_PL = {key:val for key, val in COUNTS_PL.items() if val >= 100}
COUNTS_SV = {key:val for key, val in COUNTS_SV.items() if val >= 100}

Let's observe the changes.

In [11]:
print("Total Words in 'en' texts:", len(COUNTS_EN.keys()))
print("Total Words in 'de' texts:", len(COUNTS_DE.keys()))
print("Total Words in 'it' texts:", len(COUNTS_IT.keys()))
print("Total Words in 'pl' texts:", len(COUNTS_PL.keys()))
print("Total Words in 'sv' texts:", len(COUNTS_SV.keys()))

Total Words in 'en' texts: 3504
Total Words in 'de' texts: 4214
Total Words in 'it' texts: 4178
Total Words in 'pl' texts: 5253
Total Words in 'sv' texts: 4008


Starting from the generated dictionaries, we can now create the 5 vocabularies.

In [12]:
# prepare placeholders
# - one for index mapping
# - one with the list of all the words

EN_WORDS_TO_INDEX = {"":0, "UNK":1}
EN_WORDS = ["", "UNK"]

DE_WORDS_TO_INDEX = {"":0, "UNK":1}
DE_WORDS = ["", "UNK"]

IT_WORDS_TO_INDEX = {"":0, "UNK":1}
IT_WORDS = ["", "UNK"]

PL_WORDS_TO_INDEX = {"":0, "UNK":1}
PL_WORDS = ["", "UNK"]

SV_WORDS_TO_INDEX = {"":0, "UNK":1}
SV_WORDS = ["", "UNK"]


# iterate over each counter and populate the placeholders

for word in COUNTS_EN:
    EN_WORDS_TO_INDEX[word] = len(EN_WORDS)
    EN_WORDS.append(word)

for word in COUNTS_DE:
    DE_WORDS_TO_INDEX[word] = len(DE_WORDS)
    DE_WORDS.append(word)

for word in COUNTS_IT:
    IT_WORDS_TO_INDEX[word] = len(IT_WORDS)
    IT_WORDS.append(word)

for word in COUNTS_PL:
    PL_WORDS_TO_INDEX[word] = len(PL_WORDS)
    PL_WORDS.append(word)

for word in COUNTS_SV:
    SV_WORDS_TO_INDEX[word] = len(SV_WORDS)
    SV_WORDS.append(word)

print("> Dictionaries Generated")

> Dictionaries Generated


In [13]:
print("Total Words in 'en' texts:", len(EN_WORDS))
print("Total Words in 'de' texts:", len(DE_WORDS))
print("Total Words in 'it' texts:", len(IT_WORDS))
print("Total Words in 'pl' texts:", len(PL_WORDS))
print("Total Words in 'sv' texts:", len(SV_WORDS))

Total Words in 'en' texts: 3506
Total Words in 'de' texts: 4216
Total Words in 'it' texts: 4180
Total Words in 'pl' texts: 5255
Total Words in 'sv' texts: 4010


**Part 3: Encoding**

We have everything we need to encode our texts!

Note: we choose that the maximum length of each review should be 1000 words, as the average length of texts is around that value.

In [14]:
# create our custom encoding function

def encode_text(text, word_to_index, language, N=1000):
    
    # 1. tokenize the text
    tokenized = tokenize(text, language)

    # 2. encode the text
    encoded = np.zeros(N, dtype=int)
    enc1 = np.array([word_to_index.get(word, word_to_index["UNK"]) for word in tokenized])
    length = min(N, len(enc1))
    encoded[:length] = enc1[:length]

    # finally, return encoded text
    return encoded, length

In [15]:
# test the function on the first english text

test_encoded = encode_text(
    text=dataframe["text_en"][0], 
    word_to_index=EN_WORDS_TO_INDEX, 
    language="en", 
    N=1000
    )

test_encoded

(array([  2,   3,   4,   5,   6,   7,   8,   9,  10,  11,  12,  13,  14,
         15,  16,   4,  17,   1,   1,  18,  19,  20,   9,  21,  15,  22,
         23,  24,  25,  23,  26,  27,  28,  29,  15,  30,   2,  31,  32,
         33,  15,  34,   6,  15,  35,   4,  15,  30,  36,  37,  38,  39,
         15,  40,  41,   4,  42,  43,  31,  32,  33,  15,  44,   6,  15,
         30,  45,  46,  37,  38,  39,  42,  47,  43,  31,  48,   6,  15,
         49,  50,  37,  51,  52,  53,  33,  54,  55,  56,  37,  31,  32,
         33,  55,  56,  57,  58,  38,   3,   9,   4,  59,  38,  60,   9,
         61,  62,  33,  63,  15,  64,   3,  15,   2,  65,  14,  47,  66,
         67,  68,  69,   4,  70,  37,  68,  71,  72,  73,  12,  15,   7,
          4,   1,   1,  37,  51,  74,  75,  14,  15,  17,   1,   1,  61,
         62,  33,  63,  76,  77,  15,  78,  79,  15,  80,  81,  15,   2,
         82,  83,  38,  15,  84,   3,  63,  33,  85,  15,  71,  72,  73,
         24,  86,  26,  29,   7,   8,  87,  26,  32

In [16]:
# finally, we can apply the function to the whole df, encoding all texts

dataframe['text_en_enc'] = dataframe['text_en'].progress_apply(lambda x: np.array(encode_text(x, EN_WORDS_TO_INDEX, language="en"), dtype=object))
dataframe['text_de_enc'] = dataframe['text_de'].progress_apply(lambda x: np.array(encode_text(x, DE_WORDS_TO_INDEX, language="de"), dtype=object))
dataframe['text_it_enc'] = dataframe['text_it'].progress_apply(lambda x: np.array(encode_text(x, IT_WORDS_TO_INDEX, language="it"), dtype=object))
dataframe['text_pl_enc'] = dataframe['text_pl'].progress_apply(lambda x: np.array(encode_text(x, PL_WORDS_TO_INDEX, language="pl"), dtype=object))
dataframe['text_sv_enc'] = dataframe['text_sv'].progress_apply(lambda x: np.array(encode_text(x, SV_WORDS_TO_INDEX, language="sv"), dtype=object))

100%|██████████| 6000/6000 [00:07<00:00, 793.88it/s] 
100%|██████████| 6000/6000 [00:07<00:00, 820.38it/s] 
100%|██████████| 6000/6000 [00:08<00:00, 745.46it/s]
100%|██████████| 6000/6000 [00:07<00:00, 804.22it/s] 
100%|██████████| 6000/6000 [00:08<00:00, 737.93it/s] 


In [17]:
# show df
dataframe

Unnamed: 0,celex_id,labels,labels_new,text_en,text_de,text_it,text_pl,text_sv,text_en_enc,text_de_enc,text_it_enc,text_pl_enc,text_sv_enc
0,32010D0395,2,0,commission decision of december on state aid c...,beschluss der kommission vom dezember uber die...,decisione della commissione del dicembre conce...,decyzja komisji z dnia grudnia r w sprawie pom...,kommissionens beslut av den december om det st...,"[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...","[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...","[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...","[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ...","[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 14, ..."
1,32012R0453,2,0,commission implementing regulation eu no of ma...,durchfuhrungsverordnung eu nr der kommission v...,regolamento di esecuzione ue n della commissio...,rozporzadzenie wykonawcze komisji ue nr z dnia...,kommissionens genomforandeforordning eu nr av ...,"[[2, 1275, 1276, 29, 100, 4, 743, 1277, 15, 12...","[[1302, 33, 1303, 3, 4, 5, 807, 15, 1304, 3, 6...","[[453, 10, 1422, 38, 14, 3, 4, 5, 990, 1423, 1...","[[1753, 1754, 3, 34, 24, 4, 5, 829, 7, 1755, 9...","[[2, 1239, 33, 23, 4, 5, 806, 7, 774, 4, 132, ..."
2,32012D0043,2,0,commission implementing decision of january au...,durchfuhrungsbeschluss der kommission vom janu...,decisione di esecuzione della commissione del ...,decyzja wykonawcza komisji z dnia stycznia r u...,kommissionens genomforandebeslut av den januar...,"[[2, 1275, 3, 4, 1310, 1311, 15, 1015, 4, 1312...","[[1344, 3, 4, 5, 1345, 15, 1346, 74, 1347, 134...","[[2, 10, 1422, 3, 4, 5, 1454, 245, 1455, 24, 1...","[[2, 1791, 3, 4, 5, 1792, 7, 1, 1793, 1794, 65...","[[2, 1279, 4, 5, 1280, 7, 1281, 19, 1282, 1283..."
3,32007R0730,2,0,commission regulation ec no of june establishi...,verordnung eg nr der kommission vom juni zur f...,regolamento ce n della commissione del giugno ...,rozporzadzenie komisji we nr z dnia czerwca r ...,kommissionens forordning eg nr av den juni om ...,"[[2, 1276, 1284, 100, 4, 59, 1285, 15, 1361, 1...","[[1311, 1312, 1303, 3, 4, 5, 66, 15, 1403, 140...","[[453, 1432, 14, 3, 4, 5, 76, 1423, 1510, 167,...","[[1753, 3, 273, 24, 4, 5, 77, 7, 1763, 1858, 1...","[[2, 1245, 1247, 23, 4, 5, 71, 7, 517, 4, 1328..."
4,32009R0375,2,0,commission regulation ec no of may fixing the ...,verordnung eg nr der kommission vom mai zur fe...,regolamento ce n della commissione del maggio ...,rozporzadzenie komisji we nr z dnia maja r ust...,kommissionens forordning eg nr av den maj om f...,"[[2, 1276, 1284, 100, 4, 743, 1380, 15, 1381, ...","[[1311, 1312, 1303, 3, 4, 5, 807, 15, 1424, 74...","[[453, 1432, 14, 3, 4, 5, 990, 1423, 1510, 5, ...","[[1753, 3, 273, 24, 4, 5, 829, 7, 1876, 1877, ...","[[2, 1245, 1247, 23, 4, 5, 806, 7, 517, 4, 8, ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
5995,32013R0519,18,2,commission regulation eu no of february adapti...,verordnung eu nr der kommission vom februar zu...,regolamento ue n della commissione del febbrai...,rozporzadzenie komisji ue nr z dnia lutego r d...,kommissionens forordning eu nr av den februari...,"[[2, 1276, 29, 100, 4, 1390, 1, 401, 1399, 37,...","[[1311, 33, 1303, 3, 4, 5, 1429, 15, 816, 931,...","[[453, 38, 14, 3, 4, 5, 1534, 245, 1, 1117, 15...","[[1753, 3, 34, 24, 4, 5, 1888, 7, 1, 1913, 177...","[[2, 1245, 33, 23, 4, 5, 1356, 7, 1489, 4, 325..."
5996,32008D0914,18,2,commission decision of june on the confirmatio...,entscheidung der kommission juni zur bestatigu...,decisione della commissione dell giugno recant...,decyzja komisji z dnia czerwca r w sprawie zat...,kommissionens beslut av den juni om godkannand...,"[[2, 3, 4, 59, 6, 15, 3126, 4, 207, 991, 12, 1...","[[65, 3, 4, 66, 15, 3255, 3, 28, 31, 1643, 116...","[[2, 3, 4, 45, 76, 1423, 2433, 176, 246, 1227,...","[[2, 3, 4, 5, 77, 7, 8, 9, 1117, 175, 4787, 99...","[[2, 3, 4, 5, 71, 7, 672, 4, 631, 15, 1, 4, 15..."
5997,31999R2502,18,2,commission regulation ec no of november amendi...,verordnung eg nr der kommission vom november z...,regolamento ce n della commissione del novembr...,rozporzadzenie komisji we nr z dnia listopada ...,kommissionens forordning eg nr av den november...,"[[2, 1276, 1284, 100, 4, 1065, 1277, 1276, 140...","[[1311, 1312, 1303, 3, 4, 5, 1027, 15, 1304, 3...","[[453, 1432, 14, 3, 4, 5, 1181, 1423, 1424, 5,...","[[1753, 3, 273, 24, 4, 5, 1404, 7, 1755, 1753,...","[[2, 1245, 1247, 23, 4, 5, 1012, 7, 774, 4, 12..."
5998,32008D0847,18,2,council decision of november on the eligibilit...,beschluss des rates vom november uber die ford...,decisione del consiglio del novembre sull ammi...,decyzja rady z dnia listopada r w sprawie kwal...,radets beslut av den november om berattigande ...,"[[1283, 3, 4, 1065, 6, 15, 831, 4, 115, 1, 129...","[[2, 74, 1313, 5, 1027, 7, 8, 1, 1, 340, 92, 1...","[[2, 5, 577, 5, 1181, 361, 889, 167, 424, 45, ...","[[2, 664, 4, 5, 1404, 7, 8, 9, 1138, 456, 4905...","[[1246, 3, 4, 5, 1012, 7, 1, 19, 1, 1819, 88, ..."


#### **Train, Test and Validation splits**

We split the dataset into:
- training set
- validation set
- test set

In [18]:
from sklearn.model_selection import train_test_split

# SPLIT 1: train [0.8] and test [0.2]

# execute the split (randomly, stratified fashion)
train_indexes, test_indexes, _, _ = train_test_split(
    dataframe["text_en_enc"],
    dataframe["labels_new"],
    test_size=0.2,
    random_state=42
)

# create placeholder
dataframe["set"] = "not specified"

# set value based on stratified split
dataframe.loc[train_indexes.index, "set"] = "train"
dataframe.loc[test_indexes.index, "set"] = "test"


# SPLIT 2: we split train again in order to obtain the validation set

train_indexes, val_indexes, _, _ = train_test_split(
    dataframe.loc[dataframe["set"] == "train"]["text_en_enc"],
    dataframe.loc[dataframe["set"] == "train"]["labels_new"],
    test_size=0.2,
    random_state=42
)

# update dataset with validation info
dataframe.loc[val_indexes.index, "set"] = "validation"

*How many observations for each set?*

In [19]:
Counter(dataframe['set'])

Counter({'train': 3840, 'test': 1200, 'validation': 960})

#### **Save preprocessed texts**

We save the data in pickle format in order to preserve the column types.

In [20]:
dataframe.to_pickle("../data/3_multi_eurlex_encoded.pkl")
print("> All done!")

> All done!
