## **NLP Practical**

### **Preprocessing**

It is now time to preprocess the texts.

#### **Libraries**

We import the necessary libraries for the notebook.

In [1]:
# general
import pandas as pd
import numpy as np
from tqdm import tqdm
tqdm.pandas()

# dataviz
import plotly.express as px         # create interactive plots

# preprocessing
import spacy                        # tokenization
from collections import Counter     # count words occurrences

print("> Libraries Imported")



> Libraries Imported


#### **Import the dataset**

We read our custom (and cleaned) csv. 

In [2]:
dataframe = pd.read_csv("../data/2_multi_eurlex_reduced_v2.csv")
dataframe

Unnamed: 0,celex_id,labels,labels_new,text_en,text_de,text_it,text_pl,text_sv
0,32003R1012,2,1,commission regulation ec no of june amending f...,verordnung eg nr der kommission vom juni zur n...,regolamento ce n della commissione del giugno ...,rozporzadzenie komisji we nr z dnia czerwca r ...,kommissionens forordning eg nr av den juni om ...
1,32003R2229,18,3,council regulation ec no of december imposing ...,verordnung eg nr des rates vom dezember zur ei...,regolamento ce n del consiglio del dicembre ch...,rozporzadzenie rady we nr z dnia grudnia r nak...,radets forordning eg nr av den december om inf...
2,32003R0223,7,2,commission regulation ec no of february on lab...,verordnung eg nr der kommission vom februar zu...,regolamento ce n della commissione del febbrai...,rozporzadzenie komisji we nr z dnia lutego r w...,kommissionens forordning eg nr av den februari...
3,31989L0681,7,2,council directive of december amending directi...,richtlinie des rates vom dezember zur anderung...,direttiva del consiglio del dicembre che modif...,dyrektywa rady z dnia grudnia r zmieniajaca dy...,radets direktiv av den december om andring av ...
4,32006R1007,2,1,commission regulation ec no of june determinin...,verordnung eg nr der kommission vom juni zur f...,regolamento ce n della commissione del giugno ...,rozporzadzenie komisji we nr z dnia czerwca r ...,kommissionens forordning eg nr av den juni om ...
...,...,...,...,...,...,...,...,...
11671,32011R0880,7,2,commission regulation eu no of september corre...,verordnung eu nr der kommission vom september ...,regolamento ue n della commissione del settemb...,rozporzadzenie komisji ue nr z dnia wrzesnia r...,kommissionens forordning eu nr av den septembe...
11672,32012D0272,18,3,council decision of may on the signing on beha...,beschluss des rates vom mai uber die unterzeic...,decisione del consiglio del maggio relativa al...,decyzja rady z dnia maja r w sprawie podpisani...,radets beslut av den maj om undertecknande pa ...
11673,32012R0596,18,3,commission regulation eu no of july initiating...,verordnung eu nr der kommission vom juli zur e...,regolamento ue n della commissione del luglio ...,rozporzadzenie komisji ue nr z dnia lipca r ws...,kommissionens forordning eu nr av den juli om ...
11674,32010D0165,7,2,commission decision of march withdrawing the r...,beschluss der kommission vom marz uber die str...,decisione della commissione del marzo che riti...,decyzja komisji z dnia marca r w sprawie wycof...,kommissionens beslut av den mars om strykning ...


#### **Preprocessing**

**Part 1: Tokenization**

This essentially involves dividing a sentence, paragraph or entire text document into smaller units, such as individual words or terms.

We create a custom function capable of tokenizing the texts in different languages.

In [3]:
# load the tokenizers (one for each language)

TOKENIZER_EN = spacy.load('en_core_web_sm')
TOKENIZER_DE = spacy.load('nl_core_news_sm')
TOKENIZER_IT = spacy.load('it_core_news_sm')
TOKENIZER_PL = spacy.load('pl_core_news_sm')
TOKENIZER_SV = spacy.load('sv_core_news_sm')

In [4]:
# create custom function

def tokenize(text, language):

    # use the required tokenizer (based on input language)
    if language=="en":
        tokenizer = TOKENIZER_EN
    elif language=="de":
        tokenizer = TOKENIZER_DE
    elif language=="it":
        tokenizer = TOKENIZER_IT
    elif language=="pl":
        tokenizer = TOKENIZER_PL
    elif language=="sv":
        tokenizer = TOKENIZER_SV
    else:
        error_str = "Error: language '" + language + "' not available. Please choose between 'en', 'de', 'it', 'pl', 'sv'."
        print(error_str)

    text = text.replace("_","")
    return [token.text for token in tokenizer.tokenizer(text)]

Here we show a small example of tokenization for the first text (in each language). 

In [5]:
test_text_en = dataframe["text_en"][0]
test_text_de = dataframe["text_de"][0]
test_text_it = dataframe["text_it"][0]
test_text_pl = dataframe["text_pl"][0]
test_text_sv = dataframe["text_sv"][0]


print("Sample of 'en' text:", test_text_en[0:120])
print("Sample of 'de' text:", test_text_de[0:120])
print("Sample of 'it' text:", test_text_it[0:120])
print("Sample of 'pl' text:", test_text_pl[0:120])
print("Sample of 'sv' text:", test_text_sv[0:120])

Sample of 'en' text: commission regulation ec no of june amending for the th time council regulation ec no imposing certain specific restrict
Sample of 'de' text: verordnung eg nr der kommission vom juni zur neunzehnten anderung der verordnung eg nr des rates uber die anwendung best
Sample of 'it' text: regolamento ce n della commissione del giugno recante diciannovesima modifica del regolamento ce n che impone specifiche
Sample of 'pl' text: rozporzadzenie komisji we nr z dnia czerwca r zmieniajace po raz dziewietnasty rozporzadzenie rady we nr wprowadzajace n
Sample of 'sv' text: kommissionens forordning eg nr av den juni om andring for nittonde gangen av radets forordning eg nr betraffande inforan


In [6]:
print("Example of tokenization in 'en':", tokenize(text=test_text_en, language="en")[0:10])
print("Example of tokenization in 'de':", tokenize(text=test_text_de, language="de")[0:10])
print("Example of tokenization in 'it':", tokenize(text=test_text_it, language="it")[0:10])
print("Example of tokenization in 'pl':", tokenize(text=test_text_pl, language="pl")[0:10])
print("Example of tokenization in 'sv':", tokenize(text=test_text_sv, language="sv")[0:10])

Example of tokenization in 'en': ['commission', 'regulation', 'ec', 'no', 'of', 'june', 'amending', 'for', 'the', 'th']
Example of tokenization in 'de': ['verordnung', 'eg', 'nr', 'der', 'kommission', 'vom', 'juni', 'zur', 'neunzehnten', 'anderung']
Example of tokenization in 'it': ['regolamento', 'ce', 'n', 'della', 'commissione', 'del', 'giugno', 'recante', 'diciannovesima', 'modifica']
Example of tokenization in 'pl': ['rozporzadzenie', 'komisji', 'we', 'nr', 'z', 'dnia', 'czerwca', 'r', 'zmieniajace', 'po']
Example of tokenization in 'sv': ['kommissionens', 'forordning', 'eg', 'nr', 'av', 'den', 'juni', 'om', 'andring', 'for']


**Part 2: Count Words Occurrences**

We now count the number of occurrences of each word (now a token) in the corpuses.

In [7]:
# instantiate a Counter
COUNTS_EN = Counter()
COUNTS_DE = Counter()
COUNTS_IT = Counter()
COUNTS_PL = Counter()
COUNTS_SV = Counter()

for index, row in tqdm(dataframe.iterrows(), total=dataframe.shape[0], desc="> Counting words in texts"):
    COUNTS_EN.update(tokenize(row['text_en'], language="en"))
    COUNTS_DE.update(tokenize(row['text_de'], language="de"))
    COUNTS_IT.update(tokenize(row['text_it'], language="it"))
    COUNTS_PL.update(tokenize(row['text_pl'], language="pl"))
    COUNTS_SV.update(tokenize(row['text_sv'], language="sv"))

> Counting words in texts: 100%|██████████| 11676/11676 [01:42<00:00, 113.55it/s]


In [8]:
# show first 10 words of COUNTS_EN
i = 0
for key,value in COUNTS_EN.items():
    if i < 10:
        print(f"{key}: {value}")
    i+=1

commission: 86468
regulation: 116649
ec: 81198
no: 84780
of: 769888
june: 9941
amending: 5689
for: 204635
the: 1394185
th: 1918


*How many words in the dictionary of each language?*

In [9]:
print("Total Words in 'en' texts:", len(COUNTS_EN.keys()))
print("Total Words in 'de' texts:", len(COUNTS_DE.keys()))
print("Total Words in 'it' texts:", len(COUNTS_IT.keys()))
print("Total Words in 'pl' texts:", len(COUNTS_PL.keys()))
print("Total Words in 'sv' texts:", len(COUNTS_SV.keys()))

Total Words in 'en' texts: 45290
Total Words in 'de' texts: 145627
Total Words in 'it' texts: 65064
Total Words in 'pl' texts: 110699
Total Words in 'sv' texts: 132128


It is very interesting to note how the number of words used in each language changes significantly to describe the same laws. 

Considering English and Swedish, for example, we can see how the number of Swedish words used exceeds the number of English words by a factor of 3.

The number of words is very high, let us try to remove the uncommon words.

In [10]:
COUNTS_EN = {key:val for key, val in COUNTS_EN.items() if val >= 10}
COUNTS_DE = {key:val for key, val in COUNTS_DE.items() if val >= 10}
COUNTS_IT = {key:val for key, val in COUNTS_IT.items() if val >= 10}
COUNTS_PL = {key:val for key, val in COUNTS_PL.items() if val >= 10}
COUNTS_SV = {key:val for key, val in COUNTS_SV.items() if val >= 10}

Let's observe the changes.

In [11]:
print("Total Words in 'en' texts:", len(COUNTS_EN.keys()))
print("Total Words in 'de' texts:", len(COUNTS_DE.keys()))
print("Total Words in 'it' texts:", len(COUNTS_IT.keys()))
print("Total Words in 'pl' texts:", len(COUNTS_PL.keys()))
print("Total Words in 'sv' texts:", len(COUNTS_SV.keys()))

Total Words in 'en' texts: 14751
Total Words in 'de' texts: 31903
Total Words in 'it' texts: 20577
Total Words in 'pl' texts: 33812
Total Words in 'sv' texts: 29075


Starting from the generated dictionaries, we can now create the 5 vocabularies.

In [12]:
# prepare placeholders
# - one for index mapping
# - one with the list of all the words

EN_WORDS_TO_INDEX = {"":0, "UNK":1}
EN_WORDS = ["", "UNK"]

DE_WORDS_TO_INDEX = {"":0, "UNK":1}
DE_WORDS = ["", "UNK"]

IT_WORDS_TO_INDEX = {"":0, "UNK":1}
IT_WORDS = ["", "UNK"]

PL_WORDS_TO_INDEX = {"":0, "UNK":1}
PL_WORDS = ["", "UNK"]

SV_WORDS_TO_INDEX = {"":0, "UNK":1}
SV_WORDS = ["", "UNK"]


# iterate over each counter and populate the placeholders

for word in COUNTS_EN:
    EN_WORDS_TO_INDEX[word] = len(EN_WORDS)
    EN_WORDS.append(word)

for word in COUNTS_DE:
    DE_WORDS_TO_INDEX[word] = len(DE_WORDS)
    DE_WORDS.append(word)

for word in COUNTS_IT:
    IT_WORDS_TO_INDEX[word] = len(IT_WORDS)
    IT_WORDS.append(word)

for word in COUNTS_PL:
    PL_WORDS_TO_INDEX[word] = len(PL_WORDS)
    PL_WORDS.append(word)

for word in COUNTS_SV:
    SV_WORDS_TO_INDEX[word] = len(SV_WORDS)
    SV_WORDS.append(word)

print("> Dictionaries Generated")

> Dictionaries Generated


In [13]:
print("Total Words in 'en' texts:", len(EN_WORDS))
print("Total Words in 'de' texts:", len(DE_WORDS))
print("Total Words in 'it' texts:", len(IT_WORDS))
print("Total Words in 'pl' texts:", len(PL_WORDS))
print("Total Words in 'sv' texts:", len(SV_WORDS))

Total Words in 'en' texts: 14753
Total Words in 'de' texts: 31905
Total Words in 'it' texts: 20579
Total Words in 'pl' texts: 33814
Total Words in 'sv' texts: 29077


**Part 3: Encoding**

We have everything we need to encode our texts!

Note: we choose that the maximum length of each review should be 1400 words, as the average length of texts is around 1200-1400 words.

In [14]:
# create our custom encoding function

def encode_text(text, word_to_index, language, N=1400):
    
    # 1. tokenize the text
    tokenized = tokenize(text, language)

    # 2. encode the text
    encoded = np.zeros(N, dtype=int)
    enc1 = np.array([word_to_index.get(word, word_to_index["UNK"]) for word in tokenized])
    length = min(N, len(enc1))
    encoded[:length] = enc1[:length]

    # finally, return encoded text
    return encoded, length

In [15]:
# test the function on the first english text

test_encoded = encode_text(
    text=dataframe["text_en"][0], 
    word_to_index=EN_WORDS_TO_INDEX, 
    language="en", 
    N=1400
    )

test_encoded

(array([2, 3, 4, ..., 0, 0, 0]), 265)

In [16]:
# finally, we can apply the function to the whole df, encoding all texts

dataframe['text_en_enc'] = dataframe['text_en'].progress_apply(lambda x: np.array(encode_text(x, EN_WORDS_TO_INDEX, language="en"), dtype=object))
dataframe['text_de_enc'] = dataframe['text_de'].progress_apply(lambda x: np.array(encode_text(x, DE_WORDS_TO_INDEX, language="de"), dtype=object))
dataframe['text_it_enc'] = dataframe['text_it'].progress_apply(lambda x: np.array(encode_text(x, IT_WORDS_TO_INDEX, language="it"), dtype=object))
dataframe['text_pl_enc'] = dataframe['text_pl'].progress_apply(lambda x: np.array(encode_text(x, PL_WORDS_TO_INDEX, language="pl"), dtype=object))
dataframe['text_sv_enc'] = dataframe['text_sv'].progress_apply(lambda x: np.array(encode_text(x, SV_WORDS_TO_INDEX, language="sv"), dtype=object))

100%|██████████| 11676/11676 [00:17<00:00, 662.04it/s] 
100%|██████████| 11676/11676 [00:17<00:00, 674.46it/s] 
100%|██████████| 11676/11676 [00:18<00:00, 628.89it/s] 
100%|██████████| 11676/11676 [00:16<00:00, 701.33it/s] 
100%|██████████| 11676/11676 [00:17<00:00, 670.01it/s] 


In [17]:
# show df
dataframe

Unnamed: 0,celex_id,labels,labels_new,text_en,text_de,text_it,text_pl,text_sv,text_en_enc,text_de_enc,text_it_enc,text_pl_enc,text_sv_enc
0,32003R1012,2,1,commission regulation ec no of june amending f...,verordnung eg nr der kommission vom juni zur n...,regolamento ce n della commissione del giugno ...,rozporzadzenie komisji we nr z dnia czerwca r ...,kommissionens forordning eg nr av den juni om ...,"[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 3, 4...","[[2, 3, 4, 5, 6, 7, 8, 9, 1, 10, 5, 2, 3, 4, 1...","[[2, 3, 4, 5, 6, 7, 8, 9, 1, 10, 7, 2, 3, 4, 1...","[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 1, 2, 13...","[[2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12, 13, 6, 1..."
1,32003R2229,18,3,council regulation ec no of december imposing ...,verordnung eg nr des rates vom dezember zur ei...,regolamento ce n del consiglio del dicembre ch...,rozporzadzenie rady we nr z dnia grudnia r nak...,radets forordning eg nr av den december om inf...,"[[13, 3, 4, 5, 6, 117, 14, 118, 119, 120, 121,...","[[2, 3, 4, 11, 12, 7, 116, 9, 117, 118, 119, 1...","[[2, 3, 4, 7, 37, 7, 128, 11, 44, 129, 130, 13...","[[2, 13, 4, 5, 6, 7, 134, 9, 135, 136, 137, 13...","[[14, 3, 4, 5, 6, 7, 113, 9, 16, 6, 114, 115, ..."
2,32003R0223,7,2,commission regulation ec no of february on lab...,verordnung eg nr der kommission vom februar zu...,regolamento ce n della commissione del febbrai...,rozporzadzenie komisji we nr z dnia lutego r w...,kommissionens forordning eg nr av den februari...,"[[2, 3, 4, 5, 6, 1033, 78, 1034, 343, 574, 38,...","[[2, 3, 4, 5, 6, 7, 1305, 9, 1136, 57, 1306, 1...","[[2, 3, 4, 5, 6, 7, 1259, 1260, 80, 393, 53, 1...","[[2, 3, 4, 5, 6, 7, 1603, 9, 59, 150, 1604, 59...","[[2, 3, 4, 5, 6, 7, 1240, 9, 1241, 122, 1242, ..."
3,31989L0681,7,2,council directive of december amending directi...,richtlinie des rates vom dezember zur anderung...,direttiva del consiglio del dicembre che modif...,dyrektywa rady z dnia grudnia r zmieniajaca dy...,radets direktiv av den december om andring av ...,"[[13, 1068, 6, 117, 8, 1068, 1040, 78, 1187, 7...","[[1350, 11, 12, 7, 116, 9, 10, 5, 1350, 1312, ...","[[1297, 7, 37, 7, 128, 11, 10, 38, 1297, 1267,...","[[1660, 13, 6, 7, 134, 9, 1663, 1666, 1613, 59...","[[14, 1274, 6, 7, 113, 9, 10, 6, 1274, 1247, 9..."
4,32006R1007,2,1,commission regulation ec no of june determinin...,verordnung eg nr der kommission vom juni zur f...,regolamento ce n della commissione del giugno ...,rozporzadzenie komisji we nr z dnia czerwca r ...,kommissionens forordning eg nr av den juni om ...,"[[2, 3, 4, 5, 6, 7, 488, 10, 544, 319, 373, 9,...","[[2, 3, 4, 5, 6, 7, 8, 9, 1591, 11, 1592, 166,...","[[2, 3, 4, 5, 6, 7, 8, 11, 1526, 36, 433, 7, 3...","[[2, 3, 4, 5, 6, 7, 8, 9, 1994, 476, 399, 1995...","[[2, 3, 4, 5, 6, 7, 8, 9, 591, 6, 1504, 122, 1..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...
11671,32011R0880,7,2,commission regulation eu no of september corre...,verordnung eu nr der kommission vom september ...,regolamento ue n della commissione del settemb...,rozporzadzenie komisji ue nr z dnia wrzesnia r...,kommissionens forordning eu nr av den septembe...,"[[2, 3, 318, 5, 6, 157, 5739, 3, 318, 5, 8, 69...","[[2, 331, 4, 5, 6, 7, 161, 9, 747, 5, 2, 331, ...","[[2, 371, 4, 5, 6, 7, 180, 11, 6911, 36, 2, 37...","[[2, 3, 400, 5, 6, 7, 189, 9, 59, 150, 12760, ...","[[2, 3, 316, 5, 6, 7, 160, 9, 7201, 6, 3, 316,..."
11672,32012D0272,18,3,council decision of may on the signing on beha...,beschluss des rates vom mai uber die unterzeic...,decisione del consiglio del maggio relativa al...,decyzja rady z dnia maja r w sprawie podpisani...,radets beslut av den maj om undertecknande pa ...,"[[13, 566, 6, 42, 78, 10, 4886, 78, 849, 6, 10...","[[1674, 11, 12, 7, 46, 13, 14, 7677, 11, 6345,...","[[653, 7, 37, 7, 46, 187, 28, 6508, 24, 983, 6...","[[2117, 13, 6, 7, 42, 9, 59, 150, 7622, 59, 12...","[[14, 788, 6, 7, 45, 9, 4447, 122, 101, 4526, ..."
11673,32012R0596,18,3,commission regulation eu no of july initiating...,verordnung eu nr der kommission vom juli zur e...,regolamento ue n della commissione del luglio ...,rozporzadzenie komisji ue nr z dnia lipca r ws...,kommissionens forordning eu nr av den juli om ...,"[[2, 3, 318, 5, 6, 1096, 3256, 146, 153, 464, ...","[[2, 331, 4, 5, 6, 7, 1376, 9, 3210, 548, 153,...","[[2, 371, 4, 5, 6, 7, 1332, 11, 11640, 129, 17...","[[2, 3, 400, 5, 6, 7, 1702, 9, 19048, 4119, 18...","[[2, 3, 316, 5, 6, 7, 1308, 9, 3161, 6, 114, 5..."
11674,32010D0165,7,2,commission decision of march withdrawing the r...,beschluss der kommission vom marz uber die str...,decisione della commissione del marzo che riti...,decyzja komisji z dnia marca r w sprawie wycof...,kommissionens beslut av den mars om strykning ...,"[[2, 566, 6, 1307, 6219, 10, 1088, 6, 389, 689...","[[1674, 5, 6, 7, 1645, 13, 14, 5739, 5, 7497, ...","[[653, 5, 6, 7, 1585, 11, 8249, 36, 565, 5, 89...","[[2117, 3, 6, 7, 2069, 9, 59, 150, 5460, 1811,...","[[2, 788, 6, 7, 1556, 9, 14029, 6, 2850, 50, 3..."


#### **Train, Test and Validation splits**

We split the dataset into:
- training set
- validation set
- test set

In [18]:
from sklearn.model_selection import train_test_split

# SPLIT 1: train [0.8] and test [0.2]

# execute the split (randomly, stratified fashion)
train_indexes, test_indexes, _, _ = train_test_split(
    dataframe["text_en_enc"],
    dataframe["labels_new"],
    test_size=0.2,
    random_state=42
)

# create placeholder
dataframe["set"] = "not specified"

# set value based on stratified split
dataframe.loc[train_indexes.index, "set"] = "train"
dataframe.loc[test_indexes.index, "set"] = "test"


# SPLIT 2: we split train again in order to obtain the validation set

train_indexes, val_indexes, _, _ = train_test_split(
    dataframe.loc[dataframe["set"] == "train"]["text_en_enc"],
    dataframe.loc[dataframe["set"] == "train"]["labels_new"],
    test_size=0.2,
    random_state=42
)

# update dataset with validation info
dataframe.loc[val_indexes.index, "set"] = "validation"

*How many observations for each set?*

In [19]:
Counter(dataframe['set'])

Counter({'test': 2336, 'validation': 1868, 'train': 7472})

#### **Save preprocessed texts**

We save the data in pickle format in order to preserve the column types.

In [20]:
dataframe.to_pickle("../data/3_multi_eurlex_encoded_v3.pkl")
print("> All done!")

> All done!
