# Preprocessing and Feature Creation

In this notebook we import the data, preprocess the data and create features for supervised and unsupervised cross-lingual-information retrieval models.

## I. Import Data

In this section we import the English and German europarl datasets and combine them into a parallel sentence translation dataframe.

In [None]:
%load_ext autoreload
%autoreload 2

In [1]:
import os
import sys
sys.path.append(os.path.dirname((os.path.abspath(''))))

from src.data import create_data_subset

In [2]:
create_data_subset(sentence_data_source_path='../data/external/europarl-v7.it-en.en',
                   sentence_data_target_path='../data/external/europarl-v7.it-en.it',
                   sample_size=25000,
                   sentence_data_sampled_path="../data/interim/europarl_en_it.pkl",)

Finished function: 'load_doc' in 2.93 seconds.
Finished function: 'to_sentences' in 1.18 seconds.
Finished function: 'load_doc' in 3.74 seconds.
Finished function: 'to_sentences' in 1.33 seconds.
Sampled dataframe saved in: ../data/interim/europarl_en_it.pkl
Finished function: 'create_data_subset' in 12.31 seconds.


## II. Preprocess data

In this section we preprocess the parallel sentence data for the feature generation

In [3]:
import spacy
from nltk.corpus import stopwords
from textblob import TextBlob as textblob_source
from textblob_de import TextBlobDE as textblob_target
import en_core_web_sm
# import de_core_news_sm
import it_core_news_sm
# import pl_core_news_sm
import time
from src.data import PreprocessingEuroParl

In [4]:
stopwords_source = stopwords.words('english')
# stopwords_target = stopwords.words('german') # German stopwords
stopwords_target = stopwords.words('italian') # Italian stopwords
# stopwords_target = stopwords.words('polish') # Polish stopwords
nlp_source = en_core_web_sm.load()
# nlp_target = de_core_news_sm.load() # German pipeline
nlp_target = it_core_news_sm.load() # Italian pipeline
# nlp_target = pl_core_news_sm.load() # Polish pipeline

In [5]:
# parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_de.pkl") # German
parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_it.pkl") # Italien
# parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_pol.pkl") # Polnisch

Finished function: 'import_data' in 0.02 seconds.


In [6]:
parallel_sentences.preprocess_sentences(nlp_source, nlp_target, stopwords_source, stopwords_target)

100%|██████████| 25000/25000 [03:20<00:00, 124.85it/s]
100%|██████████| 25000/25000 [00:00<00:00, 199372.93it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'spacy' in 200.24 seconds.
Finished function: 'remove_stopwords' in 0.13 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 134693.05it/s]
100%|██████████| 25000/25000 [00:00<00:00, 127434.08it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'remove_punctuation' in 0.19 seconds.
Finished function: 'remove_numbers' in 0.2 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 66933.15it/s]
100%|██████████| 25000/25000 [00:00<00:00, 141090.86it/s]


Finished function: 'lemmatize' in 0.38 seconds.
Finished function: 'lowercase_spacy' in 0.18 seconds.


100%|██████████| 25000/25000 [03:15<00:00, 127.55it/s]
100%|██████████| 25000/25000 [00:00<00:00, 192035.98it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'create_cleaned_token_embedding' in 201.58 seconds.
Finished function: 'spacy' in 195.99 seconds.
Finished function: 'remove_stopwords' in 0.13 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 130482.06it/s]
100%|██████████| 25000/25000 [00:00<00:00, 161687.78it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'remove_punctuation' in 0.19 seconds.
Finished function: 'remove_numbers' in 0.16 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 81086.96it/s]
100%|██████████| 25000/25000 [00:00<00:00, 154638.07it/s]


Finished function: 'lemmatize' in 0.31 seconds.
Finished function: 'lowercase_spacy' in 0.16 seconds.


100%|██████████| 25000/25000 [00:04<00:00, 5260.58it/s]
100%|██████████| 25000/25000 [00:00<00:00, 214551.62it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'create_cleaned_token_embedding' in 197.2 seconds.
Finished function: 'tokenize_sentence' in 4.75 seconds.
Finished function: 'remove_stopwords' in 0.12 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 164121.83it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'strip_whitespace' in 0.15 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 59838.55it/s]
  1%|▏         | 349/25000 [00:00<00:07, 3487.45it/s]

Finished function: 'lowercase' in 0.42 seconds.
Finished function: 'create_cleaned_text' in 5.48 seconds.


100%|██████████| 25000/25000 [00:04<00:00, 5079.95it/s]
100%|██████████| 25000/25000 [00:00<00:00, 289392.47it/s]
100%|██████████| 25000/25000 [00:00<00:00, 217137.84it/s]

Finished function: 'tokenize_sentence' in 4.92 seconds.
Finished function: 'remove_stopwords' in 0.09 seconds.



100%|██████████| 25000/25000 [00:00<00:00, 153625.80it/s]


Finished function: 'strip_whitespace' in 0.12 seconds.
Finished function: 'lowercase' in 0.16 seconds.
Finished function: 'create_cleaned_text' in 5.32 seconds.


In [7]:
parallel_sentences.extract_sentence_information(nlp_source, nlp_target)

100%|██████████| 25000/25000 [00:00<00:00, 81189.86it/s]
 74%|███████▍  | 18482/25000 [00:00<00:00, 91988.94it/s]

Finished function: 'number_punctuations_total' in 0.31 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 90313.43it/s]
100%|██████████| 25000/25000 [00:00<00:00, 227856.69it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuations_total' in 0.28 seconds.
Finished function: 'number_words' in 0.11 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 216801.13it/s]
 19%|█▉        | 4758/25000 [00:00<00:00, 47579.48it/s]

Finished function: 'number_words' in 0.12 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 48682.17it/s]
 18%|█▊        | 4384/25000 [00:00<00:00, 43835.23it/s]

Finished function: 'number_unique_words' in 0.52 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 45460.56it/s]
 56%|█████▌    | 14046/25000 [00:00<00:00, 70457.18it/s]

Finished function: 'number_unique_words' in 0.55 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 63648.74it/s]
 25%|██▍       | 6232/25000 [00:00<00:00, 62314.41it/s]

Finished function: 'number_characters' in 0.39 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 62192.60it/s]
  return (character_vector / word_vector).replace(np.nan, 0).replace(np.inf, 0).replace(np.log(0), 0)
100%|██████████| 25000/25000 [00:00<00:00, 417662.85it/s]
100%|██████████| 25000/25000 [00:00<00:00, 366892.82it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_characters' in 0.4 seconds.
Finished function: 'average_characters' in 0.02 seconds.
Finished function: 'average_characters' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 352962.01it/s]
100%|██████████| 25000/25000 [00:00<00:00, 402078.31it/s]
100%|██████████| 25000/25000 [00:00<00:00, 425705.30it/s]
100%|██████████| 25000/25000 [00:00<00:00, 384846.55it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 334456.93it/s]
100%|██████████| 25000/25000 [00:00<00:00, 413121.21it/s]
100%|██████████| 25000/25000 [00:00<00:00, 441097.09it/s]
100%|██████████| 25000/25000 [00:00<00:00, 437676.25it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.08 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 436065.42it/s]
100%|██████████| 25000/25000 [00:00<00:00, 433878.56it/s]
100%|██████████| 25000/25000 [00:00<00:00, 425351.29it/s]
100%|██████████| 25000/25000 [00:00<00:00, 393974.89it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 360910.87it/s]
100%|██████████| 25000/25000 [00:00<00:00, 368356.97it/s]
100%|██████████| 25000/25000 [00:00<00:00, 394720.87it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 355835.48it/s]
100%|██████████| 25000/25000 [00:00<00:00, 373408.54it/s]
100%|██████████| 25000/25000 [00:00<00:00, 407971.30it/s]
100%|██████████| 25000/25000 [00:00<00:00, 434689.75it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 435388.25it/s]
100%|██████████| 25000/25000 [00:00<00:00, 428993.52it/s]
100%|██████████| 25000/25000 [00:00<00:00, 424339.16it/s]
100%|██████████| 25000/25000 [00:00<00:00, 435191.29it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 409707.23it/s]
100%|██████████| 25000/25000 [00:00<00:00, 383731.13it/s]
100%|██████████| 25000/25000 [00:00<00:00, 294252.87it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.09 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 317328.16it/s]
100%|██████████| 25000/25000 [00:00<00:00, 306857.78it/s]
100%|██████████| 25000/25000 [00:00<00:00, 346914.23it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.08 seconds.
Finished function: 'number_punctuation_marks' in 0.08 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 331217.82it/s]
100%|██████████| 25000/25000 [00:00<00:00, 368309.10it/s]
100%|██████████| 25000/25000 [00:00<00:00, 433734.98it/s]
100%|██████████| 25000/25000 [00:00<00:00, 452089.33it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.08 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 429598.13it/s]
100%|██████████| 25000/25000 [00:00<00:00, 400018.31it/s]
100%|██████████| 25000/25000 [00:00<00:00, 382292.94it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 343194.54it/s]
100%|██████████| 25000/25000 [00:00<00:00, 391023.34it/s]
100%|██████████| 25000/25000 [00:00<00:00, 373512.29it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.08 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 351394.75it/s]
100%|██████████| 25000/25000 [00:00<00:00, 404766.52it/s]
100%|██████████| 25000/25000 [00:00<00:00, 419217.44it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 332676.38it/s]
100%|██████████| 25000/25000 [00:00<00:00, 417724.41it/s]
100%|██████████| 25000/25000 [00:00<00:00, 437491.81it/s]
100%|██████████| 25000/25000 [00:00<00:00, 401961.17it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.08 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 416182.51it/s]
100%|██████████| 25000/25000 [00:00<00:00, 361292.63it/s]
100%|██████████| 25000/25000 [00:00<00:00, 204770.40it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.12 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 291766.30it/s]
100%|██████████| 25000/25000 [00:00<00:00, 366342.92it/s]
100%|██████████| 25000/25000 [00:00<00:00, 400339.03it/s]
100%|██████████| 25000/25000 [00:00<00:00, 433892.92it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.09 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 426956.79it/s]
100%|██████████| 25000/25000 [00:00<00:00, 414850.45it/s]
100%|██████████| 25000/25000 [00:00<00:00, 424411.29it/s]
100%|██████████| 25000/25000 [00:00<00:00, 415116.51it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 422290.15it/s]
100%|██████████| 25000/25000 [00:00<00:00, 420493.41it/s]
100%|██████████| 25000/25000 [00:00<00:00, 427824.31it/s]
100%|██████████| 25000/25000 [00:00<00:00, 423841.74it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 380854.42it/s]
  0%|          | 10/25000 [00:00<04:34, 90.94it/s]

Finished function: 'number_punctuation_marks' in 0.07 seconds.


100%|██████████| 25000/25000 [02:56<00:00, 141.63it/s]
  0%|          | 15/25000 [00:00<02:57, 140.37it/s]

Finished function: 'spacy' in 176.51 seconds.


100%|██████████| 25000/25000 [02:45<00:00, 151.42it/s]
100%|██████████| 25000/25000 [00:00<00:00, 169213.86it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'spacy' in 165.11 seconds.
Finished function: 'number_pos' in 0.15 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 180482.15it/s]
100%|██████████| 25000/25000 [00:00<00:00, 169430.68it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_pos' in 0.14 seconds.
Finished function: 'number_pos' in 0.15 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 189939.12it/s]
100%|██████████| 25000/25000 [00:00<00:00, 169892.14it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_pos' in 0.13 seconds.
Finished function: 'number_pos' in 0.15 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 185282.71it/s]
 10%|▉         | 2470/25000 [00:00<00:01, 12242.04it/s]

Finished function: 'number_pos' in 0.14 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 12617.04it/s]
  5%|▍         | 1198/25000 [00:00<00:01, 11969.22it/s]

Finished function: 'number_times' in 1.98 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 12553.30it/s]
  5%|▍         | 1170/25000 [00:00<00:02, 11699.06it/s]

Finished function: 'number_times' in 1.99 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 12685.10it/s]
 10%|▉         | 2407/25000 [00:00<00:01, 11873.68it/s]

Finished function: 'number_times' in 1.97 seconds.


100%|██████████| 25000/25000 [00:02<00:00, 12412.08it/s]
 10%|▉         | 2497/25000 [00:00<00:01, 12354.79it/s]

Finished function: 'number_times' in 2.02 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 12643.05it/s]
  5%|▍         | 1174/25000 [00:00<00:02, 11738.81it/s]

Finished function: 'number_times' in 1.98 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 12780.34it/s]
100%|██████████| 25000/25000 [00:00<00:00, 179788.66it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_times' in 1.96 seconds.
Finished function: 'named_numbers' in 0.14 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 180689.27it/s]

Finished function: 'named_numbers' in 0.14 seconds.





In [8]:
parallel_sentences.create_embedding_information("proc_5k", language_pair="en_it")

Finished function: 'load_embeddings' in 0.86 seconds.


  0%|          | 48/25000 [00:00<00:53, 468.23it/s]

Finished function: 'load_embeddings' in 0.62 seconds.


100%|██████████| 25000/25000 [00:56<00:00, 441.92it/s]
  0%|          | 45/25000 [00:00<00:55, 448.95it/s]

Finished function: 'word_embeddings' in 56.57 seconds.


100%|██████████| 25000/25000 [01:02<00:00, 402.99it/s]


Finished function: 'word_embeddings' in 62.04 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 190043.42it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'create_translation_dictionary' in 59.74 seconds.
Finished function: 'translate_words' in 0.13 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 171933.78it/s]


Finished function: 'translate_words' in 0.15 seconds.


100%|██████████| 25000/25000 [00:16<00:00, 1521.18it/s]


Finished function: 'tf_idf_vector' in 16.8 seconds.


100%|██████████| 25000/25000 [00:16<00:00, 1545.20it/s]
  1%|          | 224/25000 [00:00<00:11, 2233.84it/s]

Finished function: 'tf_idf_vector' in 16.54 seconds.


100%|██████████| 25000/25000 [00:10<00:00, 2400.71it/s]
  2%|▏         | 508/25000 [00:00<00:09, 2552.88it/s]

Finished function: 'sentence_embedding_average' in 10.42 seconds.


100%|██████████| 25000/25000 [00:11<00:00, 2153.48it/s]
  0%|          | 17/25000 [00:00<02:27, 169.22it/s]

Finished function: 'sentence_embedding_average' in 11.61 seconds.


  return [pd.Series(embedding_dataframe.values.mean(axis=1))]
100%|██████████| 25000/25000 [02:06<00:00, 198.03it/s]
  0%|          | 23/25000 [00:00<01:51, 224.33it/s]

Finished function: 'sentence_embedding_tf_idf' in 126.25 seconds.


100%|██████████| 25000/25000 [02:02<00:00, 204.00it/s]

Finished function: 'sentence_embedding_tf_idf' in 122.55 seconds.





In [9]:
parallel_sentences.create_embedding_information("proc_b_1k", language_pair="en_it")

Finished function: 'load_embeddings' in 0.83 seconds.


  0%|          | 47/25000 [00:00<00:53, 467.25it/s]

Finished function: 'load_embeddings' in 0.77 seconds.


100%|██████████| 25000/25000 [00:58<00:00, 424.33it/s]
  0%|          | 46/25000 [00:00<00:54, 459.05it/s]

Finished function: 'word_embeddings' in 58.92 seconds.


100%|██████████| 25000/25000 [00:56<00:00, 445.10it/s]


Finished function: 'word_embeddings' in 56.17 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 158696.74it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'create_translation_dictionary' in 57.88 seconds.
Finished function: 'translate_words' in 0.16 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 159527.28it/s]


Finished function: 'translate_words' in 0.16 seconds.


100%|██████████| 25000/25000 [00:17<00:00, 1420.97it/s]


Finished function: 'tf_idf_vector' in 17.98 seconds.


100%|██████████| 25000/25000 [00:17<00:00, 1469.96it/s]
  1%|          | 232/25000 [00:00<00:10, 2311.73it/s]

Finished function: 'tf_idf_vector' in 17.36 seconds.


100%|██████████| 25000/25000 [00:16<00:00, 1510.40it/s]
  1%|          | 243/25000 [00:00<00:10, 2421.41it/s]

Finished function: 'sentence_embedding_average' in 16.55 seconds.


100%|██████████| 25000/25000 [00:11<00:00, 2116.46it/s]
  0%|          | 17/25000 [00:00<02:30, 166.50it/s]

Finished function: 'sentence_embedding_average' in 11.81 seconds.


100%|██████████| 25000/25000 [02:13<00:00, 186.92it/s]
  0%|          | 22/25000 [00:00<01:53, 219.84it/s]

Finished function: 'sentence_embedding_tf_idf' in 133.75 seconds.


100%|██████████| 25000/25000 [02:06<00:00, 197.05it/s]

Finished function: 'sentence_embedding_tf_idf' in 126.89 seconds.





In [10]:
parallel_sentences.create_embedding_information("vecmap", language_pair="en_it")

Finished function: 'load_embeddings' in 0.64 seconds.


  0%|          | 28/25000 [00:00<01:29, 277.86it/s]

Finished function: 'load_embeddings' in 0.66 seconds.


100%|██████████| 25000/25000 [00:57<00:00, 431.62it/s]
  0%|          | 48/25000 [00:00<00:51, 479.99it/s]

Finished function: 'word_embeddings' in 57.92 seconds.


100%|██████████| 25000/25000 [00:56<00:00, 444.63it/s]


Finished function: 'word_embeddings' in 56.23 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 194040.60it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'create_translation_dictionary' in 59.06 seconds.
Finished function: 'translate_words' in 0.13 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 171957.19it/s]


Finished function: 'translate_words' in 0.15 seconds.


100%|██████████| 25000/25000 [00:18<00:00, 1381.96it/s]


Finished function: 'tf_idf_vector' in 18.44 seconds.


100%|██████████| 25000/25000 [00:16<00:00, 1500.57it/s]
  2%|▏         | 444/25000 [00:00<00:11, 2132.48it/s]

Finished function: 'tf_idf_vector' in 17.05 seconds.


100%|██████████| 25000/25000 [00:11<00:00, 2246.87it/s]
  1%|          | 238/25000 [00:00<00:10, 2376.85it/s]

Finished function: 'sentence_embedding_average' in 11.13 seconds.


100%|██████████| 25000/25000 [00:10<00:00, 2459.24it/s]
  0%|          | 23/25000 [00:00<01:51, 224.27it/s]

Finished function: 'sentence_embedding_average' in 10.17 seconds.


100%|██████████| 25000/25000 [02:12<00:00, 188.42it/s]
  0%|          | 42/25000 [00:00<01:55, 216.63it/s]

Finished function: 'sentence_embedding_tf_idf' in 132.69 seconds.


100%|██████████| 25000/25000 [02:11<00:00, 189.99it/s]

Finished function: 'sentence_embedding_tf_idf' in 131.61 seconds.





In [11]:
parallel_sentences.preprocessed.to_json("../data/interim/preprocessed_data_en_it.json")

In [None]:
import pandas as pd
preprocessed_data = pd.read_json("../data/interim/preprocessed_data_en_it.json")
parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_it.pkl")
parallel_sentences.preprocessed = preprocessed_data

## III. Create data set

In this section we create the datasets for the training of the supervised model and the data for the supervised and unsupervised retrieval.

In [12]:
from src.data import DataSet

In [13]:
n_model = 20000
n_queries = 100
n_retrieval = 5000
k = 10
sample_size_k = 100

In [14]:
dataset = DataSet(parallel_sentences.preprocessed)
#dataset = DataSet(preprocessed_data)

Finished function: '__init__' in 0.0 seconds.


In [15]:
dataset.split_model_retrieval(n_model, n_retrieval)

Finished function: 'split_model_retrieval' in 0.0 seconds.


In [16]:
dataset.create_model_index(n_model, k, sample_size_k,
     "sentence_embedding_tf_idf_proc_5k_source", "sentence_embedding_tf_idf_proc_5k_target")

100%|██████████| 2000000/2000000 [13:06<00:00, 2542.36it/s]


Finished function: 'cosine_similarity_vector' in 786.89 seconds.


100%|██████████| 20000/20000 [00:06<00:00, 3319.12it/s]


In [17]:
dataset.model_dataset_index.reset_index(drop=True).to_feather("../data/processed/dataset_model_index_en_it.feather")

In [18]:
# import pandas as pd
# pd.read_feather("../data/processed/dataset_model_index.feather")

In [19]:
#dataset.create_retrieval_index(n_queries)
import pandas as pd
# If your pandas version is old, use this instead
query = pd.DataFrame({"id_source": dataset.retrieval_subset.iloc[:n_queries]["id_source"]})
documents = pd.DataFrame({"id_target": dataset.retrieval_subset["id_target"]})
index = pd.MultiIndex.from_product([dataset.retrieval_subset.iloc[:n_queries]["id_source"], dataset.retrieval_subset["id_target"]], names = ["id_source", "id_target"])
dataset.retrieval_dataset_index = pd.DataFrame(index = index).reset_index()

In [20]:
dataset.retrieval_dataset_index.reset_index(drop=True).to_feather("../data/processed/dataset_retrieval_index_en_it.feather")

In [21]:
# import pandas as pd
# pd.read_feather("../data/processed/dataset_retrieval_index.feather")

## IV. Create features

In this section we create features for our model, that are sentence based and should be created before the text is preprocessed.

In [22]:
#%autoreload 2
from src.features import feature_generation_class

In [None]:
# import pickle
# with open(r"../data/processed/correlated_features.pkl", "rb") as file:
#    chosen_features = pickle.load(file)

Generation of the training data for the supervised classifciation model.

In [None]:
features_model = feature_generation_class.FeatureGeneration(dataset.model_dataset_index, 
                                                             parallel_sentences.preprocessed)

In [None]:
features_model.create_feature_dataframe()

In [None]:
features_model.create_sentence_features()

In [None]:
features_model.create_embedding_features("proc_5k")

In [None]:
features_model.create_embedding_features("proc_b_1k")

In [None]:
features_model.create_embedding_features("vecmap")

In [None]:
features_model.feature_dataframe.reset_index(drop=True).to_feather("../data/processed/feature_model_en_it.feather")

In [None]:
# import pandas as pd
# pd.read_feather("../data/processed/feature_model.feather")

Generation of the data for the crosslingual information retrieval task.

In [23]:
features_retrieval = feature_generation_class.FeatureGeneration(dataset.retrieval_dataset_index, 
                                                            parallel_sentences.preprocessed)

In [24]:
features_retrieval.create_feature_dataframe()

Finished function: 'create_feature_dataframe' in 0.0 seconds.


In [25]:
features_retrieval.create_sentence_features()

Finished function: 'difference_numerical' in 0.02 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.01 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.


  return abs(target_array - source_array).replace(np.nan, 0).replace(np.inf, 0).replace(np.log(0), 0)
  0), 0)
  np.log(0), 0)


Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.02 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.01 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: '

  1%|          | 3899/500000 [00:00<00:12, 38986.88it/s]

Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.


100%|██████████| 500000/500000 [00:10<00:00, 48893.17it/s]

Finished function: 'jaccard' in 10.29 seconds.
Finished function: 'create_sentence_features' in 12.78 seconds.





In [26]:
features_retrieval.create_embedding_features("proc_5k")

100%|██████████| 500000/500000 [03:22<00:00, 2471.95it/s]
  0%|          | 168/500000 [00:00<04:58, 1676.84it/s]

Finished function: 'cosine_similarity_vector' in 202.4 seconds.


100%|██████████| 500000/500000 [03:26<00:00, 2423.36it/s]
  0%|          | 671/500000 [00:00<02:36, 3188.72it/s]

Finished function: 'cosine_similarity_vector' in 206.39 seconds.


100%|██████████| 500000/500000 [02:18<00:00, 3598.83it/s]
  0%|          | 309/500000 [00:00<02:41, 3087.19it/s]

Finished function: 'euclidean_distance_vector' in 138.99 seconds.


100%|██████████| 500000/500000 [02:12<00:00, 3783.93it/s]
  1%|          | 3361/500000 [00:00<00:14, 33609.31it/s]

Finished function: 'euclidean_distance_vector' in 132.19 seconds.


100%|██████████| 500000/500000 [00:12<00:00, 39252.57it/s]
  1%|          | 3062/500000 [00:00<00:16, 30619.74it/s]

Finished function: 'jaccard' in 12.81 seconds.


100%|██████████| 500000/500000 [00:12<00:00, 38962.15it/s]

Finished function: 'jaccard' in 12.91 seconds.
Finished function: 'create_embedding_features' in 705.72 seconds.





In [27]:
features_retrieval.create_embedding_features("proc_b_1k")

100%|██████████| 500000/500000 [03:12<00:00, 2603.09it/s]
  0%|          | 184/500000 [00:00<04:32, 1836.66it/s]

Finished function: 'cosine_similarity_vector' in 192.15 seconds.


100%|██████████| 500000/500000 [03:13<00:00, 2588.31it/s]
  0%|          | 705/500000 [00:00<02:32, 3266.84it/s]

Finished function: 'cosine_similarity_vector' in 193.25 seconds.


100%|██████████| 500000/500000 [02:24<00:00, 3462.70it/s]
  0%|          | 252/500000 [00:00<03:18, 2517.86it/s]

Finished function: 'euclidean_distance_vector' in 144.5 seconds.


100%|██████████| 500000/500000 [02:20<00:00, 3558.80it/s]
  1%|          | 2664/500000 [00:00<00:18, 26639.45it/s]

Finished function: 'euclidean_distance_vector' in 140.56 seconds.


100%|██████████| 500000/500000 [00:16<00:00, 30826.78it/s]
  1%|          | 3362/500000 [00:00<00:14, 33618.35it/s]

Finished function: 'jaccard' in 16.29 seconds.


100%|██████████| 500000/500000 [00:12<00:00, 39102.57it/s]

Finished function: 'jaccard' in 12.85 seconds.
Finished function: 'create_embedding_features' in 699.65 seconds.





In [28]:
features_retrieval.create_embedding_features("vecmap")

100%|██████████| 500000/500000 [03:18<00:00, 2517.98it/s]
  0%|          | 170/500000 [00:00<04:54, 1699.25it/s]

Finished function: 'cosine_similarity_vector' in 198.66 seconds.


100%|██████████| 500000/500000 [04:07<00:00, 2017.44it/s]
  0%|          | 648/500000 [00:00<02:42, 3077.74it/s]

Finished function: 'cosine_similarity_vector' in 247.93 seconds.


100%|██████████| 500000/500000 [02:29<00:00, 3333.84it/s]
  0%|          | 277/500000 [00:00<03:00, 2766.02it/s]

Finished function: 'euclidean_distance_vector' in 150.04 seconds.


100%|██████████| 500000/500000 [02:46<00:00, 2998.68it/s]
  0%|          | 0/500000 [00:00<?, ?it/s]

Finished function: 'euclidean_distance_vector' in 166.8 seconds.


100%|██████████| 500000/500000 [00:18<00:00, 27207.82it/s]
  0%|          | 2061/500000 [00:00<00:24, 20604.08it/s]

Finished function: 'jaccard' in 18.5 seconds.


100%|██████████| 500000/500000 [00:14<00:00, 34255.29it/s]

Finished function: 'jaccard' in 14.68 seconds.
Finished function: 'create_embedding_features' in 796.64 seconds.





In [29]:
features_retrieval.feature_dataframe.reset_index(drop=True).to_feather("../data/processed/feature_retrieval_en_it.feather")

In [None]:
# import pandas as pd
# pd.read_feather("../data/processed/feature_retrieval.feather")