# Preprocessing and Feature Creation

In this notebook we import the data, preprocess the data and create features for supervised and unsupervised cross-lingual-information retrieval models.

## I. Import Data

In this section we import the English and German europarl datasets and combine them into a parallel sentence translation dataframe.

In [None]:
%load_ext autoreload
%autoreload 2

In [4]:
import os
import sys
sys.path.append(os.path.dirname((os.path.abspath(''))))

from src.data import create_data_subset

In [9]:
create_data_subset(sentence_data_source_path='../data/external/europarl-v7.de-en.en',
                   sentence_data_target_path='../data/external/europarl-v7.de-en.de',
                   sample_size=25000,
                   sentence_data_sampled_path="../data/interim/europarl_en_de.pkl",)

Finished function: 'load_doc' in 1.62 seconds.
Finished function: 'to_sentences' in 1.14 seconds.
Finished function: 'load_doc' in 2.03 seconds.
Finished function: 'to_sentences' in 1.39 seconds.
Sampled dataframe saved in: ../data/interim/europarl_en_de_test.pkl
Finished function: 'create_data_subset' in 8.57 seconds.


## II. Preprocess data

In this section we preprocess the parallel sentence data for the feature generation

In [10]:
import spacy
from nltk.corpus import stopwords
from textblob import TextBlob as textblob_source
from textblob_de import TextBlobDE as textblob_target
import en_core_web_sm
import de_core_news_sm
# import it_core_news_sm
# import pl_core_news_sm
import time
from src.data import PreprocessingEuroParl

In [11]:
stopwords_source = stopwords.words('english')
stopwords_target = stopwords.words('german') # German stopwords
# stopwords_target = stopwords.words('italian') # Italian stopwords
# stopwords_target = stopwords.words('polish') # Polish stopwords
nlp_source = en_core_web_sm.load()
nlp_target = de_core_news_sm.load() # German pipeline
# nlp_target = it_core_news_sm.load() # Italian pipeline
# nlp_target = pl_core_news_sm.load() # Polish pipeline

In [12]:
parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_de.pkl") # German
# parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_it.pkl") # Italien
# parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_pol.pkl") # Polnisch

Finished function: 'import_data' in 0.03 seconds.


In [5]:
parallel_sentences.preprocess_sentences(nlp_source, nlp_target, stopwords_source, stopwords_target)

100%|██████████| 25000/25000 [02:46<00:00, 150.06it/s]
100%|██████████| 25000/25000 [00:00<00:00, 238907.28it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'spacy' in 166.61 seconds.
Finished function: 'remove_stopwords' in 0.11 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 165136.32it/s]
100%|██████████| 25000/25000 [00:00<00:00, 178095.43it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'remove_punctuation' in 0.15 seconds.
Finished function: 'remove_numbers' in 0.14 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 87882.70it/s]
100%|██████████| 25000/25000 [00:00<00:00, 143983.54it/s]


Finished function: 'lemmatize' in 0.29 seconds.
Finished function: 'lowercase_spacy' in 0.18 seconds.


100%|██████████| 25000/25000 [02:43<00:00, 153.28it/s]
100%|██████████| 25000/25000 [00:00<00:00, 203800.11it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'create_cleaned_token_embedding' in 167.63 seconds.
Finished function: 'spacy' in 163.1 seconds.
Finished function: 'remove_stopwords' in 0.12 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 135626.33it/s]
100%|██████████| 25000/25000 [00:00<00:00, 161884.74it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'remove_punctuation' in 0.19 seconds.
Finished function: 'remove_numbers' in 0.16 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 79516.79it/s]
100%|██████████| 25000/25000 [00:00<00:00, 185187.81it/s]


Finished function: 'lemmatize' in 0.32 seconds.
Finished function: 'lowercase_spacy' in 0.14 seconds.


100%|██████████| 25000/25000 [00:04<00:00, 5536.72it/s]
100%|██████████| 25000/25000 [00:00<00:00, 246921.30it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'create_cleaned_token_embedding' in 164.18 seconds.
Finished function: 'tokenize_sentence' in 4.52 seconds.
Finished function: 'remove_stopwords' in 0.1 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 173940.51it/s]
100%|██████████| 25000/25000 [00:00<00:00, 152523.70it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'strip_whitespace' in 0.15 seconds.
Finished function: 'lowercase' in 0.17 seconds.
Finished function: 'create_cleaned_text' in 4.96 seconds.


100%|██████████| 25000/25000 [00:05<00:00, 4877.50it/s]
100%|██████████| 25000/25000 [00:00<00:00, 259202.06it/s]
 17%|█▋        | 4130/25000 [00:00<00:01, 17379.43it/s]

Finished function: 'tokenize_sentence' in 5.13 seconds.
Finished function: 'remove_stopwords' in 0.1 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 68682.61it/s]
100%|██████████| 25000/25000 [00:00<00:00, 160151.51it/s]


Finished function: 'strip_whitespace' in 0.37 seconds.
Finished function: 'lowercase' in 0.16 seconds.
Finished function: 'create_cleaned_text' in 5.78 seconds.


In [6]:
parallel_sentences.extract_sentence_information(nlp_source, nlp_target)

100%|██████████| 25000/25000 [00:00<00:00, 89542.44it/s]
 38%|███▊      | 9488/25000 [00:00<00:00, 94873.53it/s]

Finished function: 'number_punctuations_total' in 0.28 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 96656.94it/s]
100%|██████████| 25000/25000 [00:00<00:00, 218360.98it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuations_total' in 0.26 seconds.
Finished function: 'number_words' in 0.12 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 225061.22it/s]
 18%|█▊        | 4609/25000 [00:00<00:00, 46085.87it/s]

Finished function: 'number_words' in 0.11 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 48517.67it/s]
 19%|█▉        | 4702/25000 [00:00<00:00, 47006.37it/s]

Finished function: 'number_unique_words' in 0.52 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 48498.77it/s]
 27%|██▋       | 6797/25000 [00:00<00:00, 67959.05it/s]

Finished function: 'number_unique_words' in 0.52 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 67459.26it/s]
 27%|██▋       | 6868/25000 [00:00<00:00, 68672.37it/s]

Finished function: 'number_characters' in 0.37 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 69143.84it/s]
  return (character_vector / word_vector).replace(np.nan, 0).replace(np.inf, 0).replace(np.log(0), 0)
100%|██████████| 25000/25000 [00:00<00:00, 442996.02it/s]
100%|██████████| 25000/25000 [00:00<00:00, 420223.78it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_characters' in 0.36 seconds.
Finished function: 'average_characters' in 0.02 seconds.
Finished function: 'average_characters' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 384279.58it/s]
100%|██████████| 25000/25000 [00:00<00:00, 361188.10it/s]
100%|██████████| 25000/25000 [00:00<00:00, 435815.31it/s]
100%|██████████| 25000/25000 [00:00<00:00, 371801.18it/s]

Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.



100%|██████████| 25000/25000 [00:00<00:00, 401328.86it/s]
100%|██████████| 25000/25000 [00:00<00:00, 405127.77it/s]
100%|██████████| 25000/25000 [00:00<00:00, 417947.52it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 395074.83it/s]
100%|██████████| 25000/25000 [00:00<00:00, 252188.85it/s]
100%|██████████| 25000/25000 [00:00<00:00, 332266.31it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.1 seconds.
Finished function: 'number_punctuation_marks' in 0.08 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 308930.80it/s]
100%|██████████| 25000/25000 [00:00<00:00, 298921.85it/s]
100%|██████████| 25000/25000 [00:00<00:00, 383965.78it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.08 seconds.
Finished function: 'number_punctuation_marks' in 0.09 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 360428.29it/s]
100%|██████████| 25000/25000 [00:00<00:00, 396272.25it/s]
100%|██████████| 25000/25000 [00:00<00:00, 311250.82it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.08 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 435456.96it/s]
100%|██████████| 25000/25000 [00:00<00:00, 382175.89it/s]
100%|██████████| 25000/25000 [00:00<00:00, 401399.53it/s]
100%|██████████| 25000/25000 [00:00<00:00, 406259.47it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 403172.86it/s]
100%|██████████| 25000/25000 [00:00<00:00, 366203.46it/s]
100%|██████████| 25000/25000 [00:00<00:00, 349195.93it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 389220.65it/s]
100%|██████████| 25000/25000 [00:00<00:00, 377780.82it/s]
100%|██████████| 25000/25000 [00:00<00:00, 389443.27it/s]
100%|██████████| 25000/25000 [00:00<00:00, 441214.02it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 448703.84it/s]
100%|██████████| 25000/25000 [00:00<00:00, 437904.72it/s]
100%|██████████| 25000/25000 [00:00<00:00, 439746.70it/s]
100%|██████████| 25000/25000 [00:00<00:00, 425373.72it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 428485.15it/s]
100%|██████████| 25000/25000 [00:00<00:00, 414599.49it/s]
100%|██████████| 25000/25000 [00:00<00:00, 441019.17it/s]
100%|██████████| 25000/25000 [00:00<00:00, 444351.40it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 437789.54it/s]
100%|██████████| 25000/25000 [00:00<00:00, 430662.07it/s]
100%|██████████| 25000/25000 [00:00<00:00, 446014.66it/s]
100%|██████████| 25000/25000 [00:00<00:00, 396788.08it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 442878.14it/s]
100%|██████████| 25000/25000 [00:00<00:00, 425403.06it/s]
100%|██████████| 25000/25000 [00:00<00:00, 442138.64it/s]
100%|██████████| 25000/25000 [00:00<00:00, 437888.27it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 437205.42it/s]
100%|██████████| 25000/25000 [00:00<00:00, 431884.21it/s]
100%|██████████| 25000/25000 [00:00<00:00, 449843.41it/s]
100%|██████████| 25000/25000 [00:00<00:00, 424138.34it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 447161.57it/s]
100%|██████████| 25000/25000 [00:00<00:00, 394179.28it/s]
100%|██████████| 25000/25000 [00:00<00:00, 449502.09it/s]
100%|██████████| 25000/25000 [00:00<00:00, 423384.81it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 433274.38it/s]
100%|██████████| 25000/25000 [00:00<00:00, 424153.78it/s]
100%|██████████| 25000/25000 [00:00<00:00, 423169.53it/s]
100%|██████████| 25000/25000 [00:00<00:00, 445189.04it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 448212.84it/s]
100%|██████████| 25000/25000 [00:00<00:00, 405553.97it/s]
100%|██████████| 25000/25000 [00:00<00:00, 441004.33it/s]
100%|██████████| 25000/25000 [00:00<00:00, 434540.23it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 435494.94it/s]
100%|██████████| 25000/25000 [00:00<00:00, 411546.81it/s]
100%|██████████| 25000/25000 [00:00<00:00, 441719.57it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [02:37<00:00, 158.68it/s]
  0%|          | 15/25000 [00:00<02:48, 148.41it/s]

Finished function: 'spacy' in 157.55 seconds.


100%|██████████| 25000/25000 [02:37<00:00, 158.59it/s]
100%|██████████| 25000/25000 [00:00<00:00, 171654.86it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'spacy' in 157.64 seconds.
Finished function: 'number_pos' in 0.15 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 188828.84it/s]
100%|██████████| 25000/25000 [00:00<00:00, 165356.11it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_pos' in 0.13 seconds.
Finished function: 'number_pos' in 0.15 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 178394.48it/s]
100%|██████████| 25000/25000 [00:00<00:00, 168628.25it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_pos' in 0.14 seconds.
Finished function: 'number_pos' in 0.15 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 190145.78it/s]
  9%|▉         | 2371/25000 [00:00<00:01, 11830.54it/s]

Finished function: 'number_pos' in 0.13 seconds.


100%|██████████| 25000/25000 [00:02<00:00, 12393.42it/s]
 11%|█         | 2704/25000 [00:00<00:01, 13418.13it/s]

Finished function: 'number_times' in 2.02 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 13475.72it/s]
  5%|▍         | 1160/25000 [00:00<00:02, 11593.54it/s]

Finished function: 'number_times' in 1.86 seconds.


100%|██████████| 25000/25000 [00:02<00:00, 12324.25it/s]
  5%|▌         | 1268/25000 [00:00<00:01, 12672.67it/s]

Finished function: 'number_times' in 2.03 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 13579.59it/s]
 10%|▉         | 2417/25000 [00:00<00:01, 12137.97it/s]

Finished function: 'number_times' in 1.84 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 12501.33it/s]
  5%|▍         | 1240/25000 [00:00<00:01, 12398.50it/s]

Finished function: 'number_times' in 2.0 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 13634.09it/s]
100%|██████████| 25000/25000 [00:00<00:00, 191246.87it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_times' in 1.84 seconds.
Finished function: 'named_numbers' in 0.13 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 171824.75it/s]

Finished function: 'named_numbers' in 0.15 seconds.





In [7]:
parallel_sentences.create_embedding_information("proc_5k")

Finished function: 'load_embeddings' in 0.71 seconds.


  0%|          | 43/25000 [00:00<00:58, 426.08it/s]

Finished function: 'load_embeddings' in 0.56 seconds.


100%|██████████| 25000/25000 [00:53<00:00, 470.34it/s]
  0%|          | 46/25000 [00:00<00:54, 455.14it/s]

Finished function: 'word_embeddings' in 53.15 seconds.


100%|██████████| 25000/25000 [00:48<00:00, 516.47it/s]


Finished function: 'word_embeddings' in 48.4 seconds.


 95%|█████████▍| 23701/25000 [00:00<00:00, 121801.47it/s]

Finished function: 'create_translation_dictionary' in 70.05 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 118072.25it/s]
100%|██████████| 25000/25000 [00:00<00:00, 142362.41it/s]


Finished function: 'translate_words' in 0.21 seconds.
Finished function: 'translate_words' in 0.18 seconds.


100%|██████████| 25000/25000 [00:16<00:00, 1529.85it/s]


Finished function: 'tf_idf_vector' in 16.7 seconds.


100%|██████████| 25000/25000 [00:15<00:00, 1619.98it/s]
  1%|          | 140/25000 [00:00<00:17, 1391.77it/s]

Finished function: 'tf_idf_vector' in 15.84 seconds.


100%|██████████| 25000/25000 [00:11<00:00, 2266.72it/s]
  1%|          | 202/25000 [00:00<00:12, 2018.49it/s]

Finished function: 'sentence_embedding_average' in 11.03 seconds.


100%|██████████| 25000/25000 [00:11<00:00, 2147.21it/s]
  0%|          | 17/25000 [00:00<02:36, 159.15it/s]

Finished function: 'sentence_embedding_average' in 11.64 seconds.


  return [pd.Series(embedding_dataframe.values.mean(axis=1))]
100%|██████████| 25000/25000 [01:59<00:00, 208.62it/s]
  0%|          | 19/25000 [00:00<02:14, 185.15it/s]

Finished function: 'sentence_embedding_tf_idf' in 119.84 seconds.


100%|██████████| 25000/25000 [01:45<00:00, 237.45it/s]

Finished function: 'sentence_embedding_tf_idf' in 105.29 seconds.





In [8]:
parallel_sentences.create_embedding_information("proc_b_1k")

Finished function: 'load_embeddings' in 0.75 seconds.


  0%|          | 86/25000 [00:00<01:00, 408.84it/s]

Finished function: 'load_embeddings' in 0.63 seconds.


100%|██████████| 25000/25000 [00:53<00:00, 466.93it/s]
  0%|          | 43/25000 [00:00<00:58, 429.91it/s]

Finished function: 'word_embeddings' in 53.54 seconds.


100%|██████████| 25000/25000 [00:50<00:00, 498.92it/s]


Finished function: 'word_embeddings' in 50.11 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 166672.12it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'create_translation_dictionary' in 66.74 seconds.
Finished function: 'translate_words' in 0.15 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 127926.29it/s]


Finished function: 'translate_words' in 0.2 seconds.


100%|██████████| 25000/25000 [00:16<00:00, 1544.08it/s]


Finished function: 'tf_idf_vector' in 16.52 seconds.


100%|██████████| 25000/25000 [00:14<00:00, 1674.31it/s]
  2%|▏         | 401/25000 [00:00<00:12, 1905.24it/s]

Finished function: 'tf_idf_vector' in 15.33 seconds.


100%|██████████| 25000/25000 [00:09<00:00, 2517.33it/s]
  1%|          | 236/25000 [00:00<00:10, 2356.11it/s]

Finished function: 'sentence_embedding_average' in 9.93 seconds.


100%|██████████| 25000/25000 [00:10<00:00, 2476.25it/s]
  0%|          | 17/25000 [00:00<02:39, 156.86it/s]

Finished function: 'sentence_embedding_average' in 10.1 seconds.


100%|██████████| 25000/25000 [01:58<00:00, 210.57it/s]
  0%|          | 20/25000 [00:00<02:07, 195.43it/s]

Finished function: 'sentence_embedding_tf_idf' in 118.73 seconds.


100%|██████████| 25000/25000 [01:45<00:00, 237.88it/s]

Finished function: 'sentence_embedding_tf_idf' in 105.1 seconds.





In [9]:
parallel_sentences.create_embedding_information("vecmap")

Finished function: 'load_embeddings' in 0.56 seconds.


  0%|          | 39/25000 [00:00<01:05, 382.00it/s]

Finished function: 'load_embeddings' in 0.51 seconds.


100%|██████████| 25000/25000 [00:54<00:00, 462.79it/s]
  0%|          | 45/25000 [00:00<00:56, 444.50it/s]

Finished function: 'word_embeddings' in 54.02 seconds.


100%|██████████| 25000/25000 [00:49<00:00, 509.00it/s]


Finished function: 'word_embeddings' in 49.12 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 139462.78it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'create_translation_dictionary' in 67.73 seconds.
Finished function: 'translate_words' in 0.18 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 138240.64it/s]


Finished function: 'translate_words' in 0.18 seconds.


100%|██████████| 25000/25000 [00:16<00:00, 1536.53it/s]


Finished function: 'tf_idf_vector' in 16.6 seconds.


100%|██████████| 25000/25000 [00:15<00:00, 1635.24it/s]
  1%|          | 189/25000 [00:00<00:13, 1885.08it/s]

Finished function: 'tf_idf_vector' in 15.7 seconds.


100%|██████████| 25000/25000 [00:10<00:00, 2357.60it/s]
  2%|▏         | 505/25000 [00:00<00:09, 2496.43it/s]

Finished function: 'sentence_embedding_average' in 10.61 seconds.


100%|██████████| 25000/25000 [00:18<00:00, 1387.98it/s]
  0%|          | 18/25000 [00:00<02:19, 178.88it/s]

Finished function: 'sentence_embedding_average' in 18.01 seconds.


100%|██████████| 25000/25000 [02:05<00:00, 199.07it/s]
  0%|          | 39/25000 [00:00<02:19, 179.50it/s]

Finished function: 'sentence_embedding_tf_idf' in 125.59 seconds.


100%|██████████| 25000/25000 [01:47<00:00, 232.33it/s]

Finished function: 'sentence_embedding_tf_idf' in 107.7 seconds.





In [10]:
parallel_sentences.preprocessed.to_json("../data/interim/preprocessed_data_en_de.json")

In [None]:
import pandas as pd
preprocessed_data = pd.read_json("../data/interim/preprocessed_data_en_de.json")
parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_de.pkl")
parallel_sentences.preprocessed = preprocessed_data

## III. Create data set

In this section we create the datasets for the training of the supervised model and the data for the supervised and unsupervised retrieval.

In [11]:
from src.data import DataSet

In [12]:
n_model = 20000
n_queries = 100
n_retrieval = 5000
k = 10
sample_size_k = 100

In [14]:
dataset = DataSet(parallel_sentences.preprocessed)
#dataset = DataSet(preprocessed_data)

Finished function: '__init__' in 0.0 seconds.


In [15]:
dataset.split_model_retrieval(n_model, n_retrieval)

Finished function: 'split_model_retrieval' in 0.0 seconds.


In [16]:
dataset.create_model_index(n_model, k, sample_size_k,
     "sentence_embedding_tf_idf_proc_5k_source", "sentence_embedding_tf_idf_proc_5k_target")

100%|██████████| 2000000/2000000 [13:28<00:00, 2473.02it/s]


Finished function: 'cosine_similarity_vector' in 808.96 seconds.


100%|██████████| 20000/20000 [00:06<00:00, 3267.09it/s]


In [17]:
dataset.model_dataset_index.reset_index(drop=True).to_feather("../data/processed/dataset_model_index_en_de.feather")

In [18]:
# import pandas as pd
# pd.read_feather("../data/processed/dataset_model_index.feather")

In [21]:
import pandas as pd
#dataset.create_retrieval_index(n_queries)

# If your pandas version is old, use this instead
query = pd.DataFrame({"id_source": dataset.retrieval_subset.iloc[:n_queries]["id_source"]})
documents = pd.DataFrame({"id_target": dataset.retrieval_subset["id_target"]})
index = pd.MultiIndex.from_product([dataset.retrieval_subset.iloc[:n_queries]["id_source"], dataset.retrieval_subset["id_target"]], names = ["id_source", "id_target"])
dataset.retrieval_dataset_index = pd.DataFrame(index = index).reset_index()

In [22]:
dataset.retrieval_dataset_index.reset_index(drop=True).to_feather("../data/processed/dataset_retrieval_index_en_de.feather")

In [None]:
# import pandas as pd
# pd.read_feather("../data/processed/dataset_retrieval_index.feather")

## IV. Create features

In this section we create features for our model, that are sentence based and should be created before the text is preprocessed.

In [23]:
#%autoreload 2
from src.features import feature_generation_class

In [None]:
# import pickle
# with open(r"../data/processed/correlated_features.pkl", "rb") as file:
#    chosen_features = pickle.load(file)

Generation of the training data for the supervised classifciation model.

In [24]:
features_model = feature_generation_class.FeatureGeneration(dataset.model_dataset_index, 
                                                             parallel_sentences.preprocessed)

In [25]:
features_model.create_feature_dataframe()

Finished function: 'create_feature_dataframe' in 0.0 seconds.


In [26]:
features_model.create_sentence_features()

Finished function: 'difference_numerical' in 0.01 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.01 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.0 seconds.
Finished function: 'normalized_difference_numerical' in 0.01 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.0 seconds.
Finished function: 'normalized_difference_numerical' in 0.01 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.01 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.01 seconds.
Finished function: 'd

  return abs(target_array - source_array).replace(np.nan, 0).replace(np.inf, 0).replace(np.log(0), 0)
  0), 0)
  np.log(0), 0)


Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.01 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.01 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.01 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.01 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.01 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: '

  2%|▏         | 4206/220000 [00:00<00:05, 42054.12it/s]

Finished function: 'normalized_difference_numerical' in 0.01 seconds.


100%|██████████| 220000/220000 [00:04<00:00, 49972.91it/s]

Finished function: 'jaccard' in 4.44 seconds.
Finished function: 'create_sentence_features' in 5.77 seconds.





In [27]:
features_model.create_embedding_features("proc_5k")

100%|██████████| 220000/220000 [01:29<00:00, 2450.43it/s]
  0%|          | 472/220000 [00:00<01:36, 2283.64it/s]

Finished function: 'cosine_similarity_vector' in 89.82 seconds.


100%|██████████| 220000/220000 [01:24<00:00, 2604.21it/s]
  0%|          | 633/220000 [00:00<01:21, 2700.48it/s]

Finished function: 'cosine_similarity_vector' in 84.52 seconds.


100%|██████████| 220000/220000 [00:59<00:00, 3715.12it/s]
  0%|          | 606/220000 [00:00<01:21, 2703.32it/s]

Finished function: 'euclidean_distance_vector' in 59.26 seconds.


100%|██████████| 220000/220000 [00:58<00:00, 3787.58it/s]
  1%|          | 2290/220000 [00:00<00:09, 22899.97it/s]

Finished function: 'euclidean_distance_vector' in 58.12 seconds.


100%|██████████| 220000/220000 [00:06<00:00, 36610.21it/s]
  1%|          | 2717/220000 [00:00<00:07, 27164.00it/s]

Finished function: 'jaccard' in 6.03 seconds.


100%|██████████| 220000/220000 [00:06<00:00, 36316.79it/s]

Finished function: 'jaccard' in 6.08 seconds.
Finished function: 'create_embedding_features' in 303.84 seconds.





In [28]:
features_model.create_embedding_features("proc_b_1k")

100%|██████████| 220000/220000 [01:27<00:00, 2528.52it/s]
  0%|          | 122/220000 [00:00<03:02, 1203.52it/s]

Finished function: 'cosine_similarity_vector' in 87.02 seconds.


100%|██████████| 220000/220000 [01:25<00:00, 2563.96it/s]
  0%|          | 275/220000 [00:00<01:20, 2745.74it/s]

Finished function: 'cosine_similarity_vector' in 85.84 seconds.


100%|██████████| 220000/220000 [00:57<00:00, 3835.17it/s]
  0%|          | 299/220000 [00:00<01:13, 2982.54it/s]

Finished function: 'euclidean_distance_vector' in 57.4 seconds.


100%|██████████| 220000/220000 [00:57<00:00, 3831.48it/s]
  1%|▏         | 2978/220000 [00:00<00:07, 29775.20it/s]

Finished function: 'euclidean_distance_vector' in 57.45 seconds.


100%|██████████| 220000/220000 [00:05<00:00, 37263.61it/s]
  1%|▏         | 3135/220000 [00:00<00:06, 31346.89it/s]

Finished function: 'jaccard' in 5.94 seconds.


100%|██████████| 220000/220000 [00:06<00:00, 36582.63it/s]

Finished function: 'jaccard' in 6.05 seconds.
Finished function: 'create_embedding_features' in 299.71 seconds.





In [29]:
features_model.create_embedding_features("vecmap")

100%|██████████| 220000/220000 [01:27<00:00, 2519.11it/s]
  0%|          | 172/220000 [00:00<02:08, 1715.41it/s]

Finished function: 'cosine_similarity_vector' in 87.37 seconds.


100%|██████████| 220000/220000 [01:24<00:00, 2600.26it/s]
  0%|          | 682/220000 [00:00<01:09, 3173.53it/s]

Finished function: 'cosine_similarity_vector' in 84.64 seconds.


100%|██████████| 220000/220000 [00:57<00:00, 3832.43it/s]
  0%|          | 701/220000 [00:00<01:06, 3310.82it/s]

Finished function: 'euclidean_distance_vector' in 57.44 seconds.


100%|██████████| 220000/220000 [00:57<00:00, 3850.43it/s]
  1%|▏         | 2912/220000 [00:00<00:07, 29117.32it/s]

Finished function: 'euclidean_distance_vector' in 57.17 seconds.


100%|██████████| 220000/220000 [00:05<00:00, 37154.78it/s]
  1%|▏         | 3139/220000 [00:00<00:06, 31386.89it/s]

Finished function: 'jaccard' in 5.96 seconds.


100%|██████████| 220000/220000 [00:06<00:00, 36579.73it/s]

Finished function: 'jaccard' in 6.05 seconds.
Finished function: 'create_embedding_features' in 298.64 seconds.





In [30]:
features_model.feature_dataframe.reset_index(drop=True).to_feather("../data/processed/feature_model_en_de.feather")

In [31]:
# import pandas as pd
# pd.read_feather("../data/processed/feature_model.feather")

Generation of the data for the crosslingual information retrieval task.

In [32]:
features_retrieval = feature_generation_class.FeatureGeneration(dataset.retrieval_dataset_index, 
                                                            parallel_sentences.preprocessed)

In [33]:
features_retrieval.create_feature_dataframe()

Finished function: 'create_feature_dataframe' in 0.0 seconds.


In [34]:
features_retrieval.create_sentence_features()

Finished function: 'difference_numerical' in 0.01 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.


  return abs(target_array - source_array).replace(np.nan, 0).replace(np.inf, 0).replace(np.log(0), 0)
  0), 0)
  np.log(0), 0)


Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.01 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: '

100%|██████████| 500000/500000 [00:10<00:00, 49454.73it/s]

Finished function: 'jaccard' in 10.16 seconds.
Finished function: 'create_sentence_features' in 12.59 seconds.





In [35]:
features_retrieval.create_embedding_features("proc_5k")

100%|██████████| 500000/500000 [03:12<00:00, 2603.43it/s]
  0%|          | 196/500000 [00:00<04:15, 1954.94it/s]

Finished function: 'cosine_similarity_vector' in 192.11 seconds.


100%|██████████| 500000/500000 [03:10<00:00, 2620.02it/s]
  0%|          | 681/500000 [00:00<02:36, 3194.07it/s]

Finished function: 'cosine_similarity_vector' in 190.89 seconds.


100%|██████████| 500000/500000 [02:09<00:00, 3849.30it/s]
  0%|          | 243/500000 [00:00<03:25, 2426.53it/s]

Finished function: 'euclidean_distance_vector' in 129.94 seconds.


100%|██████████| 500000/500000 [02:09<00:00, 3846.95it/s]
  1%|          | 3253/500000 [00:00<00:15, 32527.39it/s]

Finished function: 'euclidean_distance_vector' in 130.03 seconds.


100%|██████████| 500000/500000 [00:13<00:00, 36531.93it/s]
  1%|          | 2964/500000 [00:00<00:16, 29634.66it/s]

Finished function: 'jaccard' in 13.74 seconds.


100%|██████████| 500000/500000 [00:12<00:00, 40778.67it/s]

Finished function: 'jaccard' in 12.32 seconds.
Finished function: 'create_embedding_features' in 669.07 seconds.





In [36]:
features_retrieval.create_embedding_features("proc_b_1k")

100%|██████████| 500000/500000 [03:08<00:00, 2654.21it/s]
  0%|          | 139/500000 [00:00<06:00, 1386.30it/s]

Finished function: 'cosine_similarity_vector' in 188.44 seconds.


100%|██████████| 500000/500000 [03:09<00:00, 2639.72it/s]
  0%|          | 691/500000 [00:00<02:33, 3255.00it/s]

Finished function: 'cosine_similarity_vector' in 189.47 seconds.


100%|██████████| 500000/500000 [02:09<00:00, 3855.85it/s]
  0%|          | 707/500000 [00:00<02:31, 3304.17it/s]

Finished function: 'euclidean_distance_vector' in 129.73 seconds.


100%|██████████| 500000/500000 [02:10<00:00, 3825.91it/s]
  1%|          | 3126/500000 [00:00<00:15, 31258.09it/s]

Finished function: 'euclidean_distance_vector' in 130.74 seconds.


100%|██████████| 500000/500000 [00:12<00:00, 41233.22it/s]
  1%|          | 3048/500000 [00:00<00:16, 30479.74it/s]

Finished function: 'jaccard' in 12.18 seconds.


100%|██████████| 500000/500000 [00:12<00:00, 40835.64it/s]

Finished function: 'jaccard' in 12.3 seconds.
Finished function: 'create_embedding_features' in 662.88 seconds.





In [37]:
features_retrieval.create_embedding_features("vecmap")

100%|██████████| 500000/500000 [03:10<00:00, 2628.65it/s]
  0%|          | 148/500000 [00:00<05:39, 1474.12it/s]

Finished function: 'cosine_similarity_vector' in 190.27 seconds.


100%|██████████| 500000/500000 [03:11<00:00, 2612.48it/s]
  0%|          | 296/500000 [00:00<02:49, 2952.24it/s]

Finished function: 'cosine_similarity_vector' in 191.43 seconds.


100%|██████████| 500000/500000 [02:09<00:00, 3863.98it/s]
  0%|          | 697/500000 [00:00<02:34, 3240.46it/s]

Finished function: 'euclidean_distance_vector' in 129.46 seconds.


100%|██████████| 500000/500000 [02:09<00:00, 3871.01it/s]
  1%|          | 3123/500000 [00:00<00:15, 31226.61it/s]

Finished function: 'euclidean_distance_vector' in 129.22 seconds.


100%|██████████| 500000/500000 [00:12<00:00, 41239.57it/s]
  1%|▏         | 6915/500000 [00:00<00:15, 31968.05it/s]

Finished function: 'jaccard' in 12.18 seconds.


100%|██████████| 500000/500000 [00:13<00:00, 35720.76it/s]

Finished function: 'jaccard' in 14.05 seconds.
Finished function: 'create_embedding_features' in 666.64 seconds.





In [38]:
features_retrieval.feature_dataframe.reset_index(drop=True).to_feather("../data/processed/feature_retrieval_en_de.feather")

In [None]:
# import pandas as pd
# pd.read_feather("../data/processed/feature_retrieval.feather")