# Preprocessing and Feature Creation

In this notebook we import the data, preprocess the data and create features for supervised and unsupervised cross-lingual-information retrieval models.

## I. Import Data

In this section we import the English and German europarl datasets and combine them into a parallel sentence translation dataframe.

In [None]:
%load_ext autoreload
%autoreload 2

In [1]:
import os
import sys
sys.path.append(os.path.dirname((os.path.abspath(''))))

from src.data import create_data_subset

In [2]:
create_data_subset(sentence_data_source_path='../data/external/europarl-v7.pl-en.en',
                   sentence_data_target_path='../data/external/europarl-v7.pl-en.pl',
                   sample_size=25000,
                   sentence_data_sampled_path="../data/interim/europarl_en_pl.pkl",)

Finished function: 'load_doc' in 0.86 seconds.
Finished function: 'to_sentences' in 0.41 seconds.
Finished function: 'load_doc' in 0.78 seconds.
Finished function: 'to_sentences' in 0.48 seconds.
Sampled dataframe saved in: ../data/interim/europarl_en_pl.pkl
Finished function: 'create_data_subset' in 3.32 seconds.


In [None]:
!python -m spacy download pl_core_news_sm

## II. Preprocess data

In this section we preprocess the parallel sentence data for the feature generation

In [6]:
import spacy
from nltk.corpus import stopwords
from textblob import TextBlob as textblob_source
from textblob_de import TextBlobDE as textblob_target
import en_core_web_sm
# import de_core_news_sm
# import it_core_news_sm
import pl_core_news_sm
import time
from src.data import PreprocessingEuroParl
from stop_words import get_stop_words

In [7]:
stopwords_source = stopwords.words('english')
# stopwords_target = stopwords.words('german') # German stopwords
# stopwords_target = stopwords.words('italian') # Italian stopwords
stopwords_target = get_stop_words('polish') # Polish stopwords
nlp_source = en_core_web_sm.load()
# nlp_target = de_core_news_sm.load() # German pipeline
# nlp_target = it_core_news_sm.load() # Italian pipeline
nlp_target = pl_core_news_sm.load() # Polish pipeline

In [8]:
# parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_de.pkl") # German
# parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_it.pkl") # Italien
parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_pl.pkl") # Polnisch

Finished function: 'import_data' in 0.06 seconds.


In [9]:
parallel_sentences.preprocess_sentences(nlp_source, nlp_target, stopwords_source, stopwords_target)

100%|██████████| 25000/25000 [03:58<00:00, 104.69it/s]
 10%|█         | 2625/25000 [00:00<00:01, 11405.80it/s]

Finished function: 'spacy' in 238.82 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 24206.65it/s]
 64%|██████▍   | 16080/25000 [00:00<00:00, 73045.86it/s]

Finished function: 'remove_punctuation' in 1.04 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 85825.46it/s]
 14%|█▍        | 3507/25000 [00:00<00:00, 35066.86it/s]

Finished function: 'remove_numbers' in 0.3 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 47910.81it/s]
 71%|███████▏  | 17861/25000 [00:00<00:00, 21962.04it/s] 

Finished function: 'lemmatize' in 0.52 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 35292.34it/s]
  9%|▉         | 2325/25000 [00:00<00:00, 23237.50it/s]

Finished function: 'lowercase_spacy' in 0.71 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 22346.87it/s]
  0%|          | 2/25000 [00:00<29:14, 14.25it/s]

Finished function: 'remove_stopwords' in 1.12 seconds.
Finished function: 'create_cleaned_token_embedding' in 242.98 seconds.


100%|██████████| 25000/25000 [05:30<00:00, 75.64it/s]  
  2%|▏         | 417/25000 [00:00<00:05, 4138.50it/s]

Finished function: 'spacy' in 330.52 seconds.


100%|██████████| 25000/25000 [00:02<00:00, 9481.21it/s]
100%|██████████| 25000/25000 [00:00<00:00, 163378.71it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'remove_punctuation' in 2.64 seconds.
Finished function: 'remove_numbers' in 0.15 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 70491.04it/s]
100%|██████████| 25000/25000 [00:00<00:00, 161316.41it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'lemmatize' in 0.36 seconds.
Finished function: 'lowercase_spacy' in 0.16 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 31347.33it/s]
  0%|          | 1/25000 [00:00<53:02,  7.86it/s]

Finished function: 'remove_stopwords' in 0.8 seconds.
Finished function: 'create_cleaned_token_embedding' in 338.15 seconds.


100%|██████████| 25000/25000 [00:05<00:00, 4916.16it/s]
  9%|▉         | 2260/25000 [00:00<00:01, 22597.70it/s]

Finished function: 'tokenize_sentence' in 5.09 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 23064.49it/s]
100%|██████████| 25000/25000 [00:00<00:00, 211009.53it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'remove_stopwords' in 1.09 seconds.
Finished function: 'strip_whitespace' in 0.12 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 184612.35it/s]
  9%|▉         | 2228/25000 [00:00<00:01, 22278.00it/s]

Finished function: 'lowercase' in 0.14 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 23138.81it/s]
  1%|▏         | 372/25000 [00:00<00:06, 3716.55it/s]

Finished function: 'remove_stopwords' in 1.08 seconds.
Finished function: 'create_cleaned_text' in 7.55 seconds.


100%|██████████| 25000/25000 [00:05<00:00, 4462.57it/s]
 22%|██▏       | 5435/25000 [00:00<00:00, 26778.45it/s]

Finished function: 'tokenize_sentence' in 5.61 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 19592.81it/s]
100%|██████████| 25000/25000 [00:00<00:00, 203908.30it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'remove_stopwords' in 1.28 seconds.
Finished function: 'strip_whitespace' in 0.12 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 139551.68it/s]
 11%|█         | 2646/25000 [00:00<00:00, 26459.46it/s]

Finished function: 'lowercase' in 0.18 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 27126.56it/s]

Finished function: 'remove_stopwords' in 0.92 seconds.
Finished function: 'create_cleaned_text' in 8.16 seconds.





In [10]:
parallel_sentences.extract_sentence_information(nlp_source, nlp_target)

100%|██████████| 25000/25000 [00:00<00:00, 117305.34it/s]
 44%|████▍     | 11004/25000 [00:00<00:00, 110032.23it/s]

Finished function: 'number_punctuations_total' in 0.22 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 91285.54it/s] 
100%|██████████| 25000/25000 [00:00<00:00, 248553.70it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuations_total' in 0.28 seconds.
Finished function: 'number_words' in 0.1 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 254235.99it/s]
 22%|██▏       | 5482/25000 [00:00<00:00, 54818.35it/s]

Finished function: 'number_words' in 0.1 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 56174.73it/s]
 39%|███▉      | 9717/25000 [00:00<00:00, 47922.72it/s]

Finished function: 'number_unique_words' in 0.45 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 46373.59it/s]
 32%|███▏      | 7877/25000 [00:00<00:00, 78766.88it/s]

Finished function: 'number_unique_words' in 0.54 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 80714.14it/s]
 32%|███▏      | 8013/25000 [00:00<00:00, 80126.83it/s]

Finished function: 'number_characters' in 0.31 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 71760.70it/s]
  return (character_vector / word_vector).replace(np.nan, 0).replace(np.inf, 0).replace(np.log(0), 0)
100%|██████████| 25000/25000 [00:00<00:00, 475898.63it/s]
100%|██████████| 25000/25000 [00:00<00:00, 391217.37it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_characters' in 0.35 seconds.
Finished function: 'average_characters' in 0.03 seconds.
Finished function: 'average_characters' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.05 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 436986.78it/s]
100%|██████████| 25000/25000 [00:00<00:00, 367301.50it/s]
100%|██████████| 25000/25000 [00:00<00:00, 479772.32it/s]
100%|██████████| 25000/25000 [00:00<00:00, 376365.19it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.05 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 423242.97it/s]
100%|██████████| 25000/25000 [00:00<00:00, 276096.54it/s]
100%|██████████| 25000/25000 [00:00<00:00, 315007.83it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.09 seconds.
Finished function: 'number_punctuation_marks' in 0.08 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 356664.57it/s]
100%|██████████| 25000/25000 [00:00<00:00, 464286.00it/s]
100%|██████████| 25000/25000 [00:00<00:00, 407497.25it/s]
100%|██████████| 25000/25000 [00:00<00:00, 453817.02it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 309030.95it/s]
100%|██████████| 25000/25000 [00:00<00:00, 337113.92it/s]
100%|██████████| 25000/25000 [00:00<00:00, 341580.18it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.08 seconds.
Finished function: 'number_punctuation_marks' in 0.08 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 438720.04it/s]
100%|██████████| 25000/25000 [00:00<00:00, 412072.43it/s]
100%|██████████| 25000/25000 [00:00<00:00, 514729.47it/s]
100%|██████████| 25000/25000 [00:00<00:00, 351142.93it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.05 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 437462.61it/s]
100%|██████████| 25000/25000 [00:00<00:00, 312598.12it/s]
100%|██████████| 25000/25000 [00:00<00:00, 386566.10it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.08 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 337267.89it/s]
100%|██████████| 25000/25000 [00:00<00:00, 453013.17it/s]
100%|██████████| 25000/25000 [00:00<00:00, 428206.93it/s]
100%|██████████| 25000/25000 [00:00<00:00, 470297.81it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.08 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.05 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 366396.68it/s]
100%|██████████| 25000/25000 [00:00<00:00, 445069.99it/s]
100%|██████████| 25000/25000 [00:00<00:00, 398952.94it/s]
100%|██████████| 25000/25000 [00:00<00:00, 464829.35it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 346712.34it/s]
100%|██████████| 25000/25000 [00:00<00:00, 462698.24it/s]
100%|██████████| 25000/25000 [00:00<00:00, 402427.05it/s]
100%|██████████| 25000/25000 [00:00<00:00, 437727.41it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 327748.62it/s]
100%|██████████| 25000/25000 [00:00<00:00, 256855.98it/s]
100%|██████████| 25000/25000 [00:00<00:00, 355298.94it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.08 seconds.
Finished function: 'number_punctuation_marks' in 0.1 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 453532.41it/s]
100%|██████████| 25000/25000 [00:00<00:00, 349308.76it/s]
100%|██████████| 25000/25000 [00:00<00:00, 429074.27it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 351711.81it/s]
100%|██████████| 25000/25000 [00:00<00:00, 301739.81it/s]
100%|██████████| 25000/25000 [00:00<00:00, 369975.09it/s]
100%|██████████| 25000/25000 [00:00<00:00, 511268.55it/s]

Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.08 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.



100%|██████████| 25000/25000 [00:00<00:00, 425923.17it/s]
100%|██████████| 25000/25000 [00:00<00:00, 489936.13it/s]
100%|██████████| 25000/25000 [00:00<00:00, 435017.96it/s]
100%|██████████| 25000/25000 [00:00<00:00, 522071.81it/s]

Finished function: 'number_punctuation_marks' in 0.05 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.05 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.



100%|██████████| 25000/25000 [00:00<00:00, 416014.09it/s]
100%|██████████| 25000/25000 [00:00<00:00, 469670.07it/s]
100%|██████████| 25000/25000 [00:00<00:00, 427196.84it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.05 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.05 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 449390.36it/s]
100%|██████████| 25000/25000 [00:00<00:00, 341761.65it/s]
100%|██████████| 25000/25000 [00:00<00:00, 455486.73it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 277672.85it/s]
100%|██████████| 25000/25000 [00:00<00:00, 476116.88it/s]
100%|██████████| 25000/25000 [00:00<00:00, 412023.86it/s]
100%|██████████| 25000/25000 [00:00<00:00, 464372.36it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.09 seconds.
Finished function: 'number_punctuation_marks' in 0.05 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 364505.29it/s]
100%|██████████| 25000/25000 [00:00<00:00, 508772.97it/s]
100%|██████████| 25000/25000 [00:00<00:00, 429656.22it/s]
100%|██████████| 25000/25000 [00:00<00:00, 507080.23it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.05 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.05 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 405685.77it/s]
  0%|          | 6/25000 [00:00<07:08, 58.32it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [03:45<00:00, 110.99it/s]
  0%|          | 6/25000 [00:00<07:00, 59.45it/s]

Finished function: 'spacy' in 225.24 seconds.


100%|██████████| 25000/25000 [04:46<00:00, 87.12it/s] 
  3%|▎         | 673/25000 [00:00<00:03, 6696.53it/s]

Finished function: 'spacy' in 286.97 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 17904.72it/s]
 10%|▉         | 2446/25000 [00:00<00:00, 24456.82it/s]

Finished function: 'number_pos' in 1.4 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 36807.19it/s]
 30%|██▉       | 7375/25000 [00:00<00:00, 73749.89it/s]

Finished function: 'number_pos' in 0.68 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 76728.54it/s]
 36%|███▌      | 9013/25000 [00:00<00:00, 90117.41it/s]

Finished function: 'number_pos' in 0.33 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 96913.86it/s]
 34%|███▍      | 8594/25000 [00:00<00:00, 85933.12it/s]

Finished function: 'number_pos' in 0.26 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 87063.99it/s]
 41%|████      | 10165/25000 [00:00<00:00, 101639.92it/s]

Finished function: 'number_pos' in 0.29 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 101679.80it/s]
  2%|▏         | 608/25000 [00:00<00:04, 6079.89it/s]

Finished function: 'number_pos' in 0.25 seconds.


100%|██████████| 25000/25000 [00:04<00:00, 6124.42it/s]
  3%|▎         | 686/25000 [00:00<00:03, 6851.71it/s]

Finished function: 'number_times' in 4.08 seconds.


100%|██████████| 25000/25000 [00:02<00:00, 8955.36it/s] 
  9%|▉         | 2208/25000 [00:00<00:02, 10986.46it/s]

Finished function: 'number_times' in 2.79 seconds.


100%|██████████| 25000/25000 [00:02<00:00, 11099.43it/s]
  2%|▏         | 584/25000 [00:00<00:04, 5835.39it/s]

Finished function: 'number_times' in 2.25 seconds.


100%|██████████| 25000/25000 [00:02<00:00, 11840.50it/s]
  9%|▉         | 2353/25000 [00:00<00:01, 11577.94it/s]

Finished function: 'number_times' in 2.11 seconds.


100%|██████████| 25000/25000 [00:02<00:00, 11628.76it/s]
  5%|▌         | 1331/25000 [00:00<00:01, 13305.89it/s]

Finished function: 'number_times' in 2.15 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 13557.51it/s]
100%|██████████| 25000/25000 [00:00<00:00, 148436.68it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_times' in 1.85 seconds.
Finished function: 'named_numbers' in 0.17 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 150108.80it/s]

Finished function: 'named_numbers' in 0.17 seconds.





In [11]:
parallel_sentences.create_embedding_information("proc_5k", language_pair="en_pl")

Finished function: 'load_embeddings' in 1.28 seconds.


  0%|          | 43/25000 [00:00<00:58, 425.92it/s]

Finished function: 'load_embeddings' in 1.11 seconds.


100%|██████████| 25000/25000 [00:53<00:00, 463.27it/s]
  0%|          | 41/25000 [00:00<01:01, 406.46it/s]

Finished function: 'word_embeddings' in 53.97 seconds.


100%|██████████| 25000/25000 [00:53<00:00, 464.95it/s]


Finished function: 'word_embeddings' in 53.77 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 260614.84it/s]
 74%|███████▍  | 18550/25000 [00:00<00:00, 185496.20it/s]

Finished function: 'create_translation_dictionary' in 68.06 seconds.
Finished function: 'translate_words' in 0.1 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 186087.98it/s]


Finished function: 'translate_words' in 0.14 seconds.


100%|██████████| 25000/25000 [00:09<00:00, 2742.28it/s]


Finished function: 'tf_idf_vector' in 9.38 seconds.


100%|██████████| 25000/25000 [00:15<00:00, 1652.17it/s]
  0%|          | 2/25000 [00:00<21:03, 19.78it/s]

Finished function: 'tf_idf_vector' in 15.57 seconds.


100%|██████████| 25000/25000 [00:21<00:00, 1182.08it/s]
  1%|          | 165/25000 [00:00<00:15, 1638.34it/s]

Finished function: 'sentence_embedding_average' in 21.15 seconds.


100%|██████████| 25000/25000 [00:13<00:00, 1903.43it/s]
  return [pd.Series(embedding_dataframe.values.mean(axis=1))]
  0%|          | 22/25000 [00:00<01:55, 215.41it/s]

Finished function: 'sentence_embedding_average' in 13.14 seconds.


100%|██████████| 25000/25000 [01:26<00:00, 288.95it/s]
  0%|          | 25/25000 [00:00<01:48, 230.84it/s]

Finished function: 'sentence_embedding_tf_idf' in 86.55 seconds.


100%|██████████| 25000/25000 [01:59<00:00, 208.57it/s]


Finished function: 'sentence_embedding_tf_idf' in 119.91 seconds.


In [12]:
parallel_sentences.create_embedding_information("proc_b_1k", language_pair="en_pl")

Finished function: 'load_embeddings' in 0.97 seconds.


  0%|          | 16/25000 [00:00<02:36, 159.75it/s]

Finished function: 'load_embeddings' in 0.66 seconds.


100%|██████████| 25000/25000 [00:47<00:00, 529.37it/s]
  0%|          | 40/25000 [00:00<01:06, 372.85it/s]

Finished function: 'word_embeddings' in 47.23 seconds.


100%|██████████| 25000/25000 [00:49<00:00, 508.56it/s]


Finished function: 'word_embeddings' in 49.16 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 195619.63it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'create_translation_dictionary' in 69.93 seconds.
Finished function: 'translate_words' in 0.13 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 144900.59it/s]


Finished function: 'translate_words' in 0.17 seconds.


100%|██████████| 25000/25000 [00:08<00:00, 2888.44it/s]


Finished function: 'tf_idf_vector' in 8.9 seconds.


100%|██████████| 25000/25000 [00:11<00:00, 2095.27it/s]
  1%|          | 208/25000 [00:00<00:11, 2077.86it/s]

Finished function: 'tf_idf_vector' in 12.29 seconds.


100%|██████████| 25000/25000 [00:11<00:00, 2250.27it/s]
  1%|          | 239/25000 [00:00<00:10, 2384.16it/s]

Finished function: 'sentence_embedding_average' in 11.11 seconds.


100%|██████████| 25000/25000 [00:13<00:00, 1888.27it/s]
  0%|          | 29/25000 [00:00<01:27, 286.12it/s]

Finished function: 'sentence_embedding_average' in 13.24 seconds.


100%|██████████| 25000/25000 [01:23<00:00, 298.03it/s]
  0%|          | 29/25000 [00:00<01:27, 286.72it/s]

Finished function: 'sentence_embedding_tf_idf' in 83.94 seconds.


100%|██████████| 25000/25000 [01:31<00:00, 271.75it/s]

Finished function: 'sentence_embedding_tf_idf' in 92.01 seconds.





In [13]:
parallel_sentences.create_embedding_information("vecmap", language_pair="en_pl")

Finished function: 'load_embeddings' in 0.79 seconds.


  1%|          | 147/25000 [00:00<00:34, 720.71it/s]

Finished function: 'load_embeddings' in 0.66 seconds.


100%|██████████| 25000/25000 [00:36<00:00, 678.53it/s]
  0%|          | 53/25000 [00:00<00:48, 519.60it/s]

Finished function: 'word_embeddings' in 36.85 seconds.


100%|██████████| 25000/25000 [00:48<00:00, 517.04it/s]


Finished function: 'word_embeddings' in 48.35 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 216300.69it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'create_translation_dictionary' in 63.87 seconds.
Finished function: 'translate_words' in 0.12 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 160473.32it/s]


Finished function: 'translate_words' in 0.16 seconds.


100%|██████████| 25000/25000 [00:08<00:00, 2796.41it/s]


Finished function: 'tf_idf_vector' in 9.19 seconds.


100%|██████████| 25000/25000 [00:12<00:00, 2059.82it/s]
  1%|          | 143/25000 [00:00<00:17, 1423.97it/s]

Finished function: 'tf_idf_vector' in 12.6 seconds.


100%|██████████| 25000/25000 [00:11<00:00, 2257.93it/s]
  1%|          | 160/25000 [00:00<00:15, 1588.89it/s]

Finished function: 'sentence_embedding_average' in 11.07 seconds.


100%|██████████| 25000/25000 [00:13<00:00, 1919.34it/s]
  0%|          | 28/25000 [00:00<01:29, 279.90it/s]

Finished function: 'sentence_embedding_average' in 13.03 seconds.


100%|██████████| 25000/25000 [01:20<00:00, 310.92it/s]
  0%|          | 25/25000 [00:00<01:46, 233.57it/s]

Finished function: 'sentence_embedding_tf_idf' in 80.41 seconds.


100%|██████████| 25000/25000 [01:35<00:00, 261.74it/s]

Finished function: 'sentence_embedding_tf_idf' in 95.52 seconds.





In [14]:
parallel_sentences.preprocessed.to_json("../data/interim/preprocessed_data_en_pl.json")

In [15]:
parallel_sentences.preprocessed

Unnamed: 0,id_source,id_target,token_preprocessed_embedding_source,token_preprocessed_embedding_target,Translation,number_punctuations_total_source,number_punctuations_total_target,number_words_source,number_words_target,number_unique_words_source,...,sentence_embedding_average_proc_b_1k_source,sentence_embedding_average_proc_b_1k_target,sentence_embedding_tf_idf_proc_b_1k_source,sentence_embedding_tf_idf_proc_b_1k_target,translated_to_target_vecmap_source,translated_to_source_vecmap_target,sentence_embedding_average_vecmap_source,sentence_embedding_average_vecmap_target,sentence_embedding_tf_idf_vecmap_source,sentence_embedding_tf_idf_vecmap_target
0,0,0,"[fight, undeclared, work, widespread, domestic...","[więcej, walka, z, praca, nierejestrowaną, któ...",1,3,4,17,22,16,...,"[[0.0002838987306209414, -0.019273668960002915...","[[0.014335259014521451, 0.00018965698751237462...","[[-0.00021047590691174376, -0.0056726738907578...","[[0.004026147518240783, -0.0001390160811865146...","[walkę, neutralność, prace, rozpowszechnienie,...","[less, fight, along, work, another, since, sec...","[[-0.1938670820423535, 0.025120466432001973, -...","[[-0.13137145680101478, -0.016225614420631352,...","[[-0.053670995357906634, 0.0029820666373567392...","[[-0.02837276402285368, -0.004121396385896437,..."
1,1,1,"[goal, say, share]","[w, koniec, cela, który, twierdzić, wszyscy, p...",1,1,3,3,7,3,...,"[[0.003693463901678721, -0.01951286941766739, ...","[[0.0028768247769524655, -0.02109205846985181,...","[[0.00045217300043198333, -0.00950566398922733...","[[0.005712840439277451, -0.01091033432579134, ...","[wywalczenie, powiedzieć, sprzedać]","[since, beginning, casa, another, disprove, ev...","[[-0.2845593939224879, -0.12511111795902252, -...","[[-0.09147724602371454, -0.13585265167057514, ...","[[-0.15326538806887333, -0.061999269481892626,...","[[-0.0391518184192676, -0.04997417943527003, -..."
2,2,2,"[mr, president, rather, like, mr, schulz, conc...","[pani, przewodniczący, podobnie, pan, schulz, ...",1,2,2,14,17,13,...,"[[-0.02075207233428955, -0.023995181473975, 0....","[[-0.021793486467296525, 0.005810244574344584,...","[[-0.004336027120259941, -0.005060519301008339...","[[-0.0014635108940764736, 0.001177877982031354...","[pani, prezydent, raczej, takie, pani, schröde...","[mrs, chairman, unlike, mr, lange, beginning, ...","[[-0.18414082734559017, -0.09646152361081196, ...","[[-0.09970268786751799, -0.08895570006487626, ...","[[-0.04796280174016808, -0.023740545777922124,...","[[-0.022703221940974183, -0.016542937329247637..."
3,3,3,"[writing, welcome, ms, marian, harkin, 's, rep...","[pismo, z, zadowolenie, przyjmować, sprawozdan...",1,1,1,13,18,13,...,"[[-0.027708127008130152, -0.02301789327369382,...","[[-0.0016926788375712931, 0.00553107401356101,...","[[-0.007986982589232606, -0.005852513778259378...","[[0.0010018837857502347, 0.0020518884301006725...","[pisanie, serdecznie, anna, teresa, thatcher, ...","[newspaper, along, dissatisfaction, prescribe,...","[[-0.14424067921936512, -0.11529516351098816, ...","[[-0.14495723915752023, -0.08850125632307027, ...","[[-0.02622177980817667, -0.03732730928331812, ...","[[-0.02873036793153511, -0.01622434315431163, ..."
4,4,4,"[must, make, difference, opportunity, reassure...","[musić, sprawić, aby, te, różnica, stać, okazj...",1,1,1,8,11,8,...,"[[0.03252435673493892, -0.017803869675844908, ...","[[0.04475131108322077, 0.0010855019920402104, ...","[[0.013019942435215433, -0.009802636239078445,...","[[0.010895567908825732, -0.0006262334834721017...","[powinien, dać, różnica, okazja, przekonać, eu...","[make, allow, different, difference, become, o...","[[-0.26769606070593, -0.13227443769574165, -0....","[[-0.210455062902636, -0.06927116243686113, -0...","[[-0.09721756027445141, -0.05619072551686986, ...","[[-0.061185799416916056, -0.019028736888225175..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,24995,24995,"[bosman, judgment, court, examine, type, restr...","[w, wyrok, w, sprawa, bosman, trybunał, przean...",1,0,4,12,17,12,...,"[[0.017704669661311942, -0.0011741653592749076...","[[0.022024729723731675, 0.00889726960643505, 0...","[[0.006112599920603325, 1.729136721664926e-05,...","[[0.006160566100297899, 0.0010525265338731705,...","[boer, werdykt, sąd, analizować, typ, ogranicz...","[since, conviction, since, case, captain, trib...","[[-0.23768041824752634, -0.07696458539629186, ...","[[-0.1821831771483024, -0.03618997276450197, -...","[[-0.06569058839895493, -0.02387716349928876, ...","[[-0.041337720099176176, -0.006944955772073381..."
24996,24996,24996,"[urge, develop, country, least, us, raise, bid]","[obecnie, wzywać, kraj, uprzemysłowione, w, ty...",1,2,2,7,13,7,...,"[[0.0390678205128227, 0.0002555195242166519, 0...","[[-0.010872848849329684, 0.0076006797769676065...","[[0.015756938269822777, -0.0009966028830326735...","[[0.002074098245790323, 0.001640318170330213, ...","[skłonny, rozwijać, kraj, przynajmniej, usa, p...","[currently, country, since, lastly, also, unit...","[[-0.2176687480615718, -0.045634376018175056, ...","[[-0.11650129676693016, -0.019244895099998556,...","[[-0.081310045977543, -0.019216476069332248, -...","[[-0.029907682143598395, -0.002722837056983037..."
24997,24997,24997,"[opinion, duty, give, fight, misuse, hard, dru...","[moim, zdanie, naszym, obowiązek, nierezygnowa...",1,1,1,12,17,12,...,"[[0.03263462439645082, -0.027260263916105032, ...","[[0.004082426366706689, 0.007997502495224277, ...","[[0.009257623429409284, -0.008334003374923531,...","[[0.0014845895720771523, 0.0012757205532066757...","[opinia, służby, dać, walkę, nadużycie, twardy...","[think, wording, surely, obligation, along, fi...","[[-0.29579253246386844, -0.09873284818604589, ...","[[-0.2096769094467163, -0.07020765030756593, -...","[[-0.0838266234520192, -0.02501571072107024, -...","[[-0.044796592528507866, -0.013348316703308282..."
24998,24998,24998,"[understand, consumer, demand, enable, private...","[zrozumienie, popyt, konsumencki, umożliwić, p...",1,1,1,14,19,14,...,"[[0.01621616169411157, -0.01912250173544245, -...","[[0.03305096714757383, -0.017004272987833247, ...","[[0.003939290051787707, -0.005331207558207234,...","[[0.008247069033624838, -0.004289889591721116,...","[zrozumieć, konsument, popyt, umożliwiać, pryw...","[understanding, profitability, enable, company...","[[-0.27957853728107046, -0.008006543648662046,...","[[-0.2097105401335284, -0.008116696728393435, ...","[[-0.07228680966825449, -0.0032469309268726605...","[[-0.048736239088120865, -0.001692401413858680..."


In [16]:
parallel_sentences.dataframe

Unnamed: 0,id_source,text_source,text_target,id_target,text_preprocessed_source,text_preprocessed_target,text_source_spacy,text_target_spacy,word_embedding_proc_5k_source,word_embedding_proc_5k_target,tf_idf_proc_5k_source,tf_idf_proc_5k_target,word_embedding_proc_b_1k_source,word_embedding_proc_b_1k_target,tf_idf_proc_b_1k_source,tf_idf_proc_b_1k_target,word_embedding_vecmap_source,word_embedding_vecmap_target,tf_idf_vecmap_source,tf_idf_vecmap_target
0,0,"What is more, the fight against undeclared wor...","Co więcej, walka z pracą nierejestrowaną, któr...",0,"[,, fight, undeclared, work, ,, widespread, do...","[więcej, ,, walka, z, pracą, nierejestrowaną, ...","[What, is, more, ,, the, fight, against, undec...","[Co, więcej, ,, walka, z, pracą, nierejestrowa...",fight undeclared work widesprea...,więcej walka z praca ...,"{'fight': 0.22316617570215289, 'undeclared': 0...","{'więcej': 0.20446593950575123, 'walka': 0.195...",fight undeclared work widesprea...,więcej walka z praca ...,"{'fight': 0.22316617570215289, 'undeclared': 0...","{'więcej': 0.20446593950575123, 'walka': 0.195...",fight undeclared work widesprea...,więcej walka z praca ...,"{'fight': 0.22316617570215289, 'undeclared': 0...","{'więcej': 0.20446593950575123, 'walka': 0.195..."
1,1,"After all, this is the goal we all say that we...","W końcu jest to cel, który - jak twierdzimy - ...",1,"[,, goal, say, share, .]","[w, końcu, cel, ,, który, -, twierdzimy, -, ws...","[After, all, ,, this, is, the, goal, we, all, ...","[W, końcu, jest, to, cel, ,, który, -, jak, tw...",goal say share 0 -0.02371...,w koniec cela który t...,"{'goal': 0.6661962453387232, 'say': 0.43475323...","{'w': 0.1153031815764954, 'koniec': 0.38160613...",goal say share 0 -0.04515...,w koniec cela który t...,"{'goal': 0.6661962453387232, 'say': 0.43475323...","{'w': 0.1153031815764954, 'koniec': 0.38160613...",goal say share 0 -0.11391...,w koniec cela który t...,"{'goal': 0.6661962453387232, 'say': 0.43475323...","{'w': 0.1153031815764954, 'koniec': 0.38160613..."
2,2,"Mr President, rather like Mr Schulz, I was con...","Panie przewodniczący! Podobnie jak pan Schulz,...",2,"[mr, president, ,, rather, like, mr, schulz, ,...","[panie, przewodniczący, !, podobnie, pan, schu...","[Mr, President, ,, rather, like, Mr, Schulz, ,...","[Panie, przewodniczący, !, Podobnie, jak, pan,...",mr president rather like ...,pani przewodniczący podobnie ...,"{'mr': 0.3038185093178962, 'president': 0.1492...","{'pani': 0.1384883751318738, 'przewodniczący':...",mr president rather like ...,pani przewodniczący podobnie ...,"{'mr': 0.3038185093178962, 'president': 0.1492...","{'pani': 0.1384883751318738, 'przewodniczący':...",mr president rather like ...,pani przewodniczący podobnie ...,"{'mr': 0.3038185093178962, 'president': 0.1492...","{'pani': 0.1384883751318738, 'przewodniczący':..."
3,3,in writing. - I welcome Ms Marian Harkin's rep...,na piśmie. - Z zadowoleniem przyjmuję sprawozd...,3,"[writing, ., -, welcome, ms, marian, harkin, '...","[piśmie, ., -, z, zadowoleniem, przyjmuję, spr...","[in, writing, ., -, I, welcome, Ms, Marian, Ha...","[na, piśmie, ., -, Z, zadowoleniem, przyjmuję,...",writing welcome ms marian ...,pismo z zadowolenie przyjmow...,"{'writing': 0.21399994434026803, 'welcome': 0....","{'pismo': 0.20768010157923242, 'z': 0.09263291...",writing welcome ms marian ...,pismo z zadowolenie przyjmow...,"{'writing': 0.21399994434026803, 'welcome': 0....","{'pismo': 0.20768010157923242, 'z': 0.09263291...",writing welcome ms marian ...,pismo z zadowolenie przyjmow...,"{'writing': 0.21399994434026803, 'welcome': 0....","{'pismo': 0.20768010157923242, 'z': 0.09263291..."
4,4,We must make these differences an opportunity ...,"Musimy sprawić, aby te różnice stały się okazj...",4,"[must, make, differences, opportunity, reassur...","[musimy, sprawić, ,, aby, te, różnice, stały, ...","[We, must, make, these, differences, an, oppor...","[Musimy, sprawić, ,, aby, te, różnice, stały, ...",must make difference opportuni...,sprawić aby te różnica ...,"{'must': 0.2249334925638152, 'make': 0.2276239...","{'musić': 0.16536147641724275, 'sprawić': 0.33...",must make difference opportuni...,sprawić aby te różnica ...,"{'must': 0.2249334925638152, 'make': 0.2276239...","{'musić': 0.16536147641724275, 'sprawić': 0.33...",must make difference opportuni...,sprawić aby te różnica ...,"{'must': 0.2249334925638152, 'make': 0.2276239...","{'musić': 0.16536147641724275, 'sprawić': 0.33..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
24995,24995,In the Bosman judgment the Court examined two ...,"W wyroku w sprawie Bosman, Trybunał przeanaliz...",24995,"[bosman, judgment, court, examined, two, types...","[w, wyroku, w, sprawie, bosman, ,, trybunał, p...","[In, the, Bosman, judgment, the, Court, examin...","[W, wyroku, w, sprawie, Bosman, ,, Trybunał, p...",bosman judgment court examine ...,w wyrok sprawa bosman t...,"{'bosman': 0.4363252925507528, 'judgment': 0.3...","{'w': 0.12258972918750365, 'wyrok': 0.27716831...",bosman judgment court examine ...,w wyrok sprawa bosman t...,"{'bosman': 0.4363252925507528, 'judgment': 0.3...","{'w': 0.12258972918750365, 'wyrok': 0.27716831...",bosman judgment court examine ...,w wyrok sprawa bosman t...,"{'bosman': 0.4363252925507528, 'judgment': 0.3...","{'w': 0.12258972918750365, 'wyrok': 0.27716831..."
24996,24996,We now urge other developed countries - not le...,Obecnie wzywamy inne kraje uprzemysłowione - w...,24996,"[urge, developed, countries, -, least, us, -, ...","[obecnie, wzywamy, inne, kraje, uprzemysłowion...","[We, now, urge, other, developed, countries, -...","[Obecnie, wzywamy, inne, kraje, uprzemysłowion...",urge develop country least ...,obecnie kraj w tym ...,"{'urge': 0.4013945000642252, 'develop': 0.3083...","{'obecnie': 0.24084899624789596, 'wzywać': 0.2...",urge develop country least ...,obecnie kraj w tym ...,"{'urge': 0.4013945000642252, 'develop': 0.3083...","{'obecnie': 0.24084899624789596, 'wzywać': 0.2...",urge develop country least ...,obecnie kraj w tym ...,"{'urge': 0.4013945000642252, 'develop': 0.3083...","{'obecnie': 0.24084899624789596, 'wzywać': 0.2..."
24997,24997,In my opinion it is our duty not to give up th...,Moim zdaniem naszym obowiązkiem jest nierezygn...,24997,"[opinion, duty, give, fight, misuse, hard, dru...","[moim, zdaniem, naszym, obowiązkiem, nierezygn...","[In, my, opinion, it, is, our, duty, not, to, ...","[Moim, zdaniem, naszym, obowiązkiem, jest, nie...",opinion duty give fight ...,moim zdanie naszym obowiązek ...,"{'opinion': 0.2450297013198977, 'duty': 0.2851...","{'moim': 0.1806000726434773, 'zdanie': 0.18179...",opinion duty give fight ...,moim zdanie naszym obowiązek ...,"{'opinion': 0.2450297013198977, 'duty': 0.2851...","{'moim': 0.1806000726434773, 'zdanie': 0.18179...",opinion duty give fight ...,moim zdanie naszym obowiązek ...,"{'opinion': 0.2450297013198977, 'duty': 0.2851...","{'moim': 0.1806000726434773, 'zdanie': 0.18179..."
24998,24998,Understanding consumer demand will enable priv...,Zrozumienie popytu konsumenckiego umożliwi prz...,24998,"[understanding, consumer, demand, enable, priv...","[zrozumienie, popytu, konsumenckiego, umożliwi...","[Understanding, consumer, demand, will, enable...","[Zrozumienie, popytu, konsumenckiego, umożliwi...",understand consumer demand enable ...,zrozumienie popyt umożliwić przedsi...,"{'understand': 0.2716060638797365, 'consumer':...","{'zrozumienie': 0.2704750648734728, 'popyt': 0...",understand consumer demand enable ...,zrozumienie popyt umożliwić przedsi...,"{'understand': 0.2716060638797365, 'consumer':...","{'zrozumienie': 0.2704750648734728, 'popyt': 0...",understand consumer demand enable ...,zrozumienie popyt umożliwić przedsi...,"{'understand': 0.2716060638797365, 'consumer':...","{'zrozumienie': 0.2704750648734728, 'popyt': 0..."


In [13]:
import pandas as pd
preprocessed_data = pd.read_json("../data/interim/preprocessed_data_en_pl.json")
parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_pl.pkl")
parallel_sentences.preprocessed = preprocessed_data

Finished function: 'import_data' in 0.05 seconds.


## III. Create data set

In this section we create the datasets for the training of the supervised model and the data for the supervised and unsupervised retrieval.

In [17]:
from src.data import DataSet

In [18]:
n_model = 20000
n_queries = 100
n_retrieval = 5000
k = 10
sample_size_k = 100

In [19]:
dataset = DataSet(parallel_sentences.preprocessed)
#dataset = DataSet(preprocessed_data)

Finished function: '__init__' in 0.0 seconds.


In [20]:
dataset.split_model_retrieval(n_model, n_retrieval)

Finished function: 'split_model_retrieval' in 0.0 seconds.


In [23]:
dataset.create_model_index(n_model, k, sample_size_k,
     "sentence_embedding_tf_idf_proc_5k_source", "sentence_embedding_tf_idf_proc_5k_target")

100%|██████████| 2000000/2000000 [16:21<00:00, 2036.66it/s]


Finished function: 'cosine_similarity_vector' in 982.61 seconds.


100%|██████████| 20000/20000 [00:10<00:00, 1900.52it/s]


In [24]:
dataset.model_dataset_index.reset_index(drop=True).to_feather("../data/processed/dataset_model_index_en_pl.feather")

In [25]:
# import pandas as pd
# pd.read_feather("../data/processed/dataset_model_index.feather")

In [21]:
#dataset.create_retrieval_index(n_queries)
import pandas as pd
# If your pandas version is old, use this instead
query = pd.DataFrame({"id_source": dataset.retrieval_subset.iloc[:n_queries]["id_source"]})
documents = pd.DataFrame({"id_target": dataset.retrieval_subset["id_target"]})
index = pd.MultiIndex.from_product([dataset.retrieval_subset.iloc[:n_queries]["id_source"], dataset.retrieval_subset["id_target"]], names = ["id_source", "id_target"])
dataset.retrieval_dataset_index = pd.DataFrame(index = index).reset_index()

In [22]:
dataset.retrieval_dataset_index.reset_index(drop=True).to_feather("../data/processed/dataset_retrieval_index_en_pl.feather")

In [None]:
# import pandas as pd
# pd.read_feather("../data/processed/dataset_retrieval_index.feather")

## IV. Create features

In this section we create features for our model, that are sentence based and should be created before the text is preprocessed.

In [23]:
#%autoreload 2
from src.features import feature_generation_class

In [None]:
# import pickle
# with open(r"../data/processed/correlated_features.pkl", "rb") as file:
#    chosen_features = pickle.load(file)

Generation of the training data for the supervised classifciation model.

In [None]:
features_model = feature_generation_class.FeatureGeneration(dataset.model_dataset_index, 
                                                             parallel_sentences.preprocessed)

In [None]:
features_model.create_feature_dataframe()

In [None]:
features_model.create_sentence_features()

In [None]:
features_model.create_embedding_features("proc_5k")

In [None]:
features_model.create_embedding_features("proc_b_1k")

In [None]:
features_model.create_embedding_features("vecmap")

In [None]:
features_model.feature_dataframe.reset_index(drop=True).to_feather("../data/processed/feature_model_en_pl.feather")

In [None]:
# import pandas as pd
# pd.read_feather("../data/processed/feature_model.feather")

Generation of the data for the crosslingual information retrieval task.

In [24]:
features_retrieval = feature_generation_class.FeatureGeneration(dataset.retrieval_dataset_index, 
                                                            parallel_sentences.preprocessed)

In [25]:
features_retrieval.create_feature_dataframe()

Finished function: 'create_feature_dataframe' in 0.0 seconds.


In [26]:
features_retrieval.create_sentence_features()

Finished function: 'difference_numerical' in 0.02 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.


  return abs(target_array - source_array).replace(np.nan, 0).replace(np.inf, 0).replace(np.log(0), 0)
  0), 0)
  np.log(0), 0)


Finished function: 'normalized_difference_numerical' in 0.01 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.01 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.02 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.01 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 

  0%|          | 0/500000 [00:00<?, ?it/s]

Finished function: 'difference_numerical' in 0.01 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.01 seconds.
Finished function: 'relative_difference_numerical' in 0.02 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.


100%|██████████| 500000/500000 [00:11<00:00, 44459.74it/s]

Finished function: 'jaccard' in 11.32 seconds.
Finished function: 'create_sentence_features' in 14.04 seconds.





In [27]:
features_retrieval.create_embedding_features("proc_5k")

100%|██████████| 500000/500000 [03:23<00:00, 2452.87it/s]
  0%|          | 191/500000 [00:00<04:22, 1905.93it/s]

Finished function: 'cosine_similarity_vector' in 203.9 seconds.


100%|██████████| 500000/500000 [03:22<00:00, 2471.01it/s]
  0%|          | 308/500000 [00:00<02:42, 3074.83it/s]

Finished function: 'cosine_similarity_vector' in 202.41 seconds.


100%|██████████| 500000/500000 [02:19<00:00, 3579.12it/s]
  0%|          | 304/500000 [00:00<02:44, 3033.54it/s]

Finished function: 'euclidean_distance_vector' in 139.76 seconds.


100%|██████████| 500000/500000 [02:12<00:00, 3785.64it/s]
  1%|          | 3524/500000 [00:00<00:14, 35233.57it/s]

Finished function: 'euclidean_distance_vector' in 132.13 seconds.


100%|██████████| 500000/500000 [00:12<00:00, 41014.96it/s]
  1%|          | 3515/500000 [00:00<00:14, 35145.51it/s]

Finished function: 'jaccard' in 12.26 seconds.


100%|██████████| 500000/500000 [00:12<00:00, 41237.71it/s]

Finished function: 'jaccard' in 12.18 seconds.
Finished function: 'create_embedding_features' in 702.68 seconds.





In [28]:
features_retrieval.create_embedding_features("proc_b_1k")

100%|██████████| 500000/500000 [03:14<00:00, 2569.24it/s]
  0%|          | 491/500000 [00:00<03:27, 2410.37it/s]

Finished function: 'cosine_similarity_vector' in 194.68 seconds.


100%|██████████| 500000/500000 [03:17<00:00, 2534.00it/s]
  0%|          | 316/500000 [00:00<02:38, 3153.76it/s]

Finished function: 'cosine_similarity_vector' in 197.38 seconds.


100%|██████████| 500000/500000 [02:13<00:00, 3754.81it/s]
  0%|          | 1443/500000 [00:00<02:22, 3491.46it/s]

Finished function: 'euclidean_distance_vector' in 133.22 seconds.


100%|██████████| 500000/500000 [02:22<00:00, 3508.96it/s]
  1%|          | 2528/500000 [00:00<00:19, 25273.46it/s]

Finished function: 'euclidean_distance_vector' in 142.55 seconds.


100%|██████████| 500000/500000 [00:12<00:00, 39564.42it/s]
  1%|          | 2989/500000 [00:00<00:16, 29888.75it/s]

Finished function: 'jaccard' in 12.71 seconds.


100%|██████████| 500000/500000 [00:17<00:00, 28988.17it/s]

Finished function: 'jaccard' in 17.33 seconds.
Finished function: 'create_embedding_features' in 697.9 seconds.





In [29]:
features_retrieval.create_embedding_features("vecmap")

100%|██████████| 500000/500000 [03:29<00:00, 2383.26it/s]
  0%|          | 198/500000 [00:00<04:13, 1971.62it/s]

Finished function: 'cosine_similarity_vector' in 209.88 seconds.


100%|██████████| 500000/500000 [03:45<00:00, 2217.11it/s]
  0%|          | 527/500000 [00:00<03:19, 2507.48it/s]

Finished function: 'cosine_similarity_vector' in 225.6 seconds.


100%|██████████| 500000/500000 [02:47<00:00, 2977.04it/s]
  0%|          | 282/500000 [00:00<02:57, 2811.96it/s]

Finished function: 'euclidean_distance_vector' in 168.05 seconds.


100%|██████████| 500000/500000 [03:01<00:00, 2749.22it/s]
  1%|          | 3448/500000 [00:00<00:14, 34479.95it/s]

Finished function: 'euclidean_distance_vector' in 181.94 seconds.


100%|██████████| 500000/500000 [00:12<00:00, 39113.96it/s]
  1%|          | 3294/500000 [00:00<00:15, 32934.77it/s]

Finished function: 'jaccard' in 12.86 seconds.


100%|██████████| 500000/500000 [00:12<00:00, 39560.04it/s]

Finished function: 'jaccard' in 12.72 seconds.
Finished function: 'create_embedding_features' in 811.1 seconds.





In [30]:
features_retrieval.feature_dataframe.reset_index(drop=True).to_feather("../data/processed/feature_retrieval_en_pl.feather")

In [None]:
# import pandas as pd
# pd.read_feather("../data/processed/feature_retrieval.feather")