# Preprocessing and Feature Creation

In this notebook we import the data, preprocess the data and create features for supervised and unsupervised cross-lingual-information retrieval models.

## I. Import Data

In this section we import the English and German europarl datasets and combine them into a parallel sentence translation dataframe.

In [2]:
import os
import sys
sys.path.append(os.path.dirname((os.path.abspath(''))))

from src.data import create_data_subset

In [3]:
create_data_subset(sentence_data_source_path='../data/external/europarl-v7.de-en.en',
                   sentence_data_target_path='../data/external/europarl-v7.de-en.de',
                   sample_size=35000,
                   sentence_data_sampled_path="../data/interim/europarl_en_de_test.pkl",)


Finished function: 'load_doc' in 1.56 seconds.
Finished function: 'to_sentences' in 1.05 seconds.
Finished function: 'load_doc' in 2.01 seconds.
Finished function: 'to_sentences' in 1.24 seconds.
Sampled dataframe saved in: ../data/interim/europarl_en_de_test.pkl
Finished function: 'create_data_subset' in 7.85 seconds.


## II. Preprocess data

In this section we preprocess the parallel sentence data for the feature generation

In [4]:
import spacy
from nltk.corpus import stopwords
from textblob import TextBlob as textblob_source
from textblob_de import TextBlobDE as textblob_target
import en_core_web_sm
import de_core_news_sm
# import it_core_news_sm
# import pl_core_news_sm
import time
from src.data import PreprocessingEuroParl

In [5]:
stopwords_source = stopwords.words('english')
stopwords_target = stopwords.words('german') # German stopwords
# stopwords_target = stopwords.words('italian') # Italian stopwords
# stopwords_target = stopwords.words('polish') # Polish stopwords
nlp_source = en_core_web_sm.load()
nlp_target = de_core_news_sm.load() # German pipeline
# nlp_target = it_core_news_sm.load() # Italian pipeline
# nlp_target = pl_core_news_sm.load() # Polish pipeline

In [6]:
parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_de_test.pkl") # German
# parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_it.pkl") # Italien
# parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_pol.pkl") # Polnisch

Finished function: 'import_data' in 0.04 seconds.


In [7]:
import numpy as np
parallel_sentences.dataframe = parallel_sentences.dataframe.iloc[30000:, :]
parallel_sentences.dataframe = parallel_sentences.dataframe.reset_index(drop=True)
parallel_sentences.dataframe["id_source"] = np.arange(len(parallel_sentences.dataframe))
parallel_sentences.dataframe["id_target"] = np.arange(len(parallel_sentences.dataframe))

In [8]:
parallel_sentences.preprocess_sentences(nlp_source, nlp_target, stopwords_source, stopwords_target)



100%|██████████| 5000/5000 [00:39<00:00, 125.99it/s]
100%|██████████| 5000/5000 [00:00<00:00, 192976.42it/s]
100%|██████████| 5000/5000 [00:00<00:00, 124244.75it/s]
100%|██████████| 5000/5000 [00:00<00:00, 63269.37it/s]
100%|██████████| 5000/5000 [00:00<00:00, 154990.98it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'spacy' in 39.69 seconds.
Finished function: 'remove_punctuation' in 0.03 seconds.
Finished function: 'remove_numbers' in 0.04 seconds.
Finished function: 'lemmatize' in 0.08 seconds.
Finished function: 'lowercase_spacy' in 0.03 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 17386.77it/s]
  0%|          | 14/5000 [00:00<00:36, 135.06it/s]

Finished function: 'remove_stopwords' in 0.29 seconds.
Finished function: 'create_cleaned_token_embedding' in 40.22 seconds.


100%|██████████| 5000/5000 [00:36<00:00, 135.76it/s]
100%|██████████| 5000/5000 [00:00<00:00, 191498.00it/s]
100%|██████████| 5000/5000 [00:00<00:00, 195473.03it/s]
100%|██████████| 5000/5000 [00:00<00:00, 73237.37it/s]
100%|██████████| 5000/5000 [00:00<00:00, 152475.79it/s]
 13%|█▎        | 667/5000 [00:00<00:01, 3664.11it/s]

Finished function: 'spacy' in 36.83 seconds.
Finished function: 'remove_punctuation' in 0.03 seconds.
Finished function: 'remove_numbers' in 0.03 seconds.
Finished function: 'lemmatize' in 0.07 seconds.
Finished function: 'lowercase_spacy' in 0.03 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 12765.04it/s]
 15%|█▍        | 737/5000 [00:00<00:01, 3531.32it/s]

Finished function: 'remove_stopwords' in 0.39 seconds.
Finished function: 'create_cleaned_token_embedding' in 37.43 seconds.


100%|██████████| 5000/5000 [00:01<00:00, 4280.96it/s]
 46%|████▌     | 2284/5000 [00:00<00:00, 22837.95it/s]

Finished function: 'tokenize_sentence' in 1.17 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 22901.20it/s]
100%|██████████| 5000/5000 [00:00<00:00, 122677.77it/s]
100%|██████████| 5000/5000 [00:00<00:00, 147986.90it/s]
 41%|████      | 2060/5000 [00:00<00:00, 20596.93it/s]

Finished function: 'remove_stopwords' in 0.22 seconds.
Finished function: 'strip_whitespace' in 0.04 seconds.
Finished function: 'lowercase' in 0.04 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 22143.88it/s]
 18%|█▊        | 892/5000 [00:00<00:00, 4351.04it/s]

Finished function: 'remove_stopwords' in 0.23 seconds.
Finished function: 'create_cleaned_text' in 1.71 seconds.


100%|██████████| 5000/5000 [00:01<00:00, 4527.10it/s]
 37%|███▋      | 1869/5000 [00:00<00:00, 18689.44it/s]

Finished function: 'tokenize_sentence' in 1.11 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 18172.97it/s]
100%|██████████| 5000/5000 [00:00<00:00, 163590.78it/s]
100%|██████████| 5000/5000 [00:00<00:00, 103908.91it/s]
 17%|█▋        | 847/5000 [00:00<00:00, 7463.33it/s]

Finished function: 'remove_stopwords' in 0.28 seconds.
Finished function: 'strip_whitespace' in 0.03 seconds.
Finished function: 'lowercase' in 0.05 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 10776.91it/s]

Finished function: 'remove_stopwords' in 0.47 seconds.
Finished function: 'create_cleaned_text' in 1.95 seconds.





In [9]:
parallel_sentences.extract_sentence_information(nlp_source, nlp_target)

100%|██████████| 5000/5000 [00:00<00:00, 57565.20it/s]
100%|██████████| 5000/5000 [00:00<00:00, 68232.46it/s]
100%|██████████| 5000/5000 [00:00<00:00, 204968.14it/s]

Finished function: 'number_punctuations_total' in 0.09 seconds.
Finished function: 'number_punctuations_total' in 0.08 seconds.



100%|██████████| 5000/5000 [00:00<00:00, 152439.21it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'number_words' in 0.03 seconds.
Finished function: 'number_words' in 0.04 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 38255.79it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'number_unique_words' in 0.13 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 50974.99it/s]
100%|██████████| 5000/5000 [00:00<00:00, 79739.62it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'number_unique_words' in 0.1 seconds.
Finished function: 'number_characters' in 0.06 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 82351.70it/s]
  return (character_vector / word_vector).replace(np.nan, 0).replace(np.inf, 0).replace(np.log(0), 0)
100%|██████████| 5000/5000 [00:00<00:00, 402223.29it/s]
100%|██████████| 5000/5000 [00:00<00:00, 261382.72it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'number_characters' in 0.06 seconds.
Finished function: 'average_characters' in 0.03 seconds.
Finished function: 'average_characters' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 375490.50it/s]
100%|██████████| 5000/5000 [00:00<00:00, 334607.42it/s]
100%|██████████| 5000/5000 [00:00<00:00, 388627.76it/s]
100%|██████████| 5000/5000 [00:00<00:00, 377953.75it/s]
100%|██████████| 5000/5000 [00:00<00:00, 383440.66it/s]
100%|██████████| 5000/5000 [00:00<00:00, 241933.48it/s]
100%|██████████| 5000/5000 [00:00<00:00, 389284.23it/s]
100%|██████████| 5000/5000 [00:00<00:00, 358732.81it/s]

Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.



100%|██████████| 5000/5000 [00:00<00:00, 348601.54it/s]
100%|██████████| 5000/5000 [00:00<00:00, 366455.58it/s]
100%|██████████| 5000/5000 [00:00<00:00, 396789.59it/s]
100%|██████████| 5000/5000 [00:00<00:00, 369763.74it/s]
100%|██████████| 5000/5000 [00:00<00:00, 303253.85it/s]

Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.



100%|██████████| 5000/5000 [00:00<00:00, 350805.77it/s]
100%|██████████| 5000/5000 [00:00<00:00, 350091.31it/s]
100%|██████████| 5000/5000 [00:00<00:00, 363287.89it/s]
100%|██████████| 5000/5000 [00:00<00:00, 325968.66it/s]
100%|██████████| 5000/5000 [00:00<00:00, 320063.49it/s]
100%|██████████| 5000/5000 [00:00<00:00, 400387.95it/s]
100%|██████████| 5000/5000 [00:00<00:00, 244224.06it/s]

Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.



100%|██████████| 5000/5000 [00:00<00:00, 217772.79it/s]
100%|██████████| 5000/5000 [00:00<00:00, 263752.89it/s]
100%|██████████| 5000/5000 [00:00<00:00, 308377.50it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.03 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 251318.46it/s]
100%|██████████| 5000/5000 [00:00<00:00, 323310.26it/s]
100%|██████████| 5000/5000 [00:00<00:00, 360459.26it/s]
100%|██████████| 5000/5000 [00:00<00:00, 361528.07it/s]
100%|██████████| 5000/5000 [00:00<00:00, 345568.57it/s]
100%|██████████| 5000/5000 [00:00<00:00, 406606.05it/s]
100%|██████████| 5000/5000 [00:00<00:00, 229518.01it/s]

Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.



100%|██████████| 5000/5000 [00:00<00:00, 326750.80it/s]
100%|██████████| 5000/5000 [00:00<00:00, 237575.70it/s]
100%|██████████| 5000/5000 [00:00<00:00, 351593.88it/s]
100%|██████████| 5000/5000 [00:00<00:00, 320655.64it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 282676.95it/s]
100%|██████████| 5000/5000 [00:00<00:00, 294933.20it/s]
100%|██████████| 5000/5000 [00:00<00:00, 233564.47it/s]
100%|██████████| 5000/5000 [00:00<00:00, 344846.91it/s]
100%|██████████| 5000/5000 [00:00<00:00, 329057.93it/s]
100%|██████████| 5000/5000 [00:00<00:00, 298952.53it/s]

Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.



100%|██████████| 5000/5000 [00:00<00:00, 328619.65it/s]
100%|██████████| 5000/5000 [00:00<00:00, 344066.15it/s]
100%|██████████| 5000/5000 [00:00<00:00, 467123.73it/s]
100%|██████████| 5000/5000 [00:00<00:00, 431033.83it/s]
100%|██████████| 5000/5000 [00:00<00:00, 459932.01it/s]
100%|██████████| 5000/5000 [00:00<00:00, 384347.19it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 465640.57it/s]
100%|██████████| 5000/5000 [00:00<00:00, 422821.43it/s]
100%|██████████| 5000/5000 [00:00<00:00, 455912.52it/s]
100%|██████████| 5000/5000 [00:00<00:00, 458423.94it/s]
100%|██████████| 5000/5000 [00:00<00:00, 493586.90it/s]
100%|██████████| 5000/5000 [00:00<00:00, 432331.16it/s]
100%|██████████| 5000/5000 [00:00<00:00, 481009.20it/s]
100%|██████████| 5000/5000 [00:00<00:00, 423077.33it/s]


Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 429665.02it/s]
100%|██████████| 5000/5000 [00:00<00:00, 409432.07it/s]
100%|██████████| 5000/5000 [00:00<00:00, 496873.03it/s]
100%|██████████| 5000/5000 [00:00<00:00, 439645.29it/s]
100%|██████████| 5000/5000 [00:00<00:00, 438642.96it/s]
100%|██████████| 5000/5000 [00:00<00:00, 454391.26it/s]
100%|██████████| 5000/5000 [00:00<00:00, 486702.41it/s]
100%|██████████| 5000/5000 [00:00<00:00, 431078.13it/s]

Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.



  0%|          | 10/5000 [00:00<00:52, 95.76it/s]

Finished function: 'number_punctuation_marks' in 0.01 seconds.


100%|██████████| 5000/5000 [00:39<00:00, 125.99it/s]
  0%|          | 12/5000 [00:00<00:45, 110.84it/s]

Finished function: 'spacy' in 39.69 seconds.


100%|██████████| 5000/5000 [00:45<00:00, 110.76it/s]
100%|██████████| 5000/5000 [00:00<00:00, 164196.61it/s]
100%|██████████| 5000/5000 [00:00<00:00, 179006.62it/s]
100%|██████████| 5000/5000 [00:00<00:00, 150570.94it/s]
100%|██████████| 5000/5000 [00:00<00:00, 132594.35it/s]
100%|██████████| 5000/5000 [00:00<00:00, 86968.59it/s]


Finished function: 'spacy' in 45.14 seconds.
Finished function: 'number_pos' in 0.03 seconds.
Finished function: 'number_pos' in 0.03 seconds.
Finished function: 'number_pos' in 0.04 seconds.
Finished function: 'number_pos' in 0.04 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 123808.32it/s]
 43%|████▎     | 2127/5000 [00:00<00:00, 10533.59it/s]

Finished function: 'number_pos' in 0.06 seconds.
Finished function: 'number_pos' in 0.04 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 10517.31it/s]
 49%|████▉     | 2468/5000 [00:00<00:00, 12200.05it/s]

Finished function: 'number_times' in 0.48 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 10926.82it/s]
 15%|█▍        | 733/5000 [00:00<00:00, 7323.70it/s]

Finished function: 'number_times' in 0.46 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 9791.54it/s]
 48%|████▊     | 2421/5000 [00:00<00:00, 11806.21it/s]

Finished function: 'number_times' in 0.51 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 12550.30it/s]
 47%|████▋     | 2338/5000 [00:00<00:00, 11716.72it/s]

Finished function: 'number_times' in 0.4 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 11752.29it/s]
 24%|██▍       | 1201/5000 [00:00<00:00, 12001.94it/s]

Finished function: 'number_times' in 0.43 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 10862.22it/s]
100%|██████████| 5000/5000 [00:00<00:00, 106057.64it/s]
100%|██████████| 5000/5000 [00:00<00:00, 115370.13it/s]

Finished function: 'number_times' in 0.46 seconds.
Finished function: 'named_numbers' in 0.05 seconds.
Finished function: 'named_numbers' in 0.05 seconds.





In [10]:
parallel_sentences.create_embedding_information("proc_5k")

Finished function: 'load_embeddings' in 1.29 seconds.


  2%|▏         | 77/5000 [00:00<00:06, 763.16it/s]

Finished function: 'load_embeddings' in 0.64 seconds.


100%|██████████| 5000/5000 [00:07<00:00, 705.75it/s]
  2%|▏         | 83/5000 [00:00<00:05, 828.95it/s]

Finished function: 'word_embeddings' in 7.09 seconds.


100%|██████████| 5000/5000 [00:06<00:00, 771.38it/s]


Finished function: 'word_embeddings' in 6.48 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 259876.58it/s]
100%|██████████| 5000/5000 [00:00<00:00, 221719.07it/s]
  6%|▋         | 313/5000 [00:00<00:01, 3128.75it/s]

Finished function: 'create_translation_dictionary' in 20.78 seconds.
Finished function: 'translate_words' in 0.02 seconds.
Finished function: 'translate_words' in 0.02 seconds.


100%|██████████| 5000/5000 [00:02<00:00, 2457.90it/s]
  7%|▋         | 339/5000 [00:00<00:01, 3382.26it/s]

Finished function: 'tf_idf_vector' in 2.1 seconds.


100%|██████████| 5000/5000 [00:01<00:00, 3027.60it/s]
 12%|█▏        | 618/5000 [00:00<00:01, 3073.90it/s]

Finished function: 'tf_idf_vector' in 1.74 seconds.


100%|██████████| 5000/5000 [00:02<00:00, 2431.44it/s]
  6%|▌         | 305/5000 [00:00<00:01, 3042.12it/s]

Finished function: 'sentence_embedding_average' in 2.06 seconds.


100%|██████████| 5000/5000 [00:01<00:00, 2521.64it/s]
  1%|          | 32/5000 [00:00<00:16, 309.09it/s]

Finished function: 'sentence_embedding_average' in 1.98 seconds.


  return [pd.Series(embedding_dataframe.values.mean(axis=1))]
100%|██████████| 5000/5000 [00:21<00:00, 231.86it/s]
  1%|▏         | 65/5000 [00:00<00:15, 324.67it/s]

Finished function: 'sentence_embedding_tf_idf' in 21.57 seconds.


100%|██████████| 5000/5000 [00:19<00:00, 262.27it/s]

Finished function: 'sentence_embedding_tf_idf' in 19.07 seconds.





In [11]:
parallel_sentences.create_embedding_information("proc_b_1k")

Finished function: 'load_embeddings' in 0.83 seconds.


  3%|▎         | 129/5000 [00:00<00:07, 660.08it/s]

Finished function: 'load_embeddings' in 0.61 seconds.


100%|██████████| 5000/5000 [00:07<00:00, 627.94it/s]
  2%|▏         | 77/5000 [00:00<00:06, 766.37it/s]

Finished function: 'word_embeddings' in 7.96 seconds.


100%|██████████| 5000/5000 [00:08<00:00, 579.62it/s]


Finished function: 'word_embeddings' in 8.63 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 230999.49it/s]
100%|██████████| 5000/5000 [00:00<00:00, 194416.56it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'create_translation_dictionary' in 27.67 seconds.
Finished function: 'translate_words' in 0.02 seconds.
Finished function: 'translate_words' in 0.03 seconds.


100%|██████████| 5000/5000 [00:02<00:00, 2156.91it/s]
  4%|▍         | 192/5000 [00:00<00:02, 1915.38it/s]

Finished function: 'tf_idf_vector' in 2.38 seconds.


100%|██████████| 5000/5000 [00:02<00:00, 2235.64it/s]
  4%|▍         | 205/5000 [00:00<00:02, 2049.49it/s]

Finished function: 'tf_idf_vector' in 2.33 seconds.


100%|██████████| 5000/5000 [00:02<00:00, 1971.08it/s]
 10%|▉         | 485/5000 [00:00<00:02, 2180.88it/s]

Finished function: 'sentence_embedding_average' in 2.54 seconds.


100%|██████████| 5000/5000 [00:01<00:00, 2717.77it/s]
  1%|▏         | 70/5000 [00:00<00:13, 359.12it/s]

Finished function: 'sentence_embedding_average' in 1.84 seconds.


100%|██████████| 5000/5000 [00:16<00:00, 305.03it/s]
  2%|▏         | 80/5000 [00:00<00:12, 397.39it/s]

Finished function: 'sentence_embedding_tf_idf' in 16.4 seconds.


100%|██████████| 5000/5000 [00:14<00:00, 351.60it/s]


Finished function: 'sentence_embedding_tf_idf' in 14.22 seconds.


In [12]:
parallel_sentences.create_embedding_information("vecmap")

Finished function: 'load_embeddings' in 0.67 seconds.


  1%|          | 62/5000 [00:00<00:08, 616.44it/s]

Finished function: 'load_embeddings' in 0.59 seconds.


100%|██████████| 5000/5000 [00:07<00:00, 648.30it/s]
  1%|▏         | 73/5000 [00:00<00:06, 719.76it/s]

Finished function: 'word_embeddings' in 7.71 seconds.


100%|██████████| 5000/5000 [00:08<00:00, 609.92it/s]


Finished function: 'word_embeddings' in 8.2 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 275462.62it/s]
100%|██████████| 5000/5000 [00:00<00:00, 211615.51it/s]
  6%|▌         | 304/5000 [00:00<00:01, 3038.67it/s]

Finished function: 'create_translation_dictionary' in 24.22 seconds.
Finished function: 'translate_words' in 0.02 seconds.
Finished function: 'translate_words' in 0.03 seconds.


100%|██████████| 5000/5000 [00:01<00:00, 2884.19it/s]
  6%|▌         | 295/5000 [00:00<00:01, 2949.80it/s]

Finished function: 'tf_idf_vector' in 1.82 seconds.


100%|██████████| 5000/5000 [00:01<00:00, 3064.38it/s]
  6%|▌         | 311/5000 [00:00<00:01, 3103.17it/s]

Finished function: 'tf_idf_vector' in 1.73 seconds.


100%|██████████| 5000/5000 [00:01<00:00, 3000.24it/s]
  6%|▌         | 299/5000 [00:00<00:01, 2984.03it/s]

Finished function: 'sentence_embedding_average' in 1.67 seconds.


100%|██████████| 5000/5000 [00:01<00:00, 3078.20it/s]
  1%|          | 36/5000 [00:00<00:13, 356.32it/s]

Finished function: 'sentence_embedding_average' in 1.63 seconds.


100%|██████████| 5000/5000 [00:15<00:00, 318.76it/s]
  1%|          | 42/5000 [00:00<00:11, 414.07it/s]

Finished function: 'sentence_embedding_tf_idf' in 15.69 seconds.


100%|██████████| 5000/5000 [00:14<00:00, 351.51it/s]

Finished function: 'sentence_embedding_tf_idf' in 14.23 seconds.





In [13]:
parallel_sentences.preprocessed.to_json("../data/interim/preprocessed_data_en_de_testset.json")

In [14]:
parallel_sentences.dataframe

Unnamed: 0,id_source,text_source,text_target,id_target,text_preprocessed_source,text_preprocessed_target,text_source_spacy,text_target_spacy,word_embedding_proc_5k_source,word_embedding_proc_5k_target,tf_idf_proc_5k_source,tf_idf_proc_5k_target,word_embedding_proc_b_1k_source,word_embedding_proc_b_1k_target,tf_idf_proc_b_1k_source,tf_idf_proc_b_1k_target,word_embedding_vecmap_source,word_embedding_vecmap_target,tf_idf_vecmap_source,tf_idf_vecmap_target
0,0,"They must include press freedom, the rule of l...","Dazu gehört Pressefreiheit, dazu gehört Rechts...",0,"[must, include, press, freedom, ,, rule, law, ...","[gehört, pressefreiheit, ,, gehört, rechtstaat...","[They, must, include, press, freedom, ,, the, ...","[Dazu, gehört, Pressefreiheit, ,, dazu, gehört...",must include press freedom ...,hören pressefreiheit toleranz gegen...,"{'must': 0.17626342263744035, 'include': 0.235...","{'hören': 0.6660395574174905, 'pressefreiheit'...",must include press freedom ...,hören pressefreiheit toleranz gegen...,"{'must': 0.17626342263744035, 'include': 0.235...","{'hören': 0.6660395574174905, 'pressefreiheit'...",must include press freedom ...,hören pressefreiheit toleranz gegen...,"{'must': 0.17626342263744035, 'include': 0.235...","{'hören': 0.6660395574174905, 'pressefreiheit'..."
1,1,"Naturally, there are many other issues as well.",Aber natürlich auch alle weiteren Fragen.,1,"[naturally, ,, many, issues, well, .]","[natürlich, weiteren, fragen, .]","[Naturally, ,, there, are, many, other, issues...","[Aber, natürlich, auch, alle, weiteren, Fragen...",naturally many issue well 0...,natürlich all weit frage 0...,"{'naturally': 0.6306460930482953, 'many': 0.46...","{'natürlich': 0.563138440048413, 'all': 0.4521...",naturally many issue well 0...,natürlich all weit frage 0...,"{'naturally': 0.6306460930482953, 'many': 0.46...","{'natürlich': 0.563138440048413, 'all': 0.4521...",naturally many issue well 0...,natürlich all weit frage 0...,"{'naturally': 0.6306460930482953, 'many': 0.46...","{'natürlich': 0.563138440048413, 'all': 0.4521..."
2,2,"Mr President, ladies and gentlemen, Mrs Staune...","Herr Präsident, liebe Kolleginnen und Kollegen...",2,"[mr, president, ,, ladies, gentlemen, ,, mrs, ...","[herr, präsident, ,, liebe, kolleginnen, kolle...","[Mr, President, ,, ladies, and, gentlemen, ,, ...","[Herr, Präsident, ,, liebe, Kolleginnen, und, ...",mr president lady gentleman ...,herr präsident lieb kollegin ...,"{'mr': 0.1990752902598333, 'president': 0.2026...","{'herr': 0.1667370642204634, 'präsident': 0.17...",mr president lady gentleman ...,herr präsident lieb kollegin ...,"{'mr': 0.1990752902598333, 'president': 0.2026...","{'herr': 0.1667370642204634, 'präsident': 0.17...",mr president lady gentleman ...,herr präsident lieb kollegin ...,"{'mr': 0.1990752902598333, 'president': 0.2026...","{'herr': 0.1667370642204634, 'präsident': 0.17..."
3,3,It is a basic policy that consumer protection ...,"Es ist ein politisches Grundprinzip, dass Verb...",3,"[basic, policy, consumer, protection, integral...","[politisches, grundprinzip, ,, verbraucherschu...","[It, is, a, basic, policy, that, consumer, pro...","[Es, ist, ein, politisches, Grundprinzip, ,, d...",basic policy consumer protection ...,politisch grundprinzip verbraucherschut...,"{'basic': 0.21810369565971274, 'policy': 0.142...","{'politisch': 0.2082602674266007, 'grundprinzi...",basic policy consumer protection ...,politisch grundprinzip verbraucherschut...,"{'basic': 0.21810369565971274, 'policy': 0.142...","{'politisch': 0.2082602674266007, 'grundprinzi...",basic policy consumer protection ...,politisch grundprinzip verbraucherschut...,"{'basic': 0.21810369565971274, 'policy': 0.142...","{'politisch': 0.2082602674266007, 'grundprinzi..."
4,4,"Thank you Mrs Ţicău, we take due note of your ...","Vielen Dank Frau Ţicău, wir werden Ihre Beobac...",4,"[thank, mrs, ţicău, ,, take, due, note, observ...","[vielen, dank, frau, ţicău, ,, beobachtung, ge...","[Thank, you, Mrs, Ţicău, ,, we, take, due, not...","[Vielen, Dank, Frau, Ţicău, ,, wir, werden, Ih...",thank mrs take due ...,dank frau beobachtung gebühren...,"{'thank': 0.2859167764901562, 'mrs': 0.3108525...","{'dank': 0.325110383877484, 'frau': 0.22630542...",thank mrs take due ...,dank frau beobachtung gebühren...,"{'thank': 0.2859167764901562, 'mrs': 0.3108525...","{'dank': 0.325110383877484, 'frau': 0.22630542...",thank mrs take due ...,dank frau beobachtung gebühren...,"{'thank': 0.2859167764901562, 'mrs': 0.3108525...","{'dank': 0.325110383877484, 'frau': 0.22630542..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4995,I told them in one discussion that it looks as...,"Ich habe ihnen in einem Gespräch gesagt, dass ...",4995,"[told, one, discussion, looks, difficult, get,...","[gespräch, gesagt, ,, aussieht, ,, schwieriger...","[I, told, them, in, one, discussion, that, it,...","[Ich, habe, ihnen, in, einem, Gespräch, gesagt...",tell discussion look difficult...,gespräch sagen aussehen schwierig ...,"{'tell': 0.2336639862054862, 'discussion': 0.2...","{'gespräch': 0.2662775367692719, 'sagen': 0.16...",tell discussion look difficult...,gespräch sagen aussehen schwierig ...,"{'tell': 0.2336639862054862, 'discussion': 0.2...","{'gespräch': 0.2662775367692719, 'sagen': 0.16...",tell discussion look difficult...,gespräch sagen aussehen schwierig ...,"{'tell': 0.2336639862054862, 'discussion': 0.2...","{'gespräch': 0.2662775367692719, 'sagen': 0.16..."
4996,4996,Is the European Union right to be proud of the...,"Hat die Europäische Union Anlaß, auf humanitär...",4996,"[european, union, right, proud, humanitarian, ...","[europäische, union, anlaß, ,, humanitäre, maß...","[Is, the, European, Union, right, to, be, prou...","[Hat, die, Europäische, Union, Anlaß, ,, auf, ...",european union right proud h...,europäisch union anlaß maßnahme ...,"{'european': 0.12302883215168699, 'union': 0.1...","{'europäisch': 0.1675379646449199, 'union': 0....",european union right proud h...,europäisch union anlaß maßnahme ...,"{'european': 0.12302883215168699, 'union': 0.1...","{'europäisch': 0.1675379646449199, 'union': 0....",european union right proud h...,europäisch union anlaß maßnahme ...,"{'european': 0.12302883215168699, 'union': 0.1...","{'europäisch': 0.1675379646449199, 'union': 0...."
4997,4997,"However, this information would concern only s...",Solche Hinweise hätten aber nur bei der Kennze...,4997,"[however, ,, information, would, concern, subs...","[hinweise, hätten, kennzeichnung, gesundheitsg...","[However, ,, this, information, would, concern...","[Solche, Hinweise, hätten, aber, nur, bei, der...",however information would concern...,solch hinweis kennzeichnung sto...,"{'however': 0.20587812620381066, 'information'...","{'solch': 0.24858968001616202, 'hinweis': 0.28...",however information would concern...,solch hinweis kennzeichnung sto...,"{'however': 0.20587812620381066, 'information'...","{'solch': 0.24858968001616202, 'hinweis': 0.28...",however information would concern...,solch hinweis kennzeichnung sto...,"{'however': 0.20587812620381066, 'information'...","{'solch': 0.24858968001616202, 'hinweis': 0.28..."
4998,4998,"Earlier this week, Jan Pronk, the former UN en...","Anfang dieser Woche gab Jan Pronk, der ehemali...",4998,"[earlier, week, ,, jan, pronk, ,, former, un, ...","[anfang, woche, gab, jan, pronk, ,, ehemalige,...","[Earlier, this, week, ,, Jan, Pronk, ,, the, f...","[Anfang, dieser, Woche, gab, Jan, Pronk, ,, de...",early week jan former ...,anfang woche geben jan e...,"{'early': 0.19923117371332466, 'week': 0.20036...","{'anfang': 0.21764427103233672, 'woche': 0.206...",early week jan former ...,anfang woche geben jan e...,"{'early': 0.19923117371332466, 'week': 0.20036...","{'anfang': 0.21764427103233672, 'woche': 0.206...",early week jan former ...,anfang woche geben jan e...,"{'early': 0.19923117371332466, 'week': 0.20036...","{'anfang': 0.21764427103233672, 'woche': 0.206..."


In [15]:
parallel_sentences.preprocessed

Unnamed: 0,id_source,id_target,token_preprocessed_embedding_source,token_preprocessed_embedding_target,Translation,number_punctuations_total_source,number_punctuations_total_target,number_words_source,number_words_target,number_unique_words_source,...,sentence_embedding_average_proc_b_1k_source,sentence_embedding_average_proc_b_1k_target,sentence_embedding_tf_idf_proc_b_1k_source,sentence_embedding_tf_idf_proc_b_1k_target,translated_to_target_vecmap_source,translated_to_source_vecmap_target,sentence_embedding_average_vecmap_source,sentence_embedding_average_vecmap_target,sentence_embedding_tf_idf_vecmap_source,sentence_embedding_tf_idf_vecmap_target
0,0,0,"[must, include, press, freedom, rule, law, als...","[hören, pressefreiheit, hören, rechtstaatlichk...",1,2,2,13,10,13,...,"[[-0.021971818561164234, 0.011474978608580736,...","[[-0.04290382231452635, 0.046113511946584494, ...","[[-0.007066868075968942, 0.005607346065645118,...","[[-0.013969529264133333, 0.018536083310556985,...","[müssen, zählen, nachdruck, freiheit, facto, v...","[listen, censorship, listen, listen, tolerance...","[[0.20800741131489092, -0.03718488272548152, 0...","[[0.1708042360842228, -0.05323345214128494, 0....","[[0.05308659109689821, -0.012124574781091323, ...","[[0.053098278212977024, -0.004806572931137665,..."
1,1,1,"[naturally, many, issue, well]","[natürlich, all, weit, frage]",1,1,0,4,3,4,...,"[[-0.07873985217884183, -0.00878874131012708, ...","[[-0.05510551622137427, 0.019210652448236942, ...","[[-0.039775796329763254, -0.004100882681543145...","[[-0.026569483773157737, 0.009380328691558417,...","[natürlicherweise, zahlreiche, problematik, eb...","[obviously, amazing, far, question]","[[0.33323322981595993, -0.013636057265102863, ...","[[0.31983064115047455, -0.012548421160317957, ...","[[0.16483797091322513, -0.009893845352128313, ...","[[0.1630749418803205, -0.007294520385128128, 0..."
2,2,2,"[mr, president, lady, gentleman, mrs, stauner,...","[herr, präsident, lieb, kollegin, kollege, ber...",1,5,4,12,13,12,...,"[[-0.05349120330065489, 0.038198266993276775, ...","[[-0.07230656314641237, 0.04887153177211682, -...","[[-0.013956825976806853, 0.010524739055448654,...","[[-0.01840957375747747, 0.011432228637823928, ...","[herrn, präsident, mary, vornehm, mary, berich...","[lord, president, dear, mrs, colleague, report...","[[0.13944147573783994, 0.14624346122145654, 0....","[[0.13265844124058881, 0.13158888618151346, 0....","[[0.039298395795571026, 0.03915855198838343, 0...","[[0.03935996898779049, 0.029764392773049386, 0..."
3,3,3,"[basic, policy, consumer, protection, integral...","[politisch, grundprinzip, verbraucherschutz, i...",1,0,2,16,11,11,...,"[[-0.0196231756020676, -0.00987706098451533, -...","[[-0.05054771602153778, 0.00880683995783329, -...","[[-0.006306732065082907, -0.002367327419645338...","[[-0.014033250896603547, 0.0007865312866477389...","[grundlegende, wirtschaftspolitik, verbraucher...","[politically, principle, consumer, integral, p...","[[0.2640666257251393, -0.0853267332369631, 0.0...","[[0.18233205378055573, -0.06059268582612276, 0...","[[0.07169493797530326, -0.026357206451051805, ...","[[0.05362438231735067, -0.02147952806897634, 0..."
4,4,4,"[thank, mrs, ţicău, take, due, note, observation]","[dank, frau, ţicău, beobachtung, gebühren, bea...",1,1,1,7,7,7,...,"[[-0.016065427257368963, 0.06924035431196292, ...","[[-0.06597628220915794, 0.033844958432018755, ...","[[-0.004266745919428459, 0.020920746568999327,...","[[-0.02459387689770779, 0.009592057857595168, ...","[danke, mary, nehmen, aufgrund, anmerkung, beo...","[thank, daughter, observation, pay, apply]","[[0.2628776244819164, 0.07202177572374542, 0.1...","[[0.2281290665268898, 0.025979317165911196, 0....","[[0.08939328089640149, 0.01932059090826744, 0....","[[0.09774504782369009, -0.005629674603903475, ..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4995,4995,"[tell, discussion, look, difficult, get, good,...","[gespräch, sagen, aussehen, schwierig, peer, r...",1,0,5,19,16,19,...,"[[-0.03781758025029881, 0.04561582228375806, -...","[[-0.006831209890411368, 0.022711734254179255,...","[[-0.008710065448488858, 0.010466450631324158,...","[[5.6516836084291216e-05, 0.005375066987057035...","[sagen, konsens, schauen, schwierig, bekommen,...","[interview, say, shape, difficult, peer, good,...","[[0.24532438007493815, 0.004404090862307284, 0...","[[0.1969107918183519, 0.011512659090970243, 0....","[[0.05863010090177196, 0.004771032047790131, 0...","[[0.048826807176762056, 0.005519456655719748, ..."
4996,4996,4996,"[european, union, right, proud, humanitarian, ...","[europäisch, union, anlaß, humanitär, maßnahme...",1,1,3,12,12,11,...,"[[-0.03594180399721319, 0.015256671806458722, ...","[[-0.03823526995256543, 0.015385312959551811, ...","[[-0.011011873583563005, 0.0016257848030860106...","[[-0.009254192504192075, 0.007094162098472008,...","[europäische, union, linke, stolz, humanitäre,...","[european, union, acknowledgement, preventativ...","[[0.1838966760445725, -0.04385858147659085, 0....","[[0.1643218114040792, -0.023434823495335877, 0...","[[0.052804342223652274, -0.003989617522633716,...","[[0.04534390128151565, -0.001144711801600671, ..."
4997,4997,4997,"[however, information, would, concern, substan...","[solch, hinweis, kennzeichnung, gesundheitsgef...",1,2,0,9,8,8,...,"[[-0.06054617161862552, -0.032334873627405614,...","[[0.0004524181131273508, 0.0363775837700814, -...","[[-0.017964938149843405, -0.006479997166303774...","[[0.0009744391006637598, 0.013281651314851739,...","[jedoch, information, müssten, bedenken, subst...","[incredibly, explanation, labelling, substance...","[[0.33515107817947865, -0.048095799633301795, ...","[[0.29965757131576537, -0.055498963221907616, ...","[[0.09410211836024641, -0.024247944199762585, ...","[[0.093213372315576, -0.012954907841144938, 0...."
4998,4998,4998,"[early, week, jan, pronk, former, un, envoy, s...","[anfang, woche, geben, jan, pronk, ehemalig, u...",1,3,2,14,13,13,...,"[[-0.0497313872911036, 0.03800346574280411, -0...","[[-0.03595951018441054, 0.04274792379389206, -...","[[-0.01482810741283633, 0.010372276755398773, ...","[[-0.011945683411114657, 0.01196331349841016, ...","[frühe, woche, that, ehemalig, uno, botschafte...","[beginning, week, give, anyway, former, sudan,...","[[0.06874247279483825, 0.09173224462817113, 0....","[[0.10321165372927983, 0.0371778037192093, 0.1...","[[0.008600220600305348, 0.01612903007741604, 0...","[[0.010701002245010907, 4.584188879966058e-05,..."


In [None]:
import pandas as pd
preprocessed_data = pd.read_json("../data/interim/preprocessed_data_en_de_testset.json")
parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_de_test.pkl")
parallel_sentences.preprocessed = preprocessed_data

## III. Create data set

In this section we create the datasets for the training of the supervised model and the data for the supervised and unsupervised retrieval.

In [16]:
from src.data import DataSet

In [17]:
n_model = 0
n_queries = 100
n_retrieval = 5000
k = 10
sample_size_k = 100


In [18]:
dataset = DataSet(parallel_sentences.preprocessed)
#dataset = DataSet(preprocessed_data)

Finished function: '__init__' in 0.0 seconds.


In [19]:
dataset.split_model_retrieval(n_model, n_retrieval)

Finished function: 'split_model_retrieval' in 0.0 seconds.


In [20]:
import pandas as pd
#dataset.create_retrieval_index(n_queries)

# If your pandas version is old, use this instead
query = pd.DataFrame({"id_source": dataset.retrieval_subset.iloc[:n_queries]["id_source"]})
documents = pd.DataFrame({"id_target": dataset.retrieval_subset["id_target"]})
index = pd.MultiIndex.from_product([dataset.retrieval_subset.iloc[:n_queries]["id_source"], dataset.retrieval_subset["id_target"]], names = ["id_source", "id_target"])
dataset.retrieval_dataset_index = pd.DataFrame(index = index).reset_index()

In [21]:
dataset.retrieval_dataset_index.reset_index(drop=True).to_feather("../data/processed/dataset_retrieval_index_en_de_testset.feather")

In [22]:
# import pandas as pd
# pd.read_feather("../data/processed/dataset_retrieval_index.feather")

## IV. Create features

In this section we create features for our model, that are sentence based and should be created before the text is preprocessed.

In [23]:
#%autoreload 2
from src.features import feature_generation_class

Generation of the data for the crosslingual information retrieval task.

In [24]:
features_retrieval = feature_generation_class.FeatureGeneration(dataset.retrieval_dataset_index, 
                                                            parallel_sentences.preprocessed)

In [25]:
features_retrieval.create_feature_dataframe()

Finished function: 'create_feature_dataframe' in 0.01 seconds.


In [26]:
features_retrieval.create_sentence_features()

Finished function: 'difference_numerical' in 0.02 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.02 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.


  return abs(target_array - source_array).replace(np.nan, 0).replace(np.inf, 0).replace(np.log(0), 0)
  0), 0)
  np.log(0), 0)


Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.01 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 

  1%|          | 3653/500000 [00:00<00:13, 36527.51it/s]

Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.


100%|██████████| 500000/500000 [00:13<00:00, 36116.08it/s]

Finished function: 'jaccard' in 13.9 seconds.
Finished function: 'create_sentence_features' in 17.06 seconds.





In [27]:
features_retrieval.create_embedding_features("proc_5k")

100%|██████████| 500000/500000 [03:44<00:00, 2230.90it/s]
  0%|          | 154/500000 [00:00<05:24, 1539.17it/s]

Finished function: 'cosine_similarity_vector' in 224.19 seconds.


100%|██████████| 500000/500000 [03:48<00:00, 2192.02it/s]
  0%|          | 234/500000 [00:00<03:34, 2334.51it/s]

Finished function: 'cosine_similarity_vector' in 228.18 seconds.


100%|██████████| 500000/500000 [02:54<00:00, 2861.78it/s]
  0%|          | 421/500000 [00:00<04:21, 1906.93it/s]

Finished function: 'euclidean_distance_vector' in 174.8 seconds.


100%|██████████| 500000/500000 [02:50<00:00, 2935.08it/s]
  0%|          | 0/500000 [00:00<?, ?it/s]

Finished function: 'euclidean_distance_vector' in 170.46 seconds.


100%|██████████| 500000/500000 [00:13<00:00, 37392.74it/s]
  1%|          | 3139/500000 [00:00<00:15, 31385.62it/s]

Finished function: 'jaccard' in 13.5 seconds.


100%|██████████| 500000/500000 [00:13<00:00, 37078.33it/s]

Finished function: 'jaccard' in 13.56 seconds.
Finished function: 'create_embedding_features' in 824.75 seconds.





In [28]:
features_retrieval.create_embedding_features("proc_b_1k")

100%|██████████| 500000/500000 [03:45<00:00, 2221.47it/s]
  0%|          | 0/500000 [00:00<?, ?it/s]

Finished function: 'cosine_similarity_vector' in 225.14 seconds.


100%|██████████| 500000/500000 [03:39<00:00, 2278.26it/s]
  0%|          | 650/500000 [00:00<02:42, 3066.14it/s]

Finished function: 'cosine_similarity_vector' in 219.57 seconds.


100%|██████████| 500000/500000 [02:18<00:00, 3614.72it/s]
  0%|          | 316/500000 [00:00<02:38, 3154.30it/s]

Finished function: 'euclidean_distance_vector' in 138.37 seconds.


100%|██████████| 500000/500000 [02:22<00:00, 3515.18it/s]
  1%|          | 4066/500000 [00:00<00:12, 40658.29it/s]

Finished function: 'euclidean_distance_vector' in 142.29 seconds.


100%|██████████| 500000/500000 [00:13<00:00, 37314.92it/s]
  1%|          | 3382/500000 [00:00<00:14, 33818.98it/s]

Finished function: 'jaccard' in 13.49 seconds.


100%|██████████| 500000/500000 [00:12<00:00, 40500.65it/s]

Finished function: 'jaccard' in 12.43 seconds.
Finished function: 'create_embedding_features' in 751.32 seconds.





In [29]:
features_retrieval.create_embedding_features("vecmap")

100%|██████████| 500000/500000 [03:35<00:00, 2324.30it/s]
  0%|          | 95/500000 [00:00<08:46, 948.77it/s]

Finished function: 'cosine_similarity_vector' in 215.17 seconds.


100%|██████████| 500000/500000 [03:37<00:00, 2296.49it/s]
  0%|          | 240/500000 [00:00<03:28, 2395.15it/s]

Finished function: 'cosine_similarity_vector' in 217.8 seconds.


100%|██████████| 500000/500000 [02:33<00:00, 3264.70it/s]
  0%|          | 611/500000 [00:00<03:06, 2674.84it/s]

Finished function: 'euclidean_distance_vector' in 153.22 seconds.


100%|██████████| 500000/500000 [02:27<00:00, 3388.74it/s]
  1%|          | 3294/500000 [00:00<00:15, 32937.05it/s]

Finished function: 'euclidean_distance_vector' in 147.6 seconds.


100%|██████████| 500000/500000 [00:13<00:00, 37049.50it/s]
  0%|          | 1457/500000 [00:00<00:34, 14566.78it/s]

Finished function: 'jaccard' in 13.57 seconds.


100%|██████████| 500000/500000 [00:15<00:00, 32475.46it/s]

Finished function: 'jaccard' in 15.47 seconds.
Finished function: 'create_embedding_features' in 762.88 seconds.





In [30]:
features_retrieval.feature_dataframe.reset_index(drop=True).to_feather("../data/processed/feature_retrieval_en_de_testset.feather")

In [None]:
# import pandas as pd
# pd.read_feather("../data/processed/feature_retrieval.feather")