# Preprocessing and Feature Creation

In this notebook we import the data, preprocess the data and create features for supervised and unsupervised cross-lingual-information retrieval models.

## I. Import Data

In this section we import the English and German europarl datasets and combine them into a parallel sentence translation dataframe.

In [None]:
%load_ext autoreload
%autoreload 2

In [2]:
import os
import sys
sys.path.append(os.path.dirname((os.path.abspath(''))))

from src.data import create_data_subset

In [None]:
create_data_subset(sentence_data_source_path='../data/external/europarl-v7.pl-en.en',
                   sentence_data_target_path='../data/external/europarl-v7.pl-en.pl',
                   sample_size=25000,
                   sentence_data_sampled_path="../data/interim/europarl_en_pl.pkl",)

In [None]:
!python -m spacy download pl_core_news_sm

## II. Preprocess data

In this section we preprocess the parallel sentence data for the feature generation

In [3]:
import spacy
from nltk.corpus import stopwords
from textblob import TextBlob as textblob_source
from textblob_de import TextBlobDE as textblob_target
import en_core_web_sm
# import de_core_news_sm
# import it_core_news_sm
import pl_core_news_sm
import time
from src.data import PreprocessingEuroParl
from stop_words import get_stop_words

In [4]:
stopwords_source = stopwords.words('english')
# stopwords_target = stopwords.words('german') # German stopwords
# stopwords_target = stopwords.words('italian') # Italian stopwords
stopwords_target = get_stop_words('polish') # Polish stopwords
nlp_source = en_core_web_sm.load()
# nlp_target = de_core_news_sm.load() # German pipeline
# nlp_target = it_core_news_sm.load() # Italian pipeline
nlp_target = pl_core_news_sm.load() # Polish pipeline

In [35]:
parallel_sentences

<src.data.preprocessing_class.PreprocessingEuroParl at 0x7f8658984f28>

In [5]:
# parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_de.pkl") # German
# parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_it.pkl") # Italien
parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_pl.pkl") # Polnisch

Finished function: 'import_data' in 0.05 seconds.


In [6]:
parallel_sentences.preprocess_sentences(nlp_source, nlp_target, stopwords_source, stopwords_target)

100%|██████████| 25000/25000 [02:41<00:00, 154.78it/s]
 61%|██████    | 15160/25000 [00:00<00:00, 32640.97it/s]

Finished function: 'spacy' in 161.52 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 46998.56it/s]
100%|██████████| 25000/25000 [00:00<00:00, 149139.72it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'remove_stopwords' in 0.53 seconds.
Finished function: 'remove_punctuation' in 0.17 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 182182.35it/s]
 30%|██▉       | 7451/25000 [00:00<00:00, 74508.47it/s]

Finished function: 'remove_numbers' in 0.14 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 76720.00it/s]
100%|██████████| 25000/25000 [00:00<00:00, 176887.08it/s]


Finished function: 'lemmatize' in 0.33 seconds.
Finished function: 'lowercase_spacy' in 0.14 seconds.


100%|██████████| 25000/25000 [03:22<00:00, 123.18it/s]
100%|██████████| 25000/25000 [00:00<00:00, 214994.61it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'create_cleaned_token_embedding' in 163.06 seconds.
Finished function: 'spacy' in 202.96 seconds.
Finished function: 'remove_stopwords' in 0.12 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 143405.21it/s]
 85%|████████▌ | 21261/25000 [00:00<00:00, 43392.73it/s] 

Finished function: 'remove_punctuation' in 0.18 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 40371.34it/s]
 22%|██▏       | 5591/25000 [00:00<00:00, 55903.26it/s]

Finished function: 'remove_numbers' in 0.62 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 65900.84it/s]
100%|██████████| 25000/25000 [00:00<00:00, 153890.72it/s]


Finished function: 'lemmatize' in 0.38 seconds.
Finished function: 'lowercase_spacy' in 0.16 seconds.


100%|██████████| 25000/25000 [00:04<00:00, 5804.76it/s]
100%|██████████| 25000/25000 [00:00<00:00, 286031.96it/s]
100%|██████████| 25000/25000 [00:00<00:00, 192007.50it/s]

Finished function: 'create_cleaned_token_embedding' in 204.63 seconds.
Finished function: 'tokenize_sentence' in 4.31 seconds.
Finished function: 'remove_stopwords' in 0.09 seconds.



100%|██████████| 25000/25000 [00:00<00:00, 145564.59it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'strip_whitespace' in 0.13 seconds.
Finished function: 'lowercase' in 0.17 seconds.
Finished function: 'create_cleaned_text' in 4.73 seconds.


100%|██████████| 25000/25000 [00:04<00:00, 5302.97it/s]
100%|██████████| 25000/25000 [00:00<00:00, 278529.91it/s]
100%|██████████| 25000/25000 [00:00<00:00, 207109.35it/s]

Finished function: 'tokenize_sentence' in 4.72 seconds.
Finished function: 'remove_stopwords' in 0.09 seconds.



100%|██████████| 25000/25000 [00:00<00:00, 129688.87it/s]


Finished function: 'strip_whitespace' in 0.12 seconds.
Finished function: 'lowercase' in 0.19 seconds.
Finished function: 'create_cleaned_text' in 5.15 seconds.


In [7]:
parallel_sentences.extract_sentence_information(nlp_source, nlp_target)

100%|██████████| 25000/25000 [00:00<00:00, 87688.24it/s]
 86%|████████▌ | 21378/25000 [00:00<00:00, 106092.13it/s]

Finished function: 'number_punctuations_total' in 0.29 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 104911.52it/s]
100%|██████████| 25000/25000 [00:00<00:00, 217978.78it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuations_total' in 0.24 seconds.
Finished function: 'number_words' in 0.12 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 255681.65it/s]
 19%|█▉        | 4737/25000 [00:00<00:00, 47365.75it/s]

Finished function: 'number_words' in 0.1 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 49653.42it/s]
 21%|██        | 5172/25000 [00:00<00:00, 51705.50it/s]

Finished function: 'number_unique_words' in 0.51 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 51818.51it/s]
 29%|██▊       | 7141/25000 [00:00<00:00, 71399.17it/s]

Finished function: 'number_unique_words' in 0.48 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 70670.57it/s]
 61%|██████    | 15273/25000 [00:00<00:00, 74076.85it/s]

Finished function: 'number_characters' in 0.36 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 75209.22it/s]
  return (character_vector / word_vector).replace(np.nan, 0).replace(np.inf, 0).replace(np.log(0), 0)
100%|██████████| 25000/25000 [00:00<00:00, 379631.37it/s]
100%|██████████| 25000/25000 [00:00<00:00, 441351.44it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_characters' in 0.33 seconds.
Finished function: 'average_characters' in 0.01 seconds.
Finished function: 'average_characters' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 381919.77it/s]
100%|██████████| 25000/25000 [00:00<00:00, 376891.42it/s]
100%|██████████| 25000/25000 [00:00<00:00, 384079.70it/s]
100%|██████████| 25000/25000 [00:00<00:00, 408370.10it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 397366.99it/s]
100%|██████████| 25000/25000 [00:00<00:00, 343058.68it/s]
100%|██████████| 25000/25000 [00:00<00:00, 371300.89it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 406631.27it/s]
100%|██████████| 25000/25000 [00:00<00:00, 383195.44it/s]
100%|██████████| 25000/25000 [00:00<00:00, 455664.87it/s]
100%|██████████| 25000/25000 [00:00<00:00, 399136.69it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 407212.40it/s]
100%|██████████| 25000/25000 [00:00<00:00, 380239.84it/s]
100%|██████████| 25000/25000 [00:00<00:00, 376858.91it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 381580.65it/s]
100%|██████████| 25000/25000 [00:00<00:00, 377682.85it/s]
100%|██████████| 25000/25000 [00:00<00:00, 358805.23it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 403744.13it/s]
100%|██████████| 25000/25000 [00:00<00:00, 353171.24it/s]
100%|██████████| 25000/25000 [00:00<00:00, 368693.72it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 402518.20it/s]
100%|██████████| 25000/25000 [00:00<00:00, 406325.59it/s]
100%|██████████| 25000/25000 [00:00<00:00, 456945.64it/s]
100%|██████████| 25000/25000 [00:00<00:00, 424758.57it/s]
100%|██████████| 25000/25000 [00:00<00:00, 436301.30it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.



100%|██████████| 25000/25000 [00:00<00:00, 404910.32it/s]
100%|██████████| 25000/25000 [00:00<00:00, 363928.41it/s]
100%|██████████| 25000/25000 [00:00<00:00, 400609.75it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 457298.36it/s]
100%|██████████| 25000/25000 [00:00<00:00, 464485.49it/s]
100%|██████████| 25000/25000 [00:00<00:00, 464366.19it/s]
100%|██████████| 25000/25000 [00:00<00:00, 461271.41it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.05 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 451865.29it/s]
100%|██████████| 25000/25000 [00:00<00:00, 423641.40it/s]
100%|██████████| 25000/25000 [00:00<00:00, 438250.64it/s]
100%|██████████| 25000/25000 [00:00<00:00, 458941.60it/s]
100%|██████████| 25000/25000 [00:00<00:00, 460277.24it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.



100%|██████████| 25000/25000 [00:00<00:00, 457513.85it/s]
100%|██████████| 25000/25000 [00:00<00:00, 463700.85it/s]
100%|██████████| 25000/25000 [00:00<00:00, 463729.56it/s]
100%|██████████| 25000/25000 [00:00<00:00, 440374.62it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.



100%|██████████| 25000/25000 [00:00<00:00, 468804.94it/s]
100%|██████████| 25000/25000 [00:00<00:00, 463262.44it/s]
100%|██████████| 25000/25000 [00:00<00:00, 459311.50it/s]
100%|██████████| 25000/25000 [00:00<00:00, 436027.34it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.05 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.



100%|██████████| 25000/25000 [00:00<00:00, 461332.29it/s]
100%|██████████| 25000/25000 [00:00<00:00, 459651.77it/s]
100%|██████████| 25000/25000 [00:00<00:00, 458085.49it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 444745.30it/s]
100%|██████████| 25000/25000 [00:00<00:00, 450514.07it/s]
100%|██████████| 25000/25000 [00:00<00:00, 444475.72it/s]
100%|██████████| 25000/25000 [00:00<00:00, 456295.42it/s]
100%|██████████| 25000/25000 [00:00<00:00, 463254.25it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.



100%|██████████| 25000/25000 [00:00<00:00, 443740.27it/s]
100%|██████████| 25000/25000 [00:00<00:00, 456436.44it/s]
100%|██████████| 25000/25000 [00:00<00:00, 461297.79it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 451435.36it/s]
100%|██████████| 25000/25000 [00:00<00:00, 406278.36it/s]
100%|██████████| 25000/25000 [00:00<00:00, 464164.74it/s]
100%|██████████| 25000/25000 [00:00<00:00, 448414.09it/s]
100%|██████████| 25000/25000 [00:00<00:00, 435730.20it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.



100%|██████████| 25000/25000 [00:00<00:00, 435375.60it/s]
  0%|          | 12/25000 [00:00<03:30, 118.74it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 25000/25000 [02:35<00:00, 160.45it/s]
  0%|          | 13/25000 [00:00<03:20, 124.73it/s]

Finished function: 'spacy' in 155.81 seconds.


100%|██████████| 25000/25000 [03:18<00:00, 125.97it/s]
100%|██████████| 25000/25000 [00:00<00:00, 125803.36it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'spacy' in 198.46 seconds.
Finished function: 'number_pos' in 0.2 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 169461.90it/s]
100%|██████████| 25000/25000 [00:00<00:00, 179704.54it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_pos' in 0.15 seconds.
Finished function: 'number_pos' in 0.14 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 197458.55it/s]
100%|██████████| 25000/25000 [00:00<00:00, 177773.86it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_pos' in 0.13 seconds.
Finished function: 'number_pos' in 0.14 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 197416.17it/s]
  5%|▍         | 1206/25000 [00:00<00:01, 12057.71it/s]

Finished function: 'number_pos' in 0.13 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 12745.71it/s]
  6%|▌         | 1413/25000 [00:00<00:01, 14128.87it/s]

Finished function: 'number_times' in 1.96 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 14771.42it/s]
 10%|█         | 2514/25000 [00:00<00:01, 12490.52it/s]

Finished function: 'number_times' in 1.69 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 12812.24it/s]
 11%|█▏        | 2829/25000 [00:00<00:01, 13962.44it/s]

Finished function: 'number_times' in 1.95 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 14753.06it/s]
  5%|▌         | 1279/25000 [00:00<00:01, 12788.98it/s]

Finished function: 'number_times' in 1.7 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 13080.18it/s]
 11%|█▏        | 2820/25000 [00:00<00:01, 14160.27it/s]

Finished function: 'number_times' in 1.91 seconds.


100%|██████████| 25000/25000 [00:01<00:00, 14704.80it/s]
100%|██████████| 25000/25000 [00:00<00:00, 187033.63it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'number_times' in 1.7 seconds.
Finished function: 'named_numbers' in 0.14 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 187236.35it/s]

Finished function: 'named_numbers' in 0.14 seconds.





In [9]:
parallel_sentences.create_embedding_information("proc_5k", language_pair="en_pl")

Finished function: 'load_embeddings' in 1.01 seconds.


  0%|          | 33/25000 [00:00<01:18, 318.71it/s]

Finished function: 'load_embeddings' in 0.66 seconds.


100%|██████████| 25000/25000 [00:57<00:00, 436.42it/s]
  0%|          | 47/25000 [00:00<00:53, 464.15it/s]

Finished function: 'word_embeddings' in 57.28 seconds.


100%|██████████| 25000/25000 [00:53<00:00, 471.20it/s]


Finished function: 'word_embeddings' in 53.06 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 171934.63it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'create_translation_dictionary' in 61.91 seconds.
Finished function: 'translate_words' in 0.15 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 178752.42it/s]


Finished function: 'translate_words' in 0.14 seconds.


100%|██████████| 25000/25000 [00:15<00:00, 1580.53it/s]


Finished function: 'tf_idf_vector' in 16.18 seconds.


100%|██████████| 25000/25000 [00:13<00:00, 1850.63it/s]
  1%|          | 181/25000 [00:00<00:13, 1801.91it/s]

Finished function: 'tf_idf_vector' in 13.86 seconds.


100%|██████████| 25000/25000 [00:12<00:00, 1957.88it/s]
  1%|          | 254/25000 [00:00<00:09, 2534.17it/s]

Finished function: 'sentence_embedding_average' in 12.77 seconds.


100%|██████████| 25000/25000 [00:10<00:00, 2498.73it/s]
  return [pd.Series(embedding_dataframe.values.mean(axis=1))]
  0%|          | 18/25000 [00:00<02:22, 174.85it/s]

Finished function: 'sentence_embedding_average' in 10.01 seconds.


100%|██████████| 25000/25000 [02:07<00:00, 196.41it/s]
  0%|          | 22/25000 [00:00<01:55, 215.43it/s]

Finished function: 'sentence_embedding_tf_idf' in 127.29 seconds.


100%|██████████| 25000/25000 [01:50<00:00, 226.68it/s]

Finished function: 'sentence_embedding_tf_idf' in 110.3 seconds.





In [10]:
parallel_sentences.create_embedding_information("proc_b_1k", language_pair="en_pl")

Finished function: 'load_embeddings' in 1.05 seconds.


  0%|          | 41/25000 [00:00<01:01, 406.30it/s]

Finished function: 'load_embeddings' in 0.68 seconds.


100%|██████████| 25000/25000 [00:52<00:00, 478.75it/s]
  0%|          | 47/25000 [00:00<00:53, 463.10it/s]

Finished function: 'word_embeddings' in 52.22 seconds.


100%|██████████| 25000/25000 [00:59<00:00, 418.52it/s]


Finished function: 'word_embeddings' in 59.74 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 180622.67it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'create_translation_dictionary' in 59.43 seconds.
Finished function: 'translate_words' in 0.14 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 167362.72it/s]


Finished function: 'translate_words' in 0.15 seconds.


100%|██████████| 25000/25000 [00:20<00:00, 1217.80it/s]


Finished function: 'tf_idf_vector' in 20.87 seconds.


100%|██████████| 25000/25000 [00:13<00:00, 1861.64it/s]
  0%|          | 91/25000 [00:00<00:27, 905.35it/s]

Finished function: 'tf_idf_vector' in 13.79 seconds.


100%|██████████| 25000/25000 [00:13<00:00, 1825.99it/s]
  1%|          | 158/25000 [00:00<00:15, 1579.80it/s]

Finished function: 'sentence_embedding_average' in 13.69 seconds.


100%|██████████| 25000/25000 [00:15<00:00, 1603.65it/s]
  0%|          | 13/25000 [00:00<03:14, 128.68it/s]

Finished function: 'sentence_embedding_average' in 15.59 seconds.


100%|██████████| 25000/25000 [02:06<00:00, 197.65it/s]
  0%|          | 25/25000 [00:00<01:41, 246.11it/s]

Finished function: 'sentence_embedding_tf_idf' in 126.55 seconds.


100%|██████████| 25000/25000 [01:50<00:00, 225.41it/s]

Finished function: 'sentence_embedding_tf_idf' in 110.92 seconds.





In [11]:
parallel_sentences.create_embedding_information("vecmap", language_pair="en_pl")

Finished function: 'load_embeddings' in 0.92 seconds.


  0%|          | 21/25000 [00:00<02:00, 207.63it/s]

Finished function: 'load_embeddings' in 1.02 seconds.


100%|██████████| 25000/25000 [00:59<00:00, 420.79it/s]
  0%|          | 39/25000 [00:00<01:06, 377.68it/s]

Finished function: 'word_embeddings' in 59.42 seconds.


100%|██████████| 25000/25000 [00:51<00:00, 484.15it/s]


Finished function: 'word_embeddings' in 51.64 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 166618.10it/s]
  0%|          | 0/25000 [00:00<?, ?it/s]

Finished function: 'create_translation_dictionary' in 62.89 seconds.
Finished function: 'translate_words' in 0.16 seconds.


100%|██████████| 25000/25000 [00:00<00:00, 170175.86it/s]


Finished function: 'translate_words' in 0.15 seconds.


100%|██████████| 25000/25000 [00:21<00:00, 1162.44it/s]


Finished function: 'tf_idf_vector' in 21.87 seconds.


100%|██████████| 25000/25000 [00:14<00:00, 1730.15it/s]
  0%|          | 83/25000 [00:00<00:30, 827.38it/s]

Finished function: 'tf_idf_vector' in 15.21 seconds.


100%|██████████| 25000/25000 [00:23<00:00, 1058.18it/s]
  1%|          | 250/25000 [00:00<00:09, 2493.34it/s]

Finished function: 'sentence_embedding_average' in 23.63 seconds.


100%|██████████| 25000/25000 [00:15<00:00, 1651.35it/s]
  0%|          | 21/25000 [00:00<02:03, 202.45it/s]

Finished function: 'sentence_embedding_average' in 15.14 seconds.


100%|██████████| 25000/25000 [02:04<00:00, 201.35it/s]
  0%|          | 24/25000 [00:00<01:45, 237.49it/s]

Finished function: 'sentence_embedding_tf_idf' in 124.17 seconds.


100%|██████████| 25000/25000 [01:42<00:00, 243.76it/s]

Finished function: 'sentence_embedding_tf_idf' in 102.57 seconds.





In [12]:
parallel_sentences.preprocessed.to_json("../data/interim/preprocessed_data_en_pl.json")

In [13]:
import pandas as pd
preprocessed_data = pd.read_json("../data/interim/preprocessed_data_en_pl.json")
parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_pl.pkl")
parallel_sentences.preprocessed = preprocessed_data

Finished function: 'import_data' in 0.05 seconds.


## III. Create data set

In this section we create the datasets for the training of the supervised model and the data for the supervised and unsupervised retrieval.

In [19]:
from src.data import DataSet

In [20]:
n_model = 20000
n_queries = 100
n_retrieval = 5000
k = 10
sample_size_k = 100

In [21]:
dataset = DataSet(parallel_sentences.preprocessed)
#dataset = DataSet(preprocessed_data)

Finished function: '__init__' in 0.01 seconds.


In [22]:
dataset.split_model_retrieval(n_model, n_retrieval)

Finished function: 'split_model_retrieval' in 0.0 seconds.


In [23]:
dataset.create_model_index(n_model, k, sample_size_k,
     "sentence_embedding_tf_idf_proc_5k_source", "sentence_embedding_tf_idf_proc_5k_target")

100%|██████████| 2000000/2000000 [16:21<00:00, 2036.66it/s]


Finished function: 'cosine_similarity_vector' in 982.61 seconds.


100%|██████████| 20000/20000 [00:10<00:00, 1900.52it/s]


In [24]:
dataset.model_dataset_index.reset_index(drop=True).to_feather("../data/processed/dataset_model_index_en_pl.feather")

In [25]:
# import pandas as pd
# pd.read_feather("../data/processed/dataset_model_index.feather")

In [26]:
#dataset.create_retrieval_index(n_queries)
import pandas as pd
# If your pandas version is old, use this instead
query = pd.DataFrame({"id_source": dataset.retrieval_subset.iloc[:n_queries]["id_source"]})
documents = pd.DataFrame({"id_target": dataset.retrieval_subset["id_target"]})
index = pd.MultiIndex.from_product([dataset.retrieval_subset.iloc[:n_queries]["id_source"], dataset.retrieval_subset["id_target"]], names = ["id_source", "id_target"])
dataset.retrieval_dataset_index = pd.DataFrame(index = index).reset_index()

In [27]:
dataset.retrieval_dataset_index.reset_index(drop=True).to_feather("../data/processed/dataset_retrieval_index_en_pl.feather")

In [None]:
# import pandas as pd
# pd.read_feather("../data/processed/dataset_retrieval_index.feather")

## IV. Create features

In this section we create features for our model, that are sentence based and should be created before the text is preprocessed.

In [16]:
#%autoreload 2
from src.features import feature_generation_class

In [None]:
# import pickle
# with open(r"../data/processed/correlated_features.pkl", "rb") as file:
#    chosen_features = pickle.load(file)

Generation of the training data for the supervised classifciation model.

In [None]:
features_model = feature_generation_class.FeatureGeneration(dataset.model_dataset_index, 
                                                             parallel_sentences.preprocessed)

In [None]:
features_model.create_feature_dataframe()

In [None]:
features_model.create_sentence_features()

In [None]:
features_model.create_embedding_features("proc_5k")

In [None]:
features_model.create_embedding_features("proc_b_1k")

In [None]:
features_model.create_embedding_features("vecmap")

In [None]:
features_model.feature_dataframe.reset_index(drop=True).to_feather("../data/processed/feature_model_en_pl.feather")

In [None]:
# import pandas as pd
# pd.read_feather("../data/processed/feature_model.feather")

Generation of the data for the crosslingual information retrieval task.

In [28]:
features_retrieval = feature_generation_class.FeatureGeneration(dataset.retrieval_dataset_index, 
                                                            parallel_sentences.preprocessed)

In [29]:
features_retrieval.create_feature_dataframe()

Finished function: 'create_feature_dataframe' in 0.01 seconds.


In [30]:
features_retrieval.create_sentence_features()

Finished function: 'difference_numerical' in 0.03 seconds.
Finished function: 'relative_difference_numerical' in 0.02 seconds.
Finished function: 'normalized_difference_numerical' in 0.01 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.


  return abs(target_array - source_array).replace(np.nan, 0).replace(np.inf, 0).replace(np.log(0), 0)
  0), 0)
  np.log(0), 0)


Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.01 seconds.
Finished function: 'relative_difference_numerical' in 0.02 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.02 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 

  0%|          | 0/500000 [00:00<?, ?it/s]

Finished function: 'relative_difference_numerical' in 0.02 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.02 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.


100%|██████████| 500000/500000 [00:19<00:00, 26282.87it/s]

Finished function: 'jaccard' in 19.11 seconds.
Finished function: 'create_sentence_features' in 22.08 seconds.





In [31]:
features_retrieval.create_embedding_features("proc_5k")

100%|██████████| 500000/500000 [05:18<00:00, 1570.98it/s]
  0%|          | 116/500000 [00:00<07:11, 1159.28it/s]

Finished function: 'cosine_similarity_vector' in 318.37 seconds.


100%|██████████| 500000/500000 [05:15<00:00, 1582.63it/s]
  0%|          | 0/500000 [00:00<?, ?it/s]

Finished function: 'cosine_similarity_vector' in 316.01 seconds.


100%|██████████| 500000/500000 [03:18<00:00, 2524.50it/s]
  0%|          | 224/500000 [00:00<03:45, 2220.22it/s]

Finished function: 'euclidean_distance_vector' in 198.18 seconds.


100%|██████████| 500000/500000 [03:06<00:00, 2684.32it/s]
  0%|          | 0/500000 [00:00<?, ?it/s]

Finished function: 'euclidean_distance_vector' in 186.34 seconds.


100%|██████████| 500000/500000 [00:22<00:00, 22124.60it/s]
  0%|          | 1670/500000 [00:00<00:29, 16697.87it/s]

Finished function: 'jaccard' in 22.72 seconds.


100%|██████████| 500000/500000 [00:26<00:00, 18910.90it/s]

Finished function: 'jaccard' in 26.54 seconds.
Finished function: 'create_embedding_features' in 1068.22 seconds.





In [32]:
features_retrieval.create_embedding_features("proc_b_1k")

100%|██████████| 500000/500000 [05:00<00:00, 1661.14it/s]
  0%|          | 0/500000 [00:00<?, ?it/s]

Finished function: 'cosine_similarity_vector' in 301.12 seconds.


100%|██████████| 500000/500000 [06:02<00:00, 1380.42it/s]
  0%|          | 0/500000 [00:00<?, ?it/s]

Finished function: 'cosine_similarity_vector' in 362.34 seconds.


100%|██████████| 500000/500000 [03:19<00:00, 2501.94it/s]
  0%|          | 170/500000 [00:00<04:54, 1696.08it/s]

Finished function: 'euclidean_distance_vector' in 199.97 seconds.


100%|██████████| 500000/500000 [03:06<00:00, 2685.01it/s]
  0%|          | 0/500000 [00:00<?, ?it/s]

Finished function: 'euclidean_distance_vector' in 186.3 seconds.


100%|██████████| 500000/500000 [00:22<00:00, 21893.46it/s]
  0%|          | 0/500000 [00:00<?, ?it/s]

Finished function: 'jaccard' in 22.98 seconds.


100%|██████████| 500000/500000 [00:23<00:00, 21155.37it/s]

Finished function: 'jaccard' in 23.74 seconds.
Finished function: 'create_embedding_features' in 1096.49 seconds.





In [33]:
features_retrieval.create_embedding_features("vecmap")

100%|██████████| 500000/500000 [05:22<00:00, 1550.32it/s]


Finished function: 'cosine_similarity_vector' in 315.73 seconds.


100%|██████████| 500000/500000 [05:18<00:00, 1569.23it/s]
  0%|          | 0/500000 [00:00<?, ?it/s]

Finished function: 'cosine_similarity_vector' in 318.86 seconds.


100%|██████████| 500000/500000 [03:25<00:00, 2430.55it/s]
  0%|          | 231/500000 [00:00<03:36, 2307.49it/s]

Finished function: 'euclidean_distance_vector' in 205.81 seconds.


100%|██████████| 500000/500000 [02:41<00:00, 3088.96it/s]
  1%|          | 3552/500000 [00:00<00:13, 35518.51it/s]

Finished function: 'euclidean_distance_vector' in 161.94 seconds.


100%|██████████| 500000/500000 [00:16<00:00, 30403.51it/s]
  1%|          | 2875/500000 [00:00<00:17, 28745.92it/s]

Finished function: 'jaccard' in 16.54 seconds.


100%|██████████| 500000/500000 [00:15<00:00, 33093.92it/s]

Finished function: 'jaccard' in 15.2 seconds.
Finished function: 'create_embedding_features' in 1034.14 seconds.





In [34]:
features_retrieval.feature_dataframe.reset_index(drop=True).to_feather("../data/processed/feature_retrieval_en_pl.feather")

In [None]:
# import pandas as pd
# pd.read_feather("../data/processed/feature_retrieval.feather")