# Preprocessing and Feature Creation

In this notebook we import the data, preprocess the data and create features for supervised and unsupervised cross-lingual-information retrieval models.

## I. Import Data

In this section we import the English and German europarl datasets and combine them into a parallel sentence translation dataframe.

In [None]:
%load_ext autoreload
%autoreload 2

In [1]:
import os
import sys
sys.path.append(os.path.dirname((os.path.abspath(''))))

from src.data import create_data_subset

In [None]:
create_data_subset(sentence_data_source_path='../data/external/europarl-v7.it-en.en',
                   sentence_data_target_path='../data/external/europarl-v7.it-en.it',
                   sample_size=25000,
                   sentence_data_sampled_path="../data/interim/europarl_en_it.pkl",)

## II. Preprocess data

In this section we preprocess the parallel sentence data for the feature generation

In [2]:
import spacy
from nltk.corpus import stopwords
from textblob import TextBlob as textblob_source
from textblob_de import TextBlobDE as textblob_target
import en_core_web_sm
import de_core_news_sm
# import it_core_news_sm
# import pl_core_news_sm
import time
from src.data import PreprocessingEuroParl

In [3]:
stopwords_source = stopwords.words('english')
stopwords_target = stopwords.words('german') # German stopwords
# stopwords_target = stopwords.words('italian') # Italian stopwords
# stopwords_target = stopwords.words('polish') # Polish stopwords
nlp_source = en_core_web_sm.load()
nlp_target = de_core_news_sm.load() # German pipeline
# nlp_target = it_core_news_sm.load() # Italian pipeline
# nlp_target = pl_core_news_sm.load() # Polish pipeline

In [4]:
parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/feature_retrieval_doc.pickle") # German
# parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_it.pkl") # Italien
# parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_pol.pkl") # Polnisch

Finished function: 'import_data' in 0.03 seconds.


In [5]:
import numpy as np
parallel_sentences.dataframe["id_source"] = np.arange(len(parallel_sentences.dataframe))
parallel_sentences.dataframe["id_target"] = np.arange(len(parallel_sentences.dataframe))

In [6]:
parallel_sentences.dataframe

Unnamed: 0,id_source,text_source,id_target,text_target
0,0,die afroasiatischen ( früher auch als hamito -...,0,afroasiatic languages afroasiatic afro asiatic...
1,1,"( von ‚ begnadigung , straferlass , amnestie '...",1,amnesty international amnesty international co...
2,2,die ( bíos ‚leben ' ; auch als synonym zu biot...,2,biotechnology biotechnology is the use of livi...
3,3,eine ist die beschreibung der individuellen zu...,3,blood type a blood type also called a blood gr...
4,4,( englische aussprache [ ˈbuːtən ] ; von engl ...,4,booting in computing booting also known as boo...
...,...,...,...,...
4995,4995,ist ein novellenband der amerikanischen schrif...,4995,"pale horse , pale rider pale horse pale rider ..."
4996,4996,"das , auch als vedanta bezeichnet , zählt im h...",4996,brahma sutras the brahma sūtras also known as ...
4997,4997,"der von ( 13. jahrhundert , in der literatur a...",4997,cremona elephant the cremona elephant was a gi...
4998,4998,die ist ein über den quellteich in springstill...,4998,stille ( river ) stille is a river of thuringi...


In [7]:
parallel_sentences.preprocess_sentences(nlp_source, nlp_target, stopwords_source, stopwords_target)

100%|██████████| 5000/5000 [00:39<00:00, 125.42it/s]
100%|██████████| 5000/5000 [00:00<00:00, 28293.48it/s]
100%|██████████| 5000/5000 [00:00<00:00, 141413.77it/s]

Finished function: 'spacy' in 39.87 seconds.
Finished function: 'remove_punctuation' in 0.18 seconds.



100%|██████████| 5000/5000 [00:00<00:00, 54053.23it/s]
100%|██████████| 5000/5000 [00:00<00:00, 160998.63it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'remove_numbers' in 0.04 seconds.
Finished function: 'lemmatize' in 0.09 seconds.
Finished function: 'lowercase_spacy' in 0.03 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 21194.42it/s]
  0%|          | 3/5000 [00:00<02:53, 28.74it/s]

Finished function: 'remove_stopwords' in 0.24 seconds.
Finished function: 'create_cleaned_token_embedding' in 40.49 seconds.


100%|██████████| 5000/5000 [03:36<00:00, 23.07it/s]
 33%|███▎      | 1665/5000 [00:00<00:00, 16646.01it/s]

Finished function: 'spacy' in 216.69 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 21439.90it/s]
 48%|████▊     | 2407/5000 [00:00<00:00, 24068.07it/s]

Finished function: 'remove_punctuation' in 0.24 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 21502.77it/s]
 13%|█▎        | 658/5000 [00:00<00:00, 6577.89it/s]

Finished function: 'remove_numbers' in 0.23 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 7875.74it/s]
 32%|███▏      | 1588/5000 [00:00<00:00, 15875.21it/s]

Finished function: 'lemmatize' in 0.64 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 15930.16it/s]
  2%|▏         | 106/5000 [00:00<00:04, 1053.50it/s]

Finished function: 'lowercase_spacy' in 0.32 seconds.


100%|██████████| 5000/5000 [00:03<00:00, 1346.70it/s]
 14%|█▍        | 715/5000 [00:00<00:01, 3251.76it/s]

Finished function: 'remove_stopwords' in 3.71 seconds.
Finished function: 'create_cleaned_token_embedding' in 222.22 seconds.


100%|██████████| 5000/5000 [00:01<00:00, 4214.67it/s]
 70%|███████   | 3512/5000 [00:00<00:00, 16873.56it/s]

Finished function: 'tokenize_sentence' in 1.19 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 17178.29it/s]
100%|██████████| 5000/5000 [00:00<00:00, 152382.72it/s]
100%|██████████| 5000/5000 [00:00<00:00, 115372.03it/s]
 32%|███▏      | 1610/5000 [00:00<00:00, 16095.99it/s]

Finished function: 'remove_stopwords' in 0.29 seconds.
Finished function: 'strip_whitespace' in 0.03 seconds.
Finished function: 'lowercase' in 0.04 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 17796.02it/s]
  1%|          | 57/5000 [00:00<00:08, 568.61it/s]

Finished function: 'remove_stopwords' in 0.28 seconds.
Finished function: 'create_cleaned_text' in 1.85 seconds.


100%|██████████| 5000/5000 [00:06<00:00, 759.05it/s]
  4%|▍         | 214/5000 [00:00<00:04, 1054.49it/s]

Finished function: 'tokenize_sentence' in 6.59 seconds.


100%|██████████| 5000/5000 [00:04<00:00, 1114.95it/s]
100%|██████████| 5000/5000 [00:00<00:00, 26075.90it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'remove_stopwords' in 4.49 seconds.
Finished function: 'strip_whitespace' in 0.19 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 17561.37it/s]
  4%|▍         | 214/5000 [00:00<00:04, 1064.62it/s]

Finished function: 'lowercase' in 0.29 seconds.


100%|██████████| 5000/5000 [00:03<00:00, 1304.22it/s]


Finished function: 'remove_stopwords' in 3.84 seconds.
Finished function: 'create_cleaned_text' in 15.47 seconds.


In [8]:
parallel_sentences.extract_sentence_information(nlp_source, nlp_target)

100%|██████████| 5000/5000 [00:00<00:00, 73017.05it/s]
 17%|█▋        | 828/5000 [00:00<00:00, 8272.47it/s]

Finished function: 'number_punctuations_total' in 0.07 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 9741.14it/s]
100%|██████████| 5000/5000 [00:00<00:00, 240278.64it/s]
100%|██████████| 5000/5000 [00:00<00:00, 30477.56it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'number_punctuations_total' in 0.51 seconds.
Finished function: 'number_words' in 0.02 seconds.
Finished function: 'number_words' in 0.17 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 45841.69it/s]
 14%|█▍        | 706/5000 [00:00<00:00, 7050.83it/s]

Finished function: 'number_unique_words' in 0.11 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 8545.35it/s]
100%|██████████| 5000/5000 [00:00<00:00, 72852.18it/s]
 25%|██▌       | 1251/5000 [00:00<00:00, 12505.75it/s]

Finished function: 'number_unique_words' in 0.59 seconds.
Finished function: 'number_characters' in 0.07 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 14128.01it/s]
  return (character_vector / word_vector).replace(np.nan, 0).replace(np.inf, 0).replace(np.log(0), 0)
100%|██████████| 5000/5000 [00:00<00:00, 398432.98it/s]
100%|██████████| 5000/5000 [00:00<00:00, 83341.36it/s]
100%|██████████| 5000/5000 [00:00<00:00, 347584.65it/s]
100%|██████████| 5000/5000 [00:00<00:00, 76739.77it/s]
100%|██████████| 5000/5000 [00:00<00:00, 411658.29it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'number_characters' in 0.36 seconds.
Finished function: 'average_characters' in 0.02 seconds.
Finished function: 'average_characters' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 94736.43it/s]
100%|██████████| 5000/5000 [00:00<00:00, 409944.29it/s]
100%|██████████| 5000/5000 [00:00<00:00, 95611.49it/s]
100%|██████████| 5000/5000 [00:00<00:00, 404793.08it/s]
100%|██████████| 5000/5000 [00:00<00:00, 79386.46it/s]
100%|██████████| 5000/5000 [00:00<00:00, 376623.39it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.05 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.05 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 92570.69it/s]
100%|██████████| 5000/5000 [00:00<00:00, 339696.77it/s]
100%|██████████| 5000/5000 [00:00<00:00, 81953.93it/s]
100%|██████████| 5000/5000 [00:00<00:00, 309867.46it/s]
100%|██████████| 5000/5000 [00:00<00:00, 81051.55it/s]
100%|██████████| 5000/5000 [00:00<00:00, 354290.54it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 92752.47it/s]
100%|██████████| 5000/5000 [00:00<00:00, 343423.84it/s]
100%|██████████| 5000/5000 [00:00<00:00, 82395.38it/s]
100%|██████████| 5000/5000 [00:00<00:00, 360986.66it/s]
100%|██████████| 5000/5000 [00:00<00:00, 84983.39it/s]
100%|██████████| 5000/5000 [00:00<00:00, 330452.70it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 78731.38it/s]
100%|██████████| 5000/5000 [00:00<00:00, 353281.90it/s]
100%|██████████| 5000/5000 [00:00<00:00, 82217.72it/s]
100%|██████████| 5000/5000 [00:00<00:00, 351440.69it/s]
100%|██████████| 5000/5000 [00:00<00:00, 87223.60it/s]
100%|██████████| 5000/5000 [00:00<00:00, 339394.41it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 84268.66it/s]
100%|██████████| 5000/5000 [00:00<00:00, 336319.20it/s]
100%|██████████| 5000/5000 [00:00<00:00, 68647.22it/s]
100%|██████████| 5000/5000 [00:00<00:00, 267480.23it/s]
100%|██████████| 5000/5000 [00:00<00:00, 78956.66it/s]
100%|██████████| 5000/5000 [00:00<00:00, 330598.57it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.08 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.



100%|██████████| 5000/5000 [00:00<00:00, 80423.37it/s]
100%|██████████| 5000/5000 [00:00<00:00, 320822.42it/s]
100%|██████████| 5000/5000 [00:00<00:00, 84009.74it/s]
100%|██████████| 5000/5000 [00:00<00:00, 332069.54it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 86107.30it/s]
100%|██████████| 5000/5000 [00:00<00:00, 359896.35it/s]
100%|██████████| 5000/5000 [00:00<00:00, 85626.70it/s]
100%|██████████| 5000/5000 [00:00<00:00, 319181.78it/s]
100%|██████████| 5000/5000 [00:00<00:00, 80645.11it/s]
100%|██████████| 5000/5000 [00:00<00:00, 379623.12it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.01 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 81055.62it/s]
100%|██████████| 5000/5000 [00:00<00:00, 322514.73it/s]
100%|██████████| 5000/5000 [00:00<00:00, 83744.05it/s]
100%|██████████| 5000/5000 [00:00<00:00, 360391.13it/s]
100%|██████████| 5000/5000 [00:00<00:00, 86058.54it/s]
100%|██████████| 5000/5000 [00:00<00:00, 346453.45it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 85508.00it/s]
100%|██████████| 5000/5000 [00:00<00:00, 125334.80it/s]
100%|██████████| 5000/5000 [00:00<00:00, 79634.25it/s]
100%|██████████| 5000/5000 [00:00<00:00, 350483.32it/s]
100%|██████████| 5000/5000 [00:00<00:00, 75589.12it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.06 seconds.
Finished function: 'number_punctuation_marks' in 0.04 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 351788.51it/s]
100%|██████████| 5000/5000 [00:00<00:00, 72953.30it/s]
100%|██████████| 5000/5000 [00:00<00:00, 284575.68it/s]
100%|██████████| 5000/5000 [00:00<00:00, 76864.08it/s]
100%|██████████| 5000/5000 [00:00<00:00, 372939.73it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 78182.51it/s]
100%|██████████| 5000/5000 [00:00<00:00, 356149.72it/s]
100%|██████████| 5000/5000 [00:00<00:00, 87140.25it/s]
  0%|          | 10/5000 [00:00<00:53, 93.82it/s]

Finished function: 'number_punctuation_marks' in 0.07 seconds.
Finished function: 'number_punctuation_marks' in 0.02 seconds.
Finished function: 'number_punctuation_marks' in 0.06 seconds.


100%|██████████| 5000/5000 [00:37<00:00, 132.88it/s]
  0%|          | 4/5000 [00:00<02:37, 31.63it/s]

Finished function: 'spacy' in 37.63 seconds.


100%|██████████| 5000/5000 [04:01<00:00, 20.68it/s]
100%|██████████| 5000/5000 [00:00<00:00, 112673.78it/s]
 23%|██▎       | 1135/5000 [00:00<00:00, 11347.63it/s]

Finished function: 'spacy' in 241.83 seconds.
Finished function: 'number_pos' in 0.05 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 15203.73it/s]
100%|██████████| 5000/5000 [00:00<00:00, 164826.38it/s]
 56%|█████▌    | 2791/5000 [00:00<00:00, 27906.97it/s]

Finished function: 'number_pos' in 0.33 seconds.
Finished function: 'number_pos' in 0.03 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 29617.19it/s]
100%|██████████| 5000/5000 [00:00<00:00, 166013.74it/s]
 49%|████▉     | 2463/5000 [00:00<00:00, 24629.03it/s]

Finished function: 'number_pos' in 0.17 seconds.
Finished function: 'number_pos' in 0.03 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 26115.26it/s]
 46%|████▌     | 2280/5000 [00:00<00:00, 11285.35it/s]

Finished function: 'number_pos' in 0.19 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 12052.52it/s]
  2%|▏         | 116/5000 [00:00<00:04, 1151.51it/s]

Finished function: 'number_times' in 0.42 seconds.


100%|██████████| 5000/5000 [00:03<00:00, 1327.40it/s]
 21%|██▏       | 1063/5000 [00:00<00:00, 10626.29it/s]

Finished function: 'number_times' in 3.77 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 11751.88it/s]
  2%|▏         | 106/5000 [00:00<00:04, 1056.20it/s]

Finished function: 'number_times' in 0.43 seconds.


100%|██████████| 5000/5000 [00:03<00:00, 1325.50it/s]
 25%|██▍       | 1228/5000 [00:00<00:00, 12270.18it/s]

Finished function: 'number_times' in 3.78 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 12825.97it/s]
  2%|▏         | 103/5000 [00:00<00:04, 1023.45it/s]

Finished function: 'number_times' in 0.39 seconds.


100%|██████████| 5000/5000 [00:03<00:00, 1281.00it/s]
100%|██████████| 5000/5000 [00:00<00:00, 148390.05it/s]
 34%|███▎      | 1678/5000 [00:00<00:00, 16773.06it/s]

Finished function: 'number_times' in 3.91 seconds.
Finished function: 'named_numbers' in 0.04 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 18360.27it/s]

Finished function: 'named_numbers' in 0.27 seconds.





In [9]:
parallel_sentences.create_embedding_information("proc_5k")

Finished function: 'load_embeddings' in 1.09 seconds.


  1%|          | 55/5000 [00:00<00:09, 543.92it/s]

Finished function: 'load_embeddings' in 0.73 seconds.


100%|██████████| 5000/5000 [00:07<00:00, 661.80it/s]
  0%|          | 8/5000 [00:00<01:07, 73.43it/s]

Finished function: 'word_embeddings' in 7.56 seconds.


100%|██████████| 5000/5000 [00:59<00:00, 83.93it/s] 


Finished function: 'word_embeddings' in 59.57 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 122270.80it/s]
  0%|          | 0/5000 [00:00<?, ?it/s]

Finished function: 'create_translation_dictionary' in 69.63 seconds.
Finished function: 'translate_words' in 0.04 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 6968.09it/s]
  3%|▎         | 162/5000 [00:00<00:02, 1613.68it/s]

Finished function: 'translate_words' in 0.72 seconds.


100%|██████████| 5000/5000 [00:02<00:00, 1798.59it/s]


Finished function: 'tf_idf_vector' in 2.94 seconds.


100%|██████████| 5000/5000 [00:35<00:00, 141.21it/s]
 12%|█▏        | 598/5000 [00:00<00:01, 2884.91it/s]

Finished function: 'tf_idf_vector' in 36.42 seconds.


100%|██████████| 5000/5000 [00:01<00:00, 3111.87it/s]
  5%|▌         | 274/5000 [00:00<00:03, 1380.82it/s]

Finished function: 'sentence_embedding_average' in 1.61 seconds.


100%|██████████| 5000/5000 [00:04<00:00, 1043.13it/s]
  1%|          | 32/5000 [00:00<00:15, 315.32it/s]

Finished function: 'sentence_embedding_average' in 4.8 seconds.


100%|██████████| 5000/5000 [00:16<00:00, 307.16it/s]
  0%|          | 3/5000 [00:00<03:49, 21.76it/s]

Finished function: 'sentence_embedding_tf_idf' in 16.28 seconds.


100%|██████████| 5000/5000 [02:25<00:00, 34.42it/s]

Finished function: 'sentence_embedding_tf_idf' in 145.27 seconds.





In [10]:
parallel_sentences.create_embedding_information("proc_b_1k")

Finished function: 'load_embeddings' in 0.88 seconds.


  1%|▏         | 71/5000 [00:00<00:06, 708.31it/s]

Finished function: 'load_embeddings' in 0.69 seconds.


100%|██████████| 5000/5000 [00:06<00:00, 765.08it/s]
  0%|          | 8/5000 [00:00<01:05, 76.40it/s]

Finished function: 'word_embeddings' in 6.54 seconds.


100%|██████████| 5000/5000 [00:57<00:00, 86.80it/s] 


Finished function: 'word_embeddings' in 57.6 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 148903.15it/s]
 29%|██▉       | 1469/5000 [00:00<00:00, 14667.70it/s]

Finished function: 'create_translation_dictionary' in 56.05 seconds.
Finished function: 'translate_words' in 0.04 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 16214.81it/s]


Finished function: 'translate_words' in 0.31 seconds.


100%|██████████| 5000/5000 [00:02<00:00, 1762.50it/s]


Finished function: 'tf_idf_vector' in 2.99 seconds.


100%|██████████| 5000/5000 [00:38<00:00, 129.83it/s]
  3%|▎         | 126/5000 [00:00<00:03, 1256.85it/s]

Finished function: 'tf_idf_vector' in 39.61 seconds.


100%|██████████| 5000/5000 [00:02<00:00, 2123.99it/s]
  2%|▏         | 85/5000 [00:00<00:05, 843.41it/s]

Finished function: 'sentence_embedding_average' in 2.36 seconds.


100%|██████████| 5000/5000 [00:05<00:00, 923.94it/s] 
  1%|          | 37/5000 [00:00<00:13, 364.95it/s]

Finished function: 'sentence_embedding_average' in 5.41 seconds.


100%|██████████| 5000/5000 [00:16<00:00, 301.15it/s]
  0%|          | 4/5000 [00:00<02:07, 39.17it/s]

Finished function: 'sentence_embedding_tf_idf' in 16.61 seconds.


100%|██████████| 5000/5000 [02:10<00:00, 38.44it/s]

Finished function: 'sentence_embedding_tf_idf' in 130.08 seconds.





In [11]:
parallel_sentences.create_embedding_information("vecmap")

Finished function: 'load_embeddings' in 0.55 seconds.


  1%|          | 58/5000 [00:00<00:08, 579.55it/s]

Finished function: 'load_embeddings' in 0.54 seconds.


100%|██████████| 5000/5000 [00:06<00:00, 774.18it/s]
  0%|          | 8/5000 [00:00<01:07, 73.82it/s]

Finished function: 'word_embeddings' in 6.46 seconds.


100%|██████████| 5000/5000 [00:57<00:00, 86.42it/s] 


Finished function: 'word_embeddings' in 57.86 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 159923.13it/s]
 32%|███▏      | 1586/5000 [00:00<00:00, 15857.67it/s]

Finished function: 'create_translation_dictionary' in 63.75 seconds.
Finished function: 'translate_words' in 0.03 seconds.


100%|██████████| 5000/5000 [00:00<00:00, 17642.92it/s]
  3%|▎         | 161/5000 [00:00<00:03, 1604.55it/s]

Finished function: 'translate_words' in 0.28 seconds.


100%|██████████| 5000/5000 [00:02<00:00, 1856.33it/s]


Finished function: 'tf_idf_vector' in 2.84 seconds.


100%|██████████| 5000/5000 [00:35<00:00, 140.01it/s]
  5%|▌         | 260/5000 [00:00<00:01, 2597.17it/s]

Finished function: 'tf_idf_vector' in 36.73 seconds.


100%|██████████| 5000/5000 [00:01<00:00, 2746.09it/s]
  2%|▏         | 83/5000 [00:00<00:05, 821.39it/s]

Finished function: 'sentence_embedding_average' in 1.82 seconds.


100%|██████████| 5000/5000 [00:05<00:00, 923.42it/s] 
  1%|          | 39/5000 [00:00<00:12, 388.99it/s]

Finished function: 'sentence_embedding_average' in 5.42 seconds.


100%|██████████| 5000/5000 [00:14<00:00, 352.17it/s]
  0%|          | 5/5000 [00:00<01:59, 41.78it/s]

Finished function: 'sentence_embedding_tf_idf' in 14.21 seconds.


100%|██████████| 5000/5000 [02:10<00:00, 38.28it/s]

Finished function: 'sentence_embedding_tf_idf' in 130.64 seconds.





In [12]:
parallel_sentences.preprocessed.to_json("../data/interim/preprocessed_data_doc.json")

In [20]:
import pandas as pd
preprocessed_data = pd.read_json("../data/interim/preprocessed_data_doc.json")
parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/feature_retrieval_doc.pickle")
parallel_sentences.preprocessed = preprocessed_data

Finished function: 'import_data' in 0.03 seconds.


## III. Create data set

In this section we create the datasets for the training of the supervised model and the data for the supervised and unsupervised retrieval.

In [13]:
from src.data import DataSet

In [14]:
n_model = 0
n_queries = 100
n_retrieval = 5000
k = 10
sample_size_k = 100

In [15]:
dataset = DataSet(parallel_sentences.preprocessed)
#dataset = DataSet(preprocessed_data)

Finished function: '__init__' in 0.0 seconds.


In [16]:
dataset.split_model_retrieval(n_model, n_retrieval)

Finished function: 'split_model_retrieval' in 0.0 seconds.


In [None]:
dataset.create_model_index(n_model, k, sample_size_k,
     "sentence_embedding_tf_idf_proc_5k_source", "sentence_embedding_tf_idf_proc_5k_target")

In [None]:
dataset.model_dataset_index.reset_index(drop=True).to_feather("../data/processed/dataset_model_index_en_de.feather")

In [None]:
# import pandas as pd
# pd.read_feather("../data/processed/dataset_model_index.feather")

In [31]:
dataset.retrieval_dataset_index

Unnamed: 0,id_source,id_target
0,0,0
1,0,1
2,0,2
3,0,3
4,0,4
...,...,...
499995,99,4995
499996,99,4996
499997,99,4997
499998,99,4998


In [17]:
import pandas as pd
#dataset.create_retrieval_index(n_queries)

# If your pandas version is old, use this instead
query = pd.DataFrame({"id_source": dataset.retrieval_subset.iloc[:n_queries]["id_source"]})
documents = pd.DataFrame({"id_target": dataset.retrieval_subset["id_target"]})
index = pd.MultiIndex.from_product([dataset.retrieval_subset.iloc[:n_queries]["id_source"], dataset.retrieval_subset["id_target"]], names = ["id_source", "id_target"])
dataset.retrieval_dataset_index = pd.DataFrame(index = index).reset_index()

In [18]:
dataset.retrieval_dataset_index.reset_index(drop=True).to_feather("../data/processed/dataset_retrieval_index_en_de.feather")

In [None]:
# import pandas as pd
# pd.read_feather("../data/processed/dataset_retrieval_index.feather")

## IV. Create features

In this section we create features for our model, that are sentence based and should be created before the text is preprocessed.

In [19]:
#%autoreload 2
from src.features import feature_generation_class

In [None]:
# import pickle
# with open(r"../data/processed/correlated_features.pkl", "rb") as file:
#    chosen_features = pickle.load(file)

Generation of the training data for the supervised classifciation model.

In [None]:
features_model = feature_generation_class.FeatureGeneration(dataset.model_dataset_index, 
                                                             parallel_sentences.preprocessed)

In [None]:
features_model.create_feature_dataframe()

In [None]:
features_model.create_sentence_features()

In [None]:
features_model.create_embedding_features("proc_5k")

In [None]:
features_model.create_embedding_features("proc_b_1k")

In [None]:
features_model.create_embedding_features("vecmap")

In [None]:
features_model.feature_dataframe.reset_index(drop=True).to_feather("../data/processed/feature_model_en_de.feather")

In [None]:
# import pandas as pd
# pd.read_feather("../data/processed/feature_model.feather")

Generation of the data for the crosslingual information retrieval task.

In [20]:
features_retrieval = feature_generation_class.FeatureGeneration(dataset.retrieval_dataset_index, 
                                                            parallel_sentences.preprocessed)

In [21]:
features_retrieval.create_feature_dataframe()

Finished function: 'create_feature_dataframe' in 0.0 seconds.


In [22]:
features_retrieval.create_sentence_features()

Finished function: 'difference_numerical' in 0.03 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.01 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.


  return abs(target_array - source_array).replace(np.nan, 0).replace(np.inf, 0).replace(np.log(0), 0)
  0), 0)
  np.log(0), 0)


Finished function: 'normalized_difference_numerical' in 0.03 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: 'normalized_difference_numerical' in 0.02 seconds.
Finished function: 'difference_numerical' in 0.0 seconds.
Finished function: 'relative_difference_numerical' in 0.01 seconds.
Finished function: '

100%|██████████| 500000/500000 [00:12<00:00, 39619.20it/s]

Finished function: 'jaccard' in 12.67 seconds.
Finished function: 'create_sentence_features' in 15.59 seconds.





In [23]:
features_retrieval.create_embedding_features("proc_5k")

100%|██████████| 500000/500000 [03:33<00:00, 2337.55it/s]
  0%|          | 216/500000 [00:00<03:51, 2155.62it/s]

Finished function: 'cosine_similarity_vector' in 213.96 seconds.


100%|██████████| 500000/500000 [03:39<00:00, 2277.45it/s]
  0%|          | 294/500000 [00:00<02:49, 2939.94it/s]

Finished function: 'cosine_similarity_vector' in 219.61 seconds.


100%|██████████| 500000/500000 [02:31<00:00, 3310.91it/s]
  0%|          | 207/500000 [00:00<04:01, 2069.63it/s]

Finished function: 'euclidean_distance_vector' in 151.07 seconds.


100%|██████████| 500000/500000 [02:32<00:00, 3275.63it/s]
  0%|          | 1342/500000 [00:00<00:37, 13419.44it/s]

Finished function: 'euclidean_distance_vector' in 152.74 seconds.


100%|██████████| 500000/500000 [00:25<00:00, 19598.36it/s]
  0%|          | 956/500000 [00:00<00:52, 9553.63it/s]

Finished function: 'jaccard' in 25.6 seconds.


100%|██████████| 500000/500000 [00:27<00:00, 18348.12it/s]

Finished function: 'jaccard' in 27.34 seconds.
Finished function: 'create_embedding_features' in 790.35 seconds.





In [24]:
features_retrieval.create_embedding_features("proc_b_1k")

100%|██████████| 500000/500000 [04:03<00:00, 2054.21it/s]
  0%|          | 0/500000 [00:00<?, ?it/s]

Finished function: 'cosine_similarity_vector' in 243.53 seconds.


100%|██████████| 500000/500000 [03:49<00:00, 2175.58it/s]
  0%|          | 655/500000 [00:00<02:41, 3087.21it/s]

Finished function: 'cosine_similarity_vector' in 229.93 seconds.


100%|██████████| 500000/500000 [02:22<00:00, 3506.64it/s]
  0%|          | 660/500000 [00:00<02:37, 3179.49it/s]

Finished function: 'euclidean_distance_vector' in 142.64 seconds.


100%|██████████| 500000/500000 [02:12<00:00, 3776.70it/s]
  0%|          | 1915/500000 [00:00<00:26, 19147.74it/s]

Finished function: 'euclidean_distance_vector' in 132.44 seconds.


100%|██████████| 500000/500000 [00:19<00:00, 25021.50it/s]
  0%|          | 1641/500000 [00:00<00:30, 16404.11it/s]

Finished function: 'jaccard' in 20.05 seconds.


100%|██████████| 500000/500000 [00:23<00:00, 21299.02it/s]

Finished function: 'jaccard' in 23.54 seconds.
Finished function: 'create_embedding_features' in 792.16 seconds.





In [25]:
features_retrieval.create_embedding_features("vecmap")

100%|██████████| 500000/500000 [03:22<00:00, 2465.57it/s]
  0%|          | 168/500000 [00:00<05:00, 1663.31it/s]

Finished function: 'cosine_similarity_vector' in 202.87 seconds.


100%|██████████| 500000/500000 [04:43<00:00, 1763.10it/s]
  0%|          | 164/500000 [00:00<05:05, 1637.50it/s]

Finished function: 'cosine_similarity_vector' in 283.68 seconds.


100%|██████████| 500000/500000 [03:34<00:00, 2326.49it/s]
  0%|          | 172/500000 [00:00<05:04, 1642.26it/s]

Finished function: 'euclidean_distance_vector' in 215.0 seconds.


100%|██████████| 500000/500000 [03:41<00:00, 2261.81it/s]
  0%|          | 0/500000 [00:00<?, ?it/s]

Finished function: 'euclidean_distance_vector' in 221.13 seconds.


100%|██████████| 500000/500000 [00:30<00:00, 16397.71it/s]
  0%|          | 891/500000 [00:00<00:56, 8907.14it/s]

Finished function: 'jaccard' in 30.6 seconds.


100%|██████████| 500000/500000 [00:38<00:00, 12986.64it/s]

Finished function: 'jaccard' in 38.6 seconds.
Finished function: 'create_embedding_features' in 991.95 seconds.





In [26]:
features_retrieval.feature_dataframe.reset_index(drop=True).to_feather("../data/processed/feature_retrieval_doc.feather")

In [None]:
# import pandas as pd
# pd.read_feather("../data/processed/feature_retrieval.feather")