# Preprocessing and Feature Creation

In this notebook we import the data, preprocess the data and create features for supervised and unsupervised cross-lingual-information retrieval models.

## I. Import Data

In this section we import the English and German europarl datasets and combine them into a parallel sentence translation dataframe.

In [1]:
%load_ext autoreload
%autoreload 2

In [2]:
from src.data import create_data_subset

In [3]:
create_data_subset(sentence_data_source_path='../data/external/europarl-v7.it-en.en',
                   sentence_data_target_path='../data/external/europarl-v7.it-en.it',
                   sample_size=500,
                   sentence_data_sampled_path="../data/interim/europarl_en_it.pkl",)

Finished function: 'load_doc' in 1.34 seconds.
Finished function: 'to_sentences' in 0.71 seconds.
Finished function: 'load_doc' in 1.89 seconds.
Finished function: 'to_sentences' in 0.97 seconds.
Sampled dataframe saved in: ../data/interim/europarl_en_it.pkl
Finished function: 'create_data_subset' in 6.26 seconds.


## II. Preprocess data

In this section we preprocess the parallel sentence data for the feature generation.

In [3]:
import spacy
from nltk.corpus import stopwords
from textblob import TextBlob as textblob_source
from textblob_de import TextBlobDE as textblob_target
#import en_core_web_sm
# import de_core_news_sm
#import it_core_news_sm
# import pl_core_news_sm
import time

In [None]:
from src.data import PreprocessingEuroParl

In [7]:
stopwords_source = stopwords.words('english')
stopwords_target = stopwords.words('german') # German stopwords
#stopwords_target = stopwords.words('italian') # Italian stopwords
# stopwords_target = stopwords.words('polish') # Polish stopwords
#nlp_source = en_core_web_sm.load()
# nlp_target = de_core_news_sm.load() # German pipeline
#nlp_target = it_core_news_sm.load() # Italian pipeline
# nlp_target = pl_core_news_sm.load() # Polish pipeline

In [7]:
parallel_sentences = PreprocessingEuroParl(df_sampled_path="../data/interim/europarl_en_it.pkl")

Finished function: 'import_data' in 0.0 seconds.


In [None]:
stopw

In [9]:
parallel_sentences.preprocess_sentences(stopwords_source, stopwords_target, nlp_source, nlp_target)

100%|██████████| 500/500 [00:03<00:00, 161.32it/s]
100%|██████████| 500/500 [00:00<00:00, 7004.07it/s]
100%|██████████| 500/500 [00:00<00:00, 424009.70it/s]
100%|██████████| 500/500 [00:00<00:00, 321107.33it/s]
100%|██████████| 500/500 [00:00<00:00, 236966.33it/s]
100%|██████████| 500/500 [00:00<00:00, 26487.22it/s]
100%|██████████| 500/500 [00:00<00:00, 141365.15it/s]
  0%|          | 0/500 [00:00<?, ?it/s]

Finished function: 'lemmatize' in 3.12 seconds.
Finished function: 'tokenize_sentence' in 0.07 seconds.
Finished function: 'strip_whitespace' in 0.0 seconds.
Finished function: 'lowercase' in 0.0 seconds.
Finished function: 'remove_punctuation' in 0.0 seconds.
Finished function: 'remove_stopwords' in 0.02 seconds.
Finished function: 'remove_numbers' in 0.0 seconds.
Finished function: 'create_cleaned_token_embedding' in 3.22 seconds.


100%|██████████| 500/500 [00:02<00:00, 190.17it/s]
100%|██████████| 500/500 [00:00<00:00, 6979.53it/s]
100%|██████████| 500/500 [00:00<00:00, 388217.70it/s]
100%|██████████| 500/500 [00:00<00:00, 303539.15it/s]
100%|██████████| 500/500 [00:00<00:00, 212133.52it/s]
100%|██████████| 500/500 [00:00<00:00, 17735.65it/s]
100%|██████████| 500/500 [00:00<00:00, 112914.01it/s]
100%|██████████| 500/500 [00:00<00:00, 7549.12it/s]
100%|██████████| 500/500 [00:00<00:00, 364468.54it/s]
100%|██████████| 500/500 [00:00<00:00, 272039.43it/s]
  0%|          | 0/500 [00:00<?, ?it/s]

Finished function: 'lemmatize' in 2.63 seconds.
Finished function: 'tokenize_sentence' in 0.07 seconds.
Finished function: 'strip_whitespace' in 0.0 seconds.
Finished function: 'lowercase' in 0.0 seconds.
Finished function: 'remove_punctuation' in 0.0 seconds.
Finished function: 'remove_stopwords' in 0.03 seconds.
Finished function: 'remove_numbers' in 0.01 seconds.
Finished function: 'create_cleaned_token_embedding' in 2.74 seconds.
Finished function: 'tokenize_sentence' in 0.07 seconds.
Finished function: 'strip_whitespace' in 0.0 seconds.
Finished function: 'lowercase' in 0.0 seconds.
Finished function: 'create_cleaned_text' in 0.07 seconds.


100%|██████████| 500/500 [00:00<00:00, 6953.12it/s]
100%|██████████| 500/500 [00:00<00:00, 338086.73it/s]
100%|██████████| 500/500 [00:00<00:00, 273672.45it/s]

Finished function: 'tokenize_sentence' in 0.07 seconds.
Finished function: 'strip_whitespace' in 0.0 seconds.
Finished function: 'lowercase' in 0.0 seconds.
Finished function: 'create_cleaned_text' in 0.08 seconds.





In [None]:
parallel_sentences.extract_sentence_information(nlp_source, nlp_target, textblob_source, textblob_target)

100%|██████████| 500/500 [00:00<00:00, 84819.09it/s]
100%|██████████| 500/500 [00:00<00:00, 84990.96it/s]
100%|██████████| 500/500 [00:00<00:00, 217953.86it/s]
100%|██████████| 500/500 [00:00<00:00, 212047.72it/s]
100%|██████████| 500/500 [00:00<00:00, 36103.02it/s]
100%|██████████| 500/500 [00:00<00:00, 42229.36it/s]
100%|██████████| 500/500 [00:00<00:00, 80774.64it/s]
100%|██████████| 500/500 [00:00<00:00, 90141.93it/s]
  return (character_vector / word_vector).replace(np.nan, 0).replace(np.inf, 0).replace(np.log(0), 0)
100%|██████████| 500/500 [00:00<00:00, 352107.45it/s]
100%|██████████| 500/500 [00:00<00:00, 321550.44it/s]
100%|██████████| 500/500 [00:00<00:00, 409120.56it/s]
100%|██████████| 500/500 [00:00<00:00, 396812.11it/s]
100%|██████████| 500/500 [00:00<00:00, 397187.88it/s]
100%|██████████| 500/500 [00:00<00:00, 333357.49it/s]
100%|██████████| 500/500 [00:00<00:00, 380539.29it/s]
100%|██████████| 500/500 [00:00<00:00, 400985.09it/s]
100%|██████████| 500/500 [00:00<00:00, 4

Finished function: 'number_punctuations_total' in 0.01 seconds.
Finished function: 'number_punctuations_total' in 0.01 seconds.
Finished function: 'number_words' in 0.0 seconds.
Finished function: 'number_words' in 0.0 seconds.
Finished function: 'number_unique_words' in 0.01 seconds.
Finished function: 'number_unique_words' in 0.01 seconds.
Finished function: 'number_characters' in 0.01 seconds.
Finished function: 'number_characters' in 0.01 seconds.
Finished function: 'average_characters' in 0.0 seconds.
Finished function: 'average_characters' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds


100%|██████████| 500/500 [00:00<00:00, 397790.59it/s]
100%|██████████| 500/500 [00:00<00:00, 346522.14it/s]
100%|██████████| 500/500 [00:00<00:00, 377933.32it/s]
100%|██████████| 500/500 [00:00<00:00, 408881.26it/s]
100%|██████████| 500/500 [00:00<00:00, 418342.71it/s]
100%|██████████| 500/500 [00:00<00:00, 314274.24it/s]
100%|██████████| 500/500 [00:00<00:00, 406109.99it/s]
100%|██████████| 500/500 [00:00<00:00, 399001.52it/s]
100%|██████████| 500/500 [00:00<00:00, 374826.09it/s]
100%|██████████| 500/500 [00:00<00:00, 428077.57it/s]
100%|██████████| 500/500 [00:00<00:00, 385505.88it/s]
100%|██████████| 500/500 [00:00<00:00, 403919.88it/s]
100%|██████████| 500/500 [00:00<00:00, 284745.69it/s]
100%|██████████| 500/500 [00:00<00:00, 389154.20it/s]
100%|██████████| 500/500 [00:00<00:00, 401599.39it/s]
100%|██████████| 500/500 [00:00<00:00, 336459.49it/s]
100%|██████████| 500/500 [00:00<00:00, 414866.86it/s]
100%|██████████| 500/500 [00:00<00:00, 396512.01it/s]
100%|██████████| 500/500 [0

Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished

100%|██████████| 500/500 [00:00<00:00, 390676.60it/s]
100%|██████████| 500/500 [00:00<00:00, 386714.36it/s]
100%|██████████| 500/500 [00:00<00:00, 380470.25it/s]
100%|██████████| 500/500 [00:00<00:00, 371243.05it/s]
100%|██████████| 500/500 [00:00<00:00, 405795.67it/s]
  3%|▎         | 16/500 [00:00<00:03, 159.99it/s]

Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.
Finished function: 'number_punctuation_marks' in 0.0 seconds.


100%|██████████| 500/500 [00:02<00:00, 186.21it/s]
  5%|▍         | 23/500 [00:00<00:02, 229.68it/s]

Finished function: 'number_pos' in 2.69 seconds.


100%|██████████| 500/500 [00:02<00:00, 203.99it/s]
  4%|▍         | 22/500 [00:00<00:02, 218.67it/s]

Finished function: 'number_pos' in 2.45 seconds.


100%|██████████| 500/500 [00:02<00:00, 192.37it/s]
  5%|▍         | 24/500 [00:00<00:02, 231.48it/s]

Finished function: 'number_pos' in 2.6 seconds.


100%|██████████| 500/500 [00:02<00:00, 206.27it/s]
  5%|▍         | 23/500 [00:00<00:02, 218.09it/s]

Finished function: 'number_pos' in 2.42 seconds.


100%|██████████| 500/500 [00:02<00:00, 193.50it/s]
  5%|▍         | 24/500 [00:00<00:02, 234.09it/s]

Finished function: 'number_pos' in 2.58 seconds.


100%|██████████| 500/500 [00:02<00:00, 206.60it/s]
  4%|▍         | 22/500 [00:00<00:02, 216.75it/s]

Finished function: 'number_pos' in 2.42 seconds.


100%|██████████| 500/500 [00:02<00:00, 190.32it/s]
  4%|▍         | 22/500 [00:00<00:02, 219.32it/s]

Finished function: 'number_times' in 2.63 seconds.


100%|██████████| 500/500 [00:02<00:00, 198.22it/s]
  5%|▍         | 23/500 [00:00<00:02, 215.34it/s]

Finished function: 'number_times' in 2.52 seconds.


100%|██████████| 500/500 [00:02<00:00, 188.92it/s]
  5%|▍         | 24/500 [00:00<00:02, 232.66it/s]

Finished function: 'number_times' in 2.65 seconds.


100%|██████████| 500/500 [00:02<00:00, 202.50it/s]
  5%|▍         | 23/500 [00:00<00:02, 217.16it/s]

Finished function: 'number_times' in 2.47 seconds.


100%|██████████| 500/500 [00:02<00:00, 174.15it/s]
  4%|▍         | 22/500 [00:00<00:02, 212.63it/s]

Finished function: 'number_times' in 2.87 seconds.


 89%|████████▉ | 446/500 [00:02<00:00, 178.81it/s]

In [None]:
parallel_sentences.create_embedding_information("proc_5k")

In [None]:
parallel_sentences.create_embedding_information("proc_b_1k")

In [None]:
parallel_sentences.create_embedding_information("vecmap")

In [None]:
parallel_sentences.preprocessed.to_json("../data/interim/preprocessed_data_en_it.json")

In [None]:
# import pandas as pd
# preprocessed_data = pd.read_json("../data/interim/preprocessed_data.json")

## III. Create data set

In this section we create the datasets for the training of the supervised model and the data for the supervised and unsupervised retrieval.

In [None]:
%autoreload 2
from src.data import DataSet

In [None]:
n_model = 0
n_queries = 100
n_retrieval = 500
k = 0
sample_size_k = 100

In [None]:
dataset = DataSet(parallel_sentences.preprocessed)

In [None]:
dataset.split_model_retrieval(n_model, n_retrieval)

In [None]:
# dataset.create_model_index(n_model, k, sample_size_k,
#     "sentence_embedding_tf_idf_proc_5k_source", "sentence_embedding_tf_idf_proc_5k_target")

In [None]:
# dataset.model_dataset_index.reset_index(drop=True).to_feather("../data/processed/dataset_model_index_en_it.feather")

In [None]:
# import pandas as pd
# pd.read_feather("../data/processed/dataset_model_index.feather")

In [None]:
dataset.create_retrieval_index(n_queries)

In [None]:
dataset.retrieval_dataset_index.reset_index(drop=True).to_feather("../data/processed/dataset_retrieval_index_en_it.feather")

In [None]:
# import pandas as pd
# pd.read_feather("../data/processed/dataset_retrieval_index.feather")

## IV. Create features

In this section we create features for our model, that are sentence based and should be created before the text is preprocessed.

In [None]:
%autoreload 2
from src.features import feature_generation_class

In [None]:
# import pickle
# with open(r"../data/processed/correlated_features.pkl", "rb") as file:
#    chosen_features = pickle.load(file)

Generation of the training data for the supervised classifciation model.

In [None]:
# features_model = feature_generation_class.FeatureGeneration(dataset.model_dataset_index, 
#                                                             parallel_sentences.preprocessed)

In [None]:
# features_model.create_feature_dataframe()

In [None]:
# features_model.create_sentence_features()

In [None]:
# features_model.create_embedding_features("proc_5k")

In [None]:
features_model.create_embedding_features("proc_b_1k")

In [None]:
# features_model.create_embedding_features("vecmap")

In [None]:
# features_model.feature_dataframe.reset_index(drop=True).to_feather("../data/processed/feature_model_en_it.feather")

In [None]:
# import pandas as pd
# pd.read_feather("../data/processed/feature_model.feather")

Generation of the data for the crosslingual information retrieval task.

In [None]:
features_retrieval = feature_generation_class.FeatureGeneration(dataset.retrieval_dataset_index, 
                                                            parallel_sentences.preprocessed)

In [None]:
features_retrieval.create_feature_dataframe()

In [None]:
features_retrieval.create_sentence_features()

In [None]:
features_retrieval.create_embedding_features("proc_5k")

In [None]:
features_retrieval.create_embedding_features("proc_b_1k")

In [None]:
features_retrieval.create_embedding_features("vecmap")

In [None]:
features_retrieval.feature_dataframe.reset_index(drop=True).to_feather("../data/processed/feature_retrieval_en_it.feather")

In [None]:
# import pandas as pd
# pd.read_feather("../data/processed/feature_retrieval.feather")