# Preprocessing and Feature Creation

In this notebook we import the data, preprocess the data and create features for supervised and unsupervised cross-lingual-information retrieval models.

## I. Import Data

In this section we import the English and German europarl datasets and combine them into a parallel sentence translation dataframe.

In [None]:
import importlib

In [None]:
from src.data import import_data
importlib.reload(import_data)

In [None]:
import_data.create_data_subset(sentence_data_source='../data/external/europarl-v7.de-en.en',
                      sentences_data_target='../data/external/europarl-v7.de-en.de',
                      sample_size=25000)

## II. Preprocess data

In this section we preprocess the parallel sentence data for the feature generation

In [None]:
import spacy
from nltk.corpus import stopwords
from textblob import TextBlob as textblob_source
from textblob_de import TextBlobDE as textblob_target
import en_core_web_sm
import de_core_news_sm
import time

In [None]:
from src.data import preprocessing_class
importlib.reload(preprocessing_class)

In [None]:
stopwords_source = stopwords.words('english')
stopwords_target = stopwords.words('german')
nlp_source = en_core_web_sm.load()
nlp_target = de_core_news_sm.load()
embedding_array_source_path = "../data/interim/en_de_proc_5k_src_emb.pkl"
embedding_dictionary_source_path =  "../data/interim/en_de_proc_5k_src_word.pkl"
embedding_array_target_path = "../data/interim/en_de_proc_5k_trg_emb.pkl"
embedding_dictionary_target_path =  "../data/interim/en_de_proc_5k_trg_word.pkl"
number_pc = 10

In [None]:
parallel_sentences = preprocessing_class.PreprocessingEuroParl(df_sampled_path=
                                                               "../data/interim/europarl_english_german.pkl")

In [None]:
parallel_sentences.preprocess_sentences(stopwords_source, nlp_source, textblob_source,
                                                embedding_array_source_path, embedding_dictionary_source_path,
                                                stopwords_target,nlp_target, textblob_target,
                                                embedding_array_target_path, embedding_dictionary_target_path,
                                                number_pc)

In [None]:
# parallel_sentences.preprocessed.to_json("../data/interim/preprocessed_data.json")

## III. Create data set

In this section we create the datasets for the training of the supervised model and the data for the supervised and unsupervised retrieval.

In [None]:
# import pandas as pd
# preprocessed_data = pd.read_json("../data/interim/preprocessed_data.json")

In [None]:
from src.data import dataset_class
importlib.reload(dataset_class)

In [None]:
n_model = 20000
n_queries = 100
n_retrieval = 5000
k = 5
sample_size_k = 100

In [None]:
dataset = dataset_class.DataSet(parallel_sentences.preprocessed)

In [None]:
dataset.split_model_retrieval(n_model, n_retrieval)

In [None]:
dataset.generate_model_dataset(n_model, k, sample_size_k)

In [None]:
dataset.generate_retrieval_dataset(n_queries)

## IV. Create features

In this section we create features for our model, that are sentence based and should be created before the text is preprocessed.

In [None]:
from src.features import feature_generation_class
importlib.reload(feature_generation_class)

In [None]:
number_pc = 10

Generation of the training data for the supervised classifciation model.

In [None]:
features_model = feature_generation_class.FeatureGeneration(dataset.model_dataset, number_pc)

In [None]:
features_model.feature_generation()

In [None]:
# features_model.feature_dataframe.reset_index(drop=True).to_json("../data/processed/feature_dataframe.json")

Generation of the data for the crosslingual information retrieval task.

In [None]:
features_retrieval = feature_generation_class.FeatureGeneration(dataset.retrieval_dataset, number_pc)

In [None]:
features_retrieval.feature_generation()

In [None]:
# features_retrieval.feature_dataframe.reset_index(drop=True).to_json("../data/processed/feature_retrieval.json")