# Preprocessing and Feature Creation

In this notebook we import the data, preprocess the data and create features for supervised and unsupervised cross-lingual-information retrieval models.

## I. Import Data

In this section we import the English and German europarl datasets and combine them into a parallel sentence translation dataframe.

In [1]:
from src.data.preprocessing_class import PreprocessingEuroParl

In [2]:
parallel_sentences = PreprocessingEuroParl(sentence_data_source='../data/external/europarl-v7.de-en.en',
                 sentence_data_target='../data/external/europarl-v7.de-en.de')

In [3]:
parallel_sentences.dataframe

Unnamed: 0,text_source,text_target
1374447,"Those two languages will have no legal value, ...",Diese zwei Sprachen werden nicht rechtsverbind...
973261,Citizens are calling for initiatives and sanct...,Die Bürgerinnen und Bürger fordern neben der V...
109364,"Firstly, I would ask the Commission to give an...",Zuerst möchte ich die Kommission heute um die ...
544881,The European Union wishes to impose the Consti...,Die Europäische Union möchte den Verfassungsve...
1860290,"Regrettably, it has to be said that we are sti...","Leider müssen wir heute feststellen, daß wir d..."
...,...,...
604227,"Thank you, Mr President.",Vielen Dank Herr Präsident.
908401,The funding of the EIT is the big agreement st...,"Für die Finanzierung des ETI gilt es noch, das..."
571106,"Today the situation is reversed, Europe has th...",Heute ist die Lage genau umgekehrt: Europa ste...
806775,It is important for the Union to have comprehe...,"Es ist wichtig, dass die Union über umfassende..."


## II. Preprocess data

In this section we preprocess the parallel sentence data.

In [4]:
import spacy
from nltk.corpus import stopwords
from textblob import TextBlob as textblob_source
from textblob_de import TextBlobDE as textblob_target
import en_core_web_sm
import de_core_news_sm

In [5]:
stopwords_source = stopwords.words('english')
stopwords_target = stopwords.words('german')
nlp_source = en_core_web_sm.load()
nlp_target = de_core_news_sm.load()
embedding_matrix_source = "../data/interim/proc_b_src_emb.p"
embedding_dictionary_source =  "../data/interim/proc_b_src_word.p"
embedding_matrix_target = "../data/interim/proc_b_trg_emb.p"
embedding_dictionary_target =  "../data/interim/proc_b_trg_word.p"

In [6]:
parallel_sentences.preprocess_sentences_source(stopwords_source, nlp_source, textblob_source,
                                               embedding_matrix_source, embedding_dictionary_source)

In [7]:
parallel_sentences.preprocess_sentences_target(stopwords_target,nlp_target, textblob_target,
                                               embedding_matrix_target, embedding_dictionary_target)

In [8]:
parallel_sentences.combine_source_target()

In [9]:
parallel_sentences.add_label()

In [10]:
parallel_sentences.dataframe

Unnamed: 0,text_source,text_target,token_preprocessed_source,text_source_1,number_stopwords_source,number_punctuations_total_source,number_words_source,number_unique_words_source,number_characters_source,number_!_source,...,number_|_target,number_}_target,number_~_target,number_Pres_target,number_Past_target,number__target,score_polarity_target,score_subjectivity_target,list_named_entities_target,sentence_embedding_target
1374447,"Those two languages will have no legal value, ...",Diese zwei Sprachen werden nicht rechtsverbind...,"[two, language, legal, value, great, value, te...","[two, languages, legal, value, ,, great, value...",16,1,12,11,84,0,...,0,0,0,0,0,0,1.00,0.0,[],"[[-0.10605211555957794, -0.02644946426153183, ..."
973261,Citizens are calling for initiatives and sanct...,Die Bürgerinnen und Bürger fordern neben der V...,"[citizen, call, initiative, sanction, custom, ...","[citizens, calling, initiatives, sanctions, ,,...",12,2,13,13,100,0,...,0,0,0,0,0,0,0.00,0.0,[(CE-Kennzeichnung)],"[[-0.013108158484101295, 0.011367682367563248,..."
109364,"Firstly, I would ask the Commission to give an...",Zuerst möchte ich die Kommission heute um die ...,"[firstly, would, ask, commission, give, assura...","[firstly, ,, would, ask, commission, give, ass...",18,3,27,25,189,0,...,0,0,0,0,0,0,0.90,0.0,"[(Kommission), (GVO)]","[[-0.021073421463370323, 0.08619111031293869, ..."
544881,The European Union wishes to impose the Consti...,Die Europäische Union möchte den Verfassungsve...,"[european, union, wish, impose, constitutional...","[european, union, wishes, impose, constitution...",10,1,17,16,126,0,...,0,0,0,0,0,0,0.85,0.0,"[(Europäische, Union), (Vertrags)]","[[-0.05865674838423729, 0.0236191563308239, -0..."
1860290,"Regrettably, it has to be said that we are sti...","Leider müssen wir heute feststellen, daß wir d...","[regrettably, say, still, long, way, reality]","[regrettably, ,, said, still, long, way, reali...",10,1,6,6,34,0,...,0,0,0,0,0,0,0.00,0.0,[],"[[0.013209634460508823, 0.10260877758264542, -..."
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
604227,"Thank you, Mr President.",Vielen Dank Herr Präsident.,"[thank, mr, president]","[thank, ,, mr, president, .]",1,1,3,3,16,0,...,0,0,0,0,0,0,0.00,1.0,[],"[[-0.09365130960941315, 0.07141955196857452, -..."
908401,The funding of the EIT is the big agreement st...,"Für die Finanzierung des ETI gilt es noch, das...","[funding, eit, big, agreement, still, reach]","[funding, eit, big, agreement, still, reached, .]",7,0,6,6,34,0,...,0,0,0,0,0,0,0.70,0.0,[(ETI)],"[[-0.07112306356430054, -0.011566597037017345,..."
571106,"Today the situation is reversed, Europe has th...",Heute ist die Lage genau umgekehrt: Europa ste...,"[today, situation, reverse, europe, problem, d...","[today, situation, reversed, ,, europe, proble...",7,1,7,7,53,0,...,0,0,0,0,0,0,1.00,0.0,[(Europa)],"[[-0.004369521513581276, -0.005538631230592728..."
806775,It is important for the Union to have comprehe...,"Es ist wichtig, dass die Union über umfassende...","[important, union, comprehensive, information,...","[important, union, comprehensive, information,...",12,0,12,12,97,0,...,0,0,0,0,0,0,0.70,0.0,[(Union)],"[[0.0046020736917853355, -0.011742735281586647..."


## III. Create data set

In [11]:
from src.data.dataset_class import DataSet

In [12]:
dataset = DataSet(parallel_sentences)

In [13]:
dataset.get_sample(50)

In [14]:
dataset.dataset

Unnamed: 0,number_punctuations_total_source,number_words_source,number_unique_words_source,number_!_source,"number_""_source",number_#_source,number_$_source,number_%_source,number_&_source,number_'_source,...,number_SCONJ_target,number_SYM_target,number_VERB_target,number_X_target,score_polarity_target,score_subjectivity_target,number_stopwords_target,list_named_entities_target,sentence_embedding_target,Translation
0,1,148,27,0,0,0,0,0,0,0,...,0,0,2,0,0.000,0.0,0,"[(Haushaltslinien), (europäischen)]","[[-0.023463550955057144, -0.030060242861509323...",1
1,1,82,17,0,0,0,0,0,0,0,...,0,0,1,0,0.000,0.0,0,[],"[[-0.06262011080980301, 0.05538003519177437, -...",1
2,1,130,22,0,0,0,0,0,0,0,...,2,0,4,0,1.000,0.0,0,[],"[[-0.010547094978392124, -0.024415630847215652...",1
3,3,184,29,0,0,0,0,0,0,1,...,1,0,2,0,0.850,0.0,0,"[(Progressiven, Allianz, der, Sozialisten), (D...","[[-0.04979599639773369, 0.03973759710788727, -...",1
4,5,24,18,0,0,0,0,0,0,0,...,0,0,0,1,0.000,0.0,0,"[(Anfrage, Nr., 7), (H-0252)]","[[-0.02767094597220421, 0.092250294983387, -0....",1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,2,159,26,0,0,0,0,0,0,0,...,1,0,3,0,0.500,0.0,0,"[(PT), (Europäische, Union)]","[[-0.05103617161512375, 0.09060624986886978, -...",0
96,2,143,25,0,0,0,0,0,0,1,...,0,0,3,0,0.000,0.0,0,[],"[[-0.07834000140428543, -0.017925119027495384,...",0
97,5,24,18,0,0,0,0,0,0,0,...,0,0,0,1,0.000,0.0,0,"[(Vertrag, von, Lissabon)]","[[-0.0791318267583847, -0.031332068145275116, ...",0
98,1,73,20,0,0,0,0,0,0,0,...,1,0,2,0,-0.175,0.0,0,"[(Parlaments), (Blockade), (ECOFIN-Rat), (Währ...","[[0.008830344304442406, -0.024369219318032265,...",0


## II. Create sentence based features

In this section we create features for our model, that are sentence based and should be created before the text is preprocessed.

In [15]:
from src.features.feature_generation_class import FeatureGeneration

In [16]:
features = FeatureGeneration(dataset.dataset)

In [17]:
features.feature_generation()

KeyError: 'number_Pres_source'

In [None]:
features.feature_dataframe

## III. Create token based features