# Preprocessing and Feature Creation

In this notebook we import the data, preprocess the data and create features for supervised and unsupervised cross-lingual-information retrieval models.

## I. Import Data

In this section we import the English and German europarl datasets and combine them into a parallel sentence translation dataframe.

In [1]:
from src.data.preprocessing_class import PreprocessingEuroParl

In [2]:
parallel_sentences = PreprocessingEuroParl(sentence_data_source='../data/external/europarl-v7.de-en.en',
                 sentence_data_target='../data/external/europarl-v7.de-en.de')

In [3]:
parallel_sentences.dataframe

Unnamed: 0,text_source,text_target
1,I declare resumed the session of the European ...,"Ich erkläre die am Freitag, dem 17. Dezember u..."
2,"Although, as you will have seen, the dreaded '...","Wie Sie feststellen konnten, ist der gefürchte..."
3,You have requested a debate on this subject in...,Im Parlament besteht der Wunsch nach einer Aus...
4,"In the meantime, I should like to observe a mi...",Heute möchte ich Sie bitten - das ist auch der...
5,"Please rise, then, for this minute' s silence.","Ich bitte Sie, sich zu einer Schweigeminute zu..."
...,...,...
95,There was a vote on this matter.,Es gab eine Abstimmung zu diesem Punkt.
96,"As I recall, the outcome of this vote was 422 ...",Diese Abstimmung ist meiner Erinnerung nach so...
97,This means that all the Groups with the except...,"Das heißt, alle Fraktionen, mit Ausnahme der F..."
98,All of the others were of a different opinion.,Alle anderen waren anderer Meinung.


## II. Preprocess data

In this section we preprocess the parallel sentence data.

In [4]:
import spacy
from nltk.corpus import stopwords
from textblob import TextBlob as textblob_source
from textblob_de import TextBlobDE as textblob_target
import en_core_web_sm
import de_core_news_sm

In [5]:
stopwords_source = stopwords.words('english')
stopwords_target = stopwords.words('german')
nlp_source = en_core_web_sm.load()
nlp_target = de_core_news_sm.load()
embedding_matrix_source = "../data/interim/proc_b_src_emb.p"
embedding_dictionary_source =  "../data/interim/proc_b_src_word.p"
embedding_matrix_target = "../data/interim/proc_b_trg_emb.p"
embedding_dictionary_target =  "../data/interim/proc_b_trg_word.p"

In [6]:
parallel_sentences.preprocess_sentences_source(stopwords_source, nlp_source, textblob_source,
                                               embedding_matrix_source, embedding_dictionary_source)

  return _methods._mean(a, axis=axis, dtype=dtype,


In [19]:
print(parallel_sentences.dataframe.columns[1:70])

Index(['text_target', 'token_preprocessed_source',
       'number_punctuations_total_source', 'number_words_source',
       'number_unique_words_source', 'number_!_source', 'number_"_source',
       'number_#_source', 'number_$_source', 'number_%_source',
       'number_&_source', 'number_'_source', 'number_(_source',
       'number_)_source', 'number_*_source', 'number_+_source',
       'number_,_source', 'number_-_source', 'number_._source',
       'number_/_source', 'number_:_source', 'number_;_source',
       'number_<_source', 'number_=_source', 'number_>_source',
       'number_?_source', 'number_@_source', 'number_[_source',
       'number_\_source', 'number_]_source', 'number_^_source',
       'number___source', 'number_`_source', 'number_{_source',
       'number_|_source', 'number_}_source', 'number_~_source',
       'number_characters_source', 'number_ADJ_source', 'number_ADP_source',
       'number_ADV_source', 'number_AUX_source', 'number_CONJ_source',
       'number_CCONJ

In [8]:
parallel_sentences.preprocess_sentences_target(stopwords_source,nlp_source, textblob_source, 
                                               embedding_matrix_target, embedding_dictionary_target)

  return _methods._mean(a, axis=axis, dtype=dtype,


In [13]:
parallel_sentences.dataframe.columns

Index(['text_source', 'text_target', 'token_preprocessed_source',
       'number_punctuations_total_source', 'number_words_source',
       'number_unique_words_source', 'number_!_source', 'number_"_source',
       'number_#_source', 'number_$_source',
       ...
       'number_VERB_target', 'number_X_target', 'number_Pres_target',
       'number_Past_target', 'number__target', 'score_polarity_target',
       'score_subjectivity_target', 'number_stopwords_target',
       'list_named_entities_target', 'sentence_embedding_target'],
      dtype='object', length=126)

## II. Create sentence based features

In this section we create features for our model, that are sentence based and should be created before the text is preprocessed.

In [20]:
from src.features.feature_generation_class import FeatureGeneration

In [22]:
features = FeatureGeneration(parallel_sentences.dataframe)

In [25]:
features.feature_generation()

In [26]:
features.feature_dataframe

Unnamed: 0,number_punctuations_total_difference,number_words_difference,number_unique_words_difference,number_!_difference,"number_""_difference",number_#_difference,number_$_difference,number_%_difference,number_&_difference,number_'_difference,...,number_SCONJ_difference,number_SYM_difference,number_VERB_difference,number_X_difference,number_Pres_difference,number_Past_difference,number__difference,score_polarity_difference,score_subjectivity_difference,number_stopwords_difference
1,-2,-8,-4,0,0,0,0,0,0,0,...,1,0,1,0,0,0,0,0.433939,0.624242,16
2,1,6,-6,0,-2,0,0,0,0,2,...,1,0,5,0,0,0,0,-0.466667,0.566667,17
3,2,-7,-4,0,0,0,0,0,0,0,...,0,0,-2,0,0,0,0,-0.122222,0.144444,6
4,1,14,-8,0,0,0,0,0,0,1,...,0,0,2,0,0,0,0,-0.208333,0.458333,26
5,2,-11,-1,0,0,0,0,0,0,1,...,0,0,1,0,0,0,0,0.000000,0.000000,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
95,0,-7,-3,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,0.000000,0.000000,0
96,0,-33,-1,0,0,0,0,0,0,0,...,0,0,0,0,0,0,0,-0.200000,0.100000,0
97,-1,10,-4,0,0,0,0,0,0,0,...,2,0,1,-1,0,0,0,0.000000,1.000000,10
98,0,11,3,0,0,0,0,0,0,0,...,0,0,-1,0,0,0,0,0.000000,0.600000,8


## III. Preprocess data for token based features

## IV. Create token based features

In [None]:
sentence = "This is a sentence!"

In [None]:
[word for word in sentence]

In [None]:
len([word for word in sentence])