# Supervised Retrieval

In this notebook we use the supervised classification model for a supervised crosslingual information retrieval task using the scikit learn inbuild MLPClassifier. We will first prepare the data and then use a pipeline of forward feature selection and hyperparameter optimization via grid search. We see that our default settings perform best, since we do not have the computing power and time to test various different network architectures. After training our model, we use the trained model for English-German on Italian, Polish and Document level, and see that the results for the other languages are pretty good, but on document level as expected not really.

In [1]:
import pandas as pd
from sklearn import preprocessing
from sklearn.neural_network import MLPClassifier
from src.models.predict_model import MAP_score, threshold_counts,feature_selection, pipeline_model_optimization

## I. Import Data

In this section we import the feature dataframe for the retrieval task.

In [2]:
feature_dataframe=pd.read_feather("../data/processed/feature_model_en_de.feather")
feature_retrieval=pd.read_feather("../data/processed/feature_retrieval_en_de.feather")
feature_dataframe = feature_dataframe.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval = feature_retrieval.rename(columns={"id_source": "source_id", "id_target": "target_id"})

In [3]:
feature_dataframe

Unnamed: 0,source_id,target_id,Translation,number_punctuations_total_difference,number_punctuations_total_difference_relative,number_punctuations_total_difference_normalized,number_words_difference,number_words_difference_relative,number_words_difference_normalized,number_unique_words_difference,...,cosine_similarity_average_proc_b_1k,cosine_similarity_tf_idf_proc_b_1k,euclidean_distance_average_proc_b_1k,euclidean_distance_tf_idf_proc_b_1k,jaccard_translation_proc_b_1k,cosine_similarity_average_vecmap,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_vecmap,euclidean_distance_tf_idf_vecmap,jaccard_translation_vecmap
0,0,0,1,10,0.555556,0.184124,5,0.061728,0.184124,3,...,0.881515,0.859575,0.248435,0.043538,0.014706,0.689299,0.627702,0.298771,0.051651,0.007246
1,1,1,1,0,0.000000,0.000000,0,0.000000,0.000000,0,...,0.894710,0.871179,0.228432,0.056366,0.260714,0.836215,0.789221,0.216985,0.054289,0.281404
2,2,2,1,3,1.000000,0.250000,2,0.125000,0.250000,2,...,0.771116,0.792226,0.389483,0.130394,0.240385,0.643538,0.690642,0.409364,0.135454,0.291667
3,3,3,1,0,0.000000,0.028070,4,0.133333,0.028070,4,...,0.859097,0.854491,0.282293,0.074980,0.120000,0.757080,0.754322,0.293724,0.076331,0.120000
4,4,4,1,0,0.000000,0.012605,3,0.103448,0.012605,2,...,0.861526,0.848707,0.296092,0.079806,0.103679,0.795838,0.783914,0.306117,0.081788,0.125217
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
219995,19999,10689,0,1,0.333333,0.062937,1,0.047619,0.062937,1,...,0.796217,0.780328,0.354726,0.112828,0.000000,0.668862,0.638944,0.366766,0.119223,0.000000
219996,19999,9781,0,2,0.500000,0.059091,7,0.259259,0.059091,7,...,0.801057,0.783654,0.336315,0.113979,0.080128,0.653291,0.621897,0.356319,0.119032,0.080128
219997,19999,7757,0,1,0.333333,0.026738,5,0.200000,0.026738,5,...,0.770226,0.775183,0.376769,0.111710,0.048055,0.630458,0.643871,0.390613,0.115372,0.048055
219998,19999,4932,0,1,0.333333,0.007576,12,0.375000,0.007576,11,...,0.801876,0.789016,0.342303,0.113451,0.000000,0.658583,0.630707,0.364702,0.117218,0.000000


#### Delete all columns with only one value

In [4]:
column_mask = feature_dataframe.apply(threshold_counts, threshold=1)
feature_dataframe = feature_dataframe.loc[:, column_mask]
feature_retrieval = feature_retrieval.loc[:, column_mask]

## II. Supervised Retrieval

# MLP Classifier

#### Start with one feature and perform selection and tuning on validation set 

In [5]:
start_features = ["jaccard_translation_proc_5k"]
not_add = ["Translation", "source_id", "target_id"]
added_features = feature_dataframe.columns[~feature_dataframe.columns.isin(start_features+not_add)]

In [6]:
model = MLPClassifier(random_state= 0, early_stopping=True,max_iter=100000)
scaler = preprocessing.StandardScaler()

MLP_parameter_grid = {
    'hidden_layer_sizes': [(100,), (5,), (5,3,2),(6,2),(7,3,),(8,3,2)],
    'activation': ['tanh', 'relu'],
    'solver': ['lbfgs', 'adam'],
    'alpha': [0.0001],
    'learning_rate': ['constant','adaptive'],
    'random_state' :[0],
    'early_stopping':[True]
}
MLP_best_features, MLP_best_parameter_combination, MLP_best_map_score, MLP_all_parameter_combination = \
pipeline_model_optimization(model, MLP_parameter_grid, scaler, feature_dataframe, 
                            feature_retrieval, start_features, 
                            added_features, 
                            threshold_map_feature_selection=0.0001)

-----------------First do Forward Selection-----------------

Current Iteration through feature list: 1
The initial MAP score on test set: 0.7515
Updated MAP score on test set with new feature jaccard_translation_vecmap: 0.7666
Updated MAP score on test set with new feature euclidean_distance_tf_idf_vecmap: 0.8138
Updated MAP score on test set with new feature cosine_similarity_tf_idf_vecmap: 0.8167
Updated MAP score on test set with new feature jaccard_translation_proc_b_1k: 0.8352
Updated MAP score on test set with new feature number_._difference_relative: 0.8368
Updated MAP score on test set with new feature number_-_difference_relative: 0.8391

Current Iteration through feature list: 2
The initial MAP score on test set: 0.8391
Updated MAP score on test set with new feature number_;_difference_normalized: 0.8467
Updated MAP score on test set with new feature number_:_difference: 0.8477

Current Iteration through feature list: 3
The initial MAP score on test set: 0.8477

------------

Hyperparameter Tuning:   0%|          | 0/48 [00:00<?, ?it/s]



-----------------Start Hyperparameter-tuning with Grid Search-----------------
Number of Parameter Combinations: 48


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
Hyperparameter Tuning:   2%|▏         | 1/48 [03:15<2:33:06, 195.47s/it]


Current Best Hyperpamaters: {'hidden_layer_sizes': (100,), 'activation': 'tanh', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 0, 'early_stopping': True, 'MAP_score': 0.8081045079157305}
With Map Score 0.8081


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
Hyperparameter Tuning:   6%|▋         | 3/48 [06:53<1:27:11, 116.25s/it]


Current Best Hyperpamaters: {'hidden_layer_sizes': (100,), 'activation': 'tanh', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 0, 'early_stopping': True, 'MAP_score': 0.8092826925727866}
With Map Score 0.8093


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
Hyperparameter Tuning:  15%|█▍        | 7/48 [12:30<1:00:24, 88.39s/it] 


Current Best Hyperpamaters: {'hidden_layer_sizes': (100,), 'activation': 'relu', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 0, 'early_stopping': True, 'MAP_score': 0.8477437491237899}
With Map Score 0.8477


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("


-----------------Result of Hyperparameter Tuning-----------------

Best Hyperamater Settting: {'hidden_layer_sizes': (100,), 'activation': 'relu', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 0, 'early_stopping': True}
With MAP Score: 0.8477





In [15]:
start_features=['jaccard_translation_proc_5k',
 'jaccard_translation_vecmap',
 'euclidean_distance_tf_idf_vecmap',
 'cosine_similarity_tf_idf_vecmap',
 'jaccard_translation_proc_b_1k',
 'number_._difference_relative',
 'number_-_difference_relative',
 'number_;_difference_normalized',
 'number_:_difference']

In [25]:
#final model 0.8477
scaler = preprocessing.StandardScaler()
model = MLPClassifier(hidden_layer_sizes=(100,),random_state=0, alpha=0.0001, solver='adam', activation='relu', early_stopping=True,max_iter=1000000)
target_train = feature_dataframe['Translation']
target_test = feature_retrieval['Translation']
data_train = feature_dataframe.filter(items=start_features)
data_test = feature_retrieval.filter(items=start_features)
# scale the features
data_train[data_train.columns] = scaler.fit_transform(data_train[data_train.columns])
data_test[data_test.columns] = scaler.transform(data_test[data_test.columns])
# fit the model and get the initial MapScore
model.fit(data_train.to_numpy(), target_train.to_numpy())
prediction = model.predict_proba(data_test.to_numpy())
MapScore = MAP_score(feature_retrieval['source_id'], target_test, prediction)
print("The initial MAP score on test set: {:.4f}".format(MapScore))

The initial MAP score on test set: 0.8477


In [26]:
import pickle
# save the model to disk
filename = 'finalized_model_MLP.sav'
pickle.dump(model, open(filename, 'wb'))
 

# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))

In [27]:
#final features of the model
len(start_features)

9

# Test our model on an independent Englisch-German test set

In [28]:
feature_retrieval_test=pd.read_feather("../data/processed/feature_retrieval_en_de_testset.feather")
feature_retrieval_test = feature_retrieval_test.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval_test

Unnamed: 0,source_id,target_id,Translation,number_punctuations_total_difference,number_punctuations_total_difference_relative,number_punctuations_total_difference_normalized,number_words_difference,number_words_difference_relative,number_words_difference_normalized,number_unique_words_difference,...,cosine_similarity_average_proc_b_1k,cosine_similarity_tf_idf_proc_b_1k,euclidean_distance_average_proc_b_1k,euclidean_distance_tf_idf_proc_b_1k,jaccard_translation_proc_b_1k,cosine_similarity_average_vecmap,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_vecmap,euclidean_distance_tf_idf_vecmap,jaccard_translation_vecmap
0,0,0,1,0,0.000000,0.033333,3,0.130435,0.033333,5,...,0.776923,0.752113,0.380413,0.130624,0.140867,0.628849,0.569304,0.413058,0.139506,0.171569
1,0,1,0,2,1.000000,0.133333,10,0.625000,0.133333,10,...,0.620641,0.603776,0.543104,0.281724,0.000000,0.356381,0.324728,0.622440,0.298332,0.000000
2,0,2,0,2,0.333333,0.101961,0,0.000000,0.101961,0,...,0.588427,0.581390,0.480664,0.128045,0.000000,0.229010,0.239453,0.530768,0.140842,0.000000
3,0,3,0,0,0.000000,0.020513,2,0.083333,0.020513,2,...,0.665777,0.634401,0.444180,0.139126,0.000000,0.442247,0.403900,0.478450,0.147651,0.000000
4,0,4,0,1,0.333333,0.008333,6,0.300000,0.008333,6,...,0.611621,0.592239,0.494053,0.186831,0.000000,0.285372,0.285385,0.546625,0.194719,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,99,4995,0,5,1.000000,0.238095,13,0.684211,0.238095,13,...,0.618434,0.603234,0.548296,0.335448,0.000000,0.421944,0.402578,0.580330,0.335545,0.000000
499996,99,4996,0,3,1.000000,0.200000,9,0.600000,0.200000,9,...,0.606798,0.569988,0.559360,0.334685,0.000000,0.390204,0.333805,0.598008,0.341303,0.000000
499997,99,4997,0,0,0.000000,0.000000,5,0.454545,0.000000,4,...,0.568771,0.501923,0.628250,0.343402,0.000000,0.396668,0.300773,0.670406,0.359109,0.000000
499998,99,4998,0,2,1.000000,0.133333,10,0.625000,0.133333,9,...,0.554428,0.514135,0.596427,0.343570,0.000000,0.217345,0.160667,0.666197,0.360725,0.000000


### Prepare model

In [29]:
target_test = feature_retrieval_test['Translation']
data_test = feature_retrieval_test.filter(items=start_features)
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,euclidean_distance_tf_idf_vecmap,cosine_similarity_tf_idf_vecmap,jaccard_translation_proc_b_1k,number_._difference_relative,number_-_difference_relative,number_;_difference_normalized,number_:_difference
0,0.140867,0.171569,0.139506,0.569304,0.140867,0.0,0.0,0.0,0
1,0.000000,0.000000,0.298332,0.324728,0.000000,0.0,0.0,0.0,0
2,0.000000,0.000000,0.140842,0.239453,0.000000,0.0,0.0,0.0,0
3,0.000000,0.000000,0.147651,0.403900,0.000000,0.0,0.0,0.0,0
4,0.000000,0.000000,0.194719,0.285385,0.000000,0.0,0.0,0.0,0
...,...,...,...,...,...,...,...,...,...
499995,0.000000,0.000000,0.335545,0.402578,0.000000,0.0,0.0,0.0,0
499996,0.000000,0.000000,0.341303,0.333805,0.000000,1.0,0.0,0.0,0
499997,0.000000,0.000000,0.359109,0.300773,0.000000,0.0,0.0,0.0,0
499998,0.000000,0.000000,0.360725,0.160667,0.000000,0.0,0.0,0.0,0


### Use model

In [30]:
data_test[data_test.columns] = scaler.transform(data_test[data_test.columns])
data_test
prediction = model.predict_proba(data_test)
MapScore = MAP_score(feature_retrieval_test['source_id'], target_test, prediction)
print("The MAP score on test set: {:.4f}".format(MapScore))

The MAP score on test set: 0.8459


# III. Other languages

## Use the model on English-Italian

In [31]:
feature_retrieval_it=pd.read_feather("../data/processed/feature_retrieval_en_it.feather")
feature_retrieval_it =feature_retrieval_it.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval_it

Unnamed: 0,source_id,target_id,Translation,number_punctuations_total_difference,number_punctuations_total_difference_relative,number_punctuations_total_difference_normalized,number_words_difference,number_words_difference_relative,number_words_difference_normalized,number_unique_words_difference,...,cosine_similarity_average_proc_b_1k,cosine_similarity_tf_idf_proc_b_1k,euclidean_distance_average_proc_b_1k,euclidean_distance_tf_idf_proc_b_1k,jaccard_translation_proc_b_1k,cosine_similarity_average_vecmap,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_vecmap,euclidean_distance_tf_idf_vecmap,jaccard_translation_vecmap
0,20000,20000,1,1,0.333333,0.075000,1,0.066667,0.075000,1,...,0.810779,0.810212,0.361123,0.130714,0.250000,0.770822,0.771820,0.347454,0.125257,0.250000
1,20000,20001,0,0,0.000000,0.046154,3,0.157895,0.046154,3,...,0.652366,0.632913,0.495442,0.162176,0.000000,0.471685,0.442877,0.540197,0.175718,0.000000
2,20000,20002,0,1,0.200000,0.023529,6,0.272727,0.023529,6,...,0.551240,0.537829,0.552460,0.176089,0.000000,0.176865,0.133939,0.636416,0.201498,0.000000
3,20000,20003,0,1,0.333333,0.123077,4,0.200000,0.123077,3,...,0.609395,0.578322,0.497603,0.170155,0.000000,0.317125,0.261110,0.547007,0.188239,0.000000
4,20000,20004,0,0,0.000000,0.033333,2,0.111111,0.033333,1,...,0.625252,0.605403,0.499228,0.169059,0.000000,0.431676,0.400298,0.526985,0.179437,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,20099,24995,0,4,1.000000,0.363636,0,0.000000,0.363636,0,...,0.494533,0.470337,0.600207,0.206954,0.000000,0.146285,0.122446,0.678523,0.233086,0.000000
499996,20099,24996,0,4,1.000000,0.363636,3,0.272727,0.363636,3,...,0.445105,0.431848,0.653755,0.295593,0.000000,0.105808,0.098467,0.726166,0.319219,0.000000
499997,20099,24997,0,4,1.000000,0.363636,3,0.176471,0.363636,3,...,0.508493,0.465414,0.574689,0.194991,0.000000,0.166518,0.125571,0.639576,0.215090,0.000000
499998,20099,24998,0,2,0.200000,0.192208,22,0.611111,0.192208,18,...,0.609450,0.580977,0.502350,0.169182,0.014706,0.270303,0.249942,0.562242,0.172988,0.014706


## Prepare the test set

In [32]:
target_test = feature_retrieval_it['Translation']
data_test = feature_retrieval_it.filter(items=start_features)
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,euclidean_distance_tf_idf_vecmap,cosine_similarity_tf_idf_vecmap,jaccard_translation_proc_b_1k,number_._difference_relative,number_-_difference_relative,number_;_difference_normalized,number_:_difference
0,0.306818,0.250000,0.125257,0.771820,0.250000,0.0,0.0,0.000000,0
1,0.000000,0.000000,0.175718,0.442877,0.000000,0.0,0.0,0.000000,1
2,0.000000,0.000000,0.201498,0.133939,0.000000,0.0,0.0,0.000000,0
3,0.000000,0.000000,0.188239,0.261110,0.000000,0.0,0.0,0.000000,0
4,0.000000,0.000000,0.179437,0.400298,0.000000,1.0,0.0,0.000000,0
...,...,...,...,...,...,...,...,...,...
499995,0.000000,0.000000,0.233086,0.122446,0.000000,0.0,0.0,0.000000,0
499996,0.000000,0.000000,0.319219,0.098467,0.000000,0.0,0.0,0.000000,0
499997,0.000000,0.000000,0.215090,0.125571,0.000000,0.0,0.0,0.000000,0
499998,0.014706,0.014706,0.172988,0.249942,0.014706,0.0,0.0,0.028571,0


## Use model

In [33]:
data_test[data_test.columns] = scaler.transform(data_test[data_test.columns])
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,euclidean_distance_tf_idf_vecmap,cosine_similarity_tf_idf_vecmap,jaccard_translation_proc_b_1k,number_._difference_relative,number_-_difference_relative,number_;_difference_normalized,number_:_difference
0,4.284752,3.661185,-0.228882,1.735805,3.728962,-0.466162,-0.377241,-0.192207,-0.329894
1,-0.395802,-0.386745,0.184983,-0.832727,-0.385061,-0.466162,-0.377241,-0.192207,2.868866
2,-0.395802,-0.386745,0.396417,-3.245045,-0.385061,-0.466162,-0.377241,-0.192207,-0.329894
3,-0.395802,-0.386745,0.287674,-2.252040,-0.385061,-0.466162,-0.377241,-0.192207,-0.329894
4,-0.395802,-0.386745,0.215481,-1.165199,-0.385061,3.165573,-0.377241,-0.192207,-0.329894
...,...,...,...,...,...,...,...,...,...
499995,-0.395802,-0.386745,0.655488,-3.334791,-0.385061,-0.466162,-0.377241,-0.192207,-0.329894
499996,-0.395802,-0.386745,1.361915,-3.522024,-0.385061,-0.466162,-0.377241,-0.192207,-0.329894
499997,-0.395802,-0.386745,0.507889,-3.310387,-0.385061,-0.466162,-0.377241,-0.192207,-0.329894
499998,-0.171462,-0.148632,0.162586,-2.339242,-0.143060,-0.466162,-0.377241,2.060283,-0.329894


In [34]:
prediction = model.predict_proba(data_test)
MapScore = MAP_score(feature_retrieval_it['source_id'], target_test, prediction)
print("The Italian MAP score on test set: {:.4f}".format(MapScore))

The Italian MAP score on test set: 0.8725


## Use the model on English-Polish

In [35]:
feature_retrieval_pl=pd.read_feather("../data/processed/feature_retrieval_en_pl.feather")
feature_retrieval_pl = feature_retrieval_pl.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval_pl

Unnamed: 0,source_id,target_id,Translation,number_punctuations_total_difference,number_punctuations_total_difference_relative,number_punctuations_total_difference_normalized,number_words_difference,number_words_difference_relative,number_words_difference_normalized,number_unique_words_difference,...,cosine_similarity_average_proc_b_1k,cosine_similarity_tf_idf_proc_b_1k,euclidean_distance_average_proc_b_1k,euclidean_distance_tf_idf_proc_b_1k,jaccard_translation_proc_b_1k,cosine_similarity_average_vecmap,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_vecmap,euclidean_distance_tf_idf_vecmap,jaccard_translation_vecmap
0,20000,20000,1,0,0.0,0.000000,1,0.090909,0.000000,1,...,0.642565,0.626953,0.517595,0.193824,0.236111,0.557422,0.526448,0.503654,0.198580,0.236111
1,20000,20001,0,0,0.0,0.000000,0,0.000000,0.000000,0,...,0.573507,0.551732,0.569353,0.235418,0.000000,0.404290,0.365516,0.602765,0.252775,0.000000
2,20000,20002,0,11,1.0,0.139241,62,0.837838,0.139241,53,...,0.708999,0.674955,0.431595,0.209228,0.000000,0.600519,0.560591,0.444171,0.206232,0.000000
3,20000,20003,0,2,1.0,0.250000,0,0.000000,0.250000,0,...,0.555197,0.545064,0.569774,0.230373,0.000000,0.356233,0.364142,0.612446,0.245365,0.000000
4,20000,20004,0,2,1.0,0.086957,15,0.555556,0.086957,15,...,0.608879,0.543577,0.495413,0.207498,0.000000,0.363847,0.248384,0.538228,0.221728,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,20099,24995,0,4,1.0,0.190476,2,0.062500,0.190476,2,...,0.687810,0.649056,0.398149,0.104033,0.000000,0.296902,0.164214,0.443523,0.119273,0.000000
499996,20099,24996,0,2,1.0,0.133333,2,0.071429,0.133333,1,...,0.696351,0.647455,0.421799,0.111301,0.000000,0.521585,0.400441,0.436105,0.118380,0.000000
499997,20099,24997,0,1,1.0,0.055556,2,0.062500,0.055556,3,...,0.716196,0.700957,0.387899,0.095337,0.000000,0.346653,0.326476,0.458947,0.111125,0.000000
499998,20099,24998,0,1,1.0,0.050000,4,0.117647,0.050000,4,...,0.661488,0.617984,0.426675,0.107472,0.000000,0.304688,0.202665,0.484050,0.123391,0.000000


## Prepare the test set

In [36]:
target_test = feature_retrieval_pl['Translation']
data_test = feature_retrieval_pl.filter(items=start_features)
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,euclidean_distance_tf_idf_vecmap,cosine_similarity_tf_idf_vecmap,jaccard_translation_proc_b_1k,number_._difference_relative,number_-_difference_relative,number_;_difference_normalized,number_:_difference
0,0.325397,0.236111,0.198580,0.526448,0.236111,0.0,0.0,0.000000,0
1,0.000000,0.000000,0.252775,0.365516,0.000000,0.0,0.0,0.000000,0
2,0.000000,0.000000,0.206232,0.560591,0.000000,0.0,1.0,0.012658,1
3,0.000000,0.000000,0.245365,0.364142,0.000000,0.0,0.0,0.000000,0
4,0.000000,0.000000,0.221728,0.248384,0.000000,1.0,0.0,0.000000,0
...,...,...,...,...,...,...,...,...,...
499995,0.000000,0.000000,0.119273,0.164214,0.000000,0.0,0.0,0.000000,0
499996,0.000000,0.000000,0.118380,0.400441,0.000000,0.0,1.0,0.000000,0
499997,0.000000,0.000000,0.111125,0.326476,0.000000,0.0,0.0,0.000000,0
499998,0.000000,0.000000,0.123391,0.202665,0.000000,0.0,0.0,0.000000,0


## Use model

In [37]:
data_test[data_test.columns] = scaler.transform(data_test[data_test.columns])
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,euclidean_distance_tf_idf_vecmap,cosine_similarity_tf_idf_vecmap,jaccard_translation_proc_b_1k,number_._difference_relative,number_-_difference_relative,number_;_difference_normalized,number_:_difference
0,4.568171,3.436300,0.372486,-0.180167,3.500405,-0.466162,-0.377241,-0.192207,-0.329894
1,-0.395802,-0.386745,0.816968,-1.436793,-0.385061,-0.466162,-0.377241,-0.192207,-0.329894
2,-0.395802,-0.386745,0.435245,0.086441,-0.385061,-0.466162,2.664338,0.805732,2.868866
3,-0.395802,-0.386745,0.756193,-1.447521,-0.385061,-0.466162,-0.377241,-0.192207,-0.329894
4,-0.395802,-0.386745,0.562336,-2.351407,-0.385061,3.165573,-0.377241,-0.192207,-0.329894
...,...,...,...,...,...,...,...,...,...
499995,-0.395802,-0.386745,-0.277955,-3.008647,-0.385061,-0.466162,-0.377241,-0.192207,-0.329894
499996,-0.395802,-0.386745,-0.285284,-1.164084,-0.385061,-0.466162,2.664338,-0.192207,-0.329894
499997,-0.395802,-0.386745,-0.344785,-1.741631,-0.385061,-0.466162,-0.377241,-0.192207,-0.329894
499998,-0.395802,-0.386745,-0.244185,-2.708404,-0.385061,-0.466162,-0.377241,-0.192207,-0.329894


In [38]:
prediction = model.predict_proba(data_test)
MapScore = MAP_score(feature_retrieval_it['source_id'], target_test, prediction)
print("The Polish MAP score on test set: {:.4f}".format(MapScore))

The Polish MAP score on test set: 0.8691


# IV. Document level

## Use the model on German-English doc

In [39]:
feature_retrieval_doc=pd.read_feather("../data/processed/feature_retrieval_doc.feather")
feature_retrieval_doc = feature_retrieval_doc.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval_doc

Unnamed: 0,source_id,target_id,Translation,number_punctuations_total_difference,number_punctuations_total_difference_relative,number_punctuations_total_difference_normalized,number_words_difference,number_words_difference_relative,number_words_difference_normalized,number_unique_words_difference,...,cosine_similarity_average_proc_b_1k,cosine_similarity_tf_idf_proc_b_1k,euclidean_distance_average_proc_b_1k,euclidean_distance_tf_idf_proc_b_1k,jaccard_translation_proc_b_1k,cosine_similarity_average_vecmap,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_vecmap,euclidean_distance_tf_idf_vecmap,jaccard_translation_vecmap
0,0,0,1,5,1.000000,0.185185,277,0.862928,0.185185,139,...,0.511370,0.506299,0.628872,0.069684,0.004274,0.137734,0.120406,0.711984,0.073070,0.0
1,0,1,0,5,1.000000,0.185185,315,0.877437,0.185185,183,...,0.478780,0.472470,0.657574,0.070496,0.003448,0.058341,0.033254,0.756963,0.075525,0.0
2,0,2,0,5,1.000000,0.185185,320,0.879121,0.185185,171,...,0.453333,0.408333,0.693955,0.072785,0.004202,0.068042,0.002854,0.788789,0.077393,0.0
3,0,3,0,5,1.000000,0.185185,356,0.890000,0.185185,164,...,0.472022,0.457837,0.671196,0.071793,0.000000,0.079043,0.041237,0.775098,0.074297,0.0
4,0,4,0,5,1.000000,0.185185,376,0.895238,0.185185,171,...,0.459031,0.439136,0.670771,0.071706,0.003546,0.024217,-0.025245,0.781736,0.077319,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,99,4995,0,1,0.333333,0.114653,318,0.913793,0.114653,162,...,0.434860,0.389901,0.696221,0.154386,0.000000,-0.046217,-0.154401,0.842940,0.160095,0.0
499996,99,4996,0,2,1.000000,0.117647,362,0.923469,0.117647,152,...,0.427382,0.374952,0.701396,0.155738,0.000000,-0.055270,-0.153815,0.832674,0.158915,0.0
499997,99,4997,0,2,1.000000,0.117647,121,0.801325,0.117647,83,...,0.501045,0.485808,0.632044,0.147978,0.000000,0.057846,0.013532,0.745893,0.157064,0.0
499998,99,4998,0,0,0.000000,0.000000,0,0.000000,0.000000,5,...,0.497301,0.445718,0.644503,0.163494,0.000000,0.157026,0.118324,0.736800,0.183978,0.0


## Prepare the test set

In [40]:
target_test = feature_retrieval_doc['Translation']
data_test = feature_retrieval_doc.filter(items=start_features)
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,euclidean_distance_tf_idf_vecmap,cosine_similarity_tf_idf_vecmap,jaccard_translation_proc_b_1k,number_._difference_relative,number_-_difference_relative,number_;_difference_normalized,number_:_difference
0,0.004167,0.0,0.073070,0.120406,0.004274,1.0,1.0,0.0,0
1,0.003448,0.0,0.075525,0.033254,0.003448,1.0,1.0,0.0,0
2,0.003968,0.0,0.077393,0.002854,0.004202,1.0,1.0,0.0,0
3,0.000000,0.0,0.074297,0.041237,0.000000,0.0,1.0,0.0,0
4,0.003546,0.0,0.077319,-0.025245,0.003546,1.0,1.0,0.0,0
...,...,...,...,...,...,...,...,...,...
499995,0.000000,0.0,0.160095,-0.154401,0.000000,1.0,1.0,0.0,0
499996,0.000000,0.0,0.158915,-0.153815,0.000000,1.0,1.0,0.0,0
499997,0.000000,0.0,0.157064,0.013532,0.000000,1.0,1.0,0.0,0
499998,0.000000,0.0,0.183978,0.118324,0.000000,1.0,1.0,0.0,0


## Use model

In [41]:
data_test[data_test.columns] = scaler.transform(data_test[data_test.columns])
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,euclidean_distance_tf_idf_vecmap,cosine_similarity_tf_idf_vecmap,jaccard_translation_proc_b_1k,number_._difference_relative,number_-_difference_relative,number_;_difference_normalized,number_:_difference
0,-0.332239,-0.386745,-0.656892,-3.350718,-0.314736,3.165573,2.664338,-0.192207,-0.329894
1,-0.343198,-0.386745,-0.636760,-4.031234,-0.328316,3.165573,2.664338,-0.192207,-0.329894
2,-0.335266,-0.386745,-0.621440,-4.268616,-0.315918,3.165573,2.664338,-0.192207,-0.329894
3,-0.395802,-0.386745,-0.646832,-3.968902,-0.385061,-0.466162,2.664338,-0.192207,-0.329894
4,-0.341706,-0.386745,-0.622047,-4.488027,-0.326706,3.165573,2.664338,-0.192207,-0.329894
...,...,...,...,...,...,...,...,...,...
499995,-0.395802,-0.386745,0.056844,-5.496526,-0.385061,3.165573,2.664338,-0.192207,-0.329894
499996,-0.395802,-0.386745,0.047171,-5.491954,-0.385061,3.165573,2.664338,-0.192207,-0.329894
499997,-0.395802,-0.386745,0.031988,-4.185236,-0.385061,3.165573,2.664338,-0.192207,-0.329894
499998,-0.395802,-0.386745,0.252727,-3.366977,-0.385061,3.165573,2.664338,-0.192207,-0.329894


In [42]:
prediction = model.predict_proba(data_test)
MapScore = MAP_score(feature_retrieval_doc['source_id'], target_test, prediction)
print("The Doc MAP score on test set: {:.4f}".format(MapScore))

The Doc MAP score on test set: 0.0004
