# Supervised Retrieval

In this notebook we use the supervised classification model for a supervised crosslingual information retrieval task using the scikit learn inbuild MLPClassifier. We will first prepare the data and then use a pipeline of forward feature selection and hyperparameter optimization via grid search. We see that our default settings perform best, since we do not have the computing power and time to test various different network architectures. After training our model, we use the trained model for English-German on Italian, Polish and Document level, and see that the results for the other languages are pretty good, but on document level as expected not really.

In [1]:
import pandas as pd
from sklearn import preprocessing
from sklearn.neural_network import MLPClassifier
from src.models.predict_model import MAP_score, threshold_counts,feature_selection, pipeline_model_optimization

## I. Import Data

In this section we import the feature dataframe for the retrieval task.

In [2]:
feature_dataframe=pd.read_feather("../data/processed/feature_model_en_de.feather")
feature_retrieval=pd.read_feather("../data/processed/feature_retrieval_en_de.feather")
feature_dataframe = feature_dataframe.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval = feature_retrieval.rename(columns={"id_source": "source_id", "id_target": "target_id"})

In [3]:
feature_dataframe

Unnamed: 0,source_id,target_id,Translation,number_punctuations_total_difference,number_punctuations_total_difference_relative,number_punctuations_total_difference_normalized,number_words_difference,number_words_difference_relative,number_words_difference_normalized,number_unique_words_difference,...,cosine_similarity_average_proc_b_1k,cosine_similarity_tf_idf_proc_b_1k,euclidean_distance_average_proc_b_1k,euclidean_distance_tf_idf_proc_b_1k,jaccard_translation_proc_b_1k,cosine_similarity_average_vecmap,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_vecmap,euclidean_distance_tf_idf_vecmap,jaccard_translation_vecmap
0,0,0,1,10,0.555556,0.094256,3,0.016575,0.094256,12,...,0.938824,0.922837,0.188550,0.025019,0.069408,0.851877,0.816036,0.227537,0.029612,0.074080
1,1,1,1,0,0.000000,0.000000,0,0.000000,0.000000,5,...,0.928516,0.911385,0.191291,0.035125,0.246975,0.881250,0.846294,0.197129,0.036056,0.258936
2,2,2,1,3,1.000000,0.142857,1,0.027027,0.142857,1,...,0.841325,0.834805,0.308668,0.072557,0.176154,0.729274,0.726981,0.328892,0.075238,0.224167
3,3,3,1,0,0.000000,0.004762,2,0.037037,0.004762,5,...,0.881792,0.873648,0.255657,0.051406,0.173542,0.771814,0.758749,0.281536,0.053657,0.173542
4,4,4,1,0,0.000000,0.005517,4,0.076923,0.005517,2,...,0.916279,0.894702,0.227575,0.047172,0.186111,0.874617,0.839182,0.238205,0.049603,0.205495
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
219995,19999,4647,0,3,0.600000,0.079032,8,0.173913,0.079032,9,...,0.874105,0.859388,0.280235,0.065319,0.092105,0.783083,0.739614,0.302003,0.072655,0.078947
219996,19999,685,0,2,0.500000,0.050000,8,0.173913,0.050000,9,...,0.856001,0.852788,0.322840,0.066262,0.068498,0.738716,0.726509,0.375336,0.076159,0.069173
219997,19999,10689,0,1,0.333333,0.036957,2,0.050000,0.036957,2,...,0.886237,0.852108,0.263981,0.067731,0.062500,0.819358,0.759386,0.275182,0.070944,0.063508
219998,19999,9172,0,1,0.333333,0.016667,9,0.191489,0.016667,7,...,0.871786,0.855601,0.276340,0.065629,0.054054,0.785261,0.750286,0.295176,0.071319,0.054805


#### Delete all columns with only one value

In [4]:
column_mask = feature_dataframe.apply(threshold_counts, threshold=1)
feature_dataframe = feature_dataframe.loc[:, column_mask]
feature_retrieval = feature_retrieval.loc[:, column_mask]

## II. Supervised Retrieval

# MLP Classifier

#### Start with one feature

In [5]:
start_features = ["jaccard_translation_proc_5k"]
not_add = ["Translation", "source_id", "target_id"]
added_features = feature_dataframe.columns[~feature_dataframe.columns.isin(start_features+not_add)]

In [6]:
model = MLPClassifier(random_state= 0, early_stopping=True,max_iter=100000)
scaler = preprocessing.StandardScaler()

MLP_parameter_grid = {
    'hidden_layer_sizes': [(100,), (50,100,), (50,75,100,)],
    'activation': ['logistic','tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
    'random_state' :[0],
    'early_stopping':[True]
}
MLP_best_features, MLP_best_parameter_combination, MLP_best_map_score, MLP_all_parameter_combination = \
pipeline_model_optimization(model, MLP_parameter_grid, scaler, feature_dataframe, 
                            feature_retrieval, start_features, 
                            added_features, 
                            threshold_map_feature_selection=0.0001)
# Current Hyperpamaters: {'hidden_layer_sizes': (100,), 'activation': 'relu', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}
# MAP score on test set with current hyperpamaters: 0.8667

-----------------First do Forward Selection-----------------

Current Iteration through feature list: 1
The initial MAP score on test set: 0.7535
Updated MAP score on test set with new feature jaccard_translation_vecmap: 0.7718
Updated MAP score on test set with new feature euclidean_distance_tf_idf_vecmap: 0.7857
Updated MAP score on test set with new feature euclidean_distance_tf_idf_proc_b_1k: 0.8035
Updated MAP score on test set with new feature number_VERB_difference: 0.8425
Updated MAP score on test set with new feature number_ADJ_difference: 0.8427
Updated MAP score on test set with new feature number_#_difference_normalized: 0.8458

Current Iteration through feature list: 2
The initial MAP score on test set: 0.8458
Updated MAP score on test set with new feature number_ADJ_difference_relative: 0.8476

Current Iteration through feature list: 3
The initial MAP score on test set: 0.8476

-----------------Result of Feature Selection-----------------

Best MAP Score after feature sel

Hyperparameter Tuning:   0%|          | 0/72 [00:00<?, ?it/s]



-----------------Start Hyperparameter-tuning with Grid Search-----------------
Number of Parameter Combinations: 72


Hyperparameter Tuning:   1%|▏         | 1/72 [00:20<24:03, 20.34s/it]


Current Best Hyperpamaters: {'hidden_layer_sizes': (100,), 'activation': 'logistic', 'solver': 'sgd', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True, 'MAP_score': 0.7833685807702923}
With Map Score 0.7834


Hyperparameter Tuning:   4%|▍         | 3/72 [01:46<40:33, 35.27s/it]


Current Best Hyperpamaters: {'hidden_layer_sizes': (100,), 'activation': 'logistic', 'solver': 'sgd', 'alpha': 0.05, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True, 'MAP_score': 0.7833710991012302}
With Map Score 0.7834


Hyperparameter Tuning:   7%|▋         | 5/72 [03:21<45:24, 40.66s/it]


Current Best Hyperpamaters: {'hidden_layer_sizes': (100,), 'activation': 'logistic', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True, 'MAP_score': 0.8168707825996413}
With Map Score 0.8169


Hyperparameter Tuning:  18%|█▊        | 13/72 [08:59<48:46, 49.60s/it]


Current Best Hyperpamaters: {'hidden_layer_sizes': (100,), 'activation': 'tanh', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True, 'MAP_score': 0.8243364612412717}
With Map Score 0.8243


Hyperparameter Tuning:  51%|█████▏    | 37/72 [47:19<2:09:03, 221.24s/it]


Current Best Hyperpamaters: {'hidden_layer_sizes': (50, 100), 'activation': 'tanh', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True, 'MAP_score': 0.8251010878830725}
With Map Score 0.8251


Hyperparameter Tuning:  54%|█████▍    | 39/72 [52:27<1:41:45, 185.01s/it]


Current Best Hyperpamaters: {'hidden_layer_sizes': (50, 100), 'activation': 'tanh', 'solver': 'adam', 'alpha': 0.05, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True, 'MAP_score': 0.827774949656375}
With Map Score 0.8278


Hyperparameter Tuning:  62%|██████▎   | 45/72 [1:32:00<3:02:39, 405.90s/it]


Current Best Hyperpamaters: {'hidden_layer_sizes': (50, 100), 'activation': 'relu', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True, 'MAP_score': 0.8288211837073144}
With Map Score 0.8288


Hyperparameter Tuning:  85%|████████▍ | 61/72 [2:43:14<42:11, 230.12s/it]  


Current Best Hyperpamaters: {'hidden_layer_sizes': (50, 75, 100), 'activation': 'tanh', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True, 'MAP_score': 0.8357811020975123}
With Map Score 0.8358


Hyperparameter Tuning: 100%|██████████| 72/72 [3:11:59<00:00, 159.99s/it]


-----------------Result of Hyperparameter Tuning-----------------

Best Hyperamater Settting: {'hidden_layer_sizes': (50, 75, 100), 'activation': 'tanh', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True}
With MAP Score: 0.8358





In [8]:
model = MLPClassifier(random_state= 0, early_stopping=True,max_iter=100000)
scaler = preprocessing.StandardScaler()

MLP_parameter_grid = {
    'hidden_layer_sizes': [(5,), (5,3,2), (6,3,),(6,2),(7,3,),(8,3,2)],
    'activation': ['tanh', 'relu'],
    'solver': ['lbfgs', 'adam'],
    'alpha': [0.0001],
    'learning_rate': ['constant','adaptive'],
    'random_state' :[0],
    'early_stopping':[True]
}
MLP_best_features, MLP_best_parameter_combination, MLP_best_map_score, MLP_all_parameter_combination = \
pipeline_model_optimization(model, MLP_parameter_grid, scaler, feature_dataframe, 
                            feature_retrieval, start_features, 
                            added_features, 
                            threshold_map_feature_selection=0.0001)
# Current Hyperpamaters: {'hidden_layer_sizes': (7, 3), 'activation': 'relu', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}
# MAP score on test set with current hyperpamaters: 0.8674

-----------------First do Forward Selection-----------------

Current Iteration through feature list: 1
The initial MAP score on test set: 0.8476

-----------------Result of Feature Selection-----------------

Best MAP Score after feature selection: 0.8475974148299663


Hyperparameter Tuning:   0%|          | 0/48 [00:00<?, ?it/s]



-----------------Start Hyperparameter-tuning with Grid Search-----------------
Number of Parameter Combinations: 48


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
Hyperparameter Tuning:   2%|▏         | 1/48 [00:28<22:01, 28.11s/it]


Current Best Hyperpamaters: {'hidden_layer_sizes': (5,), 'activation': 'tanh', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 0, 'early_stopping': True, 'MAP_score': 0.8193988871674549}
With Map Score 0.8194


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
Hyperparameter Tuning:   6%|▋         | 3/48 [01:33<24:18, 32.42s/it]


Current Best Hyperpamaters: {'hidden_layer_sizes': (5,), 'activation': 'tanh', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 0, 'early_stopping': True, 'MAP_score': 0.8261110022139623}
With Map Score 0.8261


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
Hyperparameter Tuning:  19%|█▉        | 9/48 [04:18<18:53, 29.07s/it]


Current Best Hyperpamaters: {'hidden_layer_sizes': (5, 3, 2), 'activation': 'tanh', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 0, 'early_stopping': True, 'MAP_score': 0.8358773709922623}
With Map Score 0.8359


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("


Current Best Hyperpamaters: {'hidden_layer_sizes': (8, 3, 2), 'activation': 'tanh', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 0, 'early_stopping': True, 'MAP_score': 0.8374879871682634}
With Map Score 0.8375


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)
Hyperparameter Tuning: 100%|██████████| 48/48 [30:34<00:00, 38.21s/it]


-----------------Result of Hyperparameter Tuning-----------------

Best Hyperamater Settting: {'hidden_layer_sizes': (8, 3, 2), 'activation': 'tanh', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 0, 'early_stopping': True}
With MAP Score: 0.8375





In [None]:
# our model did not improve so we will dicard
scaler = preprocessing.StandardScaler()
model = MLPClassifier(hidden_layer_sizes=(7,3),random_state=0, alpha=0.0001,   early_stopping=True)
feature_selection(model,scaler,feature_dataframe,feature_retrieval,start_features,added_features)

In [36]:
start_features=['jaccard_translation_proc_5k','jaccard_translation_vecmap','euclidean_distance_tf_idf_vecmap','euclidean_distance_tf_idf_proc_b_1k','number_VERB_difference','number_ADJ_difference','number_#_difference_normalized','number_ADJ_difference_relative']

In [53]:
#final model 0.8476
scaler = preprocessing.StandardScaler()
model = MLPClassifier(hidden_layer_sizes=(100,),random_state=0, alpha=0.0001,  early_stopping=True,max_iter=1000000)
target_train = feature_dataframe['Translation']
target_test = feature_retrieval['Translation']
data_train = feature_dataframe.filter(items=start_features)
data_test = feature_retrieval.filter(items=start_features)
# scale the features
data_train[data_train.columns] = scaler.fit_transform(data_train[data_train.columns])
data_test[data_test.columns] = scaler.transform(data_test[data_test.columns])
# fit the model and get the initial MapScore
model.fit(data_train.to_numpy(), target_train.to_numpy())
prediction = model.predict_proba(data_test.to_numpy())
MapScore = MAP_score(feature_retrieval['source_id'], target_test, prediction)
print("The initial MAP score on test set: {:.4f}".format(MapScore))

The initial MAP score on test set: 0.8476


In [55]:
import pickle
# save the model to disk
filename = 'finalized_model_MLP.sav'
pickle.dump(model, open(filename, 'wb'))
 

# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))

In [56]:
#final features of the model
len(start_features)
start_features

['jaccard_translation_proc_5k',
 'jaccard_translation_vecmap',
 'euclidean_distance_tf_idf_vecmap',
 'euclidean_distance_tf_idf_proc_b_1k',
 'number_VERB_difference',
 'number_ADJ_difference',
 'number_#_difference_normalized',
 'number_ADJ_difference_relative']

# III. Other languages

## Use the model on English-Italian

In [57]:
feature_retrieval_it=pd.read_feather("../data/processed/feature_retrieval_en_it.feather")
feature_retrieval_it =feature_retrieval_it.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval_it

Unnamed: 0,source_id,target_id,Translation,number_punctuations_total_difference,number_punctuations_total_difference_relative,number_punctuations_total_difference_normalized,number_words_difference,number_words_difference_relative,number_words_difference_normalized,number_unique_words_difference,...,cosine_similarity_average_proc_b_1k,cosine_similarity_tf_idf_proc_b_1k,euclidean_distance_average_proc_b_1k,euclidean_distance_tf_idf_proc_b_1k,jaccard_translation_proc_b_1k,cosine_similarity_average_vecmap,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_vecmap,euclidean_distance_tf_idf_vecmap,jaccard_translation_vecmap
0,20000,20000,1,1,0.333333,0.046218,2,0.071429,0.046218,3,...,0.854424,0.852184,0.294122,0.074822,0.192029,0.824711,0.820362,0.271466,0.068599,0.192029
1,20000,20001,0,0,0.000000,0.034314,7,0.189189,0.034314,7,...,0.821553,0.759180,0.338124,0.084507,0.060662,0.733241,0.616938,0.351165,0.090769,0.060662
2,20000,20002,0,1,0.200000,0.007353,6,0.166667,0.007353,6,...,0.752849,0.665112,0.386866,0.099398,0.000000,0.540558,0.339125,0.432807,0.114732,0.000000
3,20000,20003,0,1,0.333333,0.072193,6,0.166667,0.072193,2,...,0.776020,0.708160,0.356853,0.093711,0.032292,0.615510,0.475073,0.384684,0.103048,0.031754
4,20000,20004,0,0,0.000000,0.030691,6,0.166667,0.030691,0,...,0.781948,0.717154,0.360449,0.096081,0.000000,0.659130,0.557169,0.376907,0.100984,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,20099,24995,0,4,1.000000,0.266667,2,0.083333,0.266667,3,...,0.629671,0.571876,0.485150,0.138779,0.000000,0.340138,0.254788,0.544548,0.155010,0.000000
499996,20099,24996,0,4,1.000000,0.266667,5,0.294118,0.266667,4,...,0.608893,0.541164,0.521543,0.206140,0.000000,0.352565,0.252309,0.584824,0.224727,0.000000
499997,20099,24997,0,4,1.000000,0.266667,5,0.185185,0.266667,6,...,0.634123,0.549922,0.471551,0.135978,0.000000,0.317182,0.216514,0.537252,0.150231,0.000000
499998,20099,24998,0,2,0.200000,0.168306,44,0.666667,0.168306,30,...,0.700972,0.652883,0.426266,0.122974,0.009434,0.404160,0.338733,0.492635,0.124281,0.009434


## Prepare the test set

In [58]:
target_test = feature_retrieval_it['Translation']
data_test = feature_retrieval_it.filter(items=start_features)
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,euclidean_distance_tf_idf_vecmap,euclidean_distance_tf_idf_proc_b_1k,number_VERB_difference,number_ADJ_difference,number_#_difference_normalized,number_ADJ_difference_relative
0,0.219697,0.192029,0.068599,0.074822,0,0,0.0,0.000000
1,0.060662,0.060662,0.090769,0.084507,1,1,0.0,0.333333
2,0.000000,0.000000,0.114732,0.099398,2,1,0.0,0.333333
3,0.032292,0.031754,0.103048,0.093711,0,0,0.0,0.000000
4,0.000000,0.000000,0.100984,0.096081,3,2,0.0,0.500000
...,...,...,...,...,...,...,...,...
499995,0.000000,0.000000,0.155010,0.138779,1,1,0.0,1.000000
499996,0.000000,0.000000,0.224727,0.206140,1,1,0.0,1.000000
499997,0.000000,0.000000,0.150231,0.135978,1,1,0.0,1.000000
499998,0.009434,0.009434,0.124281,0.122974,4,6,0.0,1.000000


## Use model

In [59]:
data_test[data_test.columns] = scaler.transform(data_test[data_test.columns])
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,euclidean_distance_tf_idf_vecmap,euclidean_distance_tf_idf_proc_b_1k,number_VERB_difference,number_ADJ_difference,number_#_difference_normalized,number_ADJ_difference_relative
0,2.697973,2.136147,-0.224990,-0.082522,-1.145620,-1.074044,-0.007071,-1.253327
1,-0.141238,-0.206646,0.090893,0.056906,-0.592309,-0.513583,-0.007071,-0.345023
2,-1.224216,-1.288483,0.432320,0.271281,-0.038998,-0.513583,-0.007071,-0.345023
3,-0.647722,-0.722184,0.265843,0.189407,-1.145620,-1.074044,-0.007071,-1.253327
4,-1.224216,-1.288483,0.236446,0.223524,0.514313,0.046877,-0.007071,0.109130
...,...,...,...,...,...,...,...,...
499995,-1.224216,-1.288483,1.006231,0.838213,-0.592309,-0.513583,-0.007071,1.471587
499996,-1.224216,-1.288483,1.999589,1.807956,-0.592309,-0.513583,-0.007071,1.471587
499997,-1.224216,-1.288483,0.938138,0.797892,-0.592309,-0.513583,-0.007071,1.471587
499998,-1.055794,-1.120238,0.568392,0.610689,1.067623,2.288720,-0.007071,1.471587


In [60]:
prediction = model.predict_proba(data_test)
MapScore = MAP_score(feature_retrieval_it['source_id'], target_test, prediction)
print("The Italian MAP score on test set: {:.4f}".format(MapScore))

The Italian MAP score on test set: 0.8167


## Use the model on English-Polish

In [61]:
feature_retrieval_pl=pd.read_feather("../data/processed/feature_retrieval_en_pl.feather")
feature_retrieval_pl = feature_retrieval_pl.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval_pl

Unnamed: 0,source_id,target_id,Translation,number_punctuations_total_difference,number_punctuations_total_difference_relative,number_punctuations_total_difference_normalized,number_words_difference,number_words_difference_relative,number_words_difference_normalized,number_unique_words_difference,...,cosine_similarity_average_proc_b_1k,cosine_similarity_tf_idf_proc_b_1k,euclidean_distance_average_proc_b_1k,euclidean_distance_tf_idf_proc_b_1k,jaccard_translation_proc_b_1k,cosine_similarity_average_vecmap,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_vecmap,euclidean_distance_tf_idf_vecmap,jaccard_translation_vecmap
0,20000,20000,1,0,0.0,0.000000,5,0.217391,0.000000,3,...,0.834714,0.787117,0.343884,0.096155,0.244485,0.802804,0.713881,0.337174,0.100275,0.242647
1,20000,20001,0,0,0.0,0.000000,7,0.333333,0.000000,5,...,0.736686,0.695557,0.449273,0.171380,0.000000,0.670068,0.575405,0.452836,0.173662,0.000000
2,20000,20002,0,11,1.0,0.123596,64,0.695652,0.123596,53,...,0.859428,0.801389,0.306350,0.114496,0.050166,0.831464,0.732550,0.313809,0.113891,0.049837
3,20000,20003,0,2,1.0,0.222222,7,0.333333,0.222222,5,...,0.703519,0.662172,0.456495,0.162102,0.000000,0.585979,0.515644,0.483511,0.166323,0.000000
4,20000,20004,0,2,1.0,0.071429,12,0.300000,0.071429,12,...,0.765796,0.681011,0.385775,0.112808,0.030303,0.627124,0.444584,0.426907,0.126310,0.030303
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,20099,24995,0,4,1.0,0.153846,7,0.137255,0.153846,1,...,0.791923,0.728771,0.331753,0.075849,0.026671,0.536269,0.325363,0.381850,0.088468,0.025978
499996,20099,24996,0,2,1.0,0.125000,15,0.348837,0.125000,7,...,0.779662,0.718409,0.357416,0.084390,0.047379,0.694737,0.528046,0.351784,0.088915,0.030835
499997,20099,24997,0,1,1.0,0.050000,10,0.208333,0.050000,2,...,0.773840,0.736827,0.350169,0.074033,0.013514,0.465373,0.364922,0.427608,0.090424,0.000000
499998,20099,24998,0,1,1.0,0.047619,9,0.183673,0.047619,2,...,0.731214,0.671121,0.383888,0.084993,0.069884,0.440550,0.286580,0.444644,0.100719,0.054094


## Prepare the test set

In [62]:
target_test = feature_retrieval_pl['Translation']
data_test = feature_retrieval_pl.filter(items=start_features)
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,euclidean_distance_tf_idf_vecmap,euclidean_distance_tf_idf_proc_b_1k,number_VERB_difference,number_ADJ_difference,number_#_difference_normalized,number_ADJ_difference_relative
0,0.242647,0.242647,0.100275,0.096155,1,1,0.0,0.333333
1,0.027778,0.000000,0.173662,0.171380,0,2,0.0,1.000000
2,0.043275,0.049837,0.113891,0.114496,4,12,0.0,0.750000
3,0.000000,0.000000,0.166323,0.162102,0,1,0.0,0.333333
4,0.030303,0.030303,0.126310,0.112808,1,1,0.0,0.200000
...,...,...,...,...,...,...,...,...
499995,0.040936,0.025978,0.088468,0.075849,0,1,0.0,0.333333
499996,0.031281,0.030835,0.088915,0.084390,2,1,0.0,0.200000
499997,0.013514,0.000000,0.090424,0.074033,3,1,0.0,0.200000
499998,0.069884,0.054094,0.100719,0.084993,2,1,0.0,0.200000


## Use model

In [63]:
data_test[data_test.columns] = scaler.transform(data_test[data_test.columns])
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,euclidean_distance_tf_idf_vecmap,euclidean_distance_tf_idf_proc_b_1k,number_VERB_difference,number_ADJ_difference,number_#_difference_normalized,number_ADJ_difference_relative
0,3.107695,3.038866,0.226334,0.224596,-0.592309,-0.513583,-0.007071,-0.345023
1,-0.728307,-1.288483,1.271983,1.307549,-1.145620,0.046877,-0.007071,1.471587
2,-0.451635,-0.399701,0.420348,0.488642,1.067623,5.651484,-0.007071,0.790358
3,-1.224216,-1.288483,1.167412,1.173976,-1.145620,-0.513583,-0.007071,-0.345023
4,-0.683224,-0.748061,0.597295,0.464334,-0.592309,-0.513583,-0.007071,-0.708344
...,...,...,...,...,...,...,...,...
499995,-0.493402,-0.825186,0.058106,-0.067736,-1.145620,-0.513583,-0.007071,-0.345023
499996,-0.665773,-0.738575,0.064478,0.055222,-0.038998,-0.513583,-0.007071,-0.708344
499997,-0.982963,-1.288483,0.085979,-0.093870,0.514313,-0.513583,-0.007071,-0.708344
499998,0.023407,-0.323782,0.232667,0.063907,-0.038998,-0.513583,-0.007071,-0.708344


In [64]:
prediction = model.predict_proba(data_test)
MapScore = MAP_score(feature_retrieval_it['source_id'], target_test, prediction)
print("The Polish MAP score on test set: {:.4f}".format(MapScore))

The Polish MAP score on test set: 0.8081


# IV. Document level

## Use the model on German-English doc

In [65]:
feature_retrieval_doc=pd.read_feather("../data/processed/feature_retrieval_doc.feather")
feature_retrieval_doc = feature_retrieval_doc.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval_doc

Unnamed: 0,source_id,target_id,Translation,number_punctuations_total_difference,number_punctuations_total_difference_relative,number_punctuations_total_difference_normalized,number_words_difference,number_words_difference_relative,number_words_difference_normalized,number_unique_words_difference,...,cosine_similarity_average_proc_b_1k,cosine_similarity_tf_idf_proc_b_1k,euclidean_distance_average_proc_b_1k,euclidean_distance_tf_idf_proc_b_1k,jaccard_translation_proc_b_1k,cosine_similarity_average_vecmap,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_vecmap,euclidean_distance_tf_idf_vecmap,jaccard_translation_vecmap
0,0,0,1,5,1.000000,0.178571,290,0.863095,0.178571,144,...,0.580389,0.549913,0.531730,0.054664,0.007030,0.158882,0.125875,0.602464,0.057259,0.007866
1,0,1,0,5,1.000000,0.178571,335,0.879265,0.178571,185,...,0.543362,0.521699,0.567206,0.055330,0.005784,0.073765,0.043059,0.656281,0.059766,0.006825
2,0,2,0,5,1.000000,0.178571,329,0.877333,0.178571,172,...,0.512680,0.452094,0.609772,0.057912,0.006540,0.082798,0.011172,0.693836,0.061963,0.007726
3,0,3,0,5,1.000000,0.178571,368,0.888889,0.178571,167,...,0.536417,0.502595,0.580073,0.056595,0.006444,0.101143,0.050886,0.671459,0.058318,0.008272
4,0,4,0,5,1.000000,0.178571,389,0.894253,0.178571,174,...,0.530223,0.491524,0.575751,0.056442,0.005852,0.056589,-0.003096,0.673305,0.061273,0.006961
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,99,4995,0,1,0.333333,0.114773,332,0.917127,0.114773,167,...,0.486395,0.418157,0.623162,0.129749,0.000000,-0.016775,-0.145692,0.752196,0.133761,0.000000
499996,99,4996,0,2,1.000000,0.117647,373,0.925558,0.117647,155,...,0.474177,0.402875,0.631539,0.131067,0.000000,-0.030901,-0.146376,0.744337,0.132629,0.000000
499997,99,4997,0,2,1.000000,0.117647,140,0.823529,0.117647,89,...,0.559839,0.524665,0.551034,0.123131,0.000000,0.083889,0.026164,0.649910,0.130587,0.000000
499998,99,4998,0,0,0.000000,0.006536,1,0.032258,0.006536,4,...,0.553223,0.475026,0.565722,0.140282,0.000000,0.174603,0.126667,0.642268,0.157276,0.000000


## Prepare the test set

In [66]:
target_test = feature_retrieval_doc['Translation']
data_test = feature_retrieval_doc.filter(items=start_features)
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,euclidean_distance_tf_idf_vecmap,euclidean_distance_tf_idf_proc_b_1k,number_VERB_difference,number_ADJ_difference,number_#_difference_normalized,number_ADJ_difference_relative
0,0.007874,0.007866,0.057259,0.054664,11,17,0.0,0.680000
1,0.006623,0.006825,0.059766,0.055330,7,10,0.0,0.555556
2,0.007692,0.007726,0.061963,0.057912,2,6,0.0,0.428571
3,0.004032,0.008272,0.058318,0.056595,2,18,0.0,0.692308
4,0.006849,0.006961,0.061273,0.056442,9,24,0.0,0.750000
...,...,...,...,...,...,...,...,...
499995,0.000000,0.000000,0.133761,0.129749,19,37,0.0,0.948718
499996,0.000000,0.000000,0.132629,0.131067,7,10,0.0,0.833333
499997,0.004854,0.000000,0.130587,0.123131,4,3,0.0,0.600000
499998,0.000000,0.000000,0.157276,0.140282,0,2,0.0,0.500000


## Use model

In [67]:
data_test[data_test.columns] = scaler.transform(data_test[data_test.columns])
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,euclidean_distance_tf_idf_vecmap,euclidean_distance_tf_idf_proc_b_1k,number_VERB_difference,number_ADJ_difference,number_#_difference_normalized,number_ADJ_difference_relative
0,-1.083643,-1.148194,-0.386573,-0.372711,4.940800,8.453787,-0.007071,0.599614
1,-1.105986,-1.166766,-0.350860,-0.363133,2.727556,4.530563,-0.007071,0.260514
2,-1.086887,-1.150694,-0.319545,-0.325956,-0.038998,2.288720,-0.007071,-0.085507
3,-1.152229,-1.140954,-0.371484,-0.344923,-0.038998,9.014248,-0.007071,0.633152
4,-1.101937,-1.164342,-0.329379,-0.347125,3.834178,12.377011,-0.007071,0.790358
...,...,...,...,...,...,...,...,...
499995,-1.224216,-1.288483,0.703466,0.708221,9.367287,19.663000,-0.007071,1.331847
499996,-1.224216,-1.288483,0.687331,0.727202,2.727556,4.530563,-0.007071,1.017434
499997,-1.137552,-1.288483,0.658232,0.612942,1.067623,0.607338,-0.007071,0.381621
499998,-1.224216,-1.288483,1.038513,0.859858,-1.145620,0.046877,-0.007071,0.109130


In [69]:
prediction = model.predict_proba(data_test)
MapScore = MAP_score(feature_retrieval_doc['source_id'], target_test, prediction)
print("The Doc MAP score on test set: {:.4f}".format(MapScore))

The Doc MAP score on test set: 0.0005
