# Supervised Retrieval

In this notebook we use the supervised classification model for a supervised crosslingual information retrieval task using the scikit learn inbuild LogisticRegression. We will first prepare the data and then use a pipeline of forward feature selection and hyperparameter optimization via grid search to get our best model. After training our model, we use the trained model for English-German on Italian, Polish and Document level, and see that the results for the other languages are pretty good, but on document level as expected not really.

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from src.models.predict_model import MAP_score, threshold_counts,feature_selection, pipeline_model_optimization

## I. Import Data

In this section we import the feature dataframe for the retrieval task.

In [2]:
feature_dataframe=pd.read_feather("../data/processed/feature_model_en_de.feather")
feature_retrieval=pd.read_feather("../data/processed/feature_retrieval_en_de.feather")
feature_dataframe = feature_dataframe.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval = feature_retrieval.rename(columns={"id_source": "source_id", "id_target": "target_id"})

In [3]:
feature_dataframe

Unnamed: 0,source_id,target_id,Translation,number_punctuations_total_difference,number_punctuations_total_difference_relative,number_punctuations_total_difference_normalized,number_words_difference,number_words_difference_relative,number_words_difference_normalized,number_unique_words_difference,...,cosine_similarity_average_proc_b_1k,cosine_similarity_tf_idf_proc_b_1k,euclidean_distance_average_proc_b_1k,euclidean_distance_tf_idf_proc_b_1k,jaccard_translation_proc_b_1k,cosine_similarity_average_vecmap,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_vecmap,euclidean_distance_tf_idf_vecmap,jaccard_translation_vecmap
0,0,0,1,10,0.555556,0.184124,5,0.061728,0.184124,3,...,0.881515,0.859575,0.248435,0.043538,0.014706,0.689299,0.627702,0.298771,0.051651,0.007246
1,1,1,1,0,0.000000,0.000000,0,0.000000,0.000000,0,...,0.894710,0.871179,0.228432,0.056366,0.260714,0.836215,0.789221,0.216985,0.054289,0.281404
2,2,2,1,3,1.000000,0.250000,2,0.125000,0.250000,2,...,0.771116,0.792226,0.389483,0.130394,0.240385,0.643538,0.690642,0.409364,0.135454,0.291667
3,3,3,1,0,0.000000,0.028070,4,0.133333,0.028070,4,...,0.859097,0.854491,0.282293,0.074980,0.120000,0.757080,0.754322,0.293724,0.076331,0.120000
4,4,4,1,0,0.000000,0.012605,3,0.103448,0.012605,2,...,0.861526,0.848707,0.296092,0.079806,0.103679,0.795838,0.783914,0.306117,0.081788,0.125217
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
219995,19999,10689,0,1,0.333333,0.062937,1,0.047619,0.062937,1,...,0.796217,0.780328,0.354726,0.112828,0.000000,0.668862,0.638944,0.366766,0.119223,0.000000
219996,19999,9781,0,2,0.500000,0.059091,7,0.259259,0.059091,7,...,0.801057,0.783654,0.336315,0.113979,0.080128,0.653291,0.621897,0.356319,0.119032,0.080128
219997,19999,7757,0,1,0.333333,0.026738,5,0.200000,0.026738,5,...,0.770226,0.775183,0.376769,0.111710,0.048055,0.630458,0.643871,0.390613,0.115372,0.048055
219998,19999,4932,0,1,0.333333,0.007576,12,0.375000,0.007576,11,...,0.801876,0.789016,0.342303,0.113451,0.000000,0.658583,0.630707,0.364702,0.117218,0.000000


#### Delete all columns with only one value

In [4]:
column_mask = feature_dataframe.apply(threshold_counts, threshold=1)
feature_dataframe = feature_dataframe.loc[:, column_mask]
feature_retrieval = feature_retrieval.loc[:, column_mask]

## II. Supervised Retrieval

# Logistic Regression

#### Start with one feature and perform selection and tuning on validation set 

In [5]:
start_features = ["jaccard_translation_proc_5k"]
not_add = ["Translation", "source_id", "target_id"]
added_features = feature_dataframe.columns[~feature_dataframe.columns.isin(start_features+not_add)]

In [6]:
model = LogisticRegression()
scaler = preprocessing.StandardScaler()

LR_parameter_grid = {
    'penalty' : ['l1', 'l2','elasticnet'],
    'C' : np.logspace(-4, 4, 50),
    'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'max_iter':[100000]
}

LR_best_features, LR_best_parameter_combination, LR_best_map_score, LR_all_parameter_combination = \
pipeline_model_optimization(model, LR_parameter_grid, scaler, feature_dataframe, 
                            feature_retrieval, start_features, 
                            added_features, 
                            threshold_map_feature_selection=0.0001)

-----------------First do Forward Selection-----------------

Current Iteration through feature list: 1
The initial MAP score on test set: 0.7515
Updated MAP score on test set with new feature jaccard_translation_vecmap: 0.7580
Updated MAP score on test set with new feature cosine_similarity_tf_idf_vecmap: 0.7990
Updated MAP score on test set with new feature cosine_similarity_tf_idf_proc_b_1k: 0.8145
Updated MAP score on test set with new feature cosine_similarity_average_proc_5k: 0.8174
Updated MAP score on test set with new feature jaccard_numbers_source: 0.8217
Updated MAP score on test set with new feature number_ADJ_difference_relative: 0.8280
Updated MAP score on test set with new feature number_?_difference_relative: 0.8283
Updated MAP score on test set with new feature number_-_difference_normalized: 0.8288
Updated MAP score on test set with new feature number_-_difference_relative: 0.8296

Current Iteration through feature list: 2
The initial MAP score on test set: 0.8296
Upd

Hyperparameter Tuning:   0%|          | 0/750 [00:00<?, ?it/s]



-----------------Start Hyperparameter-tuning with Grid Search-----------------
Number of Parameter Combinations: 750
Model failed to fit
Model failed to fit


Hyperparameter Tuning:   0%|          | 3/750 [00:01<05:51,  2.12it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0001, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.7544100270322174}
With Map Score 0.7544
Model failed to fit


Hyperparameter Tuning:   1%|          | 5/750 [00:03<09:55,  1.25it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:   1%|          | 8/750 [00:05<07:54,  1.56it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.00014563484775012445, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.7652476563227828}
With Map Score 0.7652
Model failed to fit


Hyperparameter Tuning:   1%|▏         | 10/750 [00:07<10:40,  1.16it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:   2%|▏         | 13/750 [00:09<08:29,  1.45it/s]

Model failed to fit


Hyperparameter Tuning:   2%|▏         | 15/750 [00:11<10:27,  1.17it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:   2%|▏         | 18/750 [00:12<08:22,  1.46it/s]

Model failed to fit


Hyperparameter Tuning:   3%|▎         | 20/750 [00:15<09:57,  1.22it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:   3%|▎         | 23/750 [00:16<08:25,  1.44it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0004498432668969444, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.7739473363010971}
With Map Score 0.7739
Model failed to fit


Hyperparameter Tuning:   3%|▎         | 25/750 [00:19<10:06,  1.19it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0004498432668969444, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.7739625096319901}
With Map Score 0.7740
Model failed to fit
Model failed to fit


Hyperparameter Tuning:   4%|▎         | 28/750 [00:20<08:31,  1.41it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0006551285568595509, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.7739946676952845}
With Map Score 0.7740
Model failed to fit


Hyperparameter Tuning:   4%|▍         | 30/750 [00:23<11:00,  1.09it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0006551285568595509, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.7792693520630847}
With Map Score 0.7793
Model failed to fit
Model failed to fit


Hyperparameter Tuning:   4%|▍         | 33/750 [00:25<09:39,  1.24it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0009540954763499944, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.7992154157754088}
With Map Score 0.7992
Model failed to fit


Hyperparameter Tuning:   5%|▍         | 35/750 [00:27<10:52,  1.10it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:   5%|▌         | 38/750 [00:30<09:58,  1.19it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0013894954943731374, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8084935496526834}
With Map Score 0.8085
Model failed to fit


Hyperparameter Tuning:   5%|▌         | 40/750 [00:32<11:12,  1.06it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0013894954943731374, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.8086606219016909}
With Map Score 0.8087
Model failed to fit
Model failed to fit


Hyperparameter Tuning:   6%|▌         | 43/750 [00:34<10:09,  1.16it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0020235896477251557, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8107694505482244}
With Map Score 0.8108
Model failed to fit


Hyperparameter Tuning:   6%|▌         | 45/750 [00:37<11:50,  1.01s/it]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0020235896477251557, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.8171778126106586}
With Map Score 0.8172
Model failed to fit
Model failed to fit


Hyperparameter Tuning:   6%|▋         | 48/750 [00:39<10:50,  1.08it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0029470517025518097, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8233966895467902}
With Map Score 0.8234
Model failed to fit


Hyperparameter Tuning:   7%|▋         | 50/750 [00:43<12:50,  1.10s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:   7%|▋         | 53/750 [00:48<15:39,  1.35s/it]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.004291934260128779, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8242853095881959}
With Map Score 0.8243
Model failed to fit


Hyperparameter Tuning:   7%|▋         | 55/750 [00:53<18:22,  1.59s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:   8%|▊         | 58/750 [01:02<23:47,  2.06s/it]

Model failed to fit


Hyperparameter Tuning:   8%|▊         | 60/750 [01:08<27:09,  2.36s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:   8%|▊         | 63/750 [01:16<28:42,  2.51s/it]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.009102981779915217, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8288370427501803}
With Map Score 0.8288
Model failed to fit


Hyperparameter Tuning:   9%|▊         | 65/750 [01:23<30:02,  2.63s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:   9%|▉         | 68/750 [01:31<30:47,  2.71s/it]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.013257113655901081, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8306217741183047}
With Map Score 0.8306
Model failed to fit


Hyperparameter Tuning:   9%|▉         | 70/750 [01:36<30:05,  2.66s/it]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.013257113655901081, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.8306911195079453}
With Map Score 0.8307
Model failed to fit
Model failed to fit


Hyperparameter Tuning:  10%|▉         | 73/750 [01:42<27:32,  2.44s/it]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.019306977288832496, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8317131971721647}
With Map Score 0.8317
Model failed to fit


Hyperparameter Tuning:  10%|█         | 75/750 [01:48<28:35,  2.54s/it]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.019306977288832496, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.8318501267582217}
With Map Score 0.8319
Model failed to fit
Model failed to fit


Hyperparameter Tuning:  10%|█         | 78/750 [01:56<28:40,  2.56s/it]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.02811768697974228, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8318881926565118}
With Map Score 0.8319
Model failed to fit


Hyperparameter Tuning:  11%|█         | 80/750 [02:00<27:09,  2.43s/it]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.02811768697974228, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.8319014528060897}
With Map Score 0.8319
Model failed to fit
Model failed to fit


Hyperparameter Tuning:  11%|█         | 83/750 [02:09<29:37,  2.66s/it]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.040949150623804234, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8319469328806642}
With Map Score 0.8319
Model failed to fit


Hyperparameter Tuning:  11%|█▏        | 85/750 [02:12<26:57,  2.43s/it]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.040949150623804234, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.8319470064508567}
With Map Score 0.8319
Model failed to fit
Model failed to fit


Hyperparameter Tuning:  12%|█▏        | 88/750 [02:21<27:54,  2.53s/it]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.05963623316594643, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8319613722369299}
With Map Score 0.8320
Model failed to fit


Hyperparameter Tuning:  12%|█▏        | 90/750 [02:24<25:50,  2.35s/it]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.05963623316594643, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.8319767847550423}
With Map Score 0.8320
Model failed to fit
Model failed to fit


Hyperparameter Tuning:  12%|█▏        | 93/750 [02:34<28:44,  2.62s/it]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.08685113737513521, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8320244253359136}
With Map Score 0.8320
Model failed to fit


Hyperparameter Tuning:  13%|█▎        | 95/750 [02:39<28:36,  2.62s/it]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.08685113737513521, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.8320281077992018}
With Map Score 0.8320
Model failed to fit
Model failed to fit


Hyperparameter Tuning:  13%|█▎        | 98/750 [02:49<31:56,  2.94s/it]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.12648552168552957, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8320830133794069}
With Map Score 0.8321
Model failed to fit


Hyperparameter Tuning:  13%|█▎        | 100/750 [02:56<32:07,  2.97s/it]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.12648552168552957, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.8320830411728521}
With Map Score 0.8321
Model failed to fit
Model failed to fit


Hyperparameter Tuning:  14%|█▎        | 103/750 [03:06<34:13,  3.17s/it]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.18420699693267145, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8320997719524094}
With Map Score 0.8321
Model failed to fit


Hyperparameter Tuning:  14%|█▍        | 105/750 [03:12<33:25,  3.11s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  14%|█▍        | 108/750 [03:20<31:36,  2.95s/it]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.2682695795279725, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8320999773216805}
With Map Score 0.8321
Model failed to fit


Hyperparameter Tuning:  15%|█▍        | 110/750 [03:26<31:09,  2.92s/it]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.2682695795279725, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.8321036513869355}
With Map Score 0.8321
Model failed to fit
Model failed to fit


Hyperparameter Tuning:  15%|█▌        | 113/750 [03:33<28:45,  2.71s/it]

Model failed to fit


Hyperparameter Tuning:  15%|█▌        | 115/750 [03:38<29:12,  2.76s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  16%|█▌        | 118/750 [03:46<28:31,  2.71s/it]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.5689866029018293, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8321179820717339}
With Map Score 0.8321
Model failed to fit


Hyperparameter Tuning:  16%|█▌        | 120/750 [03:53<29:26,  2.80s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  16%|█▋        | 123/750 [03:58<25:13,  2.41s/it]

Model failed to fit


Hyperparameter Tuning:  17%|█▋        | 125/750 [04:03<25:59,  2.49s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  17%|█▋        | 128/750 [04:13<28:21,  2.74s/it]

Model failed to fit


Hyperparameter Tuning:  17%|█▋        | 130/750 [04:18<28:35,  2.77s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  18%|█▊        | 133/750 [04:23<24:34,  2.39s/it]

Model failed to fit


Hyperparameter Tuning:  18%|█▊        | 135/750 [04:28<24:34,  2.40s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  18%|█▊        | 138/750 [04:36<25:09,  2.47s/it]

Model failed to fit


Hyperparameter Tuning:  19%|█▊        | 140/750 [04:41<24:43,  2.43s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  19%|█▉        | 143/750 [04:48<25:05,  2.48s/it]

Model failed to fit


Hyperparameter Tuning:  19%|█▉        | 145/750 [04:54<25:32,  2.53s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  20%|█▉        | 148/750 [05:01<25:11,  2.51s/it]

Model failed to fit


Hyperparameter Tuning:  20%|██        | 150/750 [05:07<26:27,  2.65s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  20%|██        | 153/750 [05:16<27:27,  2.76s/it]

Model failed to fit


Hyperparameter Tuning:  21%|██        | 155/750 [05:21<27:02,  2.73s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  21%|██        | 158/750 [05:29<26:10,  2.65s/it]

Model failed to fit


Hyperparameter Tuning:  21%|██▏       | 160/750 [05:35<27:04,  2.75s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  22%|██▏       | 163/750 [05:45<28:15,  2.89s/it]

Model failed to fit


Hyperparameter Tuning:  22%|██▏       | 165/750 [05:50<27:49,  2.85s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  22%|██▏       | 168/750 [05:54<22:45,  2.35s/it]

Model failed to fit


Hyperparameter Tuning:  23%|██▎       | 170/750 [06:00<24:17,  2.51s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  23%|██▎       | 173/750 [06:06<22:42,  2.36s/it]

Model failed to fit


Hyperparameter Tuning:  23%|██▎       | 175/750 [06:13<24:17,  2.54s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  24%|██▎       | 178/750 [06:24<27:53,  2.92s/it]

Model failed to fit


Hyperparameter Tuning:  24%|██▍       | 180/750 [06:29<27:50,  2.93s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  24%|██▍       | 183/750 [06:37<26:24,  2.79s/it]

Model failed to fit


Hyperparameter Tuning:  25%|██▍       | 185/750 [06:43<26:26,  2.81s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  25%|██▌       | 188/750 [06:47<22:08,  2.36s/it]

Model failed to fit


Hyperparameter Tuning:  25%|██▌       | 190/750 [06:54<23:43,  2.54s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  26%|██▌       | 193/750 [07:04<27:09,  2.92s/it]

Model failed to fit


Hyperparameter Tuning:  26%|██▌       | 195/750 [07:09<26:06,  2.82s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  26%|██▋       | 198/750 [07:19<26:46,  2.91s/it]

Model failed to fit


Hyperparameter Tuning:  27%|██▋       | 200/750 [07:25<27:10,  2.96s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  27%|██▋       | 203/750 [07:32<25:29,  2.80s/it]

Model failed to fit


Hyperparameter Tuning:  27%|██▋       | 205/750 [07:38<25:15,  2.78s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  28%|██▊       | 208/750 [07:44<22:26,  2.48s/it]

Model failed to fit


Hyperparameter Tuning:  28%|██▊       | 210/750 [07:49<23:15,  2.58s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  28%|██▊       | 213/750 [07:58<23:51,  2.67s/it]

Model failed to fit


Hyperparameter Tuning:  29%|██▊       | 215/750 [08:04<24:11,  2.71s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  29%|██▉       | 218/750 [08:11<23:42,  2.67s/it]

Model failed to fit


Hyperparameter Tuning:  29%|██▉       | 220/750 [08:17<23:40,  2.68s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  30%|██▉       | 223/750 [08:19<18:03,  2.06s/it]

Model failed to fit


Hyperparameter Tuning:  30%|███       | 225/750 [08:26<20:10,  2.31s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  30%|███       | 228/750 [08:31<18:34,  2.14s/it]

Model failed to fit


Hyperparameter Tuning:  31%|███       | 230/750 [08:37<20:18,  2.34s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  31%|███       | 233/750 [08:45<21:07,  2.45s/it]

Model failed to fit


Hyperparameter Tuning:  31%|███▏      | 235/750 [08:51<22:16,  2.60s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  32%|███▏      | 238/750 [09:02<25:28,  2.99s/it]

Model failed to fit


Hyperparameter Tuning:  32%|███▏      | 240/750 [09:06<23:21,  2.75s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  32%|███▏      | 243/750 [09:13<22:02,  2.61s/it]

Model failed to fit


Hyperparameter Tuning:  33%|███▎      | 245/750 [09:20<23:23,  2.78s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  33%|███▎      | 248/750 [09:28<23:03,  2.76s/it]

Model failed to fit


Hyperparameter Tuning:  45%|████▍     | 336/750 [12:46<20:53,  3.03s/it]


Current Best Hyperpamaters: {'penalty': 'l2', 'C': 0.05963623316594643, 'solver': 'newton-cg', 'max_iter': 100000, 'MAP_score': 0.8321271436635727}
With Map Score 0.8321


Hyperparameter Tuning:  45%|████▌     | 341/750 [13:00<20:46,  3.05s/it]


Current Best Hyperpamaters: {'penalty': 'l2', 'C': 0.08685113737513521, 'solver': 'newton-cg', 'max_iter': 100000, 'MAP_score': 0.8321438012457436}
With Map Score 0.8321


Hyperparameter Tuning:  46%|████▌     | 342/750 [13:02<18:01,  2.65s/it]


Current Best Hyperpamaters: {'penalty': 'l2', 'C': 0.08685113737513521, 'solver': 'lbfgs', 'max_iter': 100000, 'MAP_score': 0.8321438045658408}
With Map Score 0.8321


Hyperparameter Tuning: 100%|██████████| 750/750 [21:01<00:00,  1.68s/it]

Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit





In [18]:
#hyperparameter gives a better estimation so we use that and prepare another forward selection
scaler = preprocessing.StandardScaler()
model = LogisticRegression(max_iter=100000,penalty="l2", C= 0.08685113737513521, solver='lbfgs')
feature_selection(model,scaler,feature_dataframe,feature_retrieval,start_features,added_features)

The initial MAP score on test set: 0.8323


0.8322730336275701

In [9]:
start_features

['jaccard_translation_proc_5k',
 'jaccard_translation_vecmap',
 'cosine_similarity_tf_idf_vecmap',
 'cosine_similarity_tf_idf_proc_b_1k',
 'cosine_similarity_average_proc_5k',
 'jaccard_numbers_source',
 'number_ADJ_difference_relative',
 'number_?_difference_relative',
 'number_-_difference_normalized',
 'number_-_difference_relative',
 'euclidean_distance_tf_idf_vecmap',
 'number_ADJ_difference',
 'characters_avg_difference_normalized',
 'number_?_difference_normalized']

In [22]:
#final model
scaler = preprocessing.StandardScaler()
model = LogisticRegression(max_iter=100000,penalty="l2", C=  0.08685113737513521, solver='lbfgs')
target_train = feature_dataframe['Translation']
target_test = feature_retrieval['Translation']
data_train = feature_dataframe.filter(items=start_features)
data_test = feature_retrieval.filter(items=start_features)
# scale the features
data_train[data_train.columns] = scaler.fit_transform(data_train[data_train.columns])
data_test[data_test.columns] = scaler.transform(data_test[data_test.columns])
# fit the model and get the initial MapScore
model.fit(data_train.to_numpy(), target_train.to_numpy())
prediction = model.predict_proba(data_test.to_numpy())
MapScore = MAP_score(feature_retrieval['source_id'], target_test, prediction)
print("The initial MAP score on test set: {:.4f}".format(MapScore))

The initial MAP score on test set: 0.8323


In [15]:
# save the model to disk
import pickle
filename = 'finalized_model_LR.sav'
pickle.dump(model, open(filename, 'wb'))
 

# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))

In [16]:
model.coef_

array([[ 2.48980022,  0.86483488,  1.68772459, -0.67613725, -0.77818001,
         0.61111047, -0.04341167, -0.67556591,  0.06248484, -0.42091946,
         0.11493594, -0.65458572, -0.06658312,  0.03263054]])

# Test our model on an independent Englisch-German test set

In [31]:
feature_retrieval_test=pd.read_feather("../data/processed/feature_retrieval_en_de_testset.feather")
feature_retrieval_test = feature_retrieval_test.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval_test

Unnamed: 0,source_id,target_id,Translation,number_punctuations_total_difference,number_punctuations_total_difference_relative,number_punctuations_total_difference_normalized,number_words_difference,number_words_difference_relative,number_words_difference_normalized,number_unique_words_difference,...,cosine_similarity_average_proc_b_1k,cosine_similarity_tf_idf_proc_b_1k,euclidean_distance_average_proc_b_1k,euclidean_distance_tf_idf_proc_b_1k,jaccard_translation_proc_b_1k,cosine_similarity_average_vecmap,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_vecmap,euclidean_distance_tf_idf_vecmap,jaccard_translation_vecmap
0,0,0,1,0,0.000000,0.033333,3,0.130435,0.033333,5,...,0.776923,0.752113,0.380413,0.130624,0.140867,0.628849,0.569304,0.413058,0.139506,0.171569
1,0,1,0,2,1.000000,0.133333,10,0.625000,0.133333,10,...,0.620641,0.603776,0.543104,0.281724,0.000000,0.356381,0.324728,0.622440,0.298332,0.000000
2,0,2,0,2,0.333333,0.101961,0,0.000000,0.101961,0,...,0.588427,0.581390,0.480664,0.128045,0.000000,0.229010,0.239453,0.530768,0.140842,0.000000
3,0,3,0,0,0.000000,0.020513,2,0.083333,0.020513,2,...,0.665777,0.634401,0.444180,0.139126,0.000000,0.442247,0.403900,0.478450,0.147651,0.000000
4,0,4,0,1,0.333333,0.008333,6,0.300000,0.008333,6,...,0.611621,0.592239,0.494053,0.186831,0.000000,0.285372,0.285385,0.546625,0.194719,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,99,4995,0,5,1.000000,0.238095,13,0.684211,0.238095,13,...,0.618434,0.603234,0.548296,0.335448,0.000000,0.421944,0.402578,0.580330,0.335545,0.000000
499996,99,4996,0,3,1.000000,0.200000,9,0.600000,0.200000,9,...,0.606798,0.569988,0.559360,0.334685,0.000000,0.390204,0.333805,0.598008,0.341303,0.000000
499997,99,4997,0,0,0.000000,0.000000,5,0.454545,0.000000,4,...,0.568771,0.501923,0.628250,0.343402,0.000000,0.396668,0.300773,0.670406,0.359109,0.000000
499998,99,4998,0,2,1.000000,0.133333,10,0.625000,0.133333,9,...,0.554428,0.514135,0.596427,0.343570,0.000000,0.217345,0.160667,0.666197,0.360725,0.000000


### Prepare test set

In [32]:
target_test = feature_retrieval_test['Translation']
data_test = feature_retrieval_test.filter(items=start_features)
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,cosine_similarity_tf_idf_vecmap,cosine_similarity_tf_idf_proc_b_1k,cosine_similarity_average_proc_5k,jaccard_numbers_source,number_ADJ_difference_relative,number_?_difference_relative,number_-_difference_normalized,number_-_difference_relative,euclidean_distance_tf_idf_vecmap,number_ADJ_difference,characters_avg_difference_normalized,number_?_difference_normalized
0,0.140867,0.171569,0.569304,0.752113,0.783764,0.0,0.333333,0.0,0.0,0.0,0.139506,1,0.394231,0.000000
1,0.000000,0.000000,0.324728,0.603776,0.633933,0.0,0.333333,0.0,0.0,0.0,0.298332,1,2.124786,0.000000
2,0.000000,0.000000,0.239453,0.581390,0.616525,0.0,0.000000,0.0,0.0,0.0,0.140842,0,0.000905,0.000000
3,0.000000,0.000000,0.403900,0.634401,0.669379,0.0,0.000000,0.0,0.0,0.0,0.147651,0,0.401399,0.000000
4,0.000000,0.000000,0.285385,0.592239,0.610032,0.0,1.000000,0.0,0.0,0.0,0.194719,2,0.408516,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,0.000000,0.000000,0.402578,0.603234,0.622870,0.0,0.500000,0.0,0.0,0.0,0.335545,2,1.955357,0.000000
499996,0.000000,0.000000,0.333805,0.569988,0.611470,0.0,0.333333,1.0,0.0,0.0,0.341303,1,1.816667,0.066667
499997,0.000000,0.000000,0.300773,0.501923,0.588852,0.0,0.333333,0.0,0.0,0.0,0.359109,1,0.942708,0.000000
499998,0.000000,0.000000,0.160667,0.514135,0.549289,0.0,0.500000,0.0,0.0,0.0,0.360725,2,1.882051,0.000000


### Use model

In [33]:
data_test[data_test.columns] = scaler.transform(data_test[data_test.columns])
data_test
prediction = model.predict_proba(data_test)
MapScore = MAP_score(feature_retrieval_test['source_id'], target_test, prediction)
print("The MAP score on test set: {:.4f}".format(MapScore))

The MAP score on test set: 0.8367


# III. Other languages

## Use the model on English-Italian

In [24]:
feature_retrieval_it=pd.read_feather("../data/processed/feature_retrieval_en_it.feather")
feature_retrieval_it = feature_retrieval_it.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval_it

Unnamed: 0,source_id,target_id,Translation,number_punctuations_total_difference,number_punctuations_total_difference_relative,number_punctuations_total_difference_normalized,number_words_difference,number_words_difference_relative,number_words_difference_normalized,number_unique_words_difference,...,cosine_similarity_average_proc_b_1k,cosine_similarity_tf_idf_proc_b_1k,euclidean_distance_average_proc_b_1k,euclidean_distance_tf_idf_proc_b_1k,jaccard_translation_proc_b_1k,cosine_similarity_average_vecmap,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_vecmap,euclidean_distance_tf_idf_vecmap,jaccard_translation_vecmap
0,20000,20000,1,1,0.333333,0.075000,1,0.066667,0.075000,1,...,0.810779,0.810212,0.361123,0.130714,0.250000,0.770822,0.771820,0.347454,0.125257,0.250000
1,20000,20001,0,0,0.000000,0.046154,3,0.157895,0.046154,3,...,0.652366,0.632913,0.495442,0.162176,0.000000,0.471685,0.442877,0.540197,0.175718,0.000000
2,20000,20002,0,1,0.200000,0.023529,6,0.272727,0.023529,6,...,0.551240,0.537829,0.552460,0.176089,0.000000,0.176865,0.133939,0.636416,0.201498,0.000000
3,20000,20003,0,1,0.333333,0.123077,4,0.200000,0.123077,3,...,0.609395,0.578322,0.497603,0.170155,0.000000,0.317125,0.261110,0.547007,0.188239,0.000000
4,20000,20004,0,0,0.000000,0.033333,2,0.111111,0.033333,1,...,0.625252,0.605403,0.499228,0.169059,0.000000,0.431676,0.400298,0.526985,0.179437,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,20099,24995,0,4,1.000000,0.363636,0,0.000000,0.363636,0,...,0.494533,0.470337,0.600207,0.206954,0.000000,0.146285,0.122446,0.678523,0.233086,0.000000
499996,20099,24996,0,4,1.000000,0.363636,3,0.272727,0.363636,3,...,0.445105,0.431848,0.653755,0.295593,0.000000,0.105808,0.098467,0.726166,0.319219,0.000000
499997,20099,24997,0,4,1.000000,0.363636,3,0.176471,0.363636,3,...,0.508493,0.465414,0.574689,0.194991,0.000000,0.166518,0.125571,0.639576,0.215090,0.000000
499998,20099,24998,0,2,0.200000,0.192208,22,0.611111,0.192208,18,...,0.609450,0.580977,0.502350,0.169182,0.014706,0.270303,0.249942,0.562242,0.172988,0.014706


### Prepare test set

In [25]:
target_test = feature_retrieval_it['Translation']
data_test = feature_retrieval_it.filter(items=start_features)
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,cosine_similarity_tf_idf_vecmap,cosine_similarity_tf_idf_proc_b_1k,cosine_similarity_average_proc_5k,jaccard_numbers_source,number_ADJ_difference_relative,number_?_difference_relative,number_-_difference_normalized,number_-_difference_relative,euclidean_distance_tf_idf_vecmap,number_ADJ_difference,characters_avg_difference_normalized,number_?_difference_normalized
0,0.306818,0.250000,0.771820,0.810212,0.803508,0.0,0.000000,0.0,0.0,0.0,0.125257,0,0.348214,0.000000
1,0.000000,0.000000,0.442877,0.632913,0.644942,0.0,0.333333,0.0,0.0,0.0,0.175718,1,0.079108,0.000000
2,0.000000,0.000000,0.133939,0.537829,0.544185,0.0,0.333333,0.0,0.0,0.0,0.201498,1,0.229517,0.000000
3,0.000000,0.000000,0.261110,0.578322,0.605945,0.0,0.000000,0.0,0.0,0.0,0.188239,0,0.104167,0.000000
4,0.000000,0.000000,0.400298,0.605403,0.607170,0.0,0.500000,1.0,0.0,0.0,0.179437,2,0.195833,0.083333
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,0.000000,0.000000,0.122446,0.470337,0.486182,0.0,1.000000,0.0,0.0,0.0,0.233086,1,0.372913,0.000000
499996,0.000000,0.000000,0.098467,0.431848,0.451062,0.0,1.000000,0.0,0.0,0.0,0.319219,1,1.142045,0.000000
499997,0.000000,0.000000,0.125571,0.465414,0.513577,0.0,1.000000,0.0,0.0,0.0,0.215090,1,0.164545,0.000000
499998,0.014706,0.014706,0.249942,0.580977,0.597098,0.0,1.000000,0.0,0.0,0.0,0.172988,6,0.318854,0.000000


### Use model

In [26]:
data_test[data_test.columns] = scaler.transform(data_test[data_test.columns])
data_test
prediction = model.predict_proba(data_test)
MapScore = MAP_score(feature_retrieval_it['source_id'], target_test, prediction)
print("The Italian MAP score on test set: {:.4f}".format(MapScore))

The Italian MAP score on test set: 0.8311


## Use the model on English-Polish

In [27]:
feature_retrieval_pl=pd.read_feather("../data/processed/feature_retrieval_en_pl.feather")
feature_retrieval_pl = feature_retrieval_pl.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval_pl

Unnamed: 0,source_id,target_id,Translation,number_punctuations_total_difference,number_punctuations_total_difference_relative,number_punctuations_total_difference_normalized,number_words_difference,number_words_difference_relative,number_words_difference_normalized,number_unique_words_difference,...,cosine_similarity_average_proc_b_1k,cosine_similarity_tf_idf_proc_b_1k,euclidean_distance_average_proc_b_1k,euclidean_distance_tf_idf_proc_b_1k,jaccard_translation_proc_b_1k,cosine_similarity_average_vecmap,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_vecmap,euclidean_distance_tf_idf_vecmap,jaccard_translation_vecmap
0,20000,20000,1,0,0.0,0.000000,1,0.090909,0.000000,1,...,0.642565,0.626953,0.517595,0.193824,0.236111,0.557422,0.526448,0.503654,0.198580,0.236111
1,20000,20001,0,0,0.0,0.000000,0,0.000000,0.000000,0,...,0.573507,0.551732,0.569353,0.235418,0.000000,0.404290,0.365516,0.602765,0.252775,0.000000
2,20000,20002,0,11,1.0,0.139241,62,0.837838,0.139241,53,...,0.708999,0.674955,0.431595,0.209228,0.000000,0.600519,0.560591,0.444171,0.206232,0.000000
3,20000,20003,0,2,1.0,0.250000,0,0.000000,0.250000,0,...,0.555197,0.545064,0.569774,0.230373,0.000000,0.356233,0.364142,0.612446,0.245365,0.000000
4,20000,20004,0,2,1.0,0.086957,15,0.555556,0.086957,15,...,0.608879,0.543577,0.495413,0.207498,0.000000,0.363847,0.248384,0.538228,0.221728,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,20099,24995,0,4,1.0,0.190476,2,0.062500,0.190476,2,...,0.687810,0.649056,0.398149,0.104033,0.000000,0.296902,0.164214,0.443523,0.119273,0.000000
499996,20099,24996,0,2,1.0,0.133333,2,0.071429,0.133333,1,...,0.696351,0.647455,0.421799,0.111301,0.000000,0.521585,0.400441,0.436105,0.118380,0.000000
499997,20099,24997,0,1,1.0,0.055556,2,0.062500,0.055556,3,...,0.716196,0.700957,0.387899,0.095337,0.000000,0.346653,0.326476,0.458947,0.111125,0.000000
499998,20099,24998,0,1,1.0,0.050000,4,0.117647,0.050000,4,...,0.661488,0.617984,0.426675,0.107472,0.000000,0.304688,0.202665,0.484050,0.123391,0.000000


### Prepare test set

In [28]:
target_test = feature_retrieval_pl['Translation']
data_test = feature_retrieval_pl.filter(items=start_features)
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,cosine_similarity_tf_idf_vecmap,cosine_similarity_tf_idf_proc_b_1k,cosine_similarity_average_proc_5k,jaccard_numbers_source,number_ADJ_difference_relative,number_?_difference_relative,number_-_difference_normalized,number_-_difference_relative,euclidean_distance_tf_idf_vecmap,number_ADJ_difference,characters_avg_difference_normalized,number_?_difference_normalized
0,0.325397,0.236111,0.526448,0.626953,0.670519,0.0,0.333333,0.0,0.000000,0.0,0.198580,1,0.070000,0.000000
1,0.000000,0.000000,0.365516,0.551732,0.585349,0.0,1.000000,0.0,0.000000,0.0,0.252775,2,0.055556,0.000000
2,0.000000,0.000000,0.560591,0.674955,0.721307,0.0,0.750000,0.0,0.025316,1.0,0.206232,12,1.152829,0.000000
3,0.000000,0.000000,0.364142,0.545064,0.573316,0.0,0.333333,0.0,0.000000,0.0,0.245365,1,0.375000,0.000000
4,0.000000,0.000000,0.248384,0.543577,0.622763,0.0,0.200000,1.0,0.000000,0.0,0.221728,1,0.964286,0.043478
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,0.000000,0.000000,0.164214,0.649056,0.701219,0.0,0.333333,0.0,0.000000,0.0,0.119273,1,0.098749,0.000000
499996,0.000000,0.000000,0.400441,0.647455,0.692230,0.0,0.200000,0.0,0.133333,1.0,0.118380,1,0.002735,0.000000
499997,0.000000,0.000000,0.326476,0.700957,0.728958,0.0,0.200000,0.0,0.000000,0.0,0.111125,1,0.030588,0.000000
499998,0.000000,0.000000,0.202665,0.617984,0.666977,0.0,0.200000,0.0,0.000000,0.0,0.123391,1,0.033099,0.000000


### Use model

In [29]:
data_test[data_test.columns] = scaler.transform(data_test[data_test.columns])
data_test
prediction = model.predict_proba(data_test)
MapScore = MAP_score(feature_retrieval_it['source_id'], target_test, prediction)
print("The Polish MAP score on test set: {:.4f}".format(MapScore))

The Polish MAP score on test set: 0.8381


# IV. Document level

## Use the model on German-English doc

In [34]:
feature_retrieval_doc=pd.read_feather("../data/processed/feature_retrieval_doc.feather")
feature_retrieval_doc = feature_retrieval_doc.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval_doc

Unnamed: 0,source_id,target_id,Translation,number_punctuations_total_difference,number_punctuations_total_difference_relative,number_punctuations_total_difference_normalized,number_words_difference,number_words_difference_relative,number_words_difference_normalized,number_unique_words_difference,...,cosine_similarity_average_proc_b_1k,cosine_similarity_tf_idf_proc_b_1k,euclidean_distance_average_proc_b_1k,euclidean_distance_tf_idf_proc_b_1k,jaccard_translation_proc_b_1k,cosine_similarity_average_vecmap,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_vecmap,euclidean_distance_tf_idf_vecmap,jaccard_translation_vecmap
0,0,0,1,5,1.000000,0.185185,277,0.862928,0.185185,139,...,0.511370,0.506299,0.628872,0.069684,0.004274,0.137734,0.120406,0.711984,0.073070,0.0
1,0,1,0,5,1.000000,0.185185,315,0.877437,0.185185,183,...,0.478780,0.472470,0.657574,0.070496,0.003448,0.058341,0.033254,0.756963,0.075525,0.0
2,0,2,0,5,1.000000,0.185185,320,0.879121,0.185185,171,...,0.453333,0.408333,0.693955,0.072785,0.004202,0.068042,0.002854,0.788789,0.077393,0.0
3,0,3,0,5,1.000000,0.185185,356,0.890000,0.185185,164,...,0.472022,0.457837,0.671196,0.071793,0.000000,0.079043,0.041237,0.775098,0.074297,0.0
4,0,4,0,5,1.000000,0.185185,376,0.895238,0.185185,171,...,0.459031,0.439136,0.670771,0.071706,0.003546,0.024217,-0.025245,0.781736,0.077319,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,99,4995,0,1,0.333333,0.114653,318,0.913793,0.114653,162,...,0.434860,0.389901,0.696221,0.154386,0.000000,-0.046217,-0.154401,0.842940,0.160095,0.0
499996,99,4996,0,2,1.000000,0.117647,362,0.923469,0.117647,152,...,0.427382,0.374952,0.701396,0.155738,0.000000,-0.055270,-0.153815,0.832674,0.158915,0.0
499997,99,4997,0,2,1.000000,0.117647,121,0.801325,0.117647,83,...,0.501045,0.485808,0.632044,0.147978,0.000000,0.057846,0.013532,0.745893,0.157064,0.0
499998,99,4998,0,0,0.000000,0.000000,0,0.000000,0.000000,5,...,0.497301,0.445718,0.644503,0.163494,0.000000,0.157026,0.118324,0.736800,0.183978,0.0


### Prepare test set

In [35]:
target_test = feature_retrieval_doc['Translation']
data_test = feature_retrieval_doc.filter(items=start_features)
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,cosine_similarity_tf_idf_vecmap,cosine_similarity_tf_idf_proc_b_1k,cosine_similarity_average_proc_5k,jaccard_numbers_source,number_ADJ_difference_relative,number_?_difference_relative,number_-_difference_normalized,number_-_difference_relative,euclidean_distance_tf_idf_vecmap,number_ADJ_difference,characters_avg_difference_normalized,number_?_difference_normalized
0,0.004167,0.0,0.120406,0.506299,0.478486,0.0,0.680000,0.0,0.074074,1.0,0.073070,17,0.218582,0.0
1,0.003448,0.0,0.033254,0.472470,0.442238,0.0,0.555556,0.0,0.074074,1.0,0.075525,10,0.221102,0.0
2,0.003968,0.0,0.002854,0.408333,0.426509,0.0,0.428571,0.0,0.074074,1.0,0.077393,6,0.219582,0.0
3,0.000000,0.0,0.041237,0.457837,0.444401,0.0,0.692308,0.0,0.074074,1.0,0.074297,18,0.224314,0.0
4,0.003546,0.0,-0.025245,0.439136,0.439739,0.0,0.750000,0.0,0.074074,1.0,0.077319,24,0.225215,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,0.000000,0.0,-0.154401,0.389901,0.375731,0.0,0.948718,0.0,0.058824,1.0,0.160095,37,0.385686,0.0
499996,0.000000,0.0,-0.153815,0.374952,0.379238,0.0,0.833333,0.0,0.058824,1.0,0.158915,10,0.386245,0.0
499997,0.000000,0.0,0.013532,0.485808,0.441704,0.0,0.600000,0.0,0.058824,1.0,0.157064,3,0.361235,0.0
499998,0.000000,0.0,0.118324,0.445718,0.428619,0.0,0.500000,0.0,0.058824,1.0,0.183978,2,0.129412,0.0


### Use model

In [36]:
data_test[data_test.columns] = scaler.transform(data_test[data_test.columns])
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,cosine_similarity_tf_idf_vecmap,cosine_similarity_tf_idf_proc_b_1k,cosine_similarity_average_proc_5k,jaccard_numbers_source,number_ADJ_difference_relative,number_?_difference_relative,number_-_difference_normalized,number_-_difference_relative,euclidean_distance_tf_idf_vecmap,number_ADJ_difference,characters_avg_difference_normalized,number_?_difference_normalized
0,-0.332239,-0.386745,-3.350718,-2.356727,-2.905031,-0.102434,0.620976,-0.223476,2.426918,2.664338,-0.656892,8.241563,-0.284020,-0.140768
1,-0.343198,-0.386745,-4.031234,-2.708324,-3.285807,-0.102434,0.276019,-0.223476,2.426918,2.664338,-0.636760,4.398280,-0.280489,-0.140768
2,-0.335266,-0.386745,-4.268616,-3.374936,-3.451030,-0.102434,-0.075977,-0.223476,2.426918,2.664338,-0.621440,2.202119,-0.282619,-0.140768
3,-0.395802,-0.386745,-3.968902,-2.860418,-3.263077,-0.102434,0.655092,-0.223476,2.426918,2.664338,-0.646832,8.790604,-0.275987,-0.140768
4,-0.341706,-0.386745,-4.488027,-3.054791,-3.312059,-0.102434,0.815013,-0.223476,2.426918,2.664338,-0.622047,12.084846,-0.274725,-0.140768
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,-0.395802,-0.386745,-5.496526,-3.566510,-3.984439,-0.102434,1.365853,-0.223476,1.858552,2.664338,0.056844,19.222372,-0.049857,-0.140768
499996,-0.395802,-0.386745,-5.491954,-3.721891,-3.947600,-0.102434,1.046011,-0.223476,1.858552,2.664338,0.047171,4.398280,-0.049074,-0.140768
499997,-0.395802,-0.386745,-4.185236,-2.569698,-3.291415,-0.102434,0.399218,-0.223476,1.858552,2.664338,0.031988,0.554998,-0.084121,-0.140768
499998,-0.395802,-0.386745,-3.366977,-2.986381,-3.428866,-0.102434,0.122021,-0.223476,1.858552,2.664338,0.252727,0.005957,-0.408974,-0.140768


In [37]:
prediction = model.predict_proba(data_test)
MapScore = MAP_score(feature_retrieval_doc['source_id'], target_test, prediction)
print("The Doc MAP score on test set: {:.4f}".format(MapScore))

The Doc MAP score on test set: 0.0079
