# Supervised Retrieval

In this notebook we use the supervised classification model for a supervised crosslingual information retrieval task using the scikit learn inbuild LogisticRegression. We will first prepare the data and then use a pipeline of forward feature selection and hyperparameter optimization via grid search to get our best model. After training our model, we use the trained model for English-German on Italian, Polish and Document level, and see that the results for the other languages are pretty good, but on document level as expected not really.

In [1]:
import pandas as pd
import numpy as np
from sklearn import preprocessing
from sklearn.linear_model import LogisticRegression
from src.models.predict_model import MAP_score, threshold_counts,feature_selection, pipeline_model_optimization

## I. Import Data

In this section we import the feature dataframe for the retrieval task.

In [20]:
feature_dataframe=pd.read_feather("../data/processed/feature_model_en_de.feather")
feature_retrieval=pd.read_feather("../data/processed/feature_retrieval_en_de.feather")
feature_dataframe = feature_dataframe.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval = feature_retrieval.rename(columns={"id_source": "source_id", "id_target": "target_id"})

In [21]:
feature_dataframe

Unnamed: 0,source_id,target_id,Translation,number_punctuations_total_difference,number_punctuations_total_difference_relative,number_punctuations_total_difference_normalized,number_words_difference,number_words_difference_relative,number_words_difference_normalized,number_unique_words_difference,...,cosine_similarity_average_proc_b_1k,cosine_similarity_tf_idf_proc_b_1k,euclidean_distance_average_proc_b_1k,euclidean_distance_tf_idf_proc_b_1k,jaccard_translation_proc_b_1k,cosine_similarity_average_vecmap,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_vecmap,euclidean_distance_tf_idf_vecmap,jaccard_translation_vecmap
0,0,0,1,10,0.555556,0.094256,3,0.016575,0.094256,12,...,0.938824,0.922837,0.188550,0.025019,0.069408,0.851877,0.816036,0.227537,0.029612,0.074080
1,1,1,1,0,0.000000,0.000000,0,0.000000,0.000000,5,...,0.928516,0.911385,0.191291,0.035125,0.246975,0.881250,0.846294,0.197129,0.036056,0.258936
2,2,2,1,3,1.000000,0.142857,1,0.027027,0.142857,1,...,0.841325,0.834805,0.308668,0.072557,0.176154,0.729274,0.726981,0.328892,0.075238,0.224167
3,3,3,1,0,0.000000,0.004762,2,0.037037,0.004762,5,...,0.881792,0.873648,0.255657,0.051406,0.173542,0.771814,0.758749,0.281536,0.053657,0.173542
4,4,4,1,0,0.000000,0.005517,4,0.076923,0.005517,2,...,0.916279,0.894702,0.227575,0.047172,0.186111,0.874617,0.839182,0.238205,0.049603,0.205495
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
219995,19999,4647,0,3,0.600000,0.079032,8,0.173913,0.079032,9,...,0.874105,0.859388,0.280235,0.065319,0.092105,0.783083,0.739614,0.302003,0.072655,0.078947
219996,19999,685,0,2,0.500000,0.050000,8,0.173913,0.050000,9,...,0.856001,0.852788,0.322840,0.066262,0.068498,0.738716,0.726509,0.375336,0.076159,0.069173
219997,19999,10689,0,1,0.333333,0.036957,2,0.050000,0.036957,2,...,0.886237,0.852108,0.263981,0.067731,0.062500,0.819358,0.759386,0.275182,0.070944,0.063508
219998,19999,9172,0,1,0.333333,0.016667,9,0.191489,0.016667,7,...,0.871786,0.855601,0.276340,0.065629,0.054054,0.785261,0.750286,0.295176,0.071319,0.054805


#### Delete all columns with only one value

In [22]:
column_mask = feature_dataframe.apply(threshold_counts, threshold=1)
feature_dataframe = feature_dataframe.loc[:, column_mask]
feature_retrieval = feature_retrieval.loc[:, column_mask]

## II. Supervised Retrieval

# Logistic Regression

#### Start with one feature

In [5]:
start_features = ["jaccard_translation_proc_5k"]
not_add = ["Translation", "source_id", "target_id"]
added_features = feature_dataframe.columns[~feature_dataframe.columns.isin(start_features+not_add)]

In [6]:
model = LogisticRegression()
scaler = preprocessing.StandardScaler()

LR_parameter_grid = {
    'penalty' : ['l1', 'l2','elasticnet'],
    'C' : np.logspace(-4, 4, 50),
    'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'max_iter':[100000]
}

LR_best_features, LR_best_parameter_combination, LR_best_map_score, LR_all_parameter_combination = \
pipeline_model_optimization(model, LR_parameter_grid, scaler, feature_dataframe, 
                            feature_retrieval, start_features, 
                            added_features, 
                            threshold_map_feature_selection=0.0001)

-----------------First do Forward Selection-----------------

Current Iteration through feature list: 1
The initial MAP score on test set: 0.7536
Updated MAP score on test set with new feature jaccard_translation_vecmap: 0.7874
Updated MAP score on test set with new feature jaccard_numbers_source: 0.7915
Updated MAP score on test set with new feature number_VERB_difference_relative: 0.7921
Updated MAP score on test set with new feature number_?_difference_relative: 0.7922
Updated MAP score on test set with new feature number_-_difference_normalized: 0.7986
Updated MAP score on test set with new feature number_-_difference_relative: 0.8004
Updated MAP score on test set with new feature number_!_difference_normalized: 0.8014

Current Iteration through feature list: 2
The initial MAP score on test set: 0.8014

-----------------Result of Feature Selection-----------------

Best MAP Score after feature selection: 0.8014117683692032


-----------------Start Hyperparameter-tuning with Grid Se

Hyperparameter Tuning:   0%|          | 0/750 [00:00<?, ?it/s]

Number of Parameter Combinations: 750
Model failed to fit
Model failed to fit


Hyperparameter Tuning:   0%|          | 3/750 [00:00<03:25,  3.64it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0001, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.7782490242924823}
With Map Score 0.7782
Model failed to fit


Hyperparameter Tuning:   1%|          | 5/750 [00:02<06:54,  1.80it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:   1%|          | 8/750 [00:03<05:31,  2.24it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.00014563484775012445, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.7833912811006615}
With Map Score 0.7834
Model failed to fit


Hyperparameter Tuning:   1%|▏         | 10/750 [00:05<06:53,  1.79it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:   2%|▏         | 13/750 [00:06<05:59,  2.05it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.00021209508879201905, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.7887423717950762}
With Map Score 0.7887
Model failed to fit


Hyperparameter Tuning:   2%|▏         | 15/750 [00:08<07:26,  1.65it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.00021209508879201905, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.7890591240571934}
With Map Score 0.7891
Model failed to fit
Model failed to fit


Hyperparameter Tuning:   2%|▏         | 18/750 [00:09<06:14,  1.96it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.00030888435964774815, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.7893867074415528}
With Map Score 0.7894
Model failed to fit


Hyperparameter Tuning:   3%|▎         | 20/750 [00:11<07:37,  1.60it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.00030888435964774815, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.7897190556373275}
With Map Score 0.7897
Model failed to fit
Model failed to fit


Hyperparameter Tuning:   3%|▎         | 23/750 [00:12<06:24,  1.89it/s]

Model failed to fit


Hyperparameter Tuning:   3%|▎         | 25/750 [00:14<08:03,  1.50it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:   4%|▎         | 28/750 [00:15<06:58,  1.73it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0006551285568595509, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.7901988841159363}
With Map Score 0.7902
Model failed to fit


Hyperparameter Tuning:   4%|▍         | 30/750 [00:17<08:50,  1.36it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:   4%|▍         | 33/750 [00:19<07:31,  1.59it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0009540954763499944, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.7910281384413594}
With Map Score 0.7910
Model failed to fit


Hyperparameter Tuning:   5%|▍         | 35/750 [00:21<09:16,  1.28it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0009540954763499944, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.7926951887252516}
With Map Score 0.7927
Model failed to fit
Model failed to fit


Hyperparameter Tuning:   5%|▌         | 38/750 [00:23<07:59,  1.48it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0013894954943731374, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.7977158859926264}
With Map Score 0.7977
Model failed to fit


Hyperparameter Tuning:   5%|▌         | 40/750 [00:25<09:46,  1.21it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0013894954943731374, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.7977175065396623}
With Map Score 0.7977
Model failed to fit
Model failed to fit


Hyperparameter Tuning:   6%|▌         | 43/750 [00:27<08:23,  1.41it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0020235896477251557, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.799394671403095}
With Map Score 0.7994
Model failed to fit


Hyperparameter Tuning:   6%|▌         | 45/750 [00:29<10:12,  1.15it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0020235896477251557, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.799399219241546}
With Map Score 0.7994
Model failed to fit
Model failed to fit


Hyperparameter Tuning:   6%|▋         | 48/750 [00:31<08:13,  1.42it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0029470517025518097, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.7994028368663243}
With Map Score 0.7994
Model failed to fit


Hyperparameter Tuning:   7%|▋         | 50/750 [00:33<09:52,  1.18it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0029470517025518097, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.7994077657041707}
With Map Score 0.7994
Model failed to fit
Model failed to fit


Hyperparameter Tuning:   7%|▋         | 53/750 [00:35<08:23,  1.38it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.004291934260128779, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8007472890366986}
With Map Score 0.8007
Model failed to fit


Hyperparameter Tuning:   7%|▋         | 55/750 [00:37<10:21,  1.12it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.004291934260128779, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.8008623053741128}
With Map Score 0.8009
Model failed to fit
Model failed to fit


Hyperparameter Tuning:   8%|▊         | 58/750 [00:39<08:38,  1.34it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0062505519252739694, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8008680414108392}
With Map Score 0.8009
Model failed to fit


Hyperparameter Tuning:   8%|▊         | 60/750 [00:42<10:29,  1.10it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.0062505519252739694, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.8008733087115663}
With Map Score 0.8009
Model failed to fit
Model failed to fit


Hyperparameter Tuning:   8%|▊         | 63/750 [00:43<08:39,  1.32it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.009102981779915217, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8008832416763402}
With Map Score 0.8009
Model failed to fit


Hyperparameter Tuning:   9%|▊         | 65/750 [00:46<10:10,  1.12it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.009102981779915217, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.800888380875425}
With Map Score 0.8009
Model failed to fit
Model failed to fit


Hyperparameter Tuning:   9%|▉         | 68/750 [00:47<08:19,  1.37it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.013257113655901081, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.801220559671628}
With Map Score 0.8012
Model failed to fit


Hyperparameter Tuning:   9%|▉         | 70/750 [00:50<10:23,  1.09it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  10%|▉         | 73/750 [00:51<08:35,  1.31it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.019306977288832496, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8012353801104065}
With Map Score 0.8012
Model failed to fit


Hyperparameter Tuning:  10%|█         | 75/750 [00:54<10:39,  1.06it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.019306977288832496, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.8013837613968473}
With Map Score 0.8014
Model failed to fit
Model failed to fit


Hyperparameter Tuning:  10%|█         | 78/750 [00:56<08:46,  1.28it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.02811768697974228, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8013886544362785}
With Map Score 0.8014
Model failed to fit


Hyperparameter Tuning:  11%|█         | 80/750 [00:59<10:42,  1.04it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.02811768697974228, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.8013920462167429}
With Map Score 0.8014
Model failed to fit
Model failed to fit


Hyperparameter Tuning:  11%|█         | 83/750 [01:00<08:53,  1.25it/s]

Model failed to fit


Hyperparameter Tuning:  11%|█▏        | 85/750 [01:03<10:53,  1.02it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.040949150623804234, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.8013965514303483}
With Map Score 0.8014
Model failed to fit
Model failed to fit


Hyperparameter Tuning:  12%|█▏        | 88/750 [01:05<09:21,  1.18it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.05963623316594643, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.801401955749339}
With Map Score 0.8014
Model failed to fit


Hyperparameter Tuning:  12%|█▏        | 90/750 [01:08<10:54,  1.01it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.05963623316594643, 'solver': 'saga', 'max_iter': 100000, 'MAP_score': 0.8014019600476148}
With Map Score 0.8014
Model failed to fit
Model failed to fit


Hyperparameter Tuning:  12%|█▏        | 93/750 [01:09<09:00,  1.22it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.08685113737513521, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8014085518500292}
With Map Score 0.8014
Model failed to fit


Hyperparameter Tuning:  13%|█▎        | 95/750 [01:12<10:17,  1.06it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  13%|█▎        | 98/750 [01:14<08:42,  1.25it/s]

Model failed to fit


Hyperparameter Tuning:  13%|█▎        | 100/750 [01:17<10:28,  1.03it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  14%|█▎        | 103/750 [01:18<08:31,  1.26it/s]

Model failed to fit


Hyperparameter Tuning:  14%|█▍        | 105/750 [01:21<10:30,  1.02it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  14%|█▍        | 108/750 [01:23<08:45,  1.22it/s]

Model failed to fit


Hyperparameter Tuning:  15%|█▍        | 110/750 [01:25<10:23,  1.03it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  15%|█▌        | 113/750 [01:27<08:36,  1.23it/s]

Model failed to fit


Hyperparameter Tuning:  15%|█▌        | 115/750 [01:30<10:08,  1.04it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  16%|█▌        | 118/750 [01:31<08:28,  1.24it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.5689866029018293, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8014117683692032}
With Map Score 0.8014
Model failed to fit


Hyperparameter Tuning:  16%|█▌        | 120/750 [01:34<09:59,  1.05it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  16%|█▋        | 123/750 [01:36<08:09,  1.28it/s]


Current Best Hyperpamaters: {'penalty': 'l1', 'C': 0.8286427728546842, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8014117722301305}
With Map Score 0.8014
Model failed to fit


Hyperparameter Tuning:  17%|█▋        | 125/750 [01:38<09:47,  1.06it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  17%|█▋        | 128/750 [01:40<08:07,  1.27it/s]

Model failed to fit


Hyperparameter Tuning:  17%|█▋        | 130/750 [01:43<09:55,  1.04it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  18%|█▊        | 133/750 [01:44<08:21,  1.23it/s]

Model failed to fit


Hyperparameter Tuning:  18%|█▊        | 135/750 [01:48<10:34,  1.03s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  18%|█▊        | 138/750 [01:49<08:44,  1.17it/s]

Model failed to fit


Hyperparameter Tuning:  19%|█▊        | 140/750 [01:53<10:30,  1.03s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  19%|█▉        | 143/750 [01:54<08:51,  1.14it/s]

Model failed to fit


Hyperparameter Tuning:  19%|█▉        | 145/750 [01:57<10:34,  1.05s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  20%|█▉        | 148/750 [01:59<08:40,  1.16it/s]

Model failed to fit


Hyperparameter Tuning:  20%|██        | 150/750 [02:02<10:06,  1.01s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  20%|██        | 153/750 [02:03<08:08,  1.22it/s]

Model failed to fit


Hyperparameter Tuning:  21%|██        | 155/750 [02:06<09:47,  1.01it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  21%|██        | 158/750 [02:08<08:08,  1.21it/s]

Model failed to fit


Hyperparameter Tuning:  21%|██▏       | 160/750 [02:11<09:46,  1.01it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  22%|██▏       | 163/750 [02:12<07:58,  1.23it/s]

Model failed to fit


Hyperparameter Tuning:  22%|██▏       | 165/750 [02:15<09:30,  1.02it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  22%|██▏       | 168/750 [02:17<07:51,  1.23it/s]

Model failed to fit


Hyperparameter Tuning:  23%|██▎       | 170/750 [02:20<09:26,  1.02it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  23%|██▎       | 173/750 [02:21<07:55,  1.21it/s]

Model failed to fit


Hyperparameter Tuning:  23%|██▎       | 175/750 [02:24<09:15,  1.04it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  24%|██▎       | 178/750 [02:25<07:29,  1.27it/s]

Model failed to fit


Hyperparameter Tuning:  24%|██▍       | 180/750 [02:28<09:01,  1.05it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  24%|██▍       | 183/750 [02:30<07:25,  1.27it/s]

Model failed to fit


Hyperparameter Tuning:  25%|██▍       | 185/750 [02:33<09:10,  1.03it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  25%|██▌       | 188/750 [02:34<07:44,  1.21it/s]

Model failed to fit


Hyperparameter Tuning:  25%|██▌       | 190/750 [02:37<09:13,  1.01it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  26%|██▌       | 193/750 [02:39<07:34,  1.23it/s]

Model failed to fit


Hyperparameter Tuning:  26%|██▌       | 195/750 [02:42<09:34,  1.03s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  26%|██▋       | 198/750 [02:44<08:09,  1.13it/s]

Model failed to fit


Hyperparameter Tuning:  27%|██▋       | 200/750 [02:47<09:18,  1.02s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  27%|██▋       | 203/750 [02:48<07:29,  1.22it/s]

Model failed to fit


Hyperparameter Tuning:  27%|██▋       | 205/750 [02:51<08:43,  1.04it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  28%|██▊       | 208/750 [02:52<07:03,  1.28it/s]

Model failed to fit


Hyperparameter Tuning:  28%|██▊       | 210/750 [02:55<08:33,  1.05it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  28%|██▊       | 213/750 [02:57<07:13,  1.24it/s]

Model failed to fit


Hyperparameter Tuning:  29%|██▊       | 215/750 [03:00<08:31,  1.05it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  29%|██▉       | 218/750 [03:01<06:58,  1.27it/s]

Model failed to fit


Hyperparameter Tuning:  29%|██▉       | 220/750 [03:04<08:17,  1.07it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  30%|██▉       | 223/750 [03:05<06:54,  1.27it/s]

Model failed to fit


Hyperparameter Tuning:  30%|███       | 225/750 [03:08<08:14,  1.06it/s]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  30%|███       | 228/750 [03:10<06:52,  1.26it/s]

Model failed to fit


Hyperparameter Tuning:  31%|███       | 230/750 [03:13<08:42,  1.01s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  31%|███       | 233/750 [03:15<07:16,  1.19it/s]

Model failed to fit


Hyperparameter Tuning:  31%|███▏      | 235/750 [03:18<09:24,  1.10s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  32%|███▏      | 238/750 [03:20<07:34,  1.13it/s]

Model failed to fit


Hyperparameter Tuning:  32%|███▏      | 240/750 [03:23<09:03,  1.07s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  32%|███▏      | 243/750 [03:25<07:39,  1.10it/s]

Model failed to fit


Hyperparameter Tuning:  33%|███▎      | 245/750 [03:30<11:27,  1.36s/it]

Model failed to fit
Model failed to fit


Hyperparameter Tuning:  33%|███▎      | 248/750 [03:33<09:37,  1.15s/it]

Model failed to fit


Hyperparameter Tuning:  36%|███▋      | 273/750 [04:11<11:07,  1.40s/it]


Current Best Hyperpamaters: {'penalty': 'l2', 'C': 0.0004498432668969444, 'solver': 'liblinear', 'max_iter': 100000, 'MAP_score': 0.8014862037600845}
With Map Score 0.8015


Hyperparameter Tuning:  37%|███▋      | 276/750 [04:18<15:38,  1.98s/it]


Current Best Hyperpamaters: {'penalty': 'l2', 'C': 0.0006551285568595509, 'solver': 'newton-cg', 'max_iter': 100000, 'MAP_score': 0.8019319984858279}
With Map Score 0.8019


Hyperparameter Tuning:  37%|███▋      | 281/750 [04:25<12:52,  1.65s/it]


Current Best Hyperpamaters: {'penalty': 'l2', 'C': 0.0009540954763499944, 'solver': 'newton-cg', 'max_iter': 100000, 'MAP_score': 0.8069646648814421}
With Map Score 0.8070


Hyperparameter Tuning:  38%|███▊      | 282/750 [04:26<11:17,  1.45s/it]


Current Best Hyperpamaters: {'penalty': 'l2', 'C': 0.0009540954763499944, 'solver': 'lbfgs', 'max_iter': 100000, 'MAP_score': 0.8069646658863794}
With Map Score 0.8070


Hyperparameter Tuning: 100%|██████████| 750/750 [10:46<00:00,  1.16it/s]

Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit
Model failed to fit





In [22]:
#round c=0.0001 gives a better estimation so we use that and prepare another forward selection
scaler = preprocessing.StandardScaler()
model = LogisticRegression(max_iter=100000,penalty="l2", C= 0.0001, solver='saga')
feature_selection(model,scaler,feature_dataframe,feature_retrieval,start_features,added_features)

The initial MAP score on test set: 0.8284
Updated MAP score on test set with new feature number_VERB_difference: 0.8292
Updated MAP score on test set with new feature number_ADJ_difference_relative: 0.8299
Updated MAP score on test set with new feature characters_avg_difference_relative: 0.8303
Updated MAP score on test set with new feature number_'_difference_relative: 0.8353


0.8352689940750164

In [6]:
start_features

['jaccard_translation_proc_5k',
 'jaccard_translation_vecmap',
 'jaccard_numbers_source',
 'number_VERB_difference_relative',
 'number_?_difference_relative',
 'number_-_difference_normalized',
 'number_-_difference_relative',
 'number_!_difference_normalized',
 'cosine_similarity_tf_idf_vecmap',
 'euclidean_distance_average_proc_b_1k',
 'number_NOUN_difference',
 'number_ADJ_difference_normalized',
 'number_ADJ_difference',
 'number_characters_difference_normalized',
 'number_characters_difference',
 'number_VERB_difference',
 'number_ADJ_difference_relative',
 'characters_avg_difference_relative',
 "number_'_difference_relative"]

In [36]:
#round c=0.0001 gives a better estimation so we use that
scaler = preprocessing.StandardScaler()
model = LogisticRegression(max_iter=100000,penalty="l2", C= 0.0001, solver='saga')
target_train = feature_dataframe['Translation']
target_test = feature_retrieval['Translation']
data_train = feature_dataframe.filter(items=start_features)
data_test = feature_retrieval.filter(items=start_features)
# scale the features
data_train[data_train.columns] = scaler.fit_transform(data_train[data_train.columns])
data_test[data_test.columns] = scaler.transform(data_test[data_test.columns])
# fit the model and get the initial MapScore
model.fit(data_train.to_numpy(), target_train.to_numpy())
prediction = model.predict_proba(data_test.to_numpy())
MapScore = MAP_score(feature_retrieval['source_id'], target_test, prediction)
print("The initial MAP score on test set: {:.4f}".format(MapScore))

The initial MAP score on test set: 0.8353


In [2]:
# save the model to disk
import pickle
filename = 'finalized_model_LR.sav'
#pickle.dump(model, open(filename, 'wb'))
 

# load the model from disk
loaded_model = pickle.load(open(filename, 'rb'))

In [3]:
loaded_model.coef_

array([[ 7.53380191e-01,  6.60808198e-01,  2.41156071e-01,
        -1.26480806e-02, -9.58177714e-02, -5.21367315e-02,
        -8.65654324e-02,  3.98228332e-02,  1.33003489e-01,
         1.60447685e-01, -2.56232352e-01,  6.79084136e-04,
        -1.67896375e-01,  1.15958968e-01, -3.80036013e-01,
        -1.71007515e-01, -7.23648744e-02,  5.34286966e-02,
        -8.03855972e-03, -2.50453456e-01]])

In [37]:
start_features=['jaccard_translation_proc_5k',
 'jaccard_translation_vecmap',
 'jaccard_numbers_source',
 'number_VERB_difference_relative',
 'number_?_difference_relative',
 'number_-_difference_normalized',
 'number_-_difference_relative',
 'number_!_difference_normalized',
 'cosine_similarity_tf_idf_vecmap',
 'euclidean_distance_average_proc_b_1k',
 'number_NOUN_difference',
 'number_ADJ_difference_normalized',
 'number_ADJ_difference',
 'number_characters_difference_normalized',
 'number_characters_difference',
 'number_VERB_difference',
 'number_ADJ_difference_relative',
 'characters_avg_difference_relative',
 "number_'_difference_relative"]

# Test our model on an independent Englisch-German test set

In [45]:
feature_retrieval_test=pd.read_feather("../data/processed/feature_retrieval_en_de_testset.feather")
feature_retrieval_test = feature_retrieval_test.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval_test

Unnamed: 0,source_id,target_id,Translation,number_punctuations_total_difference,number_punctuations_total_difference_relative,number_punctuations_total_difference_normalized,number_words_difference,number_words_difference_relative,number_words_difference_normalized,number_unique_words_difference,...,cosine_similarity_average_proc_b_1k,cosine_similarity_tf_idf_proc_b_1k,euclidean_distance_average_proc_b_1k,euclidean_distance_tf_idf_proc_b_1k,jaccard_translation_proc_b_1k,cosine_similarity_average_vecmap,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_vecmap,euclidean_distance_tf_idf_vecmap,jaccard_translation_vecmap
0,0,0,1,1,1.0,0.052632,1,0.028571,0.052632,2,...,0.861389,0.824081,0.319038,0.088379,0.178854,0.785007,0.709331,0.334444,0.094191,0.127717
1,0,1,0,2,1.0,0.095238,2,0.055556,0.095238,2,...,0.805623,0.733701,0.373031,0.103607,0.057692,0.693420,0.552060,0.397617,0.111695,0.057692
2,0,2,0,1,1.0,0.043478,5,0.128205,0.043478,4,...,0.729932,0.664696,0.423182,0.111810,0.033908,0.531288,0.378973,0.471684,0.124874,0.033908
3,0,3,0,1,1.0,0.034483,11,0.244444,0.034483,11,...,0.808925,0.758664,0.359914,0.097409,0.027402,0.683163,0.584583,0.391664,0.103523,0.027402
4,0,4,0,1,1.0,0.142857,11,0.478261,0.142857,8,...,0.605760,0.523842,0.551022,0.220077,0.000000,0.415959,0.268377,0.590496,0.230078,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,99,4995,0,1,1.0,0.050000,5,0.151515,0.050000,2,...,0.786550,0.758610,0.361696,0.097063,0.041667,0.642865,0.566579,0.384921,0.106913,0.041667
499996,99,4996,0,3,0.6,0.050000,17,0.309091,0.050000,18,...,0.813926,0.764302,0.327940,0.092496,0.036905,0.567438,0.404435,0.400315,0.110524,0.036905
499997,99,4997,0,0,0.0,0.004545,2,0.050000,0.004545,5,...,0.830305,0.780338,0.325268,0.087891,0.068966,0.704087,0.603182,0.354744,0.098615,0.068966
499998,99,4998,0,1,1.0,0.050000,6,0.136364,0.050000,8,...,0.828593,0.803272,0.318972,0.082745,0.064516,0.731584,0.681119,0.336975,0.088615,0.064516


### Prepare test set

In [46]:
target_test = feature_retrieval_test['Translation']
data_test = feature_retrieval_test.filter(items=start_features)
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,jaccard_numbers_source,number_VERB_difference_relative,number_?_difference_relative,number_-_difference_normalized,number_-_difference_relative,number_!_difference_normalized,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_proc_b_1k,number_NOUN_difference,number_ADJ_difference_normalized,number_ADJ_difference,number_characters_difference_normalized,number_characters_difference,number_VERB_difference,number_ADJ_difference_relative,characters_avg_difference_relative,number_'_difference_relative
0,0.181818,0.127717,0.0,0.250000,0.0,0.00,0.0,0.000,0.709331,0.319038,1,0.000000,0,1.040248,28,2,0.000000,0.138756,0.0
1,0.057692,0.057692,0.0,0.428571,0.0,0.00,0.0,0.000,0.552060,0.373031,2,0.095238,2,1.120448,40,3,1.000000,0.168750,0.0
2,0.033333,0.033908,0.0,1.000000,1.0,0.00,0.0,0.000,0.378973,0.423182,6,0.043478,1,0.751918,42,5,1.000000,0.105691,0.0
3,0.027799,0.027402,0.0,0.250000,0.0,0.00,0.0,0.000,0.584583,0.359914,5,0.034483,1,1.158215,83,2,1.000000,0.140539,0.0
4,0.000000,0.000000,0.0,1.000000,0.0,0.00,0.0,0.000,0.268377,0.551022,0,0.000000,0,0.260504,43,5,0.000000,0.044369,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,0.000000,0.041667,0.0,0.333333,0.0,0.05,1.0,0.000,0.566579,0.361696,0,0.092857,1,0.678571,19,1,0.333333,0.041096,0.0
499996,0.012500,0.036905,0.0,0.500000,0.0,0.05,1.0,0.025,0.404435,0.327940,5,0.025000,0,0.625000,120,2,0.000000,0.088608,0.0
499997,0.032796,0.068966,0.0,0.333333,0.0,0.05,1.0,0.000,0.603182,0.325268,0,0.050000,1,0.159091,13,1,1.000000,0.014085,0.0
499998,0.031281,0.064516,0.0,0.000000,0.0,0.05,1.0,0.000,0.681119,0.318972,4,0.030000,1,0.970000,48,0,0.333333,0.067164,0.0


### Use model

In [49]:
scaler = preprocessing.StandardScaler()
data_test[data_test.columns] = scaler.fit_transform(data_test[data_test.columns])
data_test
prediction = model.predict_proba(data_test)
MapScore = MAP_score(feature_retrieval_test['source_id'], target_test, prediction)
print("The MAP score on test set: {:.4f}".format(MapScore))

The MAP score on test set: 0.8156


# III. Other languages

## Use the model on English-Italian

In [41]:
feature_retrieval_it=pd.read_feather("../data/processed/feature_retrieval_en_it.feather")
feature_retrieval_it = feature_retrieval_it.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval_it

Unnamed: 0,source_id,target_id,Translation,number_punctuations_total_difference,number_punctuations_total_difference_relative,number_punctuations_total_difference_normalized,number_words_difference,number_words_difference_relative,number_words_difference_normalized,number_unique_words_difference,...,cosine_similarity_average_proc_b_1k,cosine_similarity_tf_idf_proc_b_1k,euclidean_distance_average_proc_b_1k,euclidean_distance_tf_idf_proc_b_1k,jaccard_translation_proc_b_1k,cosine_similarity_average_vecmap,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_vecmap,euclidean_distance_tf_idf_vecmap,jaccard_translation_vecmap
0,20000,20000,1,1,0.333333,0.046218,2,0.071429,0.046218,3,...,0.854424,0.852184,0.294122,0.074822,0.192029,0.824711,0.820362,0.271466,0.068599,0.192029
1,20000,20001,0,0,0.000000,0.034314,7,0.189189,0.034314,7,...,0.821553,0.759180,0.338124,0.084507,0.060662,0.733241,0.616938,0.351165,0.090769,0.060662
2,20000,20002,0,1,0.200000,0.007353,6,0.166667,0.007353,6,...,0.752849,0.665112,0.386866,0.099398,0.000000,0.540558,0.339125,0.432807,0.114732,0.000000
3,20000,20003,0,1,0.333333,0.072193,6,0.166667,0.072193,2,...,0.776020,0.708160,0.356853,0.093711,0.032292,0.615510,0.475073,0.384684,0.103048,0.031754
4,20000,20004,0,0,0.000000,0.030691,6,0.166667,0.030691,0,...,0.781948,0.717154,0.360449,0.096081,0.000000,0.659130,0.557169,0.376907,0.100984,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,20099,24995,0,4,1.000000,0.266667,2,0.083333,0.266667,3,...,0.629671,0.571876,0.485150,0.138779,0.000000,0.340138,0.254788,0.544548,0.155010,0.000000
499996,20099,24996,0,4,1.000000,0.266667,5,0.294118,0.266667,4,...,0.608893,0.541164,0.521543,0.206140,0.000000,0.352565,0.252309,0.584824,0.224727,0.000000
499997,20099,24997,0,4,1.000000,0.266667,5,0.185185,0.266667,6,...,0.634123,0.549922,0.471551,0.135978,0.000000,0.317182,0.216514,0.537252,0.150231,0.000000
499998,20099,24998,0,2,0.200000,0.168306,44,0.666667,0.168306,30,...,0.700972,0.652883,0.426266,0.122974,0.009434,0.404160,0.338733,0.492635,0.124281,0.009434


### Prepare test set

In [42]:
target_test = feature_retrieval_it['Translation']
data_test = feature_retrieval_it.filter(items=start_features)
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,jaccard_numbers_source,number_VERB_difference_relative,number_?_difference_relative,number_-_difference_normalized,number_-_difference_relative,number_!_difference_normalized,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_proc_b_1k,number_NOUN_difference,number_ADJ_difference_normalized,number_ADJ_difference,number_characters_difference_normalized,number_characters_difference,number_VERB_difference,number_ADJ_difference_relative,characters_avg_difference_relative,number_'_difference_relative
0,0.219697,0.192029,0.0,0.000000,0.0,0.0,0.0,0.0,0.820362,0.294122,0,0.012605,0,1.050420,2.0,0,0.000000,0.085044,0.000000
1,0.060662,0.060662,0.0,0.333333,0.0,0.0,0.0,0.0,0.616938,0.338124,0,0.024510,1,0.723039,47.0,1,0.333333,0.059662,0.000000
2,0.000000,0.000000,0.0,0.500000,0.0,0.0,0.0,0.0,0.339125,0.386866,0,0.024510,1,0.973039,53.0,2,0.333333,0.107174,0.000000
3,0.032292,0.031754,0.0,0.000000,0.0,0.0,0.0,0.0,0.475073,0.356853,0,0.013369,0,0.810160,39.0,0,0.000000,0.048159,0.000000
4,0.000000,0.000000,0.0,0.600000,1.0,0.0,0.0,0.0,0.557169,0.360449,2,0.071611,2,0.539642,13.0,3,0.500000,0.085038,0.000000
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,0.000000,0.000000,0.0,0.333333,0.0,0.0,0.0,0.0,0.254788,0.485150,3,0.076923,1,1.589744,14.0,1,1.000000,0.039882,1.000000
499996,0.000000,0.000000,0.0,0.333333,0.0,0.0,0.0,0.0,0.252309,0.521543,0,0.166667,1,2.000000,18.0,1,1.000000,0.079755,1.000000
499997,0.000000,0.000000,0.0,0.333333,0.0,0.0,0.0,0.0,0.216514,0.471551,3,0.062500,1,2.041667,36.0,1,1.000000,0.083620,1.000000
499998,0.009434,0.009434,0.0,0.500000,0.0,0.0,0.0,0.0,0.338733,0.426266,12,0.098361,6,1.535519,247.0,4,1.000000,0.085923,0.333333


### Use model

In [43]:
data_test[data_test.columns] = scaler.transform(data_test[data_test.columns])
data_test
prediction = model.predict_proba(data_test)
MapScore = MAP_score(feature_retrieval_it['source_id'], target_test, prediction)
print("The Italian MAP score on test set: {:.4f}".format(MapScore))

The Italian MAP score on test set: 0.8134


## Use the model on English-Polish

In [30]:
feature_retrieval_pl=pd.read_feather("../data/processed/feature_retrieval_en_pl.feather")
feature_retrieval_pl = feature_retrieval_pl.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval_pl

Unnamed: 0,source_id,target_id,Translation,number_punctuations_total_difference,number_punctuations_total_difference_relative,number_punctuations_total_difference_normalized,number_words_difference,number_words_difference_relative,number_words_difference_normalized,number_unique_words_difference,...,cosine_similarity_average_proc_b_1k,cosine_similarity_tf_idf_proc_b_1k,euclidean_distance_average_proc_b_1k,euclidean_distance_tf_idf_proc_b_1k,jaccard_translation_proc_b_1k,cosine_similarity_average_vecmap,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_vecmap,euclidean_distance_tf_idf_vecmap,jaccard_translation_vecmap
0,20000,20000,1,0,0.0,0.000000,5,0.217391,0.000000,3,...,0.834714,0.787117,0.343884,0.096155,0.244485,0.802804,0.713881,0.337174,0.100275,0.242647
1,20000,20001,0,0,0.0,0.000000,7,0.333333,0.000000,5,...,0.736686,0.695557,0.449273,0.171380,0.000000,0.670068,0.575405,0.452836,0.173662,0.000000
2,20000,20002,0,11,1.0,0.123596,64,0.695652,0.123596,53,...,0.859428,0.801389,0.306350,0.114496,0.050166,0.831464,0.732550,0.313809,0.113891,0.049837
3,20000,20003,0,2,1.0,0.222222,7,0.333333,0.222222,5,...,0.703519,0.662172,0.456495,0.162102,0.000000,0.585979,0.515644,0.483511,0.166323,0.000000
4,20000,20004,0,2,1.0,0.071429,12,0.300000,0.071429,12,...,0.765796,0.681011,0.385775,0.112808,0.030303,0.627124,0.444584,0.426907,0.126310,0.030303
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,20099,24995,0,4,1.0,0.153846,7,0.137255,0.153846,1,...,0.791923,0.728771,0.331753,0.075849,0.026671,0.536269,0.325363,0.381850,0.088468,0.025978
499996,20099,24996,0,2,1.0,0.125000,15,0.348837,0.125000,7,...,0.779662,0.718409,0.357416,0.084390,0.047379,0.694737,0.528046,0.351784,0.088915,0.030835
499997,20099,24997,0,1,1.0,0.050000,10,0.208333,0.050000,2,...,0.773840,0.736827,0.350169,0.074033,0.013514,0.465373,0.364922,0.427608,0.090424,0.000000
499998,20099,24998,0,1,1.0,0.047619,9,0.183673,0.047619,2,...,0.731214,0.671121,0.383888,0.084993,0.069884,0.440550,0.286580,0.444644,0.100719,0.054094


### Prepare test set

In [31]:
target_test = feature_retrieval_pl['Translation']
data_test = feature_retrieval_pl.filter(items=start_features)
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,jaccard_numbers_source,number_VERB_difference_relative,number_?_difference_relative,number_-_difference_normalized,number_-_difference_relative,number_!_difference_normalized,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_proc_b_1k,number_NOUN_difference,number_ADJ_difference_normalized,number_ADJ_difference,number_characters_difference_normalized,number_characters_difference,number_VERB_difference,number_ADJ_difference_relative,characters_avg_difference_relative,number_'_difference_relative
0,0.242647,0.242647,0.0,1.000000,0.0,0.000000,0.0,0.0,0.713881,0.343884,0,0.031746,1,0.428571,19,1,0.333333,0.044776,0.0
1,0.027778,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,0.575405,0.449273,1,0.142857,2,2.571429,14,0,1.000000,0.219512,0.0
2,0.043275,0.049837,0.0,0.666667,0.0,0.022472,1.0,0.0,0.732550,0.306350,28,0.014446,12,1.574639,483,4,0.750000,0.210751,0.0
3,0.000000,0.000000,0.0,0.000000,0.0,0.000000,0.0,0.0,0.515644,0.456495,1,0.031746,1,0.317460,20,0,0.333333,0.157895,0.0
4,0.030303,0.030303,0.0,0.333333,1.0,0.000000,0.0,0.0,0.444584,0.385775,13,0.035714,1,0.964286,91,1,0.200000,0.131977,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,0.040936,0.025978,0.0,0.000000,0.0,0.000000,0.0,0.0,0.325363,0.331753,2,0.030504,1,0.415119,4,0,0.333333,0.123311,0.0
499996,0.031281,0.030835,0.0,0.500000,0.0,0.125000,1.0,0.0,0.528046,0.357416,3,0.118534,1,0.631466,54,2,0.200000,0.126336,0.0
499997,0.013514,0.000000,0.0,1.000000,0.0,0.000000,0.0,0.0,0.364922,0.350169,1,0.081034,1,1.768966,9,3,0.200000,0.177041,0.0
499998,0.069884,0.054094,0.0,0.500000,0.0,0.000000,0.0,0.0,0.286580,0.383888,6,0.073892,1,2.783251,19,2,0.200000,0.243186,0.0


### Use model

In [32]:
data_test[data_test.columns] = scaler.transform(data_test[data_test.columns])
data_test
prediction = model.predict_proba(data_test)
MapScore = MAP_score(feature_retrieval_it['source_id'], target_test, prediction)
print("The Polish MAP score on test set: {:.4f}".format(MapScore))

The Polish MAP score on test set: 0.8401


# IV. Document level

## Use the model on German-English doc

In [33]:
feature_retrieval_doc=pd.read_feather("../data/processed/feature_retrieval_doc.feather")
feature_retrieval_doc = feature_retrieval_doc.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval_doc

Unnamed: 0,source_id,target_id,Translation,number_punctuations_total_difference,number_punctuations_total_difference_relative,number_punctuations_total_difference_normalized,number_words_difference,number_words_difference_relative,number_words_difference_normalized,number_unique_words_difference,...,cosine_similarity_average_proc_b_1k,cosine_similarity_tf_idf_proc_b_1k,euclidean_distance_average_proc_b_1k,euclidean_distance_tf_idf_proc_b_1k,jaccard_translation_proc_b_1k,cosine_similarity_average_vecmap,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_vecmap,euclidean_distance_tf_idf_vecmap,jaccard_translation_vecmap
0,0,0,1,5,1.000000,0.178571,290,0.863095,0.178571,144,...,0.580389,0.549913,0.531730,0.054664,0.007030,0.158882,0.125875,0.602464,0.057259,0.007866
1,0,1,0,5,1.000000,0.178571,335,0.879265,0.178571,185,...,0.543362,0.521699,0.567206,0.055330,0.005784,0.073765,0.043059,0.656281,0.059766,0.006825
2,0,2,0,5,1.000000,0.178571,329,0.877333,0.178571,172,...,0.512680,0.452094,0.609772,0.057912,0.006540,0.082798,0.011172,0.693836,0.061963,0.007726
3,0,3,0,5,1.000000,0.178571,368,0.888889,0.178571,167,...,0.536417,0.502595,0.580073,0.056595,0.006444,0.101143,0.050886,0.671459,0.058318,0.008272
4,0,4,0,5,1.000000,0.178571,389,0.894253,0.178571,174,...,0.530223,0.491524,0.575751,0.056442,0.005852,0.056589,-0.003096,0.673305,0.061273,0.006961
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,99,4995,0,1,0.333333,0.114773,332,0.917127,0.114773,167,...,0.486395,0.418157,0.623162,0.129749,0.000000,-0.016775,-0.145692,0.752196,0.133761,0.000000
499996,99,4996,0,2,1.000000,0.117647,373,0.925558,0.117647,155,...,0.474177,0.402875,0.631539,0.131067,0.000000,-0.030901,-0.146376,0.744337,0.132629,0.000000
499997,99,4997,0,2,1.000000,0.117647,140,0.823529,0.117647,89,...,0.559839,0.524665,0.551034,0.123131,0.000000,0.083889,0.026164,0.649910,0.130587,0.000000
499998,99,4998,0,0,0.000000,0.006536,1,0.032258,0.006536,4,...,0.553223,0.475026,0.565722,0.140282,0.000000,0.174603,0.126667,0.642268,0.157276,0.000000


### Prepare test set

In [34]:
target_test = feature_retrieval_doc['Translation']
data_test = feature_retrieval_doc.filter(items=start_features)
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,jaccard_numbers_source,number_VERB_difference_relative,number_?_difference_relative,number_-_difference_normalized,number_-_difference_relative,number_!_difference_normalized,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_proc_b_1k,number_NOUN_difference,number_ADJ_difference_normalized,number_ADJ_difference,number_characters_difference_normalized,number_characters_difference,number_VERB_difference,number_ADJ_difference_relative,characters_avg_difference_relative,number_'_difference_relative
0,0.007874,0.007866,0.0,0.523810,0.0,0.071429,1.0,0.0,0.125875,0.531730,9,0.075764,17,0.365701,1570,11,0.680000,0.063690,0.0
1,0.006623,0.006825,0.0,0.411765,0.0,0.071429,1.0,0.0,0.043059,0.567206,9,0.103751,10,0.194533,1755,7,0.555556,0.079496,0.0
2,0.007692,0.007726,0.0,0.166667,0.0,0.071429,1.0,0.0,0.011172,0.609772,11,0.114448,6,0.878653,1964,2,0.428571,0.018978,0.0
3,0.004032,0.008272,0.0,0.166667,0.0,0.071429,1.0,0.0,0.050886,0.580073,11,0.086591,18,0.252923,1755,2,0.692308,0.123123,0.0
4,0.006849,0.006961,0.0,0.473684,0.0,0.071429,1.0,0.0,-0.003096,0.575751,9,0.074896,24,0.352288,1816,9,0.750000,0.133294,0.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,0.000000,0.000000,0.0,0.826087,0.0,0.058824,1.0,0.0,-0.145692,0.623162,10,0.050372,37,1.327586,1524,19,0.948718,0.184063,0.0
499996,0.000000,0.000000,0.0,0.636364,0.0,0.058824,1.0,0.0,-0.146376,0.631539,8,0.030473,10,0.873711,1887,7,0.833333,0.140338,0.0
499997,0.004854,0.000000,0.0,0.500000,0.0,0.058824,1.0,0.0,0.026164,0.551034,9,0.033017,3,1.064516,663,4,0.600000,0.158879,0.0
499998,0.000000,0.000000,0.0,0.000000,0.0,0.058824,1.0,0.0,0.126667,0.565722,9,0.107843,2,1.944444,29,0,0.500000,0.196920,0.0


### Use model

In [35]:
data_test[data_test.columns] = scaler.transform(data_test[data_test.columns])
data_test

Unnamed: 0,jaccard_translation_proc_5k,jaccard_translation_vecmap,jaccard_numbers_source,number_VERB_difference_relative,number_?_difference_relative,number_-_difference_normalized,number_-_difference_relative,number_!_difference_normalized,cosine_similarity_tf_idf_vecmap,euclidean_distance_average_proc_b_1k,number_NOUN_difference,number_ADJ_difference_normalized,number_ADJ_difference,number_characters_difference_normalized,number_characters_difference,number_VERB_difference,number_ADJ_difference_relative,characters_avg_difference_relative,number_'_difference_relative
0,-1.083643,-1.148194,-0.10217,0.702317,-0.229119,4.225414,2.635605,-0.149675,-5.016847,3.229756,1.090514,0.434909,8.453787,-0.930379,14.513577,4.940800,0.599614,-0.627510,-0.207088
1,-1.105986,-1.166766,-0.10217,0.291659,-0.229119,4.225414,2.635605,-0.149675,-5.755854,3.698212,1.090514,1.006094,4.530563,-1.176597,16.366537,2.727556,0.260514,-0.430519,-0.207088
2,-1.086887,-1.150694,-0.10217,-0.606655,-0.229119,4.225414,2.635605,-0.149675,-6.040398,4.260287,1.582376,1.224414,2.288720,-0.192517,18.459880,-0.038998,-0.085507,-1.184756,-0.207088
3,-1.152229,-1.140954,-0.10217,-0.606655,-0.229119,4.225414,2.635605,-0.149675,-5.686007,3.868118,1.582376,0.655874,9.014248,-1.092605,16.366537,-0.038998,0.633152,0.113218,-0.207088
4,-1.101937,-1.164342,-0.10217,0.518601,-0.229119,4.225414,2.635605,-0.149675,-6.167719,3.811053,1.090514,0.417183,12.377011,-0.949672,16.977513,3.834178,0.790358,0.239982,-0.207088
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,-1.224216,-1.288483,-0.10217,1.810201,-0.229119,3.421750,2.635605,-0.149675,-7.440170,4.437109,1.336445,-0.083337,19.663000,0.453257,14.052841,9.367287,1.331847,0.872718,-0.207088
499996,-1.224216,-1.288483,-0.10217,1.114841,-0.229119,3.421750,2.635605,-0.149675,-7.446267,4.547720,0.844582,-0.489459,4.530563,-0.199624,17.688649,2.727556,1.017434,0.327770,-0.207088
499997,-1.137552,-1.288483,-0.10217,0.615052,-0.229119,3.421750,2.635605,-0.149675,-5.906618,3.484663,1.090514,-0.437536,0.607338,0.074841,5.429066,1.067623,0.381621,0.558844,-0.207088
499998,-1.224216,-1.288483,-0.10217,-1.217509,-0.229119,3.421750,2.635605,-0.149675,-5.009779,3.678614,1.090514,1.089612,0.046877,1.340585,-0.921078,-1.145620,0.109130,1.032958,-0.207088


In [36]:
prediction = model.predict_proba(data_test)
MapScore = MAP_score(feature_retrieval_doc['source_id'], target_test, prediction)
print("The Doc MAP score on test set: {:.4f}".format(MapScore))

The Doc MAP score on test set: 0.0004
