# Supervised Retrieval

In this notebook we use the supervised classification model for a supervised crosslingual information retrieval task.

In [1]:
import sys
import os
sys.path.append(os.path.dirname((os.path.abspath(''))))

import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, log_loss
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFECV
from src.models.predict_model import MAP_score, threshold_counts, feature_selection, pipeline_model_optimization

## I. Import Data

In this section we import the feature dataframe for the retrieval task.

In [2]:
feature_dataframe=pd.read_feather("../data/processed/feature_model_en_de.feather")
feature_retrieval=pd.read_feather("../data/processed/feature_retrieval_en_de.feather")
feature_dataframe = feature_dataframe.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval = feature_retrieval.rename(columns={"id_source": "source_id", "id_target": "target_id"})

feature_retrieval_pl = pd.read_feather("../data/processed/feature_retrieval_en_pl.feather")
feature_retrieval_pl = feature_retrieval_pl.rename(columns={"id_source": "source_id", "id_target": "target_id"})

feature_retrieval_it = pd.read_feather("../data/processed/feature_retrieval_en_it.feather")
feature_retrieval_it = feature_retrieval_it.rename(columns={"id_source": "source_id", "id_target": "target_id"})

feature_retrieval_doc = pd.read_feather("../data/processed/feature_retrieval_doc.feather")
feature_retrieval_doc = feature_retrieval_doc.rename(columns={"id_source": "source_id", "id_target": "target_id"})

#### Delete all columns with only one value

In [3]:
column_mask = feature_dataframe.apply(threshold_counts, threshold=1)
feature_dataframe = feature_dataframe.loc[:, column_mask]
feature_retrieval = feature_retrieval.loc[:, column_mask]


## II. Supervised Retrieval

#### Start with one feature

In [11]:
start_features = ["jaccard_translation_proc_5k"]
not_add = ["Translation", "source_id", "target_id"]
added_features = feature_dataframe.columns[~feature_dataframe.columns.isin(start_features+not_add)]

# XGBoost

In [43]:
from xgboost import XGBClassifier

start_features = ["jaccard_translation_proc_5k"]
not_add = ["Translation", "source_id", "target_id"]
added_features = feature_dataframe.columns[~feature_dataframe.columns.isin(start_features+not_add)]

xgb = XGBClassifier(verbosity = 0, use_label_encoder=False, random_state=42)
scaler = preprocessing.StandardScaler()

xgb_parameter_grid = {"verbosity": [0],
                      "use_label_encoder": [False],
                      "random_state": [42],
                      'min_child_weight': [1, 5],
                      'gamma': [0.5, 1,  5],
                      'subsample': [0.6, 0.8],
                      'colsample_bytree': [0.6, 0.8, 1.0],
                      'max_depth': [3, 4, 5]}

xgb_best_features, xgb_best_parameter_combination, xgb_best_map_score, xgb_all_parameter_combination = \
pipeline_model_optimization(xgb, xgb_parameter_grid, scaler, feature_dataframe, 
                            feature_retrieval, start_features, 
                            added_features, 
                            threshold_map_feature_selection=0.001)

-----------------First do Forward Selection-----------------

Current Iteration through feature list: 1
The initial MAP score on test set: 0.7388
Updated MAP score on test set with new feature jaccard_translation_vecmap: 0.7406
Updated MAP score on test set with new feature euclidean_distance_tf_idf_vecmap: 0.7665
Updated MAP score on test set with new feature euclidean_distance_tf_idf_proc_b_1k: 0.7948
Updated MAP score on test set with new feature number_VERB_difference_relative: 0.8067
Updated MAP score on test set with new feature number_VERB_difference: 0.8261
Updated MAP score on test set with new feature number_ADJ_difference_relative: 0.8380
Updated MAP score on test set with new feature number_-_difference: 0.8463

Current Iteration through feature list: 2
The initial MAP score on test set: 0.8463

-----------------Result of Feature Selection-----------------

Best MAP Score after feature selection: 0.8463057084876486


Hyperparameter Tuning:   0%|          | 0/108 [00:00<?, ?it/s]



-----------------Start Hyperparameter-tuning with Grid Search-----------------
Number of Parameter Combinations: 108


Hyperparameter Tuning:   1%|          | 1/108 [00:05<10:33,  5.92s/it]


Current Best Hyperpamaters: {'verbosity': 0, 'use_label_encoder': False, 'random_state': 42, 'min_child_weight': 1, 'gamma': 0.5, 'subsample': 0.6, 'colsample_bytree': 0.6, 'max_depth': 3, 'MAP_score': 0.8140683054648559}
With Map Score 0.8141


Hyperparameter Tuning:   2%|▏         | 2/108 [00:12<10:59,  6.22s/it]


Current Best Hyperpamaters: {'verbosity': 0, 'use_label_encoder': False, 'random_state': 42, 'min_child_weight': 1, 'gamma': 0.5, 'subsample': 0.6, 'colsample_bytree': 0.6, 'max_depth': 4, 'MAP_score': 0.8195359903714428}
With Map Score 0.8195


Hyperparameter Tuning:   6%|▌         | 6/108 [00:56<15:52,  9.34s/it]


Current Best Hyperpamaters: {'verbosity': 0, 'use_label_encoder': False, 'random_state': 42, 'min_child_weight': 1, 'gamma': 0.5, 'subsample': 0.6, 'colsample_bytree': 0.8, 'max_depth': 5, 'MAP_score': 0.8325409529537386}
With Map Score 0.8325


Hyperparameter Tuning:  10%|█         | 11/108 [01:36<13:07,  8.12s/it]


Current Best Hyperpamaters: {'verbosity': 0, 'use_label_encoder': False, 'random_state': 42, 'min_child_weight': 1, 'gamma': 0.5, 'subsample': 0.8, 'colsample_bytree': 0.6, 'max_depth': 4, 'MAP_score': 0.8364889636529729}
With Map Score 0.8365


Hyperparameter Tuning: 100%|██████████| 108/108 [15:44<00:00,  8.75s/it]


-----------------Result of Hyperparameter Tuning-----------------

Best Hyperamater Settting: {'verbosity': 0, 'use_label_encoder': False, 'random_state': 42, 'min_child_weight': 1, 'gamma': 0.5, 'subsample': 0.8, 'colsample_bytree': 0.6, 'max_depth': 4}
With MAP Score: 0.8365





In [50]:
xgb_best_features_ = ["jaccard_translation_proc_5k",
                      "jaccard_translation_vecmap", 
                      "euclidean_distance_tf_idf_vecmap",
                      "euclidean_distance_tf_idf_proc_b_1k",
                      "number_VERB_difference_relative",
                      "number_VERB_difference",
                      "number_ADJ_difference_relative",
                      "number_-_difference"]

xgb_best_hyperparameters = {"verbosity": 0, "use_label_encoder": False, "random_state": 42}

In [51]:
from xgboost import XGBClassifier

target_train=feature_dataframe['Translation'].astype(float)
data_train=feature_dataframe.drop(columns=['Translation','source_id','target_id'])
data_train = data_train.loc[:, xgb_best_features_]
scaler = preprocessing.StandardScaler()
data_train.loc[:, data_train.columns] = scaler.fit_transform(data_train.loc[:, data_train.columns])

print("Model was trained on EN-DE Parallel Sentences.\n")
xgb_classifier = XGBClassifier(**xgb_best_hyperparameters).fit(data_train.to_numpy(), target_train.to_numpy())

# EN-DE
target_test = feature_retrieval['Translation'].astype(float)
data_test = feature_retrieval.drop(columns=['Translation','source_id','target_id'])
data_test = data_test.loc[:, xgb_best_features_]
data_test.loc[:, data_test.columns] = scaler.transform(data_test.loc[:, data_test.columns])
prediction = xgb_classifier.predict_proba(data_test).tolist()
print("EN-DE Map Score: {}".format(MAP_score(feature_retrieval['source_id'],target_test,prediction)))

# EN-IT
target_test = feature_retrieval_it['Translation'].astype(float)
data_test = feature_retrieval_it.drop(columns=['Translation','source_id','target_id'])
data_test = data_test.loc[:, xgb_best_features_]
data_test.loc[:, data_test.columns] = scaler.transform(data_test.loc[:, data_test.columns])
prediction = xgb_classifier.predict_proba(data_test).tolist()
print("EN-IT Map Score: {}".format(MAP_score(feature_retrieval_it['source_id'],target_test,prediction)))

# EN-PL
target_test = feature_retrieval_pl['Translation'].astype(float)
data_test = feature_retrieval_pl.drop(columns=['Translation','source_id','target_id'])
data_test = data_test.loc[:, xgb_best_features_]
data_test.loc[:, data_test.columns] = scaler.transform(data_test.loc[:, data_test.columns])
prediction = xgb_classifier.predict_proba(data_test).tolist()
print("EN-PL Map Score: {}".format(MAP_score(feature_retrieval_pl['source_id'],target_test,prediction)))

Model was trained on EN-DE Parallel Sentences.

EN-DE Map Score: 0.8463057084876486
EN-IT Map Score: 0.7718790626112558
EN-PL Map Score: 0.8044894862823776
