# Supervised Retrieval

In this notebook we use the supervised classification model for a supervised crosslingual information retrieval task.

In [1]:
import sys
import os
sys.path.append(os.path.dirname((os.path.abspath(''))))

import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn.metrics import accuracy_score, f1_score, precision_score, recall_score, log_loss
from sklearn.model_selection import train_test_split
from sklearn import preprocessing
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFECV
from src.models.predict_model import MAP_score, threshold_counts, feature_selection, pipeline_model_optimization

## I. Import Data

In this section we import the feature dataframe for the retrieval task.

In [9]:
feature_dataframe=pd.read_feather("../data/processed/feature_model_en_de.feather")
feature_retrieval=pd.read_feather("../data/processed/feature_retrieval_en_de.feather")
feature_dataframe = feature_dataframe.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval = feature_retrieval.rename(columns={"id_source": "source_id", "id_target": "target_id"})


# Load Test Data
feature_retrieval_de = pd.read_feather("../data/processed/feature_retrieval_en_de_testset.feather")
feature_retrieval_de = feature_retrieval_de.rename(columns={"id_source": "source_id", "id_target": "target_id"})

feature_retrieval_pl = pd.read_feather("../data/processed/feature_retrieval_en_pl.feather")
feature_retrieval_pl = feature_retrieval_pl.rename(columns={"id_source": "source_id", "id_target": "target_id"})

feature_retrieval_it = pd.read_feather("../data/processed/feature_retrieval_en_it.feather")
feature_retrieval_it = feature_retrieval_it.rename(columns={"id_source": "source_id", "id_target": "target_id"})

feature_retrieval_doc = pd.read_feather("../data/processed/feature_retrieval_doc.feather")
feature_retrieval_doc = feature_retrieval_doc.rename(columns={"id_source": "source_id", "id_target": "target_id"})

#### Delete all columns with only one value

In [3]:
column_mask = feature_dataframe.apply(threshold_counts, threshold=1)
feature_dataframe = feature_dataframe.loc[:, column_mask]
feature_retrieval = feature_retrieval.loc[:, column_mask]


## II. Supervised Retrieval

#### Start with one feature

In [4]:
start_features = ["jaccard_translation_proc_5k"]
not_add = ["Translation", "source_id", "target_id"]
added_features = feature_dataframe.columns[~feature_dataframe.columns.isin(start_features+not_add)]

# XGBoost

In [5]:
from xgboost import XGBClassifier

start_features = ["jaccard_translation_proc_5k"]
not_add = ["Translation", "source_id", "target_id"]
added_features = feature_dataframe.columns[~feature_dataframe.columns.isin(start_features+not_add)]

xgb = XGBClassifier(verbosity = 0, use_label_encoder=False, random_state=42)
scaler = preprocessing.StandardScaler()

xgb_parameter_grid = {"verbosity": [0],
                      "use_label_encoder": [False],
                      "random_state": [42],
                      'min_child_weight': [1, 5],
                      'gamma': [0.5, 1,  5],
                      'subsample': [0.6, 0.8],
                      'colsample_bytree': [0.6, 0.8, 1.0],
                      'max_depth': [3, 4, 5]}

xgb_best_features, xgb_best_parameter_combination, xgb_best_map_score, xgb_all_parameter_combination = \
pipeline_model_optimization(xgb, xgb_parameter_grid, scaler, feature_dataframe, 
                            feature_retrieval, start_features, 
                            added_features, 
                            threshold_map_feature_selection=0.001)

-----------------First do Forward Selection-----------------

Current Iteration through feature list: 1
The initial MAP score on test set: 0.7689
Updated MAP score on test set with new feature cosine_similarity_tf_idf_vecmap: 0.7933
Updated MAP score on test set with new feature cosine_similarity_average_vecmap: 0.7956
Updated MAP score on test set with new feature number_]_difference_normalized: 0.7971
Updated MAP score on test set with new feature number_?_difference_normalized: 0.8087
Updated MAP score on test set with new feature number_-_difference_relative: 0.8207

Current Iteration through feature list: 2
The initial MAP score on test set: 0.8207
Updated MAP score on test set with new feature number_[_difference_normalized: 0.8252

Current Iteration through feature list: 3
The initial MAP score on test set: 0.8252
Updated MAP score on test set with new feature jaccard_numbers_source: 0.8330

Current Iteration through feature list: 4
The initial MAP score on test set: 0.8330


Hyperparameter Tuning:   0%|          | 0/108 [00:00<?, ?it/s]


-----------------Result of Feature Selection-----------------

Best MAP Score after feature selection: 0.8330139822214722


-----------------Start Hyperparameter-tuning with Grid Search-----------------
Number of Parameter Combinations: 108


Hyperparameter Tuning:   1%|          | 1/108 [00:04<07:59,  4.49s/it]


Current Best Hyperpamaters: {'verbosity': 0, 'use_label_encoder': False, 'random_state': 42, 'min_child_weight': 1, 'gamma': 0.5, 'subsample': 0.6, 'colsample_bytree': 0.6, 'max_depth': 3, 'MAP_score': 0.8102546653754751}
With Map Score 0.8103


Hyperparameter Tuning:   2%|▏         | 2/108 [00:10<08:38,  4.89s/it]


Current Best Hyperpamaters: {'verbosity': 0, 'use_label_encoder': False, 'random_state': 42, 'min_child_weight': 1, 'gamma': 0.5, 'subsample': 0.6, 'colsample_bytree': 0.6, 'max_depth': 4, 'MAP_score': 0.8174467051651921}
With Map Score 0.8174


Hyperparameter Tuning:   4%|▎         | 4/108 [00:21<08:50,  5.10s/it]


Current Best Hyperpamaters: {'verbosity': 0, 'use_label_encoder': False, 'random_state': 42, 'min_child_weight': 1, 'gamma': 0.5, 'subsample': 0.6, 'colsample_bytree': 0.8, 'max_depth': 3, 'MAP_score': 0.8283726610502967}
With Map Score 0.8284


Hyperparameter Tuning:   5%|▍         | 5/108 [00:27<09:11,  5.35s/it]


Current Best Hyperpamaters: {'verbosity': 0, 'use_label_encoder': False, 'random_state': 42, 'min_child_weight': 1, 'gamma': 0.5, 'subsample': 0.6, 'colsample_bytree': 0.8, 'max_depth': 4, 'MAP_score': 0.8324104205462614}
With Map Score 0.8324


Hyperparameter Tuning:   9%|▉         | 10/108 [00:59<10:34,  6.48s/it]


Current Best Hyperpamaters: {'verbosity': 0, 'use_label_encoder': False, 'random_state': 42, 'min_child_weight': 1, 'gamma': 0.5, 'subsample': 0.8, 'colsample_bytree': 0.6, 'max_depth': 3, 'MAP_score': 0.832892520075553}
With Map Score 0.8329


Hyperparameter Tuning:  16%|█▌        | 17/108 [01:47<10:46,  7.10s/it]


Current Best Hyperpamaters: {'verbosity': 0, 'use_label_encoder': False, 'random_state': 42, 'min_child_weight': 1, 'gamma': 0.5, 'subsample': 0.8, 'colsample_bytree': 1.0, 'max_depth': 4, 'MAP_score': 0.833349721145063}
With Map Score 0.8333


Hyperparameter Tuning:  18%|█▊        | 19/108 [02:00<09:36,  6.48s/it]


Current Best Hyperpamaters: {'verbosity': 0, 'use_label_encoder': False, 'random_state': 42, 'min_child_weight': 1, 'gamma': 1, 'subsample': 0.6, 'colsample_bytree': 0.6, 'max_depth': 3, 'MAP_score': 0.8357483985293337}
With Map Score 0.8357


Hyperparameter Tuning: 100%|██████████| 108/108 [11:13<00:00,  6.24s/it]


-----------------Result of Hyperparameter Tuning-----------------

Best Hyperamater Settting: {'verbosity': 0, 'use_label_encoder': False, 'random_state': 42, 'min_child_weight': 1, 'gamma': 1, 'subsample': 0.6, 'colsample_bytree': 0.6, 'max_depth': 3}
With MAP Score: 0.8357





In [6]:
xgb_best_features

['jaccard_translation_proc_5k',
 'cosine_similarity_tf_idf_vecmap',
 'cosine_similarity_average_vecmap',
 'number_]_difference_normalized',
 'number_?_difference_normalized',
 'number_-_difference_relative',
 'number_[_difference_normalized',
 'jaccard_numbers_source']

In [7]:
xgb_best_features_ = ['jaccard_translation_proc_5k',
 'cosine_similarity_tf_idf_vecmap',
 'cosine_similarity_average_vecmap',
 'number_]_difference_normalized',
 'number_?_difference_normalized',
 'number_-_difference_relative',
 'number_[_difference_normalized',
 'jaccard_numbers_source']

xgb_best_hyperparameters = {'verbosity': 0, 'use_label_encoder': False, 'random_state': 42, 'min_child_weight': 1, 'gamma': 1, 'subsample': 0.6, 'colsample_bytree': 0.6, 'max_depth': 3}

# Evaluate on Test Set

In [10]:
from xgboost import XGBClassifier

target_train=feature_dataframe['Translation'].astype(float)
data_train=feature_dataframe.drop(columns=['Translation','source_id','target_id'])
data_train = data_train.loc[:, xgb_best_features_]
scaler = preprocessing.StandardScaler()
data_train.loc[:, data_train.columns] = scaler.fit_transform(data_train.loc[:, data_train.columns])

print("Model was trained on EN-DE Parallel Sentences.\n")
xgb_classifier = XGBClassifier(**xgb_best_hyperparameters).fit(data_train.to_numpy(), target_train.to_numpy())

# EN-DE
target_test = feature_retrieval_de['Translation'].astype(float)
data_test = feature_retrieval_de.drop(columns=['Translation','source_id','target_id'])
data_test = data_test.loc[:, xgb_best_features_]
data_test.loc[:, data_test.columns] = scaler.transform(data_test.loc[:, data_test.columns])
prediction = xgb_classifier.predict_proba(data_test).tolist()
print("EN-DE Map Score: {}".format(MAP_score(feature_retrieval_de['source_id'],target_test,prediction)))

# EN-IT
target_test = feature_retrieval_it['Translation'].astype(float)
data_test = feature_retrieval_it.drop(columns=['Translation','source_id','target_id'])
data_test = data_test.loc[:, xgb_best_features_]
data_test.loc[:, data_test.columns] = scaler.transform(data_test.loc[:, data_test.columns])
prediction = xgb_classifier.predict_proba(data_test).tolist()
print("EN-IT Map Score: {}".format(MAP_score(feature_retrieval_it['source_id'],target_test,prediction)))

# EN-PL
target_test = feature_retrieval_pl['Translation'].astype(float)
data_test = feature_retrieval_pl.drop(columns=['Translation','source_id','target_id'])
data_test = data_test.loc[:, xgb_best_features_]
data_test.loc[:, data_test.columns] = scaler.transform(data_test.loc[:, data_test.columns])
prediction = xgb_classifier.predict_proba(data_test).tolist()
print("EN-PL Map Score: {}".format(MAP_score(feature_retrieval_pl['source_id'],target_test,prediction)))

# Document Corpus
target_test = feature_retrieval_doc['Translation'].astype(float)
data_test = feature_retrieval_doc.drop(columns=['Translation','source_id','target_id'])
data_test = data_test.loc[:, xgb_best_features_]
data_test.loc[:, data_test.columns] = scaler.transform(data_test.loc[:, data_test.columns])
prediction = xgb_classifier.predict_proba(data_test).tolist()
print("EN-PL Map Score: {}".format(MAP_score(feature_retrieval_doc['source_id'],target_test,prediction)))

Model was trained on EN-DE Parallel Sentences.

EN-DE Map Score: 0.8431334195285983
EN-IT Map Score: 0.8686442751667981
EN-PL Map Score: 0.8664685384247898
EN-PL Map Score: 0.0021624042275190146
