# Supervised Retrieval for all models with all features.

In this notebook we use the supervised classification model for a supervised crosslingual information retrieval task. We use the default settings with all features remaining after we got rid of correlated features and features that only have one value in the whole column.

In [1]:
import sys
import os
#sys.path.append(os.path.dirname((os.path.abspath(''))))

import pandas as pd
from sklearn import preprocessing
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from src.models.predict_model import MAP_score, threshold_counts

## I. Import Data

In this section we import the feature dataframe for the retrieval task.

In [2]:
feature_dataframe=pd.read_feather("../data/processed/feature_model_en_de.feather")
feature_retrieval=pd.read_feather("../data/processed/feature_retrieval_en_de.feather")
feature_dataframe = feature_dataframe.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval = feature_retrieval.rename(columns={"id_source": "source_id", "id_target": "target_id"})

#### Delete all columns with only one value

In [3]:
column_mask = feature_dataframe.apply(threshold_counts, threshold=1)
feature_dataframe = feature_dataframe.loc[:, column_mask]
feature_retrieval = feature_retrieval.loc[:, column_mask]

In [4]:
len(feature_retrieval.columns)

97

## II. Supervised Retrieval

#### Drop the target label and the indexes for training and testing

In [5]:
target_train=feature_dataframe['Translation'].astype(float)
data_train=feature_dataframe.drop(columns=['Translation','source_id','target_id'])
target_test=feature_retrieval['Translation'].astype(float)
data_test=feature_retrieval.drop(columns=['Translation','source_id','target_id'])

#### Z-Normalization

In [6]:
#scale data into [0,1]
scaler = preprocessing.StandardScaler()
data_train.loc[:, data_train.columns] = scaler.fit_transform(data_train.loc[:, data_train.columns])
data_test.loc[:, data_test.columns] = scaler.transform(data_test.loc[:, data_test.columns])

# Naive Bayes

In [7]:
nb = GaussianNB().fit(data_train, target_train)
prediction = nb.predict_proba(data_test)
print("The MAP score on test set: {:.4f}".format(MAP_score(feature_retrieval['source_id'],target_test,prediction)))

The MAP score on test set: 0.1287


# MLP Classifier

In [8]:
mlp = MLPClassifier( verbose=True, early_stopping=True).fit(data_train, target_train)
prediction = mlp.predict_proba(data_test)
print("The MAP score on test set: {:.4f}".format(MAP_score(feature_retrieval['source_id'],target_test,prediction)))

Iteration 1, loss = 0.06518519
Validation score: 0.987091
Iteration 2, loss = 0.03789368
Validation score: 0.987773
Iteration 3, loss = 0.03533450
Validation score: 0.988364
Iteration 4, loss = 0.03473055
Validation score: 0.988227
Iteration 5, loss = 0.03368928
Validation score: 0.988182
Iteration 6, loss = 0.03309335
Validation score: 0.988500
Iteration 7, loss = 0.03249780
Validation score: 0.988773
Iteration 8, loss = 0.03149650
Validation score: 0.988909
Iteration 9, loss = 0.03099210
Validation score: 0.989045
Iteration 10, loss = 0.03063312
Validation score: 0.989636
Iteration 11, loss = 0.03044787
Validation score: 0.988864
Iteration 12, loss = 0.03039670
Validation score: 0.989364
Iteration 13, loss = 0.02978781
Validation score: 0.989364
Iteration 14, loss = 0.03024525
Validation score: 0.989227
Iteration 15, loss = 0.02917336
Validation score: 0.989545
Iteration 16, loss = 0.02930624
Validation score: 0.989636
Iteration 17, loss = 0.02875342
Validation score: 0.989500
Iterat

# Logistic Regression

In [9]:
lr = LogisticRegression(max_iter=100000, verbose=10).fit(data_train.to_numpy(), target_train.to_numpy())
prediction = lr.predict_proba(data_test.to_numpy())
print("The MAP score on test set: {:.4f}".format(MAP_score(feature_retrieval['source_id'],target_test,prediction)))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   23.3s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   23.3s finished


The MAP score on test set: 0.7379


# XGBoost

In [10]:
from xgboost import XGBClassifier

model = XGBClassifier()
model.fit(data_train.to_numpy(), target_train.to_numpy())

prediction = model.predict_proba(data_test).tolist()
print("The MAP score on test set: {:.4f}".format(MAP_score(feature_retrieval['source_id'],target_test,prediction)))



The MAP score on test set: 0.7169
