# Supervised Retrieval for all models with all features.

In this notebook we use the supervised classification model for a supervised crosslingual information retrieval task. We use the default settings with all features remaining after we got rid of correlated features and features that only have one value in the whole column.

In [1]:
#import sys
#import os
#sys.path.append(os.path.dirname((os.path.abspath(''))))

import pandas as pd
from sklearn import preprocessing
from sklearn.neural_network import MLPClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from src.models.predict_model import MAP_score, threshold_counts

## I. Import Data

In this section we import the feature dataframe for the retrieval task.

In [2]:
feature_dataframe=pd.read_feather("../data/processed/feature_model_en_de.feather")
feature_retrieval=pd.read_feather("../data/processed/feature_retrieval_en_de.feather")
feature_dataframe = feature_dataframe.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval = feature_retrieval.rename(columns={"id_source": "source_id", "id_target": "target_id"})

#### Delete all columns with only one value

In [3]:
column_mask = feature_dataframe.apply(threshold_counts, threshold=1)
feature_dataframe = feature_dataframe.loc[:, column_mask]
feature_retrieval = feature_retrieval.loc[:, column_mask]

In [4]:
len(feature_retrieval.columns)

97

## II. Supervised Retrieval

#### Drop the target label and the indexes for training and testing

In [5]:
target_train=feature_dataframe['Translation'].astype(float)
data_train=feature_dataframe.drop(columns=['Translation','source_id','target_id'])
target_test=feature_retrieval['Translation'].astype(float)
data_test=feature_retrieval.drop(columns=['Translation','source_id','target_id'])

#### Z-Normalization

In [6]:
#scale data into [0,1]
scaler = preprocessing.StandardScaler()
data_train.loc[:, data_train.columns] = scaler.fit_transform(data_train.loc[:, data_train.columns])
data_test.loc[:, data_test.columns] = scaler.transform(data_test.loc[:, data_test.columns])

# Naive Bayes

In [7]:
nb = GaussianNB().fit(data_train, target_train)
prediction = nb.predict_proba(data_test)
print("The MAP score on test set: {:.4f}".format(MAP_score(feature_retrieval['source_id'],target_test,prediction)))

The MAP score on test set: 0.3244


# MLP Classifier

In [8]:
mlp = MLPClassifier( verbose=True, early_stopping=True).fit(data_train, target_train)
prediction = mlp.predict_proba(data_test)
print("The MAP score on test set: {:.4f}".format(MAP_score(feature_retrieval['source_id'],target_test,prediction)))

Iteration 1, loss = 0.06539059
Validation score: 0.985682
Iteration 2, loss = 0.03628665
Validation score: 0.986500
Iteration 3, loss = 0.03410492
Validation score: 0.986182
Iteration 4, loss = 0.03367580
Validation score: 0.986545
Iteration 5, loss = 0.03275958
Validation score: 0.987045
Iteration 6, loss = 0.03221838
Validation score: 0.986682
Iteration 7, loss = 0.03168696
Validation score: 0.987000
Iteration 8, loss = 0.03085646
Validation score: 0.986682
Iteration 9, loss = 0.03065942
Validation score: 0.987091
Iteration 10, loss = 0.03035675
Validation score: 0.985864
Iteration 11, loss = 0.03027931
Validation score: 0.986818
Iteration 12, loss = 0.02986307
Validation score: 0.987409
Iteration 13, loss = 0.02938762
Validation score: 0.986818
Iteration 14, loss = 0.02902481
Validation score: 0.987227
Iteration 15, loss = 0.02894969
Validation score: 0.987136
Iteration 16, loss = 0.02849265
Validation score: 0.987182
Iteration 17, loss = 0.02908350
Validation score: 0.987273
Iterat

# Logistic Regression

In [9]:
lr = LogisticRegression(max_iter=100000, verbose=10).fit(data_train.to_numpy(), target_train.to_numpy())
prediction = lr.predict_proba(data_test.to_numpy())
print("The MAP score on test set: {:.4f}".format(MAP_score(feature_retrieval['source_id'],target_test,prediction)))

[Parallel(n_jobs=1)]: Using backend SequentialBackend with 1 concurrent workers.
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   31.0s remaining:    0.0s
[Parallel(n_jobs=1)]: Done   1 out of   1 | elapsed:   31.0s finished


The MAP score on test set: 0.6661


In [10]:
prediction

array([[1.30858736e-02, 9.86914126e-01],
       [9.99772324e-01, 2.27676183e-04],
       [9.99999377e-01, 6.22875232e-07],
       ...,
       [9.99866234e-01, 1.33765934e-04],
       [9.99999346e-01, 6.54168260e-07],
       [7.06386589e-01, 2.93613411e-01]])

# XGBoost

In [11]:
from xgboost import XGBClassifier

model = XGBClassifier()
model.fit(data_train.to_numpy(), target_train.to_numpy())

prediction = model.predict_proba(data_test).tolist()
print("The MAP score on test set: {:.4f}".format(MAP_score(feature_retrieval['source_id'],target_test,prediction)))



The MAP score on test set: 0.7100


In [12]:
prediction

[[0.00016641616821289062, 0.9998335838317871],
 [0.9994708895683289, 0.0005291011184453964],
 [0.9999507665634155, 4.925528264720924e-05],
 [0.9998302459716797, 0.00016975290782283992],
 [0.9883915781974792, 0.011608441360294819],
 [0.9990099668502808, 0.0009900357108563185],
 [0.9991344213485718, 0.0008655553101561964],
 [0.9973745942115784, 0.00262541719712317],
 [0.9996878504753113, 0.00031214350019581616],
 [0.9999774694442749, 2.253130151075311e-05],
 [0.9997862577438354, 0.00021374788775574416],
 [0.9996382594108582, 0.0003617327893152833],
 [0.9966864585876465, 0.003313559340313077],
 [0.9994636178016663, 0.0005363533273339272],
 [0.9999358057975769, 6.418725388357416e-05],
 [0.999866783618927, 0.0001332088722847402],
 [0.9995082020759583, 0.0004917940823361278],
 [0.9991277456283569, 0.0008722495404072106],
 [0.9999446868896484, 5.529464760911651e-05],
 [0.9998815655708313, 0.00011845587869174778],
 [0.999837338924408, 0.000162656040629372],
 [0.9857541918754578, 0.014245810918