# Supervised Retrieval

In this notebook we use the supervised classification model for a supervised crosslingual information retrieval task.

In [1]:
import pandas as pd
import numpy as np
import seaborn as sn
import matplotlib.pyplot as plt
from sklearn import preprocessing
from sklearn.neural_network import MLPClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.feature_selection import RFECV
from src.models.predict_model import MAP_score, threshold_counts,feature_selection, pipeline_model_optimization

## I. Import Data

In this section we import the feature dataframe for the retrieval task.

In [2]:
feature_dataframe=pd.read_feather("../data/processed/feature_model.feather")
feature_retrieval=pd.read_feather("../data/processed/feature_retrieval.feather")
feature_dataframe = feature_dataframe.rename(columns={"id_source": "source_id", "id_target": "target_id"})
feature_retrieval = feature_retrieval.rename(columns={"id_source": "source_id", "id_target": "target_id"})

In [3]:
# We cannot use sentiment analysis, since we did not find a pipeline for italian and polish
drop_features=[ 'score_polarity_difference',
 'score_polarity_difference_relative',
 'score_polarity_difference_normalized',
 'score_subjectivity_difference',
 'score_subjectivity_difference_relative',
 'score_subjectivity_difference_normalized']
feature_dataframe = feature_dataframe.drop(columns=drop_features)
feature_retrieval = feature_retrieval.drop(columns=drop_features)

#### Delete all columns with only one value

In [4]:
column_mask = feature_dataframe.apply(threshold_counts, threshold=1)
feature_dataframe = feature_dataframe.loc[:, column_mask]
feature_retrieval = feature_retrieval.loc[:, column_mask]

## II. Supervised Retrieval

# MLP Classifier

#### Start with one feature

In [5]:
start_features = ["jaccard_translation_proc_5k"]
not_add = ["Translation", "source_id", "target_id"]
added_features = feature_dataframe.columns[~feature_dataframe.columns.isin(start_features+not_add)]

In [None]:
model = MLPClassifier(random_state= 42, early_stopping=True)
scaler = preprocessing.StandardScaler()

MLP_parameter_grid = {
    'hidden_layer_sizes': [(100,), (50,100,), (50,75,100,)],
    'activation': ['logistic','tanh', 'relu'],
    'solver': ['sgd', 'adam'],
    'alpha': [0.0001, 0.05],
    'learning_rate': ['constant','adaptive'],
    'random_state' :[42],
    'early_stopping':[True]
}
MLP_best_features, MLP_best_parameter_combination, MLP_best_map_score, MLP_all_parameter_combination = \
pipeline_model_optimization(model, MLP_parameter_grid, scaler, feature_dataframe, 
                            feature_retrieval, start_features, 
                            added_features, 
                            threshold_map_feature_selection=0.001)
# Current Hyperpamaters: {'hidden_layer_sizes': (100,), 'activation': 'relu', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}
# MAP score on test set with current hyperpamaters: 0.8667

-----------------First do Forward Selection-----------------

Current Iteration through feature list: 1


In [6]:
model = MLPClassifier(random_state= 42, early_stopping=True,max_iter=100000)
scaler = preprocessing.StandardScaler()

MLP_parameter_grid = {
    'hidden_layer_sizes': [(5,), (5,3,2), (6,3,),(6,2),(7,3,),(8,3,2)],
    'activation': ['tanh', 'relu'],
    'solver': ['lbfgs', 'adam'],
    'alpha': [0.0001],
    'learning_rate': ['constant','adaptive'],
    'random_state' :[42],
    'early_stopping':[True]
}
MLP_best_features, MLP_best_parameter_combination, MLP_best_map_score, MLP_all_parameter_combination = \
pipeline_model_optimization(model, MLP_parameter_grid, scaler, feature_dataframe, 
                            feature_retrieval, start_features, 
                            added_features, 
                            threshold_map_feature_selection=0.001)
# Current Hyperpamaters: {'hidden_layer_sizes': (7, 3), 'activation': 'relu', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}
# MAP score on test set with current hyperpamaters: 0.8674

-----------------First do Forward Selection-----------------

Current Iteration through feature list: 1
The initial MAP score on test set: 0.8667

-----------------Result of Feature Selection-----------------

Best MAP Score after feature selection: 0.8666658045132479


-----------------Start Hyperparameter-tuning with Grid Search-----------------
Number of Parameter Combinations: 48

Current Hyperpamaters: {'hidden_layer_sizes': (5,), 'activation': 'tanh', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True}
MAP score on test set with current hyperpamaters: 0.8283

Current Hyperpamaters: {'hidden_layer_sizes': (5,), 'activation': 'tanh', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}
MAP score on test set with current hyperpamaters: 0.8283

Current Hyperpamaters: {'hidden_layer_sizes': (5,), 'activation': 'tanh', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'con

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


MAP score on test set with current hyperpamaters: 0.8459

Current Hyperpamaters: {'hidden_layer_sizes': (5, 3, 2), 'activation': 'tanh', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


MAP score on test set with current hyperpamaters: 0.8459

Current Hyperpamaters: {'hidden_layer_sizes': (5, 3, 2), 'activation': 'tanh', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True}
MAP score on test set with current hyperpamaters: 0.8453

Current Hyperpamaters: {'hidden_layer_sizes': (5, 3, 2), 'activation': 'tanh', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}
MAP score on test set with current hyperpamaters: 0.8453

Current Hyperpamaters: {'hidden_layer_sizes': (5, 3, 2), 'activation': 'relu', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True}
MAP score on test set with current hyperpamaters: 0.0004

Current Hyperpamaters: {'hidden_layer_sizes': (5, 3, 2), 'activation': 'relu', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}
MAP score on test 

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


MAP score on test set with current hyperpamaters: 0.8338

Current Hyperpamaters: {'hidden_layer_sizes': (6, 3), 'activation': 'tanh', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


MAP score on test set with current hyperpamaters: 0.8338

Current Hyperpamaters: {'hidden_layer_sizes': (6, 3), 'activation': 'tanh', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True}
MAP score on test set with current hyperpamaters: 0.8390

Current Hyperpamaters: {'hidden_layer_sizes': (6, 3), 'activation': 'tanh', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}
MAP score on test set with current hyperpamaters: 0.8390

Current Hyperpamaters: {'hidden_layer_sizes': (6, 3), 'activation': 'relu', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


MAP score on test set with current hyperpamaters: 0.8272

Current Hyperpamaters: {'hidden_layer_sizes': (6, 3), 'activation': 'relu', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


MAP score on test set with current hyperpamaters: 0.8272

Current Hyperpamaters: {'hidden_layer_sizes': (6, 3), 'activation': 'relu', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True}
MAP score on test set with current hyperpamaters: 0.8388

Current Hyperpamaters: {'hidden_layer_sizes': (6, 3), 'activation': 'relu', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}
MAP score on test set with current hyperpamaters: 0.8388

Current Hyperpamaters: {'hidden_layer_sizes': (6, 2), 'activation': 'tanh', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


MAP score on test set with current hyperpamaters: 0.8426

Current Hyperpamaters: {'hidden_layer_sizes': (6, 2), 'activation': 'tanh', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


MAP score on test set with current hyperpamaters: 0.8426

Current Hyperpamaters: {'hidden_layer_sizes': (6, 2), 'activation': 'tanh', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True}
MAP score on test set with current hyperpamaters: 0.8502

Current Hyperpamaters: {'hidden_layer_sizes': (6, 2), 'activation': 'tanh', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}
MAP score on test set with current hyperpamaters: 0.8502

Current Hyperpamaters: {'hidden_layer_sizes': (6, 2), 'activation': 'relu', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


MAP score on test set with current hyperpamaters: 0.8529

Current Hyperpamaters: {'hidden_layer_sizes': (6, 2), 'activation': 'relu', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


MAP score on test set with current hyperpamaters: 0.8529

Current Hyperpamaters: {'hidden_layer_sizes': (6, 2), 'activation': 'relu', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True}
MAP score on test set with current hyperpamaters: 0.8352

Current Hyperpamaters: {'hidden_layer_sizes': (6, 2), 'activation': 'relu', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}
MAP score on test set with current hyperpamaters: 0.8352

Current Hyperpamaters: {'hidden_layer_sizes': (7, 3), 'activation': 'tanh', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


MAP score on test set with current hyperpamaters: 0.8430

Current Hyperpamaters: {'hidden_layer_sizes': (7, 3), 'activation': 'tanh', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


MAP score on test set with current hyperpamaters: 0.8430

Current Hyperpamaters: {'hidden_layer_sizes': (7, 3), 'activation': 'tanh', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True}
MAP score on test set with current hyperpamaters: 0.8457

Current Hyperpamaters: {'hidden_layer_sizes': (7, 3), 'activation': 'tanh', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}
MAP score on test set with current hyperpamaters: 0.8457

Current Hyperpamaters: {'hidden_layer_sizes': (7, 3), 'activation': 'relu', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


MAP score on test set with current hyperpamaters: 0.8325

Current Hyperpamaters: {'hidden_layer_sizes': (7, 3), 'activation': 'relu', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


MAP score on test set with current hyperpamaters: 0.8325

Current Hyperpamaters: {'hidden_layer_sizes': (7, 3), 'activation': 'relu', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True}
MAP score on test set with current hyperpamaters: 0.8674

Current Hyperpamaters: {'hidden_layer_sizes': (7, 3), 'activation': 'relu', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}
MAP score on test set with current hyperpamaters: 0.8674

Current Hyperpamaters: {'hidden_layer_sizes': (8, 3, 2), 'activation': 'tanh', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


MAP score on test set with current hyperpamaters: 0.8379

Current Hyperpamaters: {'hidden_layer_sizes': (8, 3, 2), 'activation': 'tanh', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


MAP score on test set with current hyperpamaters: 0.8379

Current Hyperpamaters: {'hidden_layer_sizes': (8, 3, 2), 'activation': 'tanh', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True}
MAP score on test set with current hyperpamaters: 0.8357

Current Hyperpamaters: {'hidden_layer_sizes': (8, 3, 2), 'activation': 'tanh', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}
MAP score on test set with current hyperpamaters: 0.8357

Current Hyperpamaters: {'hidden_layer_sizes': (8, 3, 2), 'activation': 'relu', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


MAP score on test set with current hyperpamaters: 0.8265

Current Hyperpamaters: {'hidden_layer_sizes': (8, 3, 2), 'activation': 'relu', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
  self.n_iter_ = _check_optimize_result("lbfgs", opt_res, self.max_iter)


MAP score on test set with current hyperpamaters: 0.8265

Current Hyperpamaters: {'hidden_layer_sizes': (8, 3, 2), 'activation': 'relu', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True}
MAP score on test set with current hyperpamaters: 0.8288

Current Hyperpamaters: {'hidden_layer_sizes': (8, 3, 2), 'activation': 'relu', 'solver': 'adam', 'alpha': 0.0001, 'learning_rate': 'adaptive', 'random_state': 42, 'early_stopping': True}
MAP score on test set with current hyperpamaters: 0.8288

-----------------Result of Hyperparameter Tuning-----------------

Best Hyperamater Settting: {'hidden_layer_sizes': (5,), 'activation': 'tanh', 'solver': 'lbfgs', 'alpha': 0.0001, 'learning_rate': 'constant', 'random_state': 42, 'early_stopping': True}
With MAP Score: 0.0000


In [25]:
scaler = preprocessing.StandardScaler()
model = MLPClassifier(hidden_layer_sizes=(7, 3),activation='relu',random_state=42, solver='adam', alpha=0.0001, learning_rate='adaptive',  early_stopping=True)
feature_selection(model,scaler,feature_dataframe,feature_retrieval,start_features,added_features)

The initial MAP score on test set: 0.8674


0.8674403459178001

# Logistic Regression

#### Start with one feature

In [5]:
start_features = ["jaccard_translation_proc_5k"]
not_add = ["Translation", "source_id", "target_id"]
added_features = feature_dataframe.columns[~feature_dataframe.columns.isin(start_features+not_add)]

In [6]:
model = LogisticRegression()
scaler = preprocessing.StandardScaler()

LR_parameter_grid = {
    'penalty' : ['l1', 'l2','elasticnet'],
    'C' : np.logspace(-4, 4, 50),
    'solver' : ['newton-cg', 'lbfgs', 'liblinear', 'sag', 'saga'],
    'max_iter':[100000]
}

LR_best_features, LR_best_parameter_combination, LR_best_map_score, LR_all_parameter_combination = \
pipeline_model_optimization(model, LR_parameter_grid, scaler, feature_dataframe, 
                            feature_retrieval, start_features, 
                            added_features, 
                            threshold_map_feature_selection=0.001)
#The initial MAP score on test set: 0.7555
# Updated MAP score on test set with new feature cosine_similarity_tf_idf_vecmap: 0.7590
# Updated MAP score on test set with new feature jaccard_numbers_source: 0.7620
# Updated MAP score on test set with new feature number_VERB_difference_normalized: 0.7637
# Updated MAP score on test set with new feature number_VERB_difference_relative: 0.7754
# Updated MAP score on test set with new feature number_NOUN_difference_relative: 0.7969
# Updated MAP score on test set with new feature number_ADJ_difference_normalized: 0.8020
# Updated MAP score on test set with new feature number_?_difference_normalized: 0.8038
# Updated MAP score on test set with new feature number_:_difference_relative: 0.8127
# Updated MAP score on test set with new feature number_-_difference_normalized: 0.8148

# Current Iteration through feature list: 2
# The initial MAP score on test set: 0.8148
# Updated MAP score on test set with new feature jaccard_translation_vecmap: 0.8199
# Updated MAP score on test set with new feature jaccard_translation_proc_b_1k: 0.8243
#'penalty': 'l2', 'C': 0.009102981779915217, 'solver': 'lbfgs' best model: 0.8242


# feat=['jaccard_translation_proc_5k', 'jaccard_numbers_source', 'cosine_similarity_average_proc_5k', 'cosine_similarity_tf_idf_proc_5k', 'cosine_similarity_average_proc_b_1k', 'cosine_similarity_tf_idf_proc_b_1k', 'euclidean_distance_tf_idf_proc_b_1k', 'euclidean_distance_average_vecmap', 'euclidean_distance_tf_idf_vecmap', 'number_words_difference', 'number_!_difference_normalized', 'number_&_difference', 'number_-_difference', 'number_-_difference_relative', 'number_-_difference_normalized', 'number_._difference_relative', 'number_:_difference', 'number_:_difference_relative', 'number_:_difference_normalized', 'number_;_difference', 'number_;_difference_relative', 'number_?_difference', 'number_?_difference_relative', 'number_?_difference_normalized', 'characters_avg_difference_relative', 'score_subjectivity_difference', 'score_subjectivity_difference_normalized', 'euclidean_distance_tf_idf_proc_5k', 'euclidean_distance_average_proc_b_1k', 'cosine_similarity_average_vecmap', 'cosine_similarity_tf_idf_vecmap', 'jaccard_translation_vecmap', 'number_%_difference', 'number_%_difference_relative', "number_'_difference", "number_'_difference_relative", 'number_+_difference', 'number_[_difference', 'number_[_difference_relative', 'number_ADJ_difference_normalized', 'score_subjectivity_difference_relative']
# penalty="l2", C=0.0001, solver:lbfgs -> MAP 0.8449

-----------------First do Forward Selection-----------------

Current Iteration through feature list: 1
The initial MAP score on test set: 0.7555
Updated MAP score on test set with new feature cosine_similarity_tf_idf_vecmap: 0.7590
Updated MAP score on test set with new feature jaccard_numbers_source: 0.7620
Updated MAP score on test set with new feature number_VERB_difference_normalized: 0.7637
Updated MAP score on test set with new feature number_VERB_difference_relative: 0.7754
Updated MAP score on test set with new feature number_NOUN_difference_relative: 0.7969
Updated MAP score on test set with new feature number_ADJ_difference_normalized: 0.8020
Updated MAP score on test set with new feature number_?_difference_normalized: 0.8038
Updated MAP score on test set with new feature number_:_difference_relative: 0.8127
Updated MAP score on test set with new feature number_-_difference_normalized: 0.8148

Current Iteration through feature list: 2
The initial MAP score on test set: 0.81

MAP score on test set with current hyperpamaters: 0.8106

Current Hyperpamaters: {'penalty': 'l1', 'C': 0.004291934260128779, 'solver': 'newton-cg', 'max_iter': 100000}
Mistake

Current Hyperpamaters: {'penalty': 'l1', 'C': 0.004291934260128779, 'solver': 'lbfgs', 'max_iter': 100000}
Mistake

Current Hyperpamaters: {'penalty': 'l1', 'C': 0.004291934260128779, 'solver': 'liblinear', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8105

Current Hyperpamaters: {'penalty': 'l1', 'C': 0.004291934260128779, 'solver': 'sag', 'max_iter': 100000}
Mistake

Current Hyperpamaters: {'penalty': 'l1', 'C': 0.004291934260128779, 'solver': 'saga', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8121

Current Hyperpamaters: {'penalty': 'l1', 'C': 0.0062505519252739694, 'solver': 'newton-cg', 'max_iter': 100000}
Mistake

Current Hyperpamaters: {'penalty': 'l1', 'C': 0.0062505519252739694, 'solver': 'lbfgs', 'max_iter': 100000}
Mistake

Current Hyperpamaters

MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l1', 'C': 0.3906939937054613, 'solver': 'sag', 'max_iter': 100000}
Mistake

Current Hyperpamaters: {'penalty': 'l1', 'C': 0.3906939937054613, 'solver': 'saga', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l1', 'C': 0.5689866029018293, 'solver': 'newton-cg', 'max_iter': 100000}
Mistake

Current Hyperpamaters: {'penalty': 'l1', 'C': 0.5689866029018293, 'solver': 'lbfgs', 'max_iter': 100000}
Mistake

Current Hyperpamaters: {'penalty': 'l1', 'C': 0.5689866029018293, 'solver': 'liblinear', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l1', 'C': 0.5689866029018293, 'solver': 'sag', 'max_iter': 100000}
Mistake

Current Hyperpamaters: {'penalty': 'l1', 'C': 0.5689866029018293, 'solver': 'saga', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8

MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l1', 'C': 51.79474679231202, 'solver': 'newton-cg', 'max_iter': 100000}
Mistake

Current Hyperpamaters: {'penalty': 'l1', 'C': 51.79474679231202, 'solver': 'lbfgs', 'max_iter': 100000}
Mistake

Current Hyperpamaters: {'penalty': 'l1', 'C': 51.79474679231202, 'solver': 'liblinear', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l1', 'C': 51.79474679231202, 'solver': 'sag', 'max_iter': 100000}
Mistake

Current Hyperpamaters: {'penalty': 'l1', 'C': 51.79474679231202, 'solver': 'saga', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l1', 'C': 75.43120063354607, 'solver': 'newton-cg', 'max_iter': 100000}
Mistake

Current Hyperpamaters: {'penalty': 'l1', 'C': 75.43120063354607, 'solver': 'lbfgs', 'max_iter': 100000}
Mistake

Current Hyperpamaters: {'penalty': 'l1', 'C'

MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l1', 'C': 4714.8663634573895, 'solver': 'sag', 'max_iter': 100000}
Mistake

Current Hyperpamaters: {'penalty': 'l1', 'C': 4714.8663634573895, 'solver': 'saga', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l1', 'C': 6866.488450042998, 'solver': 'newton-cg', 'max_iter': 100000}
Mistake

Current Hyperpamaters: {'penalty': 'l1', 'C': 6866.488450042998, 'solver': 'lbfgs', 'max_iter': 100000}
Mistake

Current Hyperpamaters: {'penalty': 'l1', 'C': 6866.488450042998, 'solver': 'liblinear', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l1', 'C': 6866.488450042998, 'solver': 'sag', 'max_iter': 100000}
Mistake

Current Hyperpamaters: {'penalty': 'l1', 'C': 6866.488450042998, 'solver': 'saga', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243



MAP score on test set with current hyperpamaters: 0.8205

Current Hyperpamaters: {'penalty': 'l2', 'C': 0.0020235896477251557, 'solver': 'lbfgs', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8205

Current Hyperpamaters: {'penalty': 'l2', 'C': 0.0020235896477251557, 'solver': 'liblinear', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8203

Current Hyperpamaters: {'penalty': 'l2', 'C': 0.0020235896477251557, 'solver': 'sag', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8205

Current Hyperpamaters: {'penalty': 'l2', 'C': 0.0020235896477251557, 'solver': 'saga', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8205

Current Hyperpamaters: {'penalty': 'l2', 'C': 0.0029470517025518097, 'solver': 'newton-cg', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8210

Current Hyperpamaters: {'penalty': 'l2', 'C': 0.0029470517025518097, 'solver': 'lbfgs', 'max_iter': 100000}

MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 0.08685113737513521, 'solver': 'lbfgs', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 0.08685113737513521, 'solver': 'liblinear', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 0.08685113737513521, 'solver': 'sag', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 0.08685113737513521, 'solver': 'saga', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 0.12648552168552957, 'solver': 'newton-cg', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 0.12648552168552957, 'solver': 'lbfgs', 'max_iter': 100000}
MAP score o

MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 3.727593720314938, 'solver': 'lbfgs', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 3.727593720314938, 'solver': 'liblinear', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 3.727593720314938, 'solver': 'sag', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 3.727593720314938, 'solver': 'saga', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 5.428675439323859, 'solver': 'newton-cg', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 5.428675439323859, 'solver': 'lbfgs', 'max_iter': 100000}
MAP score on test set w

MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 159.98587196060572, 'solver': 'liblinear', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 159.98587196060572, 'solver': 'sag', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 159.98587196060572, 'solver': 'saga', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 232.99518105153672, 'solver': 'newton-cg', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 232.99518105153672, 'solver': 'lbfgs', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 232.99518105153672, 'solver': 'liblinear', 'max_iter': 100000}
MAP score on 

MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 6866.488450042998, 'solver': 'sag', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 6866.488450042998, 'solver': 'saga', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 10000.0, 'solver': 'newton-cg', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 10000.0, 'solver': 'lbfgs', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 10000.0, 'solver': 'liblinear', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current Hyperpamaters: {'penalty': 'l2', 'C': 10000.0, 'solver': 'sag', 'max_iter': 100000}
MAP score on test set with current hyperpamaters: 0.8243

Current

KeyError: 'MAP_score'

In [6]:
# keep_columns=['jaccard_translation_proc_5k','cosine_similarity_tf_idf_vecmap','jaccard_numbers_source','number_VERB_difference_normalized','number_VERB_difference_relative',
#              'number_NOUN_difference_relative','number_ADJ_difference_normalized','number_?_difference_normalized','number_:_difference_relative','number_-_difference_normalized',
#              'jaccard_translation_vecmap','jaccard_translation_proc_b_1k']
# features=['jaccard_numbers_source',
#  'cosine_similarity_average_proc_5k',
#  'cosine_similarity_tf_idf_proc_5k',
#  'euclidean_distance_average_proc_5k',
#  'euclidean_distance_tf_idf_proc_5k',
#  'cosine_similarity_average_proc_b_1k',
#  'cosine_similarity_tf_idf_proc_b_1k',
#  'euclidean_distance_average_proc_b_1k',
#  'euclidean_distance_tf_idf_proc_b_1k',
#  'jaccard_translation_proc_b_1k',
#  'cosine_similarity_average_vecmap',
#  'cosine_similarity_tf_idf_vecmap',
#  'euclidean_distance_average_vecmap',
#  'euclidean_distance_tf_idf_vecmap',
#  'jaccard_translation_vecmap', 'number_punctuations_total_difference',
#  'number_punctuations_total_difference_relative',
#  'number_punctuations_total_difference_normalized',
#  'number_words_difference',
#  'number_words_difference_relative',
#  'number_words_difference_normalized',
#  'number_unique_words_difference',
#  'number_unique_words_difference_relative',
#  'number_unique_words_difference_normalized',
#  'number_!_difference',
#  'number_!_difference_relative',
#  'number_!_difference_normalized',
#  'number_#_difference',
#  'number_#_difference_relative',
#  'number_#_difference_normalized',
#  'number_$_difference',
#  'number_$_difference_relative',
#  'number_$_difference_normalized',
#  'number_%_difference',
#  'number_%_difference_relative',
#  'number_%_difference_normalized',
#  'number_&_difference',
#  'number_&_difference_relative',
#  'number_&_difference_normalized',
#  "number_'_difference",
#  "number_'_difference_relative",
#  "number_'_difference_normalized",
#  'number_(_difference',
#  'number_(_difference_relative',
#  'number_(_difference_normalized',
#  'number_)_difference',
#  'number_)_difference_relative',
#  'number_)_difference_normalized',
#  'number_+_difference',
#  'number_+_difference_relative',
#  'number_+_difference_normalized',
#  'number_,_difference',
#  'number_,_difference_relative',
#  'number_,_difference_normalized',
#  'number_-_difference',
#  'number_-_difference_relative',
#  'number_-_difference_normalized',
#  'number_._difference',
#  'number_._difference_relative',
#  'number_._difference_normalized',
#  'number_/_difference',
#  'number_/_difference_relative',
#  'number_/_difference_normalized',
#  'number_:_difference',
#  'number_:_difference_relative',
#  'number_:_difference_normalized',
#  'number_;_difference',
#  'number_;_difference_relative',
#  'number_;_difference_normalized',
#  'number_?_difference',
#  'number_?_difference_relative',
#  'number_?_difference_normalized',
#  'number_[_difference',
#  'number_[_difference_relative',
#  'number_[_difference_normalized',
#  'number_]_difference',
#  'number_]_difference_relative',
#  'number_]_difference_normalized',
#  'number_characters_difference',
#  'number_characters_difference_relative',
#  'number_characters_difference_normalized',
#  'characters_avg_difference',
#  'characters_avg_difference_relative',
#  'characters_avg_difference_normalized',
#  'number_ADJ_difference',
#  'number_ADJ_difference_relative',
#  'number_ADJ_difference_normalized',
#  'number_NOUN_difference',
#  'number_NOUN_difference_relative',
#  'number_NOUN_difference_normalized',
#  'number_VERB_difference',
#  'number_VERB_difference_relative',
#  'number_VERB_difference_normalized']

In [9]:
scaler = preprocessing.StandardScaler()
model = LogisticRegression(max_iter=100000,penalty="l2", C= 0.0001, solver='lbfgs')
feature_selection(model,scaler,feature_dataframe,feature_retrieval,start_features,added_features)

The initial MAP score on test set: 0.8302


0.8302100146418323