# Model selection and tuning
In this notebook we perform model selection and hyperparameter tuning.

We treat prediction of Russian as two-class classification problem and prediction of ITM as a regression problem and consider models appropriately.

For Russian, we consider gradient boosting models (`GradientBoostringClassifier` from *scikit-learn* library and `CatBoostClassifier` from *catboost* project), random forest (`RandomForestClassifier` from *scikit-learn*) and logistic regression (`LogisticRegression` from *scikit-learn*).

For ITM, we also consider gradient boosting (`GradientBoostringRegressor` and `CatBoostRegressor`) and random forest (`RandomForestRegressor`) as well as linear regression (`LinearRegression`). 

We avoid regularization in linear and logistic regression as we want unbaised estimations of the target variable.

Categorical variables (`sex`, `mother toungue` and `residence`) are processed with *one hot* encoding scheme for all models except *catboost* (these models has their own ways to deal with categorical variables).

First, we split dataset into two parts: with direct and indirect data. We perform optimization on each part separately.

As usual, each part of the dataset is splitted into train and test sample (70%/30%). Hyperparameter selection is performed using 10-fold cross validation on the train set. Then the best model is evaluated on the test set.

The optimization objective is negative log loss for Russian and $R^2$ for ITM because they provide unbaised estimate for the corresponding mean.

## imports

In [1]:
import joblib
import numpy as np
import pandas as pd
import itertools
import matplotlib.pyplot as plt
import json
from collections import Counter
from scipy import stats
from sklearn.ensemble import (
    GradientBoostingRegressor,
    GradientBoostingClassifier,
    RandomForestRegressor,
    RandomForestClassifier,
)
from sklearn.linear_model import LogisticRegression, LinearRegression
from itertools import chain
import seaborn as sns
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import OrdinalEncoder, OneHotEncoder
from indirect_utils import (
    logodds,
    stratified_permute,
    tologodds,
    trimmed,
    identity,
    read_data,
    residence_info,
    russian_to_target,
)
import random
import math
from tqdm.notebook import tqdm
from sklearn.metrics import r2_score, log_loss, accuracy_score, f1_score
from sklearn.model_selection import train_test_split, GridSearchCV
from catboost import CatBoostClassifier, CatBoostRegressor
import pickle

pd.set_option("max_colwidth", 500)

%matplotlib inline

## Data loading and variable specification

In [2]:
data_ITM = read_data("data/ITM.csv")
data_russian = read_data("data/russian.csv").rename(columns={"русский": "russian"})

In [3]:
cat_features = [
    "sex",
    "mother tongue",  # L1, defined according to residence
    "residence",  # Village name
]

real_features = [
    "year_of_birth",
    "language population",
    "village population",
    "elevation",
]

## Model selection framework

In [4]:
def do_cv(estimator, param_grid, cat_features, real_features, target):
    """
    Makes cross-validation grid search for given estimator and param_grid.
    Split data into direct and indirect parts and train model separately on each part.
    Uses appropriate dataset and target variable for russian and ITM.
    target should be in ('russian', 'itm').
    returns dataframe with scores of best estimator.
    """
    assert target in ("russian", "itm")
    russian = target == "russian"
    if russian:
        data = data_russian
        scoring = lambda *args, **kwargs: -log_loss(*args, **kwargs)
        scoring_cv = "neg_log_loss"
        additional_scorers = [accuracy_score, f1_score]
    else:
        data = data_ITM
        scoring = r2_score
        scoring_cv = "r2"
        additional_scorers = []

    target = russian_to_target[russian]
    features = real_features + cat_features

    data_train, data_test = train_test_split(
        data[["type"] + features + [target]],
        test_size=0.3,
        stratify=data["type"],
        random_state=42,
    )
    
    if issubclass(estimator, CatBoostClassifier) or issubclass(
        estimator, CatBoostRegressor
    ):
        # Don't need OneHotEncoder() for CatBoost
        the_model = estimator()
        param_grid["cat_features"] = [cat_features]
    else:
        ct = ColumnTransformer(
        [
            ("catenc", OneHotEncoder(), cat_features),
            ("real", "passthrough", real_features),
        ])


        the_model = Pipeline([("column_transform", ct), ("estimator", estimator())])

    results = []
    for type in (0, 1):

        cv = GridSearchCV(the_model, param_grid, cv=10, n_jobs=-1, scoring=scoring_cv)
        predict = cv.predict_proba if russian else cv.predict

        data_train_mt, data_test_mt = (
            data_[lambda x: x["type"] == type] for data_ in (data_train, data_test)
        )

        cv.fit(data_train_mt.drop(target, axis=1), data_train_mt[target])

        train_score, test_score = (
            scoring(data_[target], predict(data_.drop(target, axis=1)))
            for data_ in (data_train_mt, data_test_mt)
        )
        

        predictions = cv.predict(data_test_mt.drop(target, axis=1))
        additional_scores = [
            additional_scorer(data_test_mt[target],
                              predictions)
            for additional_scorer in additional_scorers
        ]

        results.append(
            [
                type,
                estimator.__name__,
                cv.cv_results_["mean_test_score"][cv.best_index_],
                cv.cv_results_["std_test_score"][cv.best_index_],
                cv.best_params_,
                train_score,
                test_score,
                cv,
            ] + additional_scores
        )

    return pd.DataFrame(
        results,
        columns=[
            "type",
            "estimator",
            "mean_cv_score",
            "std_cv_score",
            "cv_best_params",
            "train_score",
            "test_score",
            "cv",
        ] + [s.__name__ for s in additional_scorers]
    )

In [5]:
def compare_estimators(target, models, seed=None):
    if seed:
        np.random.seed(seed)
        
    return pd.concat(
        [
            do_cv(
                estimator,
                param_grid,
                cat_features=cat_features,
                real_features=real_features,
                target=target,
            )
            for estimator, param_grid in tqdm(models)
        ],
        axis=0,
        ignore_index=True,
    )

## Model selection
Actual calculation is done here. These cells will take a lot of time to execute.

In [6]:
itm_results = compare_estimators(
    target="itm",
    models=[
        (
            CatBoostRegressor,
            {
                "depth": [1, 2, 3, 4, 5, 6, 7, 8],
                "iterations": [20, 50, 100, 200, 500, 1000],
                "random_state": [42],
                "logging_level": ["Silent"],
            },
        ),
        (
            GradientBoostingRegressor,
            {
                "estimator__max_depth": [1, 2, 3, 4, 5, 6],
                "estimator__n_estimators": [
                    20,
                    50,
                    75,
                    100,
                    150,
                    200,
                    250,
                    300,
                    500,
                    1000,
                ],
                "estimator__random_state": [42],
            },
        ),
        (
            RandomForestRegressor,
            {
                "estimator__max_depth": [1, 2, 3, 4, 5, 6, 10, 20, 50, None],
                "estimator__n_estimators": [
                    20,
                    50,
                    75,
                    100,
                    150,
                    200,
                    250,
                    300,
                    1000,
                    2000,
                    5000,
                ],
                "estimator__random_state": [42],
            },
        ),
        (LinearRegression, {"estimator__fit_intercept": [True],}),
    ],
    seed=42,
)

HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))




In [7]:
itm_results.drop(columns=['cv'])

Unnamed: 0,type,estimator,mean_cv_score,std_cv_score,cv_best_params,train_score,test_score
0,0,CatBoostRegressor,0.50467,0.05401,"{'cat_features': ['sex', 'mother tongue', 'residence'], 'depth': 4, 'iterations': 500, 'logging_level': 'Silent', 'random_state': 42}",0.612798,0.487507
1,1,CatBoostRegressor,0.490332,0.075982,"{'cat_features': ['sex', 'mother tongue', 'residence'], 'depth': 5, 'iterations': 200, 'logging_level': 'Silent', 'random_state': 42}",0.643575,0.440209
2,0,GradientBoostingRegressor,0.509771,0.047797,"{'estimator__max_depth': 2, 'estimator__n_estimators': 500, 'estimator__random_state': 42}",0.601414,0.499058
3,1,GradientBoostingRegressor,0.477928,0.063391,"{'estimator__max_depth': 5, 'estimator__n_estimators': 50, 'estimator__random_state': 42}",0.68316,0.478526
4,0,RandomForestRegressor,0.480033,0.04636,"{'estimator__max_depth': 10, 'estimator__n_estimators': 2000, 'estimator__random_state': 42}",0.690065,0.449557
5,1,RandomForestRegressor,0.464667,0.074315,"{'estimator__max_depth': 10, 'estimator__n_estimators': 50, 'estimator__random_state': 42}",0.756363,0.43963
6,0,LinearRegression,0.463992,0.063478,{'estimator__fit_intercept': True},0.500878,0.457024
7,1,LinearRegression,0.414284,0.083969,{'estimator__fit_intercept': True},0.476078,0.402143


In [8]:
russian_results = compare_estimators(
    target="russian",
    models=[
        (
            CatBoostClassifier,
            {
                "depth": [1, 2, 3, 4, 5, 6, 7, 8],
                "iterations": [20, 50, 100, 200, 500, 1000],
                "random_state": [42],
                "logging_level": ["Silent"],
            },
        ),
        (
            GradientBoostingClassifier,
            {
                "estimator__max_depth": [1, 2, 3, 4, 5, 6],
                "estimator__n_estimators": [
                    20,
                    50,
                    75,
                    100,
                    150,
                    200,
                    250,
                    300,
                    500,
                    1000,
                ],
                "estimator__random_state": [42],
            },
        ),
        (
            RandomForestClassifier,
            {
                "estimator__max_depth": [1, 2, 3, 4, 5, 6, 10, 20, 50, None],
                "estimator__n_estimators": [
                    20,
                    50,
                    75,
                    100,
                    150,
                    200,
                    250,
                    300,
                    1000,
                    2000,
                    5000,
                ],
                "estimator__criterion": ["gini", "entropy"],
                "estimator__random_state": [42],
            },
        ),
        (LogisticRegression, {"estimator__penalty": ["none"],}),
    ],
    seed=42,
)

HBox(children=(FloatProgress(value=0.0, max=4.0), HTML(value='')))




In [9]:
russian_results.drop(columns=['cv'])

Unnamed: 0,type,estimator,mean_cv_score,std_cv_score,cv_best_params,train_score,test_score,accuracy_score,f1_score
0,0,CatBoostClassifier,-0.381706,0.017512,"{'cat_features': ['sex', 'mother tongue', 'residence'], 'depth': 4, 'iterations': 1000, 'logging_level': 'Silent', 'random_state': 42}",-0.344177,-0.372736,0.829932,0.895105
1,1,CatBoostClassifier,-0.236057,0.048951,"{'cat_features': ['sex', 'mother tongue', 'residence'], 'depth': 3, 'iterations': 500, 'logging_level': 'Silent', 'random_state': 42}",-0.195574,-0.239133,0.914141,0.953425
2,0,GradientBoostingClassifier,-0.383843,0.014572,"{'estimator__max_depth': 1, 'estimator__n_estimators': 300, 'estimator__random_state': 42}",-0.361974,-0.37015,0.833333,0.896842
3,1,GradientBoostingClassifier,-0.233642,0.051751,"{'estimator__max_depth': 1, 'estimator__n_estimators': 300, 'estimator__random_state': 42}",-0.196978,-0.227983,0.924242,0.958678
4,0,RandomForestClassifier,-0.397339,0.013829,"{'estimator__criterion': 'entropy', 'estimator__max_depth': 10, 'estimator__n_estimators': 5000, 'estimator__random_state': 42}",-0.312547,-0.372805,0.840136,0.902959
5,1,RandomForestClassifier,-0.243756,0.04376,"{'estimator__criterion': 'entropy', 'estimator__max_depth': 10, 'estimator__n_estimators': 2000, 'estimator__random_state': 42}",-0.150372,-0.241045,0.90404,0.947802
6,0,LogisticRegression,-0.493776,0.031521,{'estimator__penalty': 'none'},-0.531325,-0.51057,0.791383,0.883544
7,1,LogisticRegression,-0.32864,0.029972,{'estimator__penalty': 'none'},-0.305205,-0.305698,0.891414,0.94259


In [10]:
itm_results.pivot(index="estimator", columns="type", values="test_score").sort_values(
    0, ascending=False
)

type,0,1
estimator,Unnamed: 1_level_1,Unnamed: 2_level_1
GradientBoostingRegressor,0.499058,0.478526
CatBoostRegressor,0.487507,0.440209
LinearRegression,0.457024,0.402143
RandomForestRegressor,0.449557,0.43963


In [11]:
itm_results.pivot(index="estimator", columns="type", values="test_score").sort_values(
    0, ascending=False
)

type,0,1
estimator,Unnamed: 1_level_1,Unnamed: 2_level_1
GradientBoostingRegressor,0.499058,0.478526
CatBoostRegressor,0.487507,0.440209
LinearRegression,0.457024,0.402143
RandomForestRegressor,0.449557,0.43963


In [12]:
russian_results.pivot(
    index="estimator", columns="type", values="test_score"
).sort_values(0, ascending=False)

type,0,1
estimator,Unnamed: 1_level_1,Unnamed: 2_level_1
GradientBoostingClassifier,-0.37015,-0.227983
CatBoostClassifier,-0.372736,-0.239133
RandomForestClassifier,-0.372805,-0.241045
LogisticRegression,-0.51057,-0.305698


As we can see, gradient boosting models (*scikit-learn*'s `GradientBoosting*` and `CatBoost*`) outperform the other models in both settings, though advantage over `RandomForest` is rather small (but seem to be consistent according to our experiments). `CatBoostRegressor` appear to be marginally better than `GradientBoostingRegressor` in ITM prediction with direct data (though this is unstable and varies with random seed), but we will stick with *scikit-learn*'s `GradientBoosting*` models because they are faster.

We will use hyperparameters found during tuning in the subsequent experiments.

In [13]:
with open("russian_cv_model_select.pickle", "wb") as f:
    pickle.dump(russian_results, f)
    
with open("itm_cv_model_select.pickle", "wb") as f:
    pickle.dump(itm_results, f)