## Prudential risk prediction

KATE expects your code to define variables with specific names that correspond to certain things we are interested in.

KATE will run your notebook from top to bottom and check the latest value of those variables, so make sure you don't overwrite them.

* Remember to uncomment the line assigning the variable to your answer and don't change the variable or function names.
* Use copies of the original or previous DataFrames to make sure you do not overwrite them by mistake.

You will find instructions below about how to define each variable.

Once you're happy with your code, upload your notebook to KATE to check your feedback.

We will work with a dataset published by an insurance company which contains anonymised information about their clients.

The aim is to predict people's risk profile based on their properties.

You will be given a description of the data set and the goal is to develop a prediction model.

##  Dataset

The data provided consists of three csv files in the `data/` folder:
* `X_train.csv`: the training set
* `y_train.csv`: the target for the training set, valued from 1 to 8
* `X_test.csv`: the test set that will be evaluated

Below we give the description of the data features, some categorical, others numerical. The dataset has been thoroughly anonymized, which makes it extra challenging. 

Although the risk profile is ordered, we will consider this problem as being a classification problem and the exact category accuracy will be used for evaluating your model. It has low signal, and a 8-classes classification problem, hence accuracy can be quite low.

## Get Started

Your task is to train a model to predict the target variable. You should save the predictions for the test set in the variable called `y_pred`, which will be evaluated against the ground truth. Below we give you a sample baseline implementation.

You are free to use all your modelling skills to get the best possible performance.

Good luck!

### Dataset info

**Variable descriptions:**
- Id - A unique identifier associated with an application.
- Product_Info_1-7 - A set of normalized variables relating to the product applied for
- Ins_Age - Normalized age of applicant
- Ht - Normalized height of applicant
- Wt - Normalized weight of applicant
- BMI - Normalized BMI of applicant
- Employment_Info_1-6 - A set of normalized variables relating to the employment history of the applicant.
- InsuredInfo_1-6 - A set of normalized variables providing information about the applicant.
- Insurance_History_1-9 - A set of normalized variables relating to the insurance history of the applicant.
- Family_Hist_1-5 - A set of normalized variables relating to the family history of the applicant.
- Medical_History_1-41 - A set of normalized variables relating to the medical history of the applicant.
- Medical_Keyword_1-48 - A set of dummy variables relating to the presence of/absence of a medical keyword being associated with the application.
- Response - This is the target variable, an ordinal variable relating to the final decision associated with an application

**Categorical (nominal) features:**
```
Product_Info_1, Product_Info_2, Product_Info_3, Product_Info_5, Product_Info_6, Product_Info_7, Employment_Info_2, Employment_Info_3, Employment_Info_5, InsuredInfo_1, InsuredInfo_2, InsuredInfo_3, InsuredInfo_4, InsuredInfo_5, InsuredInfo_6, InsuredInfo_7, Insurance_History_1, Insurance_History_2, Insurance_History_3, Insurance_History_4, Insurance_History_7, Insurance_History_8, Insurance_History_9, Family_Hist_1, Medical_History_2, Medical_History_3, Medical_History_4, Medical_History_5, Medical_History_6, Medical_History_7, Medical_History_8, Medical_History_9, Medical_History_11, Medical_History_12, Medical_History_13, Medical_History_14, Medical_History_16, Medical_History_17, Medical_History_18, Medical_History_19, Medical_History_20, Medical_History_21, Medical_History_22, Medical_History_23, Medical_History_25, Medical_History_26, Medical_History_27, Medical_History_28, Medical_History_29, Medical_History_30, Medical_History_31, Medical_History_33, Medical_History_34, Medical_History_35, Medical_History_36, Medical_History_37, Medical_History_38, Medical_History_39, Medical_History_40, Medical_History_41
```

**Continuous features:**
```
Product_Info_4, Ins_Age, Ht, Wt, BMI, Employment_Info_1, Employment_Info_4, Employment_Info_6, Insurance_History_5, Family_Hist_2, Family_Hist_3, Family_Hist_4, Family_Hist_5
```

**Discrete features:**
```
Medical_History_1, Medical_History_10, Medical_History_15, Medical_History_24, Medical_History_32
Medical_Keyword_1-48 are dummy variables.
```

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.impute import KNNImputer, SimpleImputer
from sklearn.preprocessing import OneHotEncoder, OrdinalEncoder
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.base import BaseEstimator, TransformerMixin
from sklearn.decomposition import PCA
from sklearn.metrics import classification_report

pd.set_option("display.max_columns", 500)

import warnings

warnings.filterwarnings(category=FutureWarning, action="ignore")

In [2]:
X_train = pd.read_csv("data/X_train.csv")
y_train = pd.read_csv("data/y_train.csv")
X_test = pd.read_csv("data/X_test.csv")

# categories = ["Product_Info_1", "Product_Info_2", "Product_Info_3",
#               "Product_Info_5", "Product_Info_6", "Product_Info_7"]

# preprocessor = make_column_transformer((OneHotEncoder(handle_unknown="ignore"), categories))
    
# model = make_pipeline(preprocessor, DecisionTreeClassifier())

# model.fit(X_train, y_train)

# y_pred = model.predict(X_test)

In [3]:
# check how many samples we have
X_train.shape, y_train.shape, X_test.shape

((44535, 126), (44535, 1), (14846, 126))

In [4]:
X_train.head()

Unnamed: 0,Product_Info_1,Product_Info_2,Product_Info_3,Product_Info_4,Product_Info_5,Product_Info_6,Product_Info_7,Ins_Age,Ht,Wt,BMI,Employment_Info_1,Employment_Info_2,Employment_Info_3,Employment_Info_4,Employment_Info_5,Employment_Info_6,InsuredInfo_1,InsuredInfo_2,InsuredInfo_3,InsuredInfo_4,InsuredInfo_5,InsuredInfo_6,InsuredInfo_7,Insurance_History_1,Insurance_History_2,Insurance_History_3,Insurance_History_4,Insurance_History_5,Insurance_History_7,Insurance_History_8,Insurance_History_9,Family_Hist_1,Family_Hist_2,Family_Hist_3,Family_Hist_4,Family_Hist_5,Medical_History_1,Medical_History_2,Medical_History_3,Medical_History_4,Medical_History_5,Medical_History_6,Medical_History_7,Medical_History_8,Medical_History_9,Medical_History_10,Medical_History_11,Medical_History_12,Medical_History_13,Medical_History_14,Medical_History_15,Medical_History_16,Medical_History_17,Medical_History_18,Medical_History_19,Medical_History_20,Medical_History_21,Medical_History_22,Medical_History_23,Medical_History_24,Medical_History_25,Medical_History_26,Medical_History_27,Medical_History_28,Medical_History_29,Medical_History_30,Medical_History_31,Medical_History_32,Medical_History_33,Medical_History_34,Medical_History_35,Medical_History_36,Medical_History_37,Medical_History_38,Medical_History_39,Medical_History_40,Medical_History_41,Medical_Keyword_1,Medical_Keyword_2,Medical_Keyword_3,Medical_Keyword_4,Medical_Keyword_5,Medical_Keyword_6,Medical_Keyword_7,Medical_Keyword_8,Medical_Keyword_9,Medical_Keyword_10,Medical_Keyword_11,Medical_Keyword_12,Medical_Keyword_13,Medical_Keyword_14,Medical_Keyword_15,Medical_Keyword_16,Medical_Keyword_17,Medical_Keyword_18,Medical_Keyword_19,Medical_Keyword_20,Medical_Keyword_21,Medical_Keyword_22,Medical_Keyword_23,Medical_Keyword_24,Medical_Keyword_25,Medical_Keyword_26,Medical_Keyword_27,Medical_Keyword_28,Medical_Keyword_29,Medical_Keyword_30,Medical_Keyword_31,Medical_Keyword_32,Medical_Keyword_33,Medical_Keyword_34,Medical_Keyword_35,Medical_Keyword_36,Medical_Keyword_37,Medical_Keyword_38,Medical_Keyword_39,Medical_Keyword_40,Medical_Keyword_41,Medical_Keyword_42,Medical_Keyword_43,Medical_Keyword_44,Medical_Keyword_45,Medical_Keyword_46,Medical_Keyword_47,Medical_Keyword_48
0,1,D3,26,0.487179,2,3,1,0.208955,0.745455,0.257322,0.377922,0.072,9,1,0.0,2,0.15,1,2,1,3,1,1,1,2,1,1,3,,3,2,3,3,0.26087,,0.239437,,12.0,16,2,1,1,3,2,2,2,,3,2,3,3,,1,3,1,1,2,1,2,3,,1,3,3,1,3,2,3,,3,3,1,2,1,1,3,3,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
1,1,A2,15,0.076923,2,3,1,0.089552,0.654545,0.246862,0.447639,0.035,9,1,0.0,3,0.002,1,2,8,3,1,2,1,2,1,1,3,,3,2,3,3,0.304348,,0.338028,,0.0,613,2,2,1,3,2,2,2,,3,2,3,3,,1,3,1,1,2,1,2,1,,1,3,3,1,3,2,3,,3,3,1,2,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
2,1,D4,26,0.230769,2,3,1,0.447761,0.781818,0.320084,0.443418,0.06,14,1,0.0,2,0.0,2,2,8,3,1,1,1,2,1,1,3,,3,2,3,3,,0.54902,,0.535714,15.0,156,2,2,1,3,2,2,2,,3,2,3,3,,1,3,1,1,2,1,2,1,,1,3,3,1,3,2,3,,3,3,1,2,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
3,1,D3,26,1.0,2,1,1,0.373134,0.709091,0.269874,0.432872,0.12,14,1,0.0,2,0.25,2,2,3,3,1,1,1,2,1,3,1,0.006667,1,3,2,3,0.565217,,0.464789,,3.0,335,2,2,1,3,2,2,2,,3,2,3,3,,1,3,2,1,2,1,2,1,,1,3,3,1,3,2,3,,3,3,1,2,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0
4,1,D2,29,0.076923,2,1,1,0.328358,0.672727,0.430962,0.764352,0.075,9,1,0.0,3,,1,2,8,3,1,1,1,1,1,3,1,0.003333,1,1,2,3,0.492754,,0.408451,,9.0,307,2,1,1,3,2,2,2,,3,2,3,3,12.0,1,3,1,1,2,1,2,3,,1,3,3,1,1,2,3,,3,3,1,2,2,1,1,3,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0


In [5]:
# check the distribution of the target 
y_train.value_counts(normalize=True)

Response
8           0.327203
6           0.189424
7           0.134860
2           0.110969
1           0.105153
5           0.090221
4           0.024879
3           0.017290
dtype: float64

## EDA

In [6]:
X_train.info(verbose=True, show_counts=True)

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 44535 entries, 0 to 44534
Data columns (total 126 columns):
 #    Column               Non-Null Count  Dtype  
---   ------               --------------  -----  
 0    Product_Info_1       44535 non-null  int64  
 1    Product_Info_2       44535 non-null  object 
 2    Product_Info_3       44535 non-null  int64  
 3    Product_Info_4       44535 non-null  float64
 4    Product_Info_5       44535 non-null  int64  
 5    Product_Info_6       44535 non-null  int64  
 6    Product_Info_7       44535 non-null  int64  
 7    Ins_Age              44535 non-null  float64
 8    Ht                   44535 non-null  float64
 9    Wt                   44535 non-null  float64
 10   BMI                  44535 non-null  float64
 11   Employment_Info_1    44518 non-null  float64
 12   Employment_Info_2    44535 non-null  int64  
 13   Employment_Info_3    44535 non-null  int64  
 14   Employment_Info_4    39410 non-null  float64
 15   Employment_Info_5

In [7]:
# add target in temporarily for EDA
df = X_train.copy()
df["target"] = y_train

In [8]:
df.head()

Unnamed: 0,Product_Info_1,Product_Info_2,Product_Info_3,Product_Info_4,Product_Info_5,Product_Info_6,Product_Info_7,Ins_Age,Ht,Wt,BMI,Employment_Info_1,Employment_Info_2,Employment_Info_3,Employment_Info_4,Employment_Info_5,Employment_Info_6,InsuredInfo_1,InsuredInfo_2,InsuredInfo_3,InsuredInfo_4,InsuredInfo_5,InsuredInfo_6,InsuredInfo_7,Insurance_History_1,Insurance_History_2,Insurance_History_3,Insurance_History_4,Insurance_History_5,Insurance_History_7,Insurance_History_8,Insurance_History_9,Family_Hist_1,Family_Hist_2,Family_Hist_3,Family_Hist_4,Family_Hist_5,Medical_History_1,Medical_History_2,Medical_History_3,Medical_History_4,Medical_History_5,Medical_History_6,Medical_History_7,Medical_History_8,Medical_History_9,Medical_History_10,Medical_History_11,Medical_History_12,Medical_History_13,Medical_History_14,Medical_History_15,Medical_History_16,Medical_History_17,Medical_History_18,Medical_History_19,Medical_History_20,Medical_History_21,Medical_History_22,Medical_History_23,Medical_History_24,Medical_History_25,Medical_History_26,Medical_History_27,Medical_History_28,Medical_History_29,Medical_History_30,Medical_History_31,Medical_History_32,Medical_History_33,Medical_History_34,Medical_History_35,Medical_History_36,Medical_History_37,Medical_History_38,Medical_History_39,Medical_History_40,Medical_History_41,Medical_Keyword_1,Medical_Keyword_2,Medical_Keyword_3,Medical_Keyword_4,Medical_Keyword_5,Medical_Keyword_6,Medical_Keyword_7,Medical_Keyword_8,Medical_Keyword_9,Medical_Keyword_10,Medical_Keyword_11,Medical_Keyword_12,Medical_Keyword_13,Medical_Keyword_14,Medical_Keyword_15,Medical_Keyword_16,Medical_Keyword_17,Medical_Keyword_18,Medical_Keyword_19,Medical_Keyword_20,Medical_Keyword_21,Medical_Keyword_22,Medical_Keyword_23,Medical_Keyword_24,Medical_Keyword_25,Medical_Keyword_26,Medical_Keyword_27,Medical_Keyword_28,Medical_Keyword_29,Medical_Keyword_30,Medical_Keyword_31,Medical_Keyword_32,Medical_Keyword_33,Medical_Keyword_34,Medical_Keyword_35,Medical_Keyword_36,Medical_Keyword_37,Medical_Keyword_38,Medical_Keyword_39,Medical_Keyword_40,Medical_Keyword_41,Medical_Keyword_42,Medical_Keyword_43,Medical_Keyword_44,Medical_Keyword_45,Medical_Keyword_46,Medical_Keyword_47,Medical_Keyword_48,target
0,1,D3,26,0.487179,2,3,1,0.208955,0.745455,0.257322,0.377922,0.072,9,1,0.0,2,0.15,1,2,1,3,1,1,1,2,1,1,3,,3,2,3,3,0.26087,,0.239437,,12.0,16,2,1,1,3,2,2,2,,3,2,3,3,,1,3,1,1,2,1,2,3,,1,3,3,1,3,2,3,,3,3,1,2,1,1,3,3,1,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,6
1,1,A2,15,0.076923,2,3,1,0.089552,0.654545,0.246862,0.447639,0.035,9,1,0.0,3,0.002,1,2,8,3,1,2,1,2,1,1,3,,3,2,3,3,0.304348,,0.338028,,0.0,613,2,2,1,3,2,2,2,,3,2,3,3,,1,3,1,1,2,1,2,1,,1,3,3,1,3,2,3,,3,3,1,2,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,8
2,1,D4,26,0.230769,2,3,1,0.447761,0.781818,0.320084,0.443418,0.06,14,1,0.0,2,0.0,2,2,8,3,1,1,1,2,1,1,3,,3,2,3,3,,0.54902,,0.535714,15.0,156,2,2,1,3,2,2,2,,3,2,3,3,,1,3,1,1,2,1,2,1,,1,3,3,1,3,2,3,,3,3,1,2,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,5
3,1,D3,26,1.0,2,1,1,0.373134,0.709091,0.269874,0.432872,0.12,14,1,0.0,2,0.25,2,2,3,3,1,1,1,2,1,3,1,0.006667,1,3,2,3,0.565217,,0.464789,,3.0,335,2,2,1,3,2,2,2,,3,2,3,3,,1,3,2,1,2,1,2,1,,1,3,3,1,3,2,3,,3,3,1,2,2,1,3,3,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,8
4,1,D2,29,0.076923,2,1,1,0.328358,0.672727,0.430962,0.764352,0.075,9,1,0.0,3,,1,2,8,3,1,1,1,1,1,3,1,0.003333,1,1,2,3,0.492754,,0.408451,,9.0,307,2,1,1,3,2,2,2,,3,2,3,3,12.0,1,3,1,1,2,1,2,3,,1,3,3,1,1,2,3,,3,3,1,2,2,1,1,3,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,2


In [9]:
# encode the 'product info 2' column for ease
# encoder = OneHotEncoder()
# ohe_transformed = encoder.fit_transform(df["Employment_Info_2"].values.reshape(-1,1))

# df = pd.concat([df.drop(columns=["Product_Info_2"]), pd.DataFrame(ohe_transformed, columns=encoder.get_feature_names_out())], axis=1)

In [10]:
# check missing values

"""
We'll drop any columns missing more than 75% of their values as they're pretty lost

For the remaining columns, we'll group by similar features and then impute based on the mean value
"""

missing_values = df.isnull().sum().sort_values(ascending=False).head(15) / df.shape[0]
missing_values

Medical_History_10     0.990592
Medical_History_32     0.981071
Medical_History_24     0.935377
Medical_History_15     0.749590
Family_Hist_5          0.702863
Family_Hist_3          0.575772
Family_Hist_2          0.483036
Insurance_History_5    0.427596
Family_Hist_4          0.324172
Employment_Info_6      0.182531
Medical_History_1      0.149994
Employment_Info_4      0.115078
Employment_Info_1      0.000382
Medical_Keyword_14     0.000000
Medical_Keyword_15     0.000000
dtype: float64

## Modelling

In [11]:
# cleans dataset
class DataCleaner(BaseEstimator, TransformerMixin):
    def __init__(self) -> None:
        pass
    
    def fit(self, X, y=None):
        # collect columns that are missing greater than 75% of their values
        missing_values = X.isnull().sum().sort_values(ascending=False).head(15) / X.shape[0]
        self.missing_columns = missing_values[missing_values > 0.75].index.to_list()
        
        # collect columns that are correlated
        correlations = X_train.drop(columns=["Product_Info_2"]).corr()
        correlations = correlations.where(np.triu(np.ones(correlations.shape), k=1).astype(bool))
        self.correlated_columns = [column for column in correlations.columns if any(correlations[column] > 0.9)]
        
        return self
    
    def transform(self, X):
        X_transformed = X.copy()
        
        # remove columns that are missing greater than 75% of their values
        X_transformed = X_transformed.drop(columns=self.missing_columns)
        
        # remove correlated columns
        X_transformed = X_transformed.drop(columns=self.correlated_columns)
        
        return X_transformed
   
# feature engineering 
class FeatureEngineer(BaseEstimator, TransformerMixin):
    def __init__(self) -> None:
        pass
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X):
        X_transformed = X.copy()
        
        # log + 1 transformation of BMI 
        X_transformed["BMI"] = np.log1p(X_transformed["BMI"])
        
        return X_transformed

In [12]:
# pipelines
clean_pipeline = Pipeline(
    [
        ("data_cleaner", DataCleaner()),
    ]
)

object_columns = X_train.select_dtypes("object").columns
encoder_ct = ColumnTransformer(
    [
        ("encode", OneHotEncoder(), object_columns)
    ],
    remainder="passthrough"
)

transformation_pipeline = Pipeline(
    [
        ("clean_pipeline", clean_pipeline),
        ("feature_engineer", FeatureEngineer()),
        ("encoder", encoder_ct),
        ("imputer", SimpleImputer()),
        #("pca", PCA(n_components=5))
    ]
)

# transformation_pipeline

In [13]:
pipeline = transformation_pipeline.fit(X_train)
transfomred_x_train = pipeline.transform(X_train)

In [14]:
from sklearn.neighbors import KNeighborsClassifier
from xgboost import XGBClassifier
from sklearn.model_selection import StratifiedKFold

In [15]:
# if y_train.min().values == 1:
#     y_train -= 1
    
# cv = StratifiedKFold(shuffle=True, random_state=68)
# for i, (train_index, test_index) in enumerate(cv.split(transfomred_x_train, y_train)):
#     x_train, y_train_ = transfomred_x_train[train_index], y_train.values[train_index]
#     x_test_, y_test = transfomred_x_train[test_index], y_train.values[test_index]
    
#     model = XGBClassifier()
#     model.fit(x_train, y_train_.reshape(-1,))
    
#     score = model.score(x_test_, y_test)
#     print(score)

In [17]:
# add one as XGB classifiers don't like it when the smallest value isn't 0
# y_pred = model.predict(pipeline.transform(X_test)) + 1

In [19]:
# p = model.predict(x_test_)
# print(classification_report(y_test, model.predict(x_test_)))

In [20]:
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import StackingClassifier

from sklearn.ensemble import RandomForestClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression, RidgeClassifier
import optuna
from sklearn.model_selection import StratifiedKFold

In [28]:
def objective_function(trial):
    hyperparams = {
        "rfc__n_estimators": trial.suggest_int("rfc__n_estimators", 20, 260),
        "rfc__max_depth": trial.suggest_int("rfc__max_depth", 1, 20),
        
        "lr__C": trial.suggest_float("lr__C", 0, 3),
        "lr__penalty": trial.suggest_categorical("lr_penalty", [None, "l2"]),
        
        "rc__alpha": trial.suggest_float("rc__alpha", 0, 3)
    }

    outer_cv = StratifiedKFold(shuffle=True, random_state=259)
    inner_cv = StratifiedKFold(shuffle=True, random_state=5421)

    stacked = StackingClassifier(
        [
            ("rfc", RandomForestClassifier()),
            ("nb", GaussianNB()),
            ("lr", LogisticRegression(max_iter=2500)),
            ("rc", RidgeClassifier())
        ],
        cv=inner_cv
    )
    
    stacked.set_params(**hyperparams)
    
    nested_score = cross_val_score(stacked, X=transfomred_x_train[:100], y=y_train.values[:100].reshape(-1,), cv=outer_cv)

    return np.mean(nested_score)

In [29]:
stacked_study = optuna.create_study(direction="maximize")
stacked_study.optimize(objective_function, n_trials=3)

[I 2024-02-29 21:21:39,297] A new study created in memory with name: no-name-75d6c6f6-3d99-4a21-8982-f8732cba5a03


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt