# Final Project: Flu Shot Learning: Predict H1N1 and Seasonal Flu Vaccines - Hosted by DRIVENDATA

### 1. Problem Statement
*Predict the liklihood an individual received the H1N1 and flu vaccine*

### 2. Metrics and Assumptions
Previous submissions to *https://www.drivendata.org/competitions/66/flu-shot-learning/* suggest there is an estimated irreducible error of approximately 13 percent.

### 3. Data Source
The United States National Center for Immunized and Respiratory Diseases (NCIRD) and National Center for Health Statistics (NCHS), two organizations within the Centers for Disease Control and Prevention (CDC), conducted a national survey of American households to determine socio-economic indicators of individauls who have/have not received the H1N1 and seasonal flu vaccines. The survey was conducted between October 2009 and June 2010. 

### 4. Target and Feature Variables
This dataset has two target variables. Both are binary variables (0,1) that correspond to a respondent receiving either the H1N1 flu vaccine or the seasonal flu vaccine.

This dataset contains 36 columns (35 excluding the respondent id). 12 feature variables are objects containing economic, race, sex, age, and lifestyle indicators; and rest of the feature variables consist of binary and likert scale questions on health behaviors, indicators, and knowledge. I included all feature variables in my analysis.

### 5. Model Approach
Classification on probability. Performance for this model was evalutated using the Area Under the Receiver Operating Characteristic Curve (ROC AUC) for each target variable. Submission to DataDriven must include three columns: 'respondent_id', 'h1n1_vaccine', and 'seasonal_vaccine' where each target variable should be a probability between 0.0 and 1.0 that the respondent recieved each vaccine. 

#### Package Dependancies

In [127]:
import pandas as pd
import numpy as np
import os
import matplotlib.pyplot as plt
import seaborn as sns

from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, RandomForestRegressor
from sklearn.tree import DecisionTreeClassifier
from sklearn.svm import SVC
from sklearn import svm
from sklearn.neural_network import MLPClassifier
from sklearn.metrics import confusion_matrix, classification_report, roc_curve, roc_auc_score
from sklearn.preprocessing import StandardScaler, LabelEncoder, OneHotEncoder, MinMaxScaler
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import KNeighborsClassifier
from sklearn.multioutput import MultiOutputClassifier
from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer

from catboost import CatBoostClassifier
from catboost import Pool, cv
import optuna

%matplotlib inline

### Data Import

In [128]:
df = pd.read_csv(r"Documents/H1N1 Predictions/training_set_features.csv", index_col='respondent_id')
labels_df = pd.read_csv(r"Documents/H1N1 Predictions/training_set_labels.csv", index_col='respondent_id')
test_df = pd.read_csv(r'Documents/H1N1 Predictions/test_set_features.csv', index_col='respondent_id')
np.testing.assert_array_equal(df.index.values, labels_df.index.values)

df = df.join(labels_df)
df.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 26707 entries, 0 to 26706
Data columns (total 37 columns):
 #   Column                       Non-Null Count  Dtype  
---  ------                       --------------  -----  
 0   h1n1_concern                 26615 non-null  float64
 1   h1n1_knowledge               26591 non-null  float64
 2   behavioral_antiviral_meds    26636 non-null  float64
 3   behavioral_avoidance         26499 non-null  float64
 4   behavioral_face_mask         26688 non-null  float64
 5   behavioral_wash_hands        26665 non-null  float64
 6   behavioral_large_gatherings  26620 non-null  float64
 7   behavioral_outside_home      26625 non-null  float64
 8   behavioral_touch_face        26579 non-null  float64
 9   doctor_recc_h1n1             24547 non-null  float64
 10  doctor_recc_seasonal         24547 non-null  float64
 11  chronic_med_condition        25736 non-null  float64
 12  child_under_6_months         25887 non-null  float64
 13  health_worker   

### Handling Missing Values

In [129]:
from sklearn.preprocessing import LabelEncoder
label_encoder = LabelEncoder()

In [130]:
num_cols = df.select_dtypes('number').columns
obj_cols = df.select_dtypes('object').columns

test_num_cols = test_df.select_dtypes('number').columns

In [131]:
type(obj_cols)

pandas.core.indexes.base.Index

In [132]:
#TRAINING DATASET
for col in (obj_cols):
    df[col] = df[col].fillna(value=test_df[col].mode())
    
for col in num_cols:
    df[col] = df[col].fillna(value=-1)

#TEST DATASET
for col in (obj_cols):
    test_df[col] = test_df[col].fillna(value=test_df[col].mode())
    
for col in test_num_cols:
    test_df[col] = test_df[col].fillna(value=-1)

In [133]:
df = pd.get_dummies(df, columns=obj_cols)
test_df = pd.get_dummies(test_df, columns=obj_cols)

In [134]:
test_df.isna().sum()

h1n1_concern                      0
h1n1_knowledge                    0
behavioral_antiviral_meds         0
behavioral_avoidance              0
behavioral_face_mask              0
                                 ..
employment_occupation_vlluhbov    0
employment_occupation_xgwztkwe    0
employment_occupation_xqwwgdyp    0
employment_occupation_xtkaffoo    0
employment_occupation_xzmlyyjv    0
Length: 105, dtype: int64

In [135]:
# COPY DATASET TO PREDICT H1N1 VACCINE EXCLUSIVELY

h1n1_df = df
h1n1_df = h1n1_df.dropna(axis=1)
y = h1n1_df['h1n1_vaccine']
X = h1n1_df.drop(columns=['h1n1_vaccine','seasonal_vaccine'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [136]:
# IDENTIFY COLUMNS WITH CATEGORICAL FEATURES

cat_features = np.where(X_train.dtypes != float)[0]
cat_features

array([ 23,  24,  25,  26,  27,  28,  29,  30,  31,  32,  33,  34,  35,
        36,  37,  38,  39,  40,  41,  42,  43,  44,  45,  46,  47,  48,
        49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,  60,  61,
        62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
        75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,
        88,  89,  90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100,
       101, 102, 103, 104], dtype=int64)

In [137]:
# API Reference: https://catboost.ai/docs/concepts/python-reference_pool.html
# CatBoost Pool preprocesses / wraps the dataset

train_dataset = Pool(data=X_train,
                     label=y_train,
                     cat_features = cat_features)

In [138]:
def objective(trial):
    
    # Optuna Trial API Reference: https://optuna.readthedocs.io/en/stable/reference/generated/optuna.trial.Trial.html
    # Trial determines best AUC mean by refining hyperparameter suggestions
    # For most param APIs: https://catboost.ai/docs/search/?query=od_type
    
    param = {
        # The maximum number of trees that can be built when solving machine learning problems.
        'iterations':trial.suggest_categorical('iterations', [100,200,300,500,1000,1200,1500,2000,3000,4000,5000]),
        
        # The learning rate. Used for reducing the gradient step.
        'learning_rate':trial.suggest_float("learning_rate", 0.001, 0.3),
        
        'random_strength':trial.suggest_int("random_strength", 1,10),
        
        # Defines the settings of the Bayesian bootstrap. It is used by default in classification and regression modes. 
        # Use the Bayesian bootstrap to assign random weights to objects.
        'bagging_temperature':trial.suggest_int("bagging_temperature", 0,10),
        
        'max_bin':trial.suggest_categorical('max_bin', [4,5,6,8,10,20,30]),
        
        # Defines how to perform greedy tree construction. ['SymmetricTree', 'Depthwise', 'Lossguide'] are all available hyperparams
        'grow_policy':trial.suggest_categorical('grow_policy', ['SymmetricTree', 'Depthwise', 'Lossguide']),
        
        # The minimum number of training samples in a leaf. 
        'min_data_in_leaf':trial.suggest_int("min_data_in_leaf", 1,10),
        
        # Overfitting detector
        'od_type' : "Iter",
        
        # Number of iterations to continue the training after the iteration with the optimal metric value.
        'od_wait' : 100,
        
        # Depth of the tree
        "depth": trial.suggest_int("max_depth", 2,10),
        
        "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100),
        
        'one_hot_max_size':trial.suggest_categorical('one_hot_max_size', [5,10,12,100,500,1024]),
        
        'custom_metric' : ['AUC'],
        
        # Classificatoin Objectives and Metrics: https://catboost.ai/docs/concepts/loss-functions-classification.html
        # More about Log Loss: https://www.kaggle.com/dansbecker/what-is-log-loss
        "loss_function": 'Logloss',
        
        # Scaling
        'auto_class_weights':trial.suggest_categorical('auto_class_weights', ['Balanced', 'SqrtBalanced']),
        }
    
    # CatBoost CV API Reference: https://catboost.ai/docs/concepts/python-reference_cv.html
    # Utilizes OPTUNA params to determine best hyper paramaters and CatBoost's Cross Validation
  
    scores = cv(train_dataset,
            param,
            fold_count=10, 
            early_stopping_rounds=10,         
            plot=False, verbose=False)

    return scores['test-AUC-mean'].max()

In [139]:
# Tree-structure Parzen Estimator (TPE) Sampler API: https://optuna.readthedocs.io/en/stable/reference/samplers.html
# Creates reproducible results with sampler seed

sampler = optuna.samplers.TPESampler(seed=68)  
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=50)

[32m[I 2021-04-07 22:36:14,722][0m A new study created in memory with name: no-name-d5438db2-09a7-481c-8c24-9c5ff3c26bf2[0m
[32m[I 2021-04-07 22:37:00,939][0m Trial 0 finished with value: 0.8659522576072736 and parameters: {'iterations': 300, 'learning_rate': 0.013964954297408176, 'random_strength': 1, 'bagging_temperature': 8, 'max_bin': 10, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 4, 'max_depth': 5, 'l2_leaf_reg': 21.328495943450676, 'one_hot_max_size': 100, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 0 with value: 0.8659522576072736.[0m
[32m[I 2021-04-07 22:37:26,619][0m Trial 1 finished with value: 0.8661515849000121 and parameters: {'iterations': 1200, 'learning_rate': 0.11477165079768124, 'random_strength': 9, 'bagging_temperature': 6, 'max_bin': 5, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 5, 'max_depth': 7, 'l2_leaf_reg': 0.5714362138520529, 'one_hot_max_size': 500, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.86615158490001

Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 22:37:39,261][0m Trial 2 finished with value: 0.8324344309562843 and parameters: {'iterations': 500, 'learning_rate': 0.10424841039250944, 'random_strength': 8, 'bagging_temperature': 3, 'max_bin': 6, 'grow_policy': 'Depthwise', 'min_data_in_leaf': 3, 'max_depth': 10, 'l2_leaf_reg': 7.961547985302404e-08, 'one_hot_max_size': 5, 'auto_class_weights': 'Balanced'}. Best is trial 1 with value: 0.8661515849000121.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 22:38:14,891][0m Trial 3 finished with value: 0.8625857316074506 and parameters: {'iterations': 4000, 'learning_rate': 0.06981342303072555, 'random_strength': 6, 'bagging_temperature': 7, 'max_bin': 30, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 3, 'max_depth': 6, 'l2_leaf_reg': 1.253388039132331e-06, 'one_hot_max_size': 500, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.8661515849000121.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 22:38:36,145][0m Trial 4 finished with value: 0.8645234204971002 and parameters: {'iterations': 300, 'learning_rate': 0.19342991898511874, 'random_strength': 8, 'bagging_temperature': 7, 'max_bin': 20, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 7, 'max_depth': 2, 'l2_leaf_reg': 0.0011659140576640084, 'one_hot_max_size': 12, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.8661515849000121.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 22:38:47,872][0m Trial 5 finished with value: 0.8608266555516847 and parameters: {'iterations': 100, 'learning_rate': 0.21103536986772822, 'random_strength': 4, 'bagging_temperature': 4, 'max_bin': 6, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 6, 'max_depth': 4, 'l2_leaf_reg': 1.5248267732768012e-08, 'one_hot_max_size': 100, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.8661515849000121.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 22:39:42,330][0m Trial 6 finished with value: 0.8643937645965247 and parameters: {'iterations': 1000, 'learning_rate': 0.039222305770230614, 'random_strength': 5, 'bagging_temperature': 3, 'max_bin': 20, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 10, 'max_depth': 4, 'l2_leaf_reg': 3.544948380552023e-06, 'one_hot_max_size': 10, 'auto_class_weights': 'Balanced'}. Best is trial 1 with value: 0.8661515849000121.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 22:40:33,497][0m Trial 7 finished with value: 0.8618390833729654 and parameters: {'iterations': 1000, 'learning_rate': 0.045869610193466345, 'random_strength': 9, 'bagging_temperature': 10, 'max_bin': 30, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 8, 'max_depth': 5, 'l2_leaf_reg': 6.629274905463984e-06, 'one_hot_max_size': 12, 'auto_class_weights': 'Balanced'}. Best is trial 1 with value: 0.8661515849000121.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 22:40:43,227][0m Trial 8 finished with value: 0.852634376586922 and parameters: {'iterations': 1500, 'learning_rate': 0.2775338323862574, 'random_strength': 6, 'bagging_temperature': 4, 'max_bin': 4, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 9, 'max_depth': 9, 'l2_leaf_reg': 0.0010943835695463693, 'one_hot_max_size': 500, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.8661515849000121.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 22:40:57,535][0m Trial 9 finished with value: 0.8610753121404245 and parameters: {'iterations': 1000, 'learning_rate': 0.16527427195333977, 'random_strength': 1, 'bagging_temperature': 9, 'max_bin': 8, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 2, 'max_depth': 5, 'l2_leaf_reg': 5.547808519611033e-05, 'one_hot_max_size': 100, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.8661515849000121.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 22:41:27,022][0m Trial 10 finished with value: 0.8658453957804421 and parameters: {'iterations': 1200, 'learning_rate': 0.11280639560744715, 'random_strength': 10, 'bagging_temperature': 1, 'max_bin': 5, 'grow_policy': 'Depthwise', 'min_data_in_leaf': 5, 'max_depth': 8, 'l2_leaf_reg': 3.212210040719702, 'one_hot_max_size': 1024, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.8661515849000121.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 22:43:20,132][0m Trial 11 finished with value: 0.8707177746116492 and parameters: {'iterations': 2000, 'learning_rate': 0.01981954863467386, 'random_strength': 1, 'bagging_temperature': 8, 'max_bin': 10, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 5, 'max_depth': 7, 'l2_leaf_reg': 78.38138081858969, 'one_hot_max_size': 100, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 22:43:46,435][0m Trial 12 finished with value: 0.8660174114474406 and parameters: {'iterations': 2000, 'learning_rate': 0.11011181943928247, 'random_strength': 3, 'bagging_temperature': 6, 'max_bin': 5, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 5, 'max_depth': 7, 'l2_leaf_reg': 0.19478780443657354, 'one_hot_max_size': 500, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 22:49:50,610][0m Trial 13 finished with value: 0.8692368704287926 and parameters: {'iterations': 2000, 'learning_rate': 0.005245169519152451, 'random_strength': 2, 'bagging_temperature': 10, 'max_bin': 10, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 1, 'max_depth': 8, 'l2_leaf_reg': 0.2554425816669215, 'one_hot_max_size': 10, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m
[32m[I 2021-04-07 22:55:59,817][0m Trial 14 finished with value: 0.8589175416984947 and parameters: {'iterations': 2000, 'learning_rate': 0.0012570631773644297, 'random_strength': 2, 'bagging_temperature': 10, 'max_bin': 10, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 1, 'max_depth': 9, 'l2_leaf_reg': 59.44042958669503, 'one_hot_max_size': 10, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 22:59:10,833][0m Trial 15 finished with value: 0.8654439920466578 and parameters: {'iterations': 5000, 'learning_rate': 0.005965894062374641, 'random_strength': 2, 'bagging_temperature': 9, 'max_bin': 10, 'grow_policy': 'Depthwise', 'min_data_in_leaf': 1, 'max_depth': 8, 'l2_leaf_reg': 0.02635111535001674, 'one_hot_max_size': 10, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m
[32m[I 2021-04-07 23:00:27,360][0m Trial 16 finished with value: 0.8683362480741573 and parameters: {'iterations': 2000, 'learning_rate': 0.07038569751421626, 'random_strength': 1, 'bagging_temperature': 10, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 7, 'max_depth': 10, 'l2_leaf_reg': 94.7424627110417, 'one_hot_max_size': 10, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:00:37,222][0m Trial 17 finished with value: 0.8560306713537132 and parameters: {'iterations': 3000, 'learning_rate': 0.26987200266265216, 'random_strength': 3, 'bagging_temperature': 8, 'max_bin': 10, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 3, 'max_depth': 7, 'l2_leaf_reg': 0.010165738633200461, 'one_hot_max_size': 100, 'auto_class_weights': 'Balanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:01:22,630][0m Trial 18 finished with value: 0.8641010549136038 and parameters: {'iterations': 200, 'learning_rate': 0.03163074156241095, 'random_strength': 3, 'bagging_temperature': 9, 'max_bin': 10, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 1, 'max_depth': 8, 'l2_leaf_reg': 4.702623628134756, 'one_hot_max_size': 1024, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m
[32m[I 2021-04-07 23:02:04,494][0m Trial 19 finished with value: 0.8665550091858591 and parameters: {'iterations': 2000, 'learning_rate': 0.07133190045789511, 'random_strength': 2, 'bagging_temperature': 0, 'max_bin': 8, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 10, 'max_depth': 9, 'l2_leaf_reg': 0.12220253571616772, 'one_hot_max_size': 5, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:07:58,752][0m Trial 20 finished with value: 0.863138674425203 and parameters: {'iterations': 2000, 'learning_rate': 0.002684083901463956, 'random_strength': 5, 'bagging_temperature': 8, 'max_bin': 4, 'grow_policy': 'Depthwise', 'min_data_in_leaf': 2, 'max_depth': 6, 'l2_leaf_reg': 5.990213697132072, 'one_hot_max_size': 100, 'auto_class_weights': 'Balanced'}. Best is trial 11 with value: 0.8707177746116492.[0m
[32m[I 2021-04-07 23:09:03,031][0m Trial 21 finished with value: 0.8698982042087969 and parameters: {'iterations': 2000, 'learning_rate': 0.06853671945128265, 'random_strength': 1, 'bagging_temperature': 10, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 8, 'max_depth': 10, 'l2_leaf_reg': 60.30641967454954, 'one_hot_max_size': 10, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:09:47,396][0m Trial 22 finished with value: 0.8650243448616667 and parameters: {'iterations': 2000, 'learning_rate': 0.05177305737720475, 'random_strength': 1, 'bagging_temperature': 10, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 8, 'max_depth': 10, 'l2_leaf_reg': 0.8303872031381462, 'one_hot_max_size': 10, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:11:38,936][0m Trial 23 finished with value: 0.8698190365816535 and parameters: {'iterations': 2000, 'learning_rate': 0.02455228272459276, 'random_strength': 2, 'bagging_temperature': 9, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 7, 'max_depth': 8, 'l2_leaf_reg': 80.3741668819409, 'one_hot_max_size': 10, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:12:22,853][0m Trial 24 finished with value: 0.869075717100228 and parameters: {'iterations': 2000, 'learning_rate': 0.0874622861808137, 'random_strength': 4, 'bagging_temperature': 7, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 7, 'max_depth': 9, 'l2_leaf_reg': 83.7712703435238, 'one_hot_max_size': 10, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:12:38,665][0m Trial 25 finished with value: 0.8675870756870838 and parameters: {'iterations': 5000, 'learning_rate': 0.14239138314365457, 'random_strength': 1, 'bagging_temperature': 9, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 6, 'max_depth': 7, 'l2_leaf_reg': 17.933734301020408, 'one_hot_max_size': 10, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:15:21,347][0m Trial 26 finished with value: 0.8700577992766568 and parameters: {'iterations': 3000, 'learning_rate': 0.03007462271004389, 'random_strength': 2, 'bagging_temperature': 8, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 8, 'max_depth': 10, 'l2_leaf_reg': 96.07835704457133, 'one_hot_max_size': 10, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:16:58,837][0m Trial 27 finished with value: 0.8666956429977081 and parameters: {'iterations': 3000, 'learning_rate': 0.0510925287232617, 'random_strength': 4, 'bagging_temperature': 6, 'max_bin': 30, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 9, 'max_depth': 2, 'l2_leaf_reg': 2.0700996524527304, 'one_hot_max_size': 100, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:17:49,476][0m Trial 28 finished with value: 0.8675219604872335 and parameters: {'iterations': 3000, 'learning_rate': 0.08652247042703512, 'random_strength': 3, 'bagging_temperature': 8, 'max_bin': 8, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 9, 'max_depth': 10, 'l2_leaf_reg': 15.698542577382348, 'one_hot_max_size': 1024, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:21:00,724][0m Trial 29 finished with value: 0.8684601514410792 and parameters: {'iterations': 4000, 'learning_rate': 0.02049218225214315, 'random_strength': 1, 'bagging_temperature': 5, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 8, 'max_depth': 3, 'l2_leaf_reg': 21.836587300765807, 'one_hot_max_size': 12, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:21:10,785][0m Trial 30 finished with value: 0.8487845654691452 and parameters: {'iterations': 3000, 'learning_rate': 0.13955585842146848, 'random_strength': 1, 'bagging_temperature': 8, 'max_bin': 20, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 4, 'max_depth': 10, 'l2_leaf_reg': 0.04691486822115045, 'one_hot_max_size': 5, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:21:53,688][0m Trial 31 finished with value: 0.863963699262688 and parameters: {'iterations': 200, 'learning_rate': 0.020458737773142104, 'random_strength': 2, 'bagging_temperature': 9, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 6, 'max_depth': 9, 'l2_leaf_reg': 78.11005021356631, 'one_hot_max_size': 10, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m
[32m[I 2021-04-07 23:23:08,406][0m Trial 32 finished with value: 0.8686498177263152 and parameters: {'iterations': 500, 'learning_rate': 0.0244535896832416, 'random_strength': 2, 'bagging_temperature': 7, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 7, 'max_depth': 6, 'l2_leaf_reg': 95.24145158795909, 'one_hot_max_size': 10, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m
[32m[I 2021-04-07 23:23:56,851][0m Trial 33 finished with value: 0.8690128315510919 and parameters: {'iteration

Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:24:19,006][0m Trial 34 finished with value: 0.8634902899367006 and parameters: {'iterations': 100, 'learning_rate': 0.08486955794981052, 'random_strength': 1, 'bagging_temperature': 9, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 4, 'max_depth': 9, 'l2_leaf_reg': 0.9712965396123249, 'one_hot_max_size': 10, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:25:13,115][0m Trial 35 finished with value: 0.8683777432489865 and parameters: {'iterations': 300, 'learning_rate': 0.034317783131161274, 'random_strength': 2, 'bagging_temperature': 7, 'max_bin': 6, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 7, 'max_depth': 8, 'l2_leaf_reg': 32.51006335944713, 'one_hot_max_size': 100, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m
[32m[I 2021-04-07 23:26:07,165][0m Trial 36 finished with value: 0.8680443313880088 and parameters: {'iterations': 1200, 'learning_rate': 0.06474802170010949, 'random_strength': 1, 'bagging_temperature': 8, 'max_bin': 4, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 6, 'max_depth': 10, 'l2_leaf_reg': 7.409887370787184, 'one_hot_max_size': 10, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:28:29,883][0m Trial 37 finished with value: 0.8694052465956791 and parameters: {'iterations': 2000, 'learning_rate': 0.01993785122775789, 'random_strength': 7, 'bagging_temperature': 5, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 5, 'max_depth': 7, 'l2_leaf_reg': 90.64040382757709, 'one_hot_max_size': 5, 'auto_class_weights': 'Balanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:28:49,908][0m Trial 38 finished with value: 0.859866292946496 and parameters: {'iterations': 500, 'learning_rate': 0.12708490722567845, 'random_strength': 4, 'bagging_temperature': 6, 'max_bin': 6, 'grow_policy': 'Depthwise', 'min_data_in_leaf': 9, 'max_depth': 6, 'l2_leaf_reg': 0.004995767033688599, 'one_hot_max_size': 100, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:29:13,136][0m Trial 39 finished with value: 0.8545279345215718 and parameters: {'iterations': 4000, 'learning_rate': 0.040061077282591395, 'random_strength': 3, 'bagging_temperature': 7, 'max_bin': 30, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 8, 'max_depth': 8, 'l2_leaf_reg': 1.0399183239165113e-07, 'one_hot_max_size': 12, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:29:28,630][0m Trial 40 finished with value: 0.8588792236163825 and parameters: {'iterations': 300, 'learning_rate': 0.1634566124396803, 'random_strength': 2, 'bagging_temperature': 10, 'max_bin': 20, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 10, 'max_depth': 9, 'l2_leaf_reg': 1.0300040671020267, 'one_hot_max_size': 10, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:32:03,658][0m Trial 41 finished with value: 0.8696511540361117 and parameters: {'iterations': 2000, 'learning_rate': 0.018203961046851007, 'random_strength': 7, 'bagging_temperature': 5, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 5, 'max_depth': 7, 'l2_leaf_reg': 82.67896536400292, 'one_hot_max_size': 5, 'auto_class_weights': 'Balanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:37:29,928][0m Trial 42 finished with value: 0.86820397878855 and parameters: {'iterations': 2000, 'learning_rate': 0.008363581311288481, 'random_strength': 7, 'bagging_temperature': 2, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 5, 'max_depth': 5, 'l2_leaf_reg': 33.25939019472065, 'one_hot_max_size': 5, 'auto_class_weights': 'Balanced'}. Best is trial 11 with value: 0.8707177746116492.[0m
[32m[I 2021-04-07 23:38:36,559][0m Trial 43 finished with value: 0.868196145711711 and parameters: {'iterations': 2000, 'learning_rate': 0.03486570379120119, 'random_strength': 7, 'bagging_temperature': 4, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 6, 'max_depth': 7, 'l2_leaf_reg': 8.272549786010309, 'one_hot_max_size': 5, 'auto_class_weights': 'Balanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:39:05,828][0m Trial 44 finished with value: 0.8670916470663176 and parameters: {'iterations': 2000, 'learning_rate': 0.09985942096325906, 'random_strength': 8, 'bagging_temperature': 9, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 4, 'max_depth': 6, 'l2_leaf_reg': 2.1967005623918605, 'one_hot_max_size': 5, 'auto_class_weights': 'Balanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:39:23,027][0m Trial 45 finished with value: 0.8565560530081836 and parameters: {'iterations': 100, 'learning_rate': 0.058851052012146673, 'random_strength': 6, 'bagging_temperature': 8, 'max_bin': 5, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 7, 'max_depth': 8, 'l2_leaf_reg': 6.301367635010785e-05, 'one_hot_max_size': 500, 'auto_class_weights': 'Balanced'}. Best is trial 11 with value: 0.8707177746116492.[0m
[32m[I 2021-04-07 23:42:07,955][0m Trial 46 finished with value: 0.8693237504543081 and parameters: {'iterations': 2000, 'learning_rate': 0.014042525479080605, 'random_strength': 8, 'bagging_temperature': 2, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 6, 'max_depth': 7, 'l2_leaf_reg': 32.48233574930904, 'one_hot_max_size': 100, 'auto_class_weights': 'Balanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:42:22,781][0m Trial 47 finished with value: 0.8621982563734569 and parameters: {'iterations': 1000, 'learning_rate': 0.19420213641694983, 'random_strength': 9, 'bagging_temperature': 6, 'max_bin': 10, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 5, 'max_depth': 8, 'l2_leaf_reg': 0.3754445308132189, 'one_hot_max_size': 10, 'auto_class_weights': 'Balanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:42:38,586][0m Trial 48 finished with value: 0.8672922791949803 and parameters: {'iterations': 3000, 'learning_rate': 0.25406816515011066, 'random_strength': 7, 'bagging_temperature': 10, 'max_bin': 5, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 7, 'max_depth': 6, 'l2_leaf_reg': 83.62197910519791, 'one_hot_max_size': 500, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-07 23:43:24,139][0m Trial 49 finished with value: 0.8683587513592108 and parameters: {'iterations': 2000, 'learning_rate': 0.04294018055805611, 'random_strength': 5, 'bagging_temperature': 3, 'max_bin': 10, 'grow_policy': 'Depthwise', 'min_data_in_leaf': 9, 'max_depth': 5, 'l2_leaf_reg': 3.116940448380019, 'one_hot_max_size': 1024, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 11 with value: 0.8707177746116492.[0m


Stopped by overfitting detector  (10 iterations wait)


In [140]:
print("Number of finished trials: {}".format(len(study.trials)))
print("Best trial:")
trial = study.best_trial
print("  Value: {}".format(trial.value))
print("  Params: ")
for key, value in trial.params.items():
    print("    {}={},".format(key, value))

Number of finished trials: 50
Best trial:
  Value: 0.8707177746116492
  Params: 
    iterations=2000,
    learning_rate=0.01981954863467386,
    random_strength=1,
    bagging_temperature=8,
    max_bin=10,
    grow_policy=Lossguide,
    min_data_in_leaf=5,
    max_depth=7,
    l2_leaf_reg=78.38138081858969,
    one_hot_max_size=100,
    auto_class_weights=SqrtBalanced,


In [141]:
# CatBoostClassifier API Reference: https://catboost.ai/docs/concepts/python-reference_catboostclassifier.html

final_model = CatBoostClassifier(verbose=False,  cat_features=cat_features, 
                          **trial.params)

In [142]:
final_model.fit(X_train, y_train)
predictions_h1 = final_model.predict_proba(test_df)[:,1]
predictions_h1_train = final_model.predict_proba(X_test)[:,1]

In [143]:
roc_auc_score(y_test, predictions_h1_train)

0.8671530529635235

In [144]:
predictions_h1

array([0.16592446, 0.04416085, 0.22726992, ..., 0.23541126, 0.02632169,
       0.72728195])

In [145]:
seasonal_df = df
seasonal_df=seasonal_df.dropna(axis=1)
y = seasonal_df['seasonal_vaccine']
X = seasonal_df.drop(columns=['h1n1_vaccine','seasonal_vaccine'])
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=42)

In [146]:
# Index number of cat features
cat_features = np.where(X_train.dtypes != float)[0]
cat_features

array([ 23,  24,  25,  26,  27,  28,  29,  30,  31,  32,  33,  34,  35,
        36,  37,  38,  39,  40,  41,  42,  43,  44,  45,  46,  47,  48,
        49,  50,  51,  52,  53,  54,  55,  56,  57,  58,  59,  60,  61,
        62,  63,  64,  65,  66,  67,  68,  69,  70,  71,  72,  73,  74,
        75,  76,  77,  78,  79,  80,  81,  82,  83,  84,  85,  86,  87,
        88,  89,  90,  91,  92,  93,  94,  95,  96,  97,  98,  99, 100,
       101, 102, 103, 104], dtype=int64)

In [147]:
train_dataset = Pool(data=X_train,
                     label=y_train,
                     cat_features = cat_features)

In [148]:
def objective(trial):
    param = {
        'iterations':trial.suggest_categorical('iterations', [100,200,300,500,1000,1200,1500,2000,3000,4000,5000]),
        'learning_rate':trial.suggest_float("learning_rate", 0.001, 0.3),
        'random_strength':trial.suggest_int("random_strength", 1,10),
        'bagging_temperature':trial.suggest_int("bagging_temperature", 0,10),
        'max_bin':trial.suggest_categorical('max_bin', [4,5,6,8,10,20,30]),
        'grow_policy':trial.suggest_categorical('grow_policy', ['SymmetricTree', 'Depthwise', 'Lossguide']),
        'min_data_in_leaf':trial.suggest_int("min_data_in_leaf", 1,10),
        'od_type' : "Iter",
        'od_wait' : 100,
        "depth": trial.suggest_int("max_depth", 2,10),
        "l2_leaf_reg": trial.suggest_loguniform("l2_leaf_reg", 1e-8, 100),
        'one_hot_max_size':trial.suggest_categorical('one_hot_max_size', [5,10,12,100,500,1024]),
        'custom_metric' : ['AUC'],
        "loss_function": "Logloss",
        'auto_class_weights':trial.suggest_categorical('auto_class_weights', ['Balanced', 'SqrtBalanced']),
        }

    scores = cv(train_dataset,
            param,
            fold_count=10, 
            early_stopping_rounds=10,         
            plot=False, verbose=False)

    return scores['test-AUC-mean'].max()

In [150]:
sampler = optuna.samplers.TPESampler(seed=68)  # Make the sampler behave in a deterministic way.
study = optuna.create_study(direction="maximize", sampler=sampler)
study.optimize(objective, n_trials=50)

[32m[I 2021-04-08 06:45:42,423][0m A new study created in memory with name: no-name-56004c1d-182d-4a61-b9f8-c199e3f637d5[0m
[32m[I 2021-04-08 06:46:34,734][0m Trial 0 finished with value: 0.8585526376329218 and parameters: {'iterations': 300, 'learning_rate': 0.013964954297408176, 'random_strength': 1, 'bagging_temperature': 8, 'max_bin': 10, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 4, 'max_depth': 5, 'l2_leaf_reg': 21.328495943450676, 'one_hot_max_size': 100, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 0 with value: 0.8585526376329218.[0m
[32m[I 2021-04-08 06:47:05,679][0m Trial 1 finished with value: 0.8603229456814425 and parameters: {'iterations': 1200, 'learning_rate': 0.11477165079768124, 'random_strength': 9, 'bagging_temperature': 6, 'max_bin': 5, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 5, 'max_depth': 7, 'l2_leaf_reg': 0.5714362138520529, 'one_hot_max_size': 500, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.86032294568144

Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 06:47:25,654][0m Trial 2 finished with value: 0.8368923253294923 and parameters: {'iterations': 500, 'learning_rate': 0.10424841039250944, 'random_strength': 8, 'bagging_temperature': 3, 'max_bin': 6, 'grow_policy': 'Depthwise', 'min_data_in_leaf': 3, 'max_depth': 10, 'l2_leaf_reg': 7.961547985302404e-08, 'one_hot_max_size': 5, 'auto_class_weights': 'Balanced'}. Best is trial 1 with value: 0.8603229456814425.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 06:48:06,981][0m Trial 3 finished with value: 0.8575016417253865 and parameters: {'iterations': 4000, 'learning_rate': 0.06981342303072555, 'random_strength': 6, 'bagging_temperature': 7, 'max_bin': 30, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 3, 'max_depth': 6, 'l2_leaf_reg': 1.253388039132331e-06, 'one_hot_max_size': 500, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.8603229456814425.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 06:48:27,473][0m Trial 4 finished with value: 0.8586675981281925 and parameters: {'iterations': 300, 'learning_rate': 0.19342991898511874, 'random_strength': 8, 'bagging_temperature': 7, 'max_bin': 20, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 7, 'max_depth': 2, 'l2_leaf_reg': 0.0011659140576640084, 'one_hot_max_size': 12, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.8603229456814425.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 06:48:38,550][0m Trial 5 finished with value: 0.8564892389418693 and parameters: {'iterations': 100, 'learning_rate': 0.21103536986772822, 'random_strength': 4, 'bagging_temperature': 4, 'max_bin': 6, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 6, 'max_depth': 4, 'l2_leaf_reg': 1.5248267732768012e-08, 'one_hot_max_size': 100, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 1 with value: 0.8603229456814425.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 06:50:08,831][0m Trial 6 finished with value: 0.8605099679890091 and parameters: {'iterations': 1000, 'learning_rate': 0.039222305770230614, 'random_strength': 5, 'bagging_temperature': 3, 'max_bin': 20, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 10, 'max_depth': 4, 'l2_leaf_reg': 3.544948380552023e-06, 'one_hot_max_size': 10, 'auto_class_weights': 'Balanced'}. Best is trial 6 with value: 0.8605099679890091.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 06:51:15,960][0m Trial 7 finished with value: 0.8589576087086522 and parameters: {'iterations': 1000, 'learning_rate': 0.045869610193466345, 'random_strength': 9, 'bagging_temperature': 10, 'max_bin': 30, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 8, 'max_depth': 5, 'l2_leaf_reg': 6.629274905463984e-06, 'one_hot_max_size': 12, 'auto_class_weights': 'Balanced'}. Best is trial 6 with value: 0.8605099679890091.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 06:51:31,004][0m Trial 8 finished with value: 0.8528217687771015 and parameters: {'iterations': 1500, 'learning_rate': 0.2775338323862574, 'random_strength': 6, 'bagging_temperature': 4, 'max_bin': 4, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 9, 'max_depth': 9, 'l2_leaf_reg': 0.0010943835695463693, 'one_hot_max_size': 500, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 6 with value: 0.8605099679890091.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 06:51:49,810][0m Trial 9 finished with value: 0.8572167719660676 and parameters: {'iterations': 1000, 'learning_rate': 0.16527427195333977, 'random_strength': 1, 'bagging_temperature': 9, 'max_bin': 8, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 2, 'max_depth': 5, 'l2_leaf_reg': 5.547808519611033e-05, 'one_hot_max_size': 100, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 6 with value: 0.8605099679890091.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 06:58:31,542][0m Trial 10 finished with value: 0.8591985185353671 and parameters: {'iterations': 5000, 'learning_rate': 0.010762266734580167, 'random_strength': 3, 'bagging_temperature': 0, 'max_bin': 20, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 10, 'max_depth': 2, 'l2_leaf_reg': 0.10127958284396282, 'one_hot_max_size': 10, 'auto_class_weights': 'Balanced'}. Best is trial 6 with value: 0.8605099679890091.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 06:59:08,482][0m Trial 11 finished with value: 0.8607816262653284 and parameters: {'iterations': 1200, 'learning_rate': 0.11322551338851809, 'random_strength': 10, 'bagging_temperature': 1, 'max_bin': 5, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 5, 'max_depth': 8, 'l2_leaf_reg': 43.57209194906412, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 11 with value: 0.8607816262653284.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 06:59:48,365][0m Trial 12 finished with value: 0.8613701943133913 and parameters: {'iterations': 2000, 'learning_rate': 0.09961795086011382, 'random_strength': 4, 'bagging_temperature': 1, 'max_bin': 5, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 6, 'max_depth': 8, 'l2_leaf_reg': 62.63368620692968, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 12 with value: 0.8613701943133913.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:00:24,630][0m Trial 13 finished with value: 0.8613415868457771 and parameters: {'iterations': 2000, 'learning_rate': 0.12088872443638732, 'random_strength': 3, 'bagging_temperature': 0, 'max_bin': 5, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 6, 'max_depth': 8, 'l2_leaf_reg': 61.63488178410244, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 12 with value: 0.8613701943133913.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:00:53,628][0m Trial 14 finished with value: 0.855096502695367 and parameters: {'iterations': 2000, 'learning_rate': 0.14835756064337013, 'random_strength': 3, 'bagging_temperature': 0, 'max_bin': 5, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 7, 'max_depth': 10, 'l2_leaf_reg': 1.5637683031256537, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 12 with value: 0.8613701943133913.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:01:35,680][0m Trial 15 finished with value: 0.8616285796286872 and parameters: {'iterations': 2000, 'learning_rate': 0.07951489401924636, 'random_strength': 2, 'bagging_temperature': 1, 'max_bin': 5, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 6, 'max_depth': 8, 'l2_leaf_reg': 62.705713595381376, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 15 with value: 0.8616285796286872.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:02:04,888][0m Trial 16 finished with value: 0.8577493269808103 and parameters: {'iterations': 2000, 'learning_rate': 0.0744607791169085, 'random_strength': 2, 'bagging_temperature': 2, 'max_bin': 5, 'grow_policy': 'Depthwise', 'min_data_in_leaf': 8, 'max_depth': 8, 'l2_leaf_reg': 0.06915594233689655, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 15 with value: 0.8616285796286872.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:02:48,283][0m Trial 17 finished with value: 0.8600820168476628 and parameters: {'iterations': 200, 'learning_rate': 0.07684406381403432, 'random_strength': 5, 'bagging_temperature': 2, 'max_bin': 4, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 1, 'max_depth': 9, 'l2_leaf_reg': 6.4247724894952425, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 15 with value: 0.8616285796286872.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:03:04,130][0m Trial 18 finished with value: 0.8543553821104531 and parameters: {'iterations': 2000, 'learning_rate': 0.16672331851591765, 'random_strength': 2, 'bagging_temperature': 1, 'max_bin': 5, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 4, 'max_depth': 7, 'l2_leaf_reg': 0.021606642032831525, 'one_hot_max_size': 5, 'auto_class_weights': 'Balanced'}. Best is trial 15 with value: 0.8616285796286872.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:03:18,959][0m Trial 19 finished with value: 0.8543137681739721 and parameters: {'iterations': 3000, 'learning_rate': 0.2372277057838844, 'random_strength': 4, 'bagging_temperature': 5, 'max_bin': 8, 'grow_policy': 'Depthwise', 'min_data_in_leaf': 7, 'max_depth': 9, 'l2_leaf_reg': 2.465405577846948, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 15 with value: 0.8616285796286872.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:04:22,397][0m Trial 20 finished with value: 0.8622908444898643 and parameters: {'iterations': 2000, 'learning_rate': 0.044265091319581354, 'random_strength': 2, 'bagging_temperature': 1, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 8, 'max_depth': 7, 'l2_leaf_reg': 73.52506535922271, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 20 with value: 0.8622908444898643.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:05:37,148][0m Trial 21 finished with value: 0.8625784609666646 and parameters: {'iterations': 2000, 'learning_rate': 0.042606984231318104, 'random_strength': 2, 'bagging_temperature': 1, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 8, 'max_depth': 7, 'l2_leaf_reg': 62.50005938682571, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 21 with value: 0.8625784609666646.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:06:51,859][0m Trial 22 finished with value: 0.862609807680017 and parameters: {'iterations': 2000, 'learning_rate': 0.03526272914782955, 'random_strength': 1, 'bagging_temperature': 2, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 9, 'max_depth': 7, 'l2_leaf_reg': 98.00004001104887, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 22 with value: 0.862609807680017.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:08:19,384][0m Trial 23 finished with value: 0.8625368590898382 and parameters: {'iterations': 2000, 'learning_rate': 0.028079602905080877, 'random_strength': 1, 'bagging_temperature': 2, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 9, 'max_depth': 6, 'l2_leaf_reg': 8.151804570593427, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 22 with value: 0.862609807680017.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:12:40,439][0m Trial 24 finished with value: 0.86102290443585 and parameters: {'iterations': 2000, 'learning_rate': 0.004607740064806687, 'random_strength': 1, 'bagging_temperature': 3, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 9, 'max_depth': 6, 'l2_leaf_reg': 7.454228172328469, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 22 with value: 0.862609807680017.[0m
[32m[I 2021-04-08 07:13:48,696][0m Trial 25 finished with value: 0.8620383161032587 and parameters: {'iterations': 500, 'learning_rate': 0.029800286889888797, 'random_strength': 1, 'bagging_temperature': 2, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 9, 'max_depth': 6, 'l2_leaf_reg': 0.3983195710858186, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 22 with value: 0.862609807680017.[0m
[32m[I 2021-04-08 07:17:28,102][0m Trial 26 finished with value: 0.8569233627913475 and parameters: {'iterations':

Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:19:36,704][0m Trial 29 finished with value: 0.8450047500358892 and parameters: {'iterations': 100, 'learning_rate': 0.017388446552419206, 'random_strength': 1, 'bagging_temperature': 3, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 10, 'max_depth': 5, 'l2_leaf_reg': 13.433470826373599, 'one_hot_max_size': 5, 'auto_class_weights': 'Balanced'}. Best is trial 22 with value: 0.862609807680017.[0m
[32m[I 2021-04-08 07:19:59,496][0m Trial 30 finished with value: 0.8560876842877153 and parameters: {'iterations': 3000, 'learning_rate': 0.05609115749150309, 'random_strength': 1, 'bagging_temperature': 0, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 9, 'max_depth': 7, 'l2_leaf_reg': 0.00013076250641927363, 'one_hot_max_size': 100, 'auto_class_weights': 'Balanced'}. Best is trial 22 with value: 0.862609807680017.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:21:11,349][0m Trial 31 finished with value: 0.8622584181752337 and parameters: {'iterations': 2000, 'learning_rate': 0.04366243751897575, 'random_strength': 2, 'bagging_temperature': 1, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 8, 'max_depth': 7, 'l2_leaf_reg': 26.199827613558732, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 22 with value: 0.862609807680017.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:26:05,909][0m Trial 32 finished with value: 0.8573416738116191 and parameters: {'iterations': 2000, 'learning_rate': 0.0026147015459110426, 'random_strength': 2, 'bagging_temperature': 2, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 8, 'max_depth': 7, 'l2_leaf_reg': 70.1393538899556, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 22 with value: 0.862609807680017.[0m
[32m[I 2021-04-08 07:26:35,833][0m Trial 33 finished with value: 0.8604898136840473 and parameters: {'iterations': 2000, 'learning_rate': 0.091552077805099, 'random_strength': 3, 'bagging_temperature': 1, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 7, 'max_depth': 6, 'l2_leaf_reg': 2.449615205881121, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 22 with value: 0.862609807680017.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:27:19,874][0m Trial 34 finished with value: 0.862375291567718 and parameters: {'iterations': 5000, 'learning_rate': 0.0571557613338729, 'random_strength': 1, 'bagging_temperature': 0, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 9, 'max_depth': 7, 'l2_leaf_reg': 18.043642692124692, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 22 with value: 0.862609807680017.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:27:51,143][0m Trial 35 finished with value: 0.8618001596792684 and parameters: {'iterations': 5000, 'learning_rate': 0.13332628241447375, 'random_strength': 1, 'bagging_temperature': 0, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 10, 'max_depth': 5, 'l2_leaf_reg': 17.977646693040857, 'one_hot_max_size': 500, 'auto_class_weights': 'Balanced'}. Best is trial 22 with value: 0.862609807680017.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:28:21,472][0m Trial 36 finished with value: 0.8612109438180058 and parameters: {'iterations': 5000, 'learning_rate': 0.060970889788992705, 'random_strength': 1, 'bagging_temperature': 3, 'max_bin': 6, 'grow_policy': 'Depthwise', 'min_data_in_leaf': 9, 'max_depth': 6, 'l2_leaf_reg': 0.9544216041805242, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 22 with value: 0.862609807680017.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:29:03,819][0m Trial 37 finished with value: 0.8558765210416673 and parameters: {'iterations': 300, 'learning_rate': 0.021838059939802965, 'random_strength': 7, 'bagging_temperature': 2, 'max_bin': 30, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 9, 'max_depth': 7, 'l2_leaf_reg': 0.09959386166070887, 'one_hot_max_size': 5, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 22 with value: 0.862609807680017.[0m
[32m[I 2021-04-08 07:30:31,188][0m Trial 38 finished with value: 0.8612907749318722 and parameters: {'iterations': 5000, 'learning_rate': 0.030581297756174224, 'random_strength': 2, 'bagging_temperature': 6, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 10, 'max_depth': 4, 'l2_leaf_reg': 3.894995143050487, 'one_hot_max_size': 12, 'auto_class_weights': 'Balanced'}. Best is trial 22 with value: 0.862609807680017.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:30:52,693][0m Trial 39 finished with value: 0.8572790902326396 and parameters: {'iterations': 1200, 'learning_rate': 0.09141145504683143, 'random_strength': 4, 'bagging_temperature': 3, 'max_bin': 10, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 7, 'max_depth': 5, 'l2_leaf_reg': 1.5282545810431035e-07, 'one_hot_max_size': 500, 'auto_class_weights': 'Balanced'}. Best is trial 22 with value: 0.862609807680017.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:31:32,899][0m Trial 40 finished with value: 0.8617817740799941 and parameters: {'iterations': 500, 'learning_rate': 0.05863309194188887, 'random_strength': 1, 'bagging_temperature': 4, 'max_bin': 6, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 9, 'max_depth': 6, 'l2_leaf_reg': 18.0233382126146, 'one_hot_max_size': 1024, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 22 with value: 0.862609807680017.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:32:37,151][0m Trial 41 finished with value: 0.8624183143864881 and parameters: {'iterations': 2000, 'learning_rate': 0.044432328928803416, 'random_strength': 2, 'bagging_temperature': 1, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 8, 'max_depth': 7, 'l2_leaf_reg': 84.05079365936713, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 22 with value: 0.862609807680017.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:33:33,878][0m Trial 42 finished with value: 0.8619779875486581 and parameters: {'iterations': 4000, 'learning_rate': 0.04296737318598617, 'random_strength': 3, 'bagging_temperature': 0, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 8, 'max_depth': 7, 'l2_leaf_reg': 19.70794495725286, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 22 with value: 0.862609807680017.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:36:51,199][0m Trial 43 finished with value: 0.862426409036574 and parameters: {'iterations': 2000, 'learning_rate': 0.011943671829882242, 'random_strength': 2, 'bagging_temperature': 1, 'max_bin': 10, 'grow_policy': 'SymmetricTree', 'min_data_in_leaf': 8, 'max_depth': 8, 'l2_leaf_reg': 9.12026467643942, 'one_hot_max_size': 1024, 'auto_class_weights': 'Balanced'}. Best is trial 22 with value: 0.862609807680017.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:40:33,060][0m Trial 44 finished with value: 0.86263419571034 and parameters: {'iterations': 2000, 'learning_rate': 0.011474700179802198, 'random_strength': 2, 'bagging_temperature': 2, 'max_bin': 4, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 7, 'max_depth': 8, 'l2_leaf_reg': 99.05683151592714, 'one_hot_max_size': 100, 'auto_class_weights': 'Balanced'}. Best is trial 44 with value: 0.86263419571034.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:43:15,215][0m Trial 45 finished with value: 0.86199357321067 and parameters: {'iterations': 2000, 'learning_rate': 0.015601520735630273, 'random_strength': 4, 'bagging_temperature': 2, 'max_bin': 4, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 7, 'max_depth': 9, 'l2_leaf_reg': 0.5674765587597007, 'one_hot_max_size': 100, 'auto_class_weights': 'Balanced'}. Best is trial 44 with value: 0.86263419571034.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:48:35,562][0m Trial 46 finished with value: 0.8624956015685921 and parameters: {'iterations': 2000, 'learning_rate': 0.007736085095600653, 'random_strength': 3, 'bagging_temperature': 4, 'max_bin': 4, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 7, 'max_depth': 8, 'l2_leaf_reg': 7.499443521533301, 'one_hot_max_size': 100, 'auto_class_weights': 'Balanced'}. Best is trial 44 with value: 0.86263419571034.[0m


Stopped by overfitting detector  (10 iterations wait)


[32m[I 2021-04-08 07:48:51,968][0m Trial 47 finished with value: 0.8478031954404605 and parameters: {'iterations': 100, 'learning_rate': 0.029734414414858742, 'random_strength': 5, 'bagging_temperature': 4, 'max_bin': 4, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 6, 'max_depth': 9, 'l2_leaf_reg': 1.0521712916014843, 'one_hot_max_size': 100, 'auto_class_weights': 'SqrtBalanced'}. Best is trial 44 with value: 0.86263419571034.[0m
[32m[I 2021-04-08 07:49:42,028][0m Trial 48 finished with value: 0.8396113618392009 and parameters: {'iterations': 300, 'learning_rate': 0.004064222944990553, 'random_strength': 3, 'bagging_temperature': 3, 'max_bin': 4, 'grow_policy': 'Lossguide', 'min_data_in_leaf': 5, 'max_depth': 10, 'l2_leaf_reg': 0.24571187865982283, 'one_hot_max_size': 100, 'auto_class_weights': 'Balanced'}. Best is trial 44 with value: 0.86263419571034.[0m
[32m[I 2021-04-08 07:51:03,793][0m Trial 49 finished with value: 0.8620568529725521 and parameters: {'iterations': 1000,

Stopped by overfitting detector  (10 iterations wait)


In [None]:
print("Number of finished trials: {}".format(len(study.trials)))
print("Best trial:")
trial = study.best_trial
print("  Value: {}".format(trial.value))
print("  Params: ")
for key, value in trial.params.items():
    print("    {}={},".format(key, value))

In [None]:
final_model_seasonal = CatBoostClassifier(verbose=False,  cat_features=cat_features, 
                          **trial.params)

In [None]:
final_model_seasonal.fit(X_train, y_train)
predictions_seasonal = final_model_seasonal.predict_proba(test_df)[:,1]
predictions_seasonal_train = final_model_seasonal.predict_proba(X_test)[:,1]

In [None]:
predictions_seasonal

In [None]:
roc_auc_score(y_test, predictions_seasonal_train)

In [None]:
seasonal = pd.Series(predictions_seasonal)

In [None]:
h1n1 = pd.Series(predictions_h1)

In [None]:
h1n1=h1n1.reset_index().drop('index',axis=1)
seasonal=seasonal.reset_index().drop('index',axis=1)

In [None]:
respondent_id_df = pd.DataFrame(
    test_df.index
)

In [None]:
submission = respondent_id_df.merge(h1n1 ,how='outer', left_index=True,right_index=True)
submission = submission.merge(seasonal ,how='outer', left_index=True,right_index=True)
submission = submission.set_index('respondent_id')
submission = submission.rename(columns={'0_x':'h1n1_vaccine','0_y':'seasonal_vaccine'})

In [None]:
submission.to_csv(r"C:\Users\Horri\Downloads\submission_20210331v7.csv")

In [None]:
submission