**Note**: Credit to Alexis Cook who provided an awesome template for begginners to start in the 30 Days of ML competition right away.
<br>
Add credits to Adam in [comment](https://www.kaggle.com/c/30-days-of-ml/discussion/267470#1490123) if the [tutorial](https://www.kaggle.com/adam48/tutorial-building-a-custom-pipeline-for-hyperopt) is very useful.

Link to the competition main page: **[30 Days of ML competition](https://www.kaggle.com/c/30-days-of-ml/overview)**.

# Things to be done for tomorrow
- Check [useful features discussion](https://www.kaggle.com/c/30-days-of-ml/discussion/267931) to adjust preprocessing pipeline to only use the successfull features (read the whole discussion).
- Use Adam's tutorial to achieve hyperparameter tuning for XGBoost.

# Part 1: Setting up the stage
## Step 1.1: Importing libraries

We begin by importing the libraries we'll need.  Some of them will be familiar from the **[Intro to Machine Learning](https://www.kaggle.com/learn/intro-to-machine-learning)** course and the **[Intermediate Machine Learning](https://www.kaggle.com/learn/intermediate-machine-learning)** course.

In [1]:
# Familiar imports
import numpy as np
import pandas as pd

# For preprocessing and encoding data
from sklearn.base import BaseEstimator, TransformerMixin # To construct custom encoders
from sklearn.preprocessing import OrdinalEncoder # Encoders
from sklearn.preprocessing import OneHotEncoder
from sklearn.model_selection import train_test_split # Splitting data
from sklearn.pipeline import Pipeline, FeatureUnion # Pipelines
from sklearn.compose import ColumnTransformer

# For training XGBoost and LGBM
from xgboost import XGBRegressor
from lightgbm import LGBMRegressor
from sklearn.metrics import mean_squared_error

# For parameter optimization
import optuna

## Step 1.2: Load the data

Next, we'll load the training and test data.  

We set `index_col=0` in the code cell below to use the `id` column to index the DataFrame.  (*If you're not sure how this works, try temporarily removing `index_col=0` and see how it changes the result.*)

In [2]:
# Load the training data
train = pd.read_csv("../input/30-days-of-ml/train.csv", index_col=0)
test = pd.read_csv("../input/30-days-of-ml/test.csv", index_col=0)

# Preview the data
display(train.head())

test.head()

Unnamed: 0_level_0,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont5,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13,target
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,B,B,B,C,B,B,A,E,C,N,...,0.400361,0.160266,0.310921,0.38947,0.267559,0.237281,0.377873,0.322401,0.86985,8.113634
2,B,B,A,A,B,D,A,F,A,O,...,0.533087,0.558922,0.516294,0.594928,0.341439,0.906013,0.921701,0.261975,0.465083,8.481233
3,A,A,A,C,B,D,A,D,A,F,...,0.650609,0.375348,0.902567,0.555205,0.843531,0.748809,0.620126,0.541474,0.763846,8.364351
4,B,B,A,C,B,D,A,E,C,K,...,0.66898,0.239061,0.732948,0.679618,0.574844,0.34601,0.71461,0.54015,0.280682,8.049253
6,A,A,A,C,B,D,A,E,A,N,...,0.686964,0.420667,0.648182,0.684501,0.956692,1.000773,0.776742,0.625849,0.250823,7.97226


Unnamed: 0_level_0,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont4,cont5,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
0,B,B,B,C,B,B,A,E,E,I,...,0.476739,0.37635,0.337884,0.321832,0.445212,0.290258,0.244476,0.087914,0.301831,0.845702
5,A,B,A,C,B,C,A,E,C,H,...,0.285509,0.860046,0.798712,0.835961,0.391657,0.288276,0.549568,0.905097,0.850684,0.69394
15,B,A,A,A,B,B,A,E,D,K,...,0.697272,0.6836,0.404089,0.879379,0.275549,0.427871,0.491667,0.384315,0.376689,0.508099
16,B,B,A,C,B,D,A,E,A,N,...,0.719306,0.77789,0.730954,0.644315,1.024017,0.39109,0.98834,0.411828,0.393585,0.461372
17,B,B,A,C,B,C,A,E,C,F,...,0.313032,0.431007,0.390992,0.408874,0.447887,0.390253,0.648932,0.385935,0.370401,0.900412


The next code cell separates the target (which we assign to `y`) from the training features (which we assign to `features`).

In [3]:
# Separate target from features
y = train['target']
X = train.drop(['target'], axis=1)

# Preview features
X.head()

Unnamed: 0_level_0,cat0,cat1,cat2,cat3,cat4,cat5,cat6,cat7,cat8,cat9,...,cont4,cont5,cont6,cont7,cont8,cont9,cont10,cont11,cont12,cont13
id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,B,B,B,C,B,B,A,E,C,N,...,0.610706,0.400361,0.160266,0.310921,0.38947,0.267559,0.237281,0.377873,0.322401,0.86985
2,B,B,A,A,B,D,A,F,A,O,...,0.276853,0.533087,0.558922,0.516294,0.594928,0.341439,0.906013,0.921701,0.261975,0.465083
3,A,A,A,C,B,D,A,D,A,F,...,0.285074,0.650609,0.375348,0.902567,0.555205,0.843531,0.748809,0.620126,0.541474,0.763846
4,B,B,A,C,B,D,A,E,C,K,...,0.284667,0.66898,0.239061,0.732948,0.679618,0.574844,0.34601,0.71461,0.54015,0.280682
6,A,A,A,C,B,D,A,E,A,N,...,0.287595,0.686964,0.420667,0.648182,0.684501,0.956692,1.000773,0.776742,0.625849,0.250823



# Part 2: Exploratory Data Analysis

This I will do in another notebook or at another time. I just took information from other notebooks and discussions in the forums in order to be able to complete the task.


# Part 3: Setting preprocessing pipelines

Here it comes the juicy part.

First we split the data into train and validation sets. And then we set the preprocessing pipeline.

## Step 3.1: Split into train and validation data

We first split the features into the train and validation data and then set up the preprocessing pipelines.

In [4]:
#X_train, X_valid, y_train, y_valid = train_test_split(features, y, random_state=0, train_size=0.8, test_size=0.2)

## Step 3.2: Subselect categorical and numerical data

Here we set the appropriate objects to select the useful columns for the work.

These are:
- All the numerical columns: cont0, cont1, cont2, cont3, cont4, cont5, cont6, cont7, cont8, cont9, cont10, cont11, cont12, cont13.
- Categorical features to be treated with OrdinalEncoder: cat5, cat8.
- Categorical features to be binarized: cat1_A, cat1_B, cat8_C, cat8_E.

In [5]:
# List of categorical and numerical columns
numerical_cols = [col for col in X.columns if 'cont' in col]
categorical_cols = [col for col in X.columns if 'cat' in col]

# Features to be ordinally encoded
categorical_ordinal_cols = ['cat5', 'cat8']

# Categorical features to be binarized
categorical_binary_cols = ['cat1', 'cat8']
categorical_to_iterate = ['cat1', 'cat1', 'cat8', 'cat8']
category_to_iterate = ['A', 'B', 'C', 'E']

## Step 3.3: Adding custom processing of data and functionality to convert into DataFrame

In [6]:
# Class to select given features
class FeatureSelector(BaseEstimator, TransformerMixin):
    #Class Constructor
    def __init__(self, feature_names):
        self.feature_names = feature_names 
    
    #Return self nothing else to do here
    def fit( self, X, y = None ):
        return self
    
    #Extract features
    def transform( self, X, y = None ):
        return X.loc[:, self.feature_names]
    
# Class to binarize by a given category
class CategoryBinarizer(BaseEstimator, TransformerMixin):
    def __init__(self, column_names, category_names):
        self.column_names = column_names
        self.category_names = category_names

    #Return self nothing else to do here
    def fit( self, X, y = None ):
        return self
    
    # Binarize a categorical feature based on given category
    def transform( self, X, y = None ):
        new_cols = []
        
        for col, category in zip(self.column_names, self.category_names):
            new_col_name = "{}_{}".format(col, category)
            new_cols.append(new_col_name)
            #X[new_col_name] = np.where(X_train[col].item == 'E', 1, 0)
            X[new_col_name] = X.loc[:, col].apply(lambda x: 1 if x == category else 0)
        
        return X.loc[:, new_cols]

# return dataframe from arbitrary step
class Conv2df(BaseEstimator, TransformerMixin):
    def __init__(self):
        self.colnames = []
    
    def fit(self, X, y=None):
        return self
    
    def transform(self, X, y=None):
        return pd.DataFrame(X)

## Step 3.4: Setting up preprocessing pipeline

In [7]:
def buildFeaturesPipeline():
    # Pipeline to treat some categories with ordinal encoding
    categorical_ordinal = Pipeline(
        steps = [
            ('select_ordinal', FeatureSelector(feature_names=categorical_ordinal_cols)),
            ('ordinal_encoder', OrdinalEncoder())
        ]
    )
    
    # Pipeline to treat some categorical columns with a some-category binarizer
    categorical_binarize = Pipeline(
        steps = [
            ('select_binary', FeatureSelector(feature_names=categorical_binary_cols)),
            ('category_binarizer', CategoryBinarizer(column_names=categorical_to_iterate,
                                                     category_names=category_to_iterate))
        ]
    )
    
    # Pipeline to extract numerical features
    numerical_features = Pipeline(
        steps = [
            ('select_numerical', FeatureSelector(feature_names=numerical_cols))
        ]
    )
    
    # Unite pipelines for the different features
    features_merged = FeatureUnion(transformer_list = [('categorical_ordinal_pipeline',
                                                        categorical_ordinal),
                                                      ('categorical_binary_pipeline',
                                                       categorical_binarize),
                                                      ('numerical_pipeline',
                                                       numerical_features)
                                                      ])
    
    # FeatureUnion returns a numpy array, convert to dataframe. Note that we've lost the column names at this point.
    features_pipeline = Pipeline(steps=[('features',features_merged),('convert2df',Conv2df())])
    
    return features_pipeline

## Step 3.5: Create model pipeline and full pipeline

In [8]:
class ModelTransformer(BaseEstimator, TransformerMixin):
    def __init__(self,space):
        self.space = space.copy()
        # remove hyperparameters not used in training individual models
        self.percXG = space['percXG']
        self.space.pop('percXG')
        # create XGB model
        self.xgmodel = XGBRegressor(**self.space)
        # remove hyperparameters used by XGB but not LGBM
        [self.space.pop(key) for key in ['base_score','booster','tree_method','gamma']]
        # create LGBM model
        self.lgmodel = LGBMRegressor(**self.space)
        
    def fit(self, X, y):
        # small internal train/val split, needed for early stopping rounds
        X_t, X_v, y_t, y_v = train_test_split(X, y, test_size = 0.1, random_state = 3) 
        # fit the models
        self.xgmodel.fit(X_t, y_t, eval_set=[(X_v, y_v)],
                      eval_metric='rmse',early_stopping_rounds=10,verbose=0)
        self.lgmodel.fit(X_t, y_t, eval_set=[(X_v, y_v)],
                      eval_metric='rmse',early_stopping_rounds=10,verbose=0)
        return self
    
    def predict(self, X, y=None):
        # predict with both models
        xgpred = self.xgmodel.predict(X)
        lgpred = self.lgmodel.predict(X)
        # perform weighted averaging of the models based on the hyperparameter percXG
        preds = xgpred*self.percXG + lgpred*(1.0-self.percXG)
        return preds

# Part 4: Fitting the model and evaluating results

We fit the model and evaluate the predictions.

## Step 4.1: Create objective function

In [9]:
def objective(trial, X=X, y=y):
    params = {
        # model hyperparameters to be tuned - picking which ones to tune and their ranges is a bit of an artform and not covered here
        'n_estimators': trial.suggest_int('n_estimators', 1000, 10000),
        'learning_rate': trial.suggest_uniform('learning_rate', 0.01, 0.20),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'reg_alpha': trial.suggest_loguniform('reg_alpha', 1, 100),
        'reg_lambda': trial.suggest_loguniform('reg_lambda', 1, 100),
        'colsample_bytree': trial.suggest_uniform('colsample_bytree', 0.1, 1.0),
        'gamma': trial.suggest_uniform('gamma', 0.0, 1.0),
        # model hyperparameters not tuned, just passed
        'tree_method': trial.suggest_categorical('tree_method',['hist']), # switch to gpu_hist for a big performance gain with XGBoost!
        'booster': trial.suggest_categorical('booster',['gbtree']),
        'subsample': trial.suggest_categorical('subsample',[0.9]),
        'random_state': trial.suggest_categorical('random_state',[8]),
        'base_score': trial.suggest_categorical('base_score',[8]),
        # feature engineering hyperparameters
        #'num_clusters': trial.suggest_int('num_clusters', 0, 2),
        #'num_PCAfeatures': trial.suggest_int('num_PCAfeatures', 0, 2),
        #'num_Poly': trial.suggest_int('num_Poly', 1, 2),
        # model weight hyperparameters
        'percXG': trial.suggest_uniform('percXG', 0.0, 1.0)
    }
    fp = buildFeaturesPipeline()
    full_pipeline = Pipeline(steps=[('features',fp),('model',ModelTransformer(params))])
    X_train, X_val, y_train, y_val = train_test_split(X, y, test_size=0.2, random_state=5)
    full_pipeline.fit(X_train,y_train)
    y_pred = full_pipeline.predict(X_val)
    
    rmse_score = mean_squared_error(y_val, y_pred, squared=False)
    
    return rmse_score

# Step 5: Optmize hyperparameters

In [10]:
# optuna.logging.set_verbosity(optuna.logging.WARNING) # disable normal logging for long runs or results notebook hangs browser!
study = optuna.create_study(direction='minimize')
study.optimize(objective, n_trials=20, timeout=60*10) # 20 trials or 10 minute runtime, whichever comes first
print('Number of finished trials:', len(study.trials))
print('Best trial:', study.best_trial.params)

[32m[I 2021-08-31 23:21:43,306][0m A new study created in memory with name: no-name-83d6539b-654b-427c-8c77-971b7efb28a2[0m
[32m[I 2021-08-31 23:22:06,981][0m Trial 0 finished with value: 0.7179342289636041 and parameters: {'n_estimators': 6394, 'learning_rate': 0.15453675652059595, 'max_depth': 3, 'reg_alpha': 8.103017920617429, 'reg_lambda': 3.94631882313996, 'colsample_bytree': 0.9468409412806735, 'gamma': 0.24673648119936864, 'tree_method': 'hist', 'booster': 'gbtree', 'subsample': 0.9, 'random_state': 8, 'base_score': 8, 'percXG': 0.588930373318188}. Best is trial 0 with value: 0.7179342289636041.[0m
[32m[I 2021-08-31 23:22:18,534][0m Trial 1 finished with value: 0.7185760926285102 and parameters: {'n_estimators': 3301, 'learning_rate': 0.16643360609076122, 'max_depth': 10, 'reg_alpha': 2.205285979008538, 'reg_lambda': 11.95244613949713, 'colsample_bytree': 0.1622840297736657, 'gamma': 0.6819976247956067, 'tree_method': 'hist', 'booster': 'gbtree', 'subsample': 0.9, 'rando

Number of finished trials: 19
Best trial: {'n_estimators': 4038, 'learning_rate': 0.048799326739117965, 'max_depth': 5, 'reg_alpha': 24.234693761068616, 'reg_lambda': 5.783748035635562, 'colsample_bytree': 0.18900042554055074, 'gamma': 0.48339854413337446, 'tree_method': 'hist', 'booster': 'gbtree', 'subsample': 0.9, 'random_state': 8, 'base_score': 8, 'percXG': 0.955550246432554}


# Part 6: Use best hyperparameters and submit

In [11]:
sub = pd.read_csv("/kaggle/input/30-days-of-ml/sample_submission.csv",index_col = 'id')
fp = buildFeaturesPipeline()
final_pipeline = Pipeline(steps=[('features',fp),('model',ModelTransformer(study.best_trial.params))])
final_pipeline.fit(X,y)
sub.target = final_pipeline.predict(test)
sub.head()

Unnamed: 0_level_0,target
id,Unnamed: 1_level_1
0,8.177982
5,8.370835
15,8.363151
16,8.519019
17,8.156708


In [12]:
sub.to_csv("submission.csv")
print("Complete")

Complete
