# Lightgbm with grid search for parameter optimisation

In this section we are going to create a boosting model for the classification model.
Boosting is an algorithm which set out to answer the question "Can a set of weak learners create a single strong learner?"
It turns out to be very successful in a wide array of applications.

Lightgbm, short for light gradient-boosting machine, is a specific boosting framework developed by microsoft and released open source in 2016.
Although less widely used than XGboost, lightgbm has advantages in efficiency and memory consumption.

I originally wished to use XGboost, but due to some of the problems we came across when implementing the model, lightgbm was the better choice.

In [None]:
import numpy as np
import pandas as pd
from datetime import datetime
from numba import jit
import lightgbm as lgbm
from sklearn.impute import SimpleImputer
from sklearn.model_selection import RandomizedSearchCV
from sklearn.model_selection import StratifiedKFold
from sklearn.preprocessing import StandardScaler

Here we have our functions to calculate the gini coefficient, and implement some of the data handling code to sort out missing values / drop the columns which are mostly missing values and also the calc columns since our EDA discovered these had no correlation to the target. Furthermore, we encode our catagorical features and rescale the data

In [15]:
def timer(start_time=None):
    if not start_time:
        start_time = datetime.now()
        return start_time
    elif start_time:
        thour, temp_sec = divmod((datetime.now() - start_time).total_seconds(), 3600)
        tmin, tsec = divmod(temp_sec, 60)
        print('\n Time taken: %i hours %i minutes and %s seconds' % (thour, tmin, round(tsec, 2)))


@jit
def gini(y_true, y_prob):
    y_true = np.asarray(y_true)
    y_true = y_true[np.argsort(y_prob)]
    ntrue = 0
    gini = 0
    delta = 0
    n = len(y_true)
    for i in range(n - 1, -1, -1):
        y_i = y_true[i]
        ntrue += y_i
        gini += y_i * delta
        delta += 1 - y_i
    gini = 1 - 2 * gini / (ntrue * (n - ntrue))
    return gini


def evalerror(preds, dtrain):
    labels = dtrain.get_label()
    return 'gini', gini(labels, preds), True


def dropmissingcol(pdData):
    vars_to_drop = ['ps_car_03_cat', 'ps_car_05_cat']
    pdData.drop(vars_to_drop, inplace=True, axis=1)
    return pdData


def missingvalues(pdData):
    mean_imp = SimpleImputer(missing_values=-1, strategy='mean')
    mode_imp = SimpleImputer(missing_values=-1, strategy='most_frequent')
    features = ['ps_reg_03', 'ps_car_12', 'ps_car_14', 'ps_car_11']
    for i in features:
        if i == 'ps_car_11':
            pdData[i] = mode_imp.fit_transform(pdData[[i]]).ravel()
        else:
            pdData[i] = mean_imp.fit_transform(pdData[[i]]).ravel()
    return pdData


def encodecat(train, test):
    cat_features = [col for col in train.columns if '_cat' in col]
    for column in cat_features:
        temp = pd.get_dummies(pd.Series(train[column]), prefix=column)
        train = pd.concat([train, temp], axis=1)
        train = train.drop([column], axis=1)

    for column in cat_features:
        temp = pd.get_dummies(pd.Series(test[column]), prefix=column)
        test = pd.concat([test, temp], axis=1)
        test = test.drop([column], axis=1)
    return train, test


def RescaleData(train, test):
    scaler = StandardScaler()
    scaler.fit_transform(train)
    scaler.fit_transform(test)
    return train, test


def DropCalcCol(train, test):
    col_to_drop = train.columns[train.columns.str.startswith('ps_calc_')]
    train = train.drop(col_to_drop, axis=1)
    test = test.drop(col_to_drop, axis=1)
    return train, test

We begin by loading our data - This is the data where the missing values have already been imputed by a linear regression model.
We can later set our impute boolean to False and compare how effective this model was in comparison to a simple mean/mode imputation.

Applying the functions above, we have an encoded, rescaled dataset with missing values imputed. The targets have been seperated as new dataframes.

In [16]:
Kaggle = False
impute = True
if Kaggle == False:
    if impute == True:
        train = pd.read_csv("imputeTrain.csv")
        test = pd.read_csv("imputetest.csv")
    else:
        train = pd.read_csv("new_train.csv")
        test = pd.read_csv("new_test.csv")
        test = dropmissingcol(test)
        train = dropmissingcol(train)
    target_test = test['target'].values
    test = test.drop(['target'], axis=1)
else:
    if impute == True:
        train = pd.read_csv("imputetrainKag.csv")
        test = pd.read_csv("imputetestKag.csv")
    else:
        train = pd.read_csv("Dataset/train.csv")
        test = pd.read_csv("Dataset/test.csv")
        test = dropmissingcol(test)
        train = dropmissingcol(train)

train = missingvalues(train)
test = missingvalues(test)

y_train = train['target'].values
train_id = train['id'].values
X = train.drop(['target', 'id'], axis=1)

test_id = test['id']
X_test = test.drop(['id'], axis=1)

X, X_test = DropCalcCol(X, X_test)
X, X_test = encodecat(X, X_test)
X = pd.DataFrame(X)
X_test = pd.DataFrame(X_test)
X, X_test = RescaleData(X, X_test)

We are now ready to implement the model.

XGboost vs lightgbm

Originally I set out on the project to implement an XGboost algorithm. The reasoning behind this was due to the fact that they often perform extremely well in tasks similar to the one here, especially when looking at the Kaggle leaderboards (albeit prone to overfitting)
I realise that when implementing a boosting algorithm, a huge part of the success comes from parameter optimisation.
A method we have previously looked at is Grid Search CV for parameter optimisation - effectively brute searching through a collection of potential parameter combinations and returning the best result. An issue with this method is that the more precision you want, the more combinations and possibilities you will have to try.

An XGboost algorithm simply fell short on this big dataset as it was going to take a long time to run a grid search.
Possibilities were to reduce the dataset size for the grid search - say run the grid search on 20% of the data,
Or research into Lightgbm, an alternative boosting method with solid claims of being a lot more efficient in run time.

When implementing lightgbm the difference was huge. We could maintain the same full dataset and search through a huge amount of parameter combinations to optimise the parameters, which resulted in a huge score increase in comparison to our xbgoost algorithm with a small parameter search grid.

Therefore, for this project, I found lightgbm to be a much better choice of algorithm.

In [14]:
OPTIMIZE_ROUNDS = False
LEARNING_RATE = 0.07
EARLY_STOPPING_ROUNDS = 50

params = {
    'min_child_weight': [5, 10, 12, 15, 30, 50, 100, 150],
    'num_leaves': [4, 5, 8, 10, 15, 20, 30],
    'subsample': [0.2, 0.4, 0.6, 0.8],
    'drop_rate': [0.1, 0.3, 0.5, 0.7, 0.15, 0.2],
    'max_depth': [3, 4, 5, 7, 10, 12, 15, 20]
}

model = lgbm.LGBMClassifier(learning_rate=LEARNING_RATE, n_estimators=600, objective='binary', )

folds = 5
param_comb = 40

SKfold = StratifiedKFold(n_splits=folds, shuffle=True, random_state=1)

random_search = RandomizedSearchCV(model, param_distributions=params, n_iter=param_comb, scoring='roc_auc', n_jobs=4,
                                   cv=SKfold.split(X, y_train), verbose=3, random_state=1)

start_time = timer(None)

random_search.fit(X, y_train)
timer(start_time)

print('\n All results:')
print(random_search.cv_results_)
print('\n Best estimator:')
print(random_search.best_estimator_)
print('\n Best Normalised gini score for %d-fold search with %d parameter combinations:' % (folds, param_comb))
print(random_search.best_score_)
print('\n Best hyperparameters:')
print(random_search.best_params_)
results = pd.DataFrame(random_search.cv_results_)
results.to_csv('lightgbm-randomgridsearch-results-03.csv')

[Parallel(n_jobs=4)]: Done  24 tasks      | elapsed:  2.8min


KeyboardInterrupt: 

Here we have set up a grid search with the parameters of interest to search over. The other parameters we can make a good guess from lightgbm literature online or they lack importance to fine tune in this particular case.

We run the grid search over a k=5 StratifiedKfold and search through with n_iter as the amount of combinations we wish to look at.
This is still a time intensive exercise, it is run in parallel across 4 chains but the lightgbm model has to train on five folds for each combination, and this is about 20-30 seconds each. For the strongest parameters I will use in the final model, we searched 200 combinations taking around 2 hours. This is overkill

In the future, I would wish to look into bayesian optimisation for parameters. I think this will solve my time problems as well as getting a more exact result.

In [17]:
'''
Best hyperparameters from grid search:
{'subsample': 0.2, 'num_leaves': 15, 'min_child_weight': 150, 'max_depth': 3, 'drop_rate': 0.15}
'''

"\nBest hyperparameters from grid search:\n{'subsample': 0.2, 'num_leaves': 15, 'min_child_weight': 150, 'max_depth': 3, 'drop_rate': 0.15}\n"

Now we have found our best parameters, we are ready to train the model and predict on the test set

We set up a new model with our best parameters, and again set up a stratifiedKfold with k=5 to use to train.
We also wish to acknowledge the issue of overfitting, so we iterate this process of generating our predictions by averaging over different folds and furthermore averaging the whole process over different seeds. Taking only our best folds would overfit here.

In [None]:
OPTIMIZE_ROUNDS = False
LEARNING_RATE = 0.07
EARLY_STOPPING_ROUNDS = 50

min_data_in_leaf = 2000
feature_fraction = 0.6
num_boost_round = 10000
params = {"objective": "binary",
          "boosting_type": "gbdt",
          "learning_rate": LEARNING_RATE,
          "max_bin": 256,
          "n_estimators": 600,
          "verbosity": -1,
          "feature_fraction": feature_fraction,
          "is_unbalance": False,
          "max_drop": 50,
          "min_child_samples": 10,
          "min_split_gain": 0,
          'subsample': 0.2,
          'num_leaves': 15,
          'min_child_weight': 150,
          'max_depth': 3,
          'drop_rate': 0.15
          }

folds = 5

SKfold = StratifiedKFold(n_splits=folds, shuffle=True, random_state=1)

best_trees = []
fold_scores = []

cv_train = np.zeros(len(y_train))
cv_pred = np.zeros(len(X_test))

start_time = timer(None)
iterations = 3
for seed in range(iterations):
    timer(start_time)
    params['seed'] = seed
    for id_train, id_test in SKfold.split(X, y_train):
        xtr, xvl = X.loc[id_train], X.loc[id_test]
        ytr, yvl = y_train[id_train], y_train[id_test]
        dtrain = lgbm.Dataset(data=xtr, label=ytr)
        dval = lgbm.Dataset(data=xvl, label=yvl, reference=dtrain)
        bst = lgbm.train(params, dtrain, num_boost_round, valid_sets=dval, feval=evalerror, verbose_eval=100,
                         early_stopping_rounds=100)

        best_trees.append(bst.best_iteration)
        fold_scores.append(bst.best_score)

        cv_pred += bst.predict(X_test, num_iteration=bst.best_iteration)

    cv_pred /= folds

pd.DataFrame({'id': test_id, 'target': cv_pred / iterations}).to_csv('lgbm_pred5-with-encodingscaling.csv', index=False)

if Kaggle == False:
    test_score = gini(target_test, cv_pred / iterations)
    print("Score on the test data")
    print("Gini")
    print(test_score)


 Time taken: 0 hours 0 minutes and 0.0 seconds




Training until validation scores don't improve for 100 rounds
[100]	valid_0's binary_logloss: 0.151579	valid_0's gini: 0.275997
[200]	valid_0's binary_logloss: 0.1513	valid_0's gini: 0.282682
[300]	valid_0's binary_logloss: 0.151211	valid_0's gini: 0.284987
[400]	valid_0's binary_logloss: 0.151194	valid_0's gini: 0.28541
[500]	valid_0's binary_logloss: 0.151186	valid_0's gini: 0.285667
Early stopping, best iteration is:
[464]	valid_0's binary_logloss: 0.151178	valid_0's gini: 0.286068




Training until validation scores don't improve for 100 rounds
[100]	valid_0's binary_logloss: 0.151305	valid_0's gini: 0.285804
[200]	valid_0's binary_logloss: 0.150978	valid_0's gini: 0.293999
[300]	valid_0's binary_logloss: 0.150864	valid_0's gini: 0.297001
[400]	valid_0's binary_logloss: 0.150837	valid_0's gini: 0.297286
Early stopping, best iteration is:
[370]	valid_0's binary_logloss: 0.150827	valid_0's gini: 0.29771
Training until validation scores don't improve for 100 rounds
[100]	valid_0's binary_logloss: 0.152177	valid_0's gini: 0.25852
[200]	valid_0's binary_logloss: 0.151943	valid_0's gini: 0.265777
[300]	valid_0's binary_logloss: 0.151851	valid_0's gini: 0.269261
[400]	valid_0's binary_logloss: 0.151805	valid_0's gini: 0.270663
[500]	valid_0's binary_logloss: 0.151779	valid_0's gini: 0.271642
[600]	valid_0's binary_logloss: 0.151775	valid_0's gini: 0.271986
Did not meet early stopping. Best iteration is:
[550]	valid_0's binary_logloss: 0.151766	valid_0's gini: 0.272237
Tra



Training until validation scores don't improve for 100 rounds
[100]	valid_0's binary_logloss: 0.151625	valid_0's gini: 0.275666
[200]	valid_0's binary_logloss: 0.151359	valid_0's gini: 0.282139
[300]	valid_0's binary_logloss: 0.151257	valid_0's gini: 0.284361
[400]	valid_0's binary_logloss: 0.151215	valid_0's gini: 0.285315
[500]	valid_0's binary_logloss: 0.151203	valid_0's gini: 0.285903
Early stopping, best iteration is:
[481]	valid_0's binary_logloss: 0.151191	valid_0's gini: 0.286294
Training until validation scores don't improve for 100 rounds
[100]	valid_0's binary_logloss: 0.151301	valid_0's gini: 0.286941
[200]	valid_0's binary_logloss: 0.150991	valid_0's gini: 0.294536
[300]	valid_0's binary_logloss: 0.15088	valid_0's gini: 0.297241
[400]	valid_0's binary_logloss: 0.150839	valid_0's gini: 0.298296
[500]	valid_0's binary_logloss: 0.15084	valid_0's gini: 0.298272
[600]	valid_0's binary_logloss: 0.150812	valid_0's gini: 0.298999
Did not meet early stopping. Best iteration is:
[60

The process can be sped up by choosing less folds, or less iterations in the algorithm. These choices were not made since overfitting was a risk.