pyLigthGBM
=======

Python wrapper for Microsoft [LightGBM](https://github.com/Microsoft/LightGBM)  
Make sure that you have installed LightGBM [Installation-Guide](https://github.com/Microsoft/LightGBM/wiki/Installation-Guide)

**GitHub      :  [https://github.com/ArdalanM/pyLightGBM](https://github.com/ArdalanM/pyLightGBM) **

**Author of this notebook :** Evgeny BAZAROV <baz.evgenii@gmail.com>

-------


In [1]:
%matplotlib inline
import matplotlib.pyplot as plt

import os, gc
import numpy as np
import pandas as pd

from sklearn import metrics, model_selection
from sklearn.preprocessing import LabelEncoder

from pylightgbm.models import GBMRegressor

DATA
-------

For this example used **data from Kaggle competition Allstate Claims Severity**  
You can download data here : https://www.kaggle.com/c/allstate-claims-severity/data

In [2]:
df_train = pd.read_csv("data/train.csv.zip")
print('Train data shape', df_train.shape)

df_test = pd.read_csv("data/test.csv.zip")
print('Test data shape', df_test.shape)

Train data shape (188318, 132)
Test data shape (125546, 131)


Extracting `loss` from train and `id` from test

In [3]:
y = np.log(df_train['loss']+1).as_matrix().astype(np.float)
id_test = np.array(df_test['id'])

Merging train and test data

In [4]:
df = df_train.append(df_test, ignore_index=True)
del df_test, df_train
gc.collect()

print('Merged data shape', df.shape)

Merged data shape (313864, 132)


Droping not useful columns

In [5]:
df.drop(labels=['loss', 'id'], axis=1, inplace=True)
feature_list = df.columns.tolist()

Transfrom categorical features `cat` from 1 to 116

In [6]:
le = LabelEncoder()

for col in df.columns.tolist():
    if 'cat' in col:
        df[col] = le.fit_transform(df[col])

TRAIN, VALIDATION, TEST
-------
Split data into train, validation (for early stopping) and test set

In [7]:
print('train-test split')
df_train, df_test = df.iloc[:len(y)], df.iloc[len(y):]
del df, df_test
gc.collect()

print('train transform\n')
X = df_train.as_matrix()

del df_train
gc.collect()

print('Train shape', X.shape)

train-test split
train transform

Train shape (188318, 130)


Bayesian Optimization of GBMRegressor params
-------
For more information about Bayesian Optimization please visit this github page :  
https://github.com/fmfn/BayesianOptimization

All goods goes to the author Fernando M. F. Nogueira

In [8]:
from sklearn.model_selection import cross_val_score
from sklearn.metrics import make_scorer, mean_absolute_error
from bayes_opt import BayesianOptimization

In [9]:
def mae(y, y_pred):
    return mean_absolute_error((np.exp(y_pred)-1), (np.exp(y)-1))

def gbmr_eval(num_leaves, min_data_in_leaf, feature_fraction, bagging_fraction, seed=42):
    gbmr = GBMRegressor(
        exec_path='/path/to/your/LightGBM/lightgbm', # Change this to your LighGBM path 
        config='',
        application='regression',
        num_iterations=500,
        learning_rate=0.1,
        num_leaves=int(num_leaves),
        tree_learner='serial',
        num_threads=4,
        min_data_in_leaf=int(min_data_in_leaf),
        metric='l2',
        feature_fraction=max(feature_fraction,0),
        feature_fraction_seed=seed,
        bagging_fraction=max(bagging_fraction,0),
        bagging_freq=10,
        bagging_seed=seed,
        metric_freq=1,
        verbose=False
    )
    
    score =  cross_val_score(gbmr, X=X, y=y, scoring=make_scorer(score_func=mae, greater_is_better=False), cv=5, verbose=0, pre_dispatch=1)
    return np.array(score).mean()

In [10]:
num_iter = 40
init_points = 15

gbmrBO = BayesianOptimization(gbmr_eval, 
                              {
                                'num_leaves': (15, 500),
                                'min_data_in_leaf': (15, 200),
                                'feature_fraction': (0.3,1),
                                'bagging_fraction': (0.3,1),
                              }
                             )

gbmrBO.maximize(init_points=init_points, n_iter=num_iter)

print('Final Results')
print('XGBOOST: %f' % gbmrBO.res['max']['max_val'])

[31mInitialization[0m
[94m-----------------------------------------------------------------------------------------------------------[0m
 Step |   Time |      Value |   bagging_fraction |   feature_fraction |   min_data_in_leaf |   num_leaves | 
    1 | 04m26s | [35m-1164.38437[0m | [32m            0.6703[0m | [32m            0.4968[0m | [32m           40.0666[0m | [32m    177.5792[0m | 
    2 | 07m09s | -1250.96450 |             0.3270 |             0.8248 |            21.8809 |     473.7783 | 
    3 | 03m39s | [35m-1144.07996[0m | [32m            0.8711[0m | [32m            0.5284[0m | [32m          127.2338[0m | [32m     22.1844[0m | 
    4 | 09m08s | -1179.26832 |             0.9244 |             0.9143 |           178.1341 |     374.0264 | 
    5 | 05m18s | -1212.30360 |             0.3862 |             0.5212 |            91.0555 |     475.1453 | 
    6 | 03m19s | -1145.65928 |             0.5126 |             0.3603 |           163.1402 |      29.9441 | 


**After finishing the work we can observe best pramas founded by Bayesian Optimization :**

In [12]:
gbmrBO.res['max']['max_params']

{'bagging_fraction': 1.0,
 'feature_fraction': 0.29999999999999999,
 'min_data_in_leaf': 169.3442542624897,
 'num_leaves': 52.201833585663096}