**Importing packages**

In [1]:
import lightgbm as lgb
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split
from sklearn import metrics

**Loading the data for regression and classification**

In [2]:
import seaborn as sns

df_iris = sns.load_dataset('iris')

df_iris.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,species
0,5.1,3.5,1.4,0.2,setosa
1,4.9,3.0,1.4,0.2,setosa
2,4.7,3.2,1.3,0.2,setosa
3,4.6,3.1,1.5,0.2,setosa
4,5.0,3.6,1.4,0.2,setosa


**Separating the bases**

In [3]:
x_iris = df_iris.drop(['species'], axis=1)
y_iris = df_iris['species']

In [4]:
x_iris_train, x_iris_test, y_iris_train, y_iris_test = train_test_split(x_iris, y_iris, random_state=42) #Training

**Creating the object with the LightGBM classifier**

In [5]:
classificador_lgbm = lgb.LGBMClassifier()

type(classificador_lgbm)

**scikit-learn compatibility**

In [6]:
from sklearn.model_selection import cross_val_score

cross_val_score(classificador_lgbm, x_iris_train, y_iris_train).mean() #Test

[LightGBM] [Info] Auto-choosing row-wise multi-threading, the overhead of testing was 0.002682 seconds.
You can set `force_row_wise=true` to remove the overhead.
And if memory is not enough, you can set `force_col_wise=true`.
[LightGBM] [Info] Total Bins 77
[LightGBM] [Info] Number of data points in the train set: 89, number of used features: 4
[LightGBM] [Info] Start training from score -1.156432
[LightGBM] [Info] Start training from score -1.054649
[LightGBM] [Info] Start training from score -1.087439
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000031 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 79
[LightGBM] [Info] Number of data points in the train set: 89, number of used features: 4
[LightGBM] [Info] Start training from score -1.156432
[LightGBM] [Info] Start training from score -1.054649
[LightGBM] [Info] Start training from score -1.087439
[LightGBM] [Info] Auto-choosing col-wise multi

0.9284584980237154

**Little tuning!**

In [7]:
classificador_lgbm_tunado = lgb.LGBMClassifier(max_depth=2)

cross_val_score(classificador_lgbm_tunado, x_iris_train, y_iris_train).mean() #Test

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000036 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 77
[LightGBM] [Info] Number of data points in the train set: 89, number of used features: 4
[LightGBM] [Info] Start training from score -1.156432
[LightGBM] [Info] Start training from score -1.054649
[LightGBM] [Info] Start training from score -1.087439
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000021 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 79
[LightGBM] [Info] Number of data points in the train set: 89, number of used features: 4
[LightGBM] [Info] Start training from score -1.156432
[LightGBM] [Info] Start training from score -1.054649
[LightGBM] [Info] Start training from score -1.087439
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000031 seconds.
You ca

0.9371541501976285

In [8]:
classificador_lgbm_rf = lgb.LGBMClassifier(boosting_type='rf', bagging_freq=1, bagging_fraction=0.8)

cross_val_score(classificador_lgbm_rf, x_iris_train, y_iris_train).mean() #Test

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000026 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 77
[LightGBM] [Info] Number of data points in the train set: 89, number of used features: 4
[LightGBM] [Info] Start training from score -1.156432
[LightGBM] [Info] Start training from score -1.054649
[LightGBM] [Info] Start training from score -1.087439
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000032 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 79
[LightGBM] [Info] Number of data points in the train set: 89, number of used features: 4
[LightGBM] [Info] Start training from score -1.156432
[LightGBM] [Info] Start training from score -1.054649
[LightGBM] [Info] Start training from score -1.087439
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000031 seconds.
You ca

0.9280632411067193

**Classifier types**

In [9]:
classificador_lgbm_dart = lgb.LGBMClassifier(boosting_type='dart')

cross_val_score(classificador_lgbm_dart, x_iris_train, y_iris_train).mean() #Test

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000031 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 77
[LightGBM] [Info] Number of data points in the train set: 89, number of used features: 4
[LightGBM] [Info] Start training from score -1.156432
[LightGBM] [Info] Start training from score -1.054649
[LightGBM] [Info] Start training from score -1.087439
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000034 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 79
[LightGBM] [Info] Number of data points in the train set: 89, number of used features: 4
[LightGBM] [Info] Start training from score -1.156432
[LightGBM] [Info] Start training from score -1.054649
[LightGBM] [Info] Start training from score -1.087439
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000032 seconds.
You ca

0.9462450592885375

In [10]:
classificador_final = classificador_lgbm_dart

In [11]:
classificador_final.fit(x_iris_train, y_iris_train)

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000061 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 90
[LightGBM] [Info] Number of data points in the train set: 112, number of used features: 4
[LightGBM] [Info] Start training from score -1.163151
[LightGBM] [Info] Start training from score -1.054937
[LightGBM] [Info] Start training from score -1.080913


In [12]:
predicoes_iris = classificador_final.predict(x_iris_test) #Brings predictions

predicoes_iris[:10]

array(['versicolor', 'setosa', 'virginica', 'versicolor', 'versicolor',
       'setosa', 'versicolor', 'virginica', 'versicolor', 'versicolor'],
      dtype=object)

In [13]:
y_iris_test

73     versicolor
18         setosa
118     virginica
78     versicolor
76     versicolor
31         setosa
64     versicolor
141     virginica
68     versicolor
82     versicolor
110     virginica
12         setosa
36         setosa
9          setosa
19         setosa
56     versicolor
104     virginica
69     versicolor
55     versicolor
132     virginica
29         setosa
127     virginica
26         setosa
128     virginica
131     virginica
145     virginica
108     virginica
143     virginica
45         setosa
30         setosa
22         setosa
15         setosa
65     versicolor
11         setosa
42         setosa
146     virginica
51     versicolor
27         setosa
Name: species, dtype: object

In [14]:
(predicoes_iris == y_iris_test).sum()

38

In [15]:
len(y_iris_test)

38

In [16]:
acertos = (predicoes_iris == y_iris_test).sum()
total = len(y_iris_test)

acuracia = 100 * acertos / total

acuracia #Calculating accuracy

100.0

**Let's go regression?**

In [17]:
df_mpg = sns.load_dataset('mpg')

df_mpg.head()

Unnamed: 0,mpg,cylinders,displacement,horsepower,weight,acceleration,model_year,origin,name
0,18.0,8,307.0,130.0,3504,12.0,70,usa,chevrolet chevelle malibu
1,15.0,8,350.0,165.0,3693,11.5,70,usa,buick skylark 320
2,18.0,8,318.0,150.0,3436,11.0,70,usa,plymouth satellite
3,16.0,8,304.0,150.0,3433,12.0,70,usa,amc rebel sst
4,17.0,8,302.0,140.0,3449,10.5,70,usa,ford torino


In [18]:
x_mpg = df_mpg.drop(['mpg', 'origin', 'name'], axis=1)
y_mpg = df_mpg['mpg']

In [19]:
x_mpg_train, x_mpg_test, y_mpg_train, y_mpg_test = train_test_split(x_mpg, y_mpg, random_state=42) #Training

In [20]:
from sklearn import metrics

metrics.SCORERS.keys()

dict_keys(['explained_variance', 'r2', 'max_error', 'matthews_corrcoef', 'neg_median_absolute_error', 'neg_mean_absolute_error', 'neg_mean_absolute_percentage_error', 'neg_mean_squared_error', 'neg_mean_squared_log_error', 'neg_root_mean_squared_error', 'neg_mean_poisson_deviance', 'neg_mean_gamma_deviance', 'accuracy', 'top_k_accuracy', 'roc_auc', 'roc_auc_ovr', 'roc_auc_ovo', 'roc_auc_ovr_weighted', 'roc_auc_ovo_weighted', 'balanced_accuracy', 'average_precision', 'neg_log_loss', 'neg_brier_score', 'positive_likelihood_ratio', 'neg_negative_likelihood_ratio', 'adjusted_rand_score', 'rand_score', 'homogeneity_score', 'completeness_score', 'v_measure_score', 'mutual_info_score', 'adjusted_mutual_info_score', 'normalized_mutual_info_score', 'fowlkes_mallows_score', 'precision', 'precision_macro', 'precision_micro', 'precision_samples', 'precision_weighted', 'recall', 'recall_macro', 'recall_micro', 'recall_samples', 'recall_weighted', 'f1', 'f1_macro', 'f1_micro', 'f1_samples', 'f1_weig

In [21]:
regressor_lgbm = lgb.LGBMRegressor()

cross_val_score(regressor_lgbm, x_mpg_train, y_mpg_train, scoring='neg_root_mean_squared_error').mean() #Test

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000066 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 227
[LightGBM] [Info] Number of data points in the train set: 238, number of used features: 6
[LightGBM] [Info] Start training from score 23.352101
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000037 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 229
[LightGBM] [Info] Number of data points in the train set: 238, number of used features: 6
[LightGBM] [Info] Start training from score 23.576050
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000046 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 235
[LightGBM] [Info] Number of data points in the train set: 238, number of used features: 6
[LightGBM] [Info] Start training from

-3.0602968789895

In [22]:
regressor_lgbm_tunado = lgb.LGBMRegressor(max_depth=2)

cross_val_score(regressor_lgbm_tunado, x_mpg_train, y_mpg_train, scoring='neg_root_mean_squared_error').mean() #Test

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000065 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 227
[LightGBM] [Info] Number of data points in the train set: 238, number of used features: 6
[LightGBM] [Info] Start training from score 23.352101
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000040 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 229
[LightGBM] [Info] Number of data points in the train set: 238, number of used features: 6
[LightGBM] [Info] Start training from score 23.576050
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000040 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 235
[LightGBM] [Info] Number of data points in the train set: 238, number of used features: 6
[LightGBM] [Info] Start training from

-3.085218235455716

In [23]:
regressor_lgbm_dart = lgb.LGBMRegressor(boosting_type='dart')

cross_val_score(regressor_lgbm_dart, x_mpg_train, y_mpg_train, scoring='neg_root_mean_squared_error').mean() #Test

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000053 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 227
[LightGBM] [Info] Number of data points in the train set: 238, number of used features: 6
[LightGBM] [Info] Start training from score 23.352101
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000042 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 229
[LightGBM] [Info] Number of data points in the train set: 238, number of used features: 6
[LightGBM] [Info] Start training from score 23.576050
[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000033 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 235
[LightGBM] [Info] Number of data points in the train set: 238, number of used features: 6
[LightGBM] [Info] Start training from

-3.8958190536698334

In [24]:
regressor_final = regressor_lgbm

In [25]:
regressor_final.fit(x_mpg_train,y_mpg_train,eval_metric='root_mean_squared_error')

[LightGBM] [Info] Auto-choosing col-wise multi-threading, the overhead of testing was 0.000057 seconds.
You can set `force_col_wise=true` to remove the overhead.
[LightGBM] [Info] Total Bins 268
[LightGBM] [Info] Number of data points in the train set: 298, number of used features: 6
[LightGBM] [Info] Start training from score 23.526846


In [27]:
predicoes_mpg = regressor_final.predict(x_mpg_test) #Bringing predictions

In [28]:
y_mpg_test[:10]

198    33.0
396    28.0
33     19.0
208    13.0
93     14.0
84     27.0
373    24.0
94     13.0
222    17.0
126    21.0
Name: mpg, dtype: float64

In [31]:
from sklearn.metrics import mean_squared_error
import math

mse = mean_squared_error(y_mpg_test, predicoes_mpg)

display(mse)

rmse = math.sqrt(mse)

rmse

6.383742648831409

2.526606943873821