## Bayesian methods of hyperparameter optimization

In addition to the random search and the grid search methods for selecting optimal hyperparameters, we can use Bayesian methods of probabilities to select the optimal hyperparameters for an algorithm.

In this case study, we will be using the BayesianOptimization library to perform hyperparmater tuning. This library has very good documentation which you can find here: https://github.com/fmfn/BayesianOptimization

You will need to install the Bayesian optimization module. Running a cell with an exclamation point in the beginning of the command will run it as a shell command — please do this to install this module from our notebook in the cell below.

In [3]:
! pip install bayesian-optimization lightgbm catboost

Collecting bayesian-optimization
  Downloading bayesian_optimization-2.0.3-py3-none-any.whl.metadata (9.0 kB)
Collecting lightgbm
  Downloading lightgbm-4.6.0-py3-none-macosx_12_0_arm64.whl.metadata (17 kB)
Collecting catboost
  Downloading catboost-1.2.7-cp312-cp312-macosx_11_0_universal2.whl.metadata (1.2 kB)
Collecting graphviz (from catboost)
  Downloading graphviz-0.20.3-py3-none-any.whl.metadata (12 kB)
Downloading bayesian_optimization-2.0.3-py3-none-any.whl (31 kB)
Downloading lightgbm-4.6.0-py3-none-macosx_12_0_arm64.whl (1.6 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.6/1.6 MB[0m [31m17.5 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading catboost-1.2.7-cp312-cp312-macosx_11_0_universal2.whl (27.0 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m27.0/27.0 MB[0m [31m23.7 MB/s[0m eta [36m0:00:00[0ma [36m0:00:01[0m
[?25hDownloading graphviz-0.20.3-py3-none-any.whl (47 kB)
Installing collected packages: graphviz, lightgbm, catboost,

In [42]:
import warnings
warnings.filterwarnings('ignore')
from sklearn.preprocessing import LabelEncoder
import numpy as np
import pandas as pd
import lightgbm
from bayes_opt import BayesianOptimization
from catboost import CatBoostClassifier, cv, Pool

In [44]:
import os
os.listdir()

['Bayesian_optimization_case_study.ipynb', '.ipynb_checkpoints', 'data']

## How does Bayesian optimization work?

Bayesian optimization works by constructing a posterior distribution of functions (Gaussian process) that best describes the function you want to optimize. As the number of observations grows, the posterior distribution improves, and the algorithm becomes more certain of which regions in parameter space are worth exploring and which are not, as seen in the picture below.

<img src="https://github.com/fmfn/BayesianOptimization/blob/master/examples/bo_example.png?raw=true" />
As you iterate over and over, the algorithm balances its needs of exploration and exploitation while taking into account what it knows about the target function. At each step, a Gaussian Process is fitted to the known samples (points previously explored), and the posterior distribution, combined with an exploration strategy (such as UCB — aka Upper Confidence Bound), or EI (Expected Improvement). This process is used to determine the next point that should be explored (see the gif below).
<img src="https://github.com/fmfn/BayesianOptimization/raw/master/examples/bayesian_optimization.gif" />

## Let's look at a simple example

The first step is to create an optimizer. It uses two items:
* function to optimize
* bounds of parameters

The function is the procedure that counts metrics of our model quality. The important thing is that our optimization will maximize the value on function. Smaller metrics are best. Hint: don't forget to use negative metric values.

Here we define our simple function we want to optimize.

In [48]:
def simple_func(a, b):
    return a + b

Now, we define our bounds of the parameters to optimize, within the Bayesian optimizer.

In [66]:
optimizer = BayesianOptimization(
    simple_func,
    {'a': (1, 3),
    'b': (4, 7)}
    ,random_state=42  # Ensures repeatability
)

These are the main parameters of this function:

* **n_iter:** This is how many steps of Bayesian optimization you want to perform. The more steps, the more likely you are to find a good maximum.

* **init_points:** This is how many steps of random exploration you want to perform. Random exploration can help by diversifying the exploration space.

Let's run an example where we use the optimizer to find the best values to maximize the target value for a and b given the inputs of 3 and 2.

In [68]:
optimizer.maximize(3,2)

|   iter    |  target   |     a     |     b     |
-------------------------------------------------
| [39m1        [39m | [39m8.601    [39m | [39m1.749    [39m | [39m6.852    [39m |
| [39m2        [39m | [39m8.26     [39m | [39m2.464    [39m | [39m5.796    [39m |
| [39m3        [39m | [39m5.78     [39m | [39m1.312    [39m | [39m4.468    [39m |
| [35m4        [39m | [35m9.802    [39m | [35m2.822    [39m | [35m6.979    [39m |
| [39m5        [39m | [39m9.603    [39m | [39m2.996    [39m | [39m6.607    [39m |


Great, now let's print the best parameters and the associated maximized target.

In [70]:
print(optimizer.max['params']);optimizer.max['target']

{'a': 2.82203139995844, 'b': 6.979491216800435}


9.801522616758875

## Test it on real data using the Light GBM

The dataset we will be working with is the famous flight departures dataset. Our modeling goal will be to predict if a flight departure is going to be delayed by 15 minutes based on the other attributes in our dataset. As part of this modeling exercise, we will use Bayesian hyperparameter optimization to identify the best parameters for our model.

**<font color='teal'> You can load the zipped csv files just as you would regular csv files using Pandas read_csv. In the next cell load the train and test data into two seperate dataframes. </font>**


In [120]:
train_df = pd.read_csv('/Users/shivanginimarjiwe/Desktop/repo/DataScienceGuidedCapstone/18.2.6 - Bayesian Optimization/data/flight_delays_train.csv')

test_df = pd.read_csv('/Users/shivanginimarjiwe/Desktop/repo/DataScienceGuidedCapstone/18.2.6 - Bayesian Optimization/data/flight_delays_test.csv')

**<font color='teal'> Print the top five rows of the train dataframe and review the columns in the data. </font>**

In [122]:
train_df.head()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
0,c-8,c-21,c-7,1934,AA,ATL,DFW,732,N
1,c-4,c-20,c-3,1548,US,PIT,MCO,834,N
2,c-9,c-2,c-5,1422,XE,RDU,CLE,416,N
3,c-11,c-25,c-6,1015,OO,DEN,MEM,872,N
4,c-10,c-7,c-6,1828,WN,MDW,OMA,423,Y


In [None]:
Observation :

It looks like some categorical features (Month, DayofMonth, DayOfWeek) are still encoded with prefixes like c-8, c-21, etc., 
rather than numerical values. This means they were not properly transformed during the preprocessing step.

Fixing the Encoding:
1. These values appear to be categorical but should be converted into numerical labels.
2. Reprocess these columns correctly before moving on to Bayesian Optimization with LightGBM.
    

In [126]:
# Use category encoding with mapping for unseen labels
for col in ['Month', 'DayofMonth', 'DayOfWeek', 'UniqueCarrier', 'Origin', 'Dest']:
    le = LabelEncoder()
    train_df[col] = le.fit_transform(train_df[col])
    
    # Create a mapping of known values
    mapping = dict(zip(le.classes_, le.transform(le.classes_)))
    
    # Apply mapping to test data, assigning -1 for unknown categories
    test_df[col] = test_df[col].map(lambda x: mapping.get(x, -1))

# Encode target variable
train_df['dep_delayed_15min'] = train_df['dep_delayed_15min'].map({'Y': 1, 'N': 0})

# Separate features and target
X_train = train_df.drop(columns=['dep_delayed_15min'])
y_train = train_df['dep_delayed_15min']
X_test = test_df  # No target column in test set

# Display processed dataset
X_train.head()




Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance
0,10,13,6,1934,0,18,78,732
1,6,12,2,1548,18,217,171,834
2,11,11,4,1422,20,228,59,416
3,2,17,5,1015,15,78,175,872
4,1,28,5,1828,19,174,199,423


In [None]:
#The dataset has been successfully preprocessed, with categorical columns encoded properly and 
#the target variable transformed into binary format.


**<font color='teal'> Use the describe function to review the numeric columns in the train dataframe. </font>**

In [128]:
train_df.describe()

Unnamed: 0,Month,DayofMonth,DayOfWeek,DepTime,UniqueCarrier,Origin,Dest,Distance,dep_delayed_15min
count,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0,100000.0
mean,5.56211,14.82392,2.95183,1341.52388,12.04504,142.37781,141.04111,729.39716,0.19044
std,3.451188,8.952739,1.99164,476.378445,6.566272,76.899,76.783888,574.61686,0.39265
min,0.0,0.0,0.0,1.0,0.0,0.0,0.0,30.0,0.0
25%,3.0,7.0,1.0,931.0,6.0,78.0,77.0,317.0,0.0
50%,6.0,15.0,3.0,1330.0,13.0,151.0,150.0,575.0,0.0
75%,9.0,22.0,5.0,1733.0,18.0,205.0,203.0,957.0,0.0
max,11.0,30.0,6.0,2534.0,21.0,288.0,288.0,4962.0,1.0


In [130]:
# Display unique sample values of DepTime to analyze its format
train_df['DepTime'].sample(10).sort_values()


5540      631
40968     701
53611    1309
29243    1313
89782    1443
56715    1612
23897    1633
22385    1745
23492    1956
91560    2133
Name: DepTime, dtype: int64

In [None]:
Observation 

The current DepTime is in HHMM format.
we Need to convert it into 2400 hours format

In [132]:
# Convert DepTime to 24,000-hour format
train_df['DepTime_24000'] = train_df['DepTime'] / 24
test_df['DepTime_24000'] = test_df['DepTime'] / 24

# Display the transformed DepTime column
display(train_df[['DepTime', 'DepTime_24000']].head(10))


Unnamed: 0,DepTime,DepTime_24000
0,1934,80.583333
1,1548,64.5
2,1422,59.25
3,1015,42.291667
4,1828,76.166667
5,1918,79.916667
6,754,31.416667
7,635,26.458333
8,735,30.625
9,2029,84.541667


Notice, `DepTime` is the departure time in a numeric representation in 2400 hours. 

In [None]:
Observation:

Yes. Now we have converted the Deptime into 2400 hours format. This conversion ensures DepTime is on a normalized scale, 
making it easier for the model to interpret

 **<font color='teal'>The response variable is 'dep_delayed_15min' which is a categorical column, so we need to map the Y for yes and N for no values to 1 and 0. Run the code in the next cell to do this.</font>**

In [136]:
#train_df = train_df[train_df.DepTime <= 2400].copy()
#y_train = train_df['dep_delayed_15min'].map({'Y': 1, 'N': 0}).values

# Ensure only valid DepTime values (<= 2400) are used
train_df = train_df[train_df.DepTime <= 2400].copy()

# Convert target variable 'dep_delayed_15min' to binary (1 = 'Y', 0 = 'N')
y_train = train_df['dep_delayed_15min'].map({'Y': 1, 'N': 0}).values

# Display the first few values of the target variable
display(y_train[:10])


array([nan, nan, nan, nan, nan, nan, nan, nan, nan, nan])

In [None]:
Observation:

The NaN values in y_train suggest that there might be an issue with the mapping process.

Possible Reasons for NaN Values:
The dep_delayed_15min column may contain unexpected values (not just "Y" and "N").
The train_df may have missing values in dep_delayed_15min.

Fix:
First encode dep_delayed_15min.
Then filter DepTime correctly.


In [138]:
# Restore the original train_df without filtering first
train_df = pd.read_csv('/Users/shivanginimarjiwe/Desktop/repo/DataScienceGuidedCapstone/18.2.6 - Bayesian Optimization/data/flight_delays_train.csv')

# Encode the target variable before filtering
train_df['dep_delayed_15min'] = train_df['dep_delayed_15min'].map({'Y': 1, 'N': 0})

# Now filter the dataset for valid DepTime values
train_df = train_df[train_df.DepTime <= 2400].copy()

# Extract the target variable again
y_train = train_df['dep_delayed_15min'].values

# Display unique values and first few entries
(unique_values_fixed, missing_values_fixed) = (train_df['dep_delayed_15min'].unique(), train_df['dep_delayed_15min'].isnull().sum())

(unique_values_fixed, missing_values_fixed, y_train[:10])


(array([0, 1]), 0, array([0, 0, 0, 0, 1, 0, 0, 0, 0, 0]))

In [None]:
Issue fixed :

Unique values in dep_delayed_15min are now [0, 1] (correctly encoded).
No missing values in dep_delayed_15min (missing_values = 0).
First few values of y_train: [0, 0, 0, 0, 1, 0, 0, 0, 0, 0] (correctly mapped)

## Feature Engineering
Use these defined functions to create additional features for the model. Run the cell to add the functions to your workspace.

In [142]:
def label_enc(df_column):
    df_column = LabelEncoder().fit_transform(df_column)
    return df_column

def make_harmonic_features_sin(value, period=2400):
    value *= 2 * np.pi / period 
    return np.sin(value)

def make_harmonic_features_cos(value, period=2400):
    value *= 2 * np.pi / period 
    return np.cos(value)

def feature_eng(df):
    df['flight'] = df['Origin']+df['Dest']
    df['Month'] = df.Month.map(lambda x: x.split('-')[-1]).astype('int32')
    df['DayofMonth'] = df.DayofMonth.map(lambda x: x.split('-')[-1]).astype('uint8')
    df['begin_of_month'] = (df['DayofMonth'] < 10).astype('uint8')
    df['midddle_of_month'] = ((df['DayofMonth'] >= 10)&(df['DayofMonth'] < 20)).astype('uint8')
    df['end_of_month'] = (df['DayofMonth'] >= 20).astype('uint8')
    df['DayOfWeek'] = df.DayOfWeek.map(lambda x: x.split('-')[-1]).astype('uint8')
    df['hour'] = df.DepTime.map(lambda x: x/100).astype('int32')
    df['morning'] = df['hour'].map(lambda x: 1 if (x <= 11)& (x >= 7) else 0).astype('uint8')
    df['day'] = df['hour'].map(lambda x: 1 if (x >= 12) & (x <= 18) else 0).astype('uint8')
    df['evening'] = df['hour'].map(lambda x: 1 if (x >= 19) & (x <= 23) else 0).astype('uint8')
    df['night'] = df['hour'].map(lambda x: 1 if (x >= 0) & (x <= 6) else 0).astype('int32')
    df['winter'] = df['Month'].map(lambda x: x in [12, 1, 2]).astype('int32')
    df['spring'] = df['Month'].map(lambda x: x in [3, 4, 5]).astype('int32')
    df['summer'] = df['Month'].map(lambda x: x in [6, 7, 8]).astype('int32')
    df['autumn'] = df['Month'].map(lambda x: x in [9, 10, 11]).astype('int32')
    df['holiday'] = (df['DayOfWeek'] >= 5).astype(int) 
    df['weekday'] = (df['DayOfWeek'] < 5).astype(int)
    df['airport_dest_per_month'] = df.groupby(['Dest', 'Month'])['Dest'].transform('count')
    df['airport_origin_per_month'] = df.groupby(['Origin', 'Month'])['Origin'].transform('count')
    df['airport_dest_count'] = df.groupby(['Dest'])['Dest'].transform('count')
    df['airport_origin_count'] = df.groupby(['Origin'])['Origin'].transform('count')
    df['carrier_count'] = df.groupby(['UniqueCarrier'])['Dest'].transform('count')
    df['carrier_count_per month'] = df.groupby(['UniqueCarrier', 'Month'])['Dest'].transform('count')
    df['deptime_cos'] = df['DepTime'].map(make_harmonic_features_cos)
    df['deptime_sin'] = df['DepTime'].map(make_harmonic_features_sin)
    df['flightUC'] = df['flight']+df['UniqueCarrier']
    df['DestUC'] = df['Dest']+df['UniqueCarrier']
    df['OriginUC'] = df['Origin']+df['UniqueCarrier']
    return df.drop('DepTime', axis=1)

# Functions are now ready for use!
print("Feature engineering functions have been added to the workspace.")

Feature engineering functions have been added to the workspace.


Concatenate the training and testing dataframes.


In [146]:
full_df = pd.concat([train_df.drop('dep_delayed_15min', axis=1), test_df])
full_df = feature_eng(full_df)

# Display the first few rows of the transformed dataset
display(full_df.head())


AttributeError: 'int' object has no attribute 'split'

In [None]:

Error Fix:

We need to properly extract the numeric part of these columns before converting them to integers.
Instead of  " df['Month'] = df['Month'].astype('int32')"  we need to use 
"df['Month'] = df['Month'].str.split('-').str[-1].astype('int32')"

In [148]:
# Modify the feature_eng function to correctly extract numeric values from categorical columns
def feature_eng_fixed(df):
    """Applies feature engineering transformations with correct handling of categorical formats."""
    
    df = df.copy()  # Work on a copy to prevent modification of the original data
    
    df['flight'] = df['Origin'] + df['Dest']
    
    # Extract numeric values from encoded categorical features (handling 'c-8' format)
    df['Month'] = df['Month'].astype(str).str.split('-').str[-1].astype('int32')
    df['DayofMonth'] = df['DayofMonth'].astype(str).str.split('-').str[-1].astype('uint8')
    df['begin_of_month'] = (df['DayofMonth'] < 10).astype('uint8')
    df['midddle_of_month'] = ((df['DayofMonth'] >= 10) & (df['DayofMonth'] < 20)).astype('uint8')
    df['end_of_month'] = (df['DayofMonth'] >= 20).astype('uint8')
    df['DayOfWeek'] = df['DayOfWeek'].astype(str).str.split('-').str[-1].astype('uint8')
    
    # Extract hour from DepTime (assuming it's in HHMM format)
    df['hour'] = (df.DepTime / 100).astype('int32')

    # Time of day categories
    df['morning'] = df['hour'].map(lambda x: 1 if (7 <= x <= 11) else 0).astype('uint8')
    df['day'] = df['hour'].map(lambda x: 1 if (12 <= x <= 18) else 0).astype('uint8')
    df['evening'] = df['hour'].map(lambda x: 1 if (19 <= x <= 23) else 0).astype('uint8')
    df['night'] = df['hour'].map(lambda x: 1 if (0 <= x <= 6) else 0).astype('int32')

    # Seasonal categories
    df['winter'] = df['Month'].map(lambda x: 1 if x in [12, 1, 2] else 0).astype('int32')
    df['spring'] = df['Month'].map(lambda x: 1 if x in [3, 4, 5] else 0).astype('int32')
    df['summer'] = df['Month'].map(lambda x: 1 if x in [6, 7, 8] else 0).astype('int32')
    df['autumn'] = df['Month'].map(lambda x: 1 if x in [9, 10, 11] else 0).astype('int32')

    # Weekend and weekday flags
    df['holiday'] = (df['DayOfWeek'] >= 5).astype(int) 
    df['weekday'] = (df['DayOfWeek'] < 5).astype(int)

    # Group-based aggregations
    df['airport_dest_per_month'] = df.groupby(['Dest', 'Month'])['Dest'].transform('count')
    df['airport_origin_per_month'] = df.groupby(['Origin', 'Month'])['Origin'].transform('count')
    df['airport_dest_count'] = df.groupby(['Dest'])['Dest'].transform('count')
    df['airport_origin_count'] = df.groupby(['Origin'])['Origin'].transform('count')
    df['carrier_count'] = df.groupby(['UniqueCarrier'])['Dest'].transform('count')
    df['carrier_count_per_month'] = df.groupby(['UniqueCarrier', 'Month'])['Dest'].transform('count')

    # Harmonic features for cyclic time variables
    df['deptime_cos'] = df['DepTime'].map(make_harmonic_features_cos)
    df['deptime_sin'] = df['DepTime'].map(make_harmonic_features_sin)

    # Additional combined categorical features
    df['flightUC'] = df['flight'] + df['UniqueCarrier']
    df['DestUC'] = df['Dest'] + df['UniqueCarrier']
    df['OriginUC'] = df['Origin'] + df['UniqueCarrier']

    return df.drop('DepTime', axis=1)  # Drop the original DepTime column after transformation

# Reapply feature engineering with the corrected function
full_df = feature_eng_fixed(full_df)

# Display the first few rows of the transformed dataset
display(full_df.head())


Unnamed: 0,Month,DayofMonth,DayOfWeek,UniqueCarrier,Origin,Dest,Distance,DepTime_24000,flight,begin_of_month,...,airport_origin_per_month,airport_dest_count,airport_origin_count,carrier_count,carrier_count_per_month,deptime_cos,deptime_sin,flightUC,DestUC,OriginUC
0,8,21,7,AA,ATL,DFW,732,,ATLDFW,0,...,534,4337,5822,9418,789,0.34366,-0.939094,ATLDFWAA,DFWAA,ATLAA
1,4,20,3,US,PIT,MCO,834,,PITMCO,0,...,42,1728,688,6482,537,-0.612907,-0.790155,PITMCOUS,MCOUS,PITUS
2,9,2,5,XE,RDU,CLE,416,,RDUCLE,1,...,59,1217,868,5901,493,-0.835807,-0.549023,RDUCLEXE,CLEXE,RDUXE
3,11,25,6,OO,DEN,MEM,872,,DENMEM,0,...,250,629,2973,7390,567,-0.884988,0.465615,DENMEMOO,MEMOO,DENOO
4,10,7,6,WN,MDW,OMA,423,,MDWOMA,1,...,121,311,1366,15082,1358,0.073238,-0.997314,MDWOMAWN,OMAWN,MDWWN


Apply the earlier defined feature engineering functions to the full dataframe.

In [178]:
for column in ['UniqueCarrier', 'Origin', 'Dest','flight',  'flightUC', 'DestUC', 'OriginUC']:
    full_df[column] = label_enc(full_df[column])

In [None]:
Fix:
Before applying Label Encoding, we need to:

Convert all values in these columns to strings to ensure uniformity.
Then apply Label Encoding

In [152]:
# Ensure all values in categorical columns are strings before encoding
for column in ['UniqueCarrier', 'Origin', 'Dest', 'flight', 'flightUC', 'DestUC', 'OriginUC']:
    full_df[column] = full_df[column].astype(str)  # Convert to string
    full_df[column] = label_enc(full_df[column])  # Apply Label Encoding

# Display the first few rows of the updated dataset
display(full_df.head())


Unnamed: 0,Month,DayofMonth,DayOfWeek,UniqueCarrier,Origin,Dest,Distance,DepTime_24000,flight,begin_of_month,...,airport_origin_per_month,airport_dest_count,airport_origin_count,carrier_count,carrier_count_per_month,deptime_cos,deptime_sin,flightUC,DestUC,OriginUC
0,8,21,7,20,301,359,732,,664,0,...,534,4337,5822,9418,789,0.34366,-0.939094,743,741,361
1,4,20,3,38,500,452,834,,4039,0,...,42,1728,688,6482,537,-0.612907,-0.790155,6363,1259,1579
2,9,2,5,40,511,340,416,,4131,1,...,59,1217,868,5901,493,-0.835807,-0.549023,6493,619,1646
3,11,25,6,35,361,456,872,,1693,0,...,250,629,2973,7390,567,-0.884988,0.465615,2454,1293,732
4,10,7,6,39,457,480,423,,3193,1,...,121,311,1366,15082,1358,0.073238,-0.997314,4915,1464,1271


In [None]:
The categorical columns have been successfully converted to strings and label-encoded.



Split the new full dataframe into X_train and X_test. 

In [156]:
X_train = full_df[:train_df.shape[0]]
X_test = full_df[train_df.shape[0]:]

# Display the shapes of the resulting datasets
X_train.shape, X_test.shape

((99983, 34), (100000, 34))

Create a list of the categorical features.

In [160]:
categorical_features = ['Month',  'DayOfWeek', 'UniqueCarrier', 'Origin', 'Dest','flight',  'flightUC', 'DestUC', 'OriginUC']

# Display the categorical features list
categorical_features

['Month',
 'DayOfWeek',
 'UniqueCarrier',
 'Origin',
 'Dest',
 'flight',
 'flightUC',
 'DestUC',
 'OriginUC']

Let's build a light GBM model to test the bayesian optimizer.

### [LightGBM](https://lightgbm.readthedocs.io/en/latest/) is a gradient boosting framework that uses tree-based learning algorithms. It is designed to be distributed and efficient with the following advantages:

* Faster training speed and higher efficiency.
* Lower memory usage.
* Better accuracy.
* Support of parallel and GPU learning.
* Capable of handling large-scale data.

First, we define the function we want to maximize and that will count cross-validation metrics of lightGBM for our parameters.

Some params such as num_leaves, max_depth, min_child_samples, min_data_in_leaf should be integers.

In [164]:
def lgb_eval(num_leaves,max_depth,lambda_l2,lambda_l1,min_child_samples, min_data_in_leaf):
    params = {
        "objective" : "binary",
        "metric" : "auc", 
        'is_unbalance': True,
        "num_leaves" : int(num_leaves),
        "max_depth" : int(max_depth),
        "lambda_l2" : lambda_l2,
        "lambda_l1" : lambda_l1,
        "num_threads" : 20,
        "min_child_samples" : int(min_child_samples),
        'min_data_in_leaf': int(min_data_in_leaf),
        "learning_rate" : 0.03,
        "subsample_freq" : 5,
        "bagging_seed" : 42,
        "verbosity" : -1
    }
    lgtrain = lightgbm.Dataset(X_train, y_train,categorical_feature=categorical_features)
    cv_result = lightgbm.cv(params,
                       lgtrain,
                       1000,
                       stratified=True,
                       nfold=3)
    return cv_result['valid auc-mean'][-1]


# Function is now defined and ready for optimization! 
print("LightGBM evaluation function is ready.")


LightGBM evaluation function is ready.


Apply the Bayesian optimizer to the function we created in the previous step to identify the best hyperparameters. We will run 5 iterations and set init_points = 2.


In [168]:
!pip install lightgbm




In [170]:
lgbBO = BayesianOptimization(lgb_eval, {'num_leaves': (25, 4000),
                                                'max_depth': (5, 63),
                                                'lambda_l2': (0.0, 0.05),
                                                'lambda_l1': (0.0, 0.05),
                                                'min_child_samples': (50, 10000),
                                                'min_data_in_leaf': (100, 2000)
                                                })

lgbBO.maximize(n_iter=5, init_points=2)

|   iter    |  target   | lambda_l1 | lambda_l2 | max_depth | min_ch... | min_da... | num_le... |
-------------------------------------------------------------------------------------------------
| [39m1        [39m | [39m0.7313   [39m | [39m0.04818  [39m | [39m0.03175  [39m | [39m41.01    [39m | [39m5.928e+03[39m | [39m1.02e+03 [39m | [39m1.976e+03[39m |
| [35m2        [39m | [35m0.7443   [39m | [35m0.002931 [39m | [35m0.02464  [39m | [35m29.04    [39m | [35m5.12e+03 [39m | [35m1.345e+03[39m | [35m3.151e+03[39m |
| [39m3        [39m | [39m0.7055   [39m | [39m0.02921  [39m | [39m0.005497 [39m | [39m36.85    [39m | [39m5.98e+03 [39m | [39m277.0    [39m | [39m1.748e+03[39m |
| [39m4        [39m | [39m0.7255   [39m | [39m0.03399  [39m | [39m0.02158  [39m | [39m55.48    [39m | [39m7.993e+03[39m | [39m1.002e+03[39m | [39m2.544e+03[39m |
| [39m5        [39m | [39m0.7439   [39m | [39m0.02759  [39m | [39m0.01063  [39m | [

 **<font color='teal'> Print the best result by using the '.max' function.</font>**

In [172]:
lgbBO.max

{'target': 0.7442844509016954,
 'params': {'lambda_l1': 0.0029306655504813817,
  'lambda_l2': 0.024639256072447005,
  'max_depth': 29.038484757043882,
  'min_child_samples': 5120.182434233687,
  'min_data_in_leaf': 1344.9519072870098,
  'num_leaves': 3150.904931241149}}

In [None]:
Observation 

The Bayesian Optimization process has found the best hyperparameters for LightGBM based on AUC score (target = 0.7443).

Review the process at each step by using the '.res[0]' function.

In [176]:
lgbBO.res[0]

{'target': 0.7313452653191486,
 'params': {'lambda_l1': 0.048183299383045095,
  'lambda_l2': 0.03175387371094503,
  'max_depth': 41.01243354277301,
  'min_child_samples': 5927.504726192272,
  'min_data_in_leaf': 1020.4698618651744,
  'num_leaves': 1975.6926610593114}}

In [None]:
Observation:

AUC Score (target): 0.7313

Hyperparameters Tested:

{
  'lambda_l1': 0.04818,
  'lambda_l2': 0.03175,
  'max_depth': 41,
  'min_child_samples': 5927,
  'min_data_in_leaf': 1020,
  'num_leaves': 1976
}






In [180]:
#review all the iterations
lgbBO.res


[{'target': 0.7313452653191486,
  'params': {'lambda_l1': 0.048183299383045095,
   'lambda_l2': 0.03175387371094503,
   'max_depth': 41.01243354277301,
   'min_child_samples': 5927.504726192272,
   'min_data_in_leaf': 1020.4698618651744,
   'num_leaves': 1975.6926610593114}},
 {'target': 0.7442844509016954,
  'params': {'lambda_l1': 0.0029306655504813817,
   'lambda_l2': 0.024639256072447005,
   'max_depth': 29.038484757043882,
   'min_child_samples': 5120.182434233687,
   'min_data_in_leaf': 1344.9519072870098,
   'num_leaves': 3150.904931241149}},
 {'target': 0.7055352919408965,
  'params': {'lambda_l1': 0.029205420533006777,
   'lambda_l2': 0.005496747090440419,
   'max_depth': 36.84502087718671,
   'min_child_samples': 5980.315870292989,
   'min_data_in_leaf': 277.0365302834856,
   'num_leaves': 1748.1393347835933}},
 {'target': 0.7255393705876799,
  'params': {'lambda_l1': 0.03399395278706789,
   'lambda_l2': 0.02158403838146125,
   'max_depth': 55.48167447108914,
   'min_child_sa

In [None]:
Observation:

teration	AUC Score (Target)	  Better than Previous?
1	                0.7313	      Initial Random
2	                0.7443	      Highest AUC (Best)
3	                0.7055	      Worse than Iteration 2
4	                0.7255	      Worse than Iteration 2
5	                0.7439	      Slightly worse than Iteration 2
6	                0.7027	       Worse than Iteration 2
7	                0.7248	       Worse than Iteration 2

Iteration 2 is considered the best because it has the highest AUC score (target) among all iterations.