# Tutorial: Automated Machine Learning
This is the code for the paper entitled "**[IoT Data Analytics in Dynamic Environments: From An Automated Machine Learning Perspective](https://arxiv.org/abs/2209.08018)**" published in *Engineering Applications of Artificial Intelligence* (Elsevier's Journal, IF:7.8).<br>
Authors: Li Yang (lyang339@uwo.ca) and Abdallah Shami (Abdallah.Shami@uwo.ca)<br>
Organization: The Optimized Computing and Communications (OC2) Lab, ECE Department, Western University

L. Yang and A. Shami, "IoT Data Analytics in Dynamic Environments: From An Automated Machine Learning Perspective", *Engineering Applications of Artificial Intelligence*, vol. 116, pp. 1-33, 2022, doi: https://doi.org/10.1016/j.engappai.2022.105366.

# Code Part 1: Automated Offline/Static/Batch Learning
Batch learning: Batch learning methods analyze static data in batches and often need access to the entire dataset prior to model training. Traditional ML algorithms can effectively solve batch learning tasks. Although batch learning models often achieve high performance due to their ability to learn diverse data patterns, it is often difficult to update these models once created. Therefore, batch learning faces two significant challenges: model degradation and data unavailability.

## Dataset 1: CICIDS2017
A subset of the network traffic data randomly sampled from the [CICIDS2017 dataset](https://www.unb.ca/cic/datasets/ids-2017.html).  

The Canadian Institute for Cybersecurity Intrusion Detection System 2017 (CICIDS2017) dataset has the most updated network threats. The CICIDS2017 dataset is close to real-world network data since it has a large amount of network traffic data, a variety of network features, various types of attacks, and highly imbalanced classes.

## Import libraries

In [1]:
import pandas as pd
import numpy as np
from sklearn.model_selection import train_test_split,cross_val_score
import lightgbm as lgb
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score, precision_score, recall_score, f1_score
from sklearn.preprocessing import LabelEncoder
from sklearn.ensemble import RandomForestClassifier
from sklearn.metrics import classification_report,confusion_matrix,accuracy_score
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
from sklearn.naive_bayes import GaussianNB
from sklearn.pipeline import Pipeline
from sklearn.model_selection import GridSearchCV
from scipy.stats import shapiro
from imblearn.over_sampling import SMOTE
import time

In [2]:
import warnings 
warnings.filterwarnings('ignore')
import os

## Read the sampled CICIDS2017 dataset

In [3]:
dataset_directory = r'utils\dataset_files'
df = pd.read_csv(os.path.join(dataset_directory,"Test_DS.csv"))
df.columns= ['Timestamp', 'CAN_ID', 'RTR', 'DLC', 'Data0', 'Data1', 'Data2', 'Data3', 'Data4', 'Data5', 'Data6', 'Data7', 'Label','Anomaly_Label',\
        'Mean', 'Median','Skew', 'Kurtosis', 'Variance', 'Standard_deviation']

In [4]:
df

Unnamed: 0,Timestamp,CAN_ID,RTR,DLC,Data0,Data1,Data2,Data3,Data4,Data5,Data6,Data7,Label,Anomaly_Label,Mean,Median,Skew,Kurtosis,Variance,Standard_deviation
0,0.008960,870.0,0.0,8.0,129.0,29.0,200.0,2.0,1.0,71.0,206.0,56.0,2.0,0.0,86.750,63.5,0.591948,-1.328752,6842.214286,82.717678
1,0.011995,1634.0,0.0,8.0,78.0,224.0,0.0,0.0,64.0,0.0,0.0,0.0,2.0,0.0,45.750,0.0,2.023583,4.221153,6230.214286,78.931706
2,0.011997,208.0,0.0,8.0,82.0,119.0,4.0,96.0,1.0,1.0,240.0,0.0,2.0,0.0,67.875,43.0,1.256800,1.354502,7266.125000,85.241568
3,0.014066,1694.0,0.0,8.0,4.0,64.0,4.0,125.0,31.0,192.0,21.0,162.0,2.0,0.0,75.375,47.5,0.637440,-1.421334,5544.553571,74.461759
4,0.018414,186.0,0.0,8.0,6.0,184.0,83.0,196.0,16.0,0.0,3.0,52.0,2.0,0.0,67.500,34.0,0.995326,-0.749078,6530.857143,80.813719
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1923723,347.270475,1201.0,0.0,8.0,0.0,0.0,3.0,0.0,0.0,0.0,0.0,3.0,2.0,0.0,0.750,0.0,1.440165,0.000000,1.928571,1.388730
1923724,347.276351,339.0,0.0,8.0,44.0,179.0,98.0,18.0,251.0,236.0,192.0,206.0,0.0,1.0,153.000,185.5,-0.603115,-1.337754,7804.285714,88.341868
1923725,347.281315,1440.0,0.0,8.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,0.0,2.0,0.0,0.000,0.0,0.000000,0.000000,0.000000,0.000000
1923726,347.297869,356.0,0.0,8.0,196.0,245.0,47.0,130.0,238.0,6.0,135.0,23.0,0.0,1.0,127.500,132.5,-0.036447,-1.757365,8990.571429,94.818624


# 1. Automated Data Pre-Processing

## Automated Transformation/Encoding
Automatically identify and transform string/text features into numerical features to make the data more readable by ML models

In [5]:
# Define the automated data encoding function
def Auto_Encoding(df):
    cat_features=[x for x in df.columns if df[x].dtype=="object"] ## Find string/text features
    le=LabelEncoder()
    for col in cat_features:
        if col in df.columns:
            i = df.columns.get_loc(col)
            # Transform to numerical features
            df.iloc[:,i] = df.apply(lambda i:le.fit_transform(i.astype(str)), axis=0, result_type='expand')
    return df

In [6]:
df=Auto_Encoding(df)

## Automated Imputation
Detect and impute missing values to improve data quality

In [7]:
# Define the automated data imputation function
def Auto_Imputation(df):
    if df.isnull().values.any() or np.isinf(df).values.any(): # if there is any empty or infinite values
        df.replace([np.inf, -np.inf], np.nan, inplace=True)
        df.fillna(0, inplace = True)  # Replace empty values with zeros; there are other imputation methods discussed in the paper
    return df

In [8]:
df=Auto_Imputation(df)

## Automated normalization
Normalize the range of features to a similar scale to improve data quality

In [9]:
def Auto_Normalization(df):
    stat, p = shapiro(df)
    print('Statistics=%.3f, p=%.3f' % (stat, p))
    # interpret
    alpha = 0.05
    numeric_features = df.drop(['Label'],axis = 1).dtypes[df.dtypes != 'object'].index
    
    # The selection strategy is based on the following article: 
    # https://medium.com/@kumarvaishnav17/standardization-vs-normalization-in-machine-learning-3e132a19c8bf
    # Check if the data distribution follows a Gaussian/normal distribution
    # If so, select the Z-score normalization method; otherwise, select the min-max normalization
    # Details are in the paper
    if p > alpha:
        print('Sample looks Gaussian (fail to reject H0)')
        df[numeric_features] = df[numeric_features].apply(
            lambda x: (x - x.mean()) / (x.std()))
        print('Z-score normalization is automatically chosen and used')
    else:
        print('Sample does not look Gaussian (reject H0)')
        df[numeric_features] = df[numeric_features].apply(
            lambda x: (x - x.min()) / (x.max()-x.min()))
        print('Min-max normalization is automatically chosen and used')
    return df

In [10]:
df=Auto_Normalization(df)

Statistics=0.225, p=0.000
Sample does not look Gaussian (reject H0)
Min-max normalization is automatically chosen and used


## Train-test split
Split the dataset into the training and the test set

In [11]:
X = df.drop(['Label','Anomaly_Label'],axis=1)
y = df['Anomaly_Label']

# Here we used the 80%/20% split, it can be changed based on specific tasks
#X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8, test_size = 0.2, shuffle=False,random_state = 0)
X_new, X_subset, y_new, y_subset = train_test_split(X, y, train_size=500000, stratify=y)
X_train, X_test, y_train, y_test = train_test_split(X_new,y_new, train_size = 0.1, test_size = 0.9, shuffle=False,random_state = 0)

## Automated data balancing
Generate minority class samples to solve class-imbalance and improve data quality.  
Synthetic Minority Over-sampling Technique (SMOTE) method is used.

In [12]:
pd.Series(y_train).value_counts()

0.0    40707
1.0     9293
Name: Anomaly_Label, dtype: int64

In [13]:
# For binary data (can be modified for multi-class data with the same logic)
def Auto_Balancing(X_train, y_train):
    number0 = pd.Series(y_train).value_counts().iloc[0]
    number1 = pd.Series(y_train).value_counts().iloc[1]
    
    if number0 > number1:
        nlarge = number0
    else:
        nlarge = number1
    
    # evaluate whether the incoming dataset is imbalanced (the abnormal/normal ratio is smaller than a threshold (e.g., 50%)) 
    if (number1/number0 > 1.5) or (number0/number1 > 1.5):
        smote=SMOTE(n_jobs=-1,sampling_strategy={0:nlarge, 1:nlarge})
        X_train, y_train = smote.fit_resample(X_train, y_train)
        
    return X_train, y_train

In [14]:
X_train, y_train = Auto_Balancing(X_train, y_train)

In [15]:
pd.Series(y_train).value_counts()

0.0    40707
1.0    40707
Name: Anomaly_Label, dtype: int64

## Model learning (for Comparison)

In [16]:
%%time
lg = lgb.LGBMClassifier(verbose = -1)
lg.fit(X_train,y_train)
t1=time.time()
predictions = lg.predict(X_test)
t2=time.time()
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")
print("Time: "+str(round((t2-t1)/len(y_test)*1000000,5)))

Accuracy: 90.583%
Precision: 73.236%
Recall: 79.235%
F1-score: 76.118%
Time: 0.6835
CPU times: total: 9.27 s
Wall time: 1.19 s


In [17]:
%%time
rf = RandomForestClassifier()
rf.fit(X_train,y_train)
t1=time.time()
predictions = rf.predict(X_test)
t2=time.time()
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")
print("Time: "+str(round((t2-t1)/len(y_test)*1000000,5)))

Accuracy: 88.431%
Precision: 67.354%
Recall: 75.519%
F1-score: 71.20400000000001%
Time: 13.98276
CPU times: total: 23.3 s
Wall time: 23.3 s


In [18]:
%%time
nb = GaussianNB()
nb.fit(X_train,y_train)
t1=time.time()
predictions = nb.predict(X_test)
t2=time.time()
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")
print("Time: "+str(round((t2-t1)/len(y_test)*1000000,5)))

Accuracy: 46.414%
Precision: 24.557000000000002%
Recall: 88.28399999999999%
F1-score: 38.426%
Time: 0.43122
CPU times: total: 719 ms
Wall time: 743 ms


In [19]:
%%time
knn = KNeighborsClassifier()
knn.fit(X_train,y_train)
t1=time.time()
predictions = knn.predict(X_test)
t2=time.time()
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")
print("Time: "+str(round((t2-t1)/len(y_test)*1000000,5)))

Accuracy: 85.045%
Precision: 57.656%
Recall: 79.211%
F1-score: 66.73599999999999%
Time: 62.55657
CPU times: total: 4min 55s
Wall time: 28.7 s


In [20]:
import tensorflow as tf
from keras.layers import Input,Dense,Dropout,BatchNormalization,Activation
from keras import Model
import keras.backend as K
import keras.callbacks as kcallbacks
from keras import optimizers
from keras.optimizers import Adam

from sklearn.model_selection import GridSearchCV
from keras.wrappers.scikit_learn import KerasClassifier
from keras.callbacks import EarlyStopping
def ANN(optimizer = 'sgd',neurons=16,batch_size=1024,epochs=80,activation='relu',patience=8,loss='binary_crossentropy'):
    K.clear_session()
    inputs=Input(shape=(X.shape[1],))
    x=Dense(1000)(inputs)
    x=BatchNormalization()(x)
    x=Activation('relu')(x)
    x=Dropout(0.3)(x)
    x=Dense(256)(inputs)
    x=BatchNormalization()(x)
    x=Activation('relu')(x)
    x=Dropout(0.25)(x)
    x=Dense(2,activation='softmax')(x)
    model=Model(inputs=inputs,outputs=x,name='base_nlp')
    model.compile(optimizer='adam',loss='categorical_crossentropy')
    early_stopping = EarlyStopping(monitor="loss", patience = patience)# early stop patience
    history = model.fit(X, pd.get_dummies(y).values,
              batch_size=batch_size,
              epochs=epochs,
              callbacks = [early_stopping],
              verbose=0) #verbose set to 1 will show the training process
    return model

In [21]:
%%time
ann = KerasClassifier(build_fn=ANN, verbose=0)
ann.fit(X_train,y_train)
predictions = ann.predict(X_test)
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")
print("Time: "+str(round((t2-t1)/len(y_test)*1000000,5)))

Accuracy: 89.912%
Precision: 71.04%
Recall: 78.89399999999999%
F1-score: 74.762%
Time: 62.55657
CPU times: total: 1h 48min 43s
Wall time: 14min 23s


# 2. Automated Feature Engineering
Feature selection method 1: **Information Gain (IG)**, used to remove irrelevant features to improve model efficiency  
Feature selection method 2: **Pearson Correlation**, used to remove redundant features to improve model efficiency and accuracy  

In [22]:
# Remove irrelevant features and select important features
def Feature_Importance_IG(data):
    features = data.drop(['Label'],axis=1).values  # "Label" should be changed to the target class variable name if different
    labels = data['Label'].values
    
    # Extract feature names
    feature_names = list(data.drop(['Label'],axis=1).columns)

    # Empty array for feature importances
    feature_importance_values = np.zeros(len(feature_names))
    model = lgb.LGBMRegressor(verbose = -1)
    model.fit(features, labels)
    feature_importances = pd.DataFrame({'feature': feature_names, 'importance': model.feature_importances_})

    # Sort features according to importance
    feature_importances = feature_importances.sort_values('importance', ascending = False).reset_index(drop = True)

    # Normalize the feature importances to add up to one
    feature_importances['normalized_importance'] = feature_importances['importance'] / feature_importances['importance'].sum()
    feature_importances['cumulative_importance'] = np.cumsum(feature_importances['normalized_importance'])
    
    cumulative_importance=0.90 # Only keep the important features with cumulative importance scores>=90%. It can be changed.

    # Make sure most important features are on top
    feature_importances = feature_importances.sort_values('cumulative_importance')

    # Identify the features not needed to reach the cumulative_importance
    record_low_importance = feature_importances[feature_importances['cumulative_importance'] > cumulative_importance]

    to_drop = list(record_low_importance['feature'])
#     print(feature_importances.drop(['importance'],axis=1))
    return to_drop

In [23]:
# Remove redundant features
def Feature_Redundancy_Pearson(data):
    correlation_threshold=0.90 # Only remove features with the redundancy>90%. It can be changed
    features = data.drop(['Label'],axis=1)
    corr_matrix = features.corr()

    # Extract the upper triangle of the correlation matrix
    upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k = 1).astype(np.bool))

    # Select the features with correlations above the threshold
    # Need to use the absolute value
    to_drop = [column for column in upper.columns if any(upper[column].abs() > correlation_threshold)]

    # Dataframe to hold correlated pairs
    record_collinear = pd.DataFrame(columns = ['drop_feature', 'corr_feature', 'corr_value'])

    # Iterate through the columns to drop
    for column in to_drop:

        # Find the correlated features
        corr_features = list(upper.index[upper[column].abs() > correlation_threshold])

        # Find the correlated values
        corr_values = list(upper[column][upper[column].abs() > correlation_threshold])
        drop_features = [column for _ in range(len(corr_features))]    

        # Record the information (need a temp df for now)
        temp_df = pd.DataFrame.from_dict({'drop_feature': drop_features,
                                         'corr_feature': corr_features,
                                         'corr_value': corr_values})
        record_collinear = record_collinear.append(temp_df, ignore_index = True)
#     print(record_collinear)
    return to_drop

In [24]:
def Auto_Feature_Engineering(df):
    drop1 = Feature_Importance_IG(df)
    dfh1 = df.drop(columns = drop1)
    
    drop2 = Feature_Redundancy_Pearson(dfh1)
    dfh2 = dfh1.drop(columns = drop2)
    
    return dfh2

In [25]:
dfh2 = Auto_Feature_Engineering(df)
dfh2

Unnamed: 0,Timestamp,CAN_ID,Data0,Data1,Data3,Data4,Data6,Label,Anomaly_Label,Mean,Skew,Kurtosis
0,8.801449e-07,0.451245,0.251953,0.056641,0.003906,0.001953,0.402344,2.0,0.0,0.169434,0.602383,0.136227
1,1.178308e-06,0.847510,0.152344,0.437500,0.000000,0.125000,0.000000,2.0,0.0,0.089355,0.856909,0.650107
2,1.178504e-06,0.107884,0.160156,0.232422,0.187500,0.001953,0.468750,2.0,0.0,0.132568,0.720585,0.384676
3,1.381766e-06,0.878631,0.007812,0.125000,0.244141,0.060547,0.041016,2.0,0.0,0.147217,0.610471,0.127654
4,1.808919e-06,0.096473,0.011719,0.359375,0.382812,0.031250,0.005859,2.0,0.0,0.131836,0.674098,0.189900
...,...,...,...,...,...,...,...,...,...,...,...,...
1923723,3.411635e-02,0.622925,0.000000,0.000000,0.000000,0.000000,0.000000,2.0,0.0,0.001465,0.753185,0.259259
1923724,3.411692e-02,0.175830,0.085938,0.349609,0.035156,0.490234,0.375000,0.0,1.0,0.298828,0.389916,0.135393
1923725,3.411741e-02,0.746888,0.000000,0.000000,0.000000,0.000000,0.000000,2.0,0.0,0.000000,0.497142,0.259259
1923726,3.411904e-02,0.184647,0.382812,0.478516,0.253906,0.464844,0.263672,0.0,1.0,0.249023,0.490663,0.096540


## Data Split & Balancing (After Feature Engineering)

In [26]:
X = dfh2.drop(['Label','Anomaly_Label'],axis=1)
y = dfh2['Anomaly_Label']

#X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8, test_size = 0.2, shuffle=False,random_state = 0)
X_train, X_test, y_train, y_test = train_test_split(X,y, train_size = 0.8, test_size = 0.2,random_state = 0)

In [27]:
X_train, y_train = Auto_Balancing(X_train, y_train)

# 3. Automated Model Selection
Select the best-performing model among five common machine learning models (Naive Bayes, KNN, random forest, LightGBM, and ANN/MLP) by evaluating their learning performance

### Method 1: Grid Search

In [28]:
# Create a pipeline
pipe = Pipeline([('classifier', GaussianNB())])

# Create space of candidate learning algorithms and their hyperparameters
search_space = [{'classifier': [GaussianNB()]},
                {'classifier': [KNeighborsClassifier()]},
                {'classifier': [RandomForestClassifier()]},
                {'classifier': [lgb.LGBMClassifier(verbose = -1)]},
                {'classifier': [KerasClassifier(build_fn=ANN, verbose=0)]},
                 ]

In [29]:
clf = GridSearchCV(pipe, search_space, cv=5, verbose=0)

In [30]:
clf.fit(X, y)

In [31]:
print("Best Model:"+ str(clf.best_params_))
print("Accuracy:"+ str(clf.best_score_))

Best Model:{'classifier': RandomForestClassifier()}
Accuracy:0.8349538145772648


In [32]:
clf.cv_results_

{'mean_fit_time': array([4.49287271e-01, 9.23181119e+00, 3.44438641e+02, 2.47019458e+00,
        8.56434151e+02]),
 'std_fit_time': array([4.97836289e-03, 4.97621235e-01, 3.79760645e+01, 1.63161131e-02,
        6.72389365e+01]),
 'mean_score_time': array([ 0.11600556, 27.21723347,  4.55360003,  0.23275356,  5.67175727]),
 'std_score_time': array([3.25754269e-03, 1.12521099e+01, 7.18234999e-01, 7.96020089e-03,
        1.20933518e-01]),
 'param_classifier': masked_array(data=[GaussianNB(), KNeighborsClassifier(),
                    RandomForestClassifier(), LGBMClassifier(verbose=-1),
                    <keras.wrappers.scikit_learn.KerasClassifier object at 0x000001A6BB571700>],
              mask=[False, False, False, False, False],
        fill_value='?',
             dtype=object),
 'params': [{'classifier': GaussianNB()},
  {'classifier': KNeighborsClassifier()},
  {'classifier': RandomForestClassifier()},
  {'classifier': LGBMClassifier(verbose=-1)},
  {'classifier': <keras.wrappe

LightGBM model is the best performing machine learning model, and the best cross-validation accuracy is 98.438%

### Method 2: Bayesian Optimization with Tree Parzen Estimator (BO-TPE)

In [33]:
! pip install hyperopt

Collecting hyperopt
  Downloading hyperopt-0.2.7-py2.py3-none-any.whl (1.6 MB)
     ---------------------------------------- 0.0/1.6 MB ? eta -:--:--
     ----------- ---------------------------- 0.4/1.6 MB 9.2 MB/s eta 0:00:01
     ---------------------------- ----------- 1.1/1.6 MB 12.1 MB/s eta 0:00:01
     ---------------------------------------- 1.6/1.6 MB 14.3 MB/s eta 0:00:00
Collecting networkx>=2.2
  Downloading networkx-3.0-py3-none-any.whl (2.0 MB)
     ---------------------------------------- 0.0/2.0 MB ? eta -:--:--
     -------------------- ------------------- 1.0/2.0 MB 22.0 MB/s eta 0:00:01
     ---------------------------------------  2.0/2.0 MB 25.7 MB/s eta 0:00:01
     ---------------------------------------- 2.0/2.0 MB 21.6 MB/s eta 0:00:00
Collecting py4j
  Downloading py4j-0.10.9.7-py2.py3-none-any.whl (200 kB)
     ---------------------------------------- 0.0/200.5 kB ? eta -:--:--
     ------------------------------------- 200.5/200.5 kB 11.9 MB/s eta 0:00:00
I

In [34]:
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Define the objective function
def objective(params):
    
    classifier_type = params['type']
    del params['type']
    if classifier_type == 'nb':
        clf = GaussianNB()
    elif classifier_type == 'knn':
        clf = KNeighborsClassifier()
    elif classifier_type == 'rf':
        clf = RandomForestClassifier()
    elif classifier_type == 'lgb':
        clf = lgb.LGBMClassifier(verbose = -1)
    elif classifier_type == 'ann':
        clf = KerasClassifier(build_fn=ANN, verbose=0)
    else:
        return 0
    
    clf.fit(X_train,y_train)
    predictions = clf.predict(X_test)
    score = accuracy_score(y_test,predictions)
    return {'loss':-score, 'status': STATUS_OK }

# Define the hyperparameter configuration space
space = hp.choice('classifier_type', [{'type': 'nb'},{'type': 'knn'},{'type': 'rf'},{'type': 'lgb'},{'type': 'ann'},])

# Detect the optimal hyperparameter values
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=10)
print("Hyperopt estimated optimum {}".format(best))

100%|██████████| 10/10 [57:05<00:00, 342.56s/trial, best loss: -0.9314846678068128] 
Hyperopt estimated optimum {'classifier_type': 2}


Classifier type 3 is the LightGBM model, and the best hold-out accuracy is 99.806%

# 4. Hyperparameter Optimization
Optimize the best performing machine learning model (lightGBM) by tuning its hyperparameters

## Cross validation

In [35]:
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Define the objective function
def objective(params):
    params = {
        'n_estimators': int(params['n_estimators']), 
        'max_depth': int(params['max_depth']),
        'learning_rate': abs(float(params['learning_rate'])),
        "num_leaves": int(params['num_leaves']),
        "min_child_samples": int(params['min_child_samples']),
    }
    clf = lgb.LGBMClassifier( **params)
    score = cross_val_score(clf, X, y, scoring='accuracy', cv=StratifiedKFold(n_splits=5)).mean()
    return {'loss':-score, 'status': STATUS_OK }

# Define the hyperparameter configuration space
space = {
    'n_estimators': hp.quniform('n_estimators', 50, 500, 20),
    'max_depth': hp.quniform('max_depth', 5, 50, 1),
    "learning_rate":hp.uniform('learning_rate', 0, 1),
    "num_leaves":hp.quniform('num_leaves',100,2000,100),
    "min_child_samples":hp.quniform('min_child_samples',10,50,5),
}

# Detect the optimal hyperparameter values
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=20)
print("LightGBM: Hyperopt estimated optimum {}".format(best))

100%|██████████| 20/20 [30:04<00:00, 90.22s/trial, best loss: -0.8268782903387375] 
LightGBM: Hyperopt estimated optimum {'learning_rate': 0.0031594313778544603, 'max_depth': 47.0, 'min_child_samples': 45.0, 'n_estimators': 460.0, 'num_leaves': 1300.0}


In [36]:
%%time
clf = lgb.LGBMClassifier(max_depth=14, learning_rate=  0.4765834961973211, n_estimators = 480, 
                         num_leaves = 600, min_child_samples = 25)
clf.fit(X,y)
scores = cross_val_score(clf, X, y, cv=5,scoring='accuracy')
print("Accuracy: "+ str(round(scores.mean(),5)*100)+"%")
scores = cross_val_score(clf, X, y, cv=5,scoring='precision')
print("Precision: "+ str(round(scores.mean(),5)*100)+"%")
scores = cross_val_score(clf, X, y, cv=5,scoring='recall')
print("Recall: "+ str(round(scores.mean(),5)*100)+"%")
scores = cross_val_score(clf, X, y, cv=5,scoring='f1')
print("F1-score: "+ str(round(scores.mean(),5)*100)+"%")

Accuracy: 80.301%
Precision: 51.214999999999996%
Recall: 43.402%
F1-score: 44.326%
CPU times: total: 1h 58min 12s
Wall time: 9min 14s


After hyperparameter optimization, the cross-validation accuracy has been improved from 98.438% to 98.477%

## Hold-out validation

In [None]:
from hyperopt import hp, fmin, tpe, STATUS_OK, Trials
from sklearn.model_selection import cross_val_score, StratifiedKFold

# Define the objective function
def objective(params):
    params = {
        'n_estimators': int(params['n_estimators']), 
        'max_depth': int(params['max_depth']),
        'learning_rate': abs(float(params['learning_rate'])),
        "num_leaves": int(params['num_leaves']),
        "min_child_samples": int(params['min_child_samples']),
    }
    clf = lgb.LGBMClassifier( **params)
    clf.fit(X_train,y_train)
    predictions = clf.predict(X_test)
    score = accuracy_score(y_test,predictions)
    return {'loss':-score, 'status': STATUS_OK }

# Define the hyperparameter configuration space
space = {
    'n_estimators': hp.quniform('n_estimators', 50, 500, 20),
    'max_depth': hp.quniform('max_depth', 5, 50, 1),
    "learning_rate":hp.uniform('learning_rate', 0, 1),
    "num_leaves":hp.quniform('num_leaves',100,2000,100),
    "min_child_samples":hp.quniform('min_child_samples',10,50,5),
}

# Detect the optimal hyperparameter values
best = fmin(fn=objective,
            space=space,
            algo=tpe.suggest,
            max_evals=50)
print("LightGBM: Hyperopt estimated optimum {}".format(best))

In [None]:
%%time
clf = lgb.LGBMClassifier(max_depth=35, learning_rate= 0.7925617918030913, n_estimators = 200, 
                         num_leaves = 200, min_child_samples = 25)
clf.fit(X_train,y_train)
predictions = clf.predict(X_test)
print("Accuracy: "+str(round(accuracy_score(y_test,predictions),5)*100)+"%")
print("Precision: "+str(round(precision_score(y_test,predictions),5)*100)+"%")
print("Recall: "+str(round(recall_score(y_test,predictions),5)*100)+"%")
print("F1-score: "+str(round(f1_score(y_test,predictions),5)*100)+"%")

Accuracy: 99.84100000000001%
Precision: 99.381%
Recall: 99.822%
F1-score: 99.601%
Wall time: 360 ms


After hyperparameter optimization, the hold-out accuracy has been improved from 99.806% to 99.841%