## Assignment 4
<b>Objective</b>: Use an AutoML tool or library to develop a machine learning model on a given dataset. Understand the strengths and limitations of using AutoML and compare results to traditional model development processes.

In [1]:
%%capture
## Install libraries
%pip install scikit-learn
%pip install TPOT
%pip install numpy
%pip install matplotlib
%pip install seaborn 

In [2]:
### Libraries
import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import warnings
warnings.filterwarnings('ignore')

sns.set_style('whitegrid')

### Task 1
#### Dataset Selection & Preprocessing:
- Choose one of the datasets suggested below or any dataset of your interest.
- Preprocess the dataset: handle missing values, normalize or standardize features, split the data into training and test sets.

#### 1. a) Selected dataset for the project is UCI Adult Income. The problem is defined to predict whether income exceeds $50k per year based on the census data

##### --- Load dataset

In [3]:
## List of files in the dataset
os.listdir('data')

['adult.data', 'old.adult.names', 'adult.names', 'Index', 'adult.test']

In [4]:
columns = ['age', 'workclass', 'fnlwgt', 'education', 'education_num', 'martial_status', 'occupation', 'relationship', 'race', 'sex', 'capital_gain', 'capital_loss', 'hours_per_week', 'native_country', 'class']

# Train dataset
train = pd.read_csv('data/adult.data', header = None, na_values='?')
train.columns = columns

# Test dataset
test = pd.read_csv('data/adult.test', header = None, na_values='?')
test.columns = columns

## map class column into binary
class_mapper = {' <=50K': 0, ' <=50K.': 0,
                ' >50K': 1, ' >50K.': 1}
train['class'] = train['class'].map(class_mapper)
test['class'] = test['class'].map(class_mapper)

In [5]:
print(f'Sample in train dataset: {len(train)}')
print(f'Sample in test dataset : {len(test)}')

Sample in train dataset: 32561
Sample in test dataset : 16281


In [6]:
train.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,martial_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,class
0,39,State-gov,77516,Bachelors,13,Never-married,Adm-clerical,Not-in-family,White,Male,2174,0,40,United-States,0
1,50,Self-emp-not-inc,83311,Bachelors,13,Married-civ-spouse,Exec-managerial,Husband,White,Male,0,0,13,United-States,0
2,38,Private,215646,HS-grad,9,Divorced,Handlers-cleaners,Not-in-family,White,Male,0,0,40,United-States,0
3,53,Private,234721,11th,7,Married-civ-spouse,Handlers-cleaners,Husband,Black,Male,0,0,40,United-States,0
4,28,Private,338409,Bachelors,13,Married-civ-spouse,Prof-specialty,Wife,Black,Female,0,0,40,Cuba,0


In [7]:
test.head()

Unnamed: 0,age,workclass,fnlwgt,education,education_num,martial_status,occupation,relationship,race,sex,capital_gain,capital_loss,hours_per_week,native_country,class
0,25,Private,226802,11th,7,Never-married,Machine-op-inspct,Own-child,Black,Male,0,0,40,United-States,0
1,38,Private,89814,HS-grad,9,Married-civ-spouse,Farming-fishing,Husband,White,Male,0,0,50,United-States,0
2,28,Local-gov,336951,Assoc-acdm,12,Married-civ-spouse,Protective-serv,Husband,White,Male,0,0,40,United-States,1
3,44,Private,160323,Some-college,10,Married-civ-spouse,Machine-op-inspct,Husband,Black,Male,7688,0,40,United-States,1
4,18,?,103497,Some-college,10,Never-married,?,Own-child,White,Female,0,0,30,United-States,0


In [8]:
### Combine train and test dataset for preprocessing and feature scaling and engineering
train['kind'] = 'train'
test['kind'] = 'test'

df = pd.concat([train, test], axis = 0)
df.reset_index(inplace = True, drop = True)
df.shape

(48842, 16)

#### 1. b) Preprocess the dataset: handle missing values, normalize or standardize features, split the data into training and test sets.

##### ---- Inspect features

In [9]:
#### [clean-up] All the labels have leading space. Fix it.
for f in df.select_dtypes(include = 'object'):
    df[f] = df[f].apply(lambda x: x.strip())

In [10]:
### In the original dataset, the author mentioned that value with '?' represents missing value.
# Make the representation explicitly.
df = df.replace('?', np.nan)
df.shape

(48842, 16)

In [11]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 48842 entries, 0 to 48841
Data columns (total 16 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   age             48842 non-null  int64 
 1   workclass       46043 non-null  object
 2   fnlwgt          48842 non-null  int64 
 3   education       48842 non-null  object
 4   education_num   48842 non-null  int64 
 5   martial_status  48842 non-null  object
 6   occupation      46033 non-null  object
 7   relationship    48842 non-null  object
 8   race            48842 non-null  object
 9   sex             48842 non-null  object
 10  capital_gain    48842 non-null  int64 
 11  capital_loss    48842 non-null  int64 
 12  hours_per_week  48842 non-null  int64 
 13  native_country  47985 non-null  object
 14  class           48842 non-null  int64 
 15  kind            48842 non-null  object
dtypes: int64(7), object(9)
memory usage: 6.0+ MB


In [12]:
### Inspect categorical features
for f in df.select_dtypes(include = ['object']).columns:
    if f != 'kind':
        n = df[f].nunique()
        if n < 10:
            print(f'{f:20s} [{n}]: {", ".join([str(c) for c in df[f].unique()])}')
        else:
            print(f'{f:20s} [{n}]')

workclass            [8]: State-gov, Self-emp-not-inc, Private, Federal-gov, Local-gov, nan, Self-emp-inc, Without-pay, Never-worked
education            [16]
martial_status       [7]: Never-married, Married-civ-spouse, Divorced, Married-spouse-absent, Separated, Married-AF-spouse, Widowed
occupation           [14]
relationship         [6]: Not-in-family, Husband, Wife, Own-child, Unmarried, Other-relative
race                 [5]: White, Black, Asian-Pac-Islander, Amer-Indian-Eskimo, Other
sex                  [2]: Male, Female
native_country       [41]


All the categorical features except sex (binary) needs one-hot-encoding. However, because of too many labels in the native country column, we will skip from OHE.

In [13]:
### drop native_country and education (redundant to education_num)
df.drop(columns = ['native_country', 'education'], inplace = True)

In [14]:
### inspect target class
print('Train dataset :', dict(round(train['class'].value_counts() / len(train), 2)))
print('Test dataset  :', dict(round(test['class'].value_counts() / len(test), 2)))

Train dataset : {0: 0.76, 1: 0.24}
Test dataset  : {0: 0.76, 1: 0.24}


- The dataset contains imbalance in target classes
- However, the imbalance is same in both train and test datasets

##### ---- Missing values

In [15]:
# inspect missing values
100 * df.isna().sum(axis = 0) / len(df)

age               0.000000
workclass         5.730724
fnlwgt            0.000000
education_num     0.000000
martial_status    0.000000
occupation        5.751198
relationship      0.000000
race              0.000000
sex               0.000000
capital_gain      0.000000
capital_loss      0.000000
hours_per_week    0.000000
class             0.000000
kind              0.000000
dtype: float64

- Workclass and occupation columns as missing values.

In [16]:
df[df['workclass'].isna()][['workclass', 'occupation']]

Unnamed: 0,workclass,occupation
27,,
61,,
69,,
77,,
106,,
...,...,...
48682,,
48769,,
48800,,
48812,,


It seems that workclass and occupation have missing values in pairs.

We will use other data from the remaining columns to impute the missing values.

In [17]:
### Using KNNImputer from SKlearn to replace missing values
# For simplicity, we will use only numerical columns
num_cols = [c for c in df.select_dtypes(exclude='object').columns if c != 'class']
num_cols

['age',
 'fnlwgt',
 'education_num',
 'capital_gain',
 'capital_loss',
 'hours_per_week']

In [18]:
imp_df = df[num_cols + ['workclass', 'occupation']]
imp_df.head(3)

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,workclass,occupation
0,39,77516,13,2174,0,40,State-gov,Adm-clerical
1,50,83311,13,0,0,13,Self-emp-not-inc,Exec-managerial
2,38,215646,9,0,0,40,Private,Handlers-cleaners


In [19]:
## note that fnlwgt is defined as final weight. In other words, this is the number of people the census believes the entry represents.
# It can be normalized by dividing the total number of people in the census.
imp_df['fnlwgt'] = imp_df['fnlwgt'] / imp_df['fnlwgt'].sum()
imp_df.head(3)

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,workclass,occupation
0,39,8e-06,13,2174,0,40,State-gov,Adm-clerical
1,50,9e-06,13,0,0,13,Self-emp-not-inc,Exec-managerial
2,38,2.3e-05,9,0,0,40,Private,Handlers-cleaners


In [20]:
## Apply standard scaling on the features
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
imp_df[num_cols] = scaler.fit_transform(imp_df[num_cols])
imp_df.head(3)

Unnamed: 0,age,fnlwgt,education_num,capital_gain,capital_loss,hours_per_week,workclass,occupation
0,0.025996,-1.061979,1.136512,0.146932,-0.217127,-0.034087,State-gov,Adm-clerical
1,0.828308,-1.007104,1.136512,-0.144804,-0.217127,-2.213032,Self-emp-not-inc,Exec-managerial
2,-0.046942,0.246034,-0.419335,-0.144804,-0.217127,-0.034087,Private,Handlers-cleaners


In [21]:
### Model KNN to find clusters
from sklearn.neighbors import KNeighborsClassifier

# Model for workclass imputation
wc_train = imp_df[~imp_df['workclass'].isna()]
wc_test  = imp_df[imp_df['workclass'].isna()]
model = KNeighborsClassifier(n_neighbors = 5)
model.fit(wc_train[num_cols], wc_train['workclass'])
wc_test['workclass'] = model.predict(wc_test[num_cols])

# Model for occupation imputation
occ_train = imp_df[~imp_df['occupation'].isna()]
occ_test  = imp_df[imp_df['occupation'].isna()]
model = KNeighborsClassifier(n_neighbors = 5)
model.fit(occ_train[num_cols], occ_train['occupation'])
occ_test['occupation'] = model.predict(occ_test[num_cols])

In [22]:
wc_after_imputation = pd.DataFrame(pd.concat([wc_train, wc_test], axis = 0)['workclass'])
occ_after_imputation = pd.DataFrame(pd.concat([occ_train, occ_test], axis = 0)['occupation'])
imputed = pd.concat([wc_after_imputation, occ_after_imputation], axis = 1)
imputed.head()

Unnamed: 0,workclass,occupation
0,State-gov,Adm-clerical
1,Self-emp-not-inc,Exec-managerial
2,Private,Handlers-cleaners
3,Private,Handlers-cleaners
4,Private,Prof-specialty


In [23]:
### Now replace the original columns with imputed
df.drop(columns = ['workclass', 'occupation'], inplace = True)
df = pd.concat([df, imputed], axis = 1)
df.tail()

Unnamed: 0,age,fnlwgt,education_num,martial_status,relationship,race,sex,capital_gain,capital_loss,hours_per_week,class,kind,workclass,occupation
48837,39,215419,13,Divorced,Not-in-family,White,Female,0,0,36,0,test,Private,Prof-specialty
48838,64,321403,9,Widowed,Other-relative,Black,Male,0,0,40,0,test,Private,Machine-op-inspct
48839,38,374983,13,Married-civ-spouse,Husband,White,Male,0,0,50,0,test,Private,Prof-specialty
48840,44,83891,13,Divorced,Own-child,Asian-Pac-Islander,Male,5455,0,40,0,test,Private,Adm-clerical
48841,35,182148,13,Married-civ-spouse,Husband,White,Male,0,0,60,1,test,Self-emp-inc,Exec-managerial


In [24]:
## Verify for missing values
100 * df.isna().sum(axis = 0) / len(df)

age               0.0
fnlwgt            0.0
education_num     0.0
martial_status    0.0
relationship      0.0
race              0.0
sex               0.0
capital_gain      0.0
capital_loss      0.0
hours_per_week    0.0
class             0.0
kind              0.0
workclass         0.0
occupation        0.0
dtype: float64

#### ---- Encoding categorical features

In [25]:
# Boolean encoding : sex
sex_mapper = {'Male': 1, 'Female': 0}
df['sex'] = df['sex'].map(sex_mapper)

In [26]:
### Scaling numerical features
scaler = StandardScaler()
df[num_cols] = scaler.fit_transform(df[num_cols])
df.head()

Unnamed: 0,age,fnlwgt,education_num,martial_status,relationship,race,sex,capital_gain,capital_loss,hours_per_week,class,kind,workclass,occupation
0,0.025996,-1.061979,1.136512,Never-married,Not-in-family,White,1,0.146932,-0.217127,-0.034087,0,train,State-gov,Adm-clerical
1,0.828308,-1.007104,1.136512,Married-civ-spouse,Husband,White,1,-0.144804,-0.217127,-2.213032,0,train,Self-emp-not-inc,Exec-managerial
2,-0.046942,0.246034,-0.419335,Divorced,Not-in-family,White,1,-0.144804,-0.217127,-0.034087,0,train,Private,Handlers-cleaners
3,1.047121,0.426663,-1.197259,Married-civ-spouse,Husband,Black,1,-0.144804,-0.217127,-0.034087,0,train,Private,Handlers-cleaners
4,-0.776316,1.40853,1.136512,Married-civ-spouse,Wife,Black,0,-0.144804,-0.217127,-0.034087,0,train,Private,Prof-specialty


In [27]:
### One-hot-encoding features
ohe_features = ['martial_status', 'relationship', 'race', 'workclass', 'occupation']
ohe_df = pd.concat([pd.get_dummies(df[feature], prefix=feature, dtype = int) for feature in ohe_features], axis = 1)
df.drop(columns = ohe_features, inplace = True)
df = pd.concat([df, ohe_df], axis = 1)
df.head()

Unnamed: 0,age,fnlwgt,education_num,sex,capital_gain,capital_loss,hours_per_week,class,kind,martial_status_Divorced,...,occupation_Farming-fishing,occupation_Handlers-cleaners,occupation_Machine-op-inspct,occupation_Other-service,occupation_Priv-house-serv,occupation_Prof-specialty,occupation_Protective-serv,occupation_Sales,occupation_Tech-support,occupation_Transport-moving
0,0.025996,-1.061979,1.136512,1,0.146932,-0.217127,-0.034087,0,train,0,...,0,0,0,0,0,0,0,0,0,0
1,0.828308,-1.007104,1.136512,1,-0.144804,-0.217127,-2.213032,0,train,0,...,0,0,0,0,0,0,0,0,0,0
2,-0.046942,0.246034,-0.419335,1,-0.144804,-0.217127,-0.034087,0,train,1,...,0,1,0,0,0,0,0,0,0,0
3,1.047121,0.426663,-1.197259,1,-0.144804,-0.217127,-0.034087,0,train,0,...,0,1,0,0,0,0,0,0,0,0
4,-0.776316,1.40853,1.136512,0,-0.144804,-0.217127,-0.034087,0,train,0,...,0,0,0,0,0,1,0,0,0,0


In [28]:
print(f'# of features after encoding: {df.shape[1] - 2}')       # -2 for kind and class columns

# of features after encoding: 47


##### ---- Split dataset

In [29]:
train = df[df['kind'] == 'train'].drop(columns = ['kind'])
test = df[df['kind'] == 'test'].drop(columns = ['kind'])

print(f'Sample in train dataset: {len(train)}')
print(f'Sample in test dataset : {len(test)}')

Sample in train dataset: 32561
Sample in test dataset : 16281


### Task 2:
#### AutoML:
- Select an AutoML tool or library (e.g., Google Cloud AutoML, H2O.ai, TPOT, Auto-Sklearn).
- Use the tool to automatically select a model, hyperparameters, and optionally, feature engineering techniques.
- Train the model on the training data.
- Evaluate the model's performance on the test data using appropriate metrics (e.g., accuracy, F1-score, RMSE).

In [30]:
### AutoML tool: TPOT
from tpot import TPOTClassifier

### performance metrics
from sklearn.metrics import accuracy_score
from sklearn.metrics import f1_score

In [31]:
X_train = train.drop(columns = ['class'])
y_train = train['class']

In [32]:
tpot = TPOTClassifier(generations=20, population_size=20, verbosity=2, random_state=42, 
                      scoring = 'f1_weighted', n_jobs = -1, early_stop = 5)
tpot.fit(X_train, y_train)
tpot.export('tpot_pipeline.py')

Optimization Progress:   0%|          | 0/420 [00:00<?, ?pipeline/s]


Generation 1 - Current best internal CV score: 0.8588304140330105

Generation 2 - Current best internal CV score: 0.8599323024229598

Generation 3 - Current best internal CV score: 0.8599323024229598

Generation 4 - Current best internal CV score: 0.8609714474716654

Generation 5 - Current best internal CV score: 0.8611831398331955

Generation 6 - Current best internal CV score: 0.8611831398331955

Generation 7 - Current best internal CV score: 0.8636886732875129

Generation 8 - Current best internal CV score: 0.8636886732875129

Generation 9 - Current best internal CV score: 0.8636886732875129

Generation 10 - Current best internal CV score: 0.8642812073968715

Generation 11 - Current best internal CV score: 0.866331639870122

Generation 12 - Current best internal CV score: 0.866331639870122

Generation 13 - Current best internal CV score: 0.8668457395444383

Generation 14 - Current best internal CV score: 0.8668457395444383

Generation 15 - Current best internal CV score: 0.86751700

In [33]:
### Estimate train accuracy
accuracy_score(y_train, tpot.predict(X_train))

0.900832284020761

In [34]:
### Make predictions for the test dataset
X_original_test = test.drop(columns = ['class'])
y_original_test = test['class']

y_pred = tpot.predict(X_original_test)
accuracy_score(y_original_test, y_pred)

0.8705853448805356

### Task 3
#### Comparison (Optional for CS451 students):
- Implement a traditional machine learning pipeline (e.g., using scikit-learn) for the same dataset: select a model, perform manual hyperparameter tuning, etc.
- Compare the results of your traditional pipeline with the AutoML results in terms of performance, time consumption, and other relevant metrics.

In [35]:
# Traditional ML pipleline
from sklearn.ensemble import GradientBoostingClassifier

List of hyperparamters:
- Number of estimators
- Learning rate
- Max depth
- Max features
- Min samples in a leaf
- Min samples in a split

In [36]:
def evaluate(model):
    
    train_acc = accuracy_score(model.predict(X_train), y_train)
    test_acc = accuracy_score(model.predict(X_original_test), y_original_test)
    
    print('Train accuracy:', train_acc)
    print('Test accuracy :', test_acc)
    

In [37]:
### out of box baseline with default parameters
model = GradientBoostingClassifier(
    n_estimators = 100,
    learning_rate = 0.1,
    max_depth = 3,
    max_features = None,
    min_samples_leaf = 1,
    min_samples_split = 2,
    random_state = 42
)
model.fit(X_train, y_train)
evaluate(model)

Train accuracy: 0.8688615214520439
Test accuracy : 0.8694797616854002


In [38]:
# increase n_estimators to 200, 300, 400, 500, 1000
for n in [200, 300, 400, 500, 1000]:
    model = GradientBoostingClassifier(
        n_estimators = n,
        learning_rate = 0.1,
        max_depth = 3,
        max_features = None,
        min_samples_leaf = 1,
        min_samples_split = 2,
        random_state = 42
    )
    model.fit(X_train, y_train)
    print(f'num_estimators: {n}')
    evaluate(model)
    print()

num_estimators: 200
Train accuracy: 0.8758023402229661
Test accuracy : 0.8726736686935692

num_estimators: 300
Train accuracy: 0.8800712508829581
Test accuracy : 0.8745163073521283

num_estimators: 400
Train accuracy: 0.8841251804305764
Test accuracy : 0.8752533628155519

num_estimators: 500
Train accuracy: 0.8868892233039526
Test accuracy : 0.8749462563724587

num_estimators: 1000
Train accuracy: 0.895826295261202
Test accuracy : 0.8731036177138997



In [39]:
### Best num_estimators (400) - best score 0.87525
# change learning rate from [0.001, 0.01, 0.1, 0.2, 0.5]
n = 400
for lr in  [0.001, 0.01, 0.1, 0.2, 0.5]:
    model = GradientBoostingClassifier(
        n_estimators = n,     # optimized
        learning_rate = lr,
        max_depth = 3,
        max_features = None,
        min_samples_leaf = 1,
        min_samples_split = 2,
        random_state = 42
    )
    model.fit(X_train, y_train)
    print(f'num_estimators: {n} and learning rate: {lr}')
    evaluate(model)
    print()

num_estimators: 400 and learning rate: 0.001
Train accuracy: 0.7591904425539756
Test accuracy : 0.7637737239727289

num_estimators: 400 and learning rate: 0.01
Train accuracy: 0.8591566598077455
Test accuracy : 0.8600208832381303

num_estimators: 400 and learning rate: 0.1
Train accuracy: 0.8841251804305764
Test accuracy : 0.8752533628155519

num_estimators: 400 and learning rate: 0.2
Train accuracy: 0.8942907158871042
Test accuracy : 0.8731650390025183

num_estimators: 400 and learning rate: 0.5
Train accuracy: 0.9132397653634716
Test accuracy : 0.8662244333886125



In [40]:
### Best paramters (test score 0.87525)
# num_estimators (400)
# learning_rate (0.1)

# change max depth from [2, 3, 5, 7]
n = 400
lr = 0.1
for d in [2, 3, 5, 7]:
    model = GradientBoostingClassifier(
        n_estimators = n,     # optimized
        learning_rate = lr,
        max_depth = d,
        max_features = None,
        min_samples_leaf = 1,
        min_samples_split = 2,
        random_state = 42
    )
    model.fit(X_train, y_train)
    print(f'num_estimators: {n}, learning rate: {lr}, max depth: {d},')
    evaluate(model)
    print()

num_estimators: 400, learning rate: 0.1, max depth: 2,
Train accuracy: 0.8739289333865667
Test accuracy : 0.8701553958602052

num_estimators: 400, learning rate: 0.1, max depth: 3,
Train accuracy: 0.8841251804305764
Test accuracy : 0.8752533628155519

num_estimators: 400, learning rate: 0.1, max depth: 5,
Train accuracy: 0.9083566229538405
Test accuracy : 0.872550826116332

num_estimators: 400, learning rate: 0.1, max depth: 7,
Train accuracy: 0.95009367034182
Test accuracy : 0.8670843314292734



In [41]:
### Best paramters (test score 0.87525)
# num_estimators (400)
# learning_rate (0.1)
# max depth (3)

n = 400
lr = 0.1
d = 3
for mf in [None, 0.1, 0.2, 0.3, 0.4, 0.5, 0.6, 'sqrt', 'log2']:
    model = GradientBoostingClassifier(
        n_estimators = n,     # optimized
        learning_rate = lr,
        max_depth = d,
        max_features = mf,
        min_samples_leaf = 1,
        min_samples_split = 2,
        random_state = 42
    )
    model.fit(X_train, y_train)
    print(f'num_estimators: {n}, learning rate: {lr}, max depth: {d}, max features: {mf}')
    evaluate(model)
    print()

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: None
Train accuracy: 0.8841251804305764
Test accuracy : 0.8752533628155519

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: 0.1
Train accuracy: 0.8721169497251313
Test accuracy : 0.8677599656040784

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: 0.2
Train accuracy: 0.8777985934092933
Test accuracy : 0.8732264602911369

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: 0.3
Train accuracy: 0.880163385645404
Test accuracy : 0.8733493028683742

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: 0.4
Train accuracy: 0.8803169435828138
Test accuracy : 0.8729807751366624

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: 0.5
Train accuracy: 0.881606830257056
Test accuracy : 0.8726736686935692

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: 0.6
Train accuracy: 0.883050274868708
Test accuracy : 0.873349302868

In [42]:
### Best paramters (test score 0.87525)
# num_estimators (400)
# learning_rate (0.1)
# max depth (3)
# max features (None)

n = 400
lr = 0.1
d = 3
mf = None
for min_samples_leaf in [1, 3, 5, 7, 9, 11, 13]:
    model = GradientBoostingClassifier(
        n_estimators = n,     # optimized
        learning_rate = lr,
        max_depth = d,
        max_features = mf,
        min_samples_leaf = min_samples_leaf,
        min_samples_split = 2,
        random_state = 42
    )
    model.fit(X_train, y_train)
    print(f'num_estimators: {n}, learning rate: {lr}, max depth: {d}, max features: {mf}, min_samples_leaf: {min_samples_leaf}')
    evaluate(model)
    print()

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: None, min_samples_leaf: 1
Train accuracy: 0.8841251804305764
Test accuracy : 0.8752533628155519

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: None, min_samples_leaf: 3
Train accuracy: 0.8836030834433832
Test accuracy : 0.8746391499293655

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: None, min_samples_leaf: 5
Train accuracy: 0.8833573907435275
Test accuracy : 0.8747005712179842

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: None, min_samples_leaf: 7
Train accuracy: 0.8834802370934554
Test accuracy : 0.8751919415269332

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: None, min_samples_leaf: 9
Train accuracy: 0.8839101993182028
Test accuracy : 0.875069098949696

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: None, min_samples_leaf: 11
Train accuracy: 0.8834188139184914
Test accuracy : 0.8748848350838401

num_

In [43]:
### Best paramters (test score 0.87525)
# num_estimators (400)
# learning_rate (0.1)
# max depth (3)
# max features (None)
# min samples in a leaf (1)

n = 400
lr = 0.1
d = 3
mf = None
min_samples_leaf = 1
for min_samples_split in [2, 4, 8, 16, 32]:
    model = GradientBoostingClassifier(
        n_estimators = n,     # optimized
        learning_rate = lr,
        max_depth = d,
        max_features = mf,
        min_samples_leaf = min_samples_leaf,
        min_samples_split = min_samples_split,
        random_state = 42
    )
    model.fit(X_train, y_train)
    print(f'num_estimators: {n}, learning rate: {lr}, max depth: {d}, max features: {mf}, min_samples_leaf: {min_samples_leaf}, min_samples_split: {min_samples_split}')
    evaluate(model)
    print()

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: None, min_samples_leaf: 1, min_samples_split: 2
Train accuracy: 0.8841251804305764
Test accuracy : 0.8752533628155519

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: None, min_samples_leaf: 1, min_samples_split: 4
Train accuracy: 0.8842480267805043
Test accuracy : 0.8748234137952214

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: None, min_samples_leaf: 1, min_samples_split: 8
Train accuracy: 0.8839409109056847
Test accuracy : 0.8742706221976537

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: None, min_samples_leaf: 1, min_samples_split: 16
Train accuracy: 0.8838794877307208
Test accuracy : 0.8748848350838401

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: None, min_samples_leaf: 1, min_samples_split: 32
Train accuracy: 0.8833881023310095
Test accuracy : 0.8745777286407469



In [44]:
### Best paramters (test score 0.87525)
# num_estimators (400)
# learning_rate (0.1)
# max depth (3)
# max features (None)
# min_samples_leaf (1)
# min_samples_split (2)

n = 400
lr = 0.1
d = 3
mf = None
min_samples_leaf = 1
min_samples_split = 2
for seed in [42, 512, 2048, 4096]:
    model = GradientBoostingClassifier(
        n_estimators = n,
        learning_rate = lr,
        max_depth = d,
        max_features = mf,
        min_samples_leaf = min_samples_leaf,
        min_samples_split = min_samples_split,
        random_state = seed
    )
    model.fit(X_train, y_train)
    print(f'num_estimators: {n}, learning rate: {lr}, max depth: {d}, max features: {mf}, min_samples_leaf: {min_samples_leaf}, min_samples_split: {min_samples_split}, seed: {seed}')
    evaluate(model)
    print()

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: None, min_samples_leaf: 1, min_samples_split: 2, seed: 42
Train accuracy: 0.8841251804305764
Test accuracy : 0.8752533628155519

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: None, min_samples_leaf: 1, min_samples_split: 2, seed: 512
Train accuracy: 0.8841251804305764
Test accuracy : 0.8752533628155519

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: None, min_samples_leaf: 1, min_samples_split: 2, seed: 2048
Train accuracy: 0.8841251804305764
Test accuracy : 0.8752533628155519

num_estimators: 400, learning rate: 0.1, max depth: 3, max features: None, min_samples_leaf: 1, min_samples_split: 2, seed: 4096
Train accuracy: 0.8841251804305764
Test accuracy : 0.8752533628155519



#### Comparision between AutoML and manual hyperparameters search
| parameter/metric | AutoML | Manual |
| --- | --- | --- |
| n_estimator | 100 | 400 |
| learing_rate | 0.1 | 0.1 |
| max_depth | 8 | 3 |
| max_features | 0.35 | None |
| min_samples_leaf | 3 | 1 |
| min_sample_split | 17 | 2 |
| random_state | 42 | 42 |
|||
|train_accuracy | 90.08% | 88.41% |
|test_accuracy | 87.06% | 87.53% |
|||
|time consumption| 4hrs| 0.5 hrs|

### Task 4
#### Analysis:
- Discuss the benefits and limitations of using AutoML based on your experience.
    - Benefits of AutoML (1) it tries various ML models and selects the model alogithm with best score, (2) if performs hyperparameter tuning in the process, (3) intivitually perform feature selection as well
    - Limitations of AutoML is offers limited control on the preselecting limited number of models and range of hyperparameters in the tuning step. The framework try iterations limited number of times and do not gaurantee to result in best score. The tool consumes a large amount of time as it performs various iterations, however, early stopping arguments helps but we cannot guarantee that the framework reached best performing model / hyperparameters.

- Reflect on the model choices and hyperparameters the AutoML tool selected. Were there any surprises?
    - The AutoML tool tries various ML models included random forest, gradient boosting, xgboost, among the few. I was expecting the tool selects XgBoost for its expected better performance, but it selected GradientBoostingClassifier.

### END