Variant 2
In this variant, you will compare two existing implementations of classifiers. You can choose any two
existing implementations of classification models. Train and test them on the dataset provided in the
beginning. Compare the two models using techniques for classification model comparison.

Reporting
Your submission for this assignment is a single PDF file with a report on the assignment. Your report
should be no longer than two pages. Somewhere at the top of the first page should be: your matric
number, full name, and a line “IN6227-2023-Assignment-1.2”. The only requirement for report
formatting is that it is readable, otherwise you are free to arrange information in any way you prefer.
Make sure to provide full performance comparison for the two models including the time it took to
train and apply the model. Explain all decisions you make along the way, e.g., how you fine-tune
model hyper-parameters, how you work with missing values, what is the stopping criterion, etc. If
you do any data pre-processing, please explain what and why was done.
Please upload your source code to GitHub and provide the repository link in the report.
Submission
Submission should be done in NTULearn. Access the assignment submission page through the left
navigation bar by selecting “Assignments”. Submit a single PDF file. Submissions are accepted up to
Friday, 3
rd March 2023, 23:59:59

In [1]:
import pandas as pd
#for random forest
from sklearn import tree
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
#for MLP
from numpy import loadtxt
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense
import tensorflow as tf
from tensorflow import keras
#Preprocessing libraries
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV

#https://archive.ics.uci.edu/ml/datasets/Census%2BIncome
headers = [
    'age','workclass', 'fnlwgt', 'education', 'edu-num', 'martial-status', 'occupation',
    'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country' , 'label'     
          ]
data =  pd.read_csv('adult.data', sep=",", names = headers) 
data.columns

#get test data
test =  pd.read_csv('adult.test', sep=",", names = headers)
test = test.iloc[1: , :] #get rid of first row

In [56]:
data.label.unique() #two kinds of tables only

array([' <=50K', ' >50K'], dtype=object)

In [2]:
def preprocessing(data,pca_num): #data is encoded, scaled and conducted PCA
    #encode the data
    encoder = LabelEncoder()
    for col in data.columns:
        data[col] = encoder.fit_transform(data[col])
    #split data set into features and label
    labels = data['label']
    features = data.drop("label", axis = 'columns')
    features = features
    #scale the data
    features = StandardScaler().fit_transform(features)
    #apply PCA
    length_of_PCA = pca_num
    PCA_columns = []
#     for num in range(1, length_of_PCA + 1):
#         PCA_columns.append('principal component ' + str(num)) 
#     pca = PCA(n_components= length_of_PCA)
    features = pca.fit_transform(features)
    return features,labels

In [3]:
#decision: to use Random Forest and MLP
#MLP
def preprocessing(data,pca_num): #data is encoded, scaled and conducted PCA
    #encode the data
    encoder = LabelEncoder()
    for col in data.columns:
        data[col] = encoder.fit_transform(data[col])
    #split data set into features and label
    labels = data['label']
    features = data.drop("label", axis = 'columns')
    features = features
    #scale the data
    features = StandardScaler().fit_transform(features)
    #apply PCA
#     length_of_PCA = pca_num
#     PCA_columns = []
#     for num in range(1, length_of_PCA + 1):
#         PCA_columns.append('principal component ' + str(num)) 
#     pca = PCA(n_components= length_of_PCA)
#     features = pca.fit_transform(features)
    return features,labels

In [16]:
#Using MLP with some basic tuning
component_num = 14
features, labels = preprocessing(data,component_num)
test_features,test_labels = preprocessing(test,component_num)
MLP = Sequential()
MLP.add(Dense(units = 50, input_shape=(14,), activation='relu'))
MLP.add(Dense(units = 200, activation='relu'))
MLP.add(Dense(1, activation='sigmoid'))
MLP.compile(optimizer='adam',
                loss= 'binary_crossentropy',
                metrics=['accuracy'])
MLP.fit(features,labels, epochs = 10,batch_size = 5)
MLP.evaluate(test_features,test_labels)

Epoch 1/10
Epoch 2/10
Epoch 3/10
Epoch 4/10
Epoch 5/10
Epoch 6/10
Epoch 7/10
Epoch 8/10
Epoch 9/10
Epoch 10/10


[0.3188125193119049, 0.8500092029571533]

In [12]:
#https://www.tensorflow.org/tutorials/keras/keras_tuner in order to test up some kind of tuning
import keras_tuner as kt
component_num = 14
features, labels = preprocessing(data,component_num)
test_features,test_labels = preprocessing(test,component_num)
def model_builder(hp):
    MLP = Sequential()
    hp_first_layer = hp.Int('first_layer', min_value=16, max_value=512, step=32)
    MLP.add(Dense(units = hp_first_layer, input_shape=(14,), activation='relu'))
    hp_second_layer = hp.Int('second_layer', min_value=16, max_value=512, step=32)
    MLP.add(Dense(units = hp_second_layer, activation='relu'))
    MLP.add(Dense(1, activation='sigmoid'))

    # Tune the learning rate for the optimizer
    # Choose an optimal value from 0.01, 0.001, or 0.0001
    hp_learning_rate = hp.Choice('learning_rate', values=[1e-2, 1e-3, 1e-4])
    MLP.compile(optimizer=tf.keras.optimizers.Adam(learning_rate=hp_learning_rate),
                loss= 'binary_crossentropy',
                metrics=['accuracy'])

    return MLP

tuner = kt.Hyperband(model_builder,
                     objective='val_accuracy',
                     max_epochs=30,
                     factor=3,
                     directory='my_dir',
                     project_name='intro_to_kt'
                    )
stop_early = tf.keras.callbacks.EarlyStopping(monitor='val_loss', patience=5)

#tuner.search(img_train, label_train, epochs=50, validation_split=0.2, callbacks=[stop_early])
tuner.search(features, labels, epochs=50, validation_split=0.2, callbacks=[stop_early])

# Get the optimal hyperparameters
best_hps=tuner.get_best_hyperparameters(num_trials=1)[0]

print(f"""
The hyperparameter search is complete. The optimal number of units in the first densely-connected
layer is {best_hps.get('first_layer')}, the optimal number of units in the second densely-connected
layer is {best_hps.get('second_layer')}, and the optimal learning rate for the optimizer
is {best_hps.get('learning_rate')}.
""")

Trial 88 Complete [00h 00m 29s]
val_accuracy: 0.8542914390563965

Best val_accuracy So Far: 0.8549055457115173
Total elapsed time: 00h 28m 26s
INFO:tensorflow:Oracle triggered exit

The hyperparameter search is complete. The optimal number of units in the first densely-connected
layer is 368, the optimal number of units in the second densely-connected
layer is 304, and the optimal learning rate for the optimizer
is 0.0001.



In [13]:
# Build the model with the optimal hyperparameters and train it on the data for 50 epochs
model = tuner.hypermodel.build(best_hps)
history = model.fit(features, labels, epochs=50, validation_split=0.2)

val_acc_per_epoch = history.history['val_accuracy']
best_epoch = val_acc_per_epoch.index(max(val_acc_per_epoch)) + 1
print('Best epoch: %d' % (best_epoch,))

Epoch 1/50
Epoch 2/50
Epoch 3/50
Epoch 4/50
Epoch 5/50
Epoch 6/50
Epoch 7/50
Epoch 8/50
Epoch 9/50
Epoch 10/50
Epoch 11/50
Epoch 12/50
Epoch 13/50
Epoch 14/50
Epoch 15/50
Epoch 16/50
Epoch 17/50
Epoch 18/50
Epoch 19/50
Epoch 20/50
Epoch 21/50
Epoch 22/50
Epoch 23/50
Epoch 24/50
Epoch 25/50
Epoch 26/50
Epoch 27/50
Epoch 28/50
Epoch 29/50
Epoch 30/50
Epoch 31/50
Epoch 32/50
Epoch 33/50
Epoch 34/50
Epoch 35/50
Epoch 36/50
Epoch 37/50
Epoch 38/50
Epoch 39/50
Epoch 40/50
Epoch 41/50
Epoch 42/50
Epoch 43/50
Epoch 44/50
Epoch 45/50
Epoch 46/50
Epoch 47/50
Epoch 48/50
Epoch 49/50
Epoch 50/50
Best epoch: 23


In [14]:
hypermodel = tuner.hypermodel.build(best_hps)
# Retrain the model
hypermodel.fit(features, labels, epochs=best_epoch, validation_split=0.2)
eval_result = hypermodel.evaluate(test_features,test_labels)
print("[test loss, test accuracy]:", eval_result)

Epoch 1/23
Epoch 2/23
Epoch 3/23
Epoch 4/23
Epoch 5/23
Epoch 6/23
Epoch 7/23
Epoch 8/23
Epoch 9/23
Epoch 10/23
Epoch 11/23
Epoch 12/23
Epoch 13/23
Epoch 14/23
Epoch 15/23
Epoch 16/23
Epoch 17/23
Epoch 18/23
Epoch 19/23
Epoch 20/23
Epoch 21/23
Epoch 22/23
Epoch 23/23
[test loss, test accuracy]: [0.31774643063545227, 0.8495792746543884]
