Variant 2
In this variant, you will compare two existing implementations of classifiers. You can choose any two
existing implementations of classification models. Train and test them on the dataset provided in the
beginning. Compare the two models using techniques for classification model comparison.

Reporting
Your submission for this assignment is a single PDF file with a report on the assignment. Your report
should be no longer than two pages. Somewhere at the top of the first page should be: your matric
number, full name, and a line “IN6227-2023-Assignment-1.2”. The only requirement for report
formatting is that it is readable, otherwise you are free to arrange information in any way you prefer.
Make sure to provide full performance comparison for the two models including the time it took to
train and apply the model. Explain all decisions you make along the way, e.g., how you fine-tune
model hyper-parameters, how you work with missing values, what is the stopping criterion, etc. If
you do any data pre-processing, please explain what and why was done.
Please upload your source code to GitHub and provide the repository link in the report.
Submission
Submission should be done in NTULearn. Access the assignment submission page through the left
navigation bar by selecting “Assignments”. Submit a single PDF file. Submissions are accepted up to
Friday, 3
rd March 2023, 23:59:59

In [30]:
import pandas as pd
import numpy as np
#for random forest
from sklearn import tree
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import cross_val_score
from sklearn.ensemble import RandomForestClassifier
#Preprocessing libraries
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.model_selection import RandomizedSearchCV

#https://archive.ics.uci.edu/ml/datasets/Census%2BIncome
headers = [
    'age','workclass', 'fnlwgt', 'education', 'edu-num', 'martial-status', 'occupation',
    'relationship', 'race', 'sex', 'capital-gain', 'capital-loss', 'hours-per-week', 'native-country' , 'label'     
          ]
data =  pd.read_csv('adult.data', sep=",", names = headers) 
data.columns

#get test data
test =  pd.read_csv('adult.test', sep=",", names = headers)
test = test.iloc[1: , :] #get rid of first row

In [33]:
def preprocessing(data,pca_num, pca = False): #data is encoded, scaled and conducted PCA
    #encode the data
    encoder = LabelEncoder()
    for col in data.columns:
        data[col] = encoder.fit_transform(data[col])
    #split data set into features and label
    labels = data['label']
    features = data.drop("label", axis = 'columns')
    features = features
    #scale the data
    features = StandardScaler().fit_transform(features)
    #apply PCA
    if (pca):
        length_of_PCA = pca_num
        PCA_columns = []
        for num in range(1, length_of_PCA + 1):
            PCA_columns.append('principal component ' + str(num)) 
        pca = PCA(n_components= length_of_PCA)
        features = pca.fit_transform(features)
    return features,labels

In [34]:
component_num = 14
features, labels = preprocessing(data,component_num)
test_features,test_labels = preprocessing(test,component_num)

In [35]:
#Random Forest Model
clf_gini = RandomForestClassifier(n_estimators=10, criterion='gini')
clf_gini.fit(features, labels)
result = clf_gini.predict(test_features)
print ("RF-gini Accuracy (score): " + str(clf_gini.score(test_features,test_labels)))

RF-gini Accuracy (score): 0.8465081997420306


In [39]:
#looping Model Training with different PCA values, random forest with pca doesnt really do much
def loopTraining(component_num):
    features, labels = preprocessing(data,component_num, pca = True)
    test_features,test_labels = preprocessing(test,component_num, pca = True)
    #Random Forest Model
    clf_gini = RandomForestClassifier(n_estimators=10, criterion='gini')
    clf_gini.fit(features, labels)
    #result = clf_gini.predict(test_features)
    print ("For Component number: ", component_num,",RF-gini Accuracy (score): " + str(clf_gini.score(test_features,test_labels)))

In [40]:
for num in range(1,15):
    loopTraining(num)

For Component number:  1 ,RF-gini Accuracy (score): 0.5877403107917204
For Component number:  2 ,RF-gini Accuracy (score): 0.6137215158774031
For Component number:  3 ,RF-gini Accuracy (score): 0.6230575517474357
For Component number:  4 ,RF-gini Accuracy (score): 0.6100976598489036
For Component number:  5 ,RF-gini Accuracy (score): 0.6167311590197162
For Component number:  6 ,RF-gini Accuracy (score): 0.7128554757078803
For Component number:  7 ,RF-gini Accuracy (score): 0.7019224863337633
For Component number:  8 ,RF-gini Accuracy (score): 0.7053620784964069
For Component number:  9 ,RF-gini Accuracy (score): 0.7142067440574903
For Component number:  10 ,RF-gini Accuracy (score): 0.6935077697930102
For Component number:  11 ,RF-gini Accuracy (score): 0.6969473619556539
For Component number:  12 ,RF-gini Accuracy (score): 0.6919108162889257
For Component number:  13 ,RF-gini Accuracy (score): 0.7390823659480376
For Component number:  14 ,RF-gini Accuracy (score): 0.7512437810945274


In [41]:
#employing random search to determine the best tuning parameters to see if accuracy can improve
n_estimators = [5,20,50,100] # number of trees in the random forest
max_features = ['auto', 'sqrt'] # number of features in consideration at every split
max_depth = [int(x) for x in np.linspace(10, 120, num = 12)] # maximum number of levels allowed in each decision tree
min_samples_split = [2, 6, 10] # minimum sample number to split a node
min_samples_leaf = [1, 3, 4] # minimum sample number that can be stored in a leaf node
bootstrap = [True, False] # method used to sample data points

random_grid = {'n_estimators': n_estimators,

'max_features': max_features,

'max_depth': max_depth,

'min_samples_split': min_samples_split,

'min_samples_leaf': min_samples_leaf,

'bootstrap': bootstrap}
rf = RandomForestClassifier()
rf_random = RandomizedSearchCV(estimator = rf,param_distributions = random_grid,
               n_iter = 100, cv = 5, verbose=2, random_state=35, n_jobs = -1)
rf_random.fit(features, labels)
print ('Random grid: ', random_grid, '\n')
# print the best parameters
print ('Best Parameters: ', rf_random.best_params_, ' \n')

Fitting 5 folds for each of 100 candidates, totalling 500 fits
Random grid:  {'n_estimators': [5, 20, 50, 100], 'max_features': ['auto', 'sqrt'], 'max_depth': [10, 20, 30, 40, 50, 60, 70, 80, 90, 100, 110, 120], 'min_samples_split': [2, 6, 10], 'min_samples_leaf': [1, 3, 4], 'bootstrap': [True, False]} 

Best Parameters:  {'n_estimators': 100, 'min_samples_split': 6, 'min_samples_leaf': 4, 'max_features': 'sqrt', 'max_depth': 120, 'bootstrap': False}  



In [42]:
paras = rf_random.best_params_
randmf = RandomForestClassifier(n_estimators = paras['n_estimators']
                               , min_samples_split = paras['min_samples_split'],
                               min_samples_leaf= paras['min_samples_leaf'],
                               max_features = paras['max_features'],
                               max_depth= paras['max_depth'], bootstrap=paras['bootstrap']) 
randmf.fit(features, labels) 

RandomForestClassifier(bootstrap=False, max_depth=120, max_features='sqrt',
                       min_samples_leaf=4, min_samples_split=6)

In [43]:
randmf.score(test_features,test_labels)

0.8603894109698421