<center><h1>Dimensionality reduction and reevaluation of models</h1></center>

## Summary:

1. [Loading preprocessed dataset, functions and hyperparameters](#load_data)


2. [Dimensionality reduction with models reevaluation](#dim_reduction)
    
    2.1. [k-NN](#k-NN)

    2.2. [Linear regression](#linear_regression)
    
    2.3. [Logistic regression](#logistic_regression)
    
    2.4. [Nearest Centroid Classifier (NCC)](#ncc)
    
    2.5. [Quadratic Gaussian Classifier (QGC)](#qgc)
    
    2.6. [Decison trees](#decision_trees)
    
    2.7. [Artificial Neural Network (ANN)](#ann)

# 1. Loading preprocessed data set, functions and hyperparameters <a class="anchor" id="load_data"></a>

In [1]:
import pandas as pd

X_tr = pd.read_csv("X_tr.csv")
Y_tr = pd.read_csv("Y_tr.csv", header=None)

print("X_tr.shape: {}\nY_tr.shape: {}".format(X_tr.shape, Y_tr.shape))

X_tr.shape: (631760, 113)
Y_tr.shape: (631760, 1)


In [2]:
# Auxiliary function to tell when processing is over
def is_over(): # linux os
    import os
    os.system('spd-say "your program has finished"')
    
# Function for scaling numerical features
from sklearn.preprocessing import MinMaxScaler, StandardScaler

def scale_feat(X_train, X_test, featIndex, scaleType='min-max'):
    if scaleType=='min-max' or scaleType=='std':
        X_tr_norm = np.copy(X_train) # making a copy to let the original available
        X_ts_norm = np.copy(X_test)
        scaler = MinMaxScaler(copy=False) if scaleType=='min-max' else StandardScaler(copy=False)
        scaler.fit(X_tr_norm[:,featIndex])
        X_tr_norm[:,featIndex] = scaler.transform(X_tr_norm[:,featIndex])
        X_ts_norm[:,featIndex] = scaler.transform(X_ts_norm[:,featIndex])
        return (X_tr_norm, X_ts_norm)
    else:
        raise ValueError("Type of scaling not defined. Use 'min-max' or 'std' instead.")
        
import numpy as np

# Hyperparameters:
# Numerical/Ordinal feautures
numFeat = [
    'building_id', # searching for data leakage
    'vdcmun_id',   # categorical, but used as numerical for simplicity
    'ward_id',     # categorical, but used as numerical for simplicity
    'count_floors_pre_eq',
    'count_floors_post_eq',
    'age_building',
    'plinth_area_sq_ft',
    'height_ft_pre_eq',
    'height_ft_post_eq',
    'count_families'
]

# Train/Test split = 80% train and 20% test
test_size = 0.2

# Index of columns to be scaled
numFeat_idx = np.in1d(X_tr.columns.values, numFeat).nonzero()[0]

In [3]:
# Number of resamplings
n_resamplings = 10

In [4]:
# Objective function
def f_o(u):
    return np.mean(u) - 2*np.std(u)

In [5]:
X_values = X_tr.values # taking the numpy matrix of dataframe
Y_values = np.ravel(Y_tr.values) # taking numpy column array from dataframe and converting to simple array

# 2. Dimensionality reduction with models reevaluation <a class="anchor" id="dim_reduction"></a>

The objective of this notebook is to use dimensionality reduction, in this case, *Principal Component Analysis* (PCA), to decrease the time needed to process the data when training the machine learning models.

As a hyperparameter, the conserved variance, $p$, was set to 98%. Also, the type of feature scaling was set to `min-max` as many transformed categorical features now have values `0` or `1`, so, using standard scaling may result in the original numerical features dominating the PCA transformation, as they would have values ranging from `-1` to `+1`.

In [6]:
p = 0.98
scaleType = 'min-max'

On the code below we show the number of dimensions in the case that we apply dimensionality reduction in the whole data set.

In [6]:
%%time
from sklearn.decomposition import PCA

X_tr_norm, _ = scale_feat(X_tr.values, X_tr.values, featIndex=numFeat_idx, scaleType='min-max')

pca = PCA(n_components=p)
pca.fit(X_tr_norm)
    
print("Minimum variance conserved: {}%".format(p*100))
print("# dimensions: {}".format(len(pca.components_)))

Minimum variance conserved: 98.0%
# dimensions: 69
CPU times: user 1min 25s, sys: 6.13 s, total: 1min 31s
Wall time: 14.8 s


As the PCA is applied after the features scaling, which depends on the train/test split, this notebook will contain a lot of copy and paste from the previous one (*02 Building and evaluation of ML models*).

Giving that the time to process the data was usually reduced a longer hyperparameters search as conducted.

## 2.1 k-NN <a class="anchor" id="k-NN"></a>

In [32]:
%%time
from sklearn.model_selection import train_test_split
from sklearn.neighbors import KNeighborsClassifier
from sklearn.metrics import f1_score
from tqdm import tnrange, tqdm_notebook

header = ["k", "$\mu$", "$\sigma$", "$f_o$"]
ks = np.linspace(1,10, num=6, dtype='int').tolist() # possible values of k

nn_data = np.zeros((len(ks), len(header)))
for i in tqdm_notebook(range(len(ks)), desc='ks'):
    results = [0]*n_resamplings
    for j in tqdm_notebook(range(n_resamplings), desc='n_resamplings'):
        # Train/validation split
        X_train, X_test, y_train, y_test = train_test_split(X_values, Y_values, test_size=test_size)

        # scaling
        X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, featIndex=numFeat_idx, scaleType=scaleType)
        
        # PCA
        pca = PCA(n_components = p)
        pca.fit(X_tr_norm)
        
        pca_X_tr_norm = pca.transform(X_tr_norm)
        pca_X_ts_norm = pca.transform(X_ts_norm)
        
        # model fitting
        k_nn = KNeighborsClassifier(n_neighbors=ks[i], n_jobs=-1)
        k_nn.fit(pca_X_tr_norm, y_train)

        # model evaluation
        y_pred = k_nn.predict(pca_X_ts_norm)
        results[j] = f1_score(y_test, y_pred, average='weighted')
        
    
    nn_data[i,:] = np.matrix([ks[i], np.mean(results), np.std(results), f_o(results)])

df_pca_knn = pd.DataFrame(nn_data, columns=header)

is_over()

HBox(children=(IntProgress(value=0, description='ks', max=6, style=ProgressStyle(description_width='initial'))…

HBox(children=(IntProgress(value=0, description='n_resamplings', max=10, style=ProgressStyle(description_width…

HBox(children=(IntProgress(value=0, description='n_resamplings', max=10, style=ProgressStyle(description_width…

HBox(children=(IntProgress(value=0, description='n_resamplings', max=10, style=ProgressStyle(description_width…

HBox(children=(IntProgress(value=0, description='n_resamplings', max=10, style=ProgressStyle(description_width…

HBox(children=(IntProgress(value=0, description='n_resamplings', max=10, style=ProgressStyle(description_width…

HBox(children=(IntProgress(value=0, description='n_resamplings', max=10, style=ProgressStyle(description_width…

CPU times: user 20h 46min 29s, sys: 6min 1s, total: 20h 52min 31s
Wall time: 2h 50min 31s


In [35]:
# saving results
# filename = "./simulation_results/df_pca_knn.csv"
# df_pca_knn.to_csv(filename, sep='\t', index=False)

In [8]:
# loading results
df_pca_knn = pd.read_csv("./simulation_results/df_pca_knn.csv", sep='\t')
print('df_pca_knn:')
display(df_pca_knn)

df_knn = pd.read_csv("./simulation_results/df_knn.csv", sep='\t')
print('df_knn:')
display(df_knn)

df_pca_knn:


Unnamed: 0,k,$\mu$,$\sigma$,$f_o$
0,1.0,0.705696,0.000974,0.703747
1,2.0,0.697567,0.000832,0.695903
2,4.0,0.72017,0.001066,0.718038
3,6.0,0.724551,0.001226,0.722099
4,8.0,0.729181,0.001154,0.726872
5,10.0,0.730667,0.00071,0.729246


df_knn:


Unnamed: 0,k,$\mu$,$\sigma$,$f_o$
0,1.0,0.700718,0.001065,0.698588
1,5.0,0.72444,0.001197,0.722045
2,10.0,0.731552,0.001212,0.729127


As we can see the k-NN classifier applied in the transformed data set was slightly better, but the notorious difference is the processing time:

* Original k-NN took 13h 29min 2s while searching for 3 different cases of hyperparameters;

* The combination of PCA + k-NN took 2h 50min 31s while searching for double the number of hyperparameters cases.

## 2.2 Linear regression <a class="anchor" id="linear_regression"></a>

In [7]:
# Creating the multilabel version of Y_tr
from sklearn.preprocessing import MultiLabelBinarizer

mlb = MultiLabelBinarizer()
Y_tr_multilabel = mlb.fit_transform(Y_tr.values) 

In [10]:
%%time
from sklearn.decomposition import PCA
from sklearn import linear_model
from sklearn.model_selection import train_test_split
from sklearn.metrics import f1_score
from tqdm import tnrange, tqdm_notebook

header = ["$\mu$", "$\sigma$", "$f_o$"]

results = [0]*n_resamplings
for i in tnrange(n_resamplings, desc='n_resamplings'):
    # Train/validation split
    X_train, X_test, y_train, y_test = train_test_split(X_values, Y_tr_multilabel, test_size=test_size)

    # scaling
    X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, featIndex=numFeat_idx, scaleType=scaleType)

    # PCA
    pca = PCA(n_components = p)
    pca.fit(X_tr_norm)

    pca_X_tr_norm = pca.transform(X_tr_norm)
    pca_X_ts_norm = pca.transform(X_ts_norm)

    # model fitting
    reg = linear_model.LinearRegression(n_jobs=-1)
    reg.fit(pca_X_tr_norm, y_train)

    # model evaluation
    y_pred = np.argmax(reg.predict(pca_X_ts_norm), axis=1).astype(str) # +1???
    y_pred = mlb.fit_transform(y_pred)

    results[i] = f1_score(y_test, y_pred, average='weighted')


reg_data = np.matrix([np.mean(results), np.std(results), f_o(results)])

df_pca_reg = pd.DataFrame(reg_data, columns=header)
# display(df_reg)
is_over()

HBox(children=(IntProgress(value=0, description='n_resamplings', max=10, style=ProgressStyle(description_width…


CPU times: user 14min 36s, sys: 1min 21s, total: 15min 57s
Wall time: 2min 43s


In [11]:
# saving results
# filename = "./simulation_results/df_pca_reg.csv"
# df_pca_reg.to_csv(filename, sep='\t', index=False)

In [8]:
# loading results
df_pca_reg = pd.read_csv("./simulation_results/df_pca_reg.csv", sep='\t')
print('df_pca_reg:')
display(df_pca_reg)

df_reg = pd.read_csv("./simulation_results/df_reg.csv", sep='\t')
print('df_reg:')
display(df_reg)

df_pca_reg:


Unnamed: 0,$\mu$,$\sigma$,$f_o$
0,0.70277,0.000784,0.701202


df_reg:


Unnamed: 0,scaleType,$\mu$,$\sigma$,$f_o$
0,min-max,0.707567,0.001361,0.704845
1,std,0.70768,0.001097,0.705486


In the linear regression case, we had a drop in performance and no reduction in processing time. This is probably because finding the PCA transform takes a significant time when compared to the model fitting time.

## 2.3. Logistic regression <a class="anchor" id="logistic_regression"></a>

In [None]:
%%time
from tqdm import tnrange, tqdm_notebook
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn import linear_model
from sklearn.metrics import f1_score

rhos = [0, 0.5, 1] # rho=0 <=> l2 norm / rho==1 <=> l1 norm
Cs = [10**x for x in range(-1,+1 +1)]

header = ["$rho$", "$C$", "$\mu$", "$\sigma$", "$f_o$"]
logreg_data = np.empty((len(rhos)*len(Cs), len(header)), dtype=object)
count=0
for rho in tqdm_notebook(rhos, desc="rho's: "):
    for C in tqdm_notebook(Cs, desc="C's"):
#         print("Started penalty={}/C={} at {}".format(penalty, C, datetime.datetime.now()))
        results = [0]*n_resamplings
        for i in tnrange(n_resamplings, desc='resampling :', leave=False):
            # Train/validation split
            X_train, X_test, y_train, y_test = train_test_split(X_values, Y_values, test_size=test_size)

            # scaling
            X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, featIndex=numFeat_idx, scaleType=scaleType)

            # PCA
            pca = PCA(n_components = p)
            pca.fit(X_tr_norm)

            pca_X_tr_norm = pca.transform(X_tr_norm)
            pca_X_ts_norm = pca.transform(X_ts_norm)
            
            # model fitting
            logreg = linear_model.LogisticRegression(multi_class='multinomial',solver='saga',
                                                     penalty='elasticnet',max_iter=500, tol=1e-4, 
                                                     l1_ratio=rho, C=C,  n_jobs=-1)
            logreg.fit(pca_X_tr_norm, y_train)

            # model evaluation
            y_pred = logreg.predict(pca_X_ts_norm)
            results[i] = f1_score(y_test, y_pred, average='weighted')

        logreg_data[count,:] = np.matrix([rho, C, np.mean(results), np.std(results), f_o(results)])
        count+=1

df_pca_logreg = pd.DataFrame(logreg_data, columns=header)
is_over()

'''
Original model:
CPU times: user 15h 57min 10s, sys: 1min 22s, total: 15h 58min 33s
Wall time: 15h 57min

New model:
CPU times: user 10h 49min 55s, sys: 8min 29s, total: 10h 58min 25s
Wall time: 9h 22min 6s
'''

In [10]:
# saving results
# filename = "./simulation_results/df_pca_logreg.csv"
# df_pca_logreg.to_csv(filename, sep='\t', index=False)

In [9]:
# loading results
df_pca_logreg = pd.read_csv("./simulation_results/df_pca_logreg.csv", sep='\t')
print('df_pca_logreg:')
display(df_pca_logreg.sort_values(by=['$f_o$'], ascending=False))

df_logreg = pd.read_csv("./simulation_results/df_logreg.csv", sep='\t')
print('df_logreg:')
display(df_logreg.sort_values(by=['$f_o$'], ascending=False))

df_pca_logreg:


Unnamed: 0,$rho$,$C$,$\mu$,$\sigma$,$f_o$
7,1.0,1.0,0.71112,0.000617,0.709885
4,0.5,1.0,0.710984,0.000968,0.709047
5,0.5,10.0,0.711,0.001155,0.708689
1,0.0,1.0,0.710448,0.000896,0.708656
2,0.0,10.0,0.710393,0.00091,0.708572
6,1.0,0.1,0.709856,0.000689,0.708478
8,1.0,10.0,0.710814,0.001219,0.708376
3,0.5,0.1,0.708937,0.000981,0.706976
0,0.0,0.1,0.708259,0.001078,0.706104


df_logreg:


Unnamed: 0,$rho$,$C$,$\mu$,$\sigma$,$f_o$
6,1.0,0.1,0.71868,0.000901,0.716878
4,0.5,1.0,0.718251,0.000835,0.716582
2,0.0,10.0,0.718541,0.001046,0.716449
7,1.0,1.0,0.718173,0.000922,0.716328
1,0.0,1.0,0.718408,0.00121,0.715988
5,0.5,10.0,0.718665,0.001632,0.715401
8,1.0,10.0,0.717804,0.0013,0.715204
3,0.5,0.1,0.717645,0.001321,0.715004
0,0.0,0.1,0.716944,0.001352,0.71424


We had a slight drop in performance (from $f_o = 0.716878$ to $0.709885$) and in processing time (from 15h 57min to 9h 22min 6s).

## 2.4. Nearest Centroid Classifier (NCC) <a class="anchor" id="ncc"></a>

In [7]:
%%time
from classifiers import NearestCentroid
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.metrics import f1_score

header = ["$\mu$", "$\sigma$", "$f_o$"]

# ncc_data = np.empty(len(header)), dtype=object)
results = [0]*n_resamplings
for i in range(n_resamplings):
    # Train/validation split
    X_train, X_test, y_train, y_test = train_test_split(X_values, Y_tr.values, test_size=test_size)

    # scaling
    X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, featIndex=numFeat_idx, scaleType=scaleType)
    
    # PCA
    pca = PCA(n_components = p)
    pca.fit(X_tr_norm)

    pca_X_tr_norm = pca.transform(X_tr_norm)
    pca_X_ts_norm = pca.transform(X_ts_norm)

    # model fitting
    ncc = NearestCentroid()
    ncc.fit(pca_X_tr_norm, y_train)

    # model evaluation
    y_pred = ncc.predict(pca_X_ts_norm)
    results[i] = f1_score(y_test, y_pred, average='weighted')

ncc_data = np.matrix([np.mean(results), np.std(results), f_o(results)])

df_ncc = pd.DataFrame(ncc_data, columns=header)
display(df_ncc)

Unnamed: 0,$\mu$,$\sigma$,$f_o$
0,0.646854,0.000999,0.644857


CPU times: user 12min 15s, sys: 1min 8s, total: 13min 23s
Wall time: 2min 32s


In this case, we had similar performance and an increase in processing time.

## 2.5. Quadratic Gaussian Classifier (QGC) <a class="anchor" id="qgc"></a>

In [7]:
%%time
from classifiers import QuadraticGaussianClassifier
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.metrics import f1_score
from tqdm import tnrange, tqdm_notebook

n_resamplings=10

variants = [1, 2, 3]
lambdas = np.linspace(0,1,num=12)[1:-1] # we crop 0 and 1 because:
                                        # lambda=0 => original QGC
                                        # lambda=1 => variant 2 of QGC
header = ["variant","$\lambda$", "$\mu$", "$\sigma$", "$f_o$"]
qgc_data = np.empty((2 + len(lambdas), len(header)), dtype=object)

for variant in tqdm_notebook(variants, desc="variants: "):
    results = [0]*n_resamplings if variant!=3 else np.zeros((len(lambdas),n_resamplings))
    for i in tnrange(n_resamplings, desc='resampling :', leave=False):
        # Train/validation split
        X_train, X_test, y_train, y_test = train_test_split(X_tr.values, Y_tr.values, test_size=test_size)
        
        # scaling
        X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, featIndex=numFeat_idx, scaleType=scaleType)
        
        # PCA
        pca = PCA(n_components = p)
        pca.fit(X_tr_norm)

        pca_X_tr_norm = pca.transform(X_tr_norm)
        pca_X_ts_norm = pca.transform(X_ts_norm)
        
        if variant!=3:
            # model fitting
            qgc = QuadraticGaussianClassifier(variant=variant)
            qgc.fit(pca_X_tr_norm, y_train)
            
            # model evaluation
            y_pred = qgc.predict(pca_X_ts_norm)
            results[i] = f1_score(y_test, y_pred, average='weighted')
            
        else:
            for j in range(len(lambdas)): # for each lambda
                # model fitting
                qgc = QuadraticGaussianClassifier(variant=variant, Lambda=lambdas[j])
                qgc.fit(pca_X_tr_norm, y_train)
                
                # model evaluation
                y_pred = qgc.predict(pca_X_ts_norm)
                results[j,i] = f1_score(y_test, y_pred, average='weighted')
        
    if variant!=3:
        qgc_data[variant-1,:] = np.asmatrix(
            [variant, np.nan, np.mean(results), np.std(results), f_o(results)]
        )

    else:
        var3_matrix = np.asmatrix([3]*len(lambdas)).T
        lambdas_matrix  = np.asmatrix(lambdas).T
        fo = [f_o(result) for result in results]
        fo = np.asmatrix(fo).T
        qgc_data[2:2+len(lambdas),:] = np.concatenate(
            (var3_matrix, lambdas_matrix, np.asmatrix(np.mean(results,axis=1)).T, 
             np.asmatrix(np.std(results,axis=1)).T, fo), axis=1
        )


df_pca_qgc = pd.DataFrame(qgc_data, columns=header)
is_over()

'''
Old time:
CPU times: user 6h 20min 15s, sys: 4h 36min 1s, total: 10h 56min 17s
Wall time: 2h 7min 30s

New time:
CPU times: user 1h 27min 57s, sys: 6min 52s, total: 1h 34min 49s
Wall time: 56min 17s
'''

HBox(children=(IntProgress(value=0, description='variants: ', max=3, style=ProgressStyle(description_width='in…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…


CPU times: user 1h 27min 57s, sys: 6min 52s, total: 1h 34min 49s
Wall time: 56min 17s


In [8]:
# saving results
# filename = "./simulation_results/df_pca_qgc.csv"
# df_pca_qgc.to_csv(filename, sep='\t', index=False)

In [9]:
# loading results
df_pca_qgc = pd.read_csv("./simulation_results/df_pca_qgc.csv", sep='\t')
print('df_pca_qgc:')
display(df_pca_qgc.sort_values(by=['$f_o$'], ascending=False))

df_qgc = pd.read_csv("./simulation_results/df_qgc.csv", sep='\t')
print("df_qgc:")
display(df_qgc.sort_values(by=['$f_o$'], ascending=False))

df_pca_qgc:


Unnamed: 0,variant,$\lambda$,$\mu$,$\sigma$,$f_o$
2,3.0,0.090909,0.679278,0.001332,0.676614
3,3.0,0.181818,0.679278,0.001332,0.676614
4,3.0,0.272727,0.679278,0.001332,0.676614
5,3.0,0.363636,0.679278,0.001332,0.676614
6,3.0,0.454545,0.679278,0.001332,0.676614
7,3.0,0.545455,0.679278,0.001332,0.676614
8,3.0,0.636364,0.679278,0.001332,0.676614
9,3.0,0.727273,0.679278,0.001332,0.676614
10,3.0,0.818182,0.679278,0.001332,0.676614
11,3.0,0.909091,0.679278,0.001332,0.676614


df_qgc:


Unnamed: 0,variant,$\lambda$,$\mu$,$\sigma$,$f_o$
0,1.0,,0.635737,0.000768,0.634201
2,3.0,0.090909,0.63296,0.000848,0.631264
4,3.0,0.272727,0.63285,0.000848,0.631153
3,3.0,0.181818,0.63289,0.000871,0.631148
5,3.0,0.363636,0.632503,0.00086,0.630784
6,3.0,0.454545,0.631791,0.00082,0.630152
7,3.0,0.545455,0.629895,0.000721,0.628453
8,3.0,0.636364,0.627478,0.000775,0.625929
9,3.0,0.727273,0.616343,0.000785,0.614774
10,3.0,0.818182,0.603493,0.000727,0.602039


In this case, we had a big improvement in performance and the processing time was reduced by half.

## 2.6. Decison trees <a class="anchor" id="decision_trees"></a>

In [7]:
%%time
from xgboost import XGBClassifier
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.metrics import f1_score
from tqdm import tnrange, tqdm_notebook

cases = [
    {
    "learning_rate"     : learning_rate
    ,'n_estimators'     : n_estimators
    ,'max_depth'        : max_depth
    ,'tree_method'      : 'gpu_hist'
    ,'objective'        : 'multi:softmax'
    } 
    # hyperparameters possible values
    for learning_rate    in [1e-1, 0.3, 0.5, 1]
    for n_estimators     in [500, 1000]
    for max_depth        in [6, 12]
]

header = list(cases[0].keys())[:-2] + ["$\mu$", "$\sigma$", "$f_o$"] # no need for repeating
                                                                     # 'tree_method' and 'objective'

xgb_data = np.empty((len(cases), len(header)), dtype=object)
count=0
for case in tqdm_notebook(cases, desc="case: "):
#     print("Starting instance {}/{} at {}".format(count+1, len(cases), datetime.datetime.now()))
    results = [0]*n_resamplings 
    for i in tnrange(n_resamplings, desc='resampling :', leave=False):
        # Train/test split
        X_train, X_test, y_train, y_test = train_test_split(X_values, Y_values, 
                                                            test_size=test_size)
        # Train/validation split
        X_train, X_validation, y_train, y_validation = train_test_split(X_train, y_train, 
                                                                        test_size=test_size)

        # scaling
        X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, featIndex=numFeat_idx, scaleType=scaleType)
        X_val_norm, _ = scale_feat(X_validation, X_validation, featIndex=numFeat_idx, scaleType=scaleType)
        
        # PCA
        pca = PCA(n_components = p)
        pca.fit(X_tr_norm)

        pca_X_tr_norm  = pca.transform(X_tr_norm)
        pca_X_ts_norm  = pca.transform(X_ts_norm)
        pca_X_val_norm = pca.transform(X_val_norm)
        
        # model fitting
        xgb = XGBClassifier(**case)
        xgb.fit(pca_X_tr_norm, y_train, eval_set=[(pca_X_val_norm,y_validation)]
                ,early_stopping_rounds=30, verbose=False)
                
        # model evaluation
        y_pred = xgb.predict(pca_X_ts_norm)
        results[i] = f1_score(y_test, y_pred, average='weighted')
            

    xgb_data[count,:] = np.matrix(
        list(case.values())[:-2] + [np.mean(results), np.std(results), f_o(results)])
    count+=1
        
    
df_pca_xgb = pd.DataFrame(xgb_data, columns=header)
is_over()
'''
Old time:
CPU times: user 11h 21min 42s, sys: 3h 37min 24s, total: 14h 59min 7s
Wall time: 14h 59min

New time:
CPU times: user 8h 11min 24s, sys: 1h 47min 33s, total: 9h 58min 57s
Wall time: 7h 40min 25s
'''

HBox(children=(IntProgress(value=0, description='case: ', max=16, style=ProgressStyle(description_width='initi…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…


CPU times: user 8h 11min 24s, sys: 1h 47min 33s, total: 9h 58min 57s
Wall time: 7h 40min 25s


In [None]:
# saving results
# df_pca_xgb.to_csv("./simulation_results/df_pca_xgb.csv", sep='\t', index=False)

In [7]:
# loading results
df_pca_xgb = pd.read_csv("./simulation_results/df_pca_xgb.csv", sep='\t')
df_xgb     = pd.read_csv("./simulation_results/df_xgb.csv", sep='\t')

print('df_pca_xgb:')
display(df_pca_xgb.sort_values(by=['$f_o$'], ascending=False).head())
print('df_xgb:')
display(df_xgb.sort_values(by=['$f_o$'], ascending=False).head())

df_pca_xgb:


Unnamed: 0,learning_rate,n_estimators,max_depth,$\mu$,$\sigma$,$f_o$
3,0.1,1000.0,12.0,0.736385,0.001277,0.73383
1,0.1,500.0,12.0,0.735723,0.001048,0.733627
6,0.3,1000.0,6.0,0.733833,0.000981,0.73187
5,0.3,500.0,12.0,0.733779,0.001152,0.731476
7,0.3,1000.0,12.0,0.734421,0.001515,0.731392


df_xgb:


Unnamed: 0,learning_rate,n_estimators,max_depth,$\mu$,$\sigma$,$f_o$
1,0.1,500.0,12.0,0.767047,0.00109,0.764868
7,0.3,1000.0,12.0,0.766257,0.001113,0.764031
5,0.3,500.0,12.0,0.765584,0.000913,0.763758
3,0.1,1000.0,12.0,0.766463,0.001637,0.763188
4,0.3,500.0,6.0,0.76478,0.001135,0.762511


We had a drop in performance but the processing time was cut by half.

## 2.7. Artificial Neural Network (ANN) <a class="anchor" id="ann"></a>

In [7]:
%%time 
from keras.models import Sequential
from keras.layers import Dense
from keras.optimizers import Adam
from keras.utils import to_categorical
from keras.callbacks import EarlyStopping
from tqdm import tnrange, tqdm_notebook
from sklearn.model_selection import train_test_split
from sklearn.decomposition import PCA
from sklearn.metrics import f1_score
from metric import f1


Y_tr_categorical = to_categorical(Y_tr)[:, 1:] # change to categorical

cases = [
    {
        'lr'  : lr,
        'arq' : arq
    } 
    # hyperparameters possible values
    for arq in [
                 [32],[128],[512]         # MLP with one   layer
                ,[512,128], [128,32]      # //   //  two   layers
                ,[512,128,32], [128,32,8] # //   //  three  //
                ]
    for lr in [1e-3, 1e-2]
]

header = list(cases[0].keys()) + ["$\mu$", "$\sigma$", "$f_o$"] 

ann_data = np.empty((len(cases), len(header)), dtype=object)
count=0
for case in tqdm_notebook(cases, desc="case: "):
    # model building
    arq = case['arq']
    model = Sequential()
    model.add(Dense(units=arq[0], activation='relu', input_dim=69)) # first layer
    for layer in arq:
        model.add(Dense(units=layer, activation='relu'))
    model.add(Dense(units=5, activation='softmax')) # output layer

    model.compile(loss='categorical_crossentropy'
                 ,metrics=[f1]
                 ,optimizer=Adam(lr=case['lr'], amsgrad=True)
                 )
    
    results = [0]*n_resamplings
    for i in tnrange(n_resamplings, desc='resampling :', leave=False):
        # Train/validation split
        X_train, X_test, y_train, y_test = train_test_split(X_tr.values, Y_tr_categorical, test_size=test_size)

        # scaling
        X_tr_norm, X_ts_norm = scale_feat(X_train, X_test, featIndex=numFeat_idx, scaleType="min-max")
        
        # PCA
        pca = PCA(n_components = 69)
        pca.fit(X_tr_norm)

        pca_X_tr_norm  = pca.transform(X_tr_norm)
        pca_X_ts_norm  = pca.transform(X_ts_norm)
        
        
        
        # model fitting
        es = EarlyStopping(monitor='val_f1', mode='max', verbose=0, patience=5)
        history = model.fit(pca_X_tr_norm, y_train, shuffle=True 
                            ,epochs=1_000
                            ,verbose=0
                            ,validation_split=0.2
                            ,callbacks=[es]
                            )
    
        # model evaluation
        y_pred = np.argmax(model.predict(pca_X_ts_norm), axis=1)+1
        y_pred = to_categorical(y_pred)[:, 1:]
        results[i] = f1_score(y_test, y_pred, average='weighted')
                
    ann_data[count,:] = np.matrix(list(case.values()) + [np.mean(results), np.std(results), f_o(results)])
    count+=1
        
df_pca_ann = pd.DataFrame(ann_data, columns=header)
is_over()

'''
Old model:
CPU times: user 1d 7h 58min 2s, sys: 3h 33min 4s, total: 1d 11h 31min 6s
Wall time: 18h 41min 29s

New model:
CPU times: user 1d 16h 19min 38s, sys: 4h 46min 33s, total: 1d 21h 6min 11s
Wall time: 22h 9min 6s
'''

Using TensorFlow backend.


HBox(children=(IntProgress(value=0, description='case: ', max=14, style=ProgressStyle(description_width='initi…

Instructions for updating:
Colocations handled automatically by placer.


HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

Instructions for updating:
Use tf.cast instead.


HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…

HBox(children=(IntProgress(value=0, description='resampling :', max=10, style=ProgressStyle(description_width=…


CPU times: user 1d 16h 19min 38s, sys: 4h 46min 33s, total: 1d 21h 6min 11s
Wall time: 22h 9min 6s


In [8]:
# saving results
# df_pca_ann.to_csv("./simulation_results/df_pca_ann.csv", sep='\t', index=False)

In [9]:
# loading results
df_pca_ann = pd.read_csv("./simulation_results/df_pca_ann.csv", sep='\t')
df_ann     = pd.read_csv("./simulation_results/df_ann.csv", sep='\t')

print('df_pca_ann:')
display(df_pca_ann.sort_values(by=['$f_o$'], ascending=False))
print('df_ann:')
display(df_ann.sort_values(by=['$f_o$'], ascending=False))

df_pca_ann:


Unnamed: 0,lr,arq,$\mu$,$\sigma$,$f_o$
2,0.001,[128],0.738402,0.003538,0.731327
8,0.001,"[128, 32]",0.737496,0.003756,0.729984
12,0.001,"[128, 32, 8]",0.736715,0.004054,0.728607
6,0.001,"[512, 128]",0.755639,0.015378,0.724884
11,0.01,"[512, 128, 32]",0.734996,0.005391,0.724213
10,0.001,"[512, 128, 32]",0.754969,0.015465,0.724039
0,0.001,[32],0.727099,0.001573,0.723952
4,0.001,[512],0.754622,0.01584,0.722942
13,0.01,"[128, 32, 8]",0.727965,0.003047,0.721872
7,0.01,"[512, 128]",0.73445,0.00646,0.72153


df_ann:


Unnamed: 0,lr,arq,$\mu$,$\sigma$,$f_o$
8,0.001,"[128, 32]",0.748759,0.006041,0.736676
12,0.001,"[128, 32, 8]",0.749997,0.00679,0.736417
2,0.001,[128],0.749068,0.006379,0.73631
10,0.001,"[512, 128, 32]",0.777009,0.02062,0.73577
11,0.01,"[512, 128, 32]",0.746012,0.005657,0.734699
6,0.001,"[512, 128]",0.773487,0.019732,0.734022
4,0.001,[512],0.770884,0.018516,0.733851
0,0.001,[32],0.736151,0.002087,0.731977
9,0.01,"[128, 32]",0.734137,0.003825,0.726487
3,0.01,[128],0.732377,0.004225,0.723928


We can see a drop in performance and an increase in processing time. The increase in processing time can be explained by the time necessary to find the PCA projection and the permanence of the old hyperparameters (the number of neurons in the hidden layers) which makes our new model have a similar size to the old one.