### Optimal Parameters

The purpose of this notebook is to find the best parameters for the different sets of features. In order to do so, a model needs to be selected, along with the parameter space that will be searched. First a coarse grid search is performed, and then a more narrow one. The result of a notebook end to end run, should be a csv file with a specified name that mentions the model it is for. The csv for an SVM for example should look like:

    parameter, value  <-- (header)
    kernel, rbf
    nu, 0.01
    degree, 3
    gamma, 0.05
    coef0, 0.0
    
In order to successfully run the notebook for its purpose, the sections that need to be modified for a new algorithm are:

    1. Model Selection (where a new model is initialized)
    2. Parameter Space - Coarse (where the selected parameters need to correspond to the algorithm)
    3. Drop a cleaner csv with only the best parameters (where the parameter names need to be set
       accordingly.

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import os
import itertools
from time import time
from IPython.display import display, clear_output


# Pretty display for notebooks
%matplotlib inline

random_state = 0

rows_each = 60

# specify the place from where we will use the data
source_path_rel = 'data/Main collection - features/'

# specify the target path for dropping the parameters
target_path_rel = 'data/Main collection - results/parameters/'

# find the file names in the directory
feature_files = os.listdir(source_path_rel)

# remove the readme, and the Giorgos files
feature_files.remove('README.md')
feature_files.remove('Giorgos Mon 19.csv')
feature_files.remove('Giorgos Wed 21.csv')
feature_files.remove('Giorgos Fri 23.csv')
feature_files.remove('Giorgos Mon 26.csv')
feature_files.remove('best ratios by multiclass RF')

print("The number of files selected from the directory is {}".format(len(feature_files)))

The number of files selected from the directory is 16


In [2]:
# DEBUGGING OPTIONS
np.set_printoptions(threshold=80)
pd.set_option('display.max_rows', 8)
pd.set_option('display.max_columns', 8)

### Load all data in a single DataFrame

The process here is to sample 60 locks from each participant. Some participants have more than 60 locks and in that case we select the lokcs randomly.

In [3]:
# initialize the data frame with the locks that will be used
X = pd.DataFrame()

for i, file in enumerate(feature_files):
    
    # read each features file into a df
    df = pd.read_csv(source_path_rel+file)
    
    # select a sample of 60 values (without replacement)
    df = df.sample(n=rows_each, random_state=random_state, axis=0)
    
    # sort the sampled dataframe by index 
    df.index = range(df.shape[0])
    
    # add the data to the bigger dataframe
    X = pd.concat([X, df.loc[:, :]], ignore_index=True)
    
print("The dataframe that holds the data is as follows:")
display(X)

The dataframe that holds the data is as follows:


Unnamed: 0,AB_mil,AB_xyz,AB|AC_mil,AB|AC_xyz,...,zSpeed_range,zSpeed_skew,zSpeed_std,zSpeed_var
0,434.0,258.930923,0.418919,0.655369,...,1.081136,-0.149765,0.184878,0.034180
1,502.0,297.418459,0.548035,0.629426,...,2.088235,1.069284,0.282818,0.079986
2,529.0,339.767000,0.549325,0.757873,...,1.235294,0.071162,0.249953,0.062477
3,503.0,301.571361,0.558269,0.689733,...,2.217469,1.207258,0.290968,0.084663
...,...,...,...,...,...,...,...,...,...
956,445.0,158.355584,0.654412,0.820856,...,0.558824,0.489388,0.124969,0.015617
957,436.0,138.876355,0.586022,0.597251,...,0.491597,0.393809,0.109206,0.011926
958,495.0,183.983717,0.651316,0.818607,...,0.897829,-0.483202,0.137261,0.018841
959,477.0,175.668280,0.671831,0.857197,...,0.421429,0.742492,0.105224,0.011072


### Feature sets selection

After loading the data, the next step is to select features. The parameter tuning will be across all the feature groups. There will be 10 groups of features that we will deal with here. These 10 groups of features can be grouped in 3 broader groups which are the overal/holistic features, the distances and the ratios.

In [4]:
def loadRatioFeatureIds(file_name, n=30):
    """ Loads the file with features. """
    df = pd.read_csv('data/Main collection - features/best ratios by multiclass RF/' + file_name)
    return df.loc[:n-1, 'feature id']

In [5]:
def findMagnitudes(f1, f2):
    """ Selects the features for both f1 and f2 lists that start with mag. """
    ff = [f for f in f1 if f[:3]=='mag']
    ff.extend([f for f in f2 if f[:3]=='mag'])
    return ff

In [6]:
# get all the features
all_features = X.columns.values.tolist()

# find all positional features
pos_features = [f for f in all_features if f[1:4]=='Pos' and f[4]!='I']

# find all position intervals
posInc_features = [f for f in all_features if f[1:7]=='PosInc' or f[3:9]=='PosInc']

# find all speed intervals
speed_features = [f for f in all_features if f[1:6]=='Speed' or f[3:8]=='Speed']\

# find all x axis features


# find all y axis features


# find all z axis features


# find all magnitude features
#magnitude_features = findMagnitudes(posInc_features, speed_features)

# find all euclidean distances
euc_dist_features = [f for f in all_features if f[2]=='_' and f[-3:]=='xyz']

# find all temporal distances
mil_dist_features = [f for f in all_features if f[2]=='_' and f[-3:]=='mil']

# combine euc and mil distances


# find the selected euclidean ratios
euc_ratio_features = loadRatioFeatureIds('Euclidean distance ratios.csv')

# find the selected temporal ratios
mil_ratio_features = loadRatioFeatureIds('Temporal distance ratios.csv')

# find the selected euclidean ratios
euc_all_ratio_features = [f for f in all_features if f[2]=='|' and f[-3:]=='xyz']

# find the selected temporal ratios
mil_all_ratio_features = [f for f in all_features if f[2]=='|' and f[-3:]=='mil']

# combine euc and temp ratios


# combine ratios and distances


# ----- combine best performing ones

# best 1 


# best 2

In [7]:
feature_sets = [pos_features, posInc_features, speed_features, euc_dist_features,
                mil_dist_features, euc_ratio_features, mil_ratio_features,
                euc_all_ratio_features, mil_all_ratio_features, all_features]

feature_set_names = ['positions', 'position intervals', 'speed intervals', 'euclidean distances',
                     'temporal distances', 'euclidean ratios', 'temporal ratios',
                     'euclidean ratios (all 356)', 'temporal ratios (all 356)', 'all features']

### Model Selection

The next step is to initialize the models and make a selection of the one that will be used. Because this notebook is supposed to use only one model, the selected classifier needs to be selected in the next cell by commenting out the remaining options.

In [8]:
# from sklearn import svm
# selected_classifier = svm.OneClassSVM()
# selected_classifier_name = 'one-class SVM'

from sklearn.ensemble import IsolationForest
selected_classifier = IsolationForest()
selected_classifier_name = 'Isolation Forest'

# from sklearn.covariance import EllipticEnvelope
# selected_classifier = EllipticEnvelope()
# selected_classifier_name = 'Elliptic Envelope'

# from sklearn.neighbors import LocalOutlierFactor
# selected_classifier = LocalOutlierFactor()
# selected_classifier_name = 'Local Outlier Factor'

### Parameter Space - Coarse

Here for every algorithm that will be probed, we will define the parameter space for a coarse grid search.

##### Parameters for one-class SVM:

In [9]:
nu = [0.1, 0.3, 0.5, 0.7, 0.9]    # nu
degree = [1, 2, 3, 4]             # degree (poly)
gamma = [0.1, 0.3, 0.5, 0.7, 0.9] # gamma (rbf, poly sigmoid)
coef0 = [0.1, 0.3, 0.5, 0.7, 0.9] # coef0 (poly, sigmoid)

def_degree = [3.0]
def_gamma = ['auto']
def_coef0 = [0.0]

# initialize a list of parameters the classifier can take
parameters_SVM = [{'kernel': ['linear'],
                   'nu': nu,
                   'degree': def_degree,
                   'gamma': def_gamma,
                   'coef0': def_coef0},
                  {'kernel': ['rbf'],
                   'nu': nu,
                   'degree': def_degree,
                   'gamma': gamma,
                   'coef0': def_coef0},
                  {'kernel': ['poly'],
                   'nu': nu,
                   'degree': degree,
                   'gamma': gamma,
                   'coef0': coef0},
                  {'kernel': ['sigmoid'],
                   'nu': nu,
                   'degree': def_degree,
                   'gamma': gamma,
                   'coef0': coef0}]

param_names_SVM = ['kernel', 'nu', 'degree', 'gamma', 'coef0']

In [10]:
# initialize a list of parameters the classifier can take
parameters_SVM_only_rbf = [{'kernel': ['rbf'],
                           'nu': nu,
                           'degree': def_degree,
                           'gamma': gamma,
                           'coef0': def_coef0}]

param_names_SVM = ['kernel', 'nu', 'degree', 'gamma', 'coef0']

##### Parameters for Isolation Forest:

In [36]:
estimators = [300, 400, 500, 600]      # number of estimators
contamination = [0.05, 0.2, 0.35] # contamination
max_features = [0.5, 0.7, 0.9]    # max features

def_samples = [1.0]

parameters_IF = [{'n_estimators': estimators,
                  'max_samples': def_samples,
                  'contamination': contamination,
                  'max_features': max_features}]

param_names_IF = ['n_estimators', 'max_samples', 'contamination', 'max_features']

In [12]:
def findListOfParameters(parameters):
    """ Returns all combinations by the dictionary values, in a list of dictionaries """
    return [tup for d in parameters for tup in list(itertools.product(*d.values()))]

In [13]:
selected_parameters = findListOfParameters(parameters_IF)

### Coarse Grid Search

In order to find the optimal parameter combination, we will run the selected model for every user and every group of features. Then we will select the parameters with the best mean score across all combinations of users with features. The data structure that we will use will look like:

|                 | f1,u1 | f1,u2 |  ...  | f1,uN |  ...  | fN,u1 | fN,u2 |  ...  | fN,uN |
|-----------------|-------|-------|-------|-------|-------|-------|-------|-------|-------|
|   parameters1   |  val  |  val  |  ...  |  val  |  ...  |  val  |  val  |  ...  |  val  |
|   parameters2   |  val  |  val  |  ...  |  val  |  ...  |  val  |  val  |  ...  |  val  |
|       ...       |  val  |  val  |  ...  |  val  |  ...  |  val  |  val  |  ...  |  val  |
|   parametersN   |  val  |  val  |  ...  |  val  |  ...  |  val  |  val  |  ...  |  val  |

In [14]:
from sklearn.model_selection import KFold
from sklearn.preprocessing import MinMaxScaler

In [15]:
def addParamsToClf(clf, p, clf_name):
    """ Adds the parameters of the classifier according to its name. """
    
    if clf_name == 'one-class SVM':
        clf.set_params(kernel=p[0], nu=p[1], degree=p[2], gamma=p[3], coef0=p[4])
    elif clf_name == 'Isolation Forest':
        clf.set_params(n_estimators=p[0], max_samples=p[1], contamination=p[2], max_features=p[3])
    else:
        raise ValueError('The classifier name is not enlisted')
    
    return clf

In [16]:
def findFalseNegatives(pred):
    """ Finds the False Negatives in a prediction of only positive samples. """
    # get the unique predicted labels and their counts in a dictionary
    unique_counts = dict(zip(*np.unique(pred, return_counts=True)))
    # find the correct predictions
    num_correct_predictions = unique_counts[-1] if -1 in unique_counts else 0
    #return the ratio
    return num_correct_predictions / sum(unique_counts.values()) * 100

In [17]:
def findFalsePositives(pred):
    """ Finds the False Positives in a prediction of only negative samples. """
    # get the unique predicted labels and their counts in a dictionary
    unique_counts = dict(zip(*np.unique(pred, return_counts=True)))
    # find the correct predictions
    num_correct_predictions = unique_counts[1] if 1 in unique_counts else 0
    #return the ratio
    return num_correct_predictions / sum(unique_counts.values()) * 100

In [18]:
def classifyWithCV(X_positive, X_negative, clf, scaler, folds=5, random_state=0):
    """ Finds the convergence line of a classifier, in different samples. """
    
    # initialize a cross validation object
    kf = KFold(n_splits=folds, shuffle=True, random_state=random_state)

    FN_rates = []

    for train, test in kf.split(X_positive):

        # fit the scalar
        scaler.fit(X_positive.loc[train, :])

        # scale all data before fitting the classifier or making predictions
        X_pos_transformed_train = scaler.transform(X_positive.loc[train, :])
        X_pos_transformed_test = scaler.transform(X_positive.loc[test, :])
        
        # fit the classifier
        clf.fit(X_pos_transformed_train)

        # make predictions on Positive data
        prediction_pos = clf.predict(X_pos_transformed_test)

        # find the false negatives of the split and save to the array
        FN_rates.append(findFalseNegatives(prediction_pos))
        
    # fit the scalar
    scaler.fit(X_positive)
    X_pos_transformed_train = scaler.transform(X_positive)
    X_neg_transformed_test = scaler.transform(X_negative)
    
    # fit the classifier
    clf.fit(X_pos_transformed_train)

    # make predictions on Positive data
    prediction_neg = clf.predict(X_neg_transformed_test)

    # find the false positives
    FP_rate = findFalsePositives(prediction_neg)
        
    # return the mean of the two (more intuitive sense than the sum)
    return sum([np.mean(FN_rates), FP_rate])/2

In [19]:
def bestParameter_GridSearch(X_all, clf, params, features, users, clf_name, search_type,
                             locks_per_participant=60):
    """ Finds the best set of parameters across all users and features """
    
    # initialize the dataframe for all values
    df = pd.DataFrame()
    
    # initialize a minmax scalar
    scaler = MinMaxScaler()
    
    for cc, p in enumerate(params):
        
        # add the parameters to the classifier according to its name
        classifier = addParamsToClf(selected_classifier, p, selected_classifier_name)
        
        if clf_name=='one-class SVM':
            param_string = '{}, {}, {}, {}, {}'.format(*p)
        elif clf_name=='Isolation Forest':
            param_string = '{}, {}, {}, {}'.format(*p)
        else:
            raise ValueError("Something is wrong with the classifier name")
        
        print("Currently working on the {}/{} set of parameters which is: "
              .format(cc+1, len(params))+param_string+" ...")
        
        for i, f in enumerate(features):

            for j, u in enumerate(users):
                
                # define the column name
                col = 'f'+str(i+1)+', u'+str(j+1)
                
                # find the starting index of the original dataframe
                idx = j*locks_per_participant
        
                # find the mask for positive values to slice the dataset
                positive_mask = X_all.index.isin(range(idx, idx+locks_per_participant))

                # find the lines that correspond to this participant and the feature set
                X_pos = X_all.loc[positive_mask, f]

                # find the lines that correspond to all other participants and the feature set
                X_neg = X_all.loc[~positive_mask, f]

                # reset the index of the features dataframe
                X_pos.index = range(X_pos.shape[0])
                
                # reset the index of the features dataframe
                X_neg.index = range(X_neg.shape[0])
                
                # run classify with cv
                df.loc[param_string, col] = classifyWithCV(X_pos, X_neg, classifier, scaler)
                
            # give some feedback that the feature is finished
            print('The models are finished for {}/10 features'.format(i+1))
            
        clear_output()
            
    cols = df.columns.tolist()
    
    # set new column for the parameters
    df.loc[:, 'parameters'] = df.index
    
    # make new column for the mean scores
    df.loc[:, 'mean FN+FP'] = np.mean(df.loc[:, cols], axis=1)
    
    # make new column for the mean scores
    df.loc[:, 'std FN+FP'] = np.std(df.loc[:, cols], axis=1)
    
    # reset the index of the features dataframe
    df.index = range(df.shape[0])
    
    cols = df.columns.tolist()
    
    # reorder the columns
    cols = cols[-3:] + cols[:-3]
    
    df = df[cols]
    
    print("The result of the grid search for the parameters is:\n")
    
    # show the dataframe
    display(df)
    
    # log the results
    df.to_csv(target_path_rel + '{} logs for {}.csv'.format(search_type, clf_name), index=False)
    
    # find the winning line
    winning_line = df.loc[df.loc[:, 'mean FN+FP'].idxmin(), :]
        
    # return the parameter with the smallest sum of mean FN and FP
    return winning_line['parameters'], winning_line['mean FN+FP'], winning_line['std FN+FP']

In [20]:
print('The selected parameter sets are {}.'. format(len(selected_parameters)))

The selected parameter sets are 36.


In [21]:
start = time()
best_parameters_coarse_search,\
best_mean_of_FN_FP_coarse_search,\
std_of_FN_FP_mean_coarse_search = bestParameter_GridSearch(X,
                                                           selected_classifier,
                                                           selected_parameters,
                                                           feature_sets,
                                                           feature_files, # meaning users
                                                           selected_classifier_name,
                                                           'Coarse Grid Search')

print('It took {:.2f} minutes to run the above function'.format((time() - start)/60))

The result of the grid search for the parameters is:



Unnamed: 0,parameters,mean FN+FP,std FN+FP,"f1, u1",...,"f10, u13","f10, u14","f10, u15","f10, u16"
0,"300, 1.0, 0.05, 0.5",23.742014,11.945584,16.055556,...,21.666667,29.111111,10.222222,9.888889
1,"300, 1.0, 0.05, 0.7",23.649653,12.194530,18.555556,...,20.666667,34.888889,11.055556,17.055556
2,"300, 1.0, 0.05, 0.9",23.268403,12.029715,12.166667,...,24.611111,32.000000,15.000000,7.833333
3,"300, 1.0, 0.2, 0.5",19.993750,7.076355,19.722222,...,14.833333,17.222222,12.722222,11.055556
...,...,...,...,...,...,...,...,...,...
32,"600, 1.0, 0.2, 0.9",19.917361,7.384408,19.222222,...,15.222222,17.888889,9.555556,11.000000
33,"600, 1.0, 0.35, 0.5",24.411111,4.147089,19.888889,...,22.222222,22.944444,23.555556,21.777778
34,"600, 1.0, 0.35, 0.7",24.333333,4.419286,19.611111,...,20.000000,21.111111,23.555556,19.222222
35,"600, 1.0, 0.35, 0.9",24.283681,4.158448,19.500000,...,20.611111,20.555556,22.722222,22.611111


It took 599.14 minutes to run the above function


In [22]:
print("The best parameters of the coarse search made with the {} classifier are:\n{}"
      .format(selected_classifier_name, best_parameters_coarse_search))
print('The error on those parameters in terms of combined FN and FP rates is {}.'
      .format(best_mean_of_FN_FP_coarse_search))
print('The std of this error across all features and participants is {}.'
      .format(std_of_FN_FP_mean_coarse_search))

The best parameters of the coarse search made with the Isolation Forest classifier are:
400, 1.0, 0.2, 0.9
The error on those parameters in terms of combined FN and FP rates is 19.8829861111111.
The std of this error across all features and participants is 7.249429513727164.


### Fine Grid Search

Here we define the parameters for a more refined grid search according to the results of the previous search.

##### Parameters for one-class SVM:

In [23]:
new_nu = [0.001, 0.005, 0.01, 0.015, 0.1, 0.15, 0.2, 0.25, 0.3]        # nu - best was 0.1 (rbf)
new_gamma = [0.0001, 0.0005, 0.001, 0.005, 0.01, 0.05, 0.1, 0.15, 0.2] # gamma - best was 0.1 (rbf)

new_def_degree = [3.0]
new_def_coef0 = [0.0]

# initialize a list of parameters the classifier can take
new_parameters_SVM = [{'kernel': ['rbf'],
                       'nu': new_nu,
                       'degree': new_def_degree,
                       'gamma': new_gamma,
                       'coef0': new_def_coef0}]

##### Parameters for Isolation Forest:

In [31]:
estimators = [350, 400, 450]           # number of estimators
contamination = [0.1, 0.15, 0.2, 0.25] # contamination
max_features = [0.8, 0.9, 1.0]         # max features

def_samples = [1.0]

new_parameters_IF = [{'n_estimators': estimators,
                      'max_samples': def_samples,
                      'contamination': contamination,
                      'max_features': max_features}]

In [32]:
selected_parameters_fine_grid_search = findListOfParameters(new_parameters_IF)

print("The number of parameters for the coarse grid search is {}."
      .format(len(selected_parameters_fine_grid_search)))

The number of parameters for the coarse grid search is 36.


In [33]:
start = time()
best_parameters_fine_search,\
best_mean_of_FN_FP_fine_search,\
std_of_FN_FP_mean_fine_search = bestParameter_GridSearch(X, 
                                                         selected_classifier, 
                                                         selected_parameters_fine_grid_search,
                                                         feature_sets, 
                                                         feature_files,
                                                         selected_classifier_name, 
                                                         'Fine Grid Search')

print('It took {:.2f} minutes to run the above function'.format((time() - start)/60))

The result of the grid search for the parameters is:



Unnamed: 0,parameters,mean FN+FP,std FN+FP,"f1, u1",...,"f10, u13","f10, u14","f10, u15","f10, u16"
0,"350, 1.0, 0.1, 0.8",20.018056,10.369014,15.277778,...,14.500000,15.333333,7.666667,6.777778
1,"350, 1.0, 0.1, 0.9",20.015972,10.174626,17.388889,...,13.333333,14.333333,10.444444,5.555556
2,"350, 1.0, 0.1, 1.0",19.858333,10.431870,12.555556,...,15.444444,12.611111,12.055556,4.111111
3,"350, 1.0, 0.15, 0.8",19.475694,8.788098,15.555556,...,13.444444,12.333333,10.333333,9.388889
...,...,...,...,...,...,...,...,...,...
32,"450, 1.0, 0.2, 1.0",20.054514,7.293937,19.277778,...,17.722222,15.166667,13.333333,12.722222
33,"450, 1.0, 0.25, 0.8",20.940972,6.274922,19.666667,...,15.000000,17.444444,13.722222,15.944444
34,"450, 1.0, 0.25, 0.9",20.946181,6.109828,21.444444,...,17.444444,17.277778,14.111111,11.777778
35,"450, 1.0, 0.25, 1.0",20.936458,5.958336,18.500000,...,18.666667,17.111111,16.888889,14.333333


It took 436.79 minutes to run the above function


In [34]:
print("The best parameters of the fine search made with the {} classifier are:\n{}"
      .format(selected_classifier_name, best_parameters_fine_search))
print('The error on those parameters in terms of combined FN and FP rates is {}.'
      .format(best_mean_of_FN_FP_fine_search))
print('The std of this error across all features and participants is {}.'
      .format(std_of_FN_FP_mean_fine_search))

The best parameters of the fine search made with the Isolation Forest classifier are:
350, 1.0, 0.15, 1.0
The error on those parameters in terms of combined FN and FP rates is 19.1763888888889.
The std of this error across all features and participants is 8.642370569389048.


### Drop a cleaner csv with only the best parameters

In order to do that we will pick the best parameters and drop them in a smaller csv. We will use the name of the classifier in the title. The goal is that for every classifier that there is such a file for every classifier that is being used in the next stage.

In [37]:
# split he list according to the comma and space characters that separate the params (', ')
best_params_list = best_parameters_fine_search.split(', ')

best_params_list = [best_params_list[0]] + [p for p in best_params_list[1:]]

# put them to a dataframe
best_params_df = pd.DataFrame({'parameter name': param_names_IF,
                               'parameter value': best_params_list})

print("The dataframe with the best parameters is:\n")
display(best_params_df)

The dataframe with the best parameters is:



Unnamed: 0,parameter name,parameter value
0,n_estimators,350.0
1,max_samples,1.0
2,contamination,0.15
3,max_features,1.0


In [38]:
# make a csv using the name of the classifier as the file name
best_params_df.to_csv(target_path_rel + 'Optimal Parameters --- {}.csv'
                      .format(selected_classifier_name), 
                      index=False)

In [39]:
# append the score and std of the best params
fd = open(target_path_rel + 'Optimal Parameters --- {}.csv'.format(selected_classifier_name),'a')
fd.write('--------------------------------------------------------------------------------\n')
fd.write('--------------------------------------------------------------------------------\n')
fd.write('The mean error across features and users of the above parameters is {:.4f}\n'
         .format(best_mean_of_FN_FP_fine_search))
fd.write('The std of the aforementioned error is {:.4f}\n'.format(std_of_FN_FP_mean_fine_search))
fd.write('--------------------------------------------------------------------------------\n')
fd.write('The error for every separate case is measured as the mean of FN rate and FP rate')
fd.close()