<a href="https://colab.research.google.com/github/SungjooHwang/ICTclass/blob/main/Ex06_2_Practice_1_Stress_Recognition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

**This code is based on: https://stackabuse.com/cross-validation-and-grid-search-for-model-selection-in-python/ **

**The Dataset**

For the Practice #1 in the Lecture Note #10, Use 'EDA_features.csv'that include acceleration signal features and labels of actions ('1' for high stress, '0' for low stress.



**Importing Libraries**

In [None]:
import numpy as np
import matplotlib.pyplot as plt
import pandas as pd

**Importing the Dataset**

Import the dataset and load it into our pandas dataframe,

In [None]:
dataset = pd.read_csv("EDA_features.csv")

In [None]:
dataset.head()

Unnamed: 0,mean.SCR-Amplitude,mean.nSCR,mean.Latency [s],mean.AmpSum [muS],mean.SCR [muS],mean.ISCR [muSxs],mean.PhasicMax [muS],mean.Tonic [muS],Stress
0,0.056009,0.483871,8.642857,0.025914,0.001367,0.164075,0.100281,0.347202,0
1,0.038883,2.866667,9.697368,0.111464,0.003969,0.476328,0.214056,0.324783,1
2,0.096909,6.413793,7.595238,0.617555,0.016855,2.021824,0.697503,0.9309,0
3,0.024563,0.851852,11.291667,0.020922,0.001667,0.199644,0.106704,0.085922,1
4,0.078719,6.064516,7.428571,0.471905,0.012251,1.470121,0.430211,0.912001,0


**Preprocessing**

The next step is to split our dataset into its attributes and labels.

The X variable contains the first 30 columns of the dataset (i.e. signal features) while y contains the labels of actions.

In [None]:
X = dataset.iloc[:, :-1].values
y = dataset.iloc[:, -1].values
print(y)

[0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0 1 0
 1 0 1 0 1 0 1]


**Feature Scaling**

Before making any actual predictions, it is always a good practice to scale the features so that all of them can be uniformly evaluated. Wikipedia explains the reasoning pretty well:

*Since the range of values of raw data varies widely, in some machine learning algorithms, objective functions will not work properly without normalization. For example, the majority of classifiers calculate the distance between two points by the Euclidean distance. If one of the features has a broad range of values, the distance will be governed by this particular feature. Therefore, the range of all features should be normalized so that each feature contributes approximately proportionately to the final distance.*

The gradient descent algorithm (which is used in neural network training and other machine learning algorithms) also converges faster with normalized features.



In [None]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
scaler.fit(X)

X = scaler.transform(X)
print(X)

[[-2.37835627e-01 -1.18098328e+00  5.56924060e-01 -5.57631061e-01
  -5.77339774e-01 -5.77284688e-01 -6.45045982e-01 -2.89877315e-01]
 [-4.46185385e-01 -7.06398115e-01  8.52879101e-01 -4.86875143e-01
  -4.84614841e-01 -4.84562719e-01 -4.99310210e-01 -2.94790285e-01]
 [ 2.59747988e-01  8.86287371e-05  2.62903366e-01 -6.82969802e-02
  -2.54354822e-02 -2.56346565e-02  1.19944847e-01 -1.61965353e-01]
 [-6.20392984e-01 -1.10769195e+00  1.30032866e+00 -5.61760051e-01
  -5.66671773e-01 -5.66722618e-01 -6.36819110e-01 -3.47134468e-01]
 [ 3.84492078e-02 -6.94774148e-02  2.16127346e-01 -1.88761202e-01
  -1.89502915e-01 -1.89460289e-01 -2.22433668e-01 -1.66106977e-01]
 [-4.92308067e-01 -1.14457557e+00 -6.29387206e-04 -5.59716857e-01
  -5.78549997e-01 -5.78371678e-01 -6.72490207e-01 -3.21818023e-01]
 [ 6.25116910e-01  1.53032004e+00 -1.30969779e+00  8.89435725e-01
   9.42333707e-01  9.42339915e-01  1.34744259e+00  6.00750873e-02]
 [ 2.88121456e+00  2.01218674e+00 -1.56998370e+00  3.71339524e+00
   

**Training and Cross Validation**

The first step in the training and cross validation phase is simple.

In [None]:
from sklearn.ensemble import RandomForestClassifier
classifier = RandomForestClassifier(n_estimators=300, random_state=0)


Next, to implement cross validation, the *cross_val_score* method of the *sklearn.model_selection* library can be used. The *cross_val_score* returns the accuracy for all the folds. The first parameter is estimator which basically specifies the algorithm that you want to use for cross validation. The second and third parameters, X and y, i.e. features and labels. Finally the number of folds is passed to the cv parameter as shown in the following code:

In [None]:
from sklearn.model_selection import cross_val_score
all_accuracies = cross_val_score(estimator=classifier, X=X, y=y, cv=10)

In [None]:
print(all_accuracies)
print(all_accuracies.mean())
print(all_accuracies.std())

[0.4  0.4  0.8  0.6  0.75 0.75 0.5  0.5  1.   0.25]
0.595
0.21615966321217286


**(Optional) Grid Search for Parameter Selection**

A machine learning model has two types of parameters. The first type of parameters are the parameters that are learned through a machine learning model while the second type of parameters are the hyper parameter that we pass to the machine learning model.

In the last section, we used the Random Forest algorithm. The number of estimators we used for the algorithm was 300. Similarly in KNN algorithm we have to specify the value of K and for SVM algorithm we have to specify the type of Kernel. These estimators - the K value and Kernel - are all types of hyper parameters.

Normally we randomly set the value for these hyper parameters and see what parameters result in best performance. However randomly selecting the parameters for the algorithm can be exhaustive.

Also, it is not easy to compare performance of different algorithms by randomly setting the hyper parameters because one algorithm may perform better than the other with different set of parameters. And if the parameters are changed, the algorithm may perform worse than the other algorithms.

Therefore, instead of randomly selecting the values of the parameters, a better approach would be to develop an algorithm which automatically finds the best parameters for a particular model. Grid Search is one such algorithm.

To implement the Grid Search algorithm we need to import GridSearchCV class from the sklearn.model_selection library.
The first step you need to perform is to create a dictionary of all the parameters and their corresponding set of values that you want to test for best performance. The name of the dictionary items corresponds to the parameter name and the value corresponds to the list of values for the parameter.
Let's create a dictionary of parameters and their corresponding values for our Random Forest algorithm. Details of all the parameters for the random forest algorithm are available in the Scikit-Learn docs.

In [None]:
from sklearn.model_selection import GridSearchCV
grid_param = {
    'n_estimators': [100, 300, 500, 800, 1000],
    'criterion': ['gini', 'entropy'],
    'bootstrap': [True, False]
}

Take a careful look at the above code. Here we create grid_param dictionary with three parameters n_estimators, criterion, and bootstrap. The parameter values that we want to try out are passed in the list. For instance, in the above script we want to find which value (out of 100, 300, 500, 800, and 1000) provides the highest accuracy.

Similarly, we want to find which value results in the highest performance for the criterion parameter: "gini" or "entropy"? The Grid Search algorithm basically tries all possible combinations of parameter values and returns the combination with the highest accuracy. For instance, in the above case the algorithm will check 20 combinations (5 x 2 x 2 = 20).

The Grid Search algorithm can be very slow, owing to the potentially huge number of combinations to test. Furthermore, cross validation further increases the execution time and complexity.

Once the parameter dictionary is created, the next step is to create an instance of the GridSearchCV class. You need to pass values for the estimator parameter, which basically is the algorithm that you want to execute. The param_grid parameter takes the parameter dictionary that we just created as parameter, the scoring parameter takes the performance metrics, the cv parameter corresponds to number of folds, which is 5 in our case, and finally the n_jobs parameter refers to the number of CPU's that you want to use for execution. A value of -1 for n_jobs parameter means that use all available computing power. This can be handy if you have large number amount of data.

In [None]:
gd_sr = GridSearchCV(estimator=classifier,
                     param_grid=grid_param,
                     scoring='accuracy',
                     cv=5,
                     n_jobs=-1)



Once the GridSearchCV class is initialized, the last step is to call the fit method of the class and pass it the training  set



In [None]:
gd_sr.fit(X, y)

GridSearchCV(cv=5,
             estimator=RandomForestClassifier(n_estimators=300, random_state=0),
             n_jobs=-1,
             param_grid={'bootstrap': [True, False],
                         'criterion': ['gini', 'entropy'],
                         'n_estimators': [100, 300, 500, 800, 1000]},
             scoring='accuracy')

Once the method completes execution, the next step is to check the parameters that return the highest accuracy. To do so, print the sr.best_params_ attribute of the GridSearchCV object.

In [None]:
best_parameters = gd_sr.best_params_
print(best_parameters)

{'bootstrap': True, 'criterion': 'entropy', 'n_estimators': 300}


The last and final step of Grid Search algorithm is to find the accuracy obtained using the best parameters.

In [None]:
best_result = gd_sr.best_score_
print(best_result)

0.5916666666666666
