# Hyperparameter tuning using Grid Search to select the best C & Gamma for SVM classifier

Running this notebook takes too long: https://www.kaggle.com/code/sreedevirajavelu/vreed-modelling-1/edit

This notebook uses a very large range of values for C and gamma using np.logspace(), which creates a logarithmic range of values for each parameter.

By using: 

C_range = np.logspace(-2, 10, 100)

gamma_range = np.logspace(-9, 3, 100)


# Causes of Long Runtime:

Creates a logarithmic space of 100 values for each parameter. This results in a grid search space of 10,000 combinations (100 x 100)

GridSearchCV is initialized with these parameters and a 10-fold cross validation (cv=10)

SVM with an RBF kernel (Radial Basis Function) can be computationally expensive. Each model training involves a solving a quadratic optimization problem, which can be slow.


# Solutions to reduce runtime:

**1. Reduce Parameter Grid:**

Use a coarser grid (eg. 10 values each). Can perform random search or use successive halving to optimize the search process

**2. Use fewer cross-validation folds:**

Reduce cross-validation folds from 10 to 5


**3. Sample the data:**

If dataset is large, work with a subset of the data to perform grid search

**4. Parallelize Grid Search:
**
**GridSearchCV** supports parallel processing with the **n_jobs** **parameter**. 

Setting **n_jobs = -1 ** would utilize all available CPU cores to run multiple cores simultaneously.




In [1]:
## This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
from sklearn import svm
from sklearn.impute import SimpleImputer
from sklearn.model_selection import GridSearchCV
from sklearn.model_selection import train_test_split
from sklearn.model_selection import cross_val_score
from sklearn.svm import SVC

#from skl2onnx import convert_sklearn
#from skl2onnx.common.data_types import FloatTensorType, StringTensorType

# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

# You can write up to 20GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session



# Hyperparameter Tuning using Grid Search

To search for the best set of hyperparameters for SVM classifier

In [2]:

C_range = np.logspace(-2, 10, 10)
gamma_range = np.logspace(-9, 3, 10)
param_grid = dict(gamma=gamma_range, C=C_range)
        
eyedata_file_path = '/kaggle/input/vr-eyes-emotions-dataset-vreed/04 Eye Tracking Data/02 Eye Tracking Data (Features Extracted)/EyeTracking_FeaturesExtracted.csv'
eye_data = pd.read_csv(eyedata_file_path) 
eye_data.columns


y = eye_data.Quad_Cat

eye_data_features = ['Num_of_Fixations', 'Mean_Fixation_Duration',
       'SD_Fixation_Duration', 'Skew_Fixation_Duration',
       'Max_Fixation_Duration', 'First_Fixation_Duration', 'Num_of_Saccade',
       'Mean_Saccade_Duration', 'SD_Saccade_Duration', 'Skew_Saccade_Duration',
       'Max_Saccade_Duration', 'Mean_Saccade_Amplitude',
       'SD_Saccade_Amplitude', 'Skew_Saccade_Amplitude',
       'Max_Saccade_Amplitude', 'Mean_Saccade_Direction',
       'SD_Saccade_Direction', 'Skew_Saccade_Direction',
       'Max_Saccade_Direction', 'Mean_Saccade_Length', 'SD_Saccade_Length',
       'Skew_Saccade_Length', 'Max_Saccade_Length', 'Num_of_Blink',
       'Mean_Blink_Duration', 'SD_Blink_Duration', 'Skew_Blink_Duration',
       'Max_Blink_Duration', 'Num_of_Microsac', 'Mean_Microsac_Peak_Vel',
       'SD_Microsac_Peak_Vel', 'Skew_Microsac_Peak_Vel',
       'Max_Microsac_Peak_Vel', 'Mean_Microsac_Ampl', 'SD_Microsac_Ampl',
       'Skew_Microsac_Ampl', 'Max_Microsac_Ampl', 'Mean_Microsac_Dir',
       'SD_Microsac_Dir', 'Skew_Microsac_Dir', 'Max_Microsac_Dir',
       'Mean_Microsac_H_Amp', 'SD_Microsac_H_Amp', 'Skew_Microsac_H_Amp',
       'Max_Microsac_H_Amp', 'Mean_Microsac_V_Amp', 'SD_Microsac_V_Amp',
       'Skew_Microsac_V_Amp', 'Max_Microsac_V_Amp']




# Preprocessing step: Handle missing data

SimpleImputer is used to fill missing values in the feature matrix X.

Replaces any missing values (NaN) with the mean of the respective column

In [3]:
X = eye_data[eye_data_features]

imp = SimpleImputer(missing_values=np.nan, strategy='mean')
imp.fit(X)
X_imp = imp.transform(X)


# Perform GridSearchCV with 5 folds of cross validation and find best set of parameters for SVM classifier

In [4]:
# gs_knn = GridSearchCV(svm.SVC(), param_grid=param_grid, cv=5) # perform GridSearch with 5 fold cross validation

# X_train, X_test, y_train, y_test = train_test_split(X_imp, y, test_size=0.1, random_state=0)

# gs_knn.fit(X_train, y_train)
# gs_knn.best_params_

# #print("Ground Truth: " + str(eye_data.iloc[5].Quad_Cat))
# #print("Inference: " + str(test[0]))

# # find best model score
# gs_knn.score(X_test, y_test)


In [5]:
import numpy as np
import pandas as pd
from sklearn.model_selection import train_test_split, StratifiedKFold, cross_val_score
from sklearn.impute import SimpleImputer
from sklearn.metrics import confusion_matrix , accuracy_score
from sklearn.svm import SVC
from sklearn.ensemble import RandomForestClassifier , GradientBoostingClassifier , AdaBoostClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.linear_model import LogisticRegression

# load and process data
# eye tracking features and target variable

# X = eye_data[eye_data_features]
# y = eye_data.Quad_Cat

# # handle missing values 
# imp = SimpleImputer(missing_values=np.nan , strategy ='mean')
# X_imp = imp.fit_transform(X)

# train test split
X_train, X_test , y_train , y_test = train_test_split(X_imp, y , test_size=0.2,random_state=0)

classifiers = {
    'SVM':SVC(),
    'Random Forest':RandomForestClassifier(),
    'K-Nearest Neighbors': KNeighborsClassifier(),
    'Decision Tree': DecisionTreeClassifier(),
    'Logistic Regression': LogisticRegression(max_iter=1000),
    'Gradient Boosting': GradientBoostingClassifier(),
    'AdaBoost':AdaBoostClassifier()
    
}


# use StratifiedKFold to ensure class proportions are preserved in each fold
cross_val = StratifiedKFold(n_splits=5,shuffle=True,random_state=42)

for name, clf in classifiers.items():
    print(f"\nClassifier: {name}")
    
    # perform 5 Fold Cross-validation on the training data
    cv_scores = cross_val_score(clf, X_train,y_train , cv = cross_val , scoring='accuracy') 
        
    # print the accuracy of each fold
    for i, score in enumerate(cv_scores):
          print(f"Fold {i+1} Accuracy: {score * 100:.2f}%")
          
    # mean accuracy across all folds 
    mean_cv_accuracy = np.mean(cv_scores)
    print(f"Mean Cross-Validation Accuracy: {mean_cv_accuracy}")
          
    # train the classifier on the entire training set and evaluate on the test set
    clf.fit(X_train , y_train)
    y_test_pred = clf.predict(X_test)
    
    # confusion matrix and accuracy on the test set
    test_cm = confusion_matrix(y_test, y_test_pred)
    test_accuracy = accuracy_score(y_test , y_test_pred)
    print(f"Confusion Matrix (Test Set):\n{test_cm}")
    print(f"Test Set Accuracy: {test_accuracy * 100:.2f}%")      
          


Classifier: SVM
Fold 1 Accuracy: 40.00%
Fold 2 Accuracy: 42.00%
Fold 3 Accuracy: 32.00%
Fold 4 Accuracy: 30.00%
Fold 5 Accuracy: 26.53%
Mean Cross-Validation Accuracy: 0.341061224489796
Confusion Matrix (Test Set):
[[ 0 10  6  4]
 [ 0  7  2  5]
 [ 0  6  5  5]
 [ 2  6  2  3]]
Test Set Accuracy: 23.81%

Classifier: Random Forest
Fold 1 Accuracy: 46.00%
Fold 2 Accuracy: 66.00%
Fold 3 Accuracy: 58.00%
Fold 4 Accuracy: 70.00%
Fold 5 Accuracy: 36.73%
Mean Cross-Validation Accuracy: 0.5534693877551021
Confusion Matrix (Test Set):
[[ 9  7  1  3]
 [ 1  7  2  4]
 [ 1  3 10  2]
 [ 3  1  4  5]]
Test Set Accuracy: 49.21%

Classifier: K-Nearest Neighbors
Fold 1 Accuracy: 38.00%
Fold 2 Accuracy: 32.00%
Fold 3 Accuracy: 34.00%
Fold 4 Accuracy: 36.00%
Fold 5 Accuracy: 38.78%
Mean Cross-Validation Accuracy: 0.3575510204081632
Confusion Matrix (Test Set):
[[9 5 5 1]
 [2 7 3 2]
 [4 6 5 1]
 [6 4 3 0]]
Test Set Accuracy: 33.33%

Classifier: Decision Tree
Fold 1 Accuracy: 42.00%
Fold 2 Accuracy: 32.00%
Fold

STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(
STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver opt

Fold 1 Accuracy: 42.00%
Fold 2 Accuracy: 60.00%
Fold 3 Accuracy: 42.00%
Fold 4 Accuracy: 58.00%
Fold 5 Accuracy: 40.82%
Mean Cross-Validation Accuracy: 0.4856326530612245


STOP: TOTAL NO. of ITERATIONS REACHED LIMIT.

Increase the number of iterations (max_iter) or scale the data as shown in:
    https://scikit-learn.org/stable/modules/preprocessing.html
Please also refer to the documentation for alternative solver options:
    https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression
  n_iter_i = _check_optimize_result(


Confusion Matrix (Test Set):
[[4 7 0 9]
 [2 7 0 5]
 [3 3 7 3]
 [6 1 1 5]]
Test Set Accuracy: 36.51%

Classifier: Gradient Boosting
Fold 1 Accuracy: 48.00%
Fold 2 Accuracy: 62.00%
Fold 3 Accuracy: 54.00%
Fold 4 Accuracy: 62.00%
Fold 5 Accuracy: 44.90%
Mean Cross-Validation Accuracy: 0.541795918367347
Confusion Matrix (Test Set):
[[ 8  5  1  6]
 [ 1  8  1  4]
 [ 0  3 13  0]
 [ 4  4  1  4]]
Test Set Accuracy: 52.38%

Classifier: AdaBoost
Fold 1 Accuracy: 40.00%
Fold 2 Accuracy: 46.00%
Fold 3 Accuracy: 40.00%
Fold 4 Accuracy: 60.00%
Fold 5 Accuracy: 26.53%
Mean Cross-Validation Accuracy: 0.425061224489796
Confusion Matrix (Test Set):
[[4 6 2 8]
 [1 5 3 5]
 [0 4 5 7]
 [2 5 0 6]]
Test Set Accuracy: 31.75%
