# COMP47670 Assignment II Autumn 2023
## Time Series Running Data

## Objective
The objective of this assignment is to identify good models for classifying time series data.  
The data is from an accelerometer sensor and there are samples of fatigued and non-fatigued running. The data has been segmented into strides and the segments (samples) are labelled F (fatigued) and NF (not fatigued). The data for two subjects A and B are available in the files  `fatigueA.csv` and  `fatigueB.csv`. This dataset is extracted from a much larger dataset described [here](https://openreview.net/pdf?id=9c0lAonDNP).  
At present, the best performing method for time-series classification is [Rocket](https://openreview.net/pdf?id=9c0lAonDNP). 
A rocket implementation is available in the [sktime tool kit](https://www.sktime.net/en/latest/api_reference/auto_generated/sktime.transformations.panel.rocket.Rocket.html). This sktime implementation can be used in this assignment.   
Some code to get you started in available in the notebook `RunningCore`.



In [None]:
#pip install xgboost
#pip install sktime

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OrdinalEncoder

from sklearn.linear_model import SGDClassifier
from sklearn.model_selection import train_test_split
from sklearn import metrics

from sklearn.preprocessing import MinMaxScaler, StandardScaler
from sklearn.preprocessing import Normalizer

from sklearn.tree import DecisionTreeClassifier
from xgboost import XGBClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC

from sklearn.model_selection import cross_val_score

from sktime.transformations.panel.rocket import Rocket
from sklearn.model_selection import GridSearchCV

import warnings
warnings.filterwarnings('ignore')

In [None]:
fatigue_df = pd.read_csv('fatigueA.csv', header = None) # sep = '\s+')
print(fatigue_df.shape)
fatigue_df.head(5)

In [None]:
fatigue_df.iloc[4][1:].plot(label='Fatigued')
fatigue_df.iloc[-5][1:].plot(figsize=(4,3), label = 'Not Fatigued')
plt.legend(loc='upper center', bbox_to_anchor=(0.5, 1.15),ncol=2)
plt.ylabel('Accel Mag')


## Task 1: Building Logistic Regression Classifier(with SDGClassifier) For Time Series Data
Calculate the accuracy of a logistic regression classifier (`SGDClassifier`) on the raw time series data for subject A. 

**Objective:** The main objective of this task is to build a logistic regression which utilises the stochastic gradient descent algorithm which iteratively updates the weights and biases based on the gradient loss function of a during training.  The original dataset contains time series data derived from an accelerometer, segmented into strides and labelled as either fatigued (F) or not fatigued (NF).

**Methodology:** Due to the sample size, I would be using the k-fold cross validation technique, to test the model's ability to accurately classify the classes correctly. `K-fold = 5`, to ensure as many possible subsets of the data are covered. Futhermore, cross-validations albeit being slower offer better accuracy estimates than hold out .





In [None]:
# Creating an OrdinalEncoder

ordinal_enc = OrdinalEncoder()

#converting labels from F, NF, to binary 0,1

fatigue_df[0] = ordinal_enc.fit_transform(fatigue_df[[0]])


In [None]:
#Initialising X and y variables
y = fatigue_df.pop(0).values
X = fatigue_df.values

In [None]:


# Creates an instance of SGDClassifier model
sgd_clf_rocket = SGDClassifier(loss='log', random_state=42)

# Performs cross-validation, k-fold=10
scores = cross_val_score(sgd_clf_rocket, X, y, cv=5, scoring='accuracy') 

# Converting accuracy scores to percentage
scores_percentage = [round(score * 100, 2) for score in scores]

# Prints the accuracy for each fold
print("Accuracy for each fold (%): ", scores_percentage)
print(f"Mean accuracy (%): , {round(scores.mean() * 100, 2)}%")

### Results
The logistic regression had across the 10 k-folds had a mean acuuracy score of 77.6%, however, the rresults have a wide range with a minimum of 63.1% and a max of 95.4%. This suggests the model performs bettwer with certain subsets of the data compared to others. This could be due to several factors such as data variability between subsets, hyperparameters and 

#### Conclusion
Task 1 successfully established a baseline for classifying time series data using a logistic regression approach. The moderate level of accuracy achieved suggests the potential for improved performance with more advanced methods, which will be explored in the following tasks of the project. This baseline serves as a foundation for further experimentation and comparison.

### Task 2
The RunningCore Notebook contains code to convert the data to the `sktime` time-series format. Using this format assess the accuracy of the Rocket transformer coupled with an `SGDClassifier` classifier on the data for subject A. 


In [None]:

X3d = X[:,np.newaxis,:] # time series algs require a 3D data array (sample, var, tick)
X3d.shape

In [None]:
rocket = Rocket(random_state=41)
X3d_transformed = rocket.fit_transform(X3d)

# Create an instance of SGDClassifier
sgd_clf_rocket = SGDClassifier(loss='log', random_state=42)

# Perform cross-validation
scores = cross_val_score(sgd_clf_rocket, X3d_transformed, y, cv=5, scoring='accuracy')  # cv=5 for 5-fold cross-validation

scores_percentage = [round(score * 100, 2) for score in scores]

# Prints the accuracy for each fold
print("Accuracy for each fold (%): ", scores_percentage)
print(f"Mean accuracy (%): , {round(scores.mean() * 100, 2)}%")

### Results

The mean accuracy score for the ROCKET transformed regression is 81.72% (range: 77.65%-94.05%). Applying the ROCKET improved the overall models accuracy from 78.64% in the baseline model, to 81.72%.

### Conclusion
The major increase in overall accuracy could be attributable to ROCKET's ability to efficiently capture complex patterns in time series data through its convolutional kernels. Unlike the basic logistic regression model, which directly handled raw time series, this approach transformed the data into a more informative feature space, thereby facilitating more accurate classifications.

Overall, this approach and its finding sets a new benchmark for subsequent tasks as I analyse more complex models

## Task 3

**Objective:** The main objective of this task is to try and improve our accuracy score by employing alternative models, and performing data normalisation if neccessary.

#### **Methodology:** Methods used include the following:

1. **Normalisation:** I have employed the use of various normalisation techniques, these includes
    * **No Normalisation:** In this method, the data would be left as is. that is in its original state
    * **StandardScaler():** This normalise the data by subtracting the mean and dividing by variance. Hence features would have a mean of 0 and variance of 1
    * **MinMaxScaler():** This normalises the data by scaling features to a particular range, usually 0-1

2. **Classifiers:**
    * SDGClassifier
    * RandomForest Classifier
    * Support Vector Machine (SVM)
    * KNN
    * Decision Tree
    * XGBoost

We would be calculating the accuracy of each model based on the normalisation techniques, with evaluation done useing cross validation `K-fold=5`, and be presenting the results.

In [None]:
normalisation_techniques = ['NoNormalisation', StandardScaler(), MinMaxScaler(), Normalizer()]

classifiers = {
    'SGDClassifier': SGDClassifier(loss='log', random_state=42),
    'RandomForest': RandomForestClassifier(random_state=42),
    'SVM': SVC(random_state=42),
    'kNN': KNeighborsClassifier(),
    'DecisionTree': DecisionTreeClassifier(random_state=42),
    'XGBoost': XGBClassifier(random_state=42)
}

for normalisation in normalisation_techniques:
    if normalisation == 'NoNormalisation':
        for clf_name, clf in classifiers.items():
            cv_scores = cross_val_score(clf, X, y, cv=5, scoring='accuracy')
            cv_scores_percentage = [round(score * 100, 2) for score in cv_scores]
            print(f"Cross-Validation without normalization and {clf_name}")
            print("CV Scores: ", cv_scores_percentage)
            print(f"Average CV Score:  {round(cv_scores.mean() * 100, 2)}%")
            print("--------------------------------------------\n")

    else:
        X_norm = normalisation.fit_transform(X)


    

        for clf_name, clf in classifiers.items():
            cv_scores = cross_val_score(clf, X_norm, y, cv=5, scoring='accuracy')
            cv_scores_percentage = [round(score * 100, 2) for score in cv_scores]
            print(f"Cross-Validation with {normalisation} and {clf_name}")
            print("CV Scores: ", cv_scores_percentage)
            print(f"Average CV Score:  {round(cv_scores.mean() * 100, 2)}%")
            print("--------------------------------------------\n")




| Normalization Method     | Model          | Average CV Score (%) | Min CV Score (%) | Max CV Score (%) |
|--------------------------|----------------|----------------------|------------------|------------------|
| No Normalization         | SGDClassifier  | 78.64                | 63.10            | 95.24            |
| No Normalization         | RandomForest   | 84.37                | 65.88            | 98.81            |
| No Normalization         | SVM            | 80.54                | 70.24            | 94.05            |
| No Normalization         | kNN            | 85.78                | 70.59            | 95.24            |
| No Normalization         | DecisionTree   | 79.61                | 65.88            | 88.1             |
| No Normalization         | XGBoost        | 86.74                | 70.59            | 98.81            |
| StandardScaler()         | SGDClassifier  | 76.04                | 62.35            | 90.48            |
| StandardScaler()         | RandomForest   | 84.37                | 65.88            | 98.81            |
| StandardScaler()         | SVM            | 87.91                | 76.47            | 97.62            |
| StandardScaler()         | kNN            | 85.07                | 71.76            | 95.24            |
| StandardScaler()         | DecisionTree   | 79.37                | 65.88            | 88.1             |
| StandardScaler()         | XGBoost        | 86.74                | 70.59            | 98.81            |
| MinMaxScaler()           | SGDClassifier  | 81.48                | 77.38            | 86.9             |
| MinMaxScaler()           | RandomForest   | 84.37                | 65.88            | 98.81            |
| MinMaxScaler()           | SVM            | 84.83                | 77.38            | 94.05            |
| MinMaxScaler()           | kNN            | 86.03                | 77.38            | 95.24            |
| MinMaxScaler()           | DecisionTree   | 79.61                | 77.38            | 88.1             |
| MinMaxScaler()           | XGBoost        | 86.74                | 70.59            | 98.81            |
| Normalizer()             | SGDClassifier  | 80.04                | 65.48            | 89.29            |
| Normalizer()             | RandomForest   | 85.07                | 70.59            | 97.62            |
| Normalizer()             | SVM            | 81.03                | 69.41            | 94.05            |
| Normalizer()             | kNN            | 83.41                | 65.88            | 95.24            |
| Normalizer()             | DecisionTree   | 75.58                | 57.65            | 88.1             |
| Normalizer()             | XGBoost        | 87.92                | 75.29            | 98.81            |


### Results:

**No Normalization:**
* The best performing model in the non-normalised datatset is teh XGBoost classifier with an accuracy of 86.74%(min:70.59%, max:98.81%). 

* The least performing model was the SGDClassifier with an accuracy score of 78.64% (min: 63.10%, max:95.24%)


**StandardScaler() Normalization:**

* The best performing model with StandardScaler() normalization is the SVM classifier with an accuracy of 87.91% (min: 76.47%, max: 97.62%).

* The least performing model in this dataset with StandardScaler() normalization is the SGDClassifier with an accuracy score of 76.04% (min: 62.35%, max: 90.48%).


**MinMaxScaler() Normalization:**

* The best performing model with MinMaxScaler() normalization is the XGBoost classifier with an accuracy of 86.74% (min: 70.59%, max: 98.81%).

* The least performing model in this dataset with MinMaxScaler() normalization is the SGDClassifier with an accuracy score of 81.48% (min: 77.38%, max: 85.71%).

**Normalizer():**

* The best performing model with Normalizer() normalization is the XGBoost classifier with an accuracy of 87.92%% (min: 75.29%, max: 98.81%).

* The least performing model in this dataset with Normalizer() normalization is the DecisionTree with an accuracy score of 75.58% (min: 57.65%, max: 88.1%).


#### Conclusion
----------------------------------------
Overall, data normalisation had varying impact, although miniscule, on the overall accuracy of each model. By far across all four techniques employed, the best performing model was the SVM on the StandardScaler() input data with a score of 86.74% (min: 70.59%, max: 98.81%), while the DecisionTree on the Normalizer() data was the least performing in all 4 recording its lowest score 75.58% (min: 57.65%, max: 88.1%)) in the StandardScalar technique. The models also performed differently on different subset of the data reflecting howvariations in the data might affect results. Model's performance is also dependent on their algorithm's strength and weaknesses, some of the models such as SVM,KNN, Random forrest are not particularly sensitive to sacaling or normalisation as they are very robust, while others like SGDClassifier my be in comparison to the rest. Further hyperparameter tuning or having a larger dataset could help improve results.

##  Task 4
#### Objective
In this task, the main focus is to optimise the time series classification by experimenting with different kernel sizes (or numbers), with the main aim of determining what impact the kernel size would have on the accuracy of various classifiers.

#### Methodology
As with the previous ROCKET task, the input would still be the previous 3d format created (i.e `X3d`). I would be using 4 different kernel numbers (5000, 10000, 15000, and 20000), by default the kernel number in sklearn is 10000. We would observe the accuracy of different classifiers accross these kernel sizes. the classifiers to be used include
* SDGClassifier
* RandomForest Classifier
* Support Vector Machine (SVM)
* KNN
* Decision Tree
* XGBoost


In [None]:
kernel_options = [5000, 10000, 20000, 50000]

for kernel in kernel_options:
    rocket = Rocket(num_kernels=kernel, random_state=42)
    X3d_transformed = rocket.fit_transform(X3d)
     

    for classifer_name, classifier in classifiers.items():
        
        classifier.fit(X3d_transformed, y)

        
        cv_scores = cross_val_score(classifier, X3d_transformed, y, cv=5, scoring='accuracy')
        cv_scores_percentage = [round(score * 100, 2) for score in cv_scores]

        # Print the results
        print(f"Cross-Validation with {kernel} kernels and {classifer_name}")
        print("CV Scores: ", cv_scores_percentage)
        print(f"Average CV Score:  {round(cv_scores.mean() * 100, 2)}%")
        print("----------------------------------------------------\n")
        

| Kernels | Model          | Average CV Score (%) | Min CV Score (%) | Max CV Score (%) |
|---------|----------------|----------------------|------------------|------------------|
| 5000    | SGDClassifier  | 82.21                | 70.59            | 94.05            |
| 5000    | RandomForest   | 93.13                | 84.71            | 97.62            |
| 5000    | SVM            | 87.69                | 71.76            | 92.86            |
| 5000    | kNN            | 87.93                | 69.41            | 96.43            |
| 5000    | DecisionTree   | 82.45                | 72.94            | 91.67            |
| 5000    | XGBoost        | 92.9                 | 82.35            | 96.43            |
| 10000   | SGDClassifier  | 82.68                | 72.94            | 94.05            |
| 10000   | RandomForest   | 92.18                | 82.35            | 97.62            |
| 10000   | SVM            | 86.26                | 71.76            | 92.86            |
| 10000   | kNN            | 87.69                | 68.24            | 97.62            |
| 10000   | DecisionTree   | 85.98                | 79.76            | 90.48            |
| 10000   | XGBoost        | 95.03                | 89.41            | 98.81            |
| 20000   | SGDClassifier  | 81.06                | 52.94            | 96.43            |
| 20000   | RandomForest   | 93.37                | 84.71            | 97.62            |
| 20000   | SVM            | 86.02                | 71.76            | 92.86            |
| 20000   | kNN            | 87.93                | 68.24            | 96.43            |
| 20000   | DecisionTree   | 84.59                | 70.59            | 90.48            |
| 20000   | XGBoost        | 95.26                | 90.59            | 100.0            |
| 50000   | SGDClassifier  | 91.7                 | 84.52            | 96.43            |
| 50000   | RandomForest   | 93.37                | 83.53            | 97.62            |
| 50000   | SVM            | 86.97                | 71.76            | 92.86            |
| 50000   | kNN            | 88.17                | 69.41            | 97.62            |
| 50000   | DecisionTree   | 82.91                | 76.47            | 89.29            |
| 50000   | XGBoost        | 92.19                | 80.0             | 100.0            |


#### Results
-------------------------
**Best Performing Models By Kernel Size:**
* ROCKET = 5000 kernels
    * Best Performing Model: The RandomForest was the best performing model with an average accuracy of 93.13% (Min: 84.71%, Max: 97.62%)
    * Worst Performing Model: The SDGClassifier was the least performing model with an average accuracy score of 82.21% (Min: 70.95%, Max: 94.05%)

* ROCKET = 10000 Kernels:
    * Best Performing Model: The XGBoost was the best performing model with an average accuracy of 95.03% (Min: 89.41%, Max: 96.43%)
    * Worst Performing Model: The SDGClassifier was the least performing model with an average accuracy score of 82.68% (Min: 72.62%, Max: 94.05%)


* ROCKET = 20000 Kernels:
    * Best Performing Model: The XGBoost was the best performing model with an average accuracy of 95.26% (Min: 90.59%, Max: 100.0%)
    * Worst Performing Model: The SDGClassifier was the least performing model with an average accuracy score of 81.06% (Min: 52.94%, Max: 89.29%)

* ROCKET = 50000 Kernels:
    * Best Performing Model: The XGBoost was the best performing model with an average accuracy of 93.37% (Min: 83.53%, Max: 97.62%)
    * Worst Performing Model: The DecisionTree was the least performing model with an average accuracy score of 82.91% (Min: 76.67%, Max: 89.29%)


#### Conclusion
------------------
Overall, both the XGBoost and RandomForest outperformed the other models in terms of overall accuracy, maintaining or improving performance highlighting their robustness in handling kernel complexities and scalability, with XGBoost recording the best accuracy (95.03% (Min: 89.41%, Max: 96.43%)) of all models when kernel = 5000. In contrast, both the DecisionTree models and the SDGClassifier seem to struggle at different kernel sizes. Also, SVM,  and kNN performance varied moderately across different kernel sizes but did not reach the high consistency of XGBoost or RandomForest.


## Task 5

In [None]:
#Loading data
fatigue_df_b = pd.read_csv('fatigueB.csv', header=None)

In [None]:
# Creating an OrdinalEncoder

ordinal_enc = OrdinalEncoder()

#converting labels from F, NF, to binary 0,1

fatigue_df_b[0] = ordinal_enc.fit_transform(fatigue_df_b[[0]])


In [None]:
#Initialising X and y variables
y_b = fatigue_df_b.pop(0).values
X_b = fatigue_df_b.values

### a. applying methodologies such as normlalisation used in Task 3 on Subject B

In [None]:
normalisation_techniques = ['NoNormalisation', StandardScaler(), MinMaxScaler(), Normalizer()]

classifiers = {
    'SGDClassifier': SGDClassifier(loss='log', random_state=42),
    'RandomForest': RandomForestClassifier(random_state=42),
    'SVM': SVC(random_state=42),
    'kNN': KNeighborsClassifier(),
    'DecisionTree': DecisionTreeClassifier(random_state=42),
    'XGBoost': XGBClassifier(random_state=42)
}

for normalisation in normalisation_techniques:
    if normalisation == 'NoNormalisation':
        for clf_name, clf in classifiers.items():
            cv_scores = cross_val_score(clf, X_b, y_b, cv=5, scoring='accuracy')
            cv_scores_percentage = [round(score * 100, 2) for score in cv_scores]
            print(f"Cross-Validation without normalization and {clf_name} for FatigueB")
            print("CV Scores: ", cv_scores_percentage)
            print(f"Average CV Score:  {round(cv_scores.mean() * 100, 2)}%")
            print("--------------------------------------------\n")

    else:
        X_norm_b = normalisation.fit_transform(X_b)


    

        for clf_name, clf in classifiers.items():
            cv_scores = cross_val_score(clf, X_norm_b, y_b, cv=5, scoring='accuracy')
            cv_scores_percentage = [round(score * 100, 2) for score in cv_scores]
            print(f"Cross-Validation with {normalisation} and {clf_name} for FatgueB")
            print("CV Scores: ", cv_scores_percentage)
            print(f"Average CV Score:  {round(cv_scores.mean() * 100, 2)}%")
            print("--------------------------------------------\n")




| Scaler              | Model          | Average CV Score (%) | Min CV Score (%) | Max CV Score (%) |
|---------------------|----------------|----------------------|------------------|------------------|
| None                | SGDClassifier  | 76.49                | 58.0             | 92.0             |
| None                | RandomForest   | 85.31                | 72.55            | 98.0             |
| None                | SVM            | 79.35                | 62.75            | 94.0             |
| None                | kNN            | 81.74                | 64.71            | 94.0             |
| None                | DecisionTree   | 80.09                | 66.0             | 90.0             |
| None                | XGBoost        | 90.45                | 74.0             | 100.0            |
| StandardScaler      | SGDClassifier  | 81.27                | 60.0             | 94.0             |
| StandardScaler      | RandomForest   | 85.31                | 70.0             | 98.0             |
| StandardScaler      | SVM            | 86.53                | 68.63            | 96.0             |
| StandardScaler      | kNN            | 83.73                | 66.67            | 92.0             |
| StandardScaler      | DecisionTree   | 80.09                | 66.0             | 90.0             |
| StandardScaler      | XGBoost        | 90.45                | 74.0             | 100.0            |
| MinMaxScaler        | SGDClassifier  | 77.26                | 68.0             | 86.0             |
| MinMaxScaler        | RandomForest   | 84.91                | 68.0             | 98.0             |
| MinMaxScaler        | SVM            | 84.93                | 66.67            | 94.0             |
| MinMaxScaler        | kNN            | 84.53                | 68.63            | 94.0             |
| MinMaxScaler        | DecisionTree   | 80.09                | 66.0             | 90.0             |
| MinMaxScaler        | XGBoost        | 90.45                | 74.0             | 100.0            |
| Normalizer          | SGDClassifier  | 76.91                | 60.0             | 94.0             |
| Normalizer          | RandomForest   | 86.91                | 72.55            | 100.0            |
| Normalizer          | SVM            | 79.35                | 62.75            | 96.0             |
| Normalizer          | kNN            | 81.34                | 64.71            | 96.0             |
| Normalizer          | DecisionTree   | 83.28                | 68.0             | 94.0             |
| Normalizer          | XGBoost        | 88.85                | 68.0             | 100.0            |


### Results:
-----------------------------------------

**No Normalization:**
- The best performing model in the non-normalized dataset is the XGBoost classifier with an accuracy of 90.45% (min: 74.40%, max: 100.0%).
- The least performing model was the SGDClassifier with an accuracy score of 76.49% (min: 58.0%, max: 92.0%).

**StandardScaler() Normalization:**
- The best performing model with StandardScaler() normalization is the XGBoost classifier with an accuracy of 90.45 % (min: 74.0%, max: 100%).
- The least performing model in this dataset with StandardScaler() normalization is the DecisionTree with an accuracy score of 80.09% (min: 66.0%, max: 90.0%).

**MinMaxScaler() Normalization:**
- The best performing model with MinMaxScaler() normalization is the XGBoost classifier with an accuracy of 90.45% (min: 74.0%, max: 100.0%).
- The least performing model in this dataset with MinMaxScaler() normalization is the SGDClassifier with an accuracy score of 77.26% (min: 68.0%, max: 86.0%).

**Normalizer() Normalization:**
- The best performing model with Normalizer() normalization is the XGBoost classifier with an accuracy of 88.85% (min: 68.0%, max: 100.0%).
- The least performing model in this dataset with Normalizer() normalization is the DecisionTree with an accuracy score of 76.91% (min: 60.0%, max: 94.0%).

### Conclusion:
-----------------------------

Overall, the XGBoost saw as significant increase its accuracy score while working on `FatigueB` compared to `FatigueA`, with a peak of 90.45% accuracy recorded with No Normalization, StandardScaler(), MinMaxScaler() data, compared to 87.4% in FatigueA (Normalize() data only). However this is potentially due to severeal factors such as sample size, class distribution and variation in dataset. Outside this, accuracy where similar with only minor differences and as wiith teh `FatigueA` dataset, both DecisionTree and SDGClassifier had comparatively lower accuracy scores, struggling based on the type of Normalization technique used or lack there of

### b. Changing Kernel Sizes From Taks 4 with Fatigue B

In [None]:
#Converting data to 3d array
X3d_b = X_b[:, np.newaxis, :]
print(X3d_b.shape)

In [None]:
kernel_options = [5000, 10000, 20000, 50000]

for kernel in kernel_options:
    rocket = Rocket(num_kernels=kernel, random_state=42)
    X3d_transformed = rocket.fit_transform(X3d)
     

    for classifer_name, classifier in classifiers.items():
        
        classifier.fit(X3d_transformed, y)

        
        cv_scores = cross_val_score(classifier, X3d_transformed, y, cv=5, scoring='accuracy')
        cv_scores_percentage = [round(score * 100, 2) for score in cv_scores]

        # Print the results
        print(f"Cross-Validation with {kernel} kernels and {classifer_name}")
        print("CV Scores: ", cv_scores_percentage)
        print(f"Average CV Score:  {round(cv_scores.mean() * 100, 2)}%")
        print("----------------------------------------------------\n")
        

| Kernel Size | Model          | Average CV Score (%) | Min CV Score (%) | Max CV Score (%) |
|-------------|----------------|----------------------|------------------|------------------|
| 5000        | SGDClassifier  | 82.21                | 70.59            | 94.05            |
| 5000        | RandomForest   | 93.13                | 84.71            | 97.62            |
| 5000        | SVM            | 87.69                | 71.76            | 92.86            |
| 5000        | kNN            | 87.93                | 69.41            | 96.43            |
| 5000        | DecisionTree   | 82.45                | 72.94            | 91.67            |
| 5000        | XGBoost        | 92.9                 | 82.35            | 96.43            |
| 10000       | SGDClassifier  | 82.68                | 72.94            | 94.05            |
| 10000       | RandomForest   | 92.18                | 82.35            | 97.62            |
| 10000       | SVM            | 86.26                | 71.76            | 92.86            |
| 10000       | kNN            | 87.69                | 68.24            | 97.62            |
| 10000       | DecisionTree   | 85.98                | 79.76            | 90.48            |
| 10000       | XGBoost        | 95.03                | 89.41            | 98.81            |
| 20000       | SGDClassifier  | 81.06                | 52.94            | 96.43            |
| 20000       | RandomForest   | 93.37                | 84.71            | 97.62            |
| 20000       | SVM            | 86.02                | 71.76            | 92.86            |
| 20000       | kNN            | 87.93                | 68.24            | 96.43            |
| 20000       | DecisionTree   | 84.59                | 70.59            | 90.48            |
| 20000       | XGBoost        | 95.26                | 90.59            | 100.0            |
| 50000       | SGDClassifier  | 91.7                 | 84.52            | 96.43            |
| 50000       | RandomForest   | 93.37                | 83.53            | 97.62            |
| 50000       | SVM            | 86.97                | 71.76            | 92.86            |
| 50000       | kNN            | 88.17                | 69.41            | 97.62            |
| 50000       | DecisionTree   | 82.91                | 76.47            | 89.29            |
| 50000       | XGBoost        | 92.19                | 80.0             | 100.0            |


### Results
--------------------------------------------
Best Performing Models By Kernel Size:

* ROCKET = 5000 Kernels:

    * Best Performing Model: The RandomForest was the best performing model with an average accuracy of 93.13% (Min: 84.71%, Max: 97.62%).
    * Worst Performing Model: The SDGClassifier was the least performing model with an average accuracy score of 82.21% (Min: 70.95%, Max: 94.05%).

* ROCKET = 10000 Kernels:

    * Best Performing Model: The XGBoost was the best performing model with an average accuracy of 95.03% (Min: 89.41%, Max: 96.43%).
    * Worst Performing Model: The SDGClassifier was the least performing model with an average accuracy score of 82.68% (Min: 72.62%, Max: 94.05%).

* ROCKET = 20000 Kernels:

    * Best Performing Model: The XGBoost was the best performing model with an average accuracy of 95.26% (Min: 90.59%, Max: 100.0%).
    * Worst Performing Model: The SDGClassifier was the least performing model with an average accuracy score of 81.06% (Min: 52.94%, Max: 89.29%).

* ROCKET = 50000 Kernels:

    * Best Performing Model: The XGBoost was the best performing model with an average accuracy of 93.37% (Min: 83.53%, Max: 97.62%).
    * Worst Performing Model: The DecisionTree was the least performing model with an average accuracy score of 82.91% (Min: 76.67%, Max: 89.29%).

    
Conclusion
In summary, this analysis revealed important insights about the performance of machine learning models under different kernel sizes in the context of ROCKET. Key findings include:

Model Robustness: Both XGBoost and RandomForest consistently outperformed other models across various kernel sizes. These models demonstrated their robustness in handling kernel complexities and scalability.

Kernel Size Impact: The choice of kernel size significantly affected model performance. In particular, the XGBoost model excelled when the kernel size was 10000 and 20000, with average accuracies of 95.03% and 95.26%, respectively.

Sensitivity to Kernel Size: On the other hand, the SDGClassifier struggled to deliver competitive accuracy scores across all kernel sizes, highlighting its sensitivity to this parameter.

Moderate Variability: SVM and kNN exhibited moderate performance variability across different kernel sizes but did not reach the consistent high performance levels of XGBoost or RandomForest.

Overall, the results emphasize the importance of selecting the right model and kernel size for specific tasks involving ROCKET features. The XGBoost and RandomForest models stand out as strong choices for scalability and performance, while the DecisionTree and SDGClassifier models may require additional optimization or consideration for specific use cases. Furthermore significant increas in kernel size does not lead to significant increase in any of the models, suggesting there is a point where it's effect plateaus.

#### Comparing Results FatigueA and Fatigue B
----------------------------------------------
Overall, the results of both data and only minor differences which could be attributable to differences or variations in the data contained in both datasets. Both dataset continuously showed RandomForest and XGBoost in particular,  as the best performing models across kernel size and normalisation techniques. Also in both we can notice after a certain point, adding new kernels do not significantly improves the models accuracy, meaning it might plateau at a certain number and further increase are redundant. In summary the solution and results of FatigueB confirms most of our observations in FatigueA, where resutls vary, as discussed earlier is more likely due to dataset-specific differences rather than anything else. we can also observe how different changes in the subset of data can greatly affect the accuracy of a model as seen in the CV scores, giving credence to this theoory.
