***Code and documentation written by group members***

Model name : LR2

Type of Classification: Binary classification - Low Stress and High
Stress

-   In our application, the user's EEG sample is first fed through LR1
    to classify the sample as either "stress" or "non-stress"
-   If the sample is classified as "stress" then the EEG sample is fed
    into LR2 to classify the EEG sample as either "low stress" or "high
    stress".

**1. Loading the EEG data samples and its labels**

-   The loading of the EEG sample files and labels is the same as for
    model LR1. Please refer to LR1_Model_Documentation.html for detailed
    explanation about each function in the following code block which
    performs the loading of EEG sample files and labels.

In \[4\]:

    import numpy as np  # for numerical operations
    from sklearn.model_selection import train_test_split # split data in training and testing
    from sklearn.preprocessing import StandardScaler  # standardizing feature values
    from glob import glob  # file path expansion
    import matplotlib.pyplot as plt   # plotting
    import scipy.io    # reading MATLAB files
    import os     # OS related functionalities
    import pandas as pd   # handle excel data in tabular form
    from sklearn.linear_model import LogisticRegression #Import the Logistic Regression model from scikit-learn
    from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
    import seaborn as sns


    # for labels in later stages
    def extract_number(string):
        for char in string:
            if char.isdigit():
                return int(char)
        return None

    # takes filepath as parameter, returns indices for address of stress level label in xls file
    def extract_label_address(string):
        arr=string.split(os.path.sep)[-1].split("_") # ['Arithmetic', 'sub', '10', 'trial1.mat']
        # ^ elem 0: which type of test
        # elem 2: subject number
        # extract_number(elem 3): trial number
        base_num=0
        if ("Arithmetic" in arr[0]):
            base_num=1
        elif ("Mirror" in arr[0]):
            base_num=2
        elif ("Stroop" in arr[0]):
            base_num=3
        else:
            return 0,0
        # else remains 0, for Relaxation
        trial_no=extract_number(arr[-1])
        if trial_no==2:
            base_num+=3
        elif trial_no==3:
            base_num+=6
        return int(arr[-2]),base_num # returns address of cell in excel file

    # mapping the stress ratings to either 0 or 1 - if rating from 0 to 5, marked as low stress, if rating from 6-10, marked as high stress
    def diff_label(stress_label):
        if stress_label>=1 and stress_label<=5:
            stress_label=0
        else:
            stress_label=1
        return stress_label

    # reads the data and returns the samples array and its label array
    def read_data(file_paths):
        # load first element
        data = scipy.io.loadmat(file_paths[0])
        eeg_data = data["Clean_data"]
        all_samples=[eeg_data]
        addr1,addr2=extract_label_address(file_paths[0])
        all_labels=[diff_label(stress_levels_arr[addr1][addr2])]
        for i in range(1,len(file_paths)):
            data = scipy.io.loadmat(file_paths[i])# loads EEG coordinates from .mat file
            eeg_data = data["Clean_data"] # get EEG data in numpy array, has 32 channels, and 3200 time samples
            all_samples=np.append(all_samples,[eeg_data],axis=0)
            # extract the stress level for that one file of EEG data
            addr1,addr2=extract_label_address(file_paths[i])
            all_labels=np.append(all_labels,[diff_label(stress_levels_arr[addr1][addr2])],axis=0)
        return all_samples,all_labels

The dataset contains some EEG samples called "Relax", which represents
the non-stress EEG samples. LR2 is only required to differentiate
between low stress and high stress EEG samples, therefore the filepaths
to the Relax EEG samples are removed.

In \[5\]:

    # total 480 files, 3 trials for each of the 40 participants in each of the 4 experiments
    all_file_paths=sorted(glob("filtered_data/*.mat"))
    # Filter out file names containing the word "Relax" - we are not using non-stress EEG samples
    filtered_file_paths = [file_path for file_path in all_file_paths if "Relax" not in file_path]
    all_file_paths=filtered_file_paths

    # reading excel file's stress level labels
    stress_levels_arr = pd.read_excel('Dataset_Information/scales.xls').to_numpy()

    # each sample is of shape (32,3200), total 120 samples per experiment
    # we do not use the relaxation EEG samples to train the model, so 360 files will be used, not 480
    # load the EEG data and its labels
    all_samples,all_labels=read_data(all_file_paths) # should be (360,32,3200) and (360,)

    # test if labels are correct
    print(all_samples.shape,all_labels.shape) # should be (360,32,3200) and (360,)
    print("File Name: ",all_file_paths[3])
    print("Label: ",all_labels[3])

    (360, 32, 3200) (360,)
    File Name:  filtered_data\Arithmetic_sub_11_trial1.mat
    Label:  1

Structure of all_samples array: Its shape is (360,32,3200) signifying
360 files, each containing 32 channels worth of 3200 time samples

**2. Preprocessing Data for LR Model**

-   Although the EEG data samples are already preprocessed in the sense
    of artifact and noise removal, there are additional preprocessing
    steps which include:
    -   oversampling to balance the classes ( There are 233 low-stress
        EEG samples and 127 high-stress EEG samples. Therefore random
        oversampling had to be done for high-stress EEG samples to
        compensate for the class imbalance.)
    -   splitting the data into training and testing sets (80% of the
        dataset was used for training, 20% was used for testing)
    -   flattening the data along the channel axis to prepare it for
        input into the LR model.

**3. Training the Model**

-   LR2 shares the same training methodologies as LR1 - refer to
    LR1_Model_Documentation.html

In \[6\]:

    # Assuming all_samples shape is (360, 32, 3200)
    # Assuming all_labels shape is (360,)

    # identify indices of high-stress class
    high_stress_indices = np.where(all_labels == 1)[0]

    # oversampling factor
    oversampling_factor = 3

    # replicate high-stress samples
    replicated_samples = np.repeat(all_samples[high_stress_indices], oversampling_factor, axis=0)
    replicated_labels = np.repeat(all_labels[high_stress_indices], oversampling_factor)

    # combine with original data
    oversampled_samples = np.vstack([all_samples, replicated_samples])
    oversampled_labels = np.concatenate([all_labels, replicated_labels])

    # split oversampled data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(oversampled_samples, oversampled_labels, test_size=0.2, random_state=42)

    # flatten data along the channel axis
    X_train_flatten = X_train.reshape(X_train.shape[0], -1)
    X_test_flatten = X_test.reshape(X_test.shape[0], -1)

    # create Logistic Regression model
    logistic_model = LogisticRegression(random_state=42)

    # train model on training set
    logistic_model.fit(X_train_flatten, y_train)

Out\[6\]:

    LogisticRegression(random_state=42)

**4. Testing the Model - confusion matrix and classification report**

The model is tested in the code section below. Report generated in the
output provides a summary of various classification metrics such as
precision, recall, F1-score, and support for each class. This reports
the evaluation of testing using 20% of the data from the dataset.

In \[7\]:

    # make predictions on test set
    predictions = logistic_model.predict(X_test_flatten)

    # evaluate model's performance
    accuracy = accuracy_score(y_test, predictions)
    accuracy = accuracy *100
    conf_matrix = confusion_matrix(y_test, predictions)
    classification_rep = classification_report(y_test, predictions)

    print("Accuracy:", accuracy)
    print("Confusion Matrix:\n", conf_matrix)
    print("Classification Report:\n", classification_rep)
    print("Training Set - Class Distribution:", np.bincount(y_train))
    print("Testing Set - Class Distribution:", np.bincount(y_test))

    Accuracy: 93.28859060402685
    Confusion Matrix:
     [[43 10]
     [ 0 96]]
    Classification Report:
                   precision    recall  f1-score   support

               0       1.00      0.81      0.90        53
               1       0.91      1.00      0.95        96

        accuracy                           0.93       149
       macro avg       0.95      0.91      0.92       149
    weighted avg       0.94      0.93      0.93       149

    Training Set - Class Distribution: [180 412]
    Testing Set - Class Distribution: [53 96]

In \[8\]:

    # simple testing of trained model - check if predicts accurately for any random EEG data sample in dataset

    import numpy as np
    from scipy.io import loadmat

    # load EEG data from the .mat file - example

    test_index=3

    new_user_input = loadmat(all_file_paths[test_index])   # loads eeg data at index 359 for preidiction
    eeg_data = new_user_input["Clean_data"]

    # flatten the input data along the channel axis
    selected_data = eeg_data
    input_for_model = selected_data.reshape(1, -1)
    input_for_model.shape

    # make predictions
    prediction = logistic_model.predict(input_for_model)

    print("Actual Label:", all_labels[test_index])   # checks the actual label at index 359
    print(f'Predicted Label: {prediction[0]}')

    Actual Label: 1
    Predicted Label: 1

**5. Saving the Trained Model**

This action ensures that the trained model can be easily retrieved and
reused for future predictions or analysis without the need to retrain
it. The .h5 version will be used when the model is required to classify
a user's EEG sample for the website.

In \[39\]:

    import joblib
    import h5py

    # Save the trained Logistic Regression model using joblib
    joblib.dump(logistic_model, 'C:/Users/Manoharan/Desktop/LR2.h5')

Out\[39\]:

    ['C:/Users/Manoharan/Desktop/LR2.h5']