***Code and documentation written by group members***

Model name : RNN2

Type of Classification: Binary classification - Low Stress and High
Stress

-   In our application, the user's EEG sample is first fed through RNN1
    to classify the sample as either "stress" or "non-stress"
-   If the sample is classified as "stress" then the EEG sample is fed
    into RNN2 to classify the EEG sample as either "low stress" or "high
    stress".

**1. Loading the EEG data samples and its labels**

-   The loading of the EEG sample files and labels is the same as for
    model RNN1. Please refer to RNN1_Model_Documentation.html for
    detailed explanation about each function in the following code block
    which performs the loading of EEG sample files and labels.

In \[1\]:

    import numpy as np
    from sklearn.model_selection import train_test_split
    from sklearn.svm import LinearSVC
    from sklearn.preprocessing import StandardScaler
    from glob import glob
    import matplotlib.pyplot as plt
    import scipy.io
    import mne
    import os
    import matplotlib.pyplot as plt
    import pandas as pd

    # for labels in later stages
    def extract_number(string):
        for char in string:
            if char.isdigit():
                return int(char)
        return None

    # takes filepath as parameter, returns indices for address of stress level label in xls file
    def extract_label_address(string):
        arr=string.split(os.path.sep)[-1].split("_") # ['Arithmetic', 'sub', '10', 'trial1.mat']
        # ^ elem 0: which type of test
        # elem 2: subject number
        # extract_number(elem 3): trial number
        base_num=0
        if ("Arithmetic" in arr[0]):
            base_num=1
        elif ("Mirror" in arr[0]):
            base_num=2
        elif ("Stroop" in arr[0]):
            base_num=3
        else:
            return 0,0
        # else remains 0, for Relaxation
        trial_no=extract_number(arr[-1])
        if trial_no==2:
            base_num+=3
        elif trial_no==3:
            base_num+=6
        return int(arr[-2]),base_num # returns address of cell in excel file

    # mapping the stress labels to either 0 or 1
    def diff_label(stress_label):
        if stress_label>=1 and stress_label<=5: # if subject rating from 0 to 5, label as low stress (0)
            stress_label=0
        else:
            stress_label=1 # if subject rating is from 6 to 10, label as high stress (1)
        return stress_label

    # reads the data and returns the samples array and its label array
    def read_data(file_paths):
        # load first element
        data = scipy.io.loadmat(file_paths[0])
        eeg_data = data["Clean_data"]
        all_samples=[eeg_data]
        addr1,addr2=extract_label_address(file_paths[0])
        all_labels=[diff_label(stress_levels_arr[addr1][addr2])]
        for i in range(1,len(file_paths)):
            data = scipy.io.loadmat(file_paths[i])# loads EEG coordinates from .mat file
            eeg_data = data["Clean_data"] # get EEG data in numpy array, has 32 channels, and 3200 time samples
            all_samples=np.append(all_samples,[eeg_data],axis=0)
            # extract the stress level for that one file of EEG data
            addr1,addr2=extract_label_address(file_paths[i])
            # append the corresponding label of that EEG sample file to all_labels
            all_labels=np.append(all_labels,[diff_label(stress_levels_arr[addr1][addr2])],axis=0)
        return all_samples,all_labels

The dataset contains some EEG samples called "Relax", which represents
the non-stress EEG samples. RNN2 is only required to differentiate
between low stress and high stress EEG samples, therefore the filepaths
to the Relax EEG samples are removed.

In \[2\]:

    # total 480 files, 3 trials for each of the 40 participants in each of the 4 experiments
    all_file_paths=sorted(glob("filtered_data/*.mat"))
    # Filter out file names containing the word "Relax" - we are not using non-stress EEG samples
    filtered_file_paths = [file_path for file_path in all_file_paths if "Relax" not in file_path]
    all_file_paths=filtered_file_paths # now total 360 file paths of EEG samples, excluding Relax files 

In \[3\]:

    # reading excel file's stress level labels
    stress_levels_arr = pd.read_excel('Dataset_Information/scales.xls').to_numpy()

    # testing if labels are correct
    addr1,addr2=extract_label_address(all_file_paths[349])

    # each file's content is of shape (32,3200), total 120 EEG samples per experiment
    # we do not use the relaxation EEG samples to train the model, so 360 files will be used, not 480
    # load the EEG data and its labels
    all_samples,all_labels=read_data(all_file_paths) # should be (360,32,3200) and (360,)

After loading the EEG samples and their corresponding labels, we can
check if the EEG sample and its corresponding label can be accessed with
the same index. This ensures that the EEG samples and their respective
labels are loaded accurately.

In \[4\]:

    # test if labels match the EEG sample
    print(all_samples.shape,all_labels.shape) # should be (360,32,3200) and (360,)
    print("File name: ",all_file_paths[3])
    print("Label: ",all_labels[3])

    (360, 32, 3200) (360,)
    File name:  filtered_data\Arithmetic_sub_11_trial1.mat
    Label:  1

**2. Preprocessing Data for RNN2 Model**

-   Although the EEG data samples are already preprocessed in the sense
    of artifact and noise removal, there are additional preprocessing
    steps which include:
    -   oversampling to balance the classes ( There are 233 low-stress
        EEG samples and 127 high-stress EEG samples. Therefore random
        oversampling had to be done for high-stress EEG samples to
        compensate for the class imbalance.)
    -   splitting the data into training and testing sets (80% of the
        dataset was used for training, 20% was used for testing)

**3. Architecture of the RNN2 Model**

-   RNN2 uses the same architecture as RNN1 - refer to
    RNN1_Model_Documentation.html

**4. Compilation**

-   RNN2 has the same compilation process as RNN1 - refer to
    RNN1_Model_Documentation.html

**5. Training the Model**

-   RNN2 shares the same parameter settings and training methodologies
    as RNN1 - refer to RNN1_Model_Documentation.html

In \[5\]:

    import tensorflow as tf
    from tensorflow.keras import layers, models
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import confusion_matrix
    import numpy as np
    import seaborn as sns
    import matplotlib.pyplot as plt
    import h5py

    # Assuming all_samples shape is (360, 32, 3200)
    # Assuming all_labels shape is (360,)

    # Identify indices of high-stress class
    high_stress_indices = np.where(all_labels == 1)[0]

    # oversampling factor - controls quantity of oversampling
    oversampling_factor = 4

    # Replicate high-stress samples
    replicated_samples = np.repeat(all_samples[high_stress_indices], oversampling_factor, axis=0)
    replicated_labels = np.repeat(all_labels[high_stress_indices], oversampling_factor)

    # combine with original data
    oversampled_samples = np.vstack([all_samples, replicated_samples])
    oversampled_labels = np.concatenate([all_labels, replicated_labels])

    # split oversampled data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(oversampled_samples, oversampled_labels, test_size=0.2, random_state=42)

    # define RNN model
    model = models.Sequential()

    # add a SimpleRNN layer with return_sequences=True for multiple time steps
    model.add(layers.SimpleRNN(128, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.01), input_shape=(32, 3200), return_sequences=True))

    # additional SimpleRNN layer
    model.add(layers.SimpleRNN(64, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.01)))

    # flatten output before passing it to dense layer
    model.add(layers.Flatten())

    # add Dense layer
    model.add(layers.Dense(64, activation='relu'))

    # add final output layer - gives probability of which class it belongs to
    model.add(layers.Dense(1, activation='sigmoid'))

    # compile model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    # train model
    history = model.fit(X_train, y_train, epochs=20, batch_size=32, validation_data=(X_test, y_test))
    # evaluate the model
    loss, accuracy = model.evaluate(X_test, y_test)
    print(f'Test Accuracy: {accuracy * 100:.2f}%')

    # predict on test set
    y_pred = model.predict(X_test)
    y_pred_binary = np.round(y_pred)

    Epoch 1/20
    22/22 [==============================] - 17s 382ms/step - loss: 4.8092 - accuracy: 0.7277 - val_loss: 3.6081 - val_accuracy: 0.9195
    Epoch 2/20
    22/22 [==============================] - 5s 230ms/step - loss: 3.4331 - accuracy: 0.9798 - val_loss: 3.4741 - val_accuracy: 0.9483
    Epoch 3/20
    22/22 [==============================] - 4s 208ms/step - loss: 3.2994 - accuracy: 1.0000 - val_loss: 3.3737 - val_accuracy: 0.9483
    Epoch 4/20
    22/22 [==============================] - 4s 197ms/step - loss: 3.1997 - accuracy: 1.0000 - val_loss: 3.2760 - val_accuracy: 0.9425
    Epoch 5/20
    22/22 [==============================] - 3s 156ms/step - loss: 3.0868 - accuracy: 1.0000 - val_loss: 3.1609 - val_accuracy: 0.9368
    Epoch 6/20
    22/22 [==============================] - 3s 146ms/step - loss: 2.9653 - accuracy: 1.0000 - val_loss: 3.0357 - val_accuracy: 0.9368
    Epoch 7/20
    22/22 [==============================] - 3s 156ms/step - loss: 2.8382 - accuracy: 1.0000 - val_loss: 2.9055 - val_accuracy: 0.9368
    Epoch 8/20
    22/22 [==============================] - 3s 157ms/step - loss: 2.7075 - accuracy: 1.0000 - val_loss: 2.7708 - val_accuracy: 0.9368
    Epoch 9/20
    22/22 [==============================] - 3s 148ms/step - loss: 2.5750 - accuracy: 1.0000 - val_loss: 2.6374 - val_accuracy: 0.9425
    Epoch 10/20
    22/22 [==============================] - 3s 146ms/step - loss: 2.4423 - accuracy: 1.0000 - val_loss: 2.5031 - val_accuracy: 0.9425
    Epoch 11/20
    22/22 [==============================] - 3s 150ms/step - loss: 2.3105 - accuracy: 1.0000 - val_loss: 2.3710 - val_accuracy: 0.9425
    Epoch 12/20
    22/22 [==============================] - 3s 154ms/step - loss: 2.1806 - accuracy: 1.0000 - val_loss: 2.2407 - val_accuracy: 0.9483
    Epoch 13/20
    22/22 [==============================] - 3s 153ms/step - loss: 2.0535 - accuracy: 1.0000 - val_loss: 2.1129 - val_accuracy: 0.9483
    Epoch 14/20
    22/22 [==============================] - 3s 158ms/step - loss: 1.9299 - accuracy: 1.0000 - val_loss: 1.9933 - val_accuracy: 0.9483
    Epoch 15/20
    22/22 [==============================] - 3s 149ms/step - loss: 1.8103 - accuracy: 1.0000 - val_loss: 1.8732 - val_accuracy: 0.9483
    Epoch 16/20
    22/22 [==============================] - 3s 152ms/step - loss: 1.6951 - accuracy: 1.0000 - val_loss: 1.7601 - val_accuracy: 0.9483
    Epoch 17/20
    22/22 [==============================] - 3s 149ms/step - loss: 1.5846 - accuracy: 1.0000 - val_loss: 1.6521 - val_accuracy: 0.9483
    Epoch 18/20
    22/22 [==============================] - 3s 161ms/step - loss: 1.4790 - accuracy: 1.0000 - val_loss: 1.5498 - val_accuracy: 0.9483
    Epoch 19/20
    22/22 [==============================] - 4s 196ms/step - loss: 1.3785 - accuracy: 1.0000 - val_loss: 1.4536 - val_accuracy: 0.9368
    Epoch 20/20
    22/22 [==============================] - 5s 216ms/step - loss: 1.2831 - accuracy: 1.0000 - val_loss: 1.3618 - val_accuracy: 0.9310
    6/6 [==============================] - 1s 42ms/step - loss: 1.3618 - accuracy: 0.9310
    Test Accuracy: 93.10%
    6/6 [==============================] - 5s 33ms/step

**6. Results of Testing the Model - confusion matrix and classification
report**

Report generated below provides a summary of various classification
metrics such as precision, recall, F1-score, and support for each class.
This reports the evaluation of testing using 20% of the data from the
dataset.

In \[8\]:

    from sklearn.metrics import classification_report

    # output confusion matrix
    print("Confusion Matrix:")
    conf_matrix = confusion_matrix(y_test, y_pred_binary)
    print(conf_matrix)

    print("\nClassification Report:")
    print(classification_report(y_test, y_pred_binary))

    Confusion Matrix:
    [[ 37  12]
     [  0 125]]

    Classification Report:
                  precision    recall  f1-score   support

               0       1.00      0.76      0.86        49
               1       0.91      1.00      0.95       125

        accuracy                           0.93       174
       macro avg       0.96      0.88      0.91       174
    weighted avg       0.94      0.93      0.93       174

In \[9\]:

    # simple testing if the trained model predicts correctly
    import numpy as np
    from scipy.io import loadmat

    test_index=3
    # load EEG data from .mat file - example:
    new_user_input = loadmat(all_file_paths[test_index])
    eeg_data = new_user_input["Clean_data"]
    input_for_model = eeg_data

    # Add a batch dimension
    input_for_model = np.expand_dims(input_for_model, axis=0)

    # perform prediction
    prediction = model.predict(input_for_model)

    # round prediction to get binary output
    binary_prediction = np.round(prediction)

    print("Actual Label:", all_labels[test_index])
    print("Binary prediction:", binary_prediction[0][0])

    1/1 [==============================] - 2s 2s/step
    Actual Label: 1
    Binary prediction: 1.0

**7. Saving the Trained Model**

This action ensures that the trained model can be easily retrieved and
reused for future predictions or analysis without the need to retrain
it. The .h5 version will be used when the model is required to classify
a user's EEG sample for the website.

In \[10\]:

    import h5py
    # Save the trained model to an HDF5 file
    model.save("C:/Users/Manoharan/Desktop/CorrectedRNN2.h5")