***Code and documentation written by group members***

Model name : RNN1

Type of Classification: Binary classification - Stress and Non Stress

**1. Loading the EEG data samples and its labels**

-   The dataset folder consists of an Excel file called "scales.xls"
    which consists of the subjects' stress level rating from 1-10. We
    consider this as the labels for the corresponding EEG samples from
    the dataset.

-   For example, if we load an EEG sample of subject 10 from the
    dataset, his/her stress rating for that mental task is used as the
    EEG sample's label.

-   The EEG samples file names are formatted in the following manner:
    MentalTaskType_sub_no_trial_no.mat - for example:
    Relax_sub_2\_trial2.mat

    -   Mental Task Type : either Arithmetic, Stroop, Mirror_Image or
        Relax
    -   sub_no : Subject Number
    -   trial_no : Trial Number

    The functions in the following code block and their functionalities:

    -   extract_number(file_name_snippet) : extracts the trial number
        from the given filename snippet
    -   extract_label_address(file_name) : divides the file name into 4
        sections and extracts the task type, subject number and trial
        number. The function uses this information to calculate and
        return the label address on the Excel sheet. The Relax EEG
        samples do not have ratings on the Excel sheet, they
        automatically return stress rating 0.
    -   diff_label(original_stress_label) : the label from the Excel
        sheet ranges from 1 to 10. However, we want to group the labels
        to either 0 or 1, as we are performing binary classification. So
        the EEG samples with label from 1 to 10 were mapped to the
        stress label 1 (to signify stress) and the EEG samples which
        were from files labelled as "Relax" with stress rating 0 were
        mapped to stress label 0 (to signify non-stress)
    -   read_data(file_array) : takes in a string array of filepaths of
        all the EEG sample files and extracts the EEG coordinates from
        each filepath. Stores each file's EEG coordinates in an array
        and gets its corresponding label using extract_label_address().
        This function returns two arrays: an array storing arrays of
        each files' EEG coordinates called all_samples and an array
        called all_labels, which stores each file's corresponding label
        to feed into the model.

In \[1\]:

    import numpy as np
    from sklearn.model_selection import train_test_split
    from sklearn.svm import LinearSVC
    from sklearn.preprocessing import StandardScaler
    from glob import glob
    import matplotlib.pyplot as plt
    import scipy.io
    import os
    import matplotlib.pyplot as plt
    import pandas as pd

    # extract numbers for labels in later stages
    def extract_number(string):
        for char in string:
            if char.isdigit():
                return int(char)
        return None

    # takes filepath as parameter, returns indices for address of stress level label in xls file
    def extract_label_address(string):
        arr=string.split(os.path.sep)[-1].split("_") # ['Arithmetic', 'sub', '10', 'trial1.mat']
        base_num=0
        if ("Arithmetic" in arr[0]):
            base_num=1
        elif ("Mirror" in arr[0]):
            base_num=2
        elif ("Stroop" in arr[0]):
            base_num=3
        else:
            return 0,0
        trial_no=extract_number(arr[-1])
        if trial_no==2:
            base_num+=3
        elif trial_no==3:
            base_num+=6
        return int(arr[-2]),base_num # returns address of cell in excel file

    # all samples other than Relax are mapped to label 1 (Stress)
    def diff_label(stress_label):
        if stress_label==0:
            return 0
        else:
            return 1
        return stress_label

    # total 480 files, 3 trials for each of the 40 participants in each of the 4 experiments
    all_file_paths=sorted(glob("filtered_data/*.mat"))

    # reading excel file's stress level labels - arr stores the full excel file's contents
    stress_levels_arr = pd.read_excel('Dataset_Information/scales.xls').to_numpy()

    # reads the data and returns the samples array and its label array
    def read_data(file_paths):
        data = scipy.io.loadmat(file_paths[0])
        eeg_data = data["Clean_data"]
        all_samples=[eeg_data]
        addr1,addr2=extract_label_address(file_paths[0])
        all_labels=[diff_label(stress_levels_arr[addr1][addr2])]
        for i in range(1,len(file_paths)):
            data = scipy.io.loadmat(file_paths[i])
            eeg_data = data["Clean_data"]
            all_samples=np.append(all_samples,[eeg_data],axis=0)
            addr1,addr2=extract_label_address(file_paths[i])
            if (addr1!=0 and addr2!=0): # if addr1, addr2 == 0,0: we are labelling a Relaxation file
                all_labels=np.append(all_labels,[diff_label(stress_levels_arr[addr1][addr2])],axis=0)
            else:
                all_labels=np.append(all_labels,[diff_label(0)],axis=0)
        return all_samples,all_labels

Structure of all_samples array: Its shape is (480,32,3200) signifying
480 files, each containing 32 channels worth of 3200 time samples

In \[3\]:

    # total 480 files
    # each file is of shape (32,3200), total 120 files per experiment
    # test if labels correct - should be able to access both the EEG sample and its label from same index in both arrays
    addr1,addr2=extract_label_address(all_file_paths[304])
    print("file_name: ",all_file_paths[304])
    print("label: ", stress_levels_arr[addr1][addr2])

    file_name:  filtered_data\Relax_sub_2_trial2.mat
    label:  0

In \[4\]:

    all_samples,all_labels=read_data(all_file_paths) # shape should be (480,32,3200) and (480,) respectively

**2. Preprocessing Data for RNN Model**

-   Although the EEG data samples are already preprocessed in the sense
    of artifact and noise removal, there are additional preprocessing
    steps which include:
    -   oversampling to balance the classes (1/5th of the dataset was
        non-stressed EEG samples, so oversampling was done to compensate
        for the class imbalance)
    -   splitting the data into training and testing sets (80% of the
        dataset was used for training, 20% was used for testing)

**3. Architecture of the RNN Model**

-   Input Layer: The input shape for the RNN model is specified as (32,
    3200), indicating that each EEG sample from the training set
    consists of 32 channels with 3200 time samples each.
    -   Recurrent Layers:
        -   The model consists of two SimpleRNN layers:
        -   The first SimpleRNN layer has 128 units and uses the ReLU
            activation function.
        -   The second SimpleRNN layer has 64 units and also uses the
            ReLU activation function.
        -   Both layers have a kernel regularization term with L2
            regularization strength of 0.01.
        -   The first layer is set to return sequences, indicating that
            it returns the output for each time step.
    -   Flattening Layer: After the recurrent layers, the output is
        flattened using a Flatten layer. This converts the 2D output
        from the recurrent layers into a 1D array.
    -   Dense Layers: After the Flatten layer, there is a Dense layer
        with 64 units and ReLU activation function.
    -   Output Layer: The final output layer is a Dense layer with 1
        unit and sigmoid activation function. The output is a
        probability score indicating the likelihood of the sample
        belonging to the positive class (stressed).

**4. Compilation**

The model is compiled using the Adam optimizer and binary crossentropy
loss function. Accuracy is used as the evaluation metric.

**5. Training the Model**

The model is trained for 20 epochs with a batch size of 32. The training
progress is monitored, and validation data is used to evaluate the
model's performance during training.

In \[6\]:

    import tensorflow as tf
    from tensorflow.keras import layers, models
    from sklearn.model_selection import train_test_split
    from sklearn.metrics import confusion_matrix
    import numpy as np
    import seaborn as sns
    import matplotlib.pyplot as plt
    import h5py

    # Assuming all_samples shape is (480, 32, 3200)
    # Assuming all_labels shape is (480,)

    # identify indices of non-stress class
    non_stress_indices = np.where(all_labels == 0)[0]

    # oversampling factor
    oversampling_factor = 4

    # oversampling the non-stress samples
    replicated_samples = np.repeat(all_samples[non_stress_indices], oversampling_factor, axis=0)
    replicated_labels = np.repeat(all_labels[non_stress_indices], oversampling_factor)

    # combine with original data
    oversampled_samples = np.vstack([all_samples, replicated_samples]) # 960,2,3200
    oversampled_labels = np.concatenate([all_labels, replicated_labels])

    # split oversampled data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(oversampled_samples, oversampled_labels, test_size=0.2, random_state=42)

    # define RNN model
    model = models.Sequential()

    # add a SimpleRNN layer with return_sequences=True for multiple time steps
    model.add(layers.SimpleRNN(128, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.01), input_shape=(32, 3200), return_sequences=True))

    # additional SimpleRNN layer
    model.add(layers.SimpleRNN(64, activation='relu', kernel_regularizer=tf.keras.regularizers.l2(0.01)))

    # flatten output before passing it to next dense layer
    model.add(layers.Flatten())

    # extra dense layer
    model.add(layers.Dense(64, activation='relu'))

    # final output layer
    model.add(layers.Dense(1, activation='sigmoid'))

    # compile model
    model.compile(optimizer='adam', loss='binary_crossentropy', metrics=['accuracy'])

    # train model
    history = model.fit(X_train, y_train, epochs=20, batch_size=32, validation_data=(X_test, y_test))
    # evaluate model
    loss, accuracy = model.evaluate(X_test, y_test)
    print(f'Test Accuracy: {accuracy * 100:.2f}%')

    # predict on test set
    y_pred = model.predict(X_test)
    y_pred_binary = np.round(y_pred)

    Epoch 1/20
    24/24 [==============================] - 19s 401ms/step - loss: 5.0068 - accuracy: 0.7096 - val_loss: 3.6336 - val_accuracy: 0.8958
    Epoch 2/20
    24/24 [==============================] - 5s 222ms/step - loss: 3.4602 - accuracy: 0.9805 - val_loss: 3.4930 - val_accuracy: 0.9375
    Epoch 3/20
    24/24 [==============================] - 3s 148ms/step - loss: 3.3492 - accuracy: 1.0000 - val_loss: 3.4088 - val_accuracy: 0.9375
    Epoch 4/20
    24/24 [==============================] - 3s 146ms/step - loss: 3.2681 - accuracy: 1.0000 - val_loss: 3.3251 - val_accuracy: 0.9427
    Epoch 5/20
    24/24 [==============================] - 4s 150ms/step - loss: 3.1758 - accuracy: 1.0000 - val_loss: 3.2270 - val_accuracy: 0.9427
    Epoch 6/20
    24/24 [==============================] - 4s 149ms/step - loss: 3.0759 - accuracy: 1.0000 - val_loss: 3.1219 - val_accuracy: 0.9427
    Epoch 7/20
    24/24 [==============================] - 3s 145ms/step - loss: 2.9704 - accuracy: 1.0000 - val_loss: 3.0109 - val_accuracy: 0.9479
    Epoch 8/20
    24/24 [==============================] - 3s 142ms/step - loss: 2.8610 - accuracy: 1.0000 - val_loss: 2.8977 - val_accuracy: 0.9531
    Epoch 9/20
    24/24 [==============================] - 3s 139ms/step - loss: 2.7490 - accuracy: 1.0000 - val_loss: 2.7825 - val_accuracy: 0.9531
    Epoch 10/20
    24/24 [==============================] - 3s 141ms/step - loss: 2.6354 - accuracy: 1.0000 - val_loss: 2.6662 - val_accuracy: 0.9531
    Epoch 11/20
    24/24 [==============================] - 3s 140ms/step - loss: 2.5213 - accuracy: 1.0000 - val_loss: 2.5504 - val_accuracy: 0.9479
    Epoch 12/20
    24/24 [==============================] - 3s 138ms/step - loss: 2.4075 - accuracy: 1.0000 - val_loss: 2.4354 - val_accuracy: 0.9479
    Epoch 13/20
    24/24 [==============================] - 3s 144ms/step - loss: 2.2946 - accuracy: 1.0000 - val_loss: 2.3216 - val_accuracy: 0.9479
    Epoch 14/20
    24/24 [==============================] - 3s 138ms/step - loss: 2.1833 - accuracy: 1.0000 - val_loss: 2.2097 - val_accuracy: 0.9479
    Epoch 15/20
    24/24 [==============================] - 3s 146ms/step - loss: 2.0740 - accuracy: 1.0000 - val_loss: 2.0998 - val_accuracy: 0.9531
    Epoch 16/20
    24/24 [==============================] - 3s 133ms/step - loss: 1.9672 - accuracy: 1.0000 - val_loss: 1.9929 - val_accuracy: 0.9479
    Epoch 17/20
    24/24 [==============================] - 4s 168ms/step - loss: 1.8632 - accuracy: 1.0000 - val_loss: 1.8889 - val_accuracy: 0.9479
    Epoch 18/20
    24/24 [==============================] - 4s 179ms/step - loss: 1.7623 - accuracy: 1.0000 - val_loss: 1.7903 - val_accuracy: 0.9427
    Epoch 19/20
    24/24 [==============================] - 4s 156ms/step - loss: 1.6647 - accuracy: 1.0000 - val_loss: 1.6945 - val_accuracy: 0.9427
    Epoch 20/20
    24/24 [==============================] - 3s 137ms/step - loss: 1.5706 - accuracy: 1.0000 - val_loss: 1.6016 - val_accuracy: 0.9427
    6/6 [==============================] - 1s 41ms/step - loss: 1.6016 - accuracy: 0.9427
    Test Accuracy: 94.27%
    6/6 [==============================] - 5s 38ms/step

**6. Results of Testing the Model - confusion matrix and classification
report**

Report generated below provides a summary of various classification
metrics such as precision, recall, F1-score, and support for each class.
This reports the evaluation of testing using 20% of the data from the
dataset.

In \[15\]:

    from sklearn.metrics import classification_report

    # output confusion matrix
    print("Confusion Matrix:")
    conf_matrix = confusion_matrix(y_test, y_pred_binary)
    print(conf_matrix)

    # generate and print the classification report
    print("\nClassification Report:")
    print(classification_report(y_test, y_pred_binary))

    Confusion Matrix:
    [[133   0]
     [ 11  48]]

    Classification Report:
                  precision    recall  f1-score   support

               0       0.92      1.00      0.96       133
               1       1.00      0.81      0.90        59

        accuracy                           0.94       192
       macro avg       0.96      0.91      0.93       192
    weighted avg       0.95      0.94      0.94       192

In \[13\]:

    # simple testing if the trained model predicts correctly
    import numpy as np
    from scipy.io import loadmat
    test_index=203 # try different file path indexes
    # Load EEG data from the .mat file - example
    new_user_input = loadmat(all_file_paths[test_index])
    eeg_data = new_user_input["Clean_data"]
    input_for_model = eeg_data
    # add a batch dimension before feeding into model
    input_for_model = np.expand_dims(input_for_model, axis=0)
    prediction = model.predict(input_for_model)

    # Round prediction to get binary output
    binary_prediction = np.round(prediction)
    print("Actual Label:", all_labels[test_index])
    print("Model's prediction:", binary_prediction[0][0])

    1/1 [==============================] - 1s 822ms/step
    Actual Label: 1
    Model's prediction: 1.0

**7. Saving the Trained Model**

This action ensures that the trained model can be easily retrieved and
reused for future predictions or analysis without the need to retrain
it. The .h5 version will be used when the model is required to classify
a user's EEG sample for the website.

In \[16\]:

    import h5py
    # Save the trained model to an HDF5 file
    model.save("C:/Users/Manoharan/Desktop/CorrectedRNN1.h5")