***Code and documentation written by group members***

Model name : LR1

Type of Classification: Binary classification - Stress and Non Stress

**1. Loading the EEG data samples and its labels**

-   The dataset folder consists of an Excel file called "scales.xls"
    which consists of the subjects' stress level rating from 1-10. We
    consider this as the labels for the corresponding EEG samples from
    the dataset.

-   For example, if we load an EEG sample of subject 10 from the
    dataset, his/her stress rating for that mental task is used as the
    EEG sample's label.

-   The EEG samples file names are formatted in the following manner:
    MentalTaskType_sub_no_trial_no.mat - for example:
    Relax_sub_2\_trial2.mat

    -   Mental Task Type : either Arithmetic, Stroop, Mirror_Image or
        Relax
    -   sub_no : Subject Number
    -   trial_no : Trial Number

    The functions in the following code block and their functionalities:

    -   extract_number(file_name_snippet) : extracts the trial number
        from the given filename snippet
    -   extract_label_address(file_name) : divides the file name into 4
        sections and extracts the task type, subject number and trial
        number. The function uses this information to calculate and
        return the label address on the Excel sheet. The Relax EEG
        samples do not have ratings on the Excel sheet, they
        automatically return stress rating 0.
    -   diff_label(original_stress_label) : the label from the Excel
        sheet ranges from 1 to 10. However, we want to group the labels
        to either 0 or 1, as we are performing binary classification. So
        the EEG samples with label from 1 to 10 were mapped to the
        stress label 1 (to signify stress) and the EEG samples which
        were from files labelled as "Relax" with stress rating 0 were
        mapped to stress label 0 (to signify non-stress)
    -   read_data(file_array) : takes in a string array of filepaths of
        all the EEG sample files and extracts the EEG coordinates from
        each filepath. Stores each file's EEG coordinates in an array
        and gets its corresponding label using extract_label_address().
        This function returns two arrays: an array storing arrays of
        each files' EEG coordinates called all_samples and an array
        called all_labels, which stores each file's corresponding label
        to feed into the model.

In \[1\]:

    import numpy as np  # for numerical operations
    from sklearn.model_selection import train_test_split # split data in training and testing
    from sklearn.preprocessing import StandardScaler  # standardizing feature values
    from glob import glob  # file path expansion
    import matplotlib.pyplot as plt   # plotting
    import scipy.io    # reading MATLAB files
    import os     # OS related functionalities
    import pandas as pd   # handle excel data in tabular form
    from sklearn.linear_model import LogisticRegression #Import the Logistic Regression model from scikit-learn
    from sklearn.metrics import accuracy_score, classification_report, confusion_matrix
    import seaborn as sns

    # for labels in later stages
    def extract_number(string):
        for char in string:
            if char.isdigit():
                return int(char)
        return None

    # takes filepath as parameter, returns indices for address of stress level label in xls file
    def extract_label_address(string):
        arr=string.split(os.path.sep)[-1].split("_") # ['Arithmetic', 'sub', '10', 'trial1.mat']
        base_num=0
        if ("Arithmetic" in arr[0]):
            base_num=1
        elif ("Mirror" in arr[0]):
            base_num=2
        elif ("Stroop" in arr[0]):
            base_num=3
        else:
            return 0,0
        trial_no=extract_number(arr[-1])
        if trial_no==2:
            base_num+=3
        elif trial_no==3:
            base_num+=6
        return int(arr[-2]),base_num # returns address of cell in excel file

    # stress levels are from 1 - 10 in scales.xls, mapped to levels to 0 or 1
    def diff_label(stress_label):
        if stress_label==0:
            return 0
        else:
            return 1
        return stress_label

    # total 480 files, 3 trials for each of the 40 participants in each of the 4 experiments
    all_file_paths=sorted(glob("filtered_data/*.mat"))

    # reading excel file's stress level labels - array stores the full excel file's contents
    stress_levels_arr = pd.read_excel('Dataset_Information/scales.xls').to_numpy()

    # reads the data and returns the samples array and its label array
    def read_data(file_paths):
        data = scipy.io.loadmat(file_paths[0])
        eeg_data = data["Clean_data"]
        all_samples=[eeg_data]
        addr1,addr2=extract_label_address(file_paths[0])
        all_labels=[diff_label(stress_levels_arr[addr1][addr2])]
        for i in range(1,len(file_paths)):
            data = scipy.io.loadmat(file_paths[i])
            eeg_data = data["Clean_data"]
            all_samples=np.append(all_samples,[eeg_data],axis=0)
            addr1,addr2=extract_label_address(file_paths[i])
            if (addr1!=0 and addr2!=0): # if addr1, addr2 == 0,0: we are labelling a Relaxation epoch
                all_labels=np.append(all_labels,[diff_label(stress_levels_arr[addr1][addr2])],axis=0)
            else:
                all_labels=np.append(all_labels,[diff_label(0)],axis=0)
        return all_samples,all_labels

    all_samples,all_labels=read_data(all_file_paths) # shape should be (480,32,3200) and (480,) respectively

Structure of all_samples array: Its shape is (480,32,3200) signifying
480 files, each containing 32 channels worth of 3200 time samples

**2. Preprocessing Data for LR Model**

-   Although the EEG data samples are already preprocessed in the sense
    of artifact and noise removal, there are additional preprocessing
    steps which include:
    -   oversampling to balance the classes (1/5th of the dataset was
        non-stressed EEG samples, so random oversampling was done to
        compensate for the class imbalance)
    -   splitting the data into training and testing sets (80% of the
        dataset was used for training, 20% was used for testing)
    -   flattening the data along the channel axis to prepare it for
        input into the LR model.

**3. Training the Model**

The model trains by fitting itselt to the training preprocessed data
using the fit method from sklearn.linear_model. After training, the
model makes predictions on the test set using the predict method from
sklearn.linear_model.

In \[7\]:

    # identify all indices of non-stress class
    non_stress_indices = np.where(all_labels == 0)[0]

    # oversampling factor
    oversampling_factor = 4

    # replicate non-stress samples
    replicated_samples = np.repeat(all_samples[non_stress_indices], oversampling_factor, axis=0)
    replicated_labels = np.repeat(all_labels[non_stress_indices], oversampling_factor)

    # combine with original data
    oversampled_samples = np.vstack([all_samples, replicated_samples])
    oversampled_labels = np.concatenate([all_labels, replicated_labels])

    # split oversampled data into training and testing sets
    X_train, X_test, y_train, y_test = train_test_split(oversampled_samples, oversampled_labels, test_size=0.2, random_state=42)

    # flatten data along channel axis
    X_train_flatten = X_train.reshape(X_train.shape[0], -1)
    X_test_flatten = X_test.reshape(X_test.shape[0], -1)

    # create logistic regression model
    logistic_model = LogisticRegression(random_state=42)

    # train model on training set
    logistic_model.fit(X_train_flatten, y_train)

    # make predictions on test set
    predictions = logistic_model.predict(X_test_flatten)

**4. Testing the Model - confusion matrix and classification report**

The model is tested in the code section below. Report generated in the
output provides a summary of various classification metrics such as
precision, recall, F1-score, and support for each class. This reports
the evaluation of testing using 20% of the data from the dataset.

In \[8\]:

    # evaluate model's performance
    accuracy = accuracy_score(y_test, predictions)
    accuracy = accuracy *100
    conf_matrix = confusion_matrix(y_test, predictions)
    classification_rep = classification_report(y_test, predictions)

    # display the evaluation metrics
    print("Test Accuracy:", accuracy)
    print("Confusion Matrix:\n", conf_matrix)
    print("Classification Report:\n", classification_rep)

    print("Training Set - Class Distribution:", np.bincount(y_train))
    print("Testing Set - Class Distribution:", np.bincount(y_test))

    Test Accuracy: 98.4375
    Confusion Matrix:
     [[133   0]
     [  3  56]]
    Classification Report:
                   precision    recall  f1-score   support

               0       0.98      1.00      0.99       133
               1       1.00      0.95      0.97        59

        accuracy                           0.98       192
       macro avg       0.99      0.97      0.98       192
    weighted avg       0.98      0.98      0.98       192

    Training Set - Class Distribution: [467 301]
    Testing Set - Class Distribution: [133  59]

In \[5\]:

    # simple testing for model - checking if it predicts correctly for a random EEG data sample from dataset

    import numpy as np
    from scipy.io import loadmat

    # load EEG data from .mat file - example

    test_index=301

    new_user_input = loadmat(all_file_paths[test_index])   # loads eeg data at index 301 for prediction
    eeg_data = new_user_input["Clean_data"]

    # assuming eeg_data has the shape (32, 3200)

    # flatten input data along channel axis
    selected_data = eeg_data
    input_for_model = selected_data.reshape(1, -1)
    input_for_model.shape

    # make predictions
    prediction = logistic_model.predict(input_for_model)

    print("Actual Label:", all_labels[test_index])   # checks the actual label at index 301
    print(f'Predicted Label: {prediction[0]}')

    Actual Label: 0
    Predicted Label: 0

**5. Saving the Trained Model**

This action ensures that the trained model can be easily retrieved and
reused for future predictions or analysis without the need to retrain
it. The .h5 version will be used when the model is required to classify
a user's EEG sample for the website.

In \[5\]:

    import joblib
    import h5py

    # save trained Logistic Regression model using joblib
    joblib.dump(logistic_model, 'C:/Users/Manoharan/Desktop/LR1.h5')

Out\[5\]:

    ['C:/Users/Manoharan/Desktop/LR1.h5']