# Machine Learning Experiment for Depression - classify depression or no depression

This notebook will produce the results for the machine learning experiment using time series data for my final year project "<i>Machine Learning-Based Human Activity Recognition for the Classification of Depression</i>". This project will use supervised machine learning and it is a classification problem where it will classify patients as <b>depression</b> or <b>non-depression</b>.<br>

Deliverable requirements for machine learning experiment:
- [x] Provide a time series graph to show the activity data over hourly intervals for each patient
- [x] Provide a time series graph to show the activity data over daily intervals for each patient
- [x] Perform data pre-processing on the <code>afftype</code> and <code>work</code> columns from the <code>scores.csv</code> file before creating the new dataset
- [x] Create the new dataset that is based on features extracted from each patient's actigraph data
- [x] Use machine learning to train the models for the new dataset with and without features from <code>scores.csv</code> without using Leave One Subject Out Cross Validation
- [x] Use machine learning to train the models for the new dataset with and without features from <code>scores.csv</code> with using Leave One Subject Out Cross Validation

The relevant libraries needed for the machine learning experiment are:
* NumPy
* Matplotlib
* Seaborn
* Pandas
* Sci-Kit Learn
* Time Series Feature Extraction Library (TSFEL)

NOTE: Make sure to install the TSFEL library using the command: <code>pip install tsfel</code>

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import tsfel
import os
import warnings

# Disable warning messages in the code
warnings.filterwarnings("ignore")

## Create the new dataset based on features extracted from each patient's actigraph data

For this deliverable, I will perform data pre-processing on the relevant features from the <code>scores.csv</code> file and then select relevant features that are useful before creating the new dataset. The new dataset will extract time series features from the patient's actigraph data in hourly intervals and add the relevant features from the <code>scores.csv</code> file.

### Data pre-processing on the scores.csv file

Before creating the new dataset, I will perform data pre-processing on the <code>scores.csv</code> file for the relevant features.

In [2]:
# Load the scores.csv file
scores_df = pd.read_csv("data/scores.csv")
scores_df

Unnamed: 0,number,days,gender,age,afftype,melanch,inpatient,edu,marriage,work,madrs1,madrs2
0,condition_1,11,2,35-39,2.0,2.0,2.0,6-10,1.0,2.0,19.0,19.0
1,condition_2,18,2,40-44,1.0,2.0,2.0,6-10,2.0,2.0,24.0,11.0
2,condition_3,13,1,45-49,2.0,2.0,2.0,6-10,2.0,2.0,24.0,25.0
3,condition_4,13,2,25-29,2.0,2.0,2.0,11-15,1.0,1.0,20.0,16.0
4,condition_5,13,2,50-54,2.0,2.0,2.0,11-15,2.0,2.0,26.0,26.0
5,condition_6,7,1,35-39,2.0,2.0,2.0,6-10,1.0,2.0,18.0,15.0
6,condition_7,11,1,20-24,1.0,,2.0,11-15,2.0,1.0,24.0,25.0
7,condition_8,5,2,25-29,2.0,,2.0,11-15,1.0,2.0,20.0,16.0
8,condition_9,13,2,45-49,1.0,,2.0,6-10,1.0,2.0,26.0,26.0
9,condition_10,9,2,45-49,2.0,2.0,2.0,6-10,1.0,2.0,28.0,21.0


In the output above, there are missing values in majority of the columns.<br><br>
In terms of feature selection, the columns that would be relevant for the new dataset are number and afftype columns. As seen above, the afftype column has missing values for the non-depressed patients.

To mitigate the risk of missing values for the afftype column, we can perform imputation which replaces missing values with a value.<br>

For the afftype column, we know that value 1 is <b>bipolar II</b>, 2 is <b>unipolar depressive</b> and 3 is <b>bipolar I</b> and we can notice these values are filled in for the depressed patients. However for non-depressed patients, we can impute the value 0 to indicate <b>no depression</b>.<br>

In [3]:
# Look the current count for the afftype column
scores_df["afftype"].value_counts()

2.0    15
1.0     7
3.0     1
Name: afftype, dtype: int64

We notice that only one depressed patient has bipolar I, 7 depressed patients have bipolar II and 15 depressed patients have unipolar. However there are missing values for non-depressed patients in the afftype column, we can impute the value 0 to indicate no depression.

In [4]:
'''
Notice the missing values for non-depressed patients in the afftype column
Impute the value 0 to all non-depressed patients to indicate 'no depression'
''' 
for i in range(23, 55):
    scores_df.at[i, 'afftype'] = 0

scores_df

Unnamed: 0,number,days,gender,age,afftype,melanch,inpatient,edu,marriage,work,madrs1,madrs2
0,condition_1,11,2,35-39,2.0,2.0,2.0,6-10,1.0,2.0,19.0,19.0
1,condition_2,18,2,40-44,1.0,2.0,2.0,6-10,2.0,2.0,24.0,11.0
2,condition_3,13,1,45-49,2.0,2.0,2.0,6-10,2.0,2.0,24.0,25.0
3,condition_4,13,2,25-29,2.0,2.0,2.0,11-15,1.0,1.0,20.0,16.0
4,condition_5,13,2,50-54,2.0,2.0,2.0,11-15,2.0,2.0,26.0,26.0
5,condition_6,7,1,35-39,2.0,2.0,2.0,6-10,1.0,2.0,18.0,15.0
6,condition_7,11,1,20-24,1.0,,2.0,11-15,2.0,1.0,24.0,25.0
7,condition_8,5,2,25-29,2.0,,2.0,11-15,1.0,2.0,20.0,16.0
8,condition_9,13,2,45-49,1.0,,2.0,6-10,1.0,2.0,26.0,26.0
9,condition_10,9,2,45-49,2.0,2.0,2.0,6-10,1.0,2.0,28.0,21.0


In [5]:
# Look the new count for the afftype column
scores_df["afftype"].value_counts()

0.0    32
2.0    15
1.0     7
3.0     1
Name: afftype, dtype: int64

### Create the new dataset
Now we can create the new dataset based on features extracted from each patient's actigraph data.<br>

For this, we need to create a CSV file of the new dataset and for each patient:
* Use the TSFEL library to extract new features over hourly intervals using temporal data
* Add a new feature to the dataset for patient ID/number so the data points generated have the same value for the appropriate patient ID/number and each hourly window is tied to the appropriate patient
* Add a new feature for the afftype so that the data points generated have the same value for the appropriate patient
* Add a new feature for depression - if the patient has no depression then the value is 0 and if the patient has bipolar I, bipolar II or unipolar depressive then the value is 1

First extract the features for all patients using the TSFEL library.

In [6]:
# Define condition and control folders
condition = "data/condition"
control = "data/control"

In [7]:
# Load a condition and control patient file to see the time series data
condition_1 = pd.read_csv("data/condition/condition_1.csv")
control_1 = pd.read_csv("data/control/control_1.csv")

In [8]:
condition_1

Unnamed: 0,timestamp,date,activity
0,2003-05-07 12:00:00,2003-05-07,0
1,2003-05-07 12:01:00,2003-05-07,143
2,2003-05-07 12:02:00,2003-05-07,0
3,2003-05-07 12:03:00,2003-05-07,20
4,2003-05-07 12:04:00,2003-05-07,166
...,...,...,...
23239,2003-05-23 15:19:00,2003-05-23,0
23240,2003-05-23 15:20:00,2003-05-23,0
23241,2003-05-23 15:21:00,2003-05-23,0
23242,2003-05-23 15:22:00,2003-05-23,0


In [9]:
control_1

Unnamed: 0,timestamp,date,activity
0,2003-03-18 15:00:00,2003-03-18,60
1,2003-03-18 15:01:00,2003-03-18,0
2,2003-03-18 15:02:00,2003-03-18,264
3,2003-03-18 15:03:00,2003-03-18,662
4,2003-03-18 15:04:00,2003-03-18,293
...,...,...,...
51606,2003-04-23 12:06:00,2003-04-23,3
51607,2003-04-23 12:07:00,2003-04-23,3
51608,2003-04-23 12:08:00,2003-04-23,3
51609,2003-04-23 12:09:00,2003-04-23,3


As seen above the timestamp is recorded in one minute intervals for measuring activity counts from an actigraph watch. We want the TSFEL library to extract new features over hourly intervals.<br>

We can set the sampling frequency as 1 Hz. Using 1 Hz we can multiply this by 60 minutes (1 hour) and we get 60 data points as the window size to extract features in hourly intervals. Temporal features will be used for feature extraction.

In [10]:
# Function to extract features for each patient file
def extract_time_series_features(patient_data):
    # Extract the patient number from the file name - e.g. condition_1.csv is split to condition_1
    patient_number = os.path.basename(patient_data).split('.')[0]
    
    # Load the DataFrame for the patient file
    patient_df = pd.read_csv(patient_data)
    
    # Extract temporal features using TSFEL
    cfg_file = tsfel.get_features_by_domain("temporal")
    
    features = tsfel.time_series_features_extractor(
        cfg_file,
        patient_df["activity"],
        fs=1, # Sampling frequency set at 1 Hz
        window_size=60, # Hourly intervals (1 hour=60 minutes) - 1 Hz x 60 minutes = 60 data points
        verbose=0
    )
    
    # Insert the patient number as the first column
    features.insert(0, 'number', patient_number)
    return features

In [11]:
# Function to extract features for all patients
def extract_all_patient_features(folder):
    # List to collect the extracted features for all patients
    all_patient_features = []
    
    # Loop through all the files and extract the features
    for f in os.listdir(folder):
        # Check if the file is a CSV file - if so extract the temporal features for all patients
        if f.endswith(".csv"):
            patient_file = os.path.join(folder, f)
            patient_features = extract_time_series_features(patient_file)
            all_patient_features.append(patient_features)
    
    # Concatenate all patients with their temporal features into a new DataFrame
    features_df = pd.concat(all_patient_features, ignore_index=True)
    return features_df

In [12]:
# Extract features for depressed patients
condition_features = extract_all_patient_features(condition)

In [13]:
condition_features

Unnamed: 0,number,0_Area under the curve,0_Autocorrelation,0_Centroid,0_Mean absolute diff,0_Mean diff,0_Median absolute diff,0_Median diff,0_Negative turning points,0_Neighbourhood peaks,0_Positive turning points,0_Signal distance,0_Slope,0_Sum absolute diff,0_Zero crossing rate
0,condition_1,20694.5,11191987.0,31.821197,216.694915,3.338983,150.0,0.0,16.0,2.0,14.0,12788.374959,2.999083,12785.0,3.0
1,condition_1,16921.0,8078860.0,39.889938,149.118644,-5.186441,106.0,0.0,15.0,2.0,15.0,8803.539272,5.044624,8798.0,3.0
2,condition_1,16733.5,9051577.0,34.039841,194.220339,0.593220,119.0,-6.0,14.0,2.0,16.0,11461.499311,3.374465,11459.0,7.0
3,condition_1,13077.5,8124159.0,21.132863,168.355932,0.389831,90.0,12.0,17.0,2.0,18.0,9935.297455,-3.898111,9933.0,14.0
4,condition_1,14299.5,7685663.0,35.597247,148.084746,-0.050847,79.0,-9.0,13.0,2.0,13.0,8740.510339,0.924229,8737.0,16.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9177,condition_9,9805.5,8258943.0,9.318980,107.338983,-28.118644,33.0,0.0,8.0,2.0,10.0,6353.503210,-9.620922,6333.0,9.0
9178,condition_9,1033.0,600727.0,16.827043,24.101695,0.000000,0.0,0.0,0.0,1.0,3.0,1474.652678,-0.739455,1422.0,6.0
9179,condition_9,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,59.000000,0.000000,0.0,0.0
9180,condition_9,30199.0,34324479.0,42.625566,71.593220,0.000000,0.0,0.0,4.0,0.0,5.0,4255.134412,21.246207,4224.0,2.0


In [14]:
condition_features.shape

(9182, 15)

After performing feature extraction for depressed patients using temporal data, we can see 14 temporal features extracted from the patients' activity data.<br>

In this extracted data, there are 9182 rows and 15 columns (including the number (patient ID) column) for condition patients.

In [15]:
# Extract features for non-depressed patients
control_features = extract_all_patient_features(control)

In [16]:
control_features

Unnamed: 0,number,0_Area under the curve,0_Autocorrelation,0_Centroid,0_Mean absolute diff,0_Mean diff,0_Median absolute diff,0_Median diff,0_Negative turning points,0_Neighbourhood peaks,0_Positive turning points,0_Signal distance,0_Slope,0_Sum absolute diff,0_Zero crossing rate
0,control_1,9326.0,5076847.0,9.890104,142.983051,0.101695,62.0,-18.0,17.0,2.0,17.0,8437.707426,-5.953570e+00,8436.0,4.0
1,control_1,14771.0,16467673.0,46.411548,181.220339,26.440678,74.0,5.0,15.0,2.0,17.0,10695.200739,8.134454e+00,10692.0,10.0
2,control_1,65211.5,97004748.0,19.981332,383.271186,-24.661017,288.0,0.0,18.0,2.0,17.0,22615.279140,-2.083056e+01,22613.0,0.0
3,control_1,19661.5,12267300.0,33.452533,244.525424,4.898305,229.0,-8.0,20.0,2.0,19.0,14429.480910,1.998388e+00,14427.0,2.0
4,control_1,33380.0,41691638.0,12.263850,219.322034,-17.694915,155.0,0.0,14.0,1.0,14.0,12958.275567,-2.980650e+01,12940.0,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16979,control_9,413.0,2940.0,29.500000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,59.000000,-2.681580e-17,0.0,0.0
16980,control_9,3558.5,2318139.0,56.806817,44.661017,4.254237,0.0,0.0,1.0,0.0,2.0,2688.020020,4.913170e+00,2635.0,0.0
16981,control_9,7862.5,4409459.0,17.142844,118.864407,-5.949153,17.0,0.0,12.0,2.0,14.0,7032.569413,-5.571298e+00,7013.0,0.0
16982,control_9,3167.0,867152.0,28.485564,70.915254,-0.067797,0.0,0.0,5.0,2.0,10.0,4215.194223,-3.219228e-01,4184.0,0.0


In [17]:
control_features.shape

(16984, 15)

After performing feature extraction for non-depressed patients using temporal data, we can see 14 temporal features extracted from the patients' activity data.<br>

In this extracted data, there are 16984 rows and 15 columns (including the number (patient ID) column) for control patients.

We can save these feature extracted data as a CSV file.

In [18]:
# Save the condition features as a CSV file in the extracted_features_data folder within the new_data_final folder
condition_features.to_csv("new_data_final/extracted_features_data/condition_hourly_temporal_features.csv", index=False)

In [19]:
# Save the control features as a CSV file in the extracted_features_data folder within the new_data_final folder
control_features.to_csv("new_data_final/extracted_features_data/control_hourly_temporal_features.csv", index=False)

Now the features are extracted for both depressed and non-depressed patients, we will merge the features extracted from condition and control patients with their patient number (identifier) and their temporal features into one new DataFrame.

Add a feature for afftype column from <code>scores.csv</code> so these are generated to the appropriate patient.

Add a feature for depression state and assign 0 to patients who do not have depression and 1 to patients who have bipolar I, bipolar II and unipolar depression.

In [20]:
# Check temporal features for depressed patients
condition_features

Unnamed: 0,number,0_Area under the curve,0_Autocorrelation,0_Centroid,0_Mean absolute diff,0_Mean diff,0_Median absolute diff,0_Median diff,0_Negative turning points,0_Neighbourhood peaks,0_Positive turning points,0_Signal distance,0_Slope,0_Sum absolute diff,0_Zero crossing rate
0,condition_1,20694.5,11191987.0,31.821197,216.694915,3.338983,150.0,0.0,16.0,2.0,14.0,12788.374959,2.999083,12785.0,3.0
1,condition_1,16921.0,8078860.0,39.889938,149.118644,-5.186441,106.0,0.0,15.0,2.0,15.0,8803.539272,5.044624,8798.0,3.0
2,condition_1,16733.5,9051577.0,34.039841,194.220339,0.593220,119.0,-6.0,14.0,2.0,16.0,11461.499311,3.374465,11459.0,7.0
3,condition_1,13077.5,8124159.0,21.132863,168.355932,0.389831,90.0,12.0,17.0,2.0,18.0,9935.297455,-3.898111,9933.0,14.0
4,condition_1,14299.5,7685663.0,35.597247,148.084746,-0.050847,79.0,-9.0,13.0,2.0,13.0,8740.510339,0.924229,8737.0,16.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
9177,condition_9,9805.5,8258943.0,9.318980,107.338983,-28.118644,33.0,0.0,8.0,2.0,10.0,6353.503210,-9.620922,6333.0,9.0
9178,condition_9,1033.0,600727.0,16.827043,24.101695,0.000000,0.0,0.0,0.0,1.0,3.0,1474.652678,-0.739455,1422.0,6.0
9179,condition_9,0.0,0.0,0.000000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,59.000000,0.000000,0.0,0.0
9180,condition_9,30199.0,34324479.0,42.625566,71.593220,0.000000,0.0,0.0,4.0,0.0,5.0,4255.134412,21.246207,4224.0,2.0


In [21]:
# Check temporal features for non-depressed patients
control_features

Unnamed: 0,number,0_Area under the curve,0_Autocorrelation,0_Centroid,0_Mean absolute diff,0_Mean diff,0_Median absolute diff,0_Median diff,0_Negative turning points,0_Neighbourhood peaks,0_Positive turning points,0_Signal distance,0_Slope,0_Sum absolute diff,0_Zero crossing rate
0,control_1,9326.0,5076847.0,9.890104,142.983051,0.101695,62.0,-18.0,17.0,2.0,17.0,8437.707426,-5.953570e+00,8436.0,4.0
1,control_1,14771.0,16467673.0,46.411548,181.220339,26.440678,74.0,5.0,15.0,2.0,17.0,10695.200739,8.134454e+00,10692.0,10.0
2,control_1,65211.5,97004748.0,19.981332,383.271186,-24.661017,288.0,0.0,18.0,2.0,17.0,22615.279140,-2.083056e+01,22613.0,0.0
3,control_1,19661.5,12267300.0,33.452533,244.525424,4.898305,229.0,-8.0,20.0,2.0,19.0,14429.480910,1.998388e+00,14427.0,2.0
4,control_1,33380.0,41691638.0,12.263850,219.322034,-17.694915,155.0,0.0,14.0,1.0,14.0,12958.275567,-2.980650e+01,12940.0,10.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
16979,control_9,413.0,2940.0,29.500000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,59.000000,-2.681580e-17,0.0,0.0
16980,control_9,3558.5,2318139.0,56.806817,44.661017,4.254237,0.0,0.0,1.0,0.0,2.0,2688.020020,4.913170e+00,2635.0,0.0
16981,control_9,7862.5,4409459.0,17.142844,118.864407,-5.949153,17.0,0.0,12.0,2.0,14.0,7032.569413,-5.571298e+00,7013.0,0.0
16982,control_9,3167.0,867152.0,28.485564,70.915254,-0.067797,0.0,0.0,5.0,2.0,10.0,4215.194223,-3.219228e-01,4184.0,0.0


In [22]:
# Concatenate the DataFrames of the temporal features for condition and control patients into a new DataFrame
new_patient_df = pd.concat([condition_features, control_features], ignore_index=True)
new_patient_df

Unnamed: 0,number,0_Area under the curve,0_Autocorrelation,0_Centroid,0_Mean absolute diff,0_Mean diff,0_Median absolute diff,0_Median diff,0_Negative turning points,0_Neighbourhood peaks,0_Positive turning points,0_Signal distance,0_Slope,0_Sum absolute diff,0_Zero crossing rate
0,condition_1,20694.5,11191987.0,31.821197,216.694915,3.338983,150.0,0.0,16.0,2.0,14.0,12788.374959,2.999083e+00,12785.0,3.0
1,condition_1,16921.0,8078860.0,39.889938,149.118644,-5.186441,106.0,0.0,15.0,2.0,15.0,8803.539272,5.044624e+00,8798.0,3.0
2,condition_1,16733.5,9051577.0,34.039841,194.220339,0.593220,119.0,-6.0,14.0,2.0,16.0,11461.499311,3.374465e+00,11459.0,7.0
3,condition_1,13077.5,8124159.0,21.132863,168.355932,0.389831,90.0,12.0,17.0,2.0,18.0,9935.297455,-3.898111e+00,9933.0,14.0
4,condition_1,14299.5,7685663.0,35.597247,148.084746,-0.050847,79.0,-9.0,13.0,2.0,13.0,8740.510339,9.242290e-01,8737.0,16.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26161,control_9,413.0,2940.0,29.500000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,59.000000,-2.681580e-17,0.0,0.0
26162,control_9,3558.5,2318139.0,56.806817,44.661017,4.254237,0.0,0.0,1.0,0.0,2.0,2688.020020,4.913170e+00,2635.0,0.0
26163,control_9,7862.5,4409459.0,17.142844,118.864407,-5.949153,17.0,0.0,12.0,2.0,14.0,7032.569413,-5.571298e+00,7013.0,0.0
26164,control_9,3167.0,867152.0,28.485564,70.915254,-0.067797,0.0,0.0,5.0,2.0,10.0,4215.194223,-3.219228e-01,4184.0,0.0


In [23]:
# Load scores.csv with the imputed values for afftype column
scores_df

Unnamed: 0,number,days,gender,age,afftype,melanch,inpatient,edu,marriage,work,madrs1,madrs2
0,condition_1,11,2,35-39,2.0,2.0,2.0,6-10,1.0,2.0,19.0,19.0
1,condition_2,18,2,40-44,1.0,2.0,2.0,6-10,2.0,2.0,24.0,11.0
2,condition_3,13,1,45-49,2.0,2.0,2.0,6-10,2.0,2.0,24.0,25.0
3,condition_4,13,2,25-29,2.0,2.0,2.0,11-15,1.0,1.0,20.0,16.0
4,condition_5,13,2,50-54,2.0,2.0,2.0,11-15,2.0,2.0,26.0,26.0
5,condition_6,7,1,35-39,2.0,2.0,2.0,6-10,1.0,2.0,18.0,15.0
6,condition_7,11,1,20-24,1.0,,2.0,11-15,2.0,1.0,24.0,25.0
7,condition_8,5,2,25-29,2.0,,2.0,11-15,1.0,2.0,20.0,16.0
8,condition_9,13,2,45-49,1.0,,2.0,6-10,1.0,2.0,26.0,26.0
9,condition_10,9,2,45-49,2.0,2.0,2.0,6-10,1.0,2.0,28.0,21.0


In [24]:
# Merge the afftype column from scores.csv so they are generated to the appropriate patient
new_patient_df = pd.merge(new_patient_df, scores_df[['number', 'afftype']], on='number', how='left')
new_patient_df

Unnamed: 0,number,0_Area under the curve,0_Autocorrelation,0_Centroid,0_Mean absolute diff,0_Mean diff,0_Median absolute diff,0_Median diff,0_Negative turning points,0_Neighbourhood peaks,0_Positive turning points,0_Signal distance,0_Slope,0_Sum absolute diff,0_Zero crossing rate,afftype
0,condition_1,20694.5,11191987.0,31.821197,216.694915,3.338983,150.0,0.0,16.0,2.0,14.0,12788.374959,2.999083e+00,12785.0,3.0,2.0
1,condition_1,16921.0,8078860.0,39.889938,149.118644,-5.186441,106.0,0.0,15.0,2.0,15.0,8803.539272,5.044624e+00,8798.0,3.0,2.0
2,condition_1,16733.5,9051577.0,34.039841,194.220339,0.593220,119.0,-6.0,14.0,2.0,16.0,11461.499311,3.374465e+00,11459.0,7.0,2.0
3,condition_1,13077.5,8124159.0,21.132863,168.355932,0.389831,90.0,12.0,17.0,2.0,18.0,9935.297455,-3.898111e+00,9933.0,14.0,2.0
4,condition_1,14299.5,7685663.0,35.597247,148.084746,-0.050847,79.0,-9.0,13.0,2.0,13.0,8740.510339,9.242290e-01,8737.0,16.0,2.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26161,control_9,413.0,2940.0,29.500000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,59.000000,-2.681580e-17,0.0,0.0,0.0
26162,control_9,3558.5,2318139.0,56.806817,44.661017,4.254237,0.0,0.0,1.0,0.0,2.0,2688.020020,4.913170e+00,2635.0,0.0,0.0
26163,control_9,7862.5,4409459.0,17.142844,118.864407,-5.949153,17.0,0.0,12.0,2.0,14.0,7032.569413,-5.571298e+00,7013.0,0.0,0.0
26164,control_9,3167.0,867152.0,28.485564,70.915254,-0.067797,0.0,0.0,5.0,2.0,10.0,4215.194223,-3.219228e-01,4184.0,0.0,0.0


In [25]:
# Add depression state feature - assign non-depressed patients to 0 and depressed patients to 1

# Initialise depression state column with 0
new_patient_df["depression_state"] = 0

# Initialise 1 to depressed patients if the value of the number column starts with "condition"
new_patient_df.loc[new_patient_df["number"].str.startswith('condition'), "depression_state"] = 1

new_patient_df

Unnamed: 0,number,0_Area under the curve,0_Autocorrelation,0_Centroid,0_Mean absolute diff,0_Mean diff,0_Median absolute diff,0_Median diff,0_Negative turning points,0_Neighbourhood peaks,0_Positive turning points,0_Signal distance,0_Slope,0_Sum absolute diff,0_Zero crossing rate,afftype,depression_state
0,condition_1,20694.5,11191987.0,31.821197,216.694915,3.338983,150.0,0.0,16.0,2.0,14.0,12788.374959,2.999083e+00,12785.0,3.0,2.0,1
1,condition_1,16921.0,8078860.0,39.889938,149.118644,-5.186441,106.0,0.0,15.0,2.0,15.0,8803.539272,5.044624e+00,8798.0,3.0,2.0,1
2,condition_1,16733.5,9051577.0,34.039841,194.220339,0.593220,119.0,-6.0,14.0,2.0,16.0,11461.499311,3.374465e+00,11459.0,7.0,2.0,1
3,condition_1,13077.5,8124159.0,21.132863,168.355932,0.389831,90.0,12.0,17.0,2.0,18.0,9935.297455,-3.898111e+00,9933.0,14.0,2.0,1
4,condition_1,14299.5,7685663.0,35.597247,148.084746,-0.050847,79.0,-9.0,13.0,2.0,13.0,8740.510339,9.242290e-01,8737.0,16.0,2.0,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
26161,control_9,413.0,2940.0,29.500000,0.000000,0.000000,0.0,0.0,0.0,0.0,0.0,59.000000,-2.681580e-17,0.0,0.0,0.0,0
26162,control_9,3558.5,2318139.0,56.806817,44.661017,4.254237,0.0,0.0,1.0,0.0,2.0,2688.020020,4.913170e+00,2635.0,0.0,0.0,0
26163,control_9,7862.5,4409459.0,17.142844,118.864407,-5.949153,17.0,0.0,12.0,2.0,14.0,7032.569413,-5.571298e+00,7013.0,0.0,0.0,0
26164,control_9,3167.0,867152.0,28.485564,70.915254,-0.067797,0.0,0.0,5.0,2.0,10.0,4215.194223,-3.219228e-01,4184.0,0.0,0.0,0


In [26]:
# Check the count for depression_state column
new_patient_df["depression_state"].value_counts()

0    16984
1     9182
Name: depression_state, dtype: int64

We notice that the <code>depression_state</code> label shows the value 1 for condition patients who have either bipolar I, bipolar II and unipolar depression and the value 0 for control patients who do not have depression.

If you recall back to the feature extraction, the condition features DataFrame had 9182 rows which is now seen in the above output by the value <b>1</b> in the <code>depression_state</code> column and the control features DataFrame had 16984 rows which is now seen in the above output by the value <b>0</b> in the <code>depression_state</code> column.

Finally, we will save this new dataset as a CSV file and use it for training the machine learning models.

In [27]:
# Save the new_patient_df DataFrame as a CSV file
new_patient_df.to_csv("new_data_final/depression_hourly_time_series_data_temporal.csv", index=False)