# Learning about the Demograph of the data for the project

The data for the project was taken from SensSmartTech Database on Physionet. 
It is polycardiograph of the cardiovascular signals measured synchronously. It consist of Electrocardiograph, phonocardiograph, photoplethysmography and accelerometer. It is consist of 338 30 seconds recordings from 32 healthy volunteers. 



It is made of 10 channels;
 1. 4 ECG ( limb, V3, V4 leads)
 2. 1 PCG (measured at the heart apex)
 3. 4 PPG
 4. 1 ACC ( accelerometer)


 Several multisensory databases capturing variations in HR during activity have been documented [5-8]. SensSmartTech stands out as the first base of multisensory recordings which systematically follows heart relaxation dynamics across a wide range of HRs (58 bpm - 173 bpm). The recorded HR-dependence is of interest to clinicians applying the HR biomarker correction, engineers investigating HR estimation by different wearable sensors and the impact of noises and artefacts on diagnostic signals, and scientist studying the underlying nonlinear dynamics of the heart as an electro-mechanical system.

 

Of interest to us is the  4 Channel ECG , 1 PCG and 1 ACC counting as 6 channels for the project work.

### **Technicalities of the hardware used for taking the data.**

1. ECG signal acquisition is performed with the ADS1298 chip (Texas Instruments) with the sampling rate set to 500 Hz. Measurement used 4 limb electrodes, V3 and V4, while the redundant precordial electrodes (V1, V2, V5 and V6) were placed on the upper right arm to prevent noise from the hanging leads
    
2. PCG signal is captured using a microphone ICS-40300 (TDK InvenSense) placed in a cardiology stethoscope SPIRIT CK-S474SPF63 (Spirit Medical) with the sampling rate of 1 kHz. 1 PCG stethoscope was positioned at the sternum to the right of V3 ECG electrode and secured with an elastic band
    
3. ACC signal was recorded by a MEMS accelerometer MPU6050 (TDK InvenSense) with an acceleration range set to +/- 1g. It was attached to the body between V3 and V4 ECG electrodes using a self-adhesive ECG electrode. Only the z axis in the direction perpendicular to the chest was used

 4. Sensors output signals were digitalized by 16 bit A/D converters. The polycardiograph synchronously collected data form the sensors and transmitted them to a PC over Ethernet. Accuracy of the Polycardiograph was set by the sampling rate of the sensors.


**Relevant information concerning mode of data collection**: Recordings were taken in a standing position at rest and immediately after the activity.
After each recording, the researcher calculated the heart rate (HR). Three 30-second recordings were made at rest. After the activity, recordings were repeated until the HR dropped to 10-20 bpm above the HR at rest.

Of interest to us is the CSV format which has column for time, and the rest of the channels associated with it.  The **acquisition time follows the sampling rate of the sensor**. Sensors may record signals at different point in time.
 Therefore, the time axes of different sensors are different, but the **acquisition is synchronized so that they can be extended to a common zero.**
 



Additionally, a table Demographics.csv lists file names and subject demographics, including age, height, weight, and body-mass index. Furthermore, each row in this table displays the subject activity status: 'B' for the measurement before and 'A' for the measurement after the activity, and the HR calculated as the inverse of the median RR interval per recording.
To de-identify the data, all dates were removed from the recordings. The published data do not contain any information that identifies or provides a reasonable basis to identify an individual. The data comply with HIPPA requirements for sharing personal health information.

### Understanding the Demographics

In [29]:
# Loading the necessary Libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from datetime import datetime, timedelta


In [None]:
df = pd.read_csv('Demographics_Cleaned.csv', )
# Convert 'Recording time' string to datetime.time object
# Assuming 'Recording time (hh:mm:ss)' is the column with time strings
# Check the first few rows of the DataFrame to understand its structure

# Function to safely convert time strings
def parse_time(val):
    if pd.isna(val):
        return None
    try:
        return datetime.strptime(str(val).strip(), "%H:%M:%S").time()
    except ValueError:
        print(f"Skipping invalid time format: {val}")
        return None

# Apply the conversion
df["Recording time"] = df["Recording time (hh:mm:ss)"].apply(parse_time)

In [28]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 401 entries, 0 to 400
Data columns (total 15 columns):
 #   Column                            Non-Null Count  Dtype  
---  ------                            --------------  -----  
 0   File Number                       338 non-null    float64
 1   Subject number                    338 non-null    float64
 2   Recording time (hh:mm:ss)         338 non-null    object 
 3   Gender                            338 non-null    object 
 4   Age (year)                        338 non-null    float64
 5   Height (cm)                       338 non-null    float64
 6   Weight (kg)                       338 non-null    float64
 7   Body-mass index                   338 non-null    float64
 8   ECG                               338 non-null    object 
 9   PPG                               338 non-null    object 
 10  PCG                               338 non-null    object 
 11  ACC                               338 non-null    object 
 12  Before (

In [None]:
# Group by Subject number and sort within each group by File number
grouped = df.groupby("Subject number")

# Create a dictionary where key = Subject number, value = DataFrame of that subject's records
subject_batches = {
    subject: group.sort_values(by="File Number").reset_index(drop=True)
    for subject, group in grouped
}

# Optional: View one subject’s data (e.g., Subject 1)
subject_batches[1].head(10)

Unnamed: 0,File Number,Subject number,Recording time (hh:mm:ss),Gender,Age (year),Height (cm),Weight (kg),Body-mass index,ECG,PPG,PCG,ACC,Before (B) / after (A) activity,Median heart rate (bpm)
0,1.0,1.0,10:09:54,M,53.0,175.0,88.0,28.73,1_10-09-54_ecg,1_10-09-54_ppg,1_10-09-54_pcg,1_10-09-54_acc,B,91.2
1,2.0,1.0,10:11:48,M,53.0,175.0,88.0,28.73,1_10-11-48_ecg,1_10-11-48_ppg,1_10-11-48_pcg,1_10-11-48_acc,B,90.1
2,3.0,1.0,10:12:41,M,53.0,175.0,88.0,28.73,1_10-12-41_ecg,1_10-12-41_ppg,1_10-12-41_pcg,1_10-12-41_acc,B,92.9
3,4.0,1.0,10:25:13,M,53.0,175.0,88.0,28.73,1_10-25-13_ecg,1_10-25-13_ppg,1_10-25-13_pcg,1_10-25-13_acc,A,129.9
4,5.0,1.0,10:26:08,M,53.0,175.0,88.0,28.73,1_10-26-08_ecg,1_10-26-08_ppg,1_10-26-08_pcg,1_10-26-08_acc,A,120.0
5,6.0,1.0,10:26:55,M,53.0,175.0,88.0,28.73,1_10-26-55_ecg,1_10-26-55_ppg,1_10-26-55_pcg,1_10-26-55_acc,A,112.8
6,7.0,1.0,10:27:42,M,53.0,175.0,88.0,28.73,1_10-27-42_ecg,1_10-27-42_ppg,1_10-27-42_pcg,1_10-27-42_acc,A,111.1
7,8.0,1.0,10:29:01,M,53.0,175.0,88.0,28.73,1_10-29-01_ecg,1_10-29-01_ppg,1_10-29-01_pcg,1_10-29-01_acc,A,110.3
8,9.0,1.0,10:31:08,M,53.0,175.0,88.0,28.73,1_10-31-08_ecg,1_10-31-08_ppg,1_10-31-08_pcg,1_10-31-08_acc,A,113.2
9,10.0,1.0,10:32:38,M,53.0,175.0,88.0,28.73,1_10-32-38_ecg,1_10-32-38_ppg,1_10-32-38_pcg,1_10-32-38_acc,A,104.9


In [17]:
subject_batches[1].tail()


Unnamed: 0,File Number,Subject number,Recording time (hh:mm:ss),Gender,Age (year),Height (cm),Weight (kg),Body-mass index,ECG,PPG,PCG,ACC,Before (B) / after (A) activity,Median heart rate (bpm)
5,6.0,1.0,10:26:55,M,53.0,175.0,88.0,28.73,1_10-26-55_ecg,1_10-26-55_ppg,1_10-26-55_pcg,1_10-26-55_acc,A,112.8
6,7.0,1.0,10:27:42,M,53.0,175.0,88.0,28.73,1_10-27-42_ecg,1_10-27-42_ppg,1_10-27-42_pcg,1_10-27-42_acc,A,111.1
7,8.0,1.0,10:29:01,M,53.0,175.0,88.0,28.73,1_10-29-01_ecg,1_10-29-01_ppg,1_10-29-01_pcg,1_10-29-01_acc,A,110.3
8,9.0,1.0,10:31:08,M,53.0,175.0,88.0,28.73,1_10-31-08_ecg,1_10-31-08_ppg,1_10-31-08_pcg,1_10-31-08_acc,A,113.2
9,10.0,1.0,10:32:38,M,53.0,175.0,88.0,28.73,1_10-32-38_ecg,1_10-32-38_ppg,1_10-32-38_pcg,1_10-32-38_acc,A,104.9


In [31]:
df_clean = df.dropna(subset=["Recording time"])

# Group by Subject
subject_groups = df_clean.groupby("Subject number")

# Prepare results
duration_data = []

for subject, group in subject_groups:
    group_sorted = group.sort_values("File Number")

    start_time = group_sorted["Recording time"].iloc[0]
    end_time = group_sorted["Recording time"].iloc[-1]

    # Convert times to datetime so we can subtract
    start_dt = datetime.combine(datetime.today(), start_time)
    end_dt = datetime.combine(datetime.today(), end_time)

    # Handle potential midnight wraparound (if needed)
    if end_dt < start_dt:
        end_dt += timedelta(days=1)

    duration = end_dt - start_dt

    duration_data.append({
        "Subject number": subject,
        "Start time": start_time,
        "End time": end_time,
        "Duration (HH:MM:SS)": duration
    })

# Create a new DataFrame with the results
duration_df = pd.DataFrame(duration_data)


In [32]:
duration_df

Unnamed: 0,Subject number,Start time,End time,Duration (HH:MM:SS)
0,1.0,10:09:54,10:32:38,0 days 00:22:44
1,2.0,14:08:15,14:35:20,0 days 00:27:05
2,3.0,10:08:25,10:23:07,0 days 00:14:42
3,4.0,11:19:22,11:29:07,0 days 00:09:45
4,5.0,08:59:04,09:14:32,0 days 00:15:28
5,6.0,10:43:42,11:26:54,0 days 00:43:12
6,7.0,14:16:33,14:29:51,0 days 00:13:18
7,8.0,12:52:50,13:21:39,0 days 00:28:49
8,9.0,13:52:23,14:10:21,0 days 00:17:58
9,10.0,18:12:33,18:34:40,0 days 00:22:07


In [44]:
#Trying to group the data according to the before and after activity (A and B) and calculate the duration of each activity.

# Clean the activity column (strip whitespace and uppercase)
df["Activity"] = df["Before (B)  / after (A) activity"].astype(str).str.strip().str.upper()

# Drop rows with missing info
df_clean = df.dropna(subset=["Recording time", "Activity"])

# Prepare results
results = []

for subject, group in df_clean.groupby("Subject number"):
    subject_data = {"Subject number": subject}

    for label in ['A', 'B']:  # A = After, B = Before
        sub = group[group["Activity"] == label]

        if sub.empty:
            subject_data[f"{label} Start"] = None
            subject_data[f"{label} End"] = None
            subject_data[f"{label} Duration"] = None
            continue

        sub_sorted = sub.sort_values("File Number")

        start_time = sub_sorted["Recording time"].iloc[0]
        end_time = sub_sorted["Recording time"].iloc[-1]

        start_dt = datetime.combine(datetime.today(), start_time)
        end_dt = datetime.combine(datetime.today(), end_time)

        if end_dt < start_dt:
            end_dt += timedelta(days=1)

        duration = end_dt - start_dt

        subject_data[f"{label} Start"] = start_time
        subject_data[f"{label} End"] = end_time
        subject_data[f"{label} Duration"] = duration

    # Optional: add total duration
    if subject_data["A Duration"] and subject_data["B Duration"]:
        subject_data["Total Duration"] = subject_data["A Duration"] + subject_data["B Duration"]
    else:
        subject_data["Total Duration"] = None

    results.append(subject_data)

# Create results DataFrame
activity_duration_df = pd.DataFrame(results)

In [45]:
duration_by_activity


Unnamed: 0,Subject number,A Start,A End,A Duration,B Start,B End,B Duration
0,1.0,,,,,,
1,2.0,,,,,,
2,3.0,,,,,,
3,4.0,,,,,,
4,5.0,,,,,,
5,6.0,,,,,,
6,7.0,,,,,,
7,8.0,,,,,,
8,9.0,,,,,,
9,10.0,,,,,,
