# Learning about the Demograph of the data for the project

The data for the project was taken from SensSmartTech Database on Physionet. 
It is polycardiograph of the cardiovascular signals measured synchronously. It consist of Electrocardiograph, phonocardiograph, photoplethysmography and accelerometer. It is consist of 338 30 seconds recordings from 32 healthy volunteers. 



It is made of 10 channels;
 1. 4 ECG ( limb, V3, V4 leads)
 2. 1 PCG (measured at the heart apex)
 3. 4 PPG
 4. 1 ACC ( accelerometer)


 Several multisensory databases capturing variations in HR during activity have been documented [5-8]. SensSmartTech stands out as the first base of multisensory recordings which systematically follows heart relaxation dynamics across a wide range of HRs (58 bpm - 173 bpm). The recorded HR-dependence is of interest to clinicians applying the HR biomarker correction, engineers investigating HR estimation by different wearable sensors and the impact of noises and artefacts on diagnostic signals, and scientist studying the underlying nonlinear dynamics of the heart as an electro-mechanical system.

 

Of interest to us is the  4 Channel ECG , 1 PCG and 1 ACC counting as 6 channels for the project work.

### **Technicalities of the hardware used for taking the data.**

1. ECG signal acquisition is performed with the ADS1298 chip (Texas Instruments) with the sampling rate set to 500 Hz. Measurement used 4 limb electrodes, V3 and V4, while the redundant precordial electrodes (V1, V2, V5 and V6) were placed on the upper right arm to prevent noise from the hanging leads
    
2. PCG signal is captured using a microphone ICS-40300 (TDK InvenSense) placed in a cardiology stethoscope SPIRIT CK-S474SPF63 (Spirit Medical) with the sampling rate of 1 kHz. 1 PCG stethoscope was positioned at the sternum to the right of V3 ECG electrode and secured with an elastic band
    
3. ACC signal was recorded by a MEMS accelerometer MPU6050 (TDK InvenSense) with an acceleration range set to +/- 1g. It was attached to the body between V3 and V4 ECG electrodes using a self-adhesive ECG electrode. Only the z axis in the direction perpendicular to the chest was used

 4. Sensors output signals were digitalized by 16 bit A/D converters. The polycardiograph synchronously collected data form the sensors and transmitted them to a PC over Ethernet. Accuracy of the Polycardiograph was set by the sampling rate of the sensors.


**Relevant information concerning mode of data collection**: Recordings were taken in a standing position at rest and immediately after the activity.
After each recording, the researcher calculated the heart rate (HR). Three 30-second recordings were made at rest. After the activity, recordings were repeated until the HR dropped to 10-20 bpm above the HR at rest.

Of interest to us is the CSV format which has column for time, and the rest of the channels associated with it.  The **acquisition time follows the sampling rate of the sensor**. Sensors may record signals at different point in time.
 Therefore, the time axes of different sensors are different, but the **acquisition is synchronized so that they can be extended to a common zero.**
 



Additionally, a table Demographics.csv lists file names and subject demographics, including age, height, weight, and body-mass index. Furthermore, each row in this table displays the subject activity status: 'B' for the measurement before and 'A' for the measurement after the activity, and the HR calculated as the inverse of the median RR interval per recording.
To de-identify the data, all dates were removed from the recordings. The published data do not contain any information that identifies or provides a reasonable basis to identify an individual. The data comply with HIPPA requirements for sharing personal health information.

### **Understanding the Demographics**

In [None]:
# Loading the necessary Libraries
import pandas as pd
import numpy as np
from matplotlib import pyplot as plt
from datetime import datetime, timedelta


In [None]:
df = pd.read_csv('Demographics_Cleaned.csv', )
# Display the first few rows of the DataFrame to understand its structure
df.head()

In [None]:
# Check for missing values in the DataFrame
missing_values = df.isnull().sum()
print("Missing values in each column:\n", missing_values)
df.info()

In [None]:
# Convert 'Recording time' string to datetime.time object
# Assuming 'Recording time (hh:mm:ss)' is the column with time strings

# Function to safely convert time strings to datetime.time objects
def parse_time(val):
    if pd.isna(val):
        return None
    try:
        return datetime.strptime(str(val).strip(), "%H:%M:%S").time()
    except ValueError:
        print(f"Skipping invalid time format: {val}")
        return None

# Apply the conversion
df["Recording time"] = df["Recording time (hh:mm:ss)"].apply(parse_time)

In [None]:
# Group by Subject number and sort within each group by File number
grouped = df.groupby("Subject number")

# Create a dictionary where key = Subject number, value = DataFrame of that subject's records
subject_batches = {
    subject: group.sort_values(by="File Number").reset_index(drop=True)
    for subject, group in grouped
}

# Display the first few rows of the DataFrame for a specific subject (e.g., Subject 1)
subject_batches[1].head(10)

In [None]:
# Creating the duration DataFrame
# Drop rows with NaN in 'Recording time' to avoid issues in calculations

df_clean = df.dropna(subset=["Recording time"])

# Group by Subject
subject_groups = df_clean.groupby("Subject number")

# Prepare results
duration_data = []

for subject, group in subject_groups:
    group_sorted = group.sort_values("File Number")

    start_time = group_sorted["Recording time"].iloc[0]
    end_time = group_sorted["Recording time"].iloc[-1]

    # Convert times to datetime so we can subtract
    start_dt = datetime.combine(datetime.today(), start_time)
    end_dt = datetime.combine(datetime.today(), end_time)

    # Handle potential midnight wraparound (if needed)
    if end_dt < start_dt:
        end_dt += timedelta(days=1)

    duration = end_dt - start_dt

    duration_data.append({
        "Subject number": subject,
        "Start time": start_time,
        "End time": end_time,
        "Duration (HH:MM:SS)": duration
    })

# Create a new DataFrame with the results
duration_df = pd.DataFrame(duration_data)
duration_df


In [None]:
#Trying to group the data according to the before and after activity (A and B) and calculate the duration of each activity.

# Clean the activity column (strip whitespace and uppercase)
df["Activity"] = df["Before (B)  / after (A) activity"].astype(str).str.strip().str.upper()

# Drop rows with missing info
df_clean = df.dropna(subset=["Recording time", "Activity"])

# Prepare results
results = []

for subject, group in df_clean.groupby("Subject number"):
    subject_data = {"Subject number": subject}

    for label in ['A', 'B']:  # A = After, B = Before
        sub = group[group["Activity"] == label]

        if sub.empty:
            subject_data[f"{label} Start"] = None
            subject_data[f"{label} End"] = None
            subject_data[f"{label} Duration"] = None
            continue

        sub_sorted = sub.sort_values("File Number")

        start_time = sub_sorted["Recording time"].iloc[0]
        end_time = sub_sorted["Recording time"].iloc[-1]

        start_dt = datetime.combine(datetime.today(), start_time)
        end_dt = datetime.combine(datetime.today(), end_time)

        if end_dt < start_dt:
            end_dt += timedelta(days=1)

        duration = end_dt - start_dt

        subject_data[f"{label} Start"] = start_time
        subject_data[f"{label} End"] = end_time
        subject_data[f"{label} Duration"] = duration

    # Optional: add total duration
    if subject_data["A Duration"] and subject_data["B Duration"]:
        subject_data["Total Duration"] = subject_data["A Duration"] + subject_data["B Duration"]
    else:
        subject_data["Total Duration"] = None

    results.append(subject_data)

# Create results DataFrame
activity_duration_df = pd.DataFrame(results)
activity_duration_df


In [None]:
# Finding the distribution of the each activity with a subject and finding the distribution of the activity finding for with
# Heart Rate under each subject number.
heart_rate_distr = df.groupby(["Subject number","Before (B)  / after (A) activity"])["Median heart rate (bpm)"].describe() 


In [None]:
subject_info = df[["Subject number","Age (year)","Height (cm)","Weight (kg)","Body-mass index"]]

# Drop duplicates so each subject appears only once
unique_subject_info = subject_info.drop_duplicates(subset=["Subject number"])

# (Optional) Set subject as index
unique_subject_info.set_index("Subject number", inplace=True)
unique_subject_info

In [None]:
new_df = unique_subject_info.join(duration_df.set_index("Subject number"), on="Subject number", how="inner")
mean_series = heart_rate_distr["mean"]
mean_df = mean_series.unstack("Before (B)  / after (A) activity")
mean_df = mean_df.reset_index()
mean_df.columns = ['Subject number', 'heart_rate_Before', 'heart_rate_After']
new_df = new_df.join(mean_df.set_index("Subject number"), on="Subject number", how="left")
new_df

In [115]:
new_df.to_csv("Final_Demographics.csv", index=False)

# I have had my fair share of challenges with the code, but I have tried to make it as clean as possible.
# I have also tried to make the code as efficient as possible. I have used functions and loops where necessary to avoid redundancy.
# I have also used pandas and numpy to handle the data efficiently. I have used groupby and apply functions to manipulate the data.