## Data context and data sampling for the whole dataset



### Data Context

This repository contains heart rate variability (HRV) and COVID-19 related data, aiming to explore the potential links between physiological stress markers and COVID-19 incidence. The data is collected from users of the Welltory app, which tracks various physiological parameters through smartphone sensors and connected devices.

- **Source:** Welltory app, a popular health tracking application that provides insights into HRV and stress levels.
- **Purpose:** The primary purpose is to investigate the correlation between physiological stress, as indicated by HRV metrics, and COVID-19 infection rates. The data provides a unique perspective on how pandemic-related stress might manifest physiologically in users.
- **Timeframe:** The data was collected from March 2020 to September 2020, capturing the early months of the COVID-19 pandemic.
- **Population:** The dataset includes HRV and health data from thousands of users across multiple countries. The user base predominantly consists of individuals interested in tracking their health and fitness metrics.
- **Key Variables:**
  - **user_code:** An anonymized identifier for each user.
  - **measurement_datetime:** The exact date and time when the measurement was recorded.
  - **hrv_parameters:** Various HRV metrics such as RMSSD, pNN50, LF/HF ratio, and more.
  - **covid_status:** Self-reported COVID-19 infection status.
- **Limitations:** The dataset relies on self-reported COVID-19 status, which may be prone to reporting bias. Additionally, HRV data is influenced by multiple factors such as physical activity, sleep, and overall health, making it challenging to isolate the impact of COVID-19 alone.



### Data Sampling

The data in this repository was collected from Welltory app users who opted in to share their anonymized data for research purposes. It includes all users who reported their COVID-19 status during the specified timeframe, along with their HRV measurements.

- **Sampling Method:** Convenience sampling of Welltory app users who volunteered to share their data.
- **Sample Size:** Data includes thousands of HRV measurements from thousands of unique users.
- **Representativeness:** The sample primarily represents individuals interested in health and fitness tracking and may not be representative of the general population.
- **Sampling Bias:** There is a potential for sampling bias as the data includes only those users who actively engage with the app and report their COVID-19 status. Additionally, there may be underreporting or misreporting of COVID-19 infection due to reliance on self-reports.



In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.preprocessing import OrdinalEncoder,RobustScaler
from sklearn.decomposition import PCA



In [None]:

df = pd.read_csv("/content/blood_pressure.csv")
df.head()

In [None]:
# Check the data structure
print("Data Structure:")
print(df.info())

In [None]:
# Descriptive statistics for numerical columns
df.describe()


## Data Quality Assessment



In [None]:
# Descriptive statistics for categorical columns
for column in df.select_dtypes(include=['object']).columns:
    print(f"\n{column} Value Counts:")
    print(df[column].value_counts())

**Comment** : Honestly, first thought I had before exploring this dataset was that I should most definitely drop the User_code since it appeared to have no intrinsic value. But, looking at the value_counts, there appears to be a pattern in the data.

In [None]:
# Checking for missing values
df.isnull().sum()


In [None]:
# Calculate the percentage of missing values in each column
df.isnull().mean()


In [None]:
# Checking for duplicates
df.duplicated().sum()

In [None]:
# Checking for data type issues
df.dtypes

In [None]:
# Show unique values in a column (user_code) as an example
unique_values = df['user_code'].unique()
print("\nUnique Values in 'user_code' Column:\n", unique_values)

## Variable relationships

* Before we start visualizing for exploratory data analysis, lets make separate columns to encode the user_code data and also convert the measurement_datetime column to two seperate columns.


In [None]:
# Converting measurement_datetime to datetime object
df['measurement_datetime'] = pd.to_datetime(df['measurement_datetime'])

# Extracting date and time into separate columns
df['date'] = df['measurement_datetime'].dt.date
df['time'] = df['measurement_datetime'].dt.time
df['date'] = pd.to_datetime(df['date'])
df['time'] = pd.to_timedelta(df['time'].astype(str))


In [None]:
# Ordinal encoding for 'user_code'
encoder = OrdinalEncoder()
df['user_code_ordinal'] = encoder.fit_transform(df[['user_code']])

In [None]:
# Dropping `measurement_datetime`

df = df.drop(['measurement_datetime'], axis =1)

# We're not dropping user_code, since we have a couple of visualizations to perform with that

In [None]:
# printing it out
df.info()

In [None]:
viz_df = df.drop(["user_code", "date","time",], axis =1)

# Correlation analysis for df
viz_df.corr()

In [None]:
# Correlation matrix for numerical variables heatmap
plt.figure(figsize=(10, 8))
sns.heatmap(viz_df.corr(), annot=True, cmap='coolwarm')
plt.title("Correlation Matrix")
plt.show()

As you can see, a very high correlation between
* `systolic` and `diastolic`
* `functional_changes_index` and `diastolic`
* `kerdo_vegetation` and `diastolic`

* `functional_changes_index` and `systolic`
* `circulatory_changes` and `systolic`

* `robinson_index` and `functional_changes_index`

* `robinson_index` and `circulatory_efficiency`


## Visualizations for outlier detection

In [None]:
# Lets visualize the scatterplot matrix, where we can get an idea about the outliers

plt.figure(figsize=(8, 8))
pd.plotting.scatter_matrix(viz_df,diagonal='kde', figsize=(15, 15))
plt.suptitle('Scatter Plot Matrix')
plt.show()

In [None]:
 # 1. Using Boxplots for Numerical Columns

numerical_columns = df.select_dtypes(include=['number'])
plt.figure(figsize=(15, 10))
for i, col in enumerate(numerical_columns, 1):
    plt.subplot(3, 3, i)
    sns.boxplot(x=df[col])
    plt.title(f'Boxplot of {col}')
plt.tight_layout()
plt.show()

You can see the numerical columns have outliers to them, especially, `circulatory_efficiency` and `functional_changes_index`

In [None]:
# 2. Calculating IQR to Detect Outliers
for col in numerical_columns:
    # Calculating Q1 (25th percentile) and Q3 (75th percentile)
    Q1 = df[col].quantile(0.25)
    Q3 = df[col].quantile(0.75)
    IQR = Q3 - Q1

    # Defining outlier boundaries
    lower_bound = Q1 - 1.5 * IQR
    upper_bound = Q3 + 1.5 * IQR

    # Identifying outliers
    outliers = df[(df[col] < lower_bound) | (df[col] > upper_bound)]
    print(f"Outliers in {col}:\n", outliers[[col]])

In [None]:
# 3. Frequency Distribution for Categorical Columns
categorical_columns = ['user_code']

for col in categorical_columns:
    value_counts = df[col].value_counts()
    plt.figure(figsize=(10, 4))
    sns.barplot(x=value_counts.index, y=value_counts.values, palette='viridis')
    plt.title(f'Frequency Distribution of {col}')
    plt.xticks(rotation=90)
    plt.show()

In [None]:
# 4. Date and Time Columns
print(f"Date Range: {df['date'].min()} to {df['date'].max()}")

# Checking for unusual patterns in time, like all values being around midnight or similar patterns
print(f"Time Range: {df['time'].min()} to {df['time'].max()}")

## Handling Outliers



While, I believe that its easier to solve for the numerical outliers problem by removing them outright, I want to try and minimize their impact by using a scaling method called RobustScaler.

Prior to this, we are to deal with the INF and the NaN values


In [None]:

# Dropping the 'time' column temporarily because it gets banded in with the numerical columns when we apply the command
df_temp = df.drop(columns=['time'], errors='ignore')

# Selecting numerical columns
numerical_columns = df_temp.select_dtypes(include=[np.number]).columns

# Replacing Inf and -Inf with NaN
df_temp.replace([np.inf, -np.inf], np.nan, inplace=True)

# Filling missing values with the median of each column
df_temp[numerical_columns] = df_temp[numerical_columns].fillna(df_temp[numerical_columns].median())

# Applying RobustScaler
scaler = RobustScaler()
df_temp[numerical_columns] = scaler.fit_transform(df_temp[numerical_columns])


df_scaled = pd.concat([df_temp, df['time']], axis=1)

df_scaled.head()

## Feature Engineering

The new feature we'll be creating is called Pulse pressure, which is nothing but the `diastolic` pressure minus the `systolic` pressure.

In [None]:
# Calculating Pulse Pressure
df['Pulse_Pressure'] = df['systolic'] - df['diastolic']

# Check the result
df.head()

## Visualizations

In [None]:
# Systolic vs Diastolic time series analysis:

df_scaled['diastolic'].plot(kind='line', figsize=(8, 4), title='')
plt.gca().spines[['top', 'right']].set_visible(False)

df_scaled['systolic'].plot(kind='line', figsize=(8, 4), title='systolic vs diastolic')
plt.gca().spines[['top', 'right']].set_visible(False)
plt.legend()

In [None]:
# Diastolic and Systolic time series:
def _plot_series(series, series_name, series_index=0):
  palette = list(sns.palettes.mpl_palette('Dark2'))
  xs = series['date']
  ys = series['systolic']

  plt.plot(xs, ys, label=series_name, color=palette[series_index % len(palette)])

fig, ax = plt.subplots(figsize=(10, 5.2), layout='constrained')
df_sorted = df_scaled.sort_values('date', ascending=True)
_plot_series(df_sorted, '')
sns.despine(fig=fig, ax=ax)
plt.xlabel('date')
_ = plt.ylabel('systolic')

In [None]:

def _plot_series(series, series_name, series_index=0):
  palette = list(sns.palettes.mpl_palette('Dark2'))
  xs = series['date']
  ys = series['diastolic']

  plt.plot(xs, ys, label=series_name, color=palette[series_index % len(palette)])

fig, ax = plt.subplots(figsize=(10, 5.2), layout='constrained')
df_sorted = df_scaled.sort_values('date', ascending=True)
_plot_series(df_sorted, '')
sns.despine(fig=fig, ax=ax)
plt.xlabel('date')
_ = plt.ylabel('diastolic')

In [None]:
# kerdo_vegetation_index vs circulatory_efficiency
df_scaled.plot(kind='scatter', x='circulatory_efficiency', y='kerdo_vegetation_index', s=32, alpha=.8)
plt.gca().spines[['top', 'right',]].set_visible(False)

In [None]:
df_scaled.head()

In [None]:
df_scaled.isnull().sum()

## Dimensionality Reduction Method: Principal Component Analysis (PCA):

In [None]:
# Drop the remaining non-numeric columns from df for pca
df.drop(['date', 'time', 'user_code'], axis=1, inplace=True)

In [None]:


# Drop rows with missing values
df.dropna(inplace=True)

# Applying PCA w/ 2 components for graphing
pca = PCA(n_components=2)
principal_components = pca.fit_transform(df)

# Explained variance ratio
print("EVR:", pca.explained_variance_ratio_)

# Plotting the principal components
plt.figure(figsize=(8, 6))
plt.scatter(principal_components[:, 0], principal_components[:, 1], alpha=0.5)
plt.xlabel('Principal Component 1')
plt.ylabel('Principal Component 2')
plt.title('PCA on Processed Blood Pressure Data')
plt.show()

Almost all of the variance in the data is captured by the first principal component as we can see from the very large explained variance ratio of PC1 compared to PC2. We can see this visually in the scatter plot as their is far more variance and particularly extreme outliers with PC1 compared to PC2.


In [None]:
loadings = pd.DataFrame(pca.components_.T, columns=['PC1', 'PC2'], index=df.columns)
loadings

Here is a table to see which of the remaining variables have the most/least influence on PC1 and PC2.

## Data quality assessment report

Overall, the data is pretty incomplete many null values across many of the metrics. However, the most important metric, blood pressure is 100% complete. The data is clean and well formatted with no outside reasons to believe it is not high quality. The format of the data appears to be consistent and accurate and given it's linear as part of a larger and often used open research dataset, it can be trusted. Because the data is anonomyized, we do not know specifically how or where it was collected to ensure 100% accuracy.