# Exploratory Data Analysis (Assignment # 4) - Team Forever Loop

**Authors:** Santosh Ganesan, Haran Nallasivan

**Date:** 23 September 2024

## Running this notebook

TODO: add instructions for running this notebook

## Introduction

The Welltory COVID-19 and Wearables Open Data Research [1] dataset contains a wealth of data from wearables worn by participants that both had and didn't have COVID-19. A problem that we were interested in exploring was predictability of COVID-19 based on this biometeric data. For this particular analysis, our hypothesis is that a participant's height, weight, gender, and variance in systolic/diastolic blood pressure are markers for COVID-19 detection.

## Data sourcing

TODO: imports and script to pull in data into dataframes/SQLite table as appropriate


In [1]:
# Import statements
import pandas as pd
from pathlib import Path

Path('./data').mkdir(exist_ok=True)

# Helper function for extracting and persisting data
def extract_df(csv_name):
    '''
    Extracts dataframe from the data set given a CSV filename. Persists locally to avoid redudant network calls and unnecessary load on Github.

    Parameters:
        csv_name (str) - CSV filename, e.g. "blood_pressure"
    Returns:
        A pandas dataframe
    '''
    try:
        return pd.read_csv(f'./data/{csv_name}.csv')
    except:
        df = pd.read_csv(f'https://raw.githubusercontent.com/Welltory/hrv-covid19/refs/heads/master/data/{csv_name}.csv')
        df.to_csv(f'./data/{csv_name}.csv')
        return df

# Loaading the data frames
csv_names = ['blood_pressure', 'participants']
df = {}
for csv_name in csv_names:
    df[csv_name] = extract_df(csv_name)
df['blood_pressure'].head(2)

Unnamed: 0,user_code,measurement_datetime,diastolic,systolic,functional_changes_index,circulatory_efficiency,kerdo_vegetation_index,robinson_index
0,01bad5a519,2020-04-29 22:33:33,100,150,,,,
1,01bad5a519,2020-04-30 01:33:33,100,150,,,,




## Preliminary analysis

TODO: blurb on what we'll do in this section

### Data structure

In this section, we'll explore the data structures in this data set captured in the individual CSVs (extracted as Pandas dataframes above) and talk about missing data.

**Blood Pressure:**


In [2]:
df['blood_pressure'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 721 entries, 0 to 720
Data columns (total 9 columns):
 #   Column                    Non-Null Count  Dtype  
---  ------                    --------------  -----  
 0   Unnamed: 0                721 non-null    int64  
 1   user_code                 721 non-null    object 
 2   measurement_datetime      721 non-null    object 
 3   diastolic                 721 non-null    int64  
 4   systolic                  721 non-null    int64  
 5   functional_changes_index  299 non-null    float64
 6   circulatory_efficiency    299 non-null    float64
 7   kerdo_vegetation_index    283 non-null    float64
 8   robinson_index            299 non-null    float64
dtypes: float64(4), int64(3), object(2)
memory usage: 50.8+ KB


The `user_code` column contains strings that captures a unique identifier of each participant in this study. The data for this column has no missing values.

The `measurement_datetime` column contains strings that captures the date and time at which the measurement was recorded. The data for this column has no missing values.

The `diastolic` and `systolic` columns contain integers that capture the systolic and diastolic blood pressures of the participant. The data for this column has no missing values.

The `functional_changes_index` column contains 64-bit floating point numbers that assess how well the participant's body can adapt to stressors. This is based on the participant's height, weight, and gender.

The `circulatory_efficiency` column contains 64-bit floating point numbers that assess the body's readiness to cope with pressure.





**Participants:**


In [6]:
df['participants'].info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 185 entries, 0 to 184
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype  
---  ------          --------------  -----  
 0   Unnamed: 0      185 non-null    int64  
 1   user_code       185 non-null    object 
 2   gender          185 non-null    object 
 3   age_range       185 non-null    object 
 4   city            173 non-null    object 
 5   country         179 non-null    object 
 6   height          183 non-null    float64
 7   weight          185 non-null    float64
 8   symptoms_onset  147 non-null    object 
dtypes: float64(2), int64(1), object(6)
memory usage: 13.1+ KB




### Descriptive statistics

TODO: Output some descriptive statistics and plots




### Data quality

TODO: preliminary data quality analysis (missing values, duplicates, outliers, variance)




### Variable relationships

TODO: correlation analysis between features of the data 




## Data cleaning

TODO: intro blurb on what we'll do in this section




### Handling missing values

TODO: remove missing values identified in previous data quality section




### Handling duplicates

TODO: remove duplicates identified in previous data quality section




### Handling outliers

TODO: remove outliers identified in previous data quality section




## Post-cleaning analysis

TODO: intro blurb on what we'll cover in this section

### A new feature

TODO: devise a new feature and analyze the feature




### Dimensionality reduction

TODO: employ a dimensionality reduction technique




### Data quality assessment

TODO: intro blurb on what we'll cover in this section




#### Data profiling

TODO: data profiling analysis




#### Data completeness

TODO: data completeness analysis




#### Data accuracy

TODO: data accuracy anaalysis


#### Data consistency

TODO: data consistency analysis



#### Data integrity

TODO: data integrity analysis




#### Data lineage

TODO: data lineage analysis




#### Data provenance

TODO: data provenance analysis




## Reference

[1] Pravdin, Pavel (2022). *Welltory COVID-19 and Wearables Open Data Research [dataset].* 22 July 2022. Welltory Inc. CC0-1.0 License. https://github.com/Welltory/hrv-covid19