# Team Assignment #4 - Trackstars
## Data Exploration 

## Instructions
For this assignment, you will be using the Welltory COVID-19 and Wearables Open Data Research dataset. This dataset is messy and exploration of it is meant to be less similar to cleaned datasets you may see in classrooms or on Kaggle and more relevant to real-world dataset exploration. Data is available [here](https://github.com/Welltory/hrv-covid19/tree/master).
Note: you will first need to organize the data to enable your analysis.

For this assignment, you will provide a Jupyter notebook report of your exploratory data analysis.

### In your Jupyter notebook report, complete the following:
* Document data context and data sampling in markdown
* Explore and interpret data structure, descriptive statistics, data quality, and variable relationships
* Explore data visually with appropriate visualizations
* Discuss and implement strategies for Handling Missing Values, Removing Duplicates, and Handling Outliers
* Perform data transformation as appropriate
* Create at least one new feature and document your approach
* Perform a dimensionality reduction method on the data and discuss 
* Include a discussion around data quality assessment, including data profiling, data completeness, data accuracy, data consistency, data integrity, and data lineage and provenance

## Submission
To submit your code, make a PR into the data-eda-ta4 branch and add me and the TA as reviewers. Make sure the name of your notebook follows best practices and includes your team name.
## Rubric

### Report (35 points)
* Report includes a title, authors, dates, and relevant information for running it at the top 
* Report includes a reference to the original dataset
* Data context and sampling is documented in markdown
* Code and documented interpretation of data structure
* Code and documented interpretation of descriptive statistics
* Code and documented interpretation of data quality
* Code and documented interpretation of variable relationships
* Visualizations used are complete and appropriate and interpretation(s) are documented in the notebook
* Visualizations follow best practices (titles, axes labels, etc)
* Strategies for handling missing values, outliers, and removing duplicates are implemented and/or discussed
* Appropriate data transformation is performed
* One new feature is engineered and documented
* A dimensionality reduction method is performed and interpreted
* A discussion on data quality assessment is included and incorporates the following components: data profiling, data completeness, data accuracy, data consistency, data integrity, and data lineage and provenance

### Code (10 points)
* Code is in a clearly named notebook
* Code is clean and well organized
* Code is documented with docstrings and comments 
* Branching and PRs were done appropriately
* Requirements are included in the text of the PR and are correct and versioned
* The code runs as documented




In [2]:
import pandas as pd 
import matplotlib as mpl 
import seaborn as sns

# Retrieve data

In [3]:
import os

In [4]:
!git clone https://github.com/Welltory/hrv-covid19.git

Cloning into 'hrv-covid19'...
remote: Enumerating objects: 198, done.[K
remote: Counting objects: 100% (12/12), done.[K
remote: Compressing objects: 100% (11/11), done.[K
remote: Total 198 (delta 5), reused 4 (delta 1), pack-reused 186 (from 1)[K
Receiving objects: 100% (198/198), 4.57 MiB | 25.31 MiB/s, done.
Resolving deltas: 100% (128/128), done.


In [5]:
!ls hrv-covid19/data

blood_pressure.csv     participants.csv       surveys.csv
heart_rate.csv         scales_description.csv wearables.csv
hrv_measurements.csv   sleep.csv              weather.csv


In [8]:
hrv_data_dir = "hrv-covid19/data" 

dataset = []
files = os.listdir(path=hrv_data_dir)
for file in files: 
    dataset.append({ "name": file.split('.')[0],  "file": file, "path": hrv_data_dir + '/' + file })

In [9]:
dfs = {}
for d in dataset: 
    print(f"Loading {d['name']} ({d['file']}) into dataframe... ") 
    dfs[d['name']] = pd.read_csv(d['path'])

Loading scales_description (scales_description.csv) into dataframe... 
Loading participants (participants.csv) into dataframe... 
Loading wearables (wearables.csv) into dataframe... 
Loading blood_pressure (blood_pressure.csv) into dataframe... 
Loading surveys (surveys.csv) into dataframe... 
Loading heart_rate (heart_rate.csv) into dataframe... 
Loading weather (weather.csv) into dataframe... 
Loading hrv_measurements (hrv_measurements.csv) into dataframe... 
Loading sleep (sleep.csv) into dataframe... 


# Analyze

In [None]:
df = pd.read_csv(hrv_data_dir + 