# **Data Collection**

## Objectives

* Download data from Kaggle.com and perform an initial EDA.

## Inputs

* unclean_smartwatch_health_data.csv

## Outputs

* ydata-profiling EDA

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Section 1

Section 1 content

---

# Section 2 EDA

Section 2 content

In [None]:
import pandas as pd
data = pd.read_csv(DestinationFolder + "/unclean_smartwatch_health_data.csv")
df = pd.DataFrame(data)
df.head()

# Change version variable to store outputs in different folder
version = "v1"

OutputFolder = f"outputs/{version}/"
if "outputs" in os.listdir(current_dir):
    if version not in os.listdir(current_dir + "/outputs"):
        os.mkdir(OutputFolder)
else:
    os.makedirs(OutputFolder)

# Section 2 Normality, Skewness and Kurtosis Improvement


Before imputing the rest of the data, im going to try improving the normal distribution of the variables, as this decides wether you use the median or the mean for imputation of numerics. I can now test all numerical columns together after fixing the data types.

Will do QQ plots with a BoxCox transformer with before and afters to see what the normality is like for the 3 numeric data types we can test, and then possibly improve skewness, kurtosis and normality with a boxcox with the aim to impute the most variables we can with the mean.

In [None]:
# check min and max for numeric variables to see if boxcox is suitable
for col in df_processed:
    if df_processed[col].dtype == "float64" or df_processed[col].dtype == "int64":
        print(f"{col} min: {df_processed[col].min()}, max: {df_processed[col].max()}")

Heart Rate (BPM) min: 40.0, max: 296.5939695131042
Blood Oxygen Level (%) min: 90.79120814564097, max: 100.0
Step Count min: 0.9101380609604088, max: 62486.690753464914
Sleep Duration (hours) min: -0.1944527906201543, max: 12.140232872862926
Stress Level min: 1, max: 11


In [None]:
from sklearn.pipeline import Pipeline
from feature_engine import transformation as vt
from feature_engine.imputation import MeanMedianImputer
import seaborn as sns
import pingouin as pg
import matplotlib.pyplot as plt

df_numeric = df_processed.select_dtypes(include=['float64','int64'])
df_numeric.head()

def calculate_skew_kurtosis(df,col, moment):
  print(f"{moment}  | skewness: {df[col].skew().round(2)} | kurtosis: {df[col].kurtosis().round(2)}")


# We set the pipeline with this transformer: vt.BoxCoxTransformer().
# Then we .fit_transform() the pipeline, assigning the result to df_transformed

pipeline = Pipeline([
      ( 'log', vt.BoxCoxTransformer() ) # Main difference here
  ])

df_transformed = pipeline.fit_transform(df_numeric)
print(df_transformed.head())

def compare_distributions_before_and_after_applying_transformer(df, df_transformed, method):

  for col in df.columns:
    print(f"*** {col} ***")
    fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10,8))

    sns.histplot(data=df, x=col, kde=True, ax=axes[0,0])
    axes[0,0].set_title(f'Before {method}')
    pg.qqplot(df[col], dist='norm',ax=axes[0,1])
    
    sns.histplot(data=df_transformed, x=col, kde=True, ax=axes[1,0])
    axes[1,0].set_title(f'After {method}')
    pg.qqplot(df_transformed[col], dist='norm',ax=axes[1,1])
    
    plt.tight_layout()
    plt.show()
    
    # Save plot
    plot_names = method + col + ".png"
    # Add a subfolder to the output folder for normality and skewness improvement plots
    NormalitySkewness = OutputFolder + "norm_skew_improvement/"
    if "norm_skew_improvement" not in os.listdir(OutputFolder):
        os.mkdir(NormalitySkewness)
    plot_dir = os.path.join(NormalitySkewness, plot_names)
    fig.savefig(plot_dir)

    calculate_skew_kurtosis(df,col, moment='before transformation')
    calculate_skew_kurtosis(df_transformed,col, moment='after transformation')
    print("\n")
    
compare_distributions_before_and_after_applying_transformer(df_numeric, df_transformed, method='BoxCoxTransformer')

ValueError: Some of the variables in the dataset contain NaN. Check and remove those before using this transformer.

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
