# **Data Collection**

## Objectives

* Download data from Kaggle.com and perform an initial EDA.

## Inputs

* unclean_smartwatch_health_data.csv

## Outputs

* ydata-profiling EDA

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'/workspaces/Predictive_Analytics_Project/jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

You set a new current directory


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'/workspaces/Predictive_Analytics_Project'

Include data path

In [4]:
DataUntouched = "inputs/smartwatch_health_data_untouched"

In [5]:
import pandas as pd
data = pd.read_csv(DataUntouched + "/unclean_smartwatch_health_data.csv")
df = pd.DataFrame(data)
print(df.head())

# Change version variable to store outputs in different folder
version = "v1"

OutputFolder = f"outputs/{version}/"
if "outputs" in os.listdir(current_dir):
    if version not in os.listdir(current_dir + "/outputs"):
        os.mkdir(OutputFolder)
else:
    os.makedirs(OutputFolder)

   User ID  Heart Rate (BPM)  Blood Oxygen Level (%)    Step Count  \
0   4174.0         58.939776               98.809650   5450.390578   
1      NaN               NaN               98.532195    727.601610   
2   1860.0        247.803052               97.052954   2826.521994   
3   2294.0         40.000000               96.894213  13797.338044   
4   2130.0         61.950165               98.583797  15679.067648   

  Sleep Duration (hours) Activity Level Stress Level  
0      7.167235622316564  Highly Active            1  
1      6.538239375570314  Highly_Active            5  
2                  ERROR  Highly Active            5  
3      7.367789630207228          Actve            3  
4                    NaN  Highly_Active            6  


# Clean Data

Cleaning will be performed, as from the initial EDA we can see we have 1551 missing cells across all features, Hypothesis 1 doesnt have a target variable as we are looking to perform unsupervised clustering to group. Hypothesis 2's target is Stress Levels and hypothesis 3's is Step Count.

Now we will drop the User ID feature and perform imputation of numeric and categoric variables.
Also try and imporove normality and skewness on the columns that require it.

---

Drop User ID from a copy of the original dataset

In [6]:
df_todrop = df.copy()
df_dropped = df_todrop.drop("User ID", axis=1)

In [7]:
print(df_dropped.columns)
df_dropped.head()
df_dropped.isnull().sum()

Index(['Heart Rate (BPM)', 'Blood Oxygen Level (%)', 'Step Count',
       'Sleep Duration (hours)', 'Activity Level', 'Stress Level'],
      dtype='object')


Heart Rate (BPM)          400
Blood Oxygen Level (%)    300
Step Count                100
Sleep Duration (hours)    150
Activity Level            200
Stress Level              200
dtype: int64

Now lets check the initial distribution and general analysis of the one true categorical column Acitivity Level

In [8]:
df_category_test = df_dropped["Activity Level"]
df_category_test.unique()

array(['Highly Active', 'Highly_Active', 'Actve', 'Seddentary',
       'Sedentary', 'Active', nan], dtype=object)

We note the categorical column has 5 unique classes, not including the NaN value

Note that encoding of this categorical column seems likely. The distributions of each class in Actitivity level are well balanced. There should be low bias with this metric


In [9]:
df_category_test.value_counts()

Activity Level
Seddentary       1676
Sedentary        1657
Highly Active    1650
Active           1643
Actve            1622
Highly_Active    1552
Name: count, dtype: int64

Lets take a closer look at the other two data type objects that should be a float64 and int

Next we note that column Sleep Duration (hours) has some cells that include the string "ERROR", we must replace or drop this to impute and change the variables datatype. We could replace this value with NaN, and then impute with the median

In [10]:
df_dropped["Sleep Duration (hours)"].value_counts()

Sleep Duration (hours)
ERROR                 247
4.515512633313341       1
6.192069563693488       1
8.225011860105145       1
7.77547280382428        1
                     ... 
7.809611926858791       1
6.5424774602354105      1
5.690109349564968       1
7.144720940526833       1
9.572659844239388       1
Name: count, Length: 9604, dtype: int64

We also note that column Stress Level includes numbers 1 to 10, and one variable named "Very High", we may replace very high with 11 or a custom value based on dataset knowledge

In [11]:
df_dropped["Stress Level"].value_counts()

Stress Level
2            1007
7            1006
6            1001
3             995
1             984
9             976
4             966
10            954
5             945
8             917
Very High      49
Name: count, dtype: int64

Lets go ahead replace the "Very High" class in the Stress Level column with 11

In [12]:
df_preprocess = df_dropped.copy()

df_preprocess["Stress Level"] = df_preprocess["Stress Level"].replace("Very High", 11)
df_preprocess["Stress Level"].value_counts()

Stress Level
2     1007
7     1006
6     1001
3      995
1      984
9      976
4      966
10     954
5      945
8      917
11      49
Name: count, dtype: int64

Lets replace ERROR cells in the Sleep Duration column with NaN. But Instead of replacing all non-numeric values with NaN we can just use the coerce parameter in pd.to_numeric below, to change any non-numeric values into NaN.

Now lets attempt changing the dtype on the two columns Sleep Duration and Stress Level

In [13]:
# Convert the columns to numeric data types, handling non-numeric values
df_preprocess["Sleep Duration (hours)"] = pd.to_numeric(df_preprocess["Sleep Duration (hours)"], errors='coerce')
df_preprocess["Stress Level"] = pd.to_numeric(df_preprocess["Stress Level"], errors='coerce')

# Check the data types of the columns after conversion
print("Data types after conversion to numeric:")
print(df_preprocess.dtypes)

Data types after conversion to numeric:
Heart Rate (BPM)          float64
Blood Oxygen Level (%)    float64
Step Count                float64
Sleep Duration (hours)    float64
Activity Level             object
Stress Level              float64
dtype: object


In [14]:
from feature_engine.imputation import MeanMedianImputer
from sklearn.pipeline import Pipeline

# Define the pipeline with MeanMedianImputer
pipeline = Pipeline([
    ('median', MeanMedianImputer(imputation_method='median',
                                 variables=["Sleep Duration (hours)", "Stress Level"]))
])

# Fit and transform the data using the pipeline
df_processed = pipeline.fit_transform(df_preprocess)

Now lets attempt changing the dtype on the Stress Level as to_numeric made it a float.

In [15]:
# Set explicit data types
df_processed["Stress Level"] = df_processed["Stress Level"].astype("int64")

# Check the data types of the columns again
print("Data types after setting explicit types:")
print(df_processed.dtypes)
df_processed.head()

Data types after setting explicit types:
Heart Rate (BPM)          float64
Blood Oxygen Level (%)    float64
Step Count                float64
Sleep Duration (hours)    float64
Activity Level             object
Stress Level                int64
dtype: object


Unnamed: 0,Heart Rate (BPM),Blood Oxygen Level (%),Step Count,Sleep Duration (hours),Activity Level,Stress Level
0,58.939776,98.80965,5450.390578,7.167236,Highly Active,1
1,,98.532195,727.60161,6.538239,Highly_Active,5
2,247.803052,97.052954,2826.521994,6.503308,Highly Active,5
3,40.0,96.894213,13797.338044,7.36779,Actve,3
4,61.950165,98.583797,15679.067648,6.503308,Highly_Active,6


In [17]:
# Check skewness and kurtosis after transformation and imputation
print("Skewness and Kurtosis before transformation and imputation:")
for col in ["Sleep Duration (hours)", "Stress Level"]:
    print(f"{col} | Skewness: {df_preprocess[col].skew()} | Kurtosis: {df_preprocess[col].kurtosis()}")

print("Skewness and Kurtosis after transformation and imputation:")
for col in ["Sleep Duration (hours)", "Stress Level"]:
    print(f"{col} | Skewness: {df_processed[col].skew()} | Kurtosis: {df_processed[col].kurtosis()}")

Skewness and Kurtosis before transformation and imputation:
Sleep Duration (hours) | Skewness: 0.006315730832954415 | Kurtosis: -0.06651157982483946
Stress Level | Skewness: 0.02130785311919279 | Kurtosis: -1.2141975459496481
Skewness and Kurtosis after transformation and imputation:
Sleep Duration (hours) | Skewness: 0.006618365704973535 | Kurtosis: 0.05477632157071133
Stress Level | Skewness: 0.010961089966155042 | Kurtosis: -1.180198079757423


The skewness values were already very close to zero before transformation and imputation, indicating that the distributions were already quite symmetrical. The changes in skewness are minimal and do not indicate a significant change.
The kurtosis values show slight changes towards a more normal distribution, but the improvements are not substantial.

Now the dataset has the correct data types

# Section 2 Normality, Skewness and Kurtosis Improvement


Before imputing the rest of the data, im going to try improving the normal distribution of the variables, as this decides wether you use the median or the mean for imputation of numerics. I can now test all numerical columns together after fixing the data types.

Will do QQ plots with a BoxCox transformer with before and afters to see what the normality is like for the 3 numeric data types we can test, and then possibly improve skewness, kurtosis and normality with a boxcox with the aim to impute the most variables we can with the mean.

In [None]:
# check min and max for numeric variables to see if boxcox is suitable
for col in df_processed
    if df_processed[col].dtype == "float64" or df_processed[col].dtype == "int64":
        print(f"{col} min: {df_processed[col].min()}, max: {df_processed[col].max()}")

In [None]:
from sklearn.pipeline import Pipeline
from feature_engine import transformation as vt
from feature_engine.imputation import MeanMedianImputer
import seaborn as sns
import pingouin as pg
import matplotlib.pyplot as plt

df_numeric = df_processed.select_dtypes(include=['float64','int64'])
df_numeric.head()

def calculate_skew_kurtosis(df,col, moment):
  print(f"{moment}  | skewness: {df[col].skew().round(2)} | kurtosis: {df[col].kurtosis().round(2)}")


# We set the pipeline with this transformer: vt.BoxCoxTransformer().
# Then we .fit_transform() the pipeline, assigning the result to df_transformed

pipeline = Pipeline([
      ( 'log', vt.BoxCoxTransformer() ) # Main difference here
  ])

df_transformed = pipeline.fit_transform(df_numeric)
print(df_transformed.head())

def compare_distributions_before_and_after_applying_transformer(df, df_transformed, method):

  for col in df.columns:
    print(f"*** {col} ***")
    fig, axes = plt.subplots(nrows=2, ncols=2, figsize=(10,8))

    sns.histplot(data=df, x=col, kde=True, ax=axes[0,0])
    axes[0,0].set_title(f'Before {method}')
    pg.qqplot(df[col], dist='norm',ax=axes[0,1])
    
    sns.histplot(data=df_transformed, x=col, kde=True, ax=axes[1,0])
    axes[1,0].set_title(f'After {method}')
    pg.qqplot(df_transformed[col], dist='norm',ax=axes[1,1])
    
    plt.tight_layout()
    plt.show()
    
    # Save plot
    plot_names = method + col + ".png"
    # Add a subfolder to the output folder for normality and skewness improvement plots
    NormalitySkewness = OutputFolder + "norm_skew_improvement/"
    if "norm_skew_improvement" not in os.listdir(OutputFolder):
        os.mkdir(NormalitySkewness)
    plot_dir = os.path.join(NormalitySkewness, plot_names)
    fig.savefig(plot_dir)

    calculate_skew_kurtosis(df,col, moment='before transformation')
    calculate_skew_kurtosis(df_transformed,col, moment='after transformation')
    print("\n")
    
compare_distributions_before_and_after_applying_transformer(df_numeric, df_transformed, method='BoxCoxTransformer')

Section 2 content

---

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
