# **Data Collection**

## Objectives

* Download data from Kaggle.com and perform an initial EDA.

## Inputs

* unclean_smartwatch_health_data.csv

## Outputs

* ydata-profiling EDA

## Additional Comments

* In case you have any additional comments that don't fit in the previous bullets, please state them here. 

---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [None]:
import os
current_dir = os.getcwd()
current_dir

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [None]:
os.chdir(os.path.dirname(current_dir))
print("You set a new current directory")

Confirm the new current directory

In [None]:
current_dir = os.getcwd()
current_dir

# Section 1

Section 1 content

In [None]:
os.environ["KAGGGLE_CONFIG_DIR"] = os.getcwd()
! chmod 600 kaggle.json

In [None]:
KaggleDatasetPath = "mohammedarfathr/smartwatch-health-data-uncleaned"
DestinationFolder = "inputs/smartwatch_health_data_untouched"   
! kaggle datasets download -d {KaggleDatasetPath} -p {DestinationFolder}

In [None]:
import zipfile
with zipfile.ZipFile(DestinationFolder + "/smartwatch-health-data-uncleaned.zip", "r") as zip_ref:
    zip_ref.extractall(DestinationFolder)

os.remove(DestinationFolder + "/smartwatch-health-data-uncleaned.zip")

---

# Section 2 EDA

Section 2 content

In [None]:
import pandas as pd
data = pd.read_csv(DestinationFolder + "/unclean_smartwatch_health_data.csv")
df = pd.DataFrame(data)
df.head()

# Change version variable to store outputs in different folder
version = "v1"

OutputFolder = f"outputs/{version}/"
if "outputs" in os.listdir(current_dir):
    if version not in os.listdir(current_dir + "/outputs"):
        os.mkdir(OutputFolder)
else:
    os.makedirs(OutputFolder)

### We do an initial EDA on the data, I am aware beforehand by the company the data has missing values, so a y data report seems ideal for the moment

In [None]:
from ydata_profiling import ProfileReport
os.makedirs(OutputFolder + "initial_eda", exist_ok=True)
profile = ProfileReport(df, explorative=True)
# Choosen a deeper analysis with explorative=True
profile.to_file(OutputFolder + "initial_eda/unclean_data_report.html")
profile.to_notebook_iframe()

#### We see in the ydata EDA that every feature has missing data, as expected. We can safely say we should drop User ID from this dataset as it is a identification value. Step Count, Heart Rate and Activity Level are the only features standing out to be normally distributed at the moment. Alltogether we have 10000 oberservations and 1551 missing cells, equating to 2.2% of the dataset. There isnt any duplicated data, according to the report.

In [None]:
# Check the data types of the columns
print("Data types of columns:")
for column, dtype in df.dtypes.items():
    print(f"{column}: {dtype}")

print("\n")
# Double check duplicated data levels
print(f"Duplications = {df.duplicated().sum()}")

print("\n")
# Double check missing data levels
print(f"Missing data levels:\n{df.isnull().sum()}")

I notice that the last 3 columns are objects when only Activity Levels looks to be the only one that should belong to that class. I will change Sleep Duration to a float and Stress Level to interger after the data has been imputed

---

#### I think dropping User ID, imputing the missing numerical and categorical data and improving the variables normality and skewness is of priority as of the next cleaning step. Also encoding or even distcretization will probably be experimented with soon after.

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In case you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create here your folder
  # os.makedirs(name='')
except Exception as e:
  print(e)
