# **(ETL Rest Wellness and Lifestyle project)**

## Objectives

* Extract: Loading the CSV file into a Pandas DataFrame 
* Transform: Handle missing values, convert data types, normalize column names, remove duplicates
* Load: Save the cleaned dataset to a new CSV file

## Inputs

* Sleep Health and Lifestyle Dataset from: https://www.kaggle.com/datasets/uom190346a/sleep-health-and-lifestyle-dataset/data  

## Outputs

* Write here which files, code or artefacts you generate by the end of the notebook 

## Additional Comments

* Sleep Health and Lifestyle Dataset is synthetic and was created by Laksika Tharmalingam for illustrative purposes.



---

# Change working directory

* We are assuming you will store the notebooks in a subfolder, therefore when running the notebook in the editor, you will need to change the working directory

We need to change the working directory from its current folder to its parent folder
* We access the current directory with os.getcwd()

In [1]:
import os
current_dir = os.getcwd()
current_dir

'c:\\Users\\Ewa\\Documents\\vscode-projects\\Rest_Wellness_and_Lifestyle\\02_jupyter_notebooks'

We want to make the parent of the current directory the new current directory
* os.path.dirname() gets the parent directory
* os.chir() defines the new current directory

In [2]:
os.chdir(os.path.dirname(current_dir))
print("Rest_Wellness_and_Lifestyle")

Rest_Wellness_and_Lifestyle


Confirm the new current directory

In [3]:
current_dir = os.getcwd()
current_dir

'c:\\Users\\Ewa\\Documents\\vscode-projects\\Rest_Wellness_and_Lifestyle'

# Section 1

Import python's libraries

In [4]:
import pandas as pd
import numpy as np



---

# # Step 1: Extract - Loading and Initial Exploration

In [None]:
# Load the data
df = pd.read_csv("/Users/Ewa/Documents/vscode-projects/Rest_Wellness_and_Lifestyle/01_data/Sleep_health_and_lifestyle_raw_dataset.csv")
# Create a copy of the data to avoid modifying the original dataset
df_copy = df.copy()

In [9]:
# Display basic information about the dataset
df_copy.info()
df_copy.head()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Person ID                374 non-null    int64  
 1   Gender                   374 non-null    object 
 2   Age                      374 non-null    int64  
 3   Occupation               374 non-null    object 
 4   Sleep Duration           374 non-null    float64
 5   Quality of Sleep         374 non-null    int64  
 6   Physical Activity Level  374 non-null    int64  
 7   Stress Level             374 non-null    int64  
 8   BMI Category             374 non-null    object 
 9   Blood Pressure           374 non-null    object 
 10  Heart Rate               374 non-null    int64  
 11  Daily Steps              374 non-null    int64  
 12  Sleep Disorder           155 non-null    object 
dtypes: float64(1), int64(7), object(5)
memory usage: 38.1+ KB


Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
0,1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
2,3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
3,4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea
4,5,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea


In [22]:
# Describe basic statistics of the dataset

summary_stats = df_copy.describe()
styled_summary = summary_stats.style.background_gradient(cmap='Blues')
display(styled_summary)

Unnamed: 0,Person ID,Age,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,Heart Rate,Daily Steps
count,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0
mean,187.5,42.184492,7.132086,7.312834,59.171123,5.385027,70.165775,6816.84492
std,108.108742,8.673133,0.795657,1.196956,20.830804,1.774526,4.135676,1617.915679
min,1.0,27.0,5.8,4.0,30.0,3.0,65.0,3000.0
25%,94.25,35.25,6.4,6.0,45.0,4.0,68.0,5600.0
50%,187.5,43.0,7.2,7.0,60.0,5.0,70.0,7000.0
75%,280.75,50.0,7.8,8.0,75.0,7.0,72.0,8000.0
max,374.0,59.0,8.5,9.0,90.0,8.0,86.0,10000.0


Dataset Summary:

* Age:

Mean (42.18) / Median (43): The dataset reflects a somewhat older adult population . The close values of the mean and median suggest the age distribution is fairly symmetrical.
Standard Deviation (8.67): The age distribution has a moderate spread, with a mix of younger and older individuals around the average.
Range (27 - 59): The dataset captures a wide age range of adults.

* Sleep Duration:

Mean (7.13) / Median (7.2): On average, individuals in the dataset get about 7 hours of sleep per day. The proximity of the mean and median suggests a relatively symmetrical distribution.
Range (5.8 - 8.5): There’s some variation in sleep duration, but most individuals sleep within a fairly narrow range.

* Quality of Sleep:

Mean (6.31) / Median (7): The average sleep quality rating is slightly above the midpoint of the 1-10 scale. The higher median compared to the mean suggests the distribution may be slightly skewed, with a number of individuals reporting lower sleep quality.

* Physical Activity Level:

Mean (59.17) / Median (60): On average, individuals engage in around an hour of physical activity per day.
Range (30 - 90): There is considerable variation in activity levels, with some individuals being significantly more active than others.

* Stress Level:

Mean (5.38) / Median (5): The average stress level is slightly above the midpoint of the 1-10 scale. The lower mean compared to the median suggests a possible skew towards higher stress levels for certain individuals.
 
 * Heart Rate:

Mean (70.17) / Median (70): The average resting heart rate is approximately 70 beats per minute, within the normal range for adults. The near-identical mean and median suggest a balanced distribution.

* Daily Steps:

Mean (6816.84) / Median (7000): On average, individuals take nearly 7,000 steps per day.
Range (3000 - 10000): There’s considerable variation in daily step counts, indicating varying levels of activity among individuals.


In [32]:
# Get categorical statistics
categorical_stats = df_copy.describe(include='object')
display(categorical_stats)

Unnamed: 0,Gender,Occupation,BMI Category,Blood Pressure,Sleep Disorder
count,374,374,374,374,155
unique,2,11,4,25,2
top,Male,Nurse,Normal,130/85,Sleep Apnea
freq,189,73,195,99,78


Observations: 

* Gender:

Males are slightly more represented in the dataset.

* Occupation:

11 occupations in the dataset. "Nurse" is the most frequent occupation (potential overrepresentation problem)

* BMI Category:

Four categories: Likely "Underweight," "Normal," "Overweight," and "Obese."
"Normal" is the most frequent category, indicating a relatively healthy BMI distribution in the sample.

* Blood Pressure:

25 unique blood pressure readings. "130/85" is the most frequent reading

* Sleep Disorder:

Three categories: "None," "Insomnia," and "Sleep Apnea." "Sleep Apnea" is the most commonly reported sleep disorder among individuals with recorded disorders.

NOTE

* You may add as many sections as you want, as long as it supports your project workflow.
* All notebook's cells should be run top-down (you can't create a dynamic wherein a given point you need to go back to a previous cell to execute some task, like go back to a previous cell and refresh a variable content)

---

# Push files to Repo

* In cases where you don't need to push files to Repo, you may replace this section with "Conclusions and Next Steps" and state your conclusions and next steps.

In [None]:
import os
try:
  # create your folder here
  # os.makedirs(name='')
except Exception as e:
  print(e)
