# 1. Title and Executive Summary


### Sleep Health and Lifestyle Data Analysis

This project focuses on an analysis of the Sleep Health and Lifestyle dataset, summarized in the data card section below.  Key focuses of the analysis are which demographic and health factors most influence sleep.  This notebook contains:
- **Data card** summarizing the dataset origin, fields, units, limitations, and license
- **Loading and File IO** section detailing how the dataset was imported
- **Exploratory data analysis** of the dataset, including [] visualizations 
- **Conclusions**, summarizing my findings based on the EDA
- **Appendix**, resources and references 

# 2. Data Card

### Dataset Overview

- **Dataset Origin**: Kaggle 
- **Dataset Link**: https://www.kaggle.com/datasets/uom190346a/sleep-health-and-lifestyle-dataset
- **Fields**: 
  - Person ID: An identifier for each individual.
  - Gender: The gender of the person (Male/Female).
  - Age: Age of individual in years. 
  - Occupation: Profession of the individual. 
  - Sleep Duration: Duration of sleep in hours
  - Quality of Sleep: Subjective rating of sleep on a scale of 1-10
  - Physical Activity Level: Minutes of physical activity per day
  - Stress Level: Subjective rating of stress level on a scale of 1-10
  - BMI Category: The BMI category of the person (Underweight, Normal, Overweight).
  - Blood Pressure: blood pressure measurement of the person, (systolic/diastolic). 
  - Heart Rate: The resting heart rate in beats per minute.
  - Daily Steps: Number of steps taken per day. 
  - Sleep Disorder: The presence or absence of a sleep disorder in the person (None, Insomnia, Sleep Apnea).
- **Units**: 
- **Limitations**: This is a synthetic dataset created for educational purposes, so the data may contain logical or technical errors. 
- **License**: Public Domain


# 3. Loading and File IO

In [3]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

In [38]:
# Data Loading
pd.set_option('display.max_columns', None)
sleepy_path = 'data/SleepyData.csv'
try:
  SleepyData = pd.read_csv(sleepy_path)
except FileNotFoundError:
  print(f"Error: {sleepy_path} not found. Check data / folder and file name")

except pd.errors.ParserError:
  print(
      f"Error: Parsing failed for {sleepy_path}. check delimiter or bad rows.")
else:
  print("Sleepy data loaded successfully!")
  print(SleepyData.info())

Sleepy data loaded successfully!
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Person ID                374 non-null    int64  
 1   Gender                   374 non-null    object 
 2   Age                      374 non-null    int64  
 3   Occupation               374 non-null    object 
 4   Sleep Duration           374 non-null    float64
 5   Quality of Sleep         374 non-null    int64  
 6   Physical Activity Level  374 non-null    int64  
 7   Stress Level             374 non-null    int64  
 8   BMI Category             374 non-null    object 
 9   Blood Pressure           374 non-null    object 
 10  Heart Rate               374 non-null    int64  
 11  Daily Steps              374 non-null    int64  
 12  Sleep Disorder           155 non-null    object 
dtypes: float64(1), int64(7), object(5)
memory usage

In [39]:
# Data Cleaning and Optimization

# Converting sleep quality and stress level to categorical data types (low, moderate, high).
# Currently, they are measured on a scale of 1-10.
# Using if/elif to map the numerical values to categorical labels.

# Checking the min and max values of each category before converting to categorical

print("\nBefore converting Sleep Quality and Stress Level to categorical:")
print("Quality of Sleep raw min/max:",
      SleepyData["Quality of Sleep"].min(), SleepyData["Quality of Sleep"].max())
print("Stress Level raw min/max:",
      SleepyData["Stress Level"].min(), SleepyData["Stress Level"].max())

# Converting to int to ensure the categorize levels function will work
SleepyData["Quality of Sleep"] = pd.to_numeric(
    SleepyData["Quality of Sleep"], errors="coerce")
SleepyData["Stress Level"] = pd.to_numeric(
    SleepyData["Stress Level"], errors="coerce")

# Categorizing into low (<=3), moderate (4-7), and high (>=8)


def categorize_levels(value: int) -> str:
  if value <= 3:
    return 'low'
  elif 4 <= value <= 7:
    return 'moderate'
  else:
    return 'high'


# Applying the categorization
SleepyData["Quality of Sleep"] = SleepyData["Quality of Sleep"].apply(
    categorize_levels).astype("category")
SleepyData["Stress Level"] = SleepyData["Stress Level"].apply(
    categorize_levels).astype("category")

# Printing the results
# they match the min and max values (Nobody rated quality of sleep less than 3, so there are no low values)
print("\nAfter converting Sleep Quality and Stress Level to categorical:")
print(SleepyData["Quality of Sleep"].unique())
print(SleepyData["Stress Level"].unique())


Before converting Sleep Quality and Stress Level to categorical:
Quality of Sleep raw min/max: 4 9
Stress Level raw min/max: 3 8

After converting Sleep Quality and Stress Level to categorical:
['moderate', 'high']
Categories (2, object): ['high', 'moderate']
['moderate', 'high', 'low']
Categories (3, object): ['high', 'low', 'moderate']
