# Sleep Health & Lifestyle Analysis

This project explores how lifestyle and health-related factors 
are associated with sleep duration, sleep quality, and sleep disorders.


## Goal
Analyze the relationship between individual lifestyle and health-related factors 
and sleep duration, sleep quality, and the presence of sleep disorders.

This analysis is observational and focuses on identifying associations 
rather than causal relationships.


## Independent Variables (Factors)
- Age
- Gender
- BMI Category
- Physical Activity Level
- Stress Level
- Daily Steps
- Heart Rate

## Dependent Variables (Sleep Outcomes)
- Sleep Duration
- Quality of Sleep
- Sleep Disorder


## Analytical Approach
- Exploratory data analysis (EDA)
- Group comparisons across categorical variables
- Correlation analysis for numerical variables
- Visual analysis using plots and summary statistics


In [17]:
import pandas as pd 
import numpy as np
import matplotlib.pyplot as plt 
import seaborn as sns

In [18]:
df = pd.read_csv("Sleep_health_and_lifestyle_dataset.csv")

## Initial Data Inspection

In this section, we perform a first look at the dataset to understand:
- Dataset size
- Data types
- Missing values
- Potential data quality issues


In [19]:
df.head()

Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
0,1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
2,3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
3,4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea
4,5,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea


In [167]:
df.shape

(374, 13)

In [20]:
df.info()

<class 'pandas.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Person ID                374 non-null    int64  
 1   Gender                   374 non-null    str    
 2   Age                      374 non-null    int64  
 3   Occupation               374 non-null    str    
 4   Sleep Duration           374 non-null    float64
 5   Quality of Sleep         374 non-null    int64  
 6   Physical Activity Level  374 non-null    int64  
 7   Stress Level             374 non-null    int64  
 8   BMI Category             374 non-null    str    
 9   Blood Pressure           374 non-null    str    
 10  Heart Rate               374 non-null    int64  
 11  Daily Steps              374 non-null    int64  
 12  Sleep Disorder           155 non-null    str    
dtypes: float64(1), int64(7), str(5)
memory usage: 38.1 KB


In [168]:
df.isna().sum

<bound method DataFrame.sum of      Person ID  Gender    Age  Occupation  Sleep Duration  Quality of Sleep  \
0        False   False  False       False           False             False   
1        False   False  False       False           False             False   
2        False   False  False       False           False             False   
3        False   False  False       False           False             False   
4        False   False  False       False           False             False   
..         ...     ...    ...         ...             ...               ...   
369      False   False  False       False           False             False   
370      False   False  False       False           False             False   
371      False   False  False       False           False             False   
372      False   False  False       False           False             False   
373      False   False  False       False           False             False   

     Physical Activi

In [169]:
df.duplicated().sum()

np.int64(0)

In [21]:
df.describe()

Unnamed: 0,Person ID,Age,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,Heart Rate,Daily Steps
count,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0
mean,187.5,42.184492,7.132086,7.312834,59.171123,5.385027,70.165775,6816.84492
std,108.108742,8.673133,0.795657,1.196956,20.830804,1.774526,4.135676,1617.915679
min,1.0,27.0,5.8,4.0,30.0,3.0,65.0,3000.0
25%,94.25,35.25,6.4,6.0,45.0,4.0,68.0,5600.0
50%,187.5,43.0,7.2,7.0,60.0,5.0,70.0,7000.0
75%,280.75,50.0,7.8,8.0,75.0,7.0,72.0,8000.0
max,374.0,59.0,8.5,9.0,90.0,8.0,86.0,10000.0


In [165]:
df.describe(include="object")

See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  df.describe(include="object")


Unnamed: 0,Gender,Occupation,BMI Category,Blood Pressure,Sleep Disorder
count,374,374,374,374,374.0
unique,2,11,3,25,3.0
top,Male,Nurse,Normal,130/85,
freq,189,73,216,99,219.0


## Column Descriptions

Below is a detailed description of each feature in the dataset.
This section helps to understand the meaning and role of each variable
before performing deeper exploratory analysis.

### Demographic Information
- **Person ID**  
  Unique identifier assigned to each individual.

- **Gender**  
  Biological sex of the individual (Male or Female).

- **Age**  
  Age of the individual in years.

- **Occupation**  
  Job or professional category of the individual.

### Sleep-Related Features
- **Sleep Duration**  
  Average number of hours the individual sleeps per day.

- **Quality of Sleep**  
  Subjective sleep quality rating on a scale from 1 to 10.

- **Sleep Disorder**  
  Indicates whether the individual has a diagnosed sleep disorder
  such as Insomnia or Sleep Apnea.
  Missing values likely represent individuals without sleep disorders.

### Lifestyle and Health Indicators
- **Physical Activity Level**  
  Numeric indicator of daily physical activity intensity or duration.

- **Stress Level**  
  Self-reported stress level on a numeric scale.
  Higher values indicate higher stress.

- **BMI Category**  
  Body Mass Index classification (Normal, Overweight, Obese).

- **Blood Pressure**  
  Blood pressure measurement recorded as systolic/diastolic
  (for example, 120/80).

- **Heart Rate**  
  Resting heart rate measured in beats per minute (BPM).

- **Daily Steps**  
  Average number of steps taken per day.


## Data Quality Checks

Before proceeding to exploratory analysis, we perform additional data quality checks
to identify potential issues such as:
- Inconsistent values
- Unexpected ranges
- Features requiring transformation


In [174]:
categorical_cols = df.select_dtypes(include="object").columns
categorical_cols

See https://pandas.pydata.org/docs/user_guide/migration-3-strings.html#string-migration-select-dtypes for details on how to write code that works with pandas 2 and 3.
  categorical_cols = df.select_dtypes(include="object").columns


Index(['Gender', 'Occupation', 'BMI Category', 'Blood Pressure',
       'Sleep Disorder'],
      dtype='str')

In [177]:
for col in categorical_cols:
    print(f"\n{col}")
    print(df[col].value_counts())


Gender
Gender
Male      189
Female    185
Name: count, dtype: int64

Occupation
Occupation
Nurse                   73
Doctor                  71
Engineer                63
Lawyer                  47
Teacher                 40
Accountant              37
Salesperson             32
Software Engineer        4
Scientist                4
Sales Representative     2
Manager                  1
Name: count, dtype: int64

BMI Category
BMI Category
Normal        216
Overweight    148
Obese          10
Name: count, dtype: int64

Blood Pressure
Blood Pressure
130/85    99
125/80    65
140/95    65
120/80    45
115/75    32
135/90    27
140/90     4
125/82     4
132/87     3
128/85     3
126/83     2
130/86     2
117/76     2
131/86     2
128/84     2
135/88     2
129/84     2
115/78     2
119/77     2
142/92     2
139/91     2
118/75     2
118/76     1
121/79     1
122/80     1
Name: count, dtype: int64

Sleep Disorder
Sleep Disorder
None           219
Sleep Apnea     78
Insomnia        77
Name: co

In [178]:
numerical_cols = df.select_dtypes(include=['int64', 'float64']).columns
numerical_cols

Index(['Person ID', 'Age', 'Sleep Duration', 'Quality of Sleep',
       'Physical Activity Level', 'Stress Level', 'Heart Rate', 'Daily Steps'],
      dtype='str')

In [185]:
'Minimal', df[numerical_cols].min(), 'Maximal', df[numerical_cols].max()

('Minimal',
 Person ID                     1.0
 Age                          27.0
 Sleep Duration                5.8
 Quality of Sleep              4.0
 Physical Activity Level      30.0
 Stress Level                  3.0
 Heart Rate                   65.0
 Daily Steps                3000.0
 dtype: float64,
 'Maximal',
 Person ID                    374.0
 Age                           59.0
 Sleep Duration                 8.5
 Quality of Sleep               9.0
 Physical Activity Level       90.0
 Stress Level                   8.0
 Heart Rate                    86.0
 Daily Steps                10000.0
 dtype: float64)