# **<font color=#8e44ad>Sleep Health and Lifestyle</font>**

*Project by Julien Guyet, Preetha Pallavi, Malika Matissa, Chorten Tsomo Tamang*

## **<font color=#8e44ad>Introduction</font>**

In this project we will study the impact of an individual lifestyle on its sleep, and try to understand which factors are in favor of experiencing sleep disorder.
The dataset is composed of 13 columns and 400 rows. Columns are split into two catgeories:

**<font color=#ff7400>Categorical</font>**:
- Gender
- Occupation: the job of the individual
- BMI Category: is the person overweight, underweight, etc. 
- Blood Pressure: the systolic over diastolic measurements. It is defined by the [American Heart Association](https://www.heart.org/en/health-topics/high-blood-pressure/understanding-blood-pressure-readings) as:
Systolic blood pressure: "indicates how much pressure your blood is exerting against your artery walls when the heart contracts".
Diastolic blood pressure: "indicates how much pressure your blood is exerting against your artery walls while the heart muscle is resting between contractions".
- Sleep Disorder: the type of disorder the patient suffers of. Unique values are: None, Sleep Apnea and Insomnia.

**<font color=#ff7400>Quantitative</font>**:
- Age
- Sleep duration: how many hours per night the individual sleeps.
- Quality of sleep: a rating from 1 to 10
- Physical Activity Level: the number of minutes an individual performs a physical activity per day
- Stress Level: a rating from 1 to 10
- Heart rate: the resting rate of a patient in beats per minute. 
- Daily steps

Raw data was created and synthetise by Laksika Tharmalingam so it can be used for illustative purpose. Data is expected to be updated every quarter and full dataset can be find [here](https://www.kaggle.com/datasets/uom190346a/sleep-health-and-lifestyle-dataset).

Finally, the data has a CC0 license which means the owner dedicated it to the Public Domain.

Some important definition to keep in mind for the following analysis:
- Sleep Apnea: if we refer to the definition given by the National Health Service from United-Kingdom, "[Sleep apnoea is when your breathing stops and starts while you sleep. The most common type is called obstructive sleep apnoea (OSA)](https://www.nhs.uk/conditions/sleep-apnoea/)."
- Insomnia: according to the US National Heart, Lung, and Blood Institue, someone facing insomnia ["may have trouble falling asleep, staying asleep, or getting good quality sleep"](https://www.nhlbi.nih.gov/health/insomnia).

## **<font color=#8e44ad>Research Question</font>**

#### **How can someone lifestyle help to predict any sleep disorder?**

**Subheadings ideas**
1. **<font color=#ff7400>Sleep Disorder by Gender and by Age</font>**: our hypothesis is that **gender and age** should (indirectly or not) **play a role**.
-  First, as men and women tend to be overrepresented respectively in different job categories, and some of those might impact sleep healthiness. 
- Secondly, older people should be more affected by sleep disorder than young adults as in general a population is more exposed to health issues with age.

2. **<font color=#ff7400>Occupational Stress and Sleep Health</font>**: we suppose that **occupational stress levels** vary across different professions and are **associated with** differences in **sleep disorder outcomes**.
- Individuals in high-stress occupations (e.g., emergency responders, financial traders) will report higher levels of stress and poorer sleep quality compared to those in lower-stress occupations (e.g., administrative roles, creative professions).

3. **<font color=#ff7400>Physical Activity and Stress Level</font>**: how active a person is should **impact the stress level** and helps to reduce it.
- Some studies have already proven how benefial sport is when it comes to dealing with [stress](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4013452/) and [anxiety](https://www.lemonde.fr/sciences/article/2021/11/30/l-activite-physique-a-un-effet-anxiolytique_6104117_1650684.html).

4. **<font color=#ff7400>Stress and Sleep Quality</font>**: stress should be correlated to **shorter sleep duration** and have a huge impact on sleep quality.
- Intense stress might lead to sleep issues and increase the chance of a disorder to be present. 

5. **<font color=#ff7400>BMI Category and Sleep Disorder Risk</font>**: we suppose that there is an **association between BMI category** and the risk of experiencing **sleep disorders**, such as insomnia and sleep apnea.
- Individuals in higher BMI categories (e.g., overweight, obese) will have a higher prevalence of sleep disorders compared to those in lower BMI categories (e.g., normal weight, underweight).

## **<font color=#8e44ad>Dataset Overview</font>**

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
import plotly.graph_objects as go
import plotly.express as px

In [2]:
# loading the dataset
df = pd.read_csv('../data/Sleep_health_and_lifestyle_dataset.csv')

In [3]:
# displaying first 5 rows of the dataset
df.head(5)

Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
0,1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
2,3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,
3,4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea
4,5,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea


In [4]:
df.shape

(374, 13)

In [5]:
missing_values = df.isnull().any()
print(missing_values)

Person ID                  False
Gender                     False
Age                        False
Occupation                 False
Sleep Duration             False
Quality of Sleep           False
Physical Activity Level    False
Stress Level               False
BMI Category               False
Blood Pressure             False
Heart Rate                 False
Daily Steps                False
Sleep Disorder              True
dtype: bool


In [6]:
num_columns = len(df.columns)

print(f"The dataset has {num_columns} columns:")
for i, col in enumerate(df.columns):
    if i % 5 == 0:  # Add newline every 5 columns
        print()
    print(col, ", column type: ", df[col].dtypes, end=", \n")

The dataset has 13 columns:

Person ID , column type:  int64, 
Gender , column type:  object, 
Age , column type:  int64, 
Occupation , column type:  object, 
Sleep Duration , column type:  float64, 

Quality of Sleep , column type:  int64, 
Physical Activity Level , column type:  int64, 
Stress Level , column type:  int64, 
BMI Category , column type:  object, 
Blood Pressure , column type:  object, 

Heart Rate , column type:  int64, 
Daily Steps , column type:  int64, 
Sleep Disorder , column type:  object, 


In [7]:
unique_sleep_disorders = list(df["Sleep Disorder"].unique())

print(f"Unique Sleep Disorder in the dataset are:\n{unique_sleep_disorders}")

Unique Sleep Disorder in the dataset are:
[nan, 'Sleep Apnea', 'Insomnia']


### **<font color=#8e44ad>Data Distribution</font>**

In [8]:
sleep_disorder_counts = df["Sleep Disorder"].value_counts()
sleep_disorder_categories = sleep_disorder_counts.keys()

fig = go.Figure()

fig.add_trace(go.Bar(x=sleep_disorder_categories, y=sleep_disorder_counts, text=round(df["Sleep Disorder"].value_counts())))

fig.update_layout(width=800, height=600, barmode="stack", 
                title={'text': "Type of Sleep Disorder ",
                    'y':0.9, 'x':0.5,'xanchor': 'center','yanchor': 'top'},
                    xaxis_title="Disorder", yaxis_title="Occurences")
fig.show()

###### code to center graph title adapted from plotly documentation [here](https://plotly.com/python/figure-labels/)

As we can see, the "None" category is over represented compared to others, with more than 219 occurences. That represents 58% of our total data (we have 374 rows in total). Even though this makes sense (as any disease, it is always a certain fraction of the population that suffers from it), we have to be careful about the insights we will extract from this dataset. Indeed, some cross validation with additional sources might be beneficial.

We will now attempt to explore the dependencies between this variable and some others like the gender or the age of the individual. 

## **<font color=#8e44ad>1. Sleep Disorder by Gender and Age</font>**

### *How are gender and age playing a role on someone's sleep?*

First, we will create age groups for an easier reading and understanding of our graphs. 

In [9]:
data = df.copy()

ages = list(data["Age"].unique())
ages.sort()

print(f"Here are the unique values for the Age column by asc order:\n{ages}")

Here are the unique values for the Age column by asc order:
[27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45, 48, 49, 50, 51, 52, 53, 54, 55, 56, 57, 58, 59]


In [10]:
bins= [20,30,40,50,60]
labels = ["20-30","30-40","40-50","50-60"]
data["Age_Group"] = pd.cut(data["Age"], bins=bins, labels=labels, right=False)
age_groups = data["Age_Group"].unique()
print(f"Here are the age groups: {age_groups}")

Here are the age groups: ['20-30', '30-40', '40-50', '50-60']
Categories (4, object): ['20-30' < '30-40' < '40-50' < '50-60']


###### this code was adapted from jezrael solution proposed on stackoverflow [here](https://stackoverflow.com/questions/52753613/grouping-categorizing-ages-column)

In [11]:
sleep_disorder_only = data.loc[data["Sleep Disorder"] != "None"]
unique_disorder_values = sleep_disorder_only["Sleep Disorder"].unique()
print(f"Here are the unique values for Sleep Disorder column: {unique_disorder_values}")

Here are the unique values for Sleep Disorder column: [nan 'Sleep Apnea' 'Insomnia']


In [12]:
gender_sleep_disorder_counts = data.groupby(["Gender", "Sleep Disorder"]).size().unstack(fill_value=0)
gender_total_counts = gender_sleep_disorder_counts.sum(axis=1)
gender_sleep_disorder_percentages = gender_sleep_disorder_counts.div(gender_total_counts, axis=0) * 100

fig = go.Figure()

for disorder in gender_sleep_disorder_percentages.columns:
    if disorder == "None":
        fig.add_trace(go.Bar(x=gender_sleep_disorder_percentages.index, y=gender_sleep_disorder_percentages["None"], name="No Sleep Disorder", text=round(gender_sleep_disorder_percentages["None"], 2)))
    else:
        fig.add_trace(go.Bar(x=gender_sleep_disorder_percentages.index, y=gender_sleep_disorder_percentages[disorder], name=disorder, text=round(gender_sleep_disorder_percentages[disorder], 2)))

fig.update_layout(width=800, height=600, barmode="stack", 
                title={'text': "Sleep Disorder by Gender (in percentage)",
                    'y':0.9, 'x':0.5,'xanchor': 'center','yanchor': 'top'},
                    xaxis_title="Gender", yaxis_title="Percentage")
fig.show()


###### this code was adapted from vestland solution proposed on stackoverflow [here](https://stackoverflow.com/questions/66496583/plotly-how-to-add-data-labels-to-stacked-bar-charts-using-go-bar)

It is really interesting to notice that **72%** of men **<font color=#ff7400>do not</font>** experience sleep disorder. On the other hand, **66%** of women **<font color=#ff7400>are dealing with</font>** sleep disorder. 

Sleep apnea is the main disorder for women, when men are more confronted to insomnia. We should keep these results in mind when pursuing the analysis and try to understand why do we see those results: are women overrepresented in a certain age category? Or are men underrepresented for some professional categories? Is it something else?

It is also important to stay critical and put these results in the data context, so we should check how many occurences we have for each gender.

In [13]:
gender_counts = data['Gender'].value_counts()

fig = go.Figure(go.Bar(x=gender_counts.index, y=gender_counts.values, text=gender_counts))
fig.update_layout(title="Occurences per Gender", xaxis_title="Gender", yaxis_title="Count")
fig.show()

Good news, we have a pretty **balanced dataset** with almost as many occurences for "Male" and "Female" categories. We can move to the next steps peacefully.

We will now try to analyse the sleep disorder an individual might have based on its age and see if any particular trends apply to some age categories.

In [14]:
fig = go.Figure(go.Histogram(x=sleep_disorder_only["Age"], y=sleep_disorder_only["Sleep Disorder"]))
fig.update_layout(title="Distribution of Sleep Disorder by Age")
fig.show()

As we can see with this distribution, young adults are **less subject** to sleep disorder compared to older individuals, with an obvious **<font color=#ff7400>peak for individuals between 40 and 45</font>**. Let's investigate a bit more by looking at which disorders are the most frequent for each age groups. 

In [15]:
age_sleep_disorder_counts = data.groupby(["Age_Group", "Sleep Disorder"]).size().unstack(fill_value=0)
age_total_counts = age_sleep_disorder_counts.sum(axis=1)
age_sleep_disorder_percentages = age_sleep_disorder_counts.div(age_total_counts, axis=0) * 100

fig = go.Figure()

for disorder in age_sleep_disorder_percentages.columns:
    if disorder == "None":
        fig.add_trace(go.Bar(x=age_sleep_disorder_percentages.index, y=age_sleep_disorder_percentages["None"], name="No Sleep Disorder", text=round(age_sleep_disorder_percentages["None"])))
    else:
        fig.add_trace(go.Bar(x=age_sleep_disorder_percentages.index, y=age_sleep_disorder_percentages[disorder], name=disorder, text=round(age_sleep_disorder_percentages[disorder])))

fig.update_layout(width=800, height=700, barmode="stack", 
                title={'text': "Sleep Disorder by Age Group (in percentage)",
                    'y':0.9, 'x':0.5,'xanchor': 'center','yanchor': 'top'},
                    xaxis_title="Age Group", yaxis_title="Percentage")
fig.show()


When we deep dive at the age group level, we clearly see a difference in type of sleep disorders:
- Insomnia is the most represented in the 40-50 age group.
- People above 50 years old suffers more from sleep apnea than other age categories.
- Despite being less subject to sleep disorder, when young people are suffering from it, it is mainly Sleep Apnea.

If we refer to the sleep apnea web page from [Wikipedia](https://en.wikipedia.org/wiki/Sleep_apnea), we can read that "causes of obstructive sleep apnea are complex and individualized, but typical risk factors include narrow pharyngeal anatomy and craniofacial structure. When anatomical risk factors are combined with non-anatomical contributors such as an ineffective pharyngeal dilator muscle function during sleep, unstable control of breathing (high loop gain), and premature awakening to mild airway narrowing, the severity of the OSA rapidly increases as more factors are present".

This would make sense with our results of sleep apnea most common for people above 50 years old as we know muscle and organic issues tend to be more common as the human body ages. 

Let's try to observe these results at gender level, maybe we will discover some more interesting facts.

In [16]:
women_data = data.loc[data["Gender"] == "Female"]
age_sleep_disorder_counts = women_data.groupby(["Age_Group", "Sleep Disorder"]).size().unstack(fill_value=0)
age_total_counts = age_sleep_disorder_counts.sum(axis=1)
age_sleep_disorder_percentages = age_sleep_disorder_counts.div(age_total_counts, axis=0) * 100

fig = go.Figure()

for disorder in age_sleep_disorder_percentages.columns:
    if disorder == "None":
        fig.add_trace(go.Bar(x=age_sleep_disorder_percentages.index, y=age_sleep_disorder_percentages["None"], name="No Sleep Disorder", text=round(age_sleep_disorder_percentages["None"],2)))
    else:
        fig.add_trace(go.Bar(x=age_sleep_disorder_percentages.index, y=age_sleep_disorder_percentages[disorder], name=disorder, text=round(age_sleep_disorder_percentages[disorder],2)))

fig.update_layout(width=800, height=700, barmode="stack", 
                title={'text': "Sleep Disorder for Women by Age Group (in percentage)",
                    'y':0.9, 'x':0.5,'xanchor': 'center','yanchor': 'top'},
                    xaxis_title="Age Group", yaxis_title="Percentage")
fig.show()

In [17]:
male_data = data.loc[data["Gender"] == "Male"]
age_sleep_disorder_counts = male_data.groupby(["Age_Group", "Sleep Disorder"]).size().unstack(fill_value=0)
age_total_counts = age_sleep_disorder_counts.sum(axis=1)
age_sleep_disorder_percentages = age_sleep_disorder_counts.div(age_total_counts, axis=0) * 100

fig = go.Figure()

for disorder in age_sleep_disorder_percentages.columns:
    if disorder == "None":
        fig.add_trace(go.Bar(x=age_sleep_disorder_percentages.index, y=age_sleep_disorder_percentages["None"], name="No Sleep Disorder", text=round(age_sleep_disorder_percentages["None"], 2)))
    else:
        fig.add_trace(go.Bar(x=age_sleep_disorder_percentages.index, y=age_sleep_disorder_percentages[disorder], name=disorder, text=round(age_sleep_disorder_percentages[disorder], 2)))

fig.update_layout(width=800, height=800, barmode="stack", 
                title={'text': "Sleep Disorder for Male by Age Group (in percentage)",
                    'y':0.9, 'x':0.5,'xanchor': 'center','yanchor': 'top'},
                    xaxis_title="Age Group", yaxis_title="Percentage")
fig.show()

In [18]:
unique_ages_for_men = list(male_data["Age"].unique())
print(f"The unique ages represented in the dataset for men are: {unique_ages_for_men}")

The unique ages represented in the dataset for men are: [27, 28, 29, 30, 31, 32, 33, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 48, 49]


Let's break down what we understand from these two graphs:
- Insomnia is the most common sleep disorder for the 40-50 age group, no matter the gender.
- Women are more subject to sleep apnea. This is particularly true for 20-30 years old and 50-60 years old.

Howether, it is important to note that (i) **<font color=#ff7400>data is missing</font>** for male above 50 years old when **<font color=#ff7400>92%</font>** of individuals above 50 experiences sleep disorder ; and (ii) there **<font color=#ff7400>is not any "No Sleep Disorder"</font>** label for women of the 20-30 age group. This would mean that women in their 20-30s are always either suffering from insomnia or sleep apnea? This is very doubtful and highlights again the need for some cross validation of our data.

Obviously, these two remarks will impact the rest of our analysis and any conclusions we might take.

## **<font color=#8e44ad>2. Stress and Sleep Quality</font>**

### *Are High stress levels negatively associated with sleep duration and positively associated with the likelihood of experiencing sleep disorders?*

In [19]:
# Replace missing values in 'Sleep Disorder' column with 'None'
data['Sleep Disorder'].fillna('No Disorder', inplace=True)
data

Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder,Age_Group
0,1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,No Disorder,20-30
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,No Disorder,20-30
2,3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,No Disorder,20-30
3,4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea,20-30
4,5,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea,20-30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
369,370,Female,59,Nurse,8.1,9,75,3,Overweight,140/95,68,7000,Sleep Apnea,50-60
370,371,Female,59,Nurse,8.0,9,75,3,Overweight,140/95,68,7000,Sleep Apnea,50-60
371,372,Female,59,Nurse,8.1,9,75,3,Overweight,140/95,68,7000,Sleep Apnea,50-60
372,373,Female,59,Nurse,8.1,9,75,3,Overweight,140/95,68,7000,Sleep Apnea,50-60


In [20]:
histogram_trace = go.Histogram2d(x=data['Stress Level'], y=data['Sleep Duration'], 
                                 colorscale='greens')

# Create layout
layout = go.Layout(
    title='Distribution of Stress Level and Sleep Duration',
    xaxis=dict(title='Stress Level'),
    yaxis=dict(title='Sleep Duration'),
)

fig = go.Figure(data=[histogram_trace], layout=layout)

fig.show()


The distribution depicts a diverse distribution of data points, indicating correlation between stress levels and sleep duration. Darker colors represent areas with higher concentrations of data points, while lighter colors represent areas with fewer data points. 
This finding indicates that stress can have a substantial impact on sleep patterns, with higher stress levels potentially leading to insufficient sleep and conversely, Individuals with lower stress level tend to have a longer sleep duration.

In [21]:
# Calculate average sleep duration for each sleep disorder category
average_duration = data.groupby('Sleep Disorder')['Sleep Duration'].mean()

line_trace = go.Scatter(x=average_duration.index, y=average_duration.values, 
                        mode='lines+markers', marker=dict(color='blue'))

layout = go.Layout(
    title='Average Sleep Duration by Sleep Disorders',
    xaxis=dict(title='Sleep Disorder'),
    yaxis=dict(title='Average Sleep Duration'),
)

fig = go.Figure(data=[line_trace], layout=layout)

fig.show()


The analysis of average sleep duration by sleep disorder category revealed a potential association between reported sleep disorders and reduced sleep duration. Individuals with reported Insomnia or Sleep Apnea exhibited lower average sleep durations compared to those with no reported sleep disorder.

In [22]:
light_colors = {'Insomnia': 'red', 'Sleep Apnea': 'orange', 'No Disorder': 'green'}
scatter_traces = []
for category, group in data.groupby('Sleep Disorder'):
    trace = go.Scatter(
        x=group['Stress Level'],
        y=group['Sleep Duration'],
        mode='markers',
        marker=dict(color=light_colors.get(category, 'black')),
        name=category
    )
    scatter_traces.append(trace)

fig = go.Figure(data=scatter_traces)

fig.update_layout(
    title='Distribution of Sleep Disorders in context to stress level and sleep duration',
    xaxis_title='Stress Level',
    yaxis_title='Sleep Duration (hours)'
)

fig.show()

There's an emerging trend indicating that individuals reporting higher stress levels (rated 7 and 8) tend to exhibit shorter sleep durations and a heightened prevalence of sleep apnea. Conversely, It's worth noting that individuals with lower stress levels aren't consistently exempt from diagnosed sleep disorders, as evidenced by instances of sleep apnea occurrence despite their relatively lower stress levels. These observations underscore the significance of recognizing stress as a potential influencer of both sleep patterns and the manifestation of sleep disorders.

## **<font color=#8e44ad>3. Occupational Stress and Sleep Health</font>**

### *Is the profession of an individual linked to the stress level and somehow impacting sleep quality?*

In this section, we will explore the potential relationship between occupational stress and sleep health. By analyzing data on various professions, we want to discover if there's a link between an individual's profession, their stress levels, and the quality of their sleep. We also seek to understand if certain professions exhibit more sleep disorder and how this may influence sleep.

Let's first see how these disorders are distributed among the different occupations. Two types of sleep disorder are highlighted in this dataset: Sleep apnea and Insomnia. 

In [23]:
df['Sleep Disorder'] = df['Sleep Disorder'].fillna('No Disorder')


occupation_sleep_disorder_counts = df.groupby('Occupation')['Sleep Disorder'].value_counts().unstack(fill_value=0)

fig = go.Figure()

for disorder in occupation_sleep_disorder_counts.columns:
    fig.add_trace(go.Bar(
        y=occupation_sleep_disorder_counts.index,
        x=occupation_sleep_disorder_counts[disorder],
        name=disorder,
        orientation='h', 
        text=occupation_sleep_disorder_counts[disorder].round(2).astype(str),
        hoverinfo='text'
    ))

fig.update_layout(
    title='Sleep Disorder by Occupation',
    xaxis_title='Number or individuals',
    yaxis_title='Occupation',
    barmode='stack',
    legend_title='Sleep Disorder Type',
    
)

fig.show()


Actually, sales representatives are most likely to get sleep disorder with 100% of them having sleep apnea, but there is data only for 2 of them. Next, it appears that Salesperson is the occupation with the most sleep disorder with 93.8% people (29 insomnia + 1 sleep apnea / 32 people = 0.9375) with sleep disorder, followed by nurses with 87.7% (3 insomnia + 61 sleep apnea / 73 people = 0.8767) of them with sleep disorder and teachers (67.5%).
Contrary to these high-stress occupations, lawyers, engineers, doctors and accountants are less likely to have a sleep disorder.


In [24]:
fig = go.Figure()
fig.add_trace(go.Box(y=df['Occupation'], x=df['Stress Level'], 
                     orientation='h', name='Stress Level'))

fig.update_layout(title='Stress Level by Occupation',
                  xaxis_title='Stress Level Score',
                  yaxis_title='Occupation')

fig.show()


Considering that professions experiencing the highest level of stress are those with a median above 5. Taking this into account, we can say that the high-stress professions are scientists, doctors, and software engineers, but also salespersons, sales representatives, and nurses, which were the professions with the most sleep disorders.

Now, we want to check how sleep disorders and occupational stress are affecting sleep quality and duration of the patients.

In [25]:
average_sleep_duration = df.groupby("Occupation").agg({'Sleep Duration': ['mean', 'std']}).reset_index()

fig = go.Figure()

fig.update_layout(title='Average Sleep Duration by Occupation',
                  xaxis_title='Occupation',
                  yaxis_title='Average Sleep Duration')

fig.add_trace(go.Bar(x=average_sleep_duration['Occupation'], y=average_sleep_duration['Sleep Duration']['mean'],
                  error_y=dict(type='data', array=average_sleep_duration['Sleep Duration']['std']),
                  name='Sleep Duration', marker_color='lightgreen'))

fig.show()

In [26]:
average_stress_level = df.groupby("Occupation").agg({'Quality of Sleep': ['mean', 'std']}).reset_index()

fig = go.Figure()

fig.update_layout(title='Average Sleep Quality by Occupation',
                  xaxis_title='Occupation',
                  yaxis_title='Average Sleep Quality')

fig.add_trace(go.Bar(x=average_stress_level['Occupation'], y=average_stress_level['Quality of Sleep']['mean'],
                  error_y=dict(type='data', array=average_stress_level['Quality of Sleep']['std']),
                  name='Quality of Sleep', marker_color='lightgreen'))

fig.show()


Looking at the sales representative data, it looks like this occupation is the one that is subject to highest stress level score. We can also see that their sleep quality (4) and duration (5.9 hours) impacted by the occupational stress and are the lowest of the dataset. As we've seen before, sales representatives all have sleep apnea.

Looking at the teacher data, it looks like however they are a lot subject to have sleep disorders, their sleep duration and quality are not affected. We can say that this is due to their low stress level score of 4.

Next, with a stress level score of 7 we have scientists and salesperson. Looking at their data, there are the next occupations with the lowest sleep duration (6 hours for scientists and 6.4 hours for salespersons) and quality (5 for scientists and 6 for salespersons).

On the list of "high-stress" professions identified by the data we have, we also had doctors with a score of 6 and software engineers with a score of 5.5. Considering that low sleep quality is a score of 6 or below, they still experiment a good quality of sleep with a score of 6.5 for software engineers and 6.64 for doctors.

Lastly, for nurses, they experimented all stress level score, from 3 to 8, with the median at 6. They have a sleep quality of 7.3 and a sleep duration of 7.06 hours, it seems like their sleep is not impacted (considering the hypothese made above). This profession is normally a really high-stress profession due to how much and long they work a day, although the dataset is telling us that their sleep health is not highly impacted.

It seems that high stress level (7 or above) impact sleep health (duration and quality) more than just having a sleep disorder.

## **<font color=#8e44ad>4. Physical Activity and Stress Level</font>**

### *What is the impact of physical activity on stress level and sleep health if any?*

To answer the physical activity component of the study question, we will examine how physical activity, as evaluated by daily steps, links to several sleep-related characteristics such as sleep quality, sleep length, stress level, and sleep disorder.

##### Relationship between Daily Steps and Sleep Duration

###### This code was adapted from stackoverflow [here](https://stackoverflow.com/questions/73466312/how-does-scatter-plot-works-in-plotly)


In [27]:
# Scatter Plot: Relationship between Daily Steps and Sleep Duration
fig_scatter_sleep_duration = px.scatter(df, x='Daily Steps', y='Sleep Duration', 
                                        title='Relationship between Daily Steps and Sleep Duration',
                                        labels={'Daily Steps': 'Daily Steps', 'Sleep Duration': 'Sleep Duration'})
fig_scatter_sleep_duration.show()


By looking at the scatter plot we can see that the trend of increase in the sleep duration as the daily step increases. This suggests that peole who are more movement tends to sleep longer period of time. There is certain improvement compare to 4k steps to 6K, 7k and 8k but we can see some drop cluster in 10k steps having less sleep duration.

##### Relationship between Daily Steps and Quality of Sleep

In [28]:
fig_scatter_quality_sleep = px.scatter(df, x='Daily Steps', y='Quality of Sleep', 
                                       title='Daily Steps vs. Quality of Sleep',
                                       labels={'Daily Steps': 'Daily Steps', 'Quality of Sleep': 'Quality of Sleep'})
fig_scatter_quality_sleep.show()


We can determine whether or not there is a trend between the two variables by looking at where the bulk of the dots fall on the graph. In this scenario, a positive association is implied. This suggests that as the number of daily steps grows (going to the right on the x-axis), so does the quality of sleep (moving upward on the y-axis) bit but there is some certain drops points which we have to research a bit about.

##### Distribution of Sleep Disorders by Physical Activity Level

###### This code was adapted from stackoverflow [here](https://stackoverflow.com/questions/55138359/plotly-stacked-bar-chart-pandas-dataframe#:~:text=Using%20px.bar%20will%20give%20you%20a%20stacked%20bar,import%20pandas%20as%20pd%20import%20plotly.express%20as%20px)

In [29]:
df['Sleep Disorder'].fillna('No Disorder', inplace=True)
bins = [0, 40, 70, 100]
df['Activity Category'] = pd.cut(df['Physical Activity Level'], bins=3, labels=['Low (0-40)', 'Medium (40-70)', 'High (70-100)'])

# Group by activity category and sleep disorder
sleep_disorder_freq = df.groupby(['Activity Category', 'Sleep Disorder']).size().unstack(fill_value=0)
#Frequency calculation
sleep_disorder_freq_percent = sleep_disorder_freq.div(sleep_disorder_freq.sum(axis=1), axis=0) * 100
sleep_disorder_freq_percent.reset_index(inplace=True)

# Melt the dataframe for plotting
melted_sleep_disorder_freq = sleep_disorder_freq_percent.melt(id_vars='Activity Category', 
                                                             var_name='Sleep Disorder', 
                                                             value_name='Percentage')

# Plotting the stacked bar chart
fig_sleep_disorder_activity = px.bar(melted_sleep_disorder_freq, 
                                     x='Activity Category', 
                                     y='Percentage', 
                                     color='Sleep Disorder', 
                                     title='Distribution of Sleep Disorders by Physical Activity Level',
                                     labels={'Percentage': 'Percentage', 
                                             'Activity Category': 'Physical Activity Level', 
                                             'Sleep Disorder': 'Sleep Disorder'},
                                     barmode='stack')

# Visualization
fig_sleep_disorder_activity.show()

People with low activity levels have the percentage of "No Disorder" (about 48%) and the rate of insomnia (around 44%), and sleep alpnea (around 7%).
People with meduim activity levels have the percentage of "No Disorder" (about 90%) and lower rate of insomnia (around 4%) than peopl with low activily levels, and sleep alpnea (around 4%).
People with hogh activity levels have the percentage of "No Disorder" (about 51%) and the lowest rate of insomnia (around 2%), and sleep alpnea (around 45%)

From this, we can conclude that higher physical activity levels generally correlate with a lower likelihood of experiencing sleep disorders, particularly insomnia. However, there seems to be an increase in the prevalence of sleep apnea among individuals with high activity levels, which might warrant further investigation into the specific relationship between intense physical activity and sleep apnea.

## **<font color=#8e44ad>5. BMI Category and Sleep Disorder Risk</font>**

### *Is the BMI Category a fcator influencing Sleep Disorder?*

In this section we will look at the possible correlation between the weight of an individual and its sleep quality. We will start by plotting sleep disorder by BMI category.

But first, we noticed that the dataset has 4 differents values for the BMI Category, and two of them are alike: "Normal" and "Normal Weight". Unfortunately, we don't have much information about this feature in the documentation and we don't have access to the exact BMI score of the individual. Hence why we decided to regroup these two categories together. 

In [30]:
data["BMI Category"].unique()

array(['Overweight', 'Normal', 'Obese', 'Normal Weight'], dtype=object)

In [31]:
data.head()

Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder,Age_Group
0,1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,No Disorder,20-30
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,No Disorder,20-30
2,3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,No Disorder,20-30
3,4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea,20-30
4,5,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea,20-30


In [32]:
data.loc[data['BMI Category'] == "Normal Weight", 'BMI Category'] = "Normal"

In [33]:
data["BMI Category"].unique()

array(['Overweight', 'Normal', 'Obese'], dtype=object)

In [34]:
bmi_counts = data['BMI Category'].value_counts()

fig = go.Figure(go.Bar(x=bmi_counts.index, y=bmi_counts.values, text=bmi_counts))
fig.update_layout(title="Occurences per BMI Catgeory", xaxis_title="BMI Category", yaxis_title="Count")
fig.show()

As we can observe, we must be careful in our analysis as we have only 10 rows for the obese category. We can doubt this is really representative.

In [35]:
bmi_sleep_disorder_counts = data.groupby(["BMI Category", "Sleep Disorder"]).size().unstack(fill_value=0)
bmi_total_counts = bmi_sleep_disorder_counts.sum(axis=1)
bmi_sleep_disorder_percentages = bmi_sleep_disorder_counts.div(bmi_total_counts, axis=0) * 100

fig = go.Figure()

for disorder in bmi_sleep_disorder_percentages.columns:
    if disorder == "None":
        fig.add_trace(go.Bar(x=bmi_sleep_disorder_percentages.index, y=bmi_sleep_disorder_percentages["None"], name="No Sleep Disorder", text=round(bmi_sleep_disorder_percentages["None"])))
    else:
        fig.add_trace(go.Bar(x=bmi_sleep_disorder_percentages.index, y=bmi_sleep_disorder_percentages[disorder], name=disorder, text=round(bmi_sleep_disorder_percentages[disorder])))

fig.update_layout(width=800, height=600, barmode="stack", 
                title={'text': "Sleep Disorder by BMI Category (in percentage)",
                    'y':0.9, 'x':0.5,'xanchor': 'center','yanchor': 'top'},
                    xaxis_title="BMI Category", yaxis_title="Percentage")
fig.show()


With this graph the difference between BMI categories is very obvious: **<font color=#ff7400>93%</font>** of people with a "normal" BMI **<font color=#ff7400>are not</font>** experiencing sleep disorder. On the other hand, people suffering from obesity or being overweight are more likely to face such a situation. Again, we have to be careful here as for the "Obese" category no row has been labelled "No Sleep Disorder" which might indicate some bias or missing data. 

In [36]:
# average stress level by BMI category
from plotly.subplots import make_subplots

stats_by_bmi = data.groupby("BMI Category").agg({'Sleep Duration': ['mean', 'std'], 'Quality of Sleep': ['mean', 'std']}).reset_index()

fig = make_subplots(rows=1, cols=2, subplot_titles=["Mean Sleep Duration", "Mean Sleep Quality Score"])

# Plot 1: Mean Sleep Duration
fig.add_trace(go.Bar(x=stats_by_bmi['BMI Category'], y=stats_by_bmi['Sleep Duration']['mean'],
                  error_y=dict(type='data', array=stats_by_bmi['Sleep Duration']['std']),
                  name='Sleep Duration', marker_color='lightblue'),row=1, col=1)
fig.update_yaxes(title_text="Sleep Duration", row=1, col=1)

# Plot 2: Mean Sleep Quality
fig.add_trace(go.Bar(x=stats_by_bmi['BMI Category'], y=stats_by_bmi['Quality of Sleep']['mean'],
                  error_y=dict(type='data', array=stats_by_bmi['Quality of Sleep']['std']),
                  name='Sleep Quality', marker_color='lightcoral'),
            row=1, col=2)
fig.update_yaxes(title_text="Sleep Quality", row=1, col=2)
fig.update_layout(title={'text': "Mean Sleep Duration and Mean Sleep Quality by BMI Category",
                  'y':0.9, 'x':0.5,'xanchor': 'center','yanchor': 'top'})
fig.show()

The average sleep duration is slightly lower for overweight and obese individuals. The gap in sleep quality is more obvious with a 7.6 average score for "normal" BMI versus 6.4 and 6.8 for Obese and Overweight categories respectively. 

These observations make sense following our first analysis of sleep disorder by BMI categories, as sleep disorders are more common for those categories. However, in the case of predicting sleep disorder, this is an important feature. 

In [37]:
# Stress Level
stats_by_bmi = data.groupby("BMI Category").agg({'Stress Level': ['mean', 'std']}).reset_index()

fig = go.Figure()

fig.add_trace(go.Bar(x=stats_by_bmi['BMI Category'], y=stats_by_bmi['Stress Level']['mean'],
                error_y=dict(type='data', array=stats_by_bmi['Stress Level']['std']),
                name='Sleep Duration', marker_color='lightblue'))

fig.update_layout(title={'text': "Mean Stress Level by BMI Category",
                'y':0.9, 'x':0.5,'xanchor': 'center','yanchor': 'top'},
                xaxis_title="BMI Category", yaxis_title="Average Stress Level")
fig.show()

As we can see, people suffering from obesity or being overweight slightly suffer more from stress than the rest of the population (5.7 vs 5.1 for "Normal" category). We should try to understand why by looking at the correlation with other features such as the occupation, the age, or activity level.

In [38]:
data.head()

Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder,Age_Group
0,1,Male,27,Software Engineer,6.1,6,42,6,Overweight,126/83,77,4200,No Disorder,20-30
1,2,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,No Disorder,20-30
2,3,Male,28,Doctor,6.2,6,60,8,Normal,125/80,75,10000,No Disorder,20-30
3,4,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea,20-30
4,5,Male,28,Sales Representative,5.9,4,30,8,Obese,140/90,85,3000,Sleep Apnea,20-30


In [39]:
physical_activity_by_bmi = data.groupby("BMI Category").agg({'Physical Activity Level': ['mean', 'std'], 'Daily Steps': ['mean', 'std']}).reset_index()

fig = make_subplots(rows=1, cols=2, subplot_titles=["Mean Physical Activity", "Mean Daily Steps"])

fig.add_trace(go.Bar(x=physical_activity_by_bmi['BMI Category'], y=physical_activity_by_bmi['Physical Activity Level']['mean'],
                error_y=dict(type='data', array=physical_activity_by_bmi['Physical Activity Level']['std']),
                name='Physical Activity Level', marker_color='lightblue'), row=1, col=1)
fig.update_yaxes(title_text="Physical Activity", row=1, col=1)

fig.add_trace(go.Bar(x=physical_activity_by_bmi['BMI Category'], y=physical_activity_by_bmi['Daily Steps']['mean'],
                error_y=dict(type='data', array=physical_activity_by_bmi['Daily Steps']['std']),
                name='Daily Steps', marker_color='lightcoral'), row=1, col=2)
fig.update_yaxes(title_text="Daily Steps", row=1, col=2)

fig.update_layout(title={'text': "Mean Physical Activity level by BMI Category",
                'y':0.9, 'x':0.5,'xanchor': 'center','yanchor': 'top'})
fig.show()

In [40]:
obese_population_data = data.loc[data["BMI Category"] == "Obese"]
overweight_population_data = data.loc[data["BMI Category"] == "Overweight"]
rest_population_data = data.loc[data["BMI Category"] == "Normal"]
obese_job_counts = obese_population_data['Occupation'].value_counts().head(5)
overweight_job_counts = overweight_population_data['Occupation'].value_counts().head(5)
rest_job_counts = rest_population_data['Occupation'].value_counts().head(5)

fig = make_subplots(rows=1, cols=3, subplot_titles=["Top 5 Jobs for Obese Population", "Top 5 Jobs for Overweight Population", "Top 5 Jobs for Population with no condition"])

fig.add_trace(go.Bar(x=obese_job_counts.index, y=obese_job_counts.values, text=obese_job_counts, name="Obese"), row=1, col=1)
fig.add_trace(go.Bar(x=overweight_job_counts.index, y=overweight_job_counts.values, text=overweight_job_counts, name="Overweight"), row=1, col=2)
fig.add_trace(go.Bar(x=rest_job_counts.index, y=rest_job_counts.values, text=rest_job_counts, name="Rest"), row=1, col=3)

fig.update_layout(title="Top 5 Jobs by BMI Category", showlegend=False)
fig.update_xaxes(title_text="Job", row=1, col=1)
fig.update_yaxes(title_text="Count", row=1, col=1)
fig.show()

These results above are in correlation with our findings from part 3 and 4. 

First, the physical activity: we observe that overweight people tend to do as much if not more physical activity than other indivuals. Only people suffering from obesity are less active with a clear gap versus other categories. That means despite being active, the weight would have a bigger influence on the probability of suffering from a sleep disorder.

Secondly, the job position of the individuals: we can see that people in overweight are overrepresenting the "Nurse" and "Salesperson" population in our dataset. As a reminder, these two professions where the top 2 suffering from either sleep apnea (nurses) or insomnia (salespersons).

## **<font color=#8e44ad>Conclusion</font>**

#### **Main Factors for Sleep Disorder**
In this analysis, we have seen that a combination of different factors are playing in favor of suffering from a sleep disorder or not. The main factors seem to be:

- The age: with age, sleep disorders are more common and mainly impacting people in the 40-50 and 50-60 age groups. 
- Stress level: Individuals exhibiting elevated stress levels are inclined to experience reduced sleep duration, potentially predisposing them to sleep disorders. However, it is crucial to acknowledge that this analysis is based on observational data and cannot definitively establish causation
- BMI category: people suffering from obesity or being overweight are more likely to face a sleep disorder. However, we could not clearly identify the underlying causes.
- Daily physical activity: Higher physical activity levels generally correlate with a lower likelihood of experiencing sleep disorders, particularly insomnia. However, there seems to be an increase in the prevalence of sleep apnea among 
- Occupation: Professions with higher stress levels tend to exhibit more sleep disorders, while other occupations despite experiencing high stress, maintain relatively good sleep quality and duration. Stress level of 7 or above appear to have a more pronounced impact on sleep health compared to simply having a sleep disorder.

#### **Identified Limitations**
Regarding the gender, it seems at first sight that women are more subject to sleep disorder. However, we notice an **<font color=#ff7400>important miss</font>** of data for male above 50 years old, when 92% of women for that age group suffers from at least one sleep disorder. Obviously this is a **<font color=#ff7400>major bias</font>** from our data. 

Another difficulty we faced in the analysis is the relationships to be established between (i) occupation with stress and (ii) BMI category with occupation. First, we had no additional data linked to the occupation (some factors are at play here such as working hours, breaks duration, etc.). Then, we feel like the BMI category involves other dimensions than just the weight itself, notably regarding medical conditions and diagnoses. Finally, we did not have enough rows for the obesity category, which limits our analysis. 

#### **Next Steps**
To go further, some **<font color=#ff7400>cross validation with additional data</font>** is needed. First, to obtain more observations for each categories cited above, essentialy gender and occupations. Then, comparing our findings with medical data would be very interesting. For example, try to understand more in depth why and how obesity or being overweight increases the chance of suffering from sleep apnea. On the other hand, what about other BMI categories? In our dataset for example, we had no labels for "Underweight" or "Anorexia".

Finally, analysing data at a bigger scale (country level?) could help to understand better the causes of sleep disorder and how it influences a population health.
