# Sleep Data Analysis

We have a dataset of 100 records, storing information about sleep and health quality of 100 unique individuals.

Without timestamps and repeated entries, over time, from the same individuals, we cannot identify long term effects of sleep quality. However, we can look towards the relation between features over age ranges, genders, medical issues and physical activity.

With that in mind, these are the following goals that we have in mind for this analysis.

Goals:

1. Identify which gender typically has higher quality sleep.
2. Visualise the relationship between age and medication usage.
3. Identify the relationship between calories burned and daily steps, extended to include age.
4.  Visualise the relationship between sleep duration and sleep disorders.
5. Visualise the relationship between sleep quality and dietary habits.
6. Identify the relationship between dietary habits and medication usage.

In [1]:
import pandas as pd
import altair as alt
import numpy as np

import os 

In [2]:
dataset_dir = "/home/sam/Desktop/datasets/sleep/"
os.listdir(dataset_dir)

['Health_Sleep_Statistics.csv']

In [3]:
df = pd.read_csv(dataset_dir + "Health_Sleep_Statistics.csv")

In [4]:
df.head(5)

Unnamed: 0,User ID,Age,Gender,Sleep Quality,Bedtime,Wake-up Time,Daily Steps,Calories Burned,Physical Activity Level,Dietary Habits,Sleep Disorders,Medication Usage
0,1,25,f,8,23:00,06:30,8000,2500,medium,healthy,no,no
1,2,34,m,7,00:30,07:00,5000,2200,low,unhealthy,yes,yes
2,3,29,f,9,22:45,06:45,9000,2700,high,healthy,no,no
3,4,41,m,5,01:00,06:30,4000,2100,low,unhealthy,yes,no
4,5,22,f,8,23:30,07:00,10000,2800,high,medium,no,no


## Missing Values

Here, we complete a quick check for missing values or outliers in the dataset.

In [5]:
df.isna().sum()

User ID                    0
Age                        0
Gender                     0
Sleep Quality              0
Bedtime                    0
Wake-up Time               0
Daily Steps                0
Calories Burned            0
Physical Activity Level    0
Dietary Habits             0
Sleep Disorders            0
Medication Usage           0
dtype: int64

### Numerical Outliers

Gratefully, no null values. Are there any outliers in the numerical data? 

In [24]:
calories_chart = alt.Chart(
    df,
    title='Boxplot of Calories Burned Amongst 100 Records'
).mark_boxplot(size=50).encode(
    y=alt.Y('Calories Burned:Q').scale(zero=False),
    tooltip=[alt.Tooltip(title='Calories Burned (kcal)', field='Calories Burned')]
).properties(
    width=200
)

steps_chart = alt.Chart(
    df,
    title='Boxplot of Daily Steps Amongst 100 Records'
).mark_boxplot(size=50).encode(
    y=alt.Y('Daily Steps:Q').scale(zero=False),
).properties(
    width=200
)

age_chart = alt.Chart(
    df,
    title='Boxplot of Age Amongst 100 Records'
).mark_boxplot(size=50).encode(
    y=alt.Y('Age:Q').scale(zero=False),
).properties(
    width=200
)

chart = calories_chart | steps_chart | age_chart
# chart.save('images/outlier_boxplots.html')

chart

In [29]:
def get_above_outlier(q3, q1):
    iqr = q3 - q1 
    above_outlier_threshold = q3 + (1.5*iqr)
    return above_outlier_threshold

def get_below_outlier(q3, q1):
    iqr = q3 - q1 
    below_outlier_threshold = q1 - (1.5*iqr)
    return max(below_outlier_threshold, 0)

In [33]:
quantitative_features = [
    ("Calories Burned", 2700, 2175),
    ("Daily Steps", 9000, 4750),
    ("Age", 44, 28.75)
] 

for (feat, q3, q1) in quantitative_features:
    print(f"Feature: {feat}")
    print(f"Upper outlier threshold for : {get_above_outlier(q3, q1)}")
    print(f"Lower outlier threshold: {get_below_outlier(q3, q1)}\n")

Feature: Calories Burned
Upper outlier threshold for : 3487.5
Lower outlier threshold: 1387.5

Feature: Daily Steps
Upper outlier threshold for : 15375.0
Lower outlier threshold: 0

Feature: Age
Upper outlier threshold for : 66.875
Lower outlier threshold: 5.875



Generally, we consider 'above-outliers' as values beyond Q3 + 1.5xIQR.
'below-outliers' are values beneath the Q1 - 1.5xIQR.

As we can see from the above cell, we do not exceed or drop below our upper or lower outliers respectively, showing that there is no significant outliers, which is good and ideal.

### Categorical Outliers

We look towards our categorical/binary data, to ensure that we do not have any outliers.

For categorical data, we are typically looking towards mislabelling, where two values are labelled differently but refer to the same value. An example would be 'fem' and 'f' when depicting the female value in the gender column.

In [37]:
df.columns

Index(['User ID', 'Age', 'Gender', 'Sleep Quality', 'Bedtime', 'Wake-up Time',
       'Daily Steps', 'Calories Burned', 'Physical Activity Level',
       'Dietary Habits', 'Sleep Disorders', 'Medication Usage'],
      dtype='object')

In [40]:
# Clear that there is only two genders in this dataset, with no mislabelling.
df['Gender'].unique()

array(['f', 'm'], dtype=object)

In [43]:
# Once more, no mislabelling is present.
df['Physical Activity Level'].unique()

array(['medium', 'low', 'high'], dtype=object)

In [49]:
# Seems like no mislabelling is present.
print(df['Dietary Habits'].unique())

# However, medium is strange when healthy and unhealthy sounds binary. Let's check if this is a common label.
df['Dietary Habits'].value_counts()
# Looks like a common label, so we can feel confident it is not a mislabelling.

['healthy' 'unhealthy' 'medium']


unhealthy    41
medium       30
healthy      29
Name: Dietary Habits, dtype: int64

In [55]:
# Finally, for our bedtime and wakeup time. 
# Hard to argue for outliers in this as it is possible that someone with a sleep disorder goes to sleep at 8am, for example.

# Furthermore, pandas does not support datetimeime objects without a date so these time objects are not classified as datetime. 
# Hence, we will look at all the unique values and make sure nothing looks strange - like writing 8am instead of 08:00. 

print(f"Wake-up Time: {df['Wake-up Time'].unique()}")
print(f"Bed Time: {df['Bedtime'].unique()}")

# These times all look very good and typical for waking/sleeping hours.

Wake-up Time: ['06:30' '07:00' '06:45' '07:15' '06:00' '07:30' '06:15']
Bed Time: ['23:00' '00:30' '22:45' '01:00' '23:30' '00:15' '22:30' '01:30' '00:45'
 '22:00' '22:15' '23:45' '01:15' '23:15']


In [42]:
# Clear that we have no overlapping IDs or fail to have 2 values on the binary columns.
assert df['User ID'].unique().shape[0] == df.shape[0]
assert df['Medication Usage'].unique().shape[0] == 2
assert df['Sleep Disorders'].unique().shape[0] == 2

It is clear that we have no categorical outliers and that means all columns have been assessed for null and outlier values. 

Hence, this data is good to use now.

## Answering Goals

Now that our data is ready, we can look forward to answering our goals/questions.
I will post them here again, to avoid scrolling.

Goals:

1. Identify which gender typically has higher quality sleep.
2. Visualise the relationship between age and medication usage.
3. Identify the relationship between calories burned and daily steps, extended to include age. 
4.  Visualise the relationship between sleep duration and sleep disorders.
5. Visualise the relationship between sleep quality and dietary habits.
6. Identify the relationship between dietary habits and medication usage.

### Which gender gets better sleep? 

In [109]:
selection = alt.selection_point(fields=['Gender'])
color = alt.condition(
    selection,
    alt.Color('Gender:N').legend(None),
    alt.value('lightgray')
)

sleep_chart = alt.Chart(
    df.reset_index(), 
    title='Sleep Quality amongst Genders'
).mark_point().encode(
    x=alt.X('index', axis=alt.Axis(labels=False), title=None),
    y=alt.Y('Sleep Quality:Q', title='Level of Sleep Quality'),
    color=color,
    tooltip=[
        alt.Tooltip(title='ID', field='User ID'),
        alt.Tooltip(title='Sleep Disorder', field='Sleep Disorders')
    ]
)

legend = alt.Chart(df.reset_index()).mark_point().encode(
    alt.Y('Gender:N').axis(orient='right'),
    color=color
).add_params(
    selection
)

chart = sleep_chart | legend

# chart.save('images/sleep_chart.html')
chart

From this graph, it is pretty clear that people identifying as 'female' report higher quality sleep.

However, what do these individuals consider higher quality sleep? More sleep? 

Let's investigate the relationship between sleep duration and sleep quality.

In [138]:
from datetime import datetime, timedelta

# Convert Wake-up Times to datetime
# Note that we checked all wake-up times are between 5-8am
wake_up_times = np.array([
    datetime.strptime(t, "%H:%M") + timedelta(days=1)
    for t in df["Wake-up Time"]
])

# Convert Bedtimes to datetime
# Important to consider bedtimes past midnight
# Bedtimes range from 22:00 - 02:00
bedtimes = np.array([
    datetime.strptime(t, "%H:%M") + timedelta(days=1) if int(t[0:2]) in range(0,3) # Check if bedtime is in next day.
    else datetime.strptime(t, "%H:%M") 
    for t in df["Bedtime"]
])

duration = wake_up_times - bedtimes
df["Hours Slept"] = [td.seconds / 3600 for td in duration] 

In [145]:
selection = alt.selection_point(fields=['Gender'])
color = alt.condition(
    selection,
    alt.Color('Gender:N').legend(None),
    alt.value('lightgray')
)

quality_sleep_chart = alt.Chart(
    df,
    title='Sleep Quality Against Hours Slept'
).mark_point().encode(
    x=alt.X('Sleep Quality:Q', title='Sleep Quality'),
    y=alt.Y('Hours Slept:Q', title='Hours Slept'),
    color=color,
    tooltip=[
        alt.Tooltip(title='ID', field='User ID'),
        alt.Tooltip(title='Sleep Disorder', field='Sleep Disorders')
    ]
)

legend = alt.Chart(df).mark_point().encode(
    alt.Y('Gender:N').axis(orient='right'),
    color=color
).add_params(
    selection
)

chart = quality_sleep_chart | legend

chart.save('images/quality_against_duration.html')
chart

It is clear that higher quality sleep is reported when more hours are slept. 

However, it is very important to note the ratio for which males have sleeping disorders, as opposed to females, in this dataset.

In [163]:
df[df["Sleep Disorders"] == "yes"]['Gender'].value_counts()

m    25
f     1
Name: Gender, dtype: int64

#### Observations on 'Which gender gets better sleep?'

In this dataset, it is clear that females get better sleep. However, we also recognise that this dataset seems to be a bit biased as, for all the individuals with reported sleeping disorders, 25 are males and 1 is female. 

On the other hand, from those without reported sleeping disorders, we have 25 males and 49 females.

### Does medication usage become more common with age? 

In [274]:
selection = alt.selection_point(fields=['Medication Usage'])
color = alt.condition(
    selection,
    alt.Color('Medication Usage:N').legend(None),
    alt.value('lightgray')
)

medication_chart = alt.Chart(
    df.reset_index(),
    title='Medication Usage with Age'
).mark_point().encode(
    x=alt.X('index', title='Index', axis=alt.Axis(labels=False)),
    y=alt.Y('Age', title='Age'),
    color=color,
    tooltip=[
        alt.Tooltip(title='Age', field='Age'),
        alt.Tooltip(title='Medicated', field='Medication Usage')
    ]
)

legend = alt.Chart(
    df,
).mark_point().encode(
    y=alt.Y(
        'Medication Usage:N',
        title=['Medication', 'Usage'],
        sort='descending', 
        axis=alt.Axis(orient='right')
    ),
    color=color
).add_params(
    selection
)

chart = medication_chart | legend 
# chart.save('images/age_medication.html')

chart

#### Observations on 'Does medication usage become more common with age?'

From this scatterplot, we can clearly see that medication becomes more common with age, as nobody under 30 takes medication and it grows more common as we head towards 45.

### Do we linearly burn more calories as we step more? 

In [241]:
df["Age Range"] = pd.cut(df["Age"], bins=[20, 25, 30, 35, 40, 45, 50], labels=['20-25', '25-30', '30-35', '35-40', '40-45', '45-50'])

In [265]:
selection = alt.selection_point(fields=['Age Range'])
color = alt.condition(
    selection,
    alt.Color(
        'Age Range:N',
        scale=alt.Scale(domainMid=0, scheme='reds'),
        legend=None
    ),
    alt.value('lightgray')
)

calories_chart = alt.Chart(
    df,
    title='Steps against Calories Burned'
).mark_point().encode(
    x=alt.X('Calories Burned:Q').scale(zero=False),
    y=alt.Y('Daily Steps:Q'),
    color=color,
    tooltip=[
        alt.Tooltip(title='Age', field='Age')
    ]
)

legend = alt.Chart(
    df,
).mark_point().encode(
    y=alt.Y(
        'Age Range:N',
        sort='ascending', 
        axis=alt.Axis(orient='right')
    ),
    color=color
).add_params(
    selection
)

chart = calories_chart | legend
# chart.save('images/calories_steps.html')

chart

#### Observations on 'Do we linearly burn more calories as we step more?'

From the interactivity of the chart, it is clear that younger people are typically burning more calories and completing more steps.

Furthermore, there is a linear increase in steps and calories burned, however that does not clearly say that steps linearly increase calories burned. 

For example, someone could walk very little and burn many calories while swimming. We have not recorded the other activities of the individual so we cannot clearly answer this question.

Despite all of that, we can say that people who walk more tend to burn more calories. 

### Do we sleep less with a sleeping disorder? 

This question seems likely to be true but it is nice to confirm this information visually. On top of that, we have done the legwork to create a visualisation very quickly here.

In [273]:
selection = alt.selection_point(fields=['Sleep Disorders'])
color = alt.condition(
    selection,
    alt.Color(
        'Sleep Disorders:N',
        legend=None
    ),
    alt.value('lightgray')
)

disorder_chart = alt.Chart(
    df.reset_index(),
    title='Sleep Duration against Sleep Disorders'
).mark_point().encode(
    x=alt.X('index', title='Index', axis=alt.Axis(labels=False)),
    y=alt.Y('Hours Slept:Q', title='Hours Slept'),
    tooltip=[
        alt.Tooltip(title='Sleep Disorder', field='Sleep Disorders')
    ],
    color=color
)

legend = alt.Chart(
    df.reset_index(),
).mark_point().encode(
    y=alt.Y(
        'Sleep Disorders:N',
        sort='descending', 
        axis=alt.Axis(orient='right')
    ),
    color=color
).add_params(
    selection
)

chart = disorder_chart | legend
# chart.save('images/disorder.html')

chart

#### Observations on 'Do we sleep less with a sleeping disorder?'
As expected, it is pretty clear that one gets less sleep if they have a sleeping disorder. 

In fact, we can almost draw a horizontal line, as no one with a sleeping disorder gets more than 6 and a half hours sleep.

### Does our sleeping affect our dietary habits?

It is said that sleeping more can help with diets ([View Study](https://www.ncbi.nlm.nih.gov/pmc/articles/PMC9859770/#:~:text=Short%20sleepers%20often%20make%20poor,those%20sleeping%20longer%20%5B48%5D.)). Let's see if this dataset follows this idea.

In [278]:
selection = alt.selection_point(fields=['Dietary Habits'])
color = alt.condition(
    selection,
    alt.Color(
        'Dietary Habits:N',
        legend=None
    ),
    alt.value('lightgray')
)

duration_chart = alt.Chart(
    df.reset_index(),
    title='Hours Slept against Dietary Habits'
).mark_point().encode(
    x=alt.X('index', title='Index', axis=alt.Axis(labels=False)),
    y=alt.Y('Hours Slept:Q', title='Hours Slept'),
    tooltip=[
        alt.Tooltip(title='Diet Quality', field='Dietary Habits')
    ],
    color=color
)

legend = alt.Chart(
    df.reset_index(),
).mark_point().encode(
    y=alt.Y(
        'Dietary Habits:N',
        sort='ascending', 
        axis=alt.Axis(orient='right')
    ),
    color=color
).add_params(
    selection
)


chart = duration_chart | legend
# chart.save('images/sleep_more_diet.html')

chart

#### Observations on 'Does our sleeping affect our dietary habits?'

We can clearly see that unhealthy diet choices are made from 7 hours or less of sleep. This is actually a clear cut statement - without any outliers.

Furthermore, the study, linked above, states the following:
>"Short sleepers often make poor nutritional choices and have higher caloric intakes compared to people who sleep >7 hours a night"

Hence, we are actually backing up this paper with our observations here.

### Does our diet have a relationship with our medication? 

Whilst we cannot say that a diet may cause medication, or vice versa, since we do not know when the medication/diet started, we can observe a relationship between the two.

In [289]:
selection = alt.selection_point(fields=['Dietary Habits'])
color = alt.condition(
    selection,
    alt.Color(
        'Dietary Habits:N',
        legend=None
    ),
    alt.value('lightgray')
)

medication_chart = alt.Chart(
    df.reset_index(),
    title='Dietary Habits against Medication Usage'
).mark_point().encode(
    x=alt.X('index', title='Index', axis=alt.Axis(labels=False)),
    y=alt.Y('Medication Usage:N', title='Uses Medication', sort='descending'),
    color=color
).properties(
    width=500,
    height=100
)

legend = alt.Chart(
    df.reset_index(),
).mark_point().encode(
    y=alt.Y(
        'Dietary Habits:N',
        sort='ascending', 
        axis=alt.Axis(orient='right')
    ),
    color=color
).add_params(
    selection
)


chart = medication_chart | legend
# chart.save('images/diet_medication.html')

chart

#### Observations on 'Does our diet have a relationship with medication usage?'

It is clear from this interactive chart that an unhealthy diet has a strong link with medication usage, although we cannot say it is a causal relationship, as we have no evidence of that.

## Conclusion

The primary conclusions from the notebook are: 
- People identifying as 'female' reported higher sleep.
- There was a strong correlation between sleep duration and quality amongst the records.
- The dataset was skewed towards men with sleeping disorders as there was a strong imbalance.
- Medication usage does become more common with age.
- We see that people who walk more tend to burn more calories.
- The data backs up a study which shows that people who sleep 7 hours or less make poorer nutritional choices as well as higher calorific intake.
- There is a strong correlation between using medication and an unhealthy dietary habit.