# Visualization Project

Data visualization plays a crucial role in getting meaningfull insights. This project is part of a coursework in data visualizaiton and aims to explore techniques for understanding data needs and designing effective visual representations.

## The Dataset

For this project, I have selected the "Sleep Health and Lifestyle Dataset" from Kaggle. Although the data is synthetic, I chose it because it includes a diverse set of variables and allows for the exploration of multiple relationships between them. The dataset contains key attributes such as occupation, sleep duration, and Quality of Sleep, along with other well-being indicators like physical activity level, stress level, blood pressure, and sleep disorders.
My objective is to explore the following research questions:

- How does occupation influence sleep duration?
- Is Quality of Sleep correlated with stress levels?
- Do sleep disorders impact both sleep duration and quality?
- How do sleep duration and quality change over time? Is there a gender-based difference?
- Does physical activity correlate with better Quality of Sleep?

While it would be ideal to analyze real-world data, I encountered challenges in finding a sufficiently updated dataset that included a wide range of relevant variables while remaining manageable in size. A comprehensive and up-to-date dataset on sleep health could significantly benefit sleep medicine research. At the same time, this structured dataset serves as a valuable methodological example for conducting an in-depth analysis in a real-world study.

Also, I analyzed a Kaggle notebook available at: https://www.kaggle.com/code/ratchakritbootkong/sleep-health-and-lifestyle-dataset-analyze . The analysis demonstrates a strong use of the grammar of graphics, with well-structured visualizations that feature clear axes and minimal visual clutter.
The study explores sleep duration by gender, followed by a line plot analyzing sleep duration across different age groups. An alternative approach could be to consolidate these insights into a single graph, using color to encode gender, which might offer a more comprehensive view.
Additionally, the analysis examines the relationship between physical activity and Quality of Sleep using a bar plot to display average values. While this effectively summarizes the data, a scatter plot could potentially reveal clusters or patterns that are not as apparent in a bar chart.
One of the standout visualizations in the analysis is a correlation between age and sleep disorders, which is both clear and insightful. Another valuable visualization is a box plot showing the relationship between stress levels and sleep duration, which effectively captures distribution patterns and variations.

## Tasks and Sketch

For this section, I aim to define two key tasks that will guide the development of this project. These tasks are designed to address both the goals and the means of the project.


- Why analyze the relationship between Quality of Sleep and various factors?

    Examining correlations can provide insights into the factors that influence Quality of Sleep. Understanding these relationships can help improve overall sleep health and determine wheter these improvements contribute to a healthier lifestyle.
    
- How to analyze correlations?

    Initially, I'll compute correlation coefficients to identify potential relationships between variables. If a multicollinearity is detected, I'll address it by removing redundant variables or applying dimensionality reduction techniques.

## Libraries

This section provides an overview of the libraries used in the project.

In [59]:
import pandas as pd
import altair as alt

## Import the Dataframe



In [60]:
sleep_dataset = pd.read_csv('Sleep_health_and_lifestyle_dataset_Synthetic.csv')


## Exploratory Data Analysis



In [61]:
sleep_dataset.sample(10)

Unnamed: 0,Person ID,Gender,Age,Occupation,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,BMI Category,Blood Pressure,Heart Rate,Daily Steps,Sleep Disorder
250,251,Female,45,Teacher,6.8,7,30,6,Overweight,135/90,65,6000,Insomnia
205,206,Male,43,Engineer,7.7,8,90,5,Normal,130/85,70,8000,
249,250,Male,44,Salesperson,6.5,6,45,7,Overweight,130/85,72,6000,
351,352,Female,57,Nurse,8.1,9,75,3,Overweight,140/95,68,7000,Sleep Apnea
291,292,Female,50,Nurse,6.1,6,90,8,Overweight,140/95,75,10000,Sleep Apnea
289,290,Female,50,Nurse,6.1,6,90,8,Overweight,140/95,75,10000,Sleep Apnea
89,90,Male,35,Engineer,7.3,8,60,4,Normal,125/80,65,5000,
56,57,Male,32,Doctor,7.7,7,75,6,Normal,120/80,70,8000,
235,236,Male,44,Salesperson,6.3,6,45,7,Overweight,130/85,72,6000,Insomnia
224,225,Female,44,Teacher,6.6,7,45,4,Overweight,135/90,65,6000,Insomnia


In [62]:
#Description of the dataset
sleep_dataset.info()

sleep_dataset.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Person ID                374 non-null    int64  
 1   Gender                   374 non-null    object 
 2   Age                      374 non-null    int64  
 3   Occupation               374 non-null    object 
 4   Sleep Duration           374 non-null    float64
 5   Quality of Sleep         374 non-null    int64  
 6   Physical Activity Level  374 non-null    int64  
 7   Stress Level             374 non-null    int64  
 8   BMI Category             374 non-null    object 
 9   Blood Pressure           374 non-null    object 
 10  Heart Rate               374 non-null    int64  
 11  Daily Steps              374 non-null    int64  
 12  Sleep Disorder           155 non-null    object 
dtypes: float64(1), int64(7), object(5)
memory usage: 38.1+ KB


Unnamed: 0,Person ID,Age,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,Heart Rate,Daily Steps
count,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0
mean,187.5,42.184492,7.132086,7.312834,59.171123,5.385027,70.165775,6816.84492
std,108.108742,8.673133,0.795657,1.196956,20.830804,1.774526,4.135676,1617.915679
min,1.0,27.0,5.8,4.0,30.0,3.0,65.0,3000.0
25%,94.25,35.25,6.4,6.0,45.0,4.0,68.0,5600.0
50%,187.5,43.0,7.2,7.0,60.0,5.0,70.0,7000.0
75%,280.75,50.0,7.8,8.0,75.0,7.0,72.0,8000.0
max,374.0,59.0,8.5,9.0,90.0,8.0,86.0,10000.0


In [None]:
#Change appropiate data types
sleep_dataset['Person ID'] = sleep_dataset['Person ID'].astype('str')
sleep_dataset['Gender'] = sleep_dataset['Gender'].astype('category')

sleep_dataset['BMI Category'] = sleep_dataset['BMI Category'].apply(lambda x: x.replace('Normal Weight', 'Normal'))
bmi_order = ['Normal', 'Overweight', 'Obese']
sleep_dataset['BMI Category'] = sleep_dataset['BMI Category'].astype('category')
sleep_dataset['BMI Category'] = sleep_dataset['BMI Category'].cat.set_categories(
    new_categories=bmi_order,
    ordered=True
)

sleep_dataset['Sleep Disorder'] = sleep_dataset['Sleep Disorder'].apply(lambda x: "No disorder" if pd.isna(x) else x)
sleep_dataset['Sleep Disorder'] = sleep_dataset['Sleep Disorder'].astype('category')
sleep_dataset['Occupation'] = sleep_dataset['Occupation'].astype('category')

In [64]:

sleep_dataset['BMI Category'].unique()

['Overweight', 'Normal', 'Obese']
Categories (3, object): ['Normal' < 'Overweight' < 'Obese']

## Visualization

First, we're going to plot relationship between Quality of Sleep and nominal variables like Age, Duration, Activity, Stress, Heart Rate and Dayle Steps

In [65]:
#Melt the dataset to long format

x_domains={}

for feature in sleep_dataset.columns[~sleep_dataset.columns.isin(['Person ID', 'Quality of Sleep', 'Gender', 'Occupation', 'BMI Category', 'Blood Pressure', 'Sleep Disorder'])]:
    min_val = min(sleep_dataset[feature])
    max_val = max(sleep_dataset[feature])
    x_domains[feature] = (min_val, max_val)

melted_data = sleep_dataset.melt(id_vars=['Person ID', 'Quality of Sleep'], var_name='Feature', value_name='Value')

# Count duplicates
freq = melted_data.groupby(['Feature', 'Value', 'Quality of Sleep']).size().reset_index(name='Count of Ocurrences')

# Merge frequency into the dataset
melted_data_with_count = pd.merge(melted_data, freq, on=['Feature', 'Value', 'Quality of Sleep'])


base_chart = alt.Chart(melted_data_with_count).mark_circle().encode(
    y=alt.Y('Quality of Sleep:Q', title='Quality of Sleep', scale=alt.Scale(domain=[0,10]), axis=alt.Axis(values=list(range(0, 11)))),
    color=alt.Color('Feature:N', legend=alt.Legend(title='Feature')),
    size=alt.Size('Count of Ocurrences:Q', legend=None),
    tooltip=['Feature:N', 'Value:Q', 'Quality of Sleep:O', 'Count of Ocurrences:Q'],
)

charts = []
for feature in x_domains:
    min_val, max_val = x_domains[feature]
    
    scatter_plot = base_chart.encode(
        x=alt.X('Value:Q', title=feature, scale=alt.Scale(domain=(min_val, max_val))),
        xOffset='jitter:Q',
        opacity=alt.condition(alt.datum['Feature'] == feature, alt.value(1), alt.value(0.1))
    ).transform_filter(
        alt.datum['Feature'] == feature
    ).transform_calculate(
        jitter='sqrt(-2*log(random()))*cos(2*PI*random())*0.2'
    )

    regression_line = alt.Chart(melted_data_with_count).transform_filter(
        alt.datum['Feature'] == feature
    ).transform_regression(
        'Value', 'Quality of Sleep'
    ).mark_line(color='gray').encode(
        x=alt.X('Value:Q', title=feature, scale=alt.Scale(domain=(min_val, max_val))),
        y=alt.Y('Quality of Sleep:Q', title='Quality of Sleep', scale=alt.Scale(domain=[0,10]), axis=alt.Axis(values=list(range(0, 11)))),
    )

    combined_chart = (scatter_plot + regression_line).properties(
        title= f'{feature} vs. Quality of Sleep (r = {sleep_dataset[[feature, "Quality of Sleep"]].corr().iloc[0,1]:.2f})')

    charts.append(combined_chart)

scatter_plots = alt.vconcat(*charts).properties(
    title='Quality of Sleep vs. Various Features'
)

scatter_plots

Then, we proceed to plot categorical variables with quality of sleep.

In [69]:
violin_BMI = alt.Chart(sleep_dataset).transform_density(
    'Quality of Sleep',
    as_=['Quality of Sleep', 'density'],
    groupby=['BMI Category']
).mark_area(
    opacity=0.4,
    interpolate='monotone',
    orient='horizontal',
).encode(
    x=alt.X('density:Q', title=None, axis=None, stack='center'),
    y=alt.Y('Quality of Sleep:Q', scale=alt.Scale(domain=[0, 9]), title='Quality of Sleep'),
    color=alt.Color('BMI Category:N', legend=None),
    column=alt.Column(
        'BMI Category:N',
        sort=['Normal', 'Overweight', 'Obese'],
        spacing=10,
        title='BMI Category'
    )
).properties(
    width=100
)

violin_Gender = alt.Chart(sleep_dataset).transform_density(
    'Quality of Sleep',
    as_=['Quality of Sleep', 'density'],
    groupby=['Gender']
).mark_area(
    opacity=0.4,
    interpolate='monotone',
    orient='horizontal',
).encode(
    x=alt.X('density:Q', title=None, axis=None, stack='center'),
    y=alt.Y('Quality of Sleep:Q', scale=alt.Scale(domain=[0, 9]), title='Quality of Sleep'),
    color=alt.Color('Gender:N', legend=None),
    column=alt.Column(
        'Gender:N',
        spacing=10,
        title='Gender'
    )
    ).properties(
        width=100
    )

violin_sleep_disorder = alt.Chart(sleep_dataset).transform_density(
    'Quality of Sleep',
    as_=['Quality of Sleep', 'density'],
    groupby=['Sleep Disorder']
).mark_area(
    opacity=0.4,
    interpolate='monotone',
    orient='horizontal',
).encode(
    x=alt.X('density:Q', title=None, axis=None, stack='center'),
    y=alt.Y('Quality of Sleep:Q', scale=alt.Scale(domain=[0, 9]), title='Quality of Sleep'),
    color=alt.Color('Sleep Disorder:N', legend=None),
    column=alt.Column(
        'Sleep Disorder:N',
        spacing=10,
        title='Sleep Disorder'
    )
    ).properties(
        width=100
    )

violin_Occupation = alt.Chart(sleep_dataset).transform_density(
    'Sleep Duration',
    as_=['Sleep Duration', 'density'],
    groupby=['Occupation']
).mark_area(
    opacity=0.4,
    interpolate='monotone',
    orient='horizontal',
).encode(
    x=alt.X('density:Q', title=None, axis=None, stack='center'),
    y=alt.Y('Sleep Duration:Q', scale=alt.Scale(domain=[0, 9]), title='Sleep Duration'),
    color=alt.Color('Occupation:N', legend=None),
    ).properties(
        width=100
    )


final_categorial_chart = alt.vconcat(violin_BMI, violin_Gender, violin_sleep_disorder, violin_Occupation)
final_categorial_chart



## Visualization evaluation

This section outlines our approach to evaluating the current project. Our key research question is: Is Quality of Sleep directly correlated with well-known health variables such as BMI or blood pressure?. Secondary questions include: Are there any factors that directly correlate with poor Quality of Sleep?

To explore this, we plan to recruit doctors from various specialities and employ a think-aloud protocol to assess their insights. We will evaluate the quantity, depth, and time to insights. which will help determine whether our visualization effectively communicates key findingd. Specifically, we aim to see whether relationships(or lack thereof) between Quality of Sleep and health variables are clearly conveyed. Addiotionally, this evaluation will highlight potential areas for design improvement.

If our visualization is successfull, we expecto to generate in-depth insights in a short period. More importantly, we hope to inspire doctors to further research on sleep medicine, ultimately benefiting their patients.
