# Data Activity

---

Yifan Jing/Ivan (YIJ63@pitt.edu/Jyf20050919@163.com)

---

## Activity Layout

Today, we will be doing some free-form data exploration with a dataset of your choosing! First, please go to [Kaggle](https://www.kaggle.com/datasets) and select a dataset; it can be any dataset you want. 

**MAKE SURE TO SAVE IT IN A SEPARATE DIRECTORY THAN THIS ONE!!!**

Afterwards, please `git clone` this repository onto your computer.

## Dataset Chosen

**Dataset:** Sleep Health and Lifestyle Dataset  
**Link:** https://www.kaggle.com/datasets/uom190346a/sleep-health-and-lifestyle-dataset  
**Why I chose this dataset:**  
- Sleep health is closely related to daily wellbeing, mood, and productivity, which makes it highly relevant and interesting to analyze.
- The dataset contains comprehensive information about sleep duration, efficiency, and various lifestyle factors (such as diet, exercise, napping), providing a great opportunity for multi-variable analysis.
**What I hope to learn:**  
- Explore the relationships between lifestyle factors (e.g., exercise, dinner time, napping) and sleep quality (e.g., efficiency, duration).
- Identify which variables have the strongest influence on sleep efficiency, and possibly attempt some regression or clustering analysis.

---

## Instructions

1. Make a separate branch, naming it based on the dataset you chose.

    **NOTE:** First, `cd` into the directory. Then, you run `git branch <NAME>` in the terminal to create the branch and `git checkout <NAME>` to switch to it.
2. Make a copy of this template, name it to reflect the dataset (also change the title).
3. Replace my name and email (Alejandro Ciuba) with your own.
4. Fill in the empty spaces in your Jupyter Notebook.
5. Create an Anaconda environment for this notebook.
6. Launch the environment.
7. Launch the notebok within the environment.

    **NOTE:** This is done either through selecting the kernel in VSCode or running `jupyter <NAME>` after environment launch.

From there, you are free to explore your data however you see fit! Make graphs, record anomalies, make connections. I recommend performing some statistical tests if you know them (although we'll cover those next week). Run `pip install <NAME>` for any packages you might need (e.g., [pandas](https://pandas.pydata.org/docs/), matplotlib, seaborn, etc.). Once we are finished, we will go over making an `environment.txt` file and forming pull requests.

---

## Imports & Settings

In [5]:
# Put your imports here
import matplotlib.pyplot as plt
import pandas as pd
import seaborn as sns

In [6]:
# Each code-block should be concise and accomplish one or two basic tasks
%pprint  # Turn this off, let's you see more output

pd.set_option('display.max_columns', None)  # Output setting

plt.style.use('ggplot')  # I prefer this style for charts

sns.set_palette("muted")  # Set the colors for the charts

sns.color_palette("muted")  # Shows you what the color palette looks like

Pretty printing has been turned OFF


---

## Functions

In [7]:
# Put any useful functions in their own codeblock here
def load_dataset(path: str) -> pd.DataFrame:
    return pd.read_csv(path)

In [8]:
def percentage_plot(df: pd.DataFrame, x: str, labels: bool = False) -> plt.axes:

    ordering = df[x].value_counts().index

    plot = sns.countplot(data=df, x=x, hue=x, stat="percent", 
                         order=ordering, hue_order=ordering, palette=sns.color_palette("colorblind"))
    plot.tick_params(axis="x", rotation=90)
    plot.set_title(f"{x} Percentages")
    plot.set_ylabel("Percentage")
    plot.set_xlabel(x)

    # Uncomment this if you want percentages over your bars
    # Bonus challenge: Rotate the labels vertically so they don't collide!
    # if labels:
    #     for c in plot.containers:
    #         plot.bar_label(c, fmt="%.2f")

    return plot

---
## Globals

In [9]:
DATA = "./Sleep_health_and_lifestyle_dataset.csv"

---
## Load Data

In [10]:
## Use Pandas to load your dataset
df = load_dataset(DATA)

In [11]:
## Use df.info() to get an overview
## Use df.describe() to get the data's descriptive statistics (mean, max/min, etc.)

---

## Data Exploration

In [13]:
## Explore your data here!
df.info()
df.describe()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 374 entries, 0 to 373
Data columns (total 13 columns):
 #   Column                   Non-Null Count  Dtype  
---  ------                   --------------  -----  
 0   Person ID                374 non-null    int64  
 1   Gender                   374 non-null    object 
 2   Age                      374 non-null    int64  
 3   Occupation               374 non-null    object 
 4   Sleep Duration           374 non-null    float64
 5   Quality of Sleep         374 non-null    int64  
 6   Physical Activity Level  374 non-null    int64  
 7   Stress Level             374 non-null    int64  
 8   BMI Category             374 non-null    object 
 9   Blood Pressure           374 non-null    object 
 10  Heart Rate               374 non-null    int64  
 11  Daily Steps              374 non-null    int64  
 12  Sleep Disorder           155 non-null    object 
dtypes: float64(1), int64(7), object(5)
memory usage: 38.1+ KB


Unnamed: 0,Person ID,Age,Sleep Duration,Quality of Sleep,Physical Activity Level,Stress Level,Heart Rate,Daily Steps
count,374.0,374.0,374.0,374.0,374.0,374.0,374.0,374.0
mean,187.5,42.184492,7.132086,7.312834,59.171123,5.385027,70.165775,6816.84492
std,108.108742,8.673133,0.795657,1.196956,20.830804,1.774526,4.135676,1617.915679
min,1.0,27.0,5.8,4.0,30.0,3.0,65.0,3000.0
25%,94.25,35.25,6.4,6.0,45.0,4.0,68.0,5600.0
50%,187.5,43.0,7.2,7.0,60.0,5.0,70.0,7000.0
75%,280.75,50.0,7.8,8.0,75.0,7.0,72.0,8000.0
max,374.0,59.0,8.5,9.0,90.0,8.0,86.0,10000.0


From the output above, we can see:
- The dataset has 374 rows and 13 columns.
- Most columns have complete data except for 'Sleep Disorder', which has many missing values (only 155 non-null).
- All key numeric features (such as Age, Sleep Duration, Quality of Sleep, Physical Activity Level, etc.) have reasonable distributions and ranges.

