# Sleep and Step Data Analysis

This notebook combines all the data processing and visualization steps for the analysis of sleep duration and step counts.
The goal is to understand the relationship between these two variables and present the findings using visualizations.

## Required Libraries
To run this notebook, make sure you have the following libraries installed:

```bash
pip install pandas matplotlib seaborn xmltodict
```

## Processing Sleep Data

The raw sleep data is stored in XML format. The following steps process the data and save it in CSV format.

In [None]:
import xmltodict
import pandas as pd

# Load the XML file
with open("data/raw/sleep_data.xml", "r", encoding="utf-8") as file:
    xml_data = file.read()

# Parse the XML file
data_dict = xmltodict.parse(xml_data)
records = data_dict["HealthData"]["Record"]

# Convert records to DataFrame
df = pd.DataFrame(records)

# Filter sleep data
sleep_data = df[df['@type'] == 'HKCategoryTypeIdentifierSleepAnalysis']

# Select and rename necessary columns
sleep_data_cleaned = sleep_data[['@startDate', '@endDate', '@value']]
sleep_data_cleaned.rename(columns={
    '@startDate': 'startDate',
    '@endDate': 'endDate',
    '@value': 'sleepState'
}, inplace=True)

# Save processed sleep data
sleep_data_cleaned.to_csv("data/processed/sleep_data.csv", index=False)
print("Sleep data processed and saved to 'data/processed/sleep_data.csv'.")

## Processing Step Data

The raw step data is also stored in XML format. The following steps process the data and save it in CSV format.

In [None]:
# Load the XML file for step data
with open("data/raw/step_data.xml", "r", encoding="utf-8") as file:
    xml_data = file.read()

# Parse the XML file
data_dict = xmltodict.parse(xml_data)
records = data_dict["HealthData"]["Record"]

# Convert records to DataFrame
df = pd.DataFrame(records)

# Filter step count data
step_data = df[df['@type'] == 'HKQuantityTypeIdentifierStepCount']

# Select and rename necessary columns
step_data_cleaned = step_data[['@startDate', '@endDate', '@value']]
step_data_cleaned.rename(columns={
    '@startDate': 'startDate',
    '@endDate': 'endDate',
    '@value': 'steps'
}, inplace=True)

# Save processed step data
step_data_cleaned.to_csv("data/processed/step_data.csv", index=False)
print("Step data processed and saved to 'data/processed/step_data.csv'.")

## Merging Sleep and Step Data

After processing the sleep and step data, we merge them into a single dataset based on the date.

In [None]:
# Load processed data
sleep_data = pd.read_csv("data/processed/sleep_data.csv")
step_data = pd.read_csv("data/processed/step_data.csv")

# Convert date columns to datetime
sleep_data['startDate'] = pd.to_datetime(sleep_data['startDate'])
step_data['startDate'] = pd.to_datetime(step_data['startDate'])

# Merge datasets
merged_data = pd.merge(sleep_data, step_data, on='startDate', how='inner')

# Save merged data
merged_data.to_csv("data/processed/merged_sleep_step_data.csv", index=False)
print("Merged data saved to 'data/processed/merged_sleep_step_data.csv'.")

## Visualizing Data: Boxplots

Boxplots are used to show the distribution of sleep duration and step counts.

In [None]:
import matplotlib.pyplot as plt

# Load merged data
data = pd.read_csv("data/processed/merged_sleep_step_data.csv")

# Boxplots
plt.figure(figsize=(12, 6))

# Sleep Duration Boxplot
plt.subplot(1, 2, 1)
plt.boxplot(data['duration_hours'], patch_artist=True, boxprops=dict(facecolor="lightblue"))
plt.title('Boxplot of Sleep Duration')
plt.ylabel('Hours')

# Step Count Boxplot
plt.subplot(1, 2, 2)
plt.boxplot(data['steps'], patch_artist=True, boxprops=dict(facecolor="lightgreen"))
plt.title('Boxplot of Step Count')
plt.ylabel('Steps')

plt.tight_layout()
plt.show()