# Sleep and Step Data Analysis

This notebook consolidates all the data processing and visualization steps for analyzing the relationship between sleep duration and step counts.

## Required Libraries
Install necessary libraries before running the notebook:

```bash
pip install pandas matplotlib seaborn xmltodict
```

## 1. Processing Sleep Data

The raw sleep data in XML format is processed and saved as a CSV file for analysis.

In [None]:

import xmltodict
import pandas as pd

# Load the XML file
with open("data/raw/sleep_data.xml", "r", encoding="utf-8") as file:
    xml_data = file.read()

# Parse XML data
data_dict = xmltodict.parse(xml_data)
records = data_dict["HealthData"]["Record"]

# Convert to DataFrame
df = pd.DataFrame(records)

# Filter sleep data
sleep_data = df[df['@type'] == 'HKCategoryTypeIdentifierSleepAnalysis']

# Save sleep data as CSV
sleep_data_cleaned = sleep_data[['@startDate', '@endDate', '@value']]
sleep_data_cleaned.rename(columns={
    '@startDate': 'startDate',
    '@endDate': 'endDate',
    '@value': 'sleepState'
}, inplace=True)
sleep_data_cleaned.to_csv("data/processed/sleep_data.csv", index=False)
print("Sleep data processed successfully.")


## 2. Processing Step Data

The raw step data in XML format is processed and saved as a CSV file for analysis.

In [None]:

with open("data/raw/step_data.xml", "r", encoding="utf-8") as file:
    xml_data = file.read()

data_dict = xmltodict.parse(xml_data)
records = data_dict["HealthData"]["Record"]

df = pd.DataFrame(records)

# Filter step data
step_data = df[df['@type'] == 'HKQuantityTypeIdentifierStepCount']

step_data_cleaned = step_data[['@startDate', '@endDate', '@value']]
step_data_cleaned.rename(columns={
    '@startDate': 'startDate',
    '@endDate': 'endDate',
    '@value': 'steps'
}, inplace=True)
step_data_cleaned.to_csv("data/processed/step_data.csv", index=False)
print("Step data processed successfully.")


## 3. Merging Sleep and Step Data

Merge the processed sleep and step data into a single dataset for analysis.

In [None]:

sleep_data = pd.read_csv("data/processed/sleep_data.csv")
step_data = pd.read_csv("data/processed/step_data.csv")

# Ensure date columns are datetime
sleep_data['startDate'] = pd.to_datetime(sleep_data['startDate'])
step_data['startDate'] = pd.to_datetime(step_data['startDate'])

merged_data = pd.merge(sleep_data, step_data, on='startDate', how='inner')
merged_data.to_csv("data/processed/merged_sleep_step_data.csv", index=False)
print("Merged data saved successfully.")


## 4. Visualizing Data

This section contains visualizations for exploring the relationship between sleep duration and step counts.

### Boxplot

In [None]:

import matplotlib.pyplot as plt

data = pd.read_csv("data/processed/merged_sleep_step_data.csv")

plt.figure(figsize=(12, 6))
plt.subplot(1, 2, 1)
plt.boxplot(data['duration_hours'], patch_artist=True, boxprops=dict(facecolor="lightblue"))
plt.title('Boxplot of Sleep Duration')
plt.ylabel('Hours')

plt.subplot(1, 2, 2)
plt.boxplot(data['steps'], patch_artist=True, boxprops=dict(facecolor="lightgreen"))
plt.title('Boxplot of Step Count')
plt.ylabel('Steps')

plt.tight_layout()
plt.show()


![Boxplot](assets/Boxplot_Separately.png)

### Scatter Plot

In [None]:

plt.figure(figsize=(8, 6))
plt.scatter(data['duration_hours'], data['steps'], alpha=0.6, color='purple')
plt.title('Scatter Plot of Sleep Duration vs Step Count')
plt.xlabel('Sleep Duration (hours)')
plt.ylabel('Step Count')
plt.grid()
plt.show()


![Scatter Plot](assets/Scatter_Plot.png)