# **APPLE HEALTH DATA ANALYSIS**

**INTRODUCTION**  
This project focuses on analyzing personal health and activity data exported from the Apple Health app. By leveraging tools like Python and its data analysis libraries, we aim to uncover insights into daily routines, fitness progress, and overall health trends.  

**Note:** The XML-to-CSV conversion was performed in the background due to the complexity of the process. This notebook begins from the stage where the data is already in CSV format, ready for analysis.  



**OBJECTIVE**  
The primary goals of this analysis are:  
- To explore patterns in daily activity levels, such as steps and calories burned.  
- To analyze heart rate data for identifying trends in resting and active heart rates.  
- To examine sleep patterns and their consistency over time.  
- To visualize key health metrics and derive actionable insights.  



**DATA SOURCE**  
The data was exported directly from the Apple Health app in XML format, then parsed and transformed into CSV for analysis. It includes a variety of health metrics such as steps, heart rate, distance traveled, active energy, and sleep analysis.  



**TOOLS AND LIBRARIES**  
The analysis will use:  
- **pandas**: For data cleaning and transformation.  
- **matplotlib** and **seaborn**: For creating insightful visualizations.  
- **plotly**: For interactive charts.  



**METHODOLOGY**  
1. **DATA PREPARATION**:  
   - Loading the CSV file into a pandas DataFrame.  
   - Cleaning the data to handle missing or inconsistent values.  
   - Ensuring proper formatting for dates and numerical values.  
   
2. **ANALYSIS**:  
   - Analyzing daily activity levels (steps, distance, calories burned).  
   - Examining heart rate trends over time.  
   - Evaluating sleep patterns and consistency.  
   
3. **VISUALIZATION**:  
   - Creating time-series plots for trends over days, weeks, and months.  
   - Developing heatmaps and histograms to visualize activity distribution.  
   - Highlighting insights with interactive dashboards.  



**INSIGHTS AND CONCLUSIONS**  
Through this analysis, we aim to provide a deeper understanding of health metrics, identify areas for improvement, and celebrate progress towards wellness goals.




**DATA LOADING AND EXPLORATION**

**DATA LOADING**  
In this step, we will load the CSV file into a pandas DataFrame and take an initial look at the data. This includes:
- Checking the structure and format of the dataset.
- Identifying the number of rows and columns.
- Displaying column names and sample data.








*  We start by importing essential libraries, such as `pandas` for data manipulation and `matplotlib` for visualizations.

* The raw data is loaded into a pandas DataFrame from a CSV file for initial inspection.

* We display basic information about the dataset using `info()` to check the column names, data types, and the number of non-null entries.

In [None]:
# Importing necessary libraries
import pandas as pd
import matplotlib.pyplot as plt

# Loading the dataset
file_path = 'apple_health_export_2024-11-18.csv'  # Replace with the correct file path
health_data = pd.read_csv(file_path, low_memory=False) 
# Setting low_memory=False ensures pandas processes...
# the entire dataset at once,improving type inference for columns with mixed data types.

# Displaying basic information about the dataset
print("Dataset Overview:")
health_data.info()

Dataset Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 320257 entries, 0 to 320256
Data columns (total 31 columns):
 #   Column                       Non-Null Count   Dtype  
---  ------                       --------------   -----  
 0   type                         310031 non-null  object 
 1   sourceName                   310026 non-null  object 
 2   value                        319853 non-null  object 
 3   unit                         309783 non-null  object 
 4   startDate                    310029 non-null  object 
 5   endDate                      310029 non-null  object 
 6   creationDate                 310026 non-null  object 
 7   appleMoveTimeGoal            393 non-null     float64
 8   device                       309885 non-null  object 
 9   date                         4 non-null       object 
 10  workoutActivityType          2 non-null       object 
 11  locale                       1 non-null       object 
 12  activeEnergyBurnedUnit       393 non-nul

*  We display the first few rows of the dataset using `head()` to get an overview of the data and understand its structure.

In [85]:
# Displaying the first few rows of the dataset
print("\nSample Data:")
health_data.head()


Sample Data:


Unnamed: 0,type,sourceName,value,unit,startDate,endDate,creationDate,appleMoveTimeGoal,device,date,...,dateComponents,appleExerciseTimeGoal,appleStandHours,BiologicalSex,BloodType,appleMoveTime,DateOfBirth,appleStandHoursGoal,sum,activeEnergyBurned
0,HeadphoneAudioExposure,Al's iPhone,94.5578,dBASPL,2024-11-18 12:20:47 +0000,2024-11-18 12:21:57 +0000,2024-11-18 12:31:59 +0000,,"<<HKDevice: 0x3030c1220>, name:AirPods Pro, ma...",,...,,,,,,,,,,
1,HeadphoneAudioExposure,Al's iPhone,92.9011,dBASPL,2024-11-18 12:19:50 +0000,2024-11-18 12:20:48 +0000,2024-11-18 12:25:38 +0000,,"<<HKDevice: 0x3030c1220>, name:AirPods Pro, ma...",,...,,,,,,,,,,
2,HeadphoneAudioExposure,Al's iPhone,94.8688,dBASPL,2024-11-18 12:14:34 +0000,2024-11-18 12:19:50 +0000,2024-11-18 12:25:37 +0000,,"<<HKDevice: 0x3030c1220>, name:AirPods Pro, ma...",,...,,,,,,,,,,
3,HeadphoneAudioExposure,Al's iPhone,94.5529,dBASPL,2024-11-18 12:13:22 +0000,2024-11-18 12:14:34 +0000,2024-11-18 12:19:48 +0000,,"<<HKDevice: 0x3030c1220>, name:AirPods Pro, ma...",,...,,,,,,,,,,
4,HeadphoneAudioExposure,Al's iPhone,93.1836,dBASPL,2024-11-18 12:08:45 +0000,2024-11-18 12:13:22 +0000,2024-11-18 12:19:48 +0000,,"<<HKDevice: 0x3030c1220>, name:AirPods Pro, ma...",,...,,,,,,,,,,


* We will list all column names to confirm the structure of the dataset after cleaning.

* We will check for any missing values in each column to ensure the data is complete and ready for analysis.

In [86]:
# Viewing column names
print("Column Names:")
print(health_data.columns)

# Checking for missing values
missing_values = health_data.isnull().sum()
print("\nMissing Values per Column:")
print(missing_values)


Column Names:
Index(['type', 'sourceName', 'value', 'unit', 'startDate', 'endDate',
       'creationDate', 'appleMoveTimeGoal', 'device', 'date',
       'workoutActivityType', 'locale', 'activeEnergyBurnedUnit', 'key',
       'CardioFitnessMedicationsUse', 'duration', 'appleExerciseTime',
       'FitzpatrickSkinType', 'sourceVersion', 'activeEnergyBurnedGoal',
       'durationUnit', 'dateComponents', 'appleExerciseTimeGoal',
       'appleStandHours', 'BiologicalSex', 'BloodType', 'appleMoveTime',
       'DateOfBirth', 'appleStandHoursGoal', 'sum', 'activeEnergyBurned'],
      dtype='object')

Missing Values per Column:
type                            10226
sourceName                      10231
value                             404
unit                            10474
startDate                       10228
endDate                         10228
creationDate                    10231
appleMoveTimeGoal              319864
device                          10372
date                           

**DATA CLEANING**

- **Remove Unwanted Columns**: Drop columns that are not needed for the analysis.
- **Remove Unwanted Rows**: Filter rows based on the 'type' column to keep relevant data.
- **Rename Columns**: Rename columns for clarity, e.g., `creationDate` to `date`.
- **Change Date Format**: Convert the date format to `YYYY-MM-DD HH:MM:SS` and remove timezone information.


* Remove all columns that are not needed for the analysis.

In [87]:
# List of columns to remove
columns_to_remove = [
    'sourceName', 'startDate', 'endDate', 'appleMoveTimeGoal', 'device', 'date',
    'workoutActivityType', 'locale', 'activeEnergyBurnedUnit', 'key',
    'CardioFitnessMedicationsUse', 'duration', 'appleExerciseTime',
    'FitzpatrickSkinType', 'sourceVersion', 'activeEnergyBurnedGoal',
    'durationUnit', 'dateComponents', 'appleExerciseTimeGoal',
    'appleStandHours', 'BiologicalSex', 'BloodType', 'appleMoveTime',
    'DateOfBirth', 'appleStandHoursGoal', 'sum', 'activeEnergyBurned'
]

# Remove the columns
health_data.drop(columns=columns_to_remove, inplace=True, errors='ignore')

# Confirm the removal
print("Remaining Columns:")
print(health_data.columns)


Remaining Columns:
Index(['type', 'value', 'unit', 'creationDate'], dtype='object')


* After removing unwanted columns, we inspect the structure of the cleaned data using `info()` to check the data types and non-null entries.

* We then display the first few rows of the cleaned data using `head()` to get an overview of the dataset after the cleaning process.

In [88]:
# View the structure of the dataset after removing columns
print("Dataset Overview:")
print(health_data.info())

# View the first few rows of the cleaned data
print("\nSample Data:")
health_data.head()


Dataset Overview:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 320257 entries, 0 to 320256
Data columns (total 4 columns):
 #   Column        Non-Null Count   Dtype 
---  ------        --------------   ----- 
 0   type          310031 non-null  object
 1   value         319853 non-null  object
 2   unit          309783 non-null  object
 3   creationDate  310026 non-null  object
dtypes: object(4)
memory usage: 9.8+ MB
None

Sample Data:


Unnamed: 0,type,value,unit,creationDate
0,HeadphoneAudioExposure,94.5578,dBASPL,2024-11-18 12:31:59 +0000
1,HeadphoneAudioExposure,92.9011,dBASPL,2024-11-18 12:25:38 +0000
2,HeadphoneAudioExposure,94.8688,dBASPL,2024-11-18 12:25:37 +0000
3,HeadphoneAudioExposure,94.5529,dBASPL,2024-11-18 12:19:48 +0000
4,HeadphoneAudioExposure,93.1836,dBASPL,2024-11-18 12:19:48 +0000


- Filter the data based on the `type` column to keep only the relevant health metrics and remove unwanted ones.
- Additionally, filter the data to keep only entries from the year 2024 by checking the `date` column.

In [92]:
# Define the desired values for the 'type' column
desired_types = ['HeadphoneAudioExposure', 'ActiveEnergyBurned', 'DistanceWalkingRunning', 'StepCount']

# Filter the rows based on the 'type' column
health_data = health_data[health_data['type'].isin(desired_types)]

# Ensure the 'date' column is in datetime format (if not already)
health_data['creationDate'] = pd.to_datetime(health_data['creationDate'])

# Filter the data for the year 2024
health_data_2024 = health_data[health_data['creationDate'].dt.year == 2024]

# Display the filtered data
health_data_2024.head()  # Or use .info() for more info about the filtered data



Unnamed: 0,type,value,unit,creationDate
0,HeadphoneAudioExposure,94.5578,dBASPL,2024-11-18 12:31:59+00:00
1,HeadphoneAudioExposure,92.9011,dBASPL,2024-11-18 12:25:38+00:00
2,HeadphoneAudioExposure,94.8688,dBASPL,2024-11-18 12:25:37+00:00
3,HeadphoneAudioExposure,94.5529,dBASPL,2024-11-18 12:19:48+00:00
4,HeadphoneAudioExposure,93.1836,dBASPL,2024-11-18 12:19:48+00:00


*  Rename `creationDate` to `date` and convert it to pandas `datetime` format.

In [93]:
# Rename 'creationDate' to 'date' and convert the 'date' column to datetime format
health_data = health_data.rename(columns={'creationDate': 'date'})

# Convert the 'date' column to datetime format and remove timezone, then format it as 'YYYY-MM-DD HH:MM:SS'
health_data['date'] = pd.to_datetime(health_data['date']).dt.strftime('%Y-%m-%d %H:%M:%S')

# Display the updated DataFrame
health_data_2024.head()


Unnamed: 0,type,value,unit,creationDate
0,HeadphoneAudioExposure,94.5578,dBASPL,2024-11-18 12:31:59+00:00
1,HeadphoneAudioExposure,92.9011,dBASPL,2024-11-18 12:25:38+00:00
2,HeadphoneAudioExposure,94.8688,dBASPL,2024-11-18 12:25:37+00:00
3,HeadphoneAudioExposure,94.5529,dBASPL,2024-11-18 12:19:48+00:00
4,HeadphoneAudioExposure,93.1836,dBASPL,2024-11-18 12:19:48+00:00
