# 01 – Multimodal Exploratory Data Analysis

Unified notebook for Fitbit wearables, Chengdu PM2.5 air quality, and Delhi weather data. Use this space to inspect raw distributions, quality issues, and cross-modal relationships before feature engineering.


In [1]:
from __future__ import annotations

import os
from pathlib import Path

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from federated_health_risk.utils.config import init_env

pd.options.display.max_columns = 50
sns.set_theme(style="whitegrid", context="talk")

init_env()
fitbit_path = Path(os.getenv("FITBIT_CSV", "data/fitbit/dailyActivity_merged.csv"))
air_path = Path(os.getenv("AIR_QUALITY_CSV", "data/air_quality/ChengduPM20100101_20151231.csv"))
weather_path = Path(os.getenv("WEATHER_CSV", "data/weather/DailyDelhiClimateTrain.csv"))

print("Loading datasets:\n", fitbit_path, "\n", air_path, "\n", weather_path)
fitbit_df = pd.read_csv(fitbit_path)
air_df = pd.read_csv(air_path)
weather_df = pd.read_csv(weather_path)

fitbit_df.head()


Loading datasets:
 C:\Users\Sahal Saeed\Documents\7semester\mlops\project_cursor\data\fitbit\dailyActivity_merged.csv 
 C:\Users\Sahal Saeed\Documents\7semester\mlops\project_cursor\data\air_quality\ChengduPM20100101_20151231.csv 
 C:\Users\Sahal Saeed\Documents\7semester\mlops\project_cursor\data\weather\DailyDelhiClimateTrain.csv


Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
0,1503960366,4/12/2016,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,25,13,328,728,1985
1,1503960366,4/13/2016,10735,6.97,6.97,0.0,1.57,0.69,4.71,0.0,21,19,217,776,1797
2,1503960366,4/14/2016,10460,6.74,6.74,0.0,2.44,0.4,3.91,0.0,30,11,181,1218,1776
3,1503960366,4/15/2016,9762,6.28,6.28,0.0,2.14,1.26,2.83,0.0,29,34,209,726,1745
4,1503960366,4/16/2016,12669,8.16,8.16,0.0,2.71,0.41,5.04,0.0,36,10,221,773,1863


In [2]:
summary = {
    "fitbit": fitbit_df.describe(include="all").transpose().assign(dataset="fitbit"),
    "air_quality": air_df.describe(include="all").transpose().assign(dataset="air_quality"),
    "weather": weather_df.describe(include="all").transpose().assign(dataset="weather"),
}
summary_df = pd.concat(summary.values(), axis=0)
summary_df.head(10)


Unnamed: 0,count,unique,top,freq,mean,std,min,25%,50%,75%,max,dataset
Id,940.0,,,,4855407369.332978,2424805475.65796,1503960366.0,2320127002.0,4445114986.0,6962181067.0,8877689391.0,fitbit
ActivityDate,940.0,31.0,4/12/2016,33.0,,,,,,,,fitbit
TotalSteps,940.0,,,,7637.910638,5087.150742,0.0,3789.75,7405.5,10727.0,36019.0,fitbit
TotalDistance,940.0,,,,5.489702,3.924606,0.0,2.62,5.245,7.7125,28.030001,fitbit
TrackerDistance,940.0,,,,5.475351,3.907276,0.0,2.62,5.245,7.71,28.030001,fitbit
LoggedActivitiesDistance,940.0,,,,0.108171,0.619897,0.0,0.0,0.0,0.0,4.942142,fitbit
VeryActiveDistance,940.0,,,,1.502681,2.658941,0.0,0.0,0.21,2.0525,21.92,fitbit
ModeratelyActiveDistance,940.0,,,,0.567543,0.88358,0.0,0.0,0.24,0.8,6.48,fitbit
LightActiveDistance,940.0,,,,3.340819,2.040655,0.0,1.945,3.365,4.7825,10.71,fitbit
SedentaryActiveDistance,940.0,,,,0.001606,0.007346,0.0,0.0,0.0,0.0,0.11,fitbit


## To-Do
- Plot key distributions (histograms, KDEs) across all modalities.
- Check missingness per column and handle outliers (e.g., negative PM values).
- Align timestamps (e.g., aggregate Fitbit daily metrics, resample weather/air data).
- Save cleaned versions into `data/eda/` for downstream feature engineering.


# 01 – Exploratory Data Analysis

Use this notebook to profile wearable, air-quality, and weather datasets per node. Run data quality checks, visualize distributions, and confirm schema adherence before feeding pipelines.


In [3]:
import os
from pathlib import Path

import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt

from federated_health_risk.utils.config import init_env

init_env()
fitbit_path = Path(
    os.getenv("FITBIT_CSV", "data/fitbit/dailyActivity_merged.csv")
)

if fitbit_path.exists():
    wearables = pd.read_csv(fitbit_path)
    print(f"Loaded Fitbit dataset: {fitbit_path}")
    display(wearables.head())
else:
    print(
        f"Fitbit CSV not found at {fitbit_path}. Update FITBIT_CSV in .env or verify path."
    )


Loaded Fitbit dataset: C:\Users\Sahal Saeed\Documents\7semester\mlops\project_cursor\data\fitbit\dailyActivity_merged.csv


Unnamed: 0,Id,ActivityDate,TotalSteps,TotalDistance,TrackerDistance,LoggedActivitiesDistance,VeryActiveDistance,ModeratelyActiveDistance,LightActiveDistance,SedentaryActiveDistance,VeryActiveMinutes,FairlyActiveMinutes,LightlyActiveMinutes,SedentaryMinutes,Calories
0,1503960366,4/12/2016,13162,8.5,8.5,0.0,1.88,0.55,6.06,0.0,25,13,328,728,1985
1,1503960366,4/13/2016,10735,6.97,6.97,0.0,1.57,0.69,4.71,0.0,21,19,217,776,1797
2,1503960366,4/14/2016,10460,6.74,6.74,0.0,2.44,0.4,3.91,0.0,30,11,181,1218,1776
3,1503960366,4/15/2016,9762,6.28,6.28,0.0,2.14,1.26,2.83,0.0,29,34,209,726,1745
4,1503960366,4/16/2016,12669,8.16,8.16,0.0,2.71,0.41,5.04,0.0,36,10,221,773,1863


## Next Steps
- Generate summary stats for each modality (mean, std, missingness).
- Visualize time series per node with seaborn/matplotlib.
- Store cleaned outputs under `data/eda/` for reproducibility.
