# Exploratory Data Analysis of Synthetic Biohacking Data

This workbook leans on the reusable `scripts/eda_utils.py` helpers to summarize distributions, surface correlations, and flag potential outliers inside the synthetic biohacking dataset.


In [None]:
import pandas as pd
import matplotlib.pyplot as plt
from IPython.display import display

from scripts.eda_utils import (
    calculate_correlations,
    create_correlation_heatmap,
    create_distribution_plot,
    get_summary_statistics,
)

plt.style.use('seaborn-darkgrid')


In [None]:
df = pd.read_csv('data/raw/synthetic_biohacking_data.csv')
df.head()


The dataset contains 600 synthetic samples covering five core lifestyle signals: `sleep_hours`, `workout_intensity`, `supplement_intake`, `screen_time`, and `stress_level`. These signals feed the downstream personalization pipeline, so it is critical to understand their central tendencies, spread, and any anomalies before modeling.


In [None]:
summary_stats = get_summary_statistics(df)
summary_stats


Summary statistics reveal that the cohort sleeps about 6.52 hours per night on average with an interquartile range from ~6.25 to ~6.78 hours. Workout intensity sits near 4.56 units (median 4.58) while supplement intake centers around 3.31 servings. Screen time averages just under 6 hours, and stress level clusters near a moderate 3.08. All features have relatively tight variance, which is consistent with the synthetic generation process.


In [None]:
corr_matrix = calculate_correlations(df)
corr_matrix


In [None]:
fig_corr, ax_corr = create_correlation_heatmap(corr_matrix)
display(fig_corr)
plt.close(fig_corr)


The correlation matrix confirms that `workout_intensity` and `supplement_intake` are strongly linked (~0.88), highlighting a consistent synthetic pattern where higher training loads pair with higher supplement commitments. `sleep_hours` has a notable inverse relationship with `screen_time` (-0.54), suggesting the synthetic users trade off rest for late-night exposure. Stress level mildly tracks higher screen time and lower workout intensity, mirroring common wellness anecdotes.


In [None]:
numeric_columns = df.select_dtypes(include='number').columns
for column in numeric_columns:
    fig, ax = create_distribution_plot(df, column)
    display(fig)
    plt.close(fig)


In [None]:
numeric = df.select_dtypes(include='number')
quantiles = numeric.quantile([0.25, 0.75])
iqr = quantiles.loc[0.75] - quantiles.loc[0.25]
outlier_rows = []
for column in numeric.columns:
    lower = quantiles.loc[0.25, column] - 1.5 * iqr[column]
    upper = quantiles.loc[0.75, column] + 1.5 * iqr[column]
    mask = (df[column] < lower) | (df[column] > upper)
    outlier_rows.append({
        'feature': column,
        'outlier_count': int(mask.sum()),
        'lower_bound': round(float(lower), 2),
        'upper_bound': round(float(upper), 2),
    })
outlier_summary = (
    pd.DataFrame(outlier_rows)
    .sort_values('outlier_count', ascending=False)
    .reset_index(drop=True)
)
display(outlier_summary)
outlier_summary


The IQR-based outlier check surfaces only a handful of extreme entries (5 sleep records, 4 screen-time values, and even fewer for the other features). There are no missing values, and the ranges stay within realistic synthetic bounds (e.g., screen time peaks near 13 hours). Because the feature distributions are stable and outlier counts low, downstream modeling can proceed with standard scaling and robust trimming if needed.
