# 01_exploration â€” Synthetic clinical risk dataset

This notebook performs a quick exploratory data analysis (EDA) of the
synthetic patient dataset used in the Clinical Risk Scorer demo.

We inspect basic structure, value distributions and simple correlations
with the binary target `high_risk`.


In [None]:
from pathlib import Path

import matplotlib.pyplot as plt
import pandas as pd

from src.paths import PROCESSED_DATA_DIR


In [None]:
train_path = PROCESSED_DATA_DIR / "train.csv"
test_path = PROCESSED_DATA_DIR / "test.csv"

df_train = pd.read_csv(train_path)
df_test = pd.read_csv(test_path)

df_train.head()

## Basic info

Check the shape of the training set and basic column information.

In [None]:
df_train.shape, df_train.dtypes

## Distributions of key features

We take a quick look at the distributions of age, BMI and the target `high_risk`.


In [None]:
fig, axes = plt.subplots(1, 3, figsize=(12, 3))

df_train["age"].hist(bins=30, ax=axes[0])
axes[0].set_title("Age distribution (train)")

df_train["bmi"].hist(bins=30, ax=axes[1])
axes[1].set_title("BMI distribution (train)")

df_train["high_risk"].value_counts(normalize=True).plot(
    kind="bar", ax=axes[2]
)
axes[2].set_title("high_risk class balance")

plt.tight_layout()
plt.show()

## Correlations with the target

Compute simple Pearson correlations between numeric features and `high_risk`.


In [None]:
corr = df_train.corr(numeric_only=True)
corr["high_risk"].sort_values(ascending=False)