# Interactive Exploration of Cardiometabolic Risk Factors in the NHANES Population
**Name:** Gia Lam Nguyen  
**Student ID:** S225247631  
**Email:** s225247631@deakin.edu.au  
**Unit:** SIT731 (Postgraduate, Master of Data Science)
**Assessment:** Task 7D

## Introduction

This study investigates **population-level patterns and determinants of blood glucose**, a key indicator of metabolic health and diabetes risk, using data from the **National Health and Nutrition Examination Survey (NHANES)**. NHANES is a large-scale health monitoring program conducted by the **U.S. Centers for Disease Control and Prevention (CDC)** that integrates interviews, physical examinations, and laboratory tests to provide nationally representative health and nutrition data.

The analysis draws on **two NHANES survey cycles (2017–2020 and 2021–2023)**, enabling a comprehensive and contemporary examination of metabolic health across the U.S. population (CDC, 2017–2020; CDC, 2021–2023).

The central objective of this analysis is to **explore how blood glucose levels vary in relation to demographic, physiological, and behavioural factors**. Specifically, the study examines associations between glucose and:
- demographic characteristics such as **age and sex**,
- physiological indicators including **body mass index (BMI) and blood pressure**,
- behavioural factors, particularly **smoking status**.

Together, these variables capture multiple dimensions of metabolic risk and allow examination of both individual-level variability and broader population health patterns.

All variables used in this study were identified, cleaned, and validated in **Part 1**, with careful attention to NHANES survey design, missing data, and laboratory subsample constraints. These considerations are especially important for glucose-related measures, which are collected only for eligible subsamples and therefore exhibit structured missingness.

In **Part 2**, the analysis moves beyond static descriptive summaries and adopts **interactive visual analytics** using the **Bokeh** library (Bokeh Development Team, 2023). Five non-trivial interactive visualisations are developed to progressively examine:
- age- and sex-related patterns in glucose levels,
- distributional characteristics and variability,
- behavioural associations such as smoking status,
- and joint physiological relationships, particularly between BMI and glucose.

The use of interactivity enables dynamic filtering, smoothing, and subgroup comparison, allowing relationships to be explored at multiple levels of granularity. This approach supports deeper exploratory insight while maintaining transparency and interpretability when working with a large, heterogeneous population health dataset.

These visual analyses provide the empirical foundation for **Part 3**, where key analytical conclusions are synthesised and critically discussed, alongside reflection on **data privacy considerations, ethical interpretation, and the responsible use of population health data**.

# 1. Data Acquisition and Integration of NHANES Datasets (2017–2020 and 2021–2023)
For this task, five NHANES datasets were selected to represent complementary aspects of individual health status and to enable participant-level analysis across multiple domains:

- **DEMO** – Demographic information (participant-level baseline)  
- **BMX** – Body measurements (example: height, weight, BMI)  
- **BPXO** – Blood pressure measurements  
- **GLU** – Laboratory-based glucose indicators  
- **SMQ** – Smoking behaviour questionnaire  

Two NHANES survey cycles (**2017–2020** and **2021–2023**) were combined to increase the effective sample size while maintaining consistency in survey design and variable definitions.

Data from the two cycles were first **concatenated within each domain** to preserve domain-specific variable structures. Subsequently, all domains were merged using **SEQN** as the unique participant identifier, with the **DEMO dataset serving as the base table** and **left joins** applied to retain all sampled participants.

Basic validation checks confirmed that **SEQN was unique** in the merged dataset and that the resulting dataset dimensions were consistent with the demographic base table, indicating a correct participant-level integration.

In [1]:
import pandas as pd
from pathlib import Path

DATA_ROOT = Path("data")

def find_cycle_folder(root: Path, candidates: list[str]) -> Path:
    for name in candidates:
        p = root / name
        if p.exists() and p.is_dir():
            return p
    raise FileNotFoundError(
        f"Cannot find cycle folder in {root}. Tried: {candidates}\n"
        f"Existing folders: {[x.name for x in root.iterdir() if x.is_dir()]}"
    )
PATH_2017_2020 = find_cycle_folder(DATA_ROOT, ["2017_2020", "data2017_2020", "2017-2020"])
PATH_2021_2023 = find_cycle_folder(DATA_ROOT, ["2021_2023", "data2021_2023", "2021-2023"])

print("Detected folders:")
print("2017–2020 ->", PATH_2017_2020)
print("2021–2023 ->", PATH_2021_2023)

def read_xpt(path: Path) -> pd.DataFrame:
    df = pd.read_sas(path, format="xport")
    df.columns = [c.decode() if isinstance(c, bytes) else c for c in df.columns]
    return df

def find_xpt_by_domain(folder: Path, domain: str) -> Path:
    patterns = [f"{domain}_*.xpt", f"P_{domain}.xpt", f"{domain}.xpt"]
    matches = []
    for pat in patterns:
        matches += list(folder.glob(pat))
    matches = sorted(set(matches), key=lambda p: p.name.lower())

    if len(matches) == 0:
        raise FileNotFoundError(
            f"No XPT found for domain '{domain}' in {folder}.\n"
            f"Files in folder: {[p.name for p in folder.glob('*.xpt')]}"
        )
    if len(matches) > 1:
        raise FileExistsError(
            f"Multiple XPT files found for '{domain}' in {folder}: {[m.name for m in matches]}\n"
            "Keep only one file per domain per cycle folder, or move extras elsewhere."
        )
    return matches[0]

DOMAINS = ["DEMO", "BMX", "BPXO", "GLU", "SMQ"]

def load_cycle(folder: Path, cycle_label: str) -> dict[str, pd.DataFrame]:
    out = {}
    for d in DOMAINS:
        f = find_xpt_by_domain(folder, d)
        df = read_xpt(f)
        if "SEQN" not in df.columns:
            raise ValueError(f"SEQN not found in {f.name}")
        df["cycle"] = cycle_label
        out[d] = df
    return out

data_17 = load_cycle(PATH_2017_2020, "2017-2020")
data_21 = load_cycle(PATH_2021_2023, "2021-2023")

print("\nLoaded files:")
print("2017–2020:", {d: find_xpt_by_domain(PATH_2017_2020, d).name for d in DOMAINS})
print("2021–2023:", {d: find_xpt_by_domain(PATH_2021_2023, d).name for d in DOMAINS})

demo_all = pd.concat([data_17["DEMO"], data_21["DEMO"]], ignore_index=True, sort=False)
bmx_all  = pd.concat([data_17["BMX"],  data_21["BMX"]],  ignore_index=True, sort=False)
bpx_all  = pd.concat([data_17["BPXO"], data_21["BPXO"]], ignore_index=True, sort=False)
glu_all  = pd.concat([data_17["GLU"],  data_21["GLU"]],  ignore_index=True, sort=False)
smq_all  = pd.concat([data_17["SMQ"],  data_21["SMQ"]],  ignore_index=True, sort=False)

summary = pd.DataFrame({
    "Dataset": DOMAINS,
    "2017-2020_rows": [len(data_17[d]) for d in DOMAINS],
    "2021-2023_rows": [len(data_21[d]) for d in DOMAINS],
    "Combined_rows":  [len(x) for x in [demo_all, bmx_all, bpx_all, glu_all, smq_all]],
})
summary

Detected folders:
2017–2020 -> data/data2017_2020
2021–2023 -> data/data2021_2023

Loaded files:
2017–2020: {'DEMO': 'P_DEMO.xpt', 'BMX': 'P_BMX.xpt', 'BPXO': 'P_BPXO.xpt', 'GLU': 'P_GLU.xpt', 'SMQ': 'P_SMQ.xpt'}
2021–2023: {'DEMO': 'DEMO_L.xpt', 'BMX': 'BMX_L.xpt', 'BPXO': 'BPXO_L.xpt', 'GLU': 'GLU_L.xpt', 'SMQ': 'SMQ_L.xpt'}


Unnamed: 0,Dataset,2017-2020_rows,2021-2023_rows,Combined_rows
0,DEMO,15560,11933,27493
1,BMX,14300,8860,23160
2,BPXO,11656,7801,19457
3,GLU,5090,3996,9086
4,SMQ,11137,9015,20152


The table above summarises the number of observations available for each NHANES domain across the two survey cycles and after concatenation.

- **DEMO** contains the full sample in both cycles (15,560 and 11,933 records), resulting in **27,493 participants** overall. This confirms that the DEMO dataset represents the complete participant base.
- **BMX** and **BPXO** show moderately fewer records than DEMO, indicating that body measurements and blood pressure data were collected only from participants who attended the examination component.
- **GLU** has the smallest number of observations (5,090 and 3,996), reflecting laboratory testing eligibility criteria such as fasting requirements and age restrictions.
- **SMQ** includes questionnaire-based responses and therefore has higher coverage than laboratory data, but still fewer records than DEMO due to survey routing and eligibility conditions.

These differences in row counts are consistent with the NHANES sampling and data collection design and explain the presence of missing values when domains are merged at the participant level.

In [2]:
bmx_m = bmx_all.drop(columns=["cycle"], errors="ignore")
bpx_m = bpx_all.drop(columns=["cycle"], errors="ignore")
glu_m = glu_all.drop(columns=["cycle"], errors="ignore")
smq_m = smq_all.drop(columns=["cycle"], errors="ignore")

df_merged = (demo_all
             .merge(bmx_m, on="SEQN", how="left", suffixes=("", "_bmx"))
             .merge(bpx_m, on="SEQN", how="left", suffixes=("", "_bpx"))
             .merge(glu_m, on="SEQN", how="left", suffixes=("", "_glu"))
             .merge(smq_m, on="SEQN", how="left", suffixes=("", "_smq")))

df_merged.shape

(27493, 92)

In [3]:
print("Merged shape:", df_merged.shape)
print("SEQN unique:", df_merged["SEQN"].is_unique)

missing_top15 = (df_merged.isna().mean() * 100).sort_values(ascending=False).head(15)
missing_top15
df_merged.head()

Merged shape: (27493, 92)
SEQN unique: True


Unnamed: 0,SEQN,SDDSRVYR,RIDSTATR,RIAGENDR,RIDAGEYR,RIDAGEMN,RIDRETH1,RIDRETH3,RIDEXMON,DMDBORN4,...,SMD057,SMQ078,SMD641,SMD650,SMD100FL,SMD100MN,SMQ670,SMQ621,SMD630,SMAQUEX2
0,109263.0,66.0,2.0,1.0,2.0,,5.0,6.0,2.0,1.0,...,,,,,,,,,,
1,109264.0,66.0,2.0,2.0,13.0,,1.0,1.0,2.0,1.0,...,,,,,,,,1.0,,2.0
2,109265.0,66.0,2.0,1.0,2.0,,3.0,3.0,2.0,1.0,...,,,,,,,,,,
3,109266.0,66.0,2.0,2.0,29.0,,5.0,6.0,2.0,2.0,...,,,,,,,,,,1.0
4,109267.0,66.0,1.0,2.0,21.0,,2.0,2.0,,2.0,...,,,,,,,,,,1.0


Five NHANES datasets (**DEMO, BMX, BPXO, GLU, and SMQ**) from two survey cycles (**2017–2020** and **2021–2023**) were successfully downloaded, aligned, and merged into a single analytical data frame using **SEQN** as the unique participant identifier.

After concatenation across cycles within each domain, the final merged dataset contains **27,493 unique participants** and **92 variables**, with **SEQN confirmed to be unique**. This confirms that the merge process did not introduce duplicate records or unintended many-to-many joins, and that participant-level integrity was preserved. Using the **DEMO dataset as the base table** with **left joins** ensured that all sampled participants were retained, even when domain-specific information was unavailable.

The domain-level summary table above highlights clear differences in the number of observations across datasets. **DEMO** retains the full participant base across both cycles (15,560 and 11,933 records), resulting in **27,493 participants** overall. In contrast, **BMX** and **BPXO** contain fewer observations, indicating that body measurements and blood pressure data were collected only from participants who attended the examination component of the survey. **GLU** has the smallest number of records, reflecting laboratory testing eligibility criteria such as fasting requirements and age restrictions. **SMQ**, which is questionnaire-based, shows broader coverage than laboratory data but still fewer records than DEMO due to survey routing and eligibility conditions.

These differences in row counts are consistent with the **NHANES sampling and data collection design rather than data processing errors**, and directly explain the presence of missing values (NaN) after merging at the participant level. The observed missingness therefore reflects conditional data collection and subsampling protocols inherent to NHANES, not issues arising from the integration process.

Overall, the data integration procedure was **logically sound** and resulted in a **well-structured, participant-level dataset** that accurately represents NHANES sampling characteristics. The final merged dataset provides a reliable and transparent foundation for subsequent **interactive visualisation** and **analytical interpretation and ethical reflection** for the next parts.

## 2. Interactive Visualisations (Bokeh)
### 2.1 Setup and Environment
This section uses the **Bokeh** library to create **interactive, non-trivial** visualisations from the merged NHANES dataset (`df_merged`).  
Interactivity is implemented using **hover tooltips** and **user controls** (sliders and dropdown filters) so that patterns can be explored dynamically rather than shown as static charts.

In [4]:
import numpy as np
import pandas as pd

from bokeh.io import output_notebook, show
from bokeh.layouts import column, row
from bokeh.models import (
    ColumnDataSource, HoverTool, Select, RangeSlider, Slider,
    DataTable, TableColumn, NumberFormatter, StringFormatter, Div, CustomJS
)
from bokeh.plotting import figure
from bokeh.transform import jitter

output_notebook()

### 2.2 Analysis-ready dataset for visualisation

To support interactive exploration across multiple NHANES domains, a compact visualisation dataset (`df_vis`) is created from `df_merged` using the following fields:

- **Age** (`RIDAGEYR`) and **Sex** (`RIAGENDR`) from DEMO  
- **BMI** (`BMXBMI`) from BMX  
- **Blood pressure** (systolic/diastolic) from BPXO  
- **Glucose** (example, `LBXGLU`) from GLU  
- **Smoking** variables (e.g., `SMQ020`, `SMQ040`) from SMQ  

Simple derived labels (Sex, AgeGroup, SmokingStatus) are created to make filtering and grouping more readable in interactive charts.

In [5]:
def pick_first_existing(df, candidates, label, required=True):
    for c in candidates:
        if c in df.columns:
            return c
    if required:
        raise KeyError(f"Cannot find '{label}' column. Tried: {candidates}")
    return None

COL_AGE = pick_first_existing(df_merged, ["RIDAGEYR"], "Age (RIDAGEYR)")
COL_SEX = pick_first_existing(df_merged, ["RIAGENDR"], "Sex (RIAGENDR)")
COL_BMI = pick_first_existing(df_merged, ["BMXBMI"], "BMI (BMXBMI)")

COL_SYS = pick_first_existing(df_merged, ["BPXOSY1","BPXSY1","BPXSY2","BPXSY3","BPXSY4"], "Systolic BP", required=False)
COL_DIA = pick_first_existing(df_merged, ["BPXODI1","BPXDI1","BPXDI2","BPXDI3","BPXDI4"], "Diastolic BP", required=False)

COL_GLU = pick_first_existing(df_merged, ["LBXGLU", "LBDGLUSI", "GLU"], "Glucose", required=False)

COL_SMQ020 = "SMQ020" if "SMQ020" in df_merged.columns else None
COL_SMQ040 = "SMQ040" if "SMQ040" in df_merged.columns else None

use_cols = ["SEQN", COL_AGE, COL_SEX, COL_BMI]
if COL_SYS: use_cols.append(COL_SYS)
if COL_DIA: use_cols.append(COL_DIA)
if COL_GLU: use_cols.append(COL_GLU)
if COL_SMQ020: use_cols.append(COL_SMQ020)
if COL_SMQ040: use_cols.append(COL_SMQ040)

df_vis = df_merged[use_cols].copy()

df_vis["Age"] = pd.to_numeric(df_vis[COL_AGE], errors="coerce")
df_vis["BMI"] = pd.to_numeric(df_vis[COL_BMI], errors="coerce")
df_vis["Systolic"]  = pd.to_numeric(df_vis[COL_SYS], errors="coerce") if COL_SYS else np.nan
df_vis["Diastolic"] = pd.to_numeric(df_vis[COL_DIA], errors="coerce") if COL_DIA else np.nan
df_vis["Glucose"]   = pd.to_numeric(df_vis[COL_GLU], errors="coerce") if COL_GLU else np.nan

sex_map = {1: "Male", 2: "Female"}
df_vis["Sex"] = df_vis[COL_SEX].map(sex_map).fillna("Unknown")

def derive_smoking(row):
    if COL_SMQ040 and pd.notna(row[COL_SMQ040]):
        v = row[COL_SMQ040]
        if v == 1: return "Current (Every day)"
        if v == 2: return "Current (Some days)"
        if v == 3: return "Not at all (Now)"
    if COL_SMQ020 and pd.notna(row[COL_SMQ020]):
        v = row[COL_SMQ020]
        if v == 1: return "Ever (100+ cigs)"
        if v == 2: return "Never (100+ cigs)"
    return "Unknown"

df_vis["SmokingStatus"] = df_vis.apply(derive_smoking, axis=1)

df_vis["AgeGroup"] = pd.cut(
    df_vis["Age"],
    bins=[0, 18, 30, 45, 60, 80, 120],
    labels=["0–17", "18–29", "30–44", "45–59", "60–79", "80+"],
    right=False
)

df_vis.head()

Unnamed: 0,SEQN,RIDAGEYR,RIAGENDR,BMXBMI,BPXOSY1,BPXODI1,LBXGLU,SMQ020,SMQ040,Age,BMI,Systolic,Diastolic,Glucose,Sex,SmokingStatus,AgeGroup
0,109263.0,2.0,1.0,,,,,,,2.0,,,,,Male,Unknown,0–17
1,109264.0,13.0,2.0,17.6,109.0,67.0,97.0,,,13.0,17.6,109.0,67.0,97.0,Female,Unknown,0–17
2,109265.0,2.0,1.0,15.0,,,,,,2.0,15.0,,,,Male,Unknown,0–17
3,109266.0,29.0,2.0,37.8,99.0,56.0,,2.0,,29.0,37.8,99.0,56.0,,Female,Never (100+ cigs),18–29
4,109267.0,21.0,2.0,,,,,2.0,,21.0,,,,,Female,Never (100+ cigs),18–29


After constructing the analysis-ready dataset (`df_vis`) from the merged NHANES data, the resulting table confirms that participant-level alignment across domains was preserved while enabling flexible visual exploration.

The dataset contains **27,493 observations**, matching the DEMO base population, with **SEQN remaining unique**, indicating that the transformation process did not introduce duplication or record loss. This validates that `df_vis` is suitable as a visualisation backbone.

Key health indicators (Age, BMI, blood pressure, glucose, and smoking status) are now co-located in a single table. However, the preview of `df_vis` shows that **not all variables are populated for every participant**. For example:
- BMI is present only for participants who completed body measurements.
- Blood pressure and glucose values appear intermittently, reflecting examination and laboratory eligibility.
- Smoking variables are available for most adults but may be missing or coded as *Unknown* for minors or ineligible respondents.

The derived categorical variables (**Sex**, **AgeGroup**, and **SmokingStatus**) significantly improve interpretability. Rather than relying on numeric codes (example: `RIAGENDR = 1/2` or raw SMQ values), the dataset now supports intuitive grouping, filtering, and colour mapping in interactive charts.

Importantly, this dataset is **not artificially cleaned or imputed**. Missing values are intentionally retained to reflect NHANES’ conditional data collection design. This allows each subsequent visualisation to handle missingness explicitly, either by filtering to valid records or by making coverage differences visible.

Overall, `df_vis` represents a **balanced trade-off between completeness and analytical flexibility**. It preserves the integrity of the original NHANES structure while providing a coherent, readable, and extensible foundation for interactive Bokeh visualisations in the following sections. To ensure transparency before interactive exploration, the extent and structure of missingness across key variables are examined explicitly in the next section. This data quality snapshot provides context for interpreting subsequent visualisations and clarifies which domains are broadly representative versus conditionally sampled.

### 2.3 Data quality snapshot (missingness and coverage)
NHANES domains have different coverage because some measurements are collected only from eligible subsamples (for example, laboratory tests).  
Therefore, missing values are expected after merging. In the visualisations below, missingness is handled **per chart** by filtering to records with the required fields for that specific relationship.

In [6]:
print("df_vis shape:", df_vis.shape)
print("SEQN unique in df_vis:", df_vis["SEQN"].is_unique)

missing_pct = (
    df_vis[["Age","Sex","BMI","Systolic","Diastolic","Glucose","SmokingStatus","AgeGroup"]]
    .isna()
    .mean()
    .mul(100)
    .sort_values(ascending=False)
)
print("\nNaN missingness (%):")
display(missing_pct)

semantic_missing_pct = pd.Series({
    "Sex == 'Unknown' (%)": (df_vis["Sex"] == "Unknown").mean() * 100,
    "SmokingStatus == 'Unknown' (%)": (df_vis["SmokingStatus"] == "Unknown").mean() * 100,
}).sort_values(ascending=False)
print("\nSemantic missingness (%):")
display(semantic_missing_pct)

agegroup_invalid_pct = (
    df_vis.loc[df_vis["Age"].notna(), "AgeGroup"]
    .isna()
    .mean() * 100
)
print("\nAgeGroup invalid (Age not NaN but AgeGroup NaN) (%):", agegroup_invalid_pct)

range_sanity_pct = pd.Series({
    "BMI out of [10, 80] (%)": ((df_vis["BMI"] < 10) | (df_vis["BMI"] > 80)).mean() * 100,
    "Systolic out of [70, 250] (%)": ((df_vis["Systolic"] < 70) | (df_vis["Systolic"] > 250)).mean() * 100,
    "Diastolic out of [40, 150] (%)": ((df_vis["Diastolic"] < 40) | (df_vis["Diastolic"] > 150)).mean() * 100,
    "Glucose out of [40, 500] (%)": ((df_vis["Glucose"] < 40) | (df_vis["Glucose"] > 500)).mean() * 100,
}).sort_values(ascending=False)
print("\nRange sanity flags (% of all rows):")
display(range_sanity_pct)

df_vis shape: (27493, 17)
SEQN unique in df_vis: True

NaN missingness (%):


Glucose          69.388572
Systolic         35.005274
Diastolic        35.005274
BMI              21.405449
Age               0.000000
Sex               0.000000
SmokingStatus     0.000000
AgeGroup          0.000000
dtype: float64


Semantic missingness (%):


SmokingStatus == 'Unknown' (%)    35.223511
Sex == 'Unknown' (%)               0.000000
dtype: float64


AgeGroup invalid (Age not NaN but AgeGroup NaN) (%): 0.0

Range sanity flags (% of all rows):


Diastolic out of [40, 150] (%)    0.080020
BMI out of [10, 80] (%)           0.018186
Systolic out of [70, 250] (%)     0.018186
Glucose out of [40, 500] (%)      0.007275
dtype: float64

The missingness profile of the analysis-ready dataset shows **clear variation in data coverage across health domains**, which is **consistent with the NHANES survey design** rather than data quality or processing issues (CDC, NHANES).

- **Glucose measurements** have the **highest missingness (≈69%)**, reflecting NHANES laboratory eligibility criteria such as fasting requirements, age thresholds, and participation in the laboratory examination component. As a result, glucose data are available only for a **defined subsample** of participants.

- **Blood pressure variables** show **moderate missingness (≈35%)**, indicating that these measurements were collected during the physical examination and therefore limited to participants who attended and met measurement protocols.

- **BMI** exhibits **lower but non-negligible missingness (≈21%)**, corresponding to incomplete body measurement participation or invalid height and weight readings.

In contrast, **demographic and derived categorical variables** (Age, Sex, AgeGroup) show **complete coverage**, making them reliable anchors for stratification and interactive filtering. Although **SmokingStatus** has no NaN values, a substantial proportion of responses are labelled as *Unknown*, reflecting **semantic missingness** common in questionnaire data, particularly for minors or ineligible respondents.

Overall, this missingness structure accurately reflects **conditional data collection inherent to NHANES** (CDC) and is therefore **informative rather than problematic**. In subsequent interactive visualisations, missingness is handled **on a per-visual basis**, with each chart restricted to participants who have valid data for the variables being examined.

### Visual analysis strategy

This analysis deliberately adopts an **exploratory visual analytics approach** to reflect both the **structure of NHANES data collection** and the **progressive nature of health risk assessment**, rather than testing a fixed set of predefined hypotheses.

The five visuals were selected intentionally to move from **high-coverage, foundational measures** to **more restricted and conditionally collected indicators**, ensuring that each visual is analytically justified and contextually connected.

Specifically:
- **BMI by age and sex** is chosen first to establish a **demographic and physiological baseline**, as BMI is widely available and represents overall body composition.
- **Blood pressure** is examined next as a core **cardiovascular risk indicator**, transitioning from general body metrics to clinically relevant outcomes.
- **Fasting glucose** is analysed separately due to its **restricted laboratory eligibility**, allowing metabolic risk to be explored while explicitly acknowledging subsample limitations.
- **Behavioural factors**, such as smoking status, are included to assess how **lifestyle characteristics interact with measured health outcomes**, linking questionnaire data with physiological indicators.
- Finally, a visual focusing on **missingness and subsample coverage** is included to make the **impact of NHANES’ conditional survey design explicit**, ensuring transparency in interpretation.

Together, these visuals form a coherent analytical sequence that reflects **how health data are collected, constrained, and interpreted** within NHANES. 

## 2.4 Visual 1: Body Mass Index (BMI) across Age Groups and Sex
This visual examines how **Body Mass Index (BMI)** varies across **age groups** and **sex**, and whether systematic differences in body composition are observable across key demographic strata.

BMI is a widely used indicator of **metabolic health and obesity risk**, while age and sex are two **fundamental demographic variables** available for nearly all NHANES participants. Analysing BMI across age groups provides insight into how body composition evolves across the life course, while stratification by sex enables comparison of population-level differences.

This visual uses:
- **BMI** as the primary outcome variable  
- **AgeGroup** to represent life-stage progression  
- **Sex** as an interactive stratification control  

Missing BMI values are handled implicitly by restricting this visual to participants with valid BMI measurements, ensuring that comparisons are based on observed data only.

In [7]:
import numpy as np
import pandas as pd

from IPython.display import clear_output

from bokeh.io import output_notebook, show
from bokeh.layouts import column, row
from bokeh.models import (
    ColumnDataSource, Select, CustomJS, HoverTool,
    RangeSlider, Slider, Div
)
from bokeh.plotting import figure

output_notebook()

clear_output(wait=True)

df_bmi = df_vis.dropna(subset=["BMI", "AgeGroup", "Sex"]).copy()
df_bmi = df_bmi[df_bmi["Sex"].isin(["Male", "Female"])]

age_order = ["0–17", "18–29", "30–44", "45–59", "60–79", "80+"]
df_bmi["AgeGroup"] = df_bmi["AgeGroup"].astype(str)
df_bmi = df_bmi[df_bmi["AgeGroup"].isin(age_order)]

agg = (
    df_bmi.groupby(["Sex", "AgeGroup"], as_index=False)
    .agg(mean_bmi=("BMI", "mean"), n=("BMI", "size"))
)

sex_list = sorted(agg["Sex"].unique().tolist())
initial_sex = "Male" if "Male" in sex_list else sex_list[0]

mean_lookup = {s: {g: None for g in age_order} for s in sex_list}
n_lookup = {s: {g: 0 for g in age_order} for s in sex_list}

for _, r in agg.iterrows():
    mean_lookup[r["Sex"]][r["AgeGroup"]] = float(r["mean_bmi"])
    n_lookup[r["Sex"]][r["AgeGroup"]] = int(r["n"])

color_map = {
    "Male": "#1f77b4",
    "Female": "#e377c2"
}

source = ColumnDataSource(
    data=dict(
        AgeGroup=age_order,
        mean_bmi=[mean_lookup[initial_sex][g] for g in age_order],
        n=[n_lookup[initial_sex][g] for g in age_order],
        color=[color_map[initial_sex]] * len(age_order),
    )
)

p = figure(
    x_range=age_order,
    height=420,
    width=760,
    title="Mean BMI across age groups (select Sex)",
    y_axis_label="Mean BMI",
    x_axis_label="Age Group",
    toolbar_location="right",
)

bars = p.vbar(
    x="AgeGroup",
    top="mean_bmi",
    width=0.8,
    source=source,
    fill_color="color",
    fill_alpha=0.75,
)

p.add_tools(
    HoverTool(
        renderers=[bars],
        tooltips=[
            ("Age Group", "@AgeGroup"),
            ("Mean BMI", "@mean_bmi{0.0}"),
            ("Sample size (n)", "@n"),
        ],
    )
)

select = Select(title="Select Sex:", value=initial_sex, options=sex_list)

callback = CustomJS(
    args=dict(
        source=source,
        mean_lookup=mean_lookup,
        n_lookup=n_lookup,
        color_map=color_map,
    ),
    code="""
    const sex = cb_obj.value;
    const x = source.data['AgeGroup'];
    const y = source.data['mean_bmi'];
    const n = source.data['n'];
    const c = source.data['color'];

    for (let i = 0; i < x.length; i++) {
        const g = x[i];
        y[i] = mean_lookup[sex][g];
        n[i] = n_lookup[sex][g];
        c[i] = color_map[sex];
    }

    source.change.emit();
    """,
)

select.js_on_change("value", callback)

layout_v1 = column(select, p)

This visualisation presents the **mean Body Mass Index (BMI)** across defined **age groups**, with an interactive selector allowing comparison between **Male and Female** populations.

For both sexes, mean BMI **increases markedly from adolescence to middle adulthood**. In the **0–17 age group**, the average BMI is relatively low at approximately **20**, before rising to around **27–28** in the **18–29 group**. Mean BMI continues to increase through adulthood, reaching values close to **30–31** in the **30–44 and 45–59 age groups**.

The **highest mean BMI is observed in the 45–59 age group** for both males and females, indicating that mid-life adults exhibit the greatest average BMI levels. Beyond this point, mean BMI **declines slightly**, falling to around **29–30** in the **60–79 group** and further to approximately **27–28** among individuals aged **80 and above**.

Across most age groups, **females show a marginally higher mean BMI than males**, particularly between **18–29 and 60–79**, although the difference remains modest. Importantly, the **overall shape of the BMI–age relationship is consistent for both sexes**, suggesting that **age exerts a stronger influence on mean BMI than sex**.

The interactive sex selector enhances interpretability by enabling **direct visual comparison**, supporting the conclusion that **mean BMI increases with age up to late middle adulthood, followed by a gradual decline in older age**, with only **minor sex-based variation**.

### 2.5 Visual 2: Blood Pressure patterns across Age and Sex

This visual asks: **How does blood pressure change with age, and does the trend differ by sex?**

Blood pressure is a core **cardiovascular risk indicator** in NHANES. After establishing a baseline body-composition trend in Visual 1 (BMI), blood pressure is a logical next step because it reflects **downstream cardiometabolic strain** that typically varies with age and differs between male and female populations.

**Variables used**
- **Age** (`Age`)
- **Sex** (`Sex`)
- **Blood pressure outcome** (`Systolic` or `Diastolic`)

**Interactive design**
To support exploratory analysis (open-ended formulation), the visual includes:
- a **Sex selector** (Male/Female/All)
- an **Age range slider** to focus on life stages
- a **Measure selector** (Systolic vs Diastolic)
- a **Smoothing control** (rolling mean window) to reveal trends while still allowing raw points to remain visible

Missing values are handled **per-visual** by filtering to participants with valid blood pressure for the selected measure, consistent with NHANES subsample coverage.

In [8]:
df = df_vis[["Age", "Sex", "Systolic", "Diastolic"]].copy()
df["Age"] = pd.to_numeric(df["Age"], errors="coerce")
df["Systolic"] = pd.to_numeric(df["Systolic"], errors="coerce")
df["Diastolic"] = pd.to_numeric(df["Diastolic"], errors="coerce")
df["Sex"] = df["Sex"].astype(str)
df = df[df["Sex"].isin(["Male", "Female"])].dropna(subset=["Age"])

age_min = int(np.nanmin(df["Age"]))
age_max = int(np.nanmax(df["Age"]))

title = Div(text="<h3>Visual 2: Blood Pressure vs Age (interactive)</h3>")

sex_select = Select(title="Sex", value="All", options=["All", "Male", "Female"])
measure_select = Select(title="Measure", value="Systolic", options=["Systolic", "Diastolic"])

age_slider = RangeSlider(
    title="Age range",
    start=age_min, end=age_max,
    value=(age_min, age_max),
    step=1
)
smooth_slider = Slider(title="Smoothing window (years)", start=1, end=15, value=5, step=1)

src_points = ColumnDataSource(data=dict(Age=[], BP=[], Sex=[], Color=[]))
src_line = ColumnDataSource(data=dict(Age=[], BP=[]))
src_all = ColumnDataSource(df)

p = figure(
    height=420,
    width=820,
    title="Blood Pressure vs Age",
    x_axis_label="Age (years)",
    y_axis_label="Blood Pressure (mmHg)",
    tools="pan,wheel_zoom,box_zoom,reset,save",
)

pts = p.scatter(
    x="Age",
    y="BP",
    source=src_points,
    size=4,
    alpha=0.35,
    color="Color",
)

trend = p.line(
    x="Age",
    y="BP",
    source=src_line,
    line_width=3,
    line_color="#7f7f7f",
)

p.add_tools(
    HoverTool(
        renderers=[pts],
        tooltips=[
            ("Age", "@Age"),
            ("BP", "@BP{0.0}"),
            ("Sex", "@Sex"),
        ],
    )
)

callback = CustomJS(
    args=dict(
        all_src=src_all,
        pts_src=src_points,
        line_src=src_line,
        sex_w=sex_select,
        meas_w=measure_select,
        age_w=age_slider,
        smooth_w=smooth_slider,
        plot=p,
        trend=trend,
    ),
    code="""
    const data = all_src.data;

    const sex = sex_w.value;
    const meas = meas_w.value;
    const a0 = age_w.value[0];
    const a1 = age_w.value[1];
    const window = Math.max(1, Math.floor(smooth_w.value));

    const Age = data['Age'];
    const Sex = data['Sex'];
    const BPraw = data[meas];

    const outAge = [];
    const outBP = [];
    const outSex = [];
    const outColor = [];

    const sums = new Map();
    const counts = new Map();

    for (let i = 0; i < Age.length; i++) {
      const age = Age[i];
      const s = Sex[i];
      const bp = BPraw[i];

      if (age == null) continue;
      if (age < a0 || age > a1) continue;
      if (sex !== "All" && s !== sex) continue;
      if (bp == null || isNaN(bp)) continue;

      outAge.push(Math.floor(age));
      outBP.push(bp);
      outSex.push(s);

      if (s === "Male") outColor.push("#1f77b4");
      else outColor.push("#e377c2");

      const key = Math.floor(age);
      sums.set(key, (sums.get(key) || 0) + bp);
      counts.set(key, (counts.get(key) || 0) + 1);
    }

    pts_src.data = { Age: outAge, BP: outBP, Sex: outSex, Color: outColor };

    const ages = Array.from(sums.keys()).sort((x,y) => x-y);
    const mean = ages.map(a => sums.get(a) / counts.get(a));

    const smooth = [];
    for (let i = 0; i < mean.length; i++) {
      const left = Math.max(0, i - Math.floor(window/2));
      const right = Math.min(mean.length - 1, i + Math.floor(window/2));
      let acc = 0;
      let n = 0;
      for (let j = left; j <= right; j++) {
        acc += mean[j];
        n += 1;
      }
      smooth.push(acc / n);
    }

    line_src.data = { Age: ages, BP: smooth };

    if (sex === "Male") trend.glyph.line_color = "#1f77b4";
    else if (sex === "Female") trend.glyph.line_color = "#e377c2";
    else trend.glyph.line_color = "#7f7f7f";

    plot.title.text = `${meas} vs Age (Sex: ${sex}, window=${window})`;
    plot.yaxis[0].axis_label = `${meas} (mmHg)`;
"""
)

for w in [sex_select, measure_select, age_slider, smooth_slider]:
    w.js_on_change("value", callback)

sex0 = sex_select.value
meas0 = measure_select.value
a0, a1 = age_slider.value
window = int(smooth_slider.value)

df0 = df[(df["Age"] >= a0) & (df["Age"] <= a1)].copy()
if sex0 != "All":
    df0 = df0[df0["Sex"] == sex0]

df0[meas0] = pd.to_numeric(df0[meas0], errors="coerce")
df0 = df0.dropna(subset=[meas0]).copy()

age_int = np.floor(df0["Age"]).astype(int)
sex_arr = df0["Sex"].astype(str).values
bp_arr = df0[meas0].values

color_arr = np.where(sex_arr == "Male", "#1f77b4", "#e377c2")

src_points.data = dict(
    Age=age_int.tolist(),
    BP=bp_arr.tolist(),
    Sex=sex_arr.tolist(),
    Color=color_arr.tolist(),
)

means = (
    pd.DataFrame({"Age": age_int, "BP": bp_arr})
    .groupby("Age")["BP"]
    .mean()
    .sort_index()
)

ages = means.index.to_numpy()
mean_vals = means.values

k = max(1, window)
half = k // 2
smooth_vals = []
for i in range(len(mean_vals)):
    left = max(0, i - half)
    right = min(len(mean_vals) - 1, i + half)
    smooth_vals.append(float(np.mean(mean_vals[left:right + 1])))

src_line.data = dict(Age=ages.tolist(), BP=smooth_vals)

p.title.text = f"{meas0} vs Age (Sex: {sex0}, window={k})"
p.yaxis.axis_label = f"{meas0} (mmHg)"

layout_v2 = column(
    title,
    row(sex_select, measure_select),
    row(age_slider, smooth_slider),
    p,
)

This visual explores the relationship between **blood pressure** and **age**, using an **interactive scatter plot** combined with a **smoothed trend line**. Users can filter the data by **sex** (All, Male, Female), select the **blood pressure measure** (**Systolic** or **Diastolic**), and adjust the **smoothing window** to examine age-related patterns at different levels of granularity.

Across the full population, **systolic blood pressure shows a clear upward trend with increasing age**. The **smoothed trend line** rises steadily from childhood through older adulthood, indicating that **age is strongly associated with higher systolic blood pressure**. This pattern remains consistent even when the **smoothing window** is adjusted, suggesting that the observed increase reflects a **robust population-level trend** rather than random noise.

When stratified by sex, **males consistently exhibit higher systolic blood pressure than females across most age ranges**. This difference is evident throughout adulthood and becomes more pronounced in **middle and later life**. However, despite this difference in absolute levels, **both sexes display a similar age-related trajectory**, reinforcing the conclusion that **age is the primary driver of blood pressure increase**, with **sex acting as a secondary modifier**.

The scatter of individual observations shows **substantial variability within each age group**, particularly at older ages. This reflects **heterogeneity in health status, lifestyle factors, and medical conditions** among individuals. The inclusion of an adjustable **smoothing window** aggregates age-level patterns and helps reveal the **underlying systematic trend**, reducing the influence of extreme individual measurements.

Overall, this visual demonstrates that **blood pressure increases progressively with age**, that **males tend to have higher systolic values than females**, and that **interactive smoothing is essential for distinguishing systematic physiological patterns from individual-level variation** in large-scale health datasets such as **NHANES**.

## 2.6 Visual 3: Glucose Distribution across Age and Sex (Interactive)

### Analytical question & design
This visual examines how **blood glucose levels (mg/dL)** are distributed across different **age groups** and **sexes**, and whether systematic differences in glycaemic profiles are observable across demographic subpopulations.

Glucose is a central biochemical marker associated with **glycaemic regulation, insulin resistance, and metabolic risk**. Within NHANES, glucose measurements are collected under **strict laboratory conditions** and are only available for a **specific subsample of participants**, resulting in substantial missingness. To ensure valid and interpretable comparisons, this visual **restricts analysis to participants with valid glucose measurements**, reflecting the intended survey design rather than treating missing values as data errors.

This visual uses:
- **Glucose** as the primary outcome variable (distributional focus),
- **Age** to capture life-stage variation (filtered via range slider),
- **Sex** for demographic stratification (filtered via dropdown).

### Interactive visual & interpretation
The interactive controls allow users to:
- Filter the distribution by **Sex** (All, Male, Female),
- Restrict the analysis to a specific **Age range**,
- Adjust the **number of histogram bins** to examine distributional structure at different levels of resolution.

After filtering, the visual displays a **histogram of glucose values** alongside a concise summary table reporting **sample size (n), mean, and median** for the selected subgroup. This combination supports both **distributional interpretation** and **numerical comparison**.

As filters are applied, the number of observations may decrease markedly, particularly for narrower age ranges. This reduction reflects **laboratory eligibility criteria and subsample coverage in NHANES**, rather than issues arising from data integration or preprocessing. Consequently, the visual highlights not only variation in glucose distributions, but also the **analytical implications of conditional data availability** in large-scale health surveys.

In [9]:
df_glu = df_vis[["Age", "Sex", "Glucose"]].copy()

df_glu["Age"] = pd.to_numeric(df_glu["Age"], errors="coerce")
df_glu["Glucose"] = pd.to_numeric(df_glu["Glucose"], errors="coerce")
df_glu["Sex"] = df_glu["Sex"].astype(str)

df_glu = df_glu[df_glu["Sex"].isin(["Male", "Female"])]
df_glu = df_glu.dropna(subset=["Age", "Glucose"]).copy()

age_min = int(np.nanmin(df_glu["Age"]))
age_max = int(np.nanmax(df_glu["Age"]))

src_all = ColumnDataSource(df_glu)
src_hist = ColumnDataSource(data=dict(left=[], right=[], top=[], fill_color=[]))
src_stats = ColumnDataSource(data=dict(metric=["n", "mean", "median"], value=["-", "-", "-"]))

title = Div(text="<h3>Visual 3: Glucose Distribution (mg/dL) by Age and Sex</h3>")

sex_select = Select(title="Sex", value="All", options=["All", "Male", "Female"])

age_slider = RangeSlider(
    title="Age range",
    start=age_min, end=age_max,
    value=(age_min, age_max),
    step=1
)

bins_slider = Slider(
    title="Number of bins",
    start=5, end=60,
    value=25,
    step=1
)

p = figure(
    height=420, width=820,
    title="Glucose distribution (filtered)",
    x_axis_label="Glucose (mg/dL)",
    y_axis_label="Count",
    tools="pan,wheel_zoom,box_zoom,reset,save"
)

quads = p.quad(
    left="left", right="right",
    bottom=0, top="top",
    source=src_hist,
    fill_color="fill_color",
    line_alpha=0.6, fill_alpha=0.6
)

p.add_tools(HoverTool(
    renderers=[quads],
    tooltips=[
        ("Bin", "@left{0.0} to @right{0.0}"),
        ("Count", "@top"),
    ]
))

table = DataTable(
    source=src_stats,
    columns=[
        TableColumn(field="metric", title="Metric", formatter=StringFormatter()),
        TableColumn(field="value", title="Value", formatter=StringFormatter()),
    ],
    width=320,
    height=120,
    index_position=None
)

def init_hist():
    sex = sex_select.value
    a0, a1 = age_slider.value
    bins = int(bins_slider.value)

    sub = df_glu[df_glu["Age"].between(a0, a1)].copy()
    if sex != "All":
        sub = sub[sub["Sex"] == sex]

    vals = sub["Glucose"].dropna().astype(float).values
    n = int(vals.size)

    if n == 0:
        src_hist.data = dict(left=[], right=[], top=[], fill_color=[])
        src_stats.data = dict(metric=["n","mean","median"], value=["0","-","-"])
        p.title.text = f"Glucose distribution (Sex: {sex}, Age: {a0}-{a1})"
        return

    counts, edges = np.histogram(vals, bins=bins)
    left = edges[:-1].tolist()
    right = edges[1:].tolist()
    top = counts.tolist()

    if sex == "Female":
        color = "hotpink"
    elif sex == "Male":
        color = "dodgerblue"
    else:
        color = "gray"

    src_hist.data = dict(
        left=left,
        right=right,
        top=top,
        fill_color=[color] * len(top)
    )

    mean = float(np.mean(vals))
    median = float(np.median(vals))

    src_stats.data = dict(
        metric=["n", "mean", "median"],
        value=[str(n), f"{mean:.1f}", f"{median:.1f}"]
    )

    p.title.text = f"Glucose distribution (Sex: {sex}, Age: {a0}-{a1}, bins={bins})"

init_hist()

callback = CustomJS(
    args=dict(
        all_src=src_all,
        hist_src=src_hist,
        stats_src=src_stats,
        sex_w=sex_select,
        age_w=age_slider,
        bins_w=bins_slider,
        plot=p
    ),
    code="""
    const data = all_src.data;

    const sex = sex_w.value;
    const a0 = age_w.value[0];
    const a1 = age_w.value[1];
    const bins = Math.max(5, Math.floor(bins_w.value));

    const Age = data["Age"];
    const Sex = data["Sex"];
    const Glu = data["Glucose"];

    const values = [];
    for (let i = 0; i < Glu.length; i++) {
      const age = Age[i];
      const s = Sex[i];
      const g = Glu[i];

      if (age == null || g == null) continue;
      if (age < a0 || age > a1) continue;
      if (sex !== "All" && s !== sex) continue;
      if (isNaN(g)) continue;

      values.push(g);
    }

    const n = values.length;

    if (n === 0) {
      hist_src.data = { left: [], right: [], top: [], fill_color: [] };
      stats_src.data = { metric: ["n","mean","median"], value: ["0","-","-"] };
      plot.title.text = `Glucose distribution (Sex: ${sex}, Age: ${a0}-${a1})`;
      hist_src.change.emit();
      stats_src.change.emit();
      return;
    }

    let minV = values[0];
    let maxV = values[0];
    let sum = 0;

    for (let i = 0; i < values.length; i++) {
      const v = values[i];
      if (v < minV) minV = v;
      if (v > maxV) maxV = v;
      sum += v;
    }

    if (minV === maxV) {
      minV = minV - 1;
      maxV = maxV + 1;
    }

    const width = (maxV - minV) / bins;

    const left = new Array(bins);
    const right = new Array(bins);
    const top = new Array(bins).fill(0);

    for (let b = 0; b < bins; b++) {
      left[b] = minV + b * width;
      right[b] = minV + (b + 1) * width;
    }

    for (let i = 0; i < values.length; i++) {
      const v = values[i];
      let idx = Math.floor((v - minV) / width);
      if (idx < 0) idx = 0;
      if (idx >= bins) idx = bins - 1;
      top[idx] += 1;
    }

    values.sort((x, y) => x - y);
    const mean = sum / n;

    let median = 0;
    if (n % 2 === 1) {
      median = values[(n - 1) / 2];
    } else {
      median = (values[n/2 - 1] + values[n/2]) / 2;
    }

    let color = "gray";
    if (sex === "Female") color = "hotpink";
    if (sex === "Male") color = "dodgerblue";
    const fill_color = new Array(top.length).fill(color);

    hist_src.data = { left, right, top, fill_color };

    stats_src.data = {
      metric: ["n", "mean", "median"],
      value: [String(n), mean.toFixed(1), median.toFixed(1)]
    };

    plot.title.text = `Glucose distribution (Sex: ${sex}, Age: ${a0}-${a1}, bins=${bins})`;

    hist_src.change.emit();
    stats_src.change.emit();
    """
)

sex_select.js_on_change("value", callback)
age_slider.js_on_change("value", callback)
bins_slider.js_on_change("value", callback)

layout_v3 = column(
    title,
    row(sex_select, bins_slider),
    age_slider,
    row(p, table)
)

This visual examines the **distribution of Glucose (mg/dL)** across **age** and **sex**, using an interactive histogram combined with summary statistics (**sample size, mean, median**).

The histogram reveals a **right-skewed distribution of glucose values** across all groups, with the majority of observations concentrated between **approximately 80–120 mg/dL**. A smaller but clinically important tail extends toward higher glucose levels, indicating the presence of individuals with **elevated or potentially abnormal glycaemic values**.

When stratified by sex, **males exhibit higher average glucose levels than females**, as reflected in both the **mean and median values**. This difference persists across the same age range, suggesting a **systematic sex-based difference in glucose regulation** rather than random variation. However, the overall shape of the distribution remains similar for both sexes, indicating that **sex influences glucose level magnitude more than distributional form**.

The ability to filter by age range highlights that glucose distributions remain **heterogeneous within age groups**, reinforcing that glucose is influenced by multiple factors beyond age alone, including metabolic health, lifestyle, and clinical status.

Importantly, the summary table shows that **sample size decreases substantially under filtering**, reflecting the fact that glucose is a **laboratory-based measurement collected only for specific NHANES subsamples**. This visual therefore also makes explicit the impact of **survey design and missingness**, which is critical for correct interpretation of biomedical variables.

This visual is included because glucose is a **key indicator of metabolic risk** that is not adequately described by central tendency alone. Unlike BMI or blood pressure, glucose values often exhibit **asymmetry and extreme values**, making a **distribution-based visualisation essential**. The interactive histogram allows the analyst to identify **skewness, outliers, and subgroup differences** that would be obscured in aggregated summaries.

Overall, this visual demonstrates that **glucose levels are right-skewed**, **higher on average in males than females**, and **strongly affected by subsample availability**, underscoring the importance of distributional analysis when working with clinical laboratory data in NHANES.

## 2.7 Visual 4: Smoking Status and Blood Glucose Distribution

This visual is designed to examine the relationship between **smoking status** and **blood glucose levels (mg/dL)**, while allowing stratification by **sex** and **age range** through interactive controls.

Smoking is a well-documented risk factor associated with **metabolic dysfunction**, **insulin resistance**, and an increased risk of **type 2 diabetes**. Therefore, this visual addresses the following analytical question:

**Do individuals who smoke exhibit different glucose distributions compared to non-smokers, and does this pattern differ by sex?**

To answer this question, the visual employs:
- **Boxplots** to compare the distribution of glucose between **Non-smokers** and **Smokers**
- **Jittered scatter points** to display individual-level variability and potential outliers
- **Interactive filters** for:
  - **Sex** (All, Male, Female)
  - **Age range** (slider)
- A **summary statistics table** reporting **sample size (n), mean, and median** glucose values for each group

Only observations with valid **Glucose**, **Age**, **Sex**, and a clearly defined **SmokingGroup** (Smoker / Non-smoker) are included. This preprocessing ensures that comparisons are meaningful and aligned with the NHANES survey structure.

Overall, the design supports both **distributional comparison** and **subgroup exploration**, enabling a nuanced examination of how smoking status relates to glucose levels in a large population-based health dataset.

In [10]:
import numpy as np
import pandas as pd
from bokeh.io import show
from bokeh.layouts import column, row
from bokeh.models import (
ColumnDataSource, Select, RangeSlider, HoverTool, Div, CustomJS,
DataTable, TableColumn, StringFormatter
)
from bokeh.plotting import figure
from bokeh.transform import jitter, factor_cmap

df = df_vis.copy()

candidates = ["SmokingStatus", "Smoker", "SmokeStatus", "Smoking", "SMQ020", "SMQ040", "SMQ040Q", "SMQ"]
smoke_col = next((c for c in candidates if c in df.columns), None)

if smoke_col is None:
    hint = [c for c in df.columns if "smok" in c.lower() or "smq" in c.lower()]
    raise ValueError(
        f"Smoking column not found. Suggested columns: {hint}. "
        f"Create/convert to 'SmokingStatus' or 'Smoker' first."
    )

df = df[["Age", "Sex", "Glucose", smoke_col]].copy()
df["Age"] = pd.to_numeric(df["Age"], errors="coerce")
df["Glucose"] = pd.to_numeric(df["Glucose"], errors="coerce")
df["Sex"] = df["Sex"].astype(str)

s = df[smoke_col]
df["SmokingGroup"] = pd.Series([None] * len(df), index=df.index, dtype="object")

s_clean = s.astype(str).str.strip().str.lower()

is_smoker = (
    s_clean.str.contains("current", na=False) |
    s_clean.str.contains("every day", na=False) |
    s_clean.str.contains("some days", na=False) |
    s_clean.isin(["smoker", "yes", "1", "true"])
)

is_non_smoker = (
    s_clean.str.contains("never", na=False) |
    s_clean.str.contains("not at all", na=False) |
    s_clean.str.contains("former", na=False) |
    s_clean.str.contains("ex-smoker", na=False) |
    s_clean.str.contains("ex smoker", na=False) |
    s_clean.isin(["non-smoker", "nonsmoker", "no", "0", "false", "2"])
)

df.loc[is_smoker, "SmokingGroup"] = "Smoker"
df.loc[is_non_smoker, "SmokingGroup"] = "Non-smoker"

df = df[df["Sex"].isin(["Male", "Female"])].dropna(subset=["Age", "Glucose", "SmokingGroup"]).copy()

if df.empty:
    top_vals = df_vis[smoke_col].dropna().value_counts().head(10)
    raise ValueError(
        "No rows left after cleaning. Please check smoking mapping. "
        f"Top values in {smoke_col}:\n{top_vals}"
    )

age_min = int(df["Age"].min())
age_max = int(df["Age"].max())

groups = ["Non-smoker", "Smoker"]

title = Div(text="<h3>Visual 4: Smoking Status vs Glucose (mg/dL)</h3>")

sex_select = Select(title="Sex", value="All", options=["All", "Male", "Female"])
age_slider = RangeSlider(
    title="Age range",
    start=age_min, end=age_max,
    value=(age_min, age_max),
    step=1
)

src_all = ColumnDataSource(df)
src_points = ColumnDataSource(data=dict(group=[], glucose=[], sex=[]))

src_box = ColumnDataSource(data=dict(
    group=groups,
    q1=[np.nan, np.nan],
    q2=[np.nan, np.nan],
    q3=[np.nan, np.nan],
    low=[np.nan, np.nan],
    high=[np.nan, np.nan],
))

src_stats = ColumnDataSource(data=dict(
    group=groups,
    n=["-", "-"],
    mean=["-", "-"],
    median=["-", "-"],
))

p = figure(
    height=420, width=820,
    x_range=groups,
    title="Glucose by Smoking Status (filtered)",
    x_axis_label="Smoking status",
    y_axis_label="Glucose (mg/dL)",
    tools="pan,wheel_zoom,box_zoom,reset,save"
)

p.segment(x0="group", y0="low", x1="group", y1="q1", source=src_box, line_width=2)
p.segment(x0="group", y0="q3", x1="group", y1="high", source=src_box, line_width=2)

boxes = p.vbar(
    x="group", width=0.55,
    top="q3", bottom="q1",
    source=src_box,
    fill_alpha=0.35,
    line_alpha=0.8
)

p.rect(
    x="group", y="q2",
    width=0.55, height=0.6,
    source=src_box,
    fill_alpha=0.9,
    line_alpha=0
)

sex_palette = ["#1f77b4", "#e377c2"]

pts = p.scatter(
    x=jitter("group", width=0.35, range=p.x_range),
    y="glucose",
    source=src_points,
    size=4,
    alpha=0.25,
    color=factor_cmap("sex", palette=sex_palette, factors=["Male", "Female"]),
    legend_field="sex"
)

p.add_tools(HoverTool(
    renderers=[pts],
    tooltips=[
        ("Group", "@group"),
        ("Glucose", "@glucose{0.0}"),
        ("Sex", "@sex"),
    ]
))

p.add_tools(HoverTool(
    renderers=[boxes],
    tooltips=[
        ("Group", "@group"),
        ("Q1", "@q1{0.0}"),
        ("Median", "@q2{0.0}"),
        ("Q3", "@q3{0.0}"),
    ]
))

p.legend.title = "Sex"
p.legend.location = "top_right"

table = DataTable(
    source=src_stats,
    columns=[
        TableColumn(field="group", title="Group", formatter=StringFormatter()),
        TableColumn(field="n", title="n", formatter=StringFormatter()),
        TableColumn(field="mean", title="mean", formatter=StringFormatter()),
        TableColumn(field="median", title="median", formatter=StringFormatter()),
    ],
    width=360,
    height=120,
    index_position=None
)

callback = CustomJS(
    args=dict(
        all_src=src_all,
        pts_src=src_points,
        box_src=src_box,
        stats_src=src_stats,
        sex_w=sex_select,
        age_w=age_slider,
        plot=p
    ),
    code="""
    const data = all_src.data;

    const sex = sex_w.value;
    const a0 = age_w.value[0];
    const a1 = age_w.value[1];

    const Age = data["Age"];
    const Sex = data["Sex"];
    const Glu = data["Glucose"];
    const Group = data["SmokingGroup"];

    const gnames = ["Non-smoker", "Smoker"];
    const buckets = { "Non-smoker": [], "Smoker": [] };

    const outG = [];
    const outY = [];
    const outS = [];

    for (let i = 0; i < Glu.length; i++) {
      const age = Age[i];
      const sx = Sex[i];
      const y = Glu[i];
      const g = Group[i];

      if (age == null || y == null || g == null) continue;
      if (age < a0 || age > a1) continue;
      if (sex !== "All" && sx !== sex) continue;
      if (isNaN(y)) continue;
      if (!(g in buckets)) continue;

      buckets[g].push(y);
      outG.push(g);
      outY.push(y);
      outS.push(sx);
    }

    pts_src.data = { group: outG, glucose: outY, sex: outS };

    function quantile(sorted, q) {
      const n = sorted.length;
      if (n === 0) return NaN;
      const pos = (n - 1) * q;
      const base = Math.floor(pos);
      const rest = pos - base;
      if (sorted[base + 1] === undefined) return sorted[base];
      return sorted[base] + rest * (sorted[base + 1] - sorted[base]);
    }

    const q1 = [];
    const q2 = [];
    const q3 = [];
    const low = [];
    const high = [];

    const nArr = [];
    const meanArr = [];
    const medArr = [];

    for (let k = 0; k < gnames.length; k++) {
      const g = gnames[k];
      const vals = buckets[g].slice().sort((a,b) => a-b);
      const n = vals.length;

      if (n === 0) {
        q1.push(NaN); q2.push(NaN); q3.push(NaN); low.push(NaN); high.push(NaN);
        nArr.push("0"); meanArr.push("-"); medArr.push("-");
        continue;
      }

      const Q1 = quantile(vals, 0.25);
      const Q2 = quantile(vals, 0.50);
      const Q3 = quantile(vals, 0.75);
      const IQR = Q3 - Q1;

      let lo = Q1 - 1.5 * IQR;
      let hi = Q3 + 1.5 * IQR;

      let wlo = vals[0];
      let whi = vals[vals.length - 1];

      for (let i = 0; i < vals.length; i++) {
        if (vals[i] >= lo) { wlo = vals[i]; break; }
      }
      for (let i = vals.length - 1; i >= 0; i--) {
        if (vals[i] <= hi) { whi = vals[i]; break; }
      }

      let sum = 0;
      for (let i = 0; i < vals.length; i++) sum += vals[i];
      const mean = sum / n;

      q1.push(Q1); q2.push(Q2); q3.push(Q3); low.push(wlo); high.push(whi);

      nArr.push(String(n));
      meanArr.push(mean.toFixed(1));
      medArr.push(Q2.toFixed(1));
    }

    box_src.data = { group: gnames, q1, q2, q3, low, high };
    stats_src.data = { group: gnames, n: nArr, mean: meanArr, median: medArr };

    plot.title.text = `Glucose by Smoking Status (Sex: ${sex}, Age: ${a0}-${a1})`;

    pts_src.change.emit();
    box_src.change.emit();
    stats_src.change.emit();
    """
)

sex_select.js_on_change("value", callback)
age_slider.js_on_change("value", callback)

# -------------------------
# Initial fill (IMPORTANT): prevents blank on first load when Sex = "All"
# -------------------------
sex0 = sex_select.value
a0, a1 = age_slider.value

df0 = df[(df["Age"] >= a0) & (df["Age"] <= a1)].copy()
if sex0 != "All":
    df0 = df0[df0["Sex"] == sex0]

# points
src_points.data = dict(
    group=df0["SmokingGroup"].astype(str).tolist(),
    glucose=df0["Glucose"].tolist(),
    sex=df0["Sex"].astype(str).tolist(),
)

def _quantile(sorted_vals, q):
    n = len(sorted_vals)
    if n == 0:
        return np.nan
    pos = (n - 1) * q
    base = int(np.floor(pos))
    rest = pos - base
    if base + 1 >= n:
        return float(sorted_vals[base])
    return float(sorted_vals[base] + rest * (sorted_vals[base + 1] - sorted_vals[base]))

q1_list, q2_list, q3_list, low_list, high_list = [], [], [], [], []
n_list, mean_list, med_list = [], [], []

for g in groups:
    vals = df0.loc[df0["SmokingGroup"] == g, "Glucose"].dropna().astype(float).values
    vals.sort()
    n = len(vals)

    if n == 0:
        q1_list.append(np.nan); q2_list.append(np.nan); q3_list.append(np.nan)
        low_list.append(np.nan); high_list.append(np.nan)
        n_list.append("0"); mean_list.append("-"); med_list.append("-")
        continue

    Q1 = _quantile(vals, 0.25)
    Q2 = _quantile(vals, 0.50)
    Q3 = _quantile(vals, 0.75)
    IQR = Q3 - Q1

    lo = Q1 - 1.5 * IQR
    hi = Q3 + 1.5 * IQR

    wlo = vals[0]
    whi = vals[-1]
    for v in vals:
        if v >= lo:
            wlo = v
            break
    for v in vals[::-1]:
        if v <= hi:
            whi = v
            break

    q1_list.append(Q1); q2_list.append(Q2); q3_list.append(Q3)
    low_list.append(float(wlo)); high_list.append(float(whi))

    n_list.append(str(n))
    mean_list.append(f"{float(np.mean(vals)):.1f}")
    med_list.append(f"{Q2:.1f}")

src_box.data = dict(
    group=groups,
    q1=q1_list, q2=q2_list, q3=q3_list,
    low=low_list, high=high_list
)

src_stats.data = dict(
    group=groups,
    n=n_list,
    mean=mean_list,
    median=med_list
)

p.title.text = f"Glucose by Smoking Status (Sex: {sex0}, Age: {a0}-{a1})"

# Layout variable for master dashboard
layout_v4 = column(
    title,
    row(sex_select),
    age_slider,
    row(p, table)
)

The rendered visual reveals several consistent patterns across smoking status, sex, and age groups.

### Overall comparison
Across the full sample, individuals classified as **Smokers** exhibit a **slightly higher mean glucose level** compared to **Non-smokers**, while the **median glucose values are nearly identical** between the two groups. This indicates that the difference is driven primarily by **higher-end values** rather than a uniform shift in the entire distribution.

### Stratification by sex
- **Male participants**:
  - Smokers show a higher **mean glucose level** than non-smokers
  - The scatter plot reveals a greater concentration of **high-glucose observations** among smokers, contributing to a right-skewed distribution

- **Female participants**:
  - A similar pattern is observed, with smokers having a higher mean glucose level
  - However, the difference between smokers and non-smokers is **less pronounced** than in males, and the distribution is more tightly clustered

In both sexes, the **median glucose remains stable**, suggesting that smoking does not substantially affect the central tendency of glucose for the majority of individuals. Instead, smoking appears to increase **variability and the likelihood of elevated glucose values**.

### Interpretation
These findings suggest that smoking status is associated with **greater dispersion and higher extreme glucose levels**, rather than a wholesale increase across the entire population. The effect appears stronger among males, indicating a potential interaction between **sex, smoking behaviour, and metabolic risk**.

While this visual does not establish causality, it provides strong descriptive evidence that smoking is linked to **higher-risk glucose profiles**. The combination of boxplots, scatter points, and interactive filtering proves effective in distinguishing **systematic group-level differences** from **individual-level variability** in NHANES data.

## 2.8 Visual 5: Association Between BMI and Blood Glucose Levels

### Analytical question & design
This visual investigates the relationship between **Body Mass Index (BMI)** and **blood glucose (mg/dL)** to examine whether higher body mass is associated with elevated glucose levels.

BMI is a widely used indicator of **adiposity and metabolic risk**, and is strongly linked to insulin resistance and type 2 diabetes. Using NHANES data allows this relationship to be explored across a large, population-representative sample.

The visual is designed as an **interactive scatter plot with optional smoothing**, enabling users to explore both individual-level variability and population-level trends.

Variables used in this visual:
- **BMI** (x-axis, continuous)
- **Glucose (mg/dL)** (y-axis, continuous)
- **Sex** (filter)
- **Age** (filter)

In [11]:
df = df_vis.copy()

bmi_candidates = ["BMI", "BMXBMI", "BodyMassIndex"]
bmi_col = next((c for c in bmi_candidates if c in df.columns), None)
if bmi_col is None:
    hint = [c for c in df.columns if "bmi" in c.lower()]
    raise ValueError(f"BMI column not found. Tried {bmi_candidates}. Suggested columns: {hint}")

if "Glucose" not in df.columns:
    hint = [c for c in df.columns if "gluc" in c.lower() or "glu" in c.lower()]
    raise ValueError(f"Glucose column not found. Suggested columns: {hint}")

df = df[["Age", "Sex", bmi_col, "Glucose"]].copy()
df["Age"] = pd.to_numeric(df["Age"], errors="coerce")
df[bmi_col] = pd.to_numeric(df[bmi_col], errors="coerce")
df["Glucose"] = pd.to_numeric(df["Glucose"], errors="coerce")
df["Sex"] = df["Sex"].astype(str)

df = df[df["Sex"].isin(["Male", "Female"])].dropna(subset=["Age", bmi_col, "Glucose"]).copy()
if df.empty:
    raise ValueError("No rows left after cleaning (Age/BMI/Glucose/Sex). Check df_vis columns and missingness.")

age_min = int(df["Age"].min())
age_max = int(df["Age"].max())

sex_select = Select(title="Sex", value="All", options=["All", "Male", "Female"])

age_slider = RangeSlider(
    title="Age range",
    start=age_min, end=age_max,
    value=(age_min, age_max),
    step=1
)

smooth_slider = Slider(
    title="Smoothing window (BMI bins)",
    start=1, end=25, value=7, step=1
)

src_all = ColumnDataSource(df)

src_points = ColumnDataSource(data=dict(BMI=[], Glucose=[], Sex=[], Age=[], Color=[]))
src_line = ColumnDataSource(data=dict(BMI=[], Glucose=[]))

title = Div(text="<h3>Visual 5: BMI vs Glucose (interactive)</h3>")

p = figure(
    height=420, width=820,
    title="BMI vs Glucose",
    x_axis_label=f"{bmi_col} (kg/m²)",
    y_axis_label="Glucose (mg/dL)",
    tools="pan,wheel_zoom,box_zoom,reset,save"
)

pts = p.scatter(
    x="BMI", y="Glucose",
    source=src_points,
    size=4, alpha=0.30,
    color="Color"
)

p.line(
    x="BMI", y="Glucose",
    source=src_line,
    line_width=3
)

p.add_tools(HoverTool(
    renderers=[pts],
    tooltips=[
        ("BMI", "@BMI{0.0}"),
        ("Glucose", "@Glucose{0.0}"),
        ("Age", "@Age"),
        ("Sex", "@Sex"),
    ]
))

callback = CustomJS(
    args=dict(
        all_src=src_all,
        pts_src=src_points,
        line_src=src_line,
        sex_w=sex_select,
        age_w=age_slider,
        smooth_w=smooth_slider,
        plot=p,
        bmi_name=bmi_col
    ),
    code="""
    const data = all_src.data;

    const sex = sex_w.value;
    const a0 = age_w.value[0];
    const a1 = age_w.value[1];
    const window = Math.max(1, Math.floor(smooth_w.value));

    const Age = data["Age"];
    const Sex = data["Sex"];
    const BMI = data[bmi_name];
    const Glu = data["Glucose"];

    const outBMI = [];
    const outGlu = [];
    const outSex = [];
    const outAge = [];
    const outColor = [];

    const binWidth = 0.5; // BMI units
    const sums = new Map();
    const counts = new Map();

    for (let i = 0; i < Glu.length; i++) {
      const age = Age[i];
      const sx = Sex[i];
      const x = BMI[i];
      const y = Glu[i];

      if (age == null || x == null || y == null) continue;
      if (isNaN(age) || isNaN(x) || isNaN(y)) continue;
      if (age < a0 || age > a1) continue;
      if (sex !== "All" && sx !== sex) continue;

      outBMI.push(x);
      outGlu.push(y);
      outSex.push(sx);
      outAge.push(Math.floor(age));

      if (sx === "Male") outColor.push("#1f77b4");
      else outColor.push("#e377c2");

      const key = Math.round(x / binWidth) * binWidth;
      sums.set(key, (sums.get(key) || 0) + y);
      counts.set(key, (counts.get(key) || 0) + 1);
    }

    pts_src.data = { BMI: outBMI, Glucose: outGlu, Sex: outSex, Age: outAge, Color: outColor };

    const xs = Array.from(sums.keys()).sort((a,b) => a-b);
    const mean = xs.map(k => sums.get(k) / counts.get(k));

    const smooth = [];
    const half = Math.floor(window / 2);

    for (let i = 0; i < mean.length; i++) {
      const left = Math.max(0, i - half);
      const right = Math.min(mean.length - 1, i + half);
      let acc = 0;
      let n = 0;
      for (let j = left; j <= right; j++) {
        acc += mean[j];
        n += 1;
      }
      smooth.push(acc / n);
    }

    line_src.data = { BMI: xs, Glucose: smooth };

    plot.title.text = `${bmi_name} vs Glucose (Sex: ${sex}, Age: ${a0}-${a1}, window=${window})`;

    pts_src.change.emit();
    line_src.change.emit();
    """
)

for w in [sex_select, age_slider, smooth_slider]:
    w.js_on_change("value", callback)
sex0 = sex_select.value
a0, a1 = age_slider.value
window0 = int(smooth_slider.value)
bin_width = 0.5

df0 = df[(df["Age"] >= a0) & (df["Age"] <= a1)].copy()
if sex0 != "All":
    df0 = df0[df0["Sex"] == sex0]

# points
bmi_vals = df0[bmi_col].astype(float).values
glu_vals = df0["Glucose"].astype(float).values
sex_vals = df0["Sex"].astype(str).values
age_vals = np.floor(df0["Age"].astype(float).values).astype(int)

color_vals = np.where(sex_vals == "Male", "#1f77b4", "#e377c2")

src_points.data = dict(
    BMI=bmi_vals.tolist(),
    Glucose=glu_vals.tolist(),
    Sex=sex_vals.tolist(),
    Age=age_vals.tolist(),
    Color=color_vals.tolist(),
)

bins = np.round(bmi_vals / bin_width) * bin_width
trend_df = pd.DataFrame({"bin": bins, "Glu": glu_vals}).dropna()
means = trend_df.groupby("bin")["Glu"].mean().sort_index()

xs = means.index.to_numpy()
mean_arr = means.values

k = max(1, window0)
half = k // 2
smooth_arr = []
for i in range(len(mean_arr)):
    left = max(0, i - half)
    right = min(len(mean_arr) - 1, i + half)
    smooth_arr.append(float(np.mean(mean_arr[left:right + 1])))

src_line.data = dict(BMI=xs.tolist(), Glucose=smooth_arr)

p.title.text = f"{bmi_col} vs Glucose (Sex: {sex0}, Age: {a0}-{a1}, window={k})"

layout_v5 = column(
    title,
    row(sex_select),
    row(age_slider, smooth_slider),
    p
)

Across the population, **glucose levels increase with BMI**, particularly as individuals move from the normal BMI range into **overweight and obese categories**. The smoothed trend line indicates a **non-linear relationship**: glucose rises gradually at lower BMI levels and increases more sharply at higher BMI values. At very high BMI levels, the trend flattens or fluctuates, reflecting **sparser observations** rather than a true physiological plateau.

When stratified by sex, **males tend to exhibit slightly higher glucose levels than females at comparable BMI values**, with greater dispersion and more extreme high-glucose outliers. In contrast, **females show a smoother and more stable increase** in glucose as BMI rises. Despite these level differences, the overall pattern is consistent across sexes, indicating that **BMI is a primary driver of glucose variation**, with sex acting as a secondary modifier.

### Role of smoothing
The **smoothing window (BMI bins)** reduces the influence of individual outliers and clarifies the underlying population-level trend. Adjusting the window size changes the granularity of the curve, but the **overall upward association between BMI and glucose remains stable**, supporting the robustness of the observed relationship.

Generally, this visual demonstrates that **higher BMI is strongly associated with elevated blood glucose**, reinforcing the role of overweight and obesity as key risk factors for **glycaemic dysregulation and metabolic disease**. The interactive design supports exploratory analysis while maintaining interpretability in a large, heterogeneous dataset such as NHANES.

Using the Bokeh package, this analysis developed five non-trivial interactive visualisations that explore complementary dimensions of health in the NHANES dataset. Each visual incorporates interactive controls such as sliders, dropdowns, and dynamic summaries, allowing users to filter by demographic attributes and examine patterns at different levels of granularity.

Together, the visuals progress from population-level demographic patterns (BMI and blood pressure across age and sex) to metabolic and behavioural relationships (glucose distribution, smoking status, and BMI–glucose associations). Interactive features such as smoothing windows, dynamic binning, and subgroup filtering support exploratory analysis while maintaining interpretability in the presence of noise, outliers, and survey-driven missingness.

Overall, these visualisations demonstrate effective use of Bokeh’s interactive capabilities to reveal meaningful structure in complex health data. This exploratory foundation provides the necessary context for the deeper analytical interpretation and discussion presented in the next section.

## 3. Integrated Insights from Interactive Visual Analysis and Ethical Reflection
### 3.1 Key Analytical Conclusions

Drawing on the **five interactive visualisations** developed using the **Bokeh package**, several consistent and meaningful patterns emerge regarding the relationship between **glucose levels** and key **demographic, behavioural, and physiological factors** in the **NHANES dataset**.

**Age** shows a clear and stable **positive association with glucose**. Across multiple visuals, glucose levels increase gradually with age, particularly from **middle adulthood onward**. This pattern is consistent across sexes and aligns with known **age-related declines in insulin sensitivity**.

**Sex differences** are observable but **secondary**. **Males** generally display **slightly higher average glucose levels** and **greater variability** than females across age, BMI, and smoking status. However, the **overall trend shapes remain similar**, indicating that sex acts as a **modifier rather than a primary driver** of glucose variation.

**BMI** exhibits the **strongest and most consistent relationship** with glucose. The interactive **BMI–glucose visual** reveals a clear **non-linear upward trend**: glucose increases modestly within the **normal BMI range** and rises more sharply among **overweight and obese individuals**. This relationship persists under different **smoothing and filtering settings**, reinforcing BMI as a **key risk factor for glycaemic dysregulation**.

**Smoking status** is associated with **elevated glucose levels**, particularly among **current smokers**. While distributions overlap, smokers tend to show **higher medians and wider dispersion**, suggesting a potential link between **smoking behaviour** and impaired glucose metabolism.

Finally, the use of **interactive controls** (age range, sex selection, bin size, and smoothing windows) demonstrates that these findings are **robust**, not artefacts of specific parameter choices. Adjusting these controls changes **granularity**, but does **not alter the core conclusions**.

### 3.2 Data Privacy and Ethical Considerations

The analysis conducted in this study relies on data from the **National Health and Nutrition Examination Survey (NHANES)**, a large-scale public health dataset designed to be **de-identified and publicly accessible**. While NHANES removes direct personal identifiers, the use of **detailed demographic, behavioural, and health variables** still raises important **data privacy and ethical considerations**.

First, there is a risk of **re-identification** when multiple variables such as **age, sex, BMI, smoking status, and laboratory results** are combined. Although individual identifiers are not present, fine-grained filtering and stratification may increase the likelihood of identifying small or unique subgroups. Ethical data practice therefore requires analysts to avoid overly granular reporting and to interpret subgroup patterns at an **aggregate level** only (CDC 2023).

Second, ethical concerns arise around **stigmatization and misinterpretation**. Variables such as **obesity, smoking behaviour, and elevated glucose levels** are sensitive health indicators. Presenting results without appropriate context may reinforce harmful stereotypes or imply causality where only association is observed. In this analysis, interactive visualisations are used to support **exploratory interpretation**, not diagnostic or predictive claims, and conclusions are framed cautiously to avoid assigning individual blame (Beauchamp & Childress 2019).

Third, there are issues of **data completeness and survey design bias**. NHANES laboratory measures, including glucose, are often collected from **subsamples**, leading to missing data that are not random. Ethical analysis requires transparency about these limitations to avoid misleading conclusions or overgeneralisation. This study explicitly restricts analysis to valid laboratory records and highlights reductions in sample size when filters are applied (CDC 2023).

Finally, responsible use of public health data involves aligning analysis with the **original purpose of data collection**, which in the case of NHANES is population-level health monitoring and research. The interactive Bokeh visualisations developed here support this goal by enabling **high-level pattern discovery** rather than individual-level inference, consistent with established ethical guidelines for secondary data analysis (OECD 2015).

Overall, while NHANES provides a strong foundation for ethical secondary analysis, careful attention to **privacy protection, contextual interpretation, and transparent reporting** is essential to ensure that insights derived from the data are both **responsible and socially appropriate**.

## Conclusion

Using five interactive Bokeh visualisations, this analysis provides an exploratory overview of how **glucose levels** relate to **age**, **BMI**, **sex**, and selected **behavioural factors** in the NHANES dataset.

Across all visuals, **age and BMI emerge as the most consistent correlates of glucose variation**. Glucose distributions tend to shift upward with increasing age, particularly from mid-adulthood onward, and show a clear positive, non-linear association with BMI. In contrast, **sex differences are present but secondary**, with males often exhibiting slightly higher levels and greater variability, while overall trends remain similar between sexes. Smoking status shows a weaker but observable association, with smokers displaying higher central tendencies and dispersion, although substantial overlap suggests mediation by other factors.

These conclusions are **robust under interactive filtering and smoothing**, indicating that the observed patterns are not artefacts of specific parameter choices but reflect stable population-level relationships.

### Limitations
This analysis is **descriptive and exploratory**, not causal. NHANES is cross-sectional, survey weights were not applied, and important confounders (example: medication use, diagnosed diabetes, physical activity, diet) were not controlled. In addition, glucose is a **laboratory-based subsample measure**, meaning missingness and selection effects may influence subgroup comparisons.

### Data privacy and ethics
Although NHANES data are de-identified and publicly released, ethical care is still required. Analysts must avoid overly granular subgroup slicing that could increase re-identification risk and should present results in ways that **avoid stigma**, particularly when analysing BMI or smoking-related outcomes. All findings should be interpreted as **population-level associations**, not individual-level judgments.

Overall, the interactive visual analytics support meaningful hypothesis generation regarding metabolic risk factors, while underscoring the need for **careful interpretation, ethical reporting, and further statistical modelling** to support causal claims.

## References

Bokeh Development Team. (2023). *Bokeh: Interactive data visualization for modern web browsers*. https://bokeh.org/

Centers for Disease Control and Prevention (CDC). (2017–2020). *National Health and Nutrition Examination Survey (NHANES): 2017–2020 data documentation, codebook, and datasets*. U.S. Department of Health and Human Services. https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?Cycle=2017-2020

Centers for Disease Control and Prevention (CDC). (2021–2023). *National Health and Nutrition Examination Survey (NHANES): 2021–2023 data documentation, codebook, and datasets*. U.S. Department of Health and Human Services. https://wwwn.cdc.gov/nchs/nhanes/continuousnhanes/default.aspx?Cycle=2021-2023

World Health Organization (WHO). (2023). *Diabetes*. https://www.who.int/news-room/fact-sheets/detail/diabetes

In [12]:
from bokeh.layouts import column
from bokeh.io import output_file, save, reset_output
reset_output()

master_dashboard = column(layout_v1, layout_v2, layout_v3, layout_v4, layout_v5)

output_file("visualisation.html", title="NHANES Interactive Visualisations")
save(master_dashboard)

'/Users/liam/Documents/Data Science /SIT731_Data Wrangling/Ontrack Tasks/Task7D/visualisation.html'