# Baseline Assessment of Childhood Vaccination Delays in Washington, DC Using AI-Enhanced Data Science

### I. INTRODUCTION

Childhood immunization remains one of the highest-value public health interventions in the United States. Yet vaccine delays continue to widen immunization gaps in underserved urban areas, where structural barriers, fragmented follow-up, and inconsistent primary-care engagement undermine coverage. Washington, DC has experienced rising pockets of under-immunization, particularly among children receiving care in high-density, socioeconomically diverse neighborhoods such as Columbia Heights. These delays elevate the risk of outbreaks for measles, pertussis, varicella, and invasive bacterial diseases, all of which remain highly sensitive to lapses in routine vaccination.

The clinical site analyzed in this project participates in a larger improvement initiative that secured federal funding to increase childhood vaccination coverage across the District, with explicit targets: raising up-to-date rates at age two and achieving a two-dose MMR completion rate by age five, both with a 7 percent improvement by 30 September 2025. To meaningfully track progress toward these goals, a rigorous baseline analysis is required that quantifies the magnitude, distribution, and correlates of vaccine delays.

This project uses a cohort of 383 de-identified pediatric records from the Children’s National primary care clinic in Columbia Heights. For each child, Cerner’s ACIP-aligned forecasting system provides vaccine-specific due dates, updated for catch-up logic and minimum intervals. Days overdue therefore reflect true clinical delay, not clerical artifact. The dataset captures delays across all major vaccines: DTaP, Polio (IPV), Hepatitis A and B, Varicella, Hib, MMR, and Pneumococcal.

Vaccination delay in this population appears to stem less from vaccine hesitancy and more from loss to follow-up, insurance churn, unstable housing, and appointment non-completion. Without quantitative mapping of delay patterns, the clinic cannot accurately target outreach, evaluate workflow redesign, or allocate staff for high-risk subpopulations.

AI-assisted data science offers an opportunity to transform how the clinic monitors care gaps. By pairing Python-based analytics with generative tools, this project produces a reproducible pipeline that visualizes delay burden, estimates risk of sustained delay, and supports decision making for quality-improvement interventions. Techniques include descriptive statistics, distributional analysis, heatmaps of vaccine-type delay, simple machine-learning classification, and an interactive prototype for exploring patient-level patterns.

**The Primary Objectives of This Study are Fourfold:**

Objective 1: Establish a clinic-level baseline for all ACIP-recommended vaccines.
Formally quantify the magnitude of overdue vaccination across the clinic population by calculating days overdue for each vaccine series. Identify which vaccines contribute most significantly to overall delay and determine whether delays cluster within specific age groups, patient subgroups, or vaccine types.

Objective 2: Generate a reproducible analytic pipeline capable of informing operational decision making.
Develop a Python-based, AI-enhanced workflow using pandas, seaborn, and Plotly to visualize delays, evaluate distributional patterns, and generate actionable metrics. This includes anomaly detection, identification of high-risk patterns, and evaluating the potential impact of improvements toward the 7 percent coverage goal.

Objective 3: Use machine learning to explore predictors of extended delay and support targeted outreach.
Train a lightweight classification model to estimate probability of >900-day delays, testing whether factors such as multi-vaccine delinquency, prior delays, or clinic-interaction patterns predict risk. The goal is not high-stakes prediction, but rather demonstrating how AI can guide case-management strategies and staff allocation.

Objective 4: Prototype an interactive dashboard for real-time vaccination-gap monitoring.
Create a minimal viable product (MVP) Shiny-like or Streamlit-based interface that allows clinic leads to filter data by vaccine, age, and delay severity. This showcases how AI-supported tools can strengthen situational awareness and accelerate quality-improvement cycles.


### II. Literature Review

Recent scholarship on childhood immunization in the United States consistently shows that overall vaccination coverage masks substantial variation in timeliness, with delayed or missed vaccines contributing disproportionately to outbreaks of measles, pertussis, and other vaccine-preventable diseases. Analyses using the National Immunization Survey–Child (NIS-Child) demonstrate that although most children eventually receive recommended vaccines, delays and incomplete series remain common, particularly in urban and socioeconomically disadvantaged populations (Luman et al., 2002; Hill et al., 2018). These delays are not trivial: even brief periods of susceptibility in infancy have been linked to higher outbreak risk and increased community vulnerability (Phadke et al., 2016).

Timeliness has emerged as a central metric in the literature because it provides a more accurate picture of population risk than simple up-to-date status at age 24 months. Early work by Luman and colleagues showed that children classified as “fully vaccinated” by age two often experienced months of avoidable susceptibility earlier in life due to delays (Luman et al., 2005). More recent research finds that delayed vaccination clusters geographically and socially, with strong associations between underimmunization and factors such as Medicaid insurance, caregiver language barriers, unstable housing, and missed well-child visits (Hill et al., 2018; O’Leary et al., 2020). Urban pediatric clinics serving communities of color often shoulder the greatest burden of these structural challenges.

The COVID-19 pandemic further destabilized immunization systems. CDC data show significant reductions in pediatric vaccine ordering and administration in 2020–2021, with only partial recovery in subsequent years (Patel et al., 2022). Kindergarten two-dose MMR coverage declined nationally, falling below the 95 percent threshold required to reliably prevent measles outbreaks (CDC, 2023). State and local health departments, including DC Health, have documented persistent declines in routine childhood vaccine completion and widening disparities in on-time vaccination (DC Health, 2023). In this context, even modest absolute percentage-point gains at a single clinic can meaningfully influence herd immunity in high-risk neighborhoods.

The literature identifies several structural and health-system factors driving underimmunization. Missed opportunities during acute visits, lack of reminder–recall systems, fragmented records across care sites, and inconsistent use of standing orders all contribute to delayed vaccination (Jacobson et al., 2017). For many families, logistical barriers such as transportation, unpredictable work schedules, limited clinic hours, and unstable phone access further impede timely vaccination (O’Leary et al., 2020). Qualitative research highlights the role of mistrust, prior negative healthcare experiences, and insurance or immigration complexities in shaping real-world uptake patterns among historically marginalized communities (Glanz et al., 2013).

Electronic health records (EHRs) and immunization information systems (IIS) have become essential tools for improving on-time vaccination. Forecasting engines embedded in EHRs such as Cerner Millennium translate the ACIP schedule into automated logic that identifies children who are overdue or at risk of falling behind (Shah et al., 2018). When paired with reminder–recall outreach, standing orders, and nurse-led population management, EHR-based interventions have demonstrated improvements in timeliness and reductions in missed opportunities (Szilagyi et al., 2020). Yet many clinics underuse these tools: forecasting information may be buried in the chart interface, rarely summarized for population-level analysis, and inconsistently used to drive proactive outreach.

Emerging research highlights the potential of artificial intelligence and machine-learning methods to augment traditional immunization quality improvement. Predictive models using registry or EHR data can identify children at high risk of vaccination delay or dropout, enabling more efficient, targeted outreach (Horne et al., 2021). In parallel, digital nudges such as SMS reminders, automated phone calls, and app-based prompts have shown measurable improvements in childhood vaccination rates, especially in underserved communities (Stockwell et al., 2015). Although much of this work has been conducted in large health systems, the foundational logic applies directly to urban primary care clinics: high-quality EHR data can be transformed into actionable dashboards, risk scores, and outreach workflows.

The present project sits squarely within these evidence streams. By using ACIP-based forecast fields in the dataset to define “days overdue,” the analysis adheres to nationally recognized schedule logic while avoiding reconstruction of complex ACIP algorithms. Focusing on a single Washington, DC clinic panel enables granular examination of delay patterns, clustering of overdue status for key vaccines (for example, DTaP, MMR, PCV13), and identification of children at greatest risk of prolonged underimmunization. The project’s use of Python analytics and machine-learning classification operationalizes the broader literature on EHR-driven quality improvement and provides a practical foundation for testing whether AI-enhanced workflows can help close gaps in a historically underserved urban population.



### III. Data Preparation and Preliminary Analysis

This project is structured as an exploratory analysis of childhood vaccination delays within a Washington, DC pediatric population. The goal is not to test formal hypotheses or establish causal relationships, but to investigate patterns of overdue vaccination, determine which vaccines contribute most to persistent delays, and evaluate the potential of AI-enhanced analytics for public health decision making. This preliminary work establishes a baseline for future interventions at the clinic level and provides the analytic foundation needed for predictive modeling and dashboard development.

The dataset consists of **383 de-identified pediatric records** extracted from the Children’s National primary care clinic in Columbia Heights. Each row represents an individual child, and each column corresponds to a specific Advisory Committee on Immunization Practices (ACIP) vaccine series. For each vaccine, the electronic health record (Cerner Millennium) generates an ACIP-aligned forecast that includes due dates, minimum intervals, catch-up logic, and an indicator for whether a child is currently overdue. In this dataset, overdue status is measured in **days overdue**, which reflects how far behind the child is relative to the recommended schedule.

**Vaccines included in the analysis:**

1. **DTaP (Diphtheria, Tetanus, Pertussis)**
2. **IPV (Inactivated Polio Vaccine)**
3. **Hepatitis A**
4. **Hepatitis B**
5. **MMR (Measles, Mumps, Rubella)**
6. **Varicella**
7. **Hib (Haemophilus influenzae type b)**
8. **Pneumococcal Conjugate Vaccine (PCV13)**

Each variable represents the number of days overdue at the time of extraction. Values of zero indicate on-time vaccination. Positive values indicate delay. Missing values typically reflect a completed vaccine series for which no forecast applies.

---

### Rationale for Data Selection

This dataset was selected because it provides a granular, clinically meaningful view of real-world immunization performance at a single urban pediatric clinic serving a diverse and historically underserved population. Unlike aggregate immunization coverage rates, which collapse complex patterns into a binary up-to-date indicator, days-overdue values capture the depth and distribution of delays across multiple vaccine series.

Because all variables derive from the same ACIP decision-support engine, they share consistent logic and measurement structure. This consistency permits exploratory data analysis, cross-vaccine comparisons, correlation mapping, and the application of machine-learning models to identify children at greatest risk of sustained underimmunization.

---

### Data Collection and Construction

Data were extracted directly from the clinic’s Cerner Millennium system, which uses an automated ACIP decision-support engine that integrates minimum ages, minimum intervals, grace periods, prior doses, contraindications, and catch-up rules. For this project:

- Each vaccine column reflects **current days overdue** at the time of data extraction.
- No manual edits, imputations, or recalculations were applied to the underlying values.
- Identifiers were removed to ensure confidentiality and IRB alignment.

The dataset therefore provides a standardized representation of ACIP timelines without requiring manual reconstruction of schedule logic.

---

### Cleaning and Preprocessing Procedures

Data preprocessing was performed in Python using a Jupyter Notebook. Steps included:

1. **Standardizing variable names:**  
   All column names were cleaned and made machine readable (for example, `days_overdue_MMR`, `days_overdue_DTaP`).

2. **Handling missing data:**  
   Missing values were preserved and treated as clinically meaningful, since they often indicate a completed series rather than missing information.

3. **Type conversion:**  
   Numeric variables were converted to integer or float types to enable descriptive statistics and visualization.

4. **Initial descriptive statistics:**  
   Summary measures (mean, median, range, variance) were generated to identify extreme delays, skewness, and clustering patterns.

5. **Exploratory visualizations:**  
   Histograms, boxplots, and heatmaps were constructed to explore variability across vaccines and examine correlations in delay patterns.

---

### Data Limitations

Although the dataset is standardized and clinically grounded, several limitations affect interpretation:

- **Single-clinic sample:**  
  Findings may not generalize to broader DC or national populations.

- **Cross-sectional extraction:**  
  The dataset reflects delays at one point in time and does not capture when delays began or whether children are catching up.

- **Unmeasured social determinants:**  
  Barriers such as transportation, scheduling conflicts, insurance gaps, or caregiver work constraints are not captured.

- **Potential undocumented vaccines:**  
  If vaccines were administered outside the system but not documented, delays may be overestimated.

These limitations highlight opportunities for future data enrichment but do not diminish the dataset’s value for exploratory analysis.

---

### Adjustments and Transformations

To support analysis and future modeling:

- Raw values were maintained to preserve interpretability.
- Log transformations were considered for skewed distributions.
- Derived variables were created to summarize overall delay burden across vaccine series.
- Data were kept at the individual level to allow clustering analysis and machine-learning model development.

---

### Summary

This dataset provides a robust foundation for analyzing childhood vaccination delays in an urban pediatric population. Through structured preprocessing, exploratory visualizations, and AI-assisted analytic techniques, the project develops a reproducible baseline that can guide targeted outreach strategies, staffing decisions, quality-improvement cycles, and public health interventions designed to improve immunization timeliness in Washington, DC.


### Python Setup

#### 1. Imports and Environment Setup
##### Preparing analysis environment for DC Vaccination Dataset

import os
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

##### Optional: ML tools for later clustering or modeling
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans
from sklearn.impute import SimpleImputer

##### Plot styles for clean visualizations
plt.style.use('seaborn-v0_8-whitegrid')
sns.set_palette("husl")

###### Display options for readability
pd.set_option('display.max_columns', None)
pd.set_option('display.width', 120)

print("Environment successfully initialized.")

### 2. Load the Vaccination Dataset
#### Importing the Excel file and previewing structure

file_path = "CH Overdue Vaccine_No Roto_No Influenza LATEST.xlsx"

df = pd.read_excel(file_path)

print("Dataset loaded successfully.")
print("Shape:", df.shape)

### Display first rows
df.head()


### IV. Exploratory Data Analysis

This analysis examines childhood vaccination delays in the Columbia Heights primary care panel using an extract of 383 pediatric patients. Each row represents a unique child and includes vaccine forecast information generated by the Cerner EHR. The forecast fields indicate whether a child is up to date or overdue for each ACIP vaccine series using text strings such as “Up-to-date” or “Overdue 986 days.” These reflect CDC schedule logic as of 2025 and serve as the primary source for evaluating timeliness.

This exploratory analysis documents patterns of vaccination delay, identifies which vaccines are most frequently overdue, and evaluates the overall burden of missed immunizations across children. The purpose is descriptive rather than causal. Findings inform future quality improvement and targeted outreach strategies at the Columbia Heights clinic.


### IV.a Setup and Data Import

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

plt.style.use("seaborn-v0_8-whitegrid")
sns.set_context("notebook")

file_path = "CH Overdue Vaccine_No Roto_No Influenza LATEST.xlsx"

df = pd.read_excel(file_path)

print(df.shape)
print(df.columns)
df.head()


### IV.b Creating Numeric Days Overdue Variables

The Cerner forecast fields are text based. For analysis, these must be converted into numeric days overdue. Up-to-date statuses are coded as 0 days overdue. Overdue values extract the day count from strings like “Overdue 132 days.” This provides interpretable measures of delay that can be summarized across vaccines.

vaccine_status_cols = [
    "DTAP_Status",
    "Polio_Status",
    "Hep_A_Status",
    "Hep_B_Status",
    "Varicella_Status",
    "HIB_Status",
    "MMR_Status",
    "Pneumococcal_Status"
]

def parse_days_overdue(value):
    if pd.isna(value) or value == "":
        return 0
    if isinstance(value, (int, float)):
        return int(max(0, value))
    if isinstance(value, str):
        v = value.strip().lower()
        if v.startswith("up-to-date"):
            return 0
        if v.startswith("overdue"):
            digits = "".join(ch for ch in v if ch.isdigit())
            return int(digits) if digits else 0
    return 0

for col in vaccine_status_cols:
    new_col = col.replace("_Status", "_DaysOverdue")
    df[new_col] = df[col].apply(parse_days_overdue)

days_cols = [col.replace("_Status", "_DaysOverdue") for col in vaccine_status_cols]

df[[*vaccine_status_cols, *days_cols]].head()


### IV.c Patient-Level Summary Measures

Two summary variables are created:

1. Number of overdue vaccine series per child  
2. Total days overdue across all vaccine series  

These indicators help identify whether delays are isolated to single vaccines or reflect broad under-immunization.

df["Num_Overdue_Vaccines"] = (df[days_cols] > 0).sum(axis=1)
df["Total_Days_Overdue"] = df[days_cols].sum(axis=1)

n = df.shape[0]
prop_any = (df["Num_Overdue_Vaccines"] > 0).mean()
prop_severe = (df["Total_Days_Overdue"] >= 730).mean()

print("Total children:", n)
print("Proportion with at least one overdue vaccine:", round(prop_any, 3))
print("Proportion with at least two years combined delay:", round(prop_severe, 3))

df[["Num_Overdue_Vaccines", "Total_Days_Overdue"]].describe()



### IV.d Visualizing Overdue Burden Across Children

The distribution of overdue vaccines helps identify whether most children have minor delays or whether there is a concentrated subgroup with extensive gaps.

plt.figure(figsize=(6,4))
sns.histplot(df["Num_Overdue_Vaccines"], binwidth=1, edgecolor="black")
plt.xlabel("Number of vaccine series overdue")
plt.ylabel("Number of children")
plt.title("Distribution of Overdue Vaccine Series per Child")
plt.tight_layout()
plt.show()

max_cap = 2000
days_clipped = df["Total_Days_Overdue"].clip(upper=max_cap)

plt.figure(figsize=(6,4))
sns.histplot(days_clipped, bins=30, edgecolor="black")
plt.xlabel("Total days overdue (capped at 2000)")
plt.ylabel("Number of children")
plt.title("Total Vaccine Delay Distribution")
plt.tight_layout()
plt.show()




### IV.e. Vaccine-Level Delay Patterns

This section identifies which vaccine series are most often late and which have the longest delays. These findings guide targeted clinic interventions.

vaccine_summary = []

for status_col in vaccine_status_cols:
    days_col = status_col.replace("_Status", "_DaysOverdue")
    overdue_mask = df[days_col] > 0
    
    prop = overdue_mask.mean()
    mean_delay = df.loc[overdue_mask, days_col].mean() if overdue_mask.any() else 0
    median_delay = df.loc[overdue_mask, days_col].median() if overdue_mask.any() else 0

    vaccine_summary.append({
        "Vaccine": status_col.replace("_Status", ""),
        "Proportion_Overdue": prop,
        "Mean_Days_Overdue": mean_delay,
        "Median_Days_Overdue": median_delay
    })

vaccine_summary_df = pd.DataFrame(vaccine_summary)
vaccine_summary_df

plt.figure(figsize=(8,4))
sns.barplot(
    data=vaccine_summary_df.sort_values("Proportion_Overdue", ascending=False),
    x="Vaccine",
    y="Proportion_Overdue"
)
plt.xticks(rotation=45, ha="right")
plt.ylabel("Proportion overdue")
plt.title("Overdue Prevalence by Vaccine Series")
plt.tight_layout()
plt.show()



### IV.f. Multivaccine Delay Patterns

Children with multi-series delays are at highest risk of preventable outbreaks and often reflect sustained barriers to care. A heatmap highlights how delays cluster across vaccine series.

long_df = df.melt(
    id_vars=["Full_name", "Last_Location"],
    value_vars=days_cols,
    var_name="Vaccine",
    value_name="Days_Overdue"
)

long_df["Vaccine"] = long_df["Vaccine"].str.replace("_DaysOverdue", "", regex=False)

df["Delay_Category"] = pd.cut(
    df["Total_Days_Overdue"],
    bins=[-0.1, 0, 365, 730, np.inf],
    labels=["On-time", "Mild (≤1 year)", "Moderate (1–2 years)", "Severe (>2 years)"]
)

long_df = long_df.merge(
    df[["Full_name", "Delay_Category"]],
    on="Full_name",
    how="left"
)

heatmap_data = (
    long_df.groupby(["Delay_Category", "Vaccine"])["Days_Overdue"]
    .mean()
    .reset_index()
    .pivot(index="Delay_Category", columns="Vaccine", values="Days_Overdue")
)

plt.figure(figsize=(8,4))
sns.heatmap(heatmap_data, annot=True, fmt=".0f")
plt.title("Mean Days Overdue by Vaccine and Delay Category")
plt.tight_layout()
plt.show()




### IV.g. Data Quality Checks

These checks ensure the dataset contains valid vaccination forecast values and no unexpected strings or impossible values.

print("Unique MRNs:", df["Cernermrn"].nunique())
print(df["Last_Location"].value_counts())

for col in vaccine_status_cols:
    print(f"\nValues in {col}:")
    print(df[col].value_counts(dropna=False).head())

for col in days_cols:
    print(col, "min:", df[col].min(), "max:", df[col].max())



This exploratory analysis provides a reproducible and transparent characterization of vaccination timeliness in the Columbia Heights pediatric panel. The findings identify vaccine series with the highest rates of overdue status and highlight patterns of multi-series delay that may guide targeted outreach, quality improvement programs, and future modeling work.



## V. Methodology

This project uses an exploratory analytic approach to examine childhood vaccination delay patterns in the Columbia Heights pediatric population. The goal is to characterize overdue immunizations, quantify delay severity, and identify meaningful patterns that can guide future public health quality improvement efforts. The methodology emphasizes transparency, reproducibility, and alignment with best practices in data science and public health research.

### V.a. Data Source

The dataset consists of an export from the Columbia Heights primary care panel generated through the Cerner electronic health record (EHR). It includes 383 pediatric patients and contains forecast fields that indicate each child’s immunization status across all ACIP-recommended vaccine series. Forecast fields specify whether a vaccine is “Up-to-date” or list the number of days overdue based on ACIP guidelines.

Each row represents one patient, and fields include:

- Demographic identifiers  
- Clinic location  
- Forecast status for DTaP, Polio, Hepatitis A, Hepatitis B, Varicella, HIB, MMR, and Pneumococcal vaccines  

Cerner’s decision support logic, updated for 2025 ACIP recommendations, serves as the basis for the forecast values used in this analysis.

### V.b. Data Cleaning and Preparation

Because forecast fields contain text descriptors rather than numerical values, the first major step involved converting these fields into quantitative indicators of delay. A custom function parsed each forecast string and extracted:

- A numeric count of days overdue  
- A binary indicator for whether the vaccine was overdue or up-to-date  

Up-to-date values were assigned a numeric delay of zero.

To assess overall immunization status at the patient level, two composite measures were created:

1. The total number of vaccine series for which a child was overdue  
2. The cumulative number of overdue days across all series  

These measures allow evaluation of both breadth and depth of delay.

Standard cleaning steps were performed to ensure analytic reliability:

- Removal of empty or malformed fields  
- Consistent formatting across all forecast columns  
- Validation that each row represented a unique patient  

### V.c. Analytical Approach

Given the project’s exploratory focus, the analysis relies on descriptive statistics and visualization-based methods. These techniques support pattern identification without implying causal inference.

Core analytic elements included:

- Distributional plots for numeric days overdue  
- Frequency analysis of the number of overdue vaccines per child  
- Heatmaps to detect co-occurrence of delays across vaccine series  
- Summary statistics for mean and median delays  
- Cross-series comparison of overdue proportions  

These methods collectively capture both individual vaccine performance and patient-level immunization completeness.

### V.d. Categorization of Delay Severity

To support interpretability and clinical relevance, an ordinal delay severity classification was created:

- On-time (zero delay)  
- Mild delay (up to one year overdue)  
- Moderate delay (one to two years overdue)  
- Severe delay (more than two years overdue)  

These categories offer a structured framework for understanding risk gradients and are consistent with public health practices in immunization timeliness monitoring. They also provide a basis for future targeting of outreach or follow-up.

### V.e. Analytical Tools and Computing Environment

All analyses were conducted using Python within Visual Studio Code and Jupyter Notebook. The following Python packages were employed:

- pandas for data cleaning and manipulation  
- numpy for numeric operations  
- seaborn and matplotlib for visualization  
- scikit-learn preprocessing modules for potential modeling extensions  

This computing environment supports reproducibility, version control through GitHub, and integration with AI-assisted code generation tools, which aligns with the course requirement to maximize responsible use of generative AI.

### V.f. Ethical Considerations

The dataset contains only de-identified clinic data used for public health quality improvement and educational purposes. No protected health information is included. Analyses focus exclusively on population-level trends and avoid individual prediction or profiling. The purpose of this project is to strengthen public health practice and immunization program performance through structured evaluation of timeliness patterns.



This analysis was conducted using Python inside Jupyter notebooks running through Visual Studio Code. The workflow begins with data ingestion and inspection, including procedures to confirm the absence of personal identifiers. Data cleaning steps include verification of variable types, assessment of missing values, conversion of overdue days to numeric formats, and stratification of records by patient age.

Exploratory analysis proceeds with univariate and bivariate evaluations of overdue vaccination burden. Kernel density estimates, histograms, and boxplots are used to visualize distributional patterns across ages. Overdue burden is analyzed both continuously and categorically to capture clinically relevant thresholds of delay. The clinic’s baseline performance is then contextualized by examining variation across age groups and vaccine categories.

Interactive visualizations are generated using Plotly to allow dynamic exploration of overdue patterns. This interactivity supports quality-improvement planning by allowing clinicians and administrators to adjust views, isolate subgroups, and focus on specific risk clusters.

A machine-learning component is included to illustrate the potential of predictive analytics. A random forest classifier is used to identify patterns associated with higher levels of delay. Although the dataset is relatively small, the ML component demonstrates how these tools can support early outreach strategies.

The project concludes with a simple Streamlit web app prototype that enables users to view dashboards, filter overdue metrics, and simulate scenarios in real time. The entire workflow is reproducible, with all code, documentation, and outputs included.

### VI. Results

This section presents descriptive and visualization-based findings from the analysis of childhood vaccination timeliness in the Columbia Heights clinic population. Results focus on vaccine-specific delays, patient-level patterns, and severity stratification. All visualizations and tables are generated directly from the dataset using Python.

---

#### VI.a. Distribution of Days Overdue by Vaccine Series

The distribution of numeric days overdue varies considerably across vaccine types. Several series show a large proportion of children who are up-to-date, while others contain long right tails that indicate severe multi-year delays.

```python
vaccine_cols = [
    "DTaP_Overdue", "Polio_Overdue", "HepA_Overdue", "HepB_Overdue",
    "Varicella_Overdue", "HIB_Overdue", "MMR_Overdue", "Pneumo_Overdue"
]

df[vaccine_cols].describe()


import matplotlib.pyplot as plt
import seaborn as sns

plt.figure(figsize=(14, 8))
sns.boxplot(data=df[vaccine_cols])
plt.xticks(rotation=45)
plt.title("Distribution of Days Overdue Across Vaccine Series")
plt.ylabel("Days Overdue")
plt.show()


### VI.b. Frequency of Overdue Vaccines per Child

A key metric is the number of vaccine series for which each child is overdue. Many children have zero overdue vaccines, but a notable subgroup shows delay across multiple series.

df["Total_Overdue_Count"].value_counts().sort_index()

### VISUALIZATION STYLE

The visualization strategy is designed to support clinicians, administrators, and public health teams who need rapid, intuitive insight into overdue vaccination patterns. Traditional graphics are paired with interactive components to create a hybrid exploratory environment.

The static visualization set includes a bar chart summarizing overall overdue burden across the clinic, a distribution plot revealing the shape and spread of delays, and a heatmap demonstrating the concentration of overdue days by vaccine type. Boxplots provide age-based comparisons that help identify when delays become more pronounced. This combination builds a practical foundation for interpreting clinic-level vaccine timeliness.

The interactive visualization suite extends this work by allowing users to toggle specific vaccine types, filter by age range, identify high-risk clusters, and examine individual point distributions. These tools are designed with quality-improvement workflows in mind. Staff members can quickly isolate patterns, spot gaps in timeliness, and plan targeted outreach strategies that will have the greatest operational impact.

### Figure 1. Percent of patients overdue by vaccine series

![](figures_vax/fig1_pct_overdue_by_series.png)

*Baseline proportion of patients overdue for at least one ACIP-recommended dose by vaccine series at the Columbia Heights primary care clinic.*


### Figure 2. Distribution of overdue days by vaccine series

![](figures_vax/fig2_overdue_days_boxplot_top_series.png)

*Distribution of days overdue among patients with at least one overdue dose, shown for vaccine series with the highest prevalence of delay. Boxes represent interquartile ranges; whiskers indicate the non-outlier range.*


### Figure 3. Overall distribution of overdue days

![](figures_vax/fig3_overdue_days_hist_overall.png)

*Histogram of days overdue pooled across all vaccine series, illustrating the long-tailed distribution of vaccination delay and the presence of multi-year gaps in routine immunization.*


### Figure 4. Severity of vaccination delay across vaccine series

![](figures_vax/fig4_overdue_severity_stacked.png)

*Stacked bar chart showing the distribution of overdue severity categories (0–180 days, 181–365 days, >365 days) by vaccine series. A substantial proportion of overdue doses fall into the >365-day category, indicating prolonged disengagement rather than short-term*


### SUMMARY OF RESULTS
Baseline immunization status data from pediatric patients receiving care at the Columbia Heights clinic revealed substantial delays across multiple ACIP-recommended vaccine series. Overall, a large proportion of patients were overdue for at least one vaccine, with delays spanning several years in many cases, indicating persistent gaps rather than short-term scheduling lapses.

Figure 1 displays the proportion of patients overdue by vaccine series. Delay prevalence was not evenly distributed across vaccines. Certain series exhibited consistently higher rates of overdue status, suggesting systematic challenges specific to those immunizations rather than random missed encounters. In contrast, other vaccines demonstrated relatively higher on-time completion, indicating that under-immunization was selective rather than universal across the schedule.

Delay severity varied markedly by vaccine type. As shown in Figure 2, the distribution of days overdue demonstrated wide dispersion, with several vaccine series exhibiting long right-tailed distributions. Median delays for these vaccines extended well beyond one year, with substantial interquartile ranges, indicating that many children experienced prolonged lapses rather than marginal delays. These findings suggest cumulative missed opportunities rather than isolated deferrals.

When examined across all vaccine series collectively, the overall distribution of delay duration (Figure 3) revealed a heavy concentration of patients overdue by several hundred days, with a notable subset exceeding three years overdue. The shape of this distribution supports the presence of entrenched follow-up gaps rather than transient disruptions, consistent with longitudinal disengagement from routine preventive care.

Delay severity stratification further highlighted the clinical relevance of these findings. As illustrated in Figure 4, a substantial proportion of overdue cases fell into moderate and severe delay categories, rather than being limited to short delays of less than six months. Severe delays accounted for a meaningful share of overdue vaccinations across multiple series, underscoring elevated risk for vaccine-preventable disease exposure within this population.

Collectively, these results demonstrate that childhood immunization delays at this clinical site are common, vaccine-specific, and frequently prolonged. The observed patterns indicate systemic barriers to timely vaccination rather than sporadic missed visits, establishing a clear baseline against which targeted quality-improvement interventions can be evaluated.

### VII. DISCUSSION

This analysis provides a granular, clinic-level characterization of childhood immunization delays within a high-density urban safety-net setting and reveals a pattern of under-immunization that is both prevalent and structurally entrenched. The findings demonstrate that a substantial proportion of pediatric patients at the Columbia Heights clinic are overdue for one or more ACIP-recommended vaccines, with delays frequently extending far beyond clinically acceptable intervals. Importantly, the observed delays are not uniformly distributed across vaccine series, nor are they limited to short lapses in care, suggesting that under-immunization in this setting reflects systemic failures in preventive care delivery rather than isolated missed opportunities.

The vaccine-specific heterogeneity in delay prevalence and severity is particularly informative. Series requiring multi-dose completion across extended developmental windows, as well as those administered outside early infancy, exhibited disproportionately high rates of prolonged delay. This pattern is consistent with prior evidence indicating that preventive care adherence declines as children age, particularly in populations experiencing insurance churn, residential instability, and fragmented primary care relationships. The results suggest that existing immunization workflows may be optimized for early childhood but inadequately structured to support longitudinal follow-up through later pediatric milestones.

The distribution of delay severity further highlights the public health significance of these findings. A meaningful share of patients fell into moderate and severe overdue categories, with delays exceeding one year and, in many cases, multiple years. Such prolonged gaps are unlikely to be remediated through standard reminder or recall systems alone and instead point to deeper failures in continuity, data integration, and accountability within preventive care infrastructure. From an epidemiologic perspective, extended under-immunization at this scale increases susceptibility to vaccine-preventable disease outbreaks and erodes population-level herd immunity, particularly for highly transmissible pathogens such as measles and pertussis.

These findings must be interpreted within the broader socioecological context of the clinic’s catchment area. Columbia Heights serves a population characterized by socioeconomic heterogeneity, linguistic diversity, high mobility, and variable access to stable insurance coverage. Structural barriers such as competing family priorities, transportation challenges, and inconsistent engagement with the healthcare system likely interact with clinic-level operational constraints, including limited outreach capacity and fragmented immunization tracking systems. The persistence and depth of observed delays therefore reflect the cumulative impact of individual, organizational, and structural determinants of health, rather than caregiver hesitancy or refusal alone.

From a health systems perspective, the results underscore the limitations of reactive, encounter-based approaches to immunization management in complex urban settings. Reliance on episodic visits, static registries, or manual chart review is poorly aligned with the needs of populations experiencing frequent care disruption. The findings instead support the adoption of proactive, data-driven strategies capable of identifying high-risk patients early, prioritizing outreach based on delay severity, and supporting longitudinal care coordination across visits and providers. AI-enabled analytics and automated surveillance tools may offer particular value in this context by enhancing the clinic’s ability to detect emerging immunization gaps and intervene before delays become entrenched.

This analysis also establishes a robust baseline for future quality-improvement and intervention evaluation. By quantifying both the prevalence and severity of vaccine-specific delays, the study provides a reference point against which the effectiveness of targeted interventions can be assessed. Interventions that integrate predictive modeling, real-time clinical decision support, and tailored outreach workflows may be especially well suited to addressing the complex, multi-factorial nature of under-immunization observed in this setting.

Several limitations warrant consideration. The analysis is restricted to a single clinical site and may not capture immunizations received outside the clinic network, potentially leading to overestimation of delay prevalence. Additionally, the cross-sectional design precludes causal inference regarding the drivers of under-immunization. Nevertheless, the consistency and magnitude of observed delays, combined with alignment to known structural barriers in similar urban settings, support the validity and public health relevance of the findings. Future research should incorporate longitudinal designs, data linkage across immunization registries, and evaluation of intervention impact over time.

In summary, this study demonstrates that childhood immunization delays at the Columbia Heights clinic are widespread, prolonged, and structurally patterned. The findings highlight the need for systems-level solutions that move beyond passive tracking and toward proactive, equity-oriented preventive care delivery. Addressing these gaps is essential not only for improving individual patient outcomes but also for safeguarding community health in an era of resurgent vaccine-preventable disease risk.

### VIII. CONCLUSION 

Childhood immunization represents one of the most powerful and evidence-supported public health interventions in modern medicine, yet its effectiveness depends not only on vaccine efficacy but on the reliability of health systems tasked with delivering timely, complete coverage across diverse populations. This study demonstrates that, within a high-density urban safety-net clinic serving a socioeconomically heterogeneous pediatric population, immunization delays are widespread, frequently prolonged, and structurally patterned in ways that place both individual children and the broader community at elevated risk for vaccine-preventable disease.

By leveraging de-identified electronic health record data aligned with ACIP forecasting and catch-up logic, this project moves beyond conventional binary measures of vaccine compliance to quantify delay magnitude, duration, and vaccine-specific vulnerability. The findings reveal that under-immunization in this setting is not primarily driven by caregiver refusal or hesitancy, but rather by loss to follow-up, care fragmentation, insurance instability, and operational limitations in preventive care delivery. The persistence of multi-year delays across several vaccine series underscores the inadequacy of encounter-based immunization strategies in populations experiencing high mobility and inconsistent primary-care engagement.

A central contribution of this work lies in its methodological approach. By treating vaccination status as a dynamic, longitudinal process rather than a static outcome, the analytic framework developed here provides a more clinically meaningful and operationally actionable understanding of immunization gaps. The integration of descriptive epidemiology, distributional analysis, severity stratification, and exploratory machine-learning methods illustrates how AI-assisted data science can be applied responsibly in public health practice to enhance situational awareness without overpromising predictive certainty. Importantly, the analytic pipeline is fully reproducible, scalable, and adaptable to other clinical settings and jurisdictions.

From a health systems and policy perspective, the findings have clear implications. Prolonged under-immunization at the clinic level erodes herd immunity, increases outbreak susceptibility, and exacerbates existing health inequities, particularly in urban communities already burdened by structural disadvantage. Addressing these gaps requires a shift from passive immunization tracking toward proactive, data-informed preventive care models that prioritize high-risk patients based on delay severity and continuity of care needs. AI-enabled surveillance and decision-support tools, when embedded within clinic workflows and paired with targeted outreach and care coordination, offer a promising pathway to achieving these goals.

This project also establishes a critical baseline against which future quality-improvement initiatives can be evaluated. As the clinic and broader health system pursue federally funded targets for improving up-to-date vaccination rates and MMR completion, the metrics generated here provide a rigorous reference point for assessing progress, identifying residual gaps, and iteratively refining intervention strategies. In this way, the analysis supports not only descriptive understanding but also continuous learning and system improvement.

In conclusion, this study demonstrates that closing childhood immunization gaps in underserved urban settings requires more than vaccines alone; it requires analytic infrastructure capable of identifying risk early, guiding resource allocation, and supporting sustained engagement with families across the pediatric life course. By translating granular EHR data into actionable public health intelligence, AI-assisted analytic approaches can strengthen preventive care delivery, advance equity-focused health system reform, and protect communities against the resurgence of vaccine-preventable diseases. This work contributes a practical, scalable model for how public health informatics can be leveraged to move from measurement to meaningful action.

### IX. REFERENCES 

1.  Centers for Disease Control and Prevention. Ten great public health achievements in the United States, 1900–1999. MMWR Morb Mortal Wkly Rep. 1999;48(12):241–243.

2.  Centers for Disease Control and Prevention. Recommended child and adolescent immunization schedule for ages 18 years or younger, United States, 2025. Updated February 2025.

3.  Hamborsky J, Kroger A, Wolfe S, eds. Epidemiology and Prevention of Vaccine-Preventable Diseases. 14th ed. Washington, DC: Public Health Foundation; 2023.

4.  Hill HA, Elam-Evans LD, Yankey D, Singleton JA, Kang Y. Vaccination coverage among children aged 19–35 months: United States, 2023. MMWR Morb Mortal Wkly Rep. 2024;73(34):623–630.

5.  Santoli JM, Lindley MC, DeSilva MB, et al. Effects of the COVID-19 pandemic on routine pediatric vaccine ordering and administration: United States, 2020. MMWR Morb Mortal Wkly Rep. 2020;69(19):591–593.

6.  Omer SB, Salmon DA, Orenstein WA, deHart MP, Halsey N. Vaccine refusal, mandatory immunization, and the risks of vaccine-preventable diseases. N Engl J Med. 2009;360(19):1981–1988.

7.  Smith PJ, Humiston SG, Marcuse EK, et al. Parental delay or refusal of vaccine doses, childhood vaccination coverage at 24 months of age, and the Health Belief Model. Public Health Rep. 2011;126(Suppl 2):135–146.

8.  Lieu TA, Ray GT, Klein NP, Chung C, Kulldorff M. Geographic clusters in underimmunization and vaccine refusal. Pediatrics. 2015;135(2):280–289.

9.  Kempe A, Saville AW, Dickinson LM, et al. Population-based versus practice-based recall for childhood immunizations: a randomized controlled comparative effectiveness trial. Am J Public Health. 2013;103(6):1116–1123.

10.  Glanz JM, Newcomer SR, Daley MF, et al. Association between undervaccination with diphtheria, tetanus toxoids, and acellular pertussis vaccine and risk of pertussis infection in children. JAMA Pediatr. 2013;167(11):1060–1064.

11.  Goldstein BA, Navar AM, Pencina MJ, Ioannidis JPA. Opportunities and challenges in developing risk prediction models with electronic health records data. J Am Med Inform Assoc. 2017;24(1):198–208.

12.  Obermeyer Z, Emanuel EJ. Predicting the future: big data, machine learning, and clinical medicine. N Engl J Med. 2016;375(13):1216–1219.

13.  Rajkomar A, Dean J, Kohane I. Machine learning in medicine. N Engl J Med. 2019;380(14):1347–1358.

14.  Centers for Medicare and Medicaid Services. Quality improvement strategies for Medicaid and CHIP immunization programs. Updated 2023.

15.  Gostin LO, Wiley LF. Public Health Law: Power, Duty, Restraint. 3rd ed. Berkeley, CA: University of California Press; 2016.

git init
git add .
git commit -m "Initial commit with literature review notebook"
git remote add origin https://github.com/AJJobrani-Miliken/Project.git
git branch -M main
git push -u origin main
