In [5]:
#Import libraries

import numpy as np
import pandas as pd
import seaborn as sns
import matplotlib.pyplot as plt
import joblib
import shap

from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from imblearn.pipeline import Pipeline as imb_pipeline
from imblearn.over_sampling import SMOTENC
from sklearn.linear_model import LogisticRegression
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split, RandomizedSearchCV, StratifiedKFold
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score, classification_report, confusion_matrix

from model import WhiteBox

%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


# <div style="background-color:#00357A; padding:20px; border-radius:10px; color:white; width:auto;">Executive Summary</div>

St. James Hospital’s “Baby Monitor” ML model produced misclassifications of newborn risk, leading to patient fatality. The model’s unreliability and lack of transparency have jeopardized patient safety and eroded the trust of the hospital staff. The Organization for Whitebox ModeLs (OWL) was commissioned to diagnose these failures. Our objective is to propose a new, robust methodology for model development—validated by a proof-of-concept model—and a governance framework that ensures clear explanations and effective human oversight.

Our evaluation of the original model identified critical flaws. The system was optimized for overall accuracy rather than patient safety, a critical error for an imbalanced dataset. This focus resulted in a dangerously low 57.1% recall rate for at-risk infants, meaning it failed to identify nearly half of the babies in danger. Furthermore, our analysis of the misclassifications revealed the model systematically failed to flag risk in infants from an underrepresented demographic, indicating a significant data bias. The proposed methodology improves the performance of the model through several steps. First, data points were split by baby to avoid data leakage. Second, the sampling technique SMOTENC was used to balance the distribution between the healthy and at-risk babies. Third, random forest was tuned and modeled using unscaled numerical features and one-hot encoded categorical features. Finally, probabilities were outputed to capture the model's confidence with the prediciton. Our proposed proof-of-concept (a transparent decision tree) shows that these changes significally improved the model, achieving a 95% recall rate.


# <div style="background-color:#00357A; padding:20px; border-radius:10px; color:white; width:auto;">I. Introduction</div>

## <div style="background-color:#0081CC; padding:10px; border-radius:10px; color:white; width:auto;">Overview</div>

The Organization for Whitebox ModeLs (OWL) was commissioned by St. James Hospital to audit the Baby Monitor Project, an AI-based system developed by Data Monitors for the hospital’s OB-GYN department. The system was designed to predict newborn risk levels and support staff in providing timely care.

Although the project initially showed strong results in improving efficiency and patient satisfaction, a series of misclassifications, one of which resulted in a fatality, prompted serious concerns regarding the model’s accuracy and transparency.

This report summarizes OWL’s independent assessment of the Baby Monitor model and outlines findings, issues, and recommendations to guide St. James Hospital in strengthening the safety, reliability, and governance of future AI solutions.

## <div style="background-color:#0081CC; padding:10px; border-radius:10px; color:white; width:auto;">Problem</div>

The main issue being faced  is the severe failures of the ‘Baby Monitor’ ML model, which has led to patient fatalities due to misclassifications. These errors can be derived from several different  key factors.

First is Model Opacity. How did the erroneous model make its predictions, and what faulty logic did it use? The lack of transparency makes it unclear how the model reaches its conclusions.

Next, a  deeper M=misclassification analysis is needed to identify what features or patterns caused the twelve "false negative" cases, along with what those errors reveal about the model's decision-making process.

There are also potential methodological flaws in the original approach. This includes the preprocessing, model selection, and evaluation metrics; these have likely reduced the ability for the model to detect "at risk" infants with sufficient accuracy.

Finally, governance and oversight gaps seem to play a major role in the issues faced. The lack of a proper or robust human review processes and risk management frameworks led to system errors going unnoticed until the consequences become too severe.

In short, the challenges highlight the technical weaknesses and limited accountability structure being the primary reasons that patients get affected by these errors.


## <div style="background-color:#0081CC; padding:10px; border-radius:10px; color:white; width:auto;">Project Objectives</div>

This assessment aims to tackle two main objectives. The first objective is to investigate and interpret the erroneous model. Interpreting the model allows OWL and St. James Hospital to discover the logic behind its predictions, cross-reference the findings with healthcare practitioners, and troubleshoot faulty steps in preprocessing the data and training the model. This entails understanding the behavior of the model, particularly in detecting the most prominent patterns in predicting the health of the 12 misclassified babies, to see where the model went wrong.

The second objective is to propose a better methodology and framework that reduces the risks of misclassifications and ensures that any errors are quickly detected and mitigated. The new machine learning model developed will serve as a proof-of-concept that this method works and can transparently provide clear explanations for its decisions. The framework also entails seamlessly integrating the model into the workflow and educating stakeholders on proper intervention. This framework will serve as a template for future machine learning modeling efforts.


## <div style="background-color:#0081CC; padding:10px; border-radius:10px; color:white; width:auto;">Significance of the Report</div>

This report is significant because it's our direct response to the 'Baby Monitor' model's severe failures and shortcomings. Aside from being a technical review, it's a crucial step toward rebuilding trust between hospitals and patients towards these kinds of models while also guaranteeing patient safety.

By breaking down why the Data Monitors model failed, this report provides St. James Hospital with a clear diagnosis of methodological flaws. More importantly, it establishes a benchmark for the 'correct' way to develop, validate, and govern clinical AI systems.

Ultimately, the significance is twofold: it provides an immediate, safer, and more transparent alternative for newborn risk assessment, and it delivers a robust, understandable framework that can guide all future AI adoption within the hospital, ensuring that technology serves as a reliable and trustworthy aid to medical practitioners rather than a liability.


# <div style="background-color:#00357A; padding:20px; border-radius:10px; color:white; width:auto;">II. Methodology</div>

The methodology is as follows:

1. Data Collection 
    Data was provided of St. James Hospital’s historical records for 70 babies. Exploratory Data Analysis will be used to understand the dataset: duplicate values, distributions for each feature, and other patterns present with the data. 

2. Evaluation of Data Monitor’s Model
    The original model, its specification, and its accuracies (recall, precision, accuracy), were also provided by St. James Hospital. The models were run to identify the model’s current classification report. Preprocessing steps were analyzed and implied based on the model specifications and parameters. Then, the pre-processing, modelling, and evaluation metrics were critiqued. 

3. Analysis of the Misclassified Babies
    SHAP was used to analyze the decisions behind classifying the 12 babies as healthy. By applying SHAP on both an individual instance and a group instance, stakeholders can recognize which features have a large influence on the final prediction and lead to the misclassification.

4. Improved Model
	Given the previous observations, the team suggests more appropriate preprocessing methods, model types, and evaluation metrics for detecting and classifying newborn risk levels. The new model will also be interpreted through SHAP to showcase the contribution of each feature to this new model’s predictions, and particularly in discrepancies that possibly misclassified the 12 false negatives. 

5. Robust Framework 
    Finally, the team suggests ways to seamlessly integrate the system into the hospital’s workflow. The framework allows for human intervention so that errors are easily mitigated. 


# <div style="background-color:#00357A; padding:20px; border-radius:10px; color:white; width:auto;">III. Data Description</div>

The `historical.csv` dataset contains 2,100 observations and 25 features representing longitudinal monitoring data of newborns. It integrates demographic and clinical variables to facilitate early identification of potential health risks in infants.


The dataset includes both baseline characteristics recorded at birth and daily follow-up measurements, allowing for analysis of growth trajectories and vital sign stability over time. Continuous variables such as `weight`, `temperature`, and `oxygen saturation` reflect clinical conditions, while categorical variables capture developmental indicators.

| **Feature Name**              | **Feature Description**                            | **Data Type** |
| :----------------------------: | :------------------------------------------------: | :------------: |
| `baby_id`                     | Unique identifier for each baby                    | string        |
| `name`                        | Name of the baby                                   | string        |
| `gender`                      | Gender of the baby (Male/Female)                   | string        |
| `gestational_age_weeks`       | Gestational age at birth (normal: 37–42 weeks)     | float         |
| `birth_weight_kg`             | Weight of the baby at birth (normal: 2.5–4.5 kg)   | float         |
| `birth_length_cm`             | Length of the baby at birth (average: 48–52 cm)    | float         |
| `birth_head_circumference_cm` | Head circumference at birth (average: 33–35 cm)    | float         |
| `date`                        | Monitoring date                                    | string (date) |
| `age_days`                    | Age of the baby in days since birth                | integer       |
| `weight_kg`                   | Recorded daily weight                              | float         |
| `length_cm`                   | Recorded daily body length                         | float         |
| `head_circumference_cm`       | Recorded daily head circumference                  | float         |
| `temperature_c`               | Body temperature in °C (normal: 36.5–37.5)         | float         |
| `heart_rate_bpm`              | Heart rate (normal: 120–160 bpm)                   | integer       |
| `respiratory_rate_bpm`        | Respiratory rate (normal: 30–60 bpm)               | integer       |
| `oxygen_saturation`           | Oxygen saturation level (normal >95%)              | integer       |
| `feeding_type`                | Type of feeding: Breastfeeding, Formula, or Mixed  | string        |
| `feeding_frequency_per_day`   | Number of feeds per day (normal: 8–12)             | integer       |
| `urine_output_count`          | Wet diaper count per day (normal: 6–8+)            | integer       |
| `stool_count`                 | Bowel movements per day (0–5 typical)              | integer       |
| `jaundice_level_mg_dl`        | Bilirubin level (normal <5, mild 5–12, severe >15) | float         |
| `apgar_score`                 | APGAR score at birth (0–10, recorded on day 1)     | float         |
| `immunizations_done`          | Immunizations completed (Yes/No)                   | string        |
| `reflexes_normal`             | Whether newborn reflexes are normal (Yes/No)       | string        |
| `risk_level`                  | Target variable: Healthy (0) or At Risk (1)        | string        |

</div>

The `risk_level` variable serves as the target feature, representing the infant’s overall health classification. Analytical models developed using this dataset aim to predict or classify newborns as Healthy or At Risk based on observed developmental and physiological indicators.

## <div style="background-color:#0081CC; padding:10px; border-radius:10px; color:white; width:auto;">Evaluation Metrics</div>

To get a better idea of how the team evaluates the models, the following section will define the different evaluation metrics to be discussed in this report:

- Recall - Known also as the sensitivity or true positive rate. It measures the proportion of all actual at-risk babies that were correctly identified by the model and is the crucial metric for patient safety, as high recall minimizes false negatives (babies who are at risk but are classified as healthy).
- Precision - Known also as the positive predictive value. It measures the proportion of babies the model predicted to be at risk that were actually at risk. High precision minimizes false positives which is often prioritized for hospital efficiency.
- Accuracy - Is the proportion of the total number of correct predictions which is calculated as the ratio of correct predictions (true positives or negatives) to the total number of predictions. Used often as an overall metric of performance, it can be misleading in imbalanced datasets, like this one where "at-risk" babies are the minority class.

# <div style="background-color:#00357A; padding:20px; border-radius:10px; color:white; width:auto;">IV. Evaluation of Data Monitor’s Model</div>

Data Monitor provided St. James’ Hospital with a model that acts as a Baby Monitor for the OBGYN Ward. The model provides a daily prediction of whether the baby is healthy or at risk based on various factors. It preprocessed the data based on its data type and ensured that each feature was understood by the machine. Then, it used Logistic Regression to learn and identify patterns within the dataset. Finally, the model developing team used precision as their evaluation metric to assess the performance of the model.


In [11]:
# load the data frame
df = pd.read_csv('historical.csv')
fn_details = pd.read_csv('corrected_false_negatives.csv')

# get complete columns for false negative babies
fn_df = fn_details.merge(df, 
                         on=fn_details.columns.tolist(), 
                         how='left')

In [12]:
# load old model for audit
model = joblib.load('old_model.joblib')

https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations
https://scikit-learn.org/stable/model_persistence.html#security-maintainability-limitations


In [13]:
# get features used in old model

features = {}
for row in model.named_steps['preprocess'].transformers_:
    dtype, preprocess, col_list = row
    features[dtype] = col_list

all_features = [col for cols in features.values() for col in cols]

target_col = "risk_level"
X = df.drop(columns=[target_col])
X = df[all_features]
fn_df = fn_df[all_features]
y = df[target_col]

In [14]:
y = df['risk_level'].replace({'Healthy': 0, 'At Risk': 1}).values

# Split into train/test sets
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# Predict train set
y_train_pred = model.predict(X_train)
train_accuracy = accuracy_score(y_train, y_train_pred)
train_f1 = f1_score(y_train, y_train_pred)
train_precision = precision_score(y_train, y_train_pred)
train_recall = recall_score(y_train, y_train_pred)

print(f"Train Accuracy: {train_accuracy:.3f}")
print(f"F1: {train_f1:.3f}")
print(f"Precision: {train_precision:.3f}")
print(f"Recall: {train_recall:.3f}")

# Predict test set
y_pred = model.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_train, y_train_pred)
precision = precision_score(y_test, y_pred)
recall = recall_score(y_test, y_pred)

# 8. Print results
print("===============================")
print(f"Test Accuracy: {accuracy:.3f}")
print(f"F1: {f1:.3f}")
print(f"Precision: {precision:.3f}")
print(f"Recall: {recall:.3f}")
print("===============================")
print("Test Set Confusion Matrix:\n", pd.DataFrame(confusion_matrix(y_test, y_pred)))

  y = df['risk_level'].replace({'Healthy': 0, 'At Risk': 1}).values


Train Accuracy: 0.926
F1: 0.684
Precision: 0.780
Recall: 0.608
Test Accuracy: 0.919
F1: 0.684
Precision: 0.762
Recall: 0.571
Test Set Confusion Matrix:
      0   1
0  354  10
1   24  32


One flaw of the audited model is the overreliance of accuracy as the only metric. Accuracy is used to determine whether the model’s predictions align with the babies’ actual risk. Our audit revealed that the original model had a high test accuracy of 91.9% . However, analysis into the distribution of the data reveals that the distribution between classes are imbalanced. Healthy instances make up 1822 records in the database while only 278 instances were classified as at risk. Imbalanced datasets have difficulty capturing patterns of the smaller class. Therefore, the high accuracy of the model is misleading because it can achieve a high accuracy by merely predicting the majority, i.e. “Healthy,” without properly detecting the minority class. 

A better metric to use for this case is recall.  Recall, or the true positive rate, measure how many at-risk babies were properly flagged by the model among all the at-risk babies. A low recall, like the 57.1% from the original model, indicates that the model is not able to properly detect nearly half of the babies that actually need medical attention. While relying on recall might lead to false negatives, i.e. healthy babies prematurely flagged as at-risk, it is better to remain cautious and conduct additional screenings earlier instead of detecting risk too late.

In [None]:
numeric_df = df.select_dtypes(include=['float64','int64'])

numeric_df[['heart_rate_bpm','jaundice_level_mg_dl']].hist(
    figsize=(6,5),
    bins= 20,
    edgecolor='black',
    grid=False,
)

In addition, the distributions of the features in the dataset violate the assumptions of a linear model. Some, like heart rate and weight, follow a relatively normal distribution. Others, like urine output and stool count, are uniformly distributed, with a relatively equal number of data points per category. Finally, a few, like jaundice, are bimodal distributions. Since linear models assume multivariate normality, it finds it difficult to process features without heavily preprocessing the data to be normally distributed. 

<div style="text-align: center;">
    <img src="https://imgur.com/d3aDLA7.png" alt="Boxplots of Numerical Features" width="800">
</div>

<div style="text-align: center;">
<p style="text-align: center; font-size: 14px; margin-bottom: 30px; margin-left: 20px; font-style: italic;">
    Figure 1. Feature Importance of the Initial Model
</div>

Figure 1 shows the feature importance of the model provided by Data Monitor. It is evident that jaundice level is the most influential predictor, with heart rate and oxygen saturation following as key indicators, reflecting how the model weighs signs of distress versus stability.
Further directionality and feature interactions will be detailed in the upcoming SHAP analysis.

<div style="text-align: center;">
    <img src="https://imgur.com/SnFkXa5.png" alt="Boxplots of Numerical Features" width="600">
</div>

<div style="text-align: center;">
<p style="text-align: center; font-size: 14px; margin-bottom: 30px; margin-left: 20px; font-style: italic;">
    Figure 2. Global SHAP Summary Plot of the Initial Model
</div>

The global SHAP summary plot gives a view of how each feature influences the model’s predictions, with both magnitude and direction of influence, showing how risk predictions change depending on the baby’s physiological profile. As seen above, jaundice level remains the strongest determinant of neonatal risk. Given the heavy distribution of blue points, this indicates that higher bilirubin values consistently push predictions toward the “At Risk” side. This confirms the model’s strong sensitivity to jaundice. Beyond jaundice, heart rate, oxygen saturation, and weight also shape the decision boundary of the model. Increased heart rate, lower oxygen saturation, and lower weight values tend to increase predicted risk.

To further interpret the model’s decision logic, we can later inspect the decision boundary and distribution for jaundice level specifically. This would help visualize how the model separates “Healthy” vs. “At Risk” infants and reveal the approximate threshold at which jaundice becomes critical in classification.


<div style="text-align: center;">
    <img src="https://imgur.com/rvMtKpC.png" alt="Boxplots of Numerical Features" width="750">
</div>

<div style="text-align: center;">
<p style="text-align: center; font-size: 14px; margin-bottom: 30px; margin-left: 20px; font-style: italic;">
    Figure 3. Partial Dependence Plot of Risk on Jaundice Level
</p>
</div>

Figure 3 displays the Partial Dependence Plot (PDP) of baby health on jaundice level. The plot shows that the model’s predicted probability of a baby being classified as 'At Risk' changes as bilirubin levels increase. The curve shows a strong and nonlinear relationship, where risk remains minimal below approximately 5 mg/dL, then rises sharply beyond 8–10 mg/dL, where the model predicts a significantly higher likelihood of classifying baby health as being at risk.

The insights from the PDP complements the above global SHAP analysis that ranked jaundice as the top predictor influencing model decisions. However, the PDP shows that the model’s reliance is not only strong, but also has a threshold or decision boundary, meaning that predictions change rapidly around a certain value.

<div style="text-align: center;">
    <img src="https://imgur.com/qhBWGC6.png" alt="Boxplots of Numerical Features" width="750">
</div>

<div style="text-align: center;">
<p style="text-align: center; font-size: 14px; margin-bottom: 30px; margin-left: 20px; font-style: italic;">
    Figure 4. Scatter Plot of Jaundice Levels by Baby Health Status
</p>
</div>

Figure 4 shows a scatter plot on the distribution of jaundice levels among infants, differentiated by their health status. Blue dots indicate healthy infants and red dots represent those classified as at risk.

There is clear decision boundary present in the health classification of babies based on their level of jaundice. Most healthy babies cluster within lower jaundice levels of 0–5 mg/dL, with the decision boundary being around 10 mg/dL; Cases beyond that level are commonly classified as being at risk. This plot trend suggests that higher bilirubin concentrations are strongly associated with elevated health risks.

However, the scatter plot also reveals a potential class imbalance, where healthy cases appear to heavily outnumber at-risk cases, as seen in the heavy clustering of healthy cases between 2-4 mg/dL. This imbalance may contribute to misclassifications where the model may struggle to correctly identify subtle cases of elevated risk due to the dominance of healthy examples in the training data.

Although this plot reflects raw observed data rather than model predictions, it effectively highlights both the clinical significance of jaundice level and the impact of data distribution on model performance — underscoring the need for balanced datasets or resampling techniques to improve classification reliability.


<div style="text-align: center;">
    <img src="https://imgur.com/YOY5deo.png" alt="Boxplots of Numerical Features" width="750">
</div>

<div style="text-align: center;">
<p style="text-align: center; font-size: 14px; margin-bottom: 30px; margin-left: 20px; font-style: italic;">
    Figure 5. Bar Graph of Class Imbalance on the Decision Boundary (10 mg/dL)
</p>
</div>

Figure 5 presents a bar chart comparing the number of infants with jaundice levels above and below the 10 mg/dL threshold among selected instances. The majority of babies fall within the <10 mg/dL group, while a smaller subset exceeds 10 mg/dL. This distribution reinforces the class imbalance observed earlier, where most babies exhibit normal or moderately elevated bilirubin levels, while only a limited number display severe jaundice. Such imbalance can influence model performance as the classifier may become biased toward predicting the more common category and underperform when identifying rare high-risk cases.

To address this, the proposed model aims to reduce dependency on jaundice alone by incorporating a broader set of physiological and developmental features, such as weight, reflexes, feeding type, and immunization status. By diversifying the feature space and rebalancing the training data, the new model is expected to achieve better generalization and fairer predictions across all baby health profiles.

<a href="https://ibb.co/jxJJFn1"><img src="https://i.ibb.co/wbcc15x/Screen-Shot-2025-10-17-at-10-19-50-PM.png" alt="Screen-Shot-2025-10-17-at-10-19-50-PM" border="0"></a>

<div style="text-align: center;">
<p style="text-align: center; font-size: 14px; margin-bottom: 30px; margin-left: 20px; font-style: italic;">
    Figure 6. Global SHAP Values of Audited Model
</p>
</div>

Figure 6 is a beeswarm plot of the cumulative Shapley values of the false negative babies. The group instance tells us how high and low values of the features impact the model outcome, giving us better and more actionable insights.

1. Lower values for jaundice, heart rate, and the baby’s age in days have had a significant contribution to mislabelling the babies as healthy, which is in line with the single instance plot.
2. Higher values of weight and oxygen saturation have also significantly pulled the prediction lower


## <div style="background-color:#0081CC; padding:10px; border-radius:10px; color:white; width:auto;">Local Interpretability (Individual Instance)</div>

To explain why the babies were misclassified, this section will use LIME (Local Interpretable Model-agnostic Explanations) to explore individual instances and group instances that could have led to misclassifying these babies as ‘healthy’. LIME provides local explanations by approximating the model’s behavior around a single instance and highlight which specific features influenced that prediction the most.

<div style="text-align: center;">
    <img src="https://imgur.com/Z2raNTL.png" alt="Boxplots of Numerical Features" width="900">
</div>

<div style="text-align: center;">
<p style="text-align: center; font-size: 14px; margin-bottom: 30px; margin-left: 20px; font-style: italic;">
    Figure 7. LIME Explanation of Instance #729
</p>
</div>

Figure 7 shows the LIME explanation for Instance #729, a baby with a jaundice level of 4.9 mg/dL, which is distant from the model’s decision boundary of roughly 10 mg/dL. Clinically, this baby should have been labeled as 'At Risk', yet the model predicted it as 100% healthy. This proves the case that the model is completely reliant on jaundice; A strong case of boundary-driven misclassification.

The LIME results show that the model placed heavy weight on the low jaundice level, which dominated the prediction and neglected other relevant indicators such as lower gestational age and smaller head circumference. While these latter features would typically correlate with higher risk, their influence was minimal compared to jaundice, which led to a 100% confidence of classifying the instance as 'Healthy.'

# <div style="background-color:#00357A; padding:20px; border-radius:10px; color:white; width:auto;">V. Improved Model</div>

Several modifications were made to improve performance and address key issues found in the audit.

First, instead of the default train-test-split, the data points were grouped per baby ID before splitting. One issue arising from the audit are unrealistic test scores for tree-based models. The dataset contains multiple records per baby, with birth features, namely gestational age, birth weight, birth length, birth head circumference, and apgar score not changing among records.  While personally identifiable information like baby ID and name were initially dropped, the model may still be able to implicitly detect the specific baby through its birth details. Grouping the records by baby id before splitting ensures that each baby only appears in one of three sets: train, validation, test set. This prevents data leakage and improves generalizability with unseen babies.



Another step implemented before modelling was SMOTENC. SMOTENC is a sampling technique that handles both numerical and categorical variables. It creates synthetic data points of the minority class to balance the distribution between the two classes. These data points are generated by interpolating between the nearest at-risk cases to make them similar approximations to the original data. This technique improves generalizability by helping the model reduce the bias toward only predicting the majority class.

Random search was conducted to determine the best method and hyperparameters that resulted in the best recall and F1 score. Instead of Logistic Regression, the proposed method uses RandomForest instead. A random forest model fits the dataset because it is non-parametric. In addition, a random forest model is able to reduce overfitting by averaging the predictions of multiple decision trees. Since Random Forest models are insensitive to scaling, the standard scaler preprocessor was removed from the pipeline with the one-hot encoding retained. 

Finally, instead of using absolute class predictions, the model outputs the model’s predicted probability that a baby belongs to the “at risk” group. There is no strict cutoff between healthy and unhealthy, so using probabilities shows how confident or uncertain the model is with its classification. This provides a more nuanced view of risk, allowing medical personnel to see and intervene based on their domain expertise. For example, a record with probabilities of 55% healthy and 45% at risk is more uncertain that a record with probabilities of 90% healthy and 10% at risk, so medical personnel can use this information to screen babies near the threshold.

The tuned random forest model improved the test recall score to 95% and the f1 score to 92%. While these scores are more encouraging, the confusion matrix still reveals to at-risk babies from the test set misclassified as healthy. Another big challenge of the original model is the lack of explainability. Explainable models allow the user to see the factors that influence the model’s prediction. To enhance transparency and support clinical trust, model explainability was introduced using LIME. LIME, or Local Interpretable Model-Agnostic Explanations provides insight into features most strongly influenced by the model’s decisions for a specific case. This addition allows clinicians and data monitors to better understand why the model flagged (or did not flag) a baby as at risk, determine if these factors are realistic, and intervene accordingly.


In [None]:
results = []

pipe = Pipeline(
    [("clf", WhiteBox())]
)

pipe.fit(X_train, y_train)

y_train_pred = pipe.predict(X_train)
y_val_pred = pipe.predict(X_val)

results.append({
        'Model': name,
        'Train_Acc': accuracy_score(y_train, y_train_pred),
        'Val_Acc': accuracy_score(y_val, y_val_pred),
        'Train_F1': f1_score(y_train, y_train_pred),
        'Val_F1': f1_score(y_val, y_val_pred),
        'Train_Precision': precision_score(y_train, y_train_pred),
        'Val_Precision': precision_score(y_val, y_val_pred),
        'Train_Recall': recall_score(y_train, y_train_pred),
        'Val_Recall': recall_score(y_val, y_val_pred),
    })

results_df = pd.DataFrame(results).sort_values(by='Val_F1', ascending=False).reset_index(drop=True)
print(results_df)

In [None]:
model = pipe.named_steps['clf']

In [None]:
sns.barplot(x='Importance', y='Feature', data=model._get_feature_importance())
plt.title('Feature Importance')
plt.show()

In [None]:
# Fit the model
model.fit(X_train, y_train)

instance = fn_df[(fn_df['baby_id'] == 'B025') & (fn_df['date'] == '2025-04-19')]  
explanation = model._explain_instance(instance)

explanation.show_in_notebook()

Finally, to support the decision-making process of the doctors in St. James' Hospital, a Local Interpretable Model-agnostic Explanation (LIME) will be provided. LIME could provide doctors with a concrete probability of the patients’ health status. For instance, for the same baby misclassified by the data monitor's model, baby 25, because the improved model has reduced its dependency on juandice, the probability that he was healthy decreased. However, it was not enough to correctly classify him as ‘at risk’. Therefore, highlighting the importance of doctor intervention. After identifying what features have affected the predictions, the doctor could decide to implement an alternative course of action based on their own experiences and unquantifiable observations on the patient.

# <div style="background-color:#00357A; padding:20px; border-radius:10px; color:white; width:auto;">VI. Proposed Framework</div>

The Organization of Whitebox Models (OWL) is a non-profit organization that provides accreditation for the following:

Data Management Association – Certified Data Management Professional (DAMA-CDMP), 
International Institute of Business Analysis – Certification in Business Data Analytics (IIBA-CBDA), 
International Organization for Standardization (ISO), and 
General Data Protection Regulation (GDPR)

Therefore, to conduct the audit process for Data Monitor's model, these organizations' standardization process would be used to first identify and locate the source of the issue, and second, to create an improved framework that embodies OWL's mission of advocating ethical use of Machine Learning models through transparency. Therefore, upon further investigation, St. James' Hospital and Data Monitors have failed to uphold three key standards: 1) Privacy and Security, 2) Model Interpretability and Explainability, and 3) Human Intervention.

The data set provided by St. James' hospital fails to meet the Privacy and Security practices of the DAMA-CDMP and GDPR. Providing real names that can be used as an identifier for the babies can unnecessarily breach the confidentiality codes of the health industry. DAMA-CDMP and GDPR encourage firms to collect only necessary data. Therefore, identifiers such as names that do not contribute to the prediction or performance of the model should not be included in future data collection. However, it is understood that it would be beneficial for hospitals to keep and maintain a record of patients that can be used for future doctor visits or appointments. Therefore, to address this need, the public data set could retain the column baby_id that allows hospitals and patients to look back at the babies' history by using these as identifiers for their sealed records.
Secondly, IIBA-CBDA, ISO Standards, and GDPR all discourage the overreliance on artificial intelligence (AI). IIBA-CBDA advocates that data analytics should only support decision-making processes, but not replace them. On the other hand, GDPR and ISO Standards express the need for human involvement. In fact, Article 22 prohibits fully automated decisions that significantly affect individuals. Despite these regulations, St. James Hospital has failed to comply and has become overdependent on these models to classify the health of these babies. This can be visualized in Figure 1, where, after the model makes a prediction, the doctors in the hospital fail to verify the prediction in exchange for efficiency. Therefore, in the second figure, to improve the framework, human intervention was added to ensure the doctor's expert opinion will be a necessary step before choosing a course of action.

Finally, to aid the doctor in implementing the best course of action for the baby, a missing segment in the model's framework is an explainability model. IIBA-CBDA advocates quality results interpretation and communication to industry professionals, while ISO/IEC 23894 aims to ensure that there is transparency and risk controls for ML models. Therefore, to mitigate the risk, such as death or missing a critical diagnosis, that comes with healthy babies being misclassified, an explainability model could be used to help doctors identify what is causing at-risk babies to be labelled as healthy. Therefore, adding this step still maintains efficiency by allowing doctors to bypass the extensive monitoring process, but it also provides a better explanation for the model at a glance. Therefore, doctors can base their professional opinions on a more concrete and interpretable visualization as compared to being overreliant on the data predictions themselves.
<div style="text-align: center;">
    <img src="https://imgur.com/tYXAR5C.png" alt="Initial Framework" width="800">
</div>

<div style="text-align: center;">
    <img src="https://imgur.com/8hnb2n0.png" alt="Proposed Framework" width="670">
</div>

Additionally, to ensure that the model remains reliable and useful for the hospital staff, it is also suggested that an assigned employee of St. James' hospital will take responsibility in consistently evaluating the model's performance metric. For instance, a significant drop in the model performance could indicate a bug, human error, or there may be a need to use a better classifier model. Therefore, these instances must be reported directly to the team to prevent misclassifications and to find a solution as soon as possible, so as not to compromise the efficiency rate of the hospital. 

# <div style="background-color:#00357A; padding:20px; border-radius:10px; color:white; width:auto;">VII. Conclusions and Recommendations</div>

## <div style="background-color:#0081CC; padding:10px; border-radius:10px; color:white; width:auto;">Limitations</div>

Despite the modified model significantly improving the performance Data Monitor's intial system, the study still comes with certain limitations:

1. Limited Feature Scope- The model relies on a lot of quantitative features and less of contextual data. Therefore, doctor's notes and maternal health information could help nurses and doctors better determine the health of the baby.

2. Limited Interpretability in Complex Models- Although tree-based models provide better evealuation metrics, they also increase the risk of overfitting and reduced transparency. Even with SHAP interpreting hundreds of possible feature interaction could be difficult for hospital personell.

## <div style="background-color:#0081CC; padding:10px; border-radius:10px; color:white; width:auto;">Recommendations</div>

To further the study, here are some recommendations to improve the model:

1. Expand and Diversify the Dataset- To improve the quality of the dataset, future researchers can collect related datasets from other hospitalsor different demographic regions. Doing so, can improve generability across certain areas or countries.

2. Implement Class Balancing Techniques- To improve effectivity of adressing class imbalance, it would be encouraged to combine and explore other sampling methods, like ADASYN. If done properly, it could increase the performance of the non tree-based models to reduce chances of overfitting and increase transparency.

# <div style="background-color:#00357A; padding:20px; border-radius:10px; color:white; width:auto;">Supplementary Materials</div>

## <div style="background-color:#0081CC; padding:10px; border-radius:10px; color:white; width:auto;">Loading Files</div>

In [None]:
from sklearn.inspection import PartialDependenceDisplay
import matplotlib.pyplot as plt

# --- Partial Dependence Plot for Jaundice Level ---
fig, ax = plt.subplots(figsize=(8, 5))  # smaller aspect ratio

disp = PartialDependenceDisplay.from_estimator(
    model,
    X_train,
    features=['jaundice_level_mg_dl'],
    ax=ax,
    kind='average',
    grid_resolution=100
)

# --- Adjust scaling and formatting ---
ax.set_xlim(0, 15)                    # jaundice range
ax.set_ylim(0, 0.8)                   # matches your plotted range
ax.set_xlabel('Jaundice Level (mg/dL)', fontsize=12)
ax.set_ylabel('Partial Dependence (Predicted Risk)', fontsize=12)
ax.set_title('Partial Dependence of Risk on Jaundice Level', fontsize=14)
ax.grid(True, linestyle='--', alpha=0.6)

plt.tight_layout()
plt.show()


## <div style="background-color:#0081CC; padding:10px; border-radius:10px; color:white; width:auto;">Exploratory Data Analysis</div>

In [None]:
df["apgar_score"] = df["apgar_score"].ffill()

In [None]:
df.info()

In [None]:
df = df.drop(columns=['baby_id', 'name', 'date'])

In [None]:
non_numeric_df = df.select_dtypes(exclude=['number'])

for col in non_numeric_df.columns:
    plt.figure(figsize=(6,4))
    non_numeric_df[col].value_counts(dropna=False).plot(kind='bar', color='skyblue', edgecolor='black')
    plt.show()

## <div style="background-color:#0081CC; padding:10px; border-radius:10px; color:white; width:auto;">Hyperparameter Tuning</div>

In [None]:
# ----------- STEP 1: Define Feature Types -----------
# Sepeparate features by data types
numerical_features_std = ['weight_kg', 'length_cm',
                     'head_circumference_cm', 'temperature_c', 'heart_rate_bpm', 'respiratory_rate_bpm',
                     'oxygen_saturation', 'jaundice_level_mg_dl']

numerical_features_minmax = ['age_days', 'feeding_frequency_per_day','urine_output_count',
                             'stool_count']

numerical_features = numerical_features_std + numerical_features_minmax
categorical_features = ['feeding_type', 'gender','immunizations_done', 'reflexes_normal']
   
drop_col = ['baby_id', 'name', 'date', 'gestational_age_weeks', 'birth_weight_kg', 
                     'birth_length_cm', 'birth_head_circumference_cm','apgar_score']
df['risk_level'] = df['risk_level'].replace({'Healthy': 0, 'At Risk': 1}).values

In [None]:
# ----------- STEP 2: Train-Test Split by Baby -----------
# Get unique babies
unique_babies = df['baby_id'].unique()

# Split babies, not records
Train_babies, test_babies = train_test_split(
    unique_babies, test_size=0.15, random_state=0
)

# Split babies, not records
train_babies, val_babies = train_test_split(
    Train_babies, test_size=0.15, random_state=0
)

# Filter dataframe
train_df = df[df['baby_id'].isin(train_babies)].copy()
val_df = df[df['baby_id'].isin(val_babies)].copy()
test_df = df[df['baby_id'].isin(test_babies)].copy()

# Now fill within each set
train_df['apgar_score'] = train_df.groupby('baby_id')['apgar_score'].ffill()
val_df['apgar_score'] = val_df.groupby('baby_id')['apgar_score'].ffill()
test_df['apgar_score'] = test_df.groupby('baby_id')['apgar_score'].ffill()

X_train = train_df.drop(columns='risk_level')
X_train = X_train.drop(columns=drop_col)
y_train = train_df['risk_level']

X_val = val_df.drop(columns='risk_level')
X_val = X_val.drop(columns=drop_col)
y_val = val_df['risk_level']

X_test = test_df.drop(columns='risk_level')
X_test = X_test.drop(columns=drop_col)
y_test = test_df['risk_level']

In [None]:
# ----------- STEP 3: Preprocessing Pipeline -----------

lr_preprocessor = ColumnTransformer(
    transformers=[
        ('num_std', StandardScaler(), numerical_features_std),
        ('num_minmax', MinMaxScaler(), numerical_features_minmax),
        ('cat', OneHotEncoder(drop="if_binary"), categorical_features),
    ]
)

rf_preprocessor = ColumnTransformer(
    transformers=[
        ('num_std', 'passthrough', numerical_features_std),
        ('num_minmax', 'passthrough', numerical_features_minmax),
        ('cat', OneHotEncoder(drop="if_binary"), categorical_features),
    ]
)

preprocessors = {
    'Logistic Regression (L2)': lr_preprocessor,
    'Logistic Regression (L1)': lr_preprocessor,
    'Random Forest': rf_preprocessor
}

# Parameter grids for hyperparameter tuning
params = {
    'Logistic Regression (L2)': {
        'smote__k_neighbors': [3, 5, 7],
        'clf__C': [1e-3, 1e-2, 1e-1, 1],
    },
    'Logistic Regression (L1)': {
        'smote__k_neighbors': [3, 5, 7],
        'clf__C': [1e-3, 1e-2, 1e-1, 1],
    },
    'Random Forest': {
        'smote__k_neighbors': [3, 5, 7],
        'clf__n_estimators': [100, 200, 300],
        'clf__max_depth': [1, 2, 3],
        'clf__min_samples_split': [2, 5, 10],
        'clf__min_samples_leaf': [1, 2, 4]
    },
}

models = {
    'Logistic Regression (L2)': LogisticRegression(penalty = 'l2', max_iter=1000, solver='liblinear', random_state=42),
    'Logistic Regression (L1)': LogisticRegression(penalty = 'l1', max_iter=1000, solver='liblinear', random_state=42),
    'Random Forest': RandomForestClassifier(random_state=42)
}

In [None]:
# ----------- STEP 4: Hyperparameter Tune with Val Set -----------
results = []
trained_models = {}

kfold = StratifiedKFold(n_splits=5, shuffle=True, random_state=42)

for name, model in models.items():
    print(f"\n{'='*60}")
    print(f"Training: {name}")
    print(f"{'='*60}")

    # Fit preprocessor to get categorical feature indices
    preprocessor = preprocessors[name].fit(X_train)
    feature_names_out = preprocessor.get_feature_names_out()
    
    numerical_count = len(numerical_features)
    new_categorical_features = preprocessor.named_transformers_['cat']
    categorical_count = len(new_categorical_features.get_feature_names_out())
    indices = list(range(numerical_count, numerical_count + categorical_count))

    # Declare pipeline
    pipe = imb_pipeline([
        ("preprocessor", preprocessor),
        ("smote", SMOTENC(categorical_features=indices,
                          random_state=42)),
        ("clf", model)
        
    ])

    # Hyperparameter tuning
    rcv = RandomizedSearchCV(pipe, params[name], 
                             random_state=0, scoring='recall', cv=kfold)
    search = rcv.fit(X_train, y_train)
    
    best_params = search.best_params_
    best_model = search.best_estimator_
    print(best_params)
    trained_models[name] = best_model

    # Get prediction probabilities
    threshold = 0.5
    y_train_proba = best_model.predict_proba(X_train)[:, 1]
    y_val_proba = best_model.predict_proba(X_val)[:, 1]
    
    y_train_pred = (y_train_proba >= threshold).astype(int)
    y_val_pred = (y_val_proba >= threshold).astype(int)
    
    results.append({
        'Model': name,
        'Train_Acc': accuracy_score(y_train, y_train_pred),
        'Val_Acc': accuracy_score(y_val, y_val_pred),
        'Train_F1': f1_score(y_train, y_train_pred),
        'Val_F1': f1_score(y_val, y_val_pred),
        'Train_Precision': precision_score(y_train, y_train_pred),
        'Val_Precision': precision_score(y_val, y_val_pred),
        'Train_Recall': recall_score(y_train, y_train_pred),
        'Val_Recall': recall_score(y_val, y_val_pred),
    })

# ----------- STEP 5: Show Results -----------

results_df = pd.DataFrame(results).sort_values(by='Val_Recall', ascending=False).reset_index(drop=True)
print(results_df)

In [None]:
best_pipe = trained_models[results_df.loc[0,'Model']]
best_model = best_pipe.named_steps['clf']

best_pipe.fit(X_train, y_train)

FEATURE_NAMES_NEW = preprocessor.get_feature_names_out()
feature_importances = best_model.feature_importances_

feature_importance_df = pd.DataFrame({
    'Feature': FEATURE_NAMES_NEW,
    'Importance': feature_importances
})

feature_importance_df = feature_importance_df.sort_values(by='Importance', ascending=False)

sns.barplot(x='Importance', y='Feature', data=feature_importance_df)
plt.title('Feature Importance')
plt.show()

In [None]:
# See result of best model on test set
y_train_proba = best_pipe.predict_proba(X_train)[:, 1]
y_train_pred = (y_train_proba >= threshold).astype(int)

y_test_proba = best_pipe.predict_proba(X_test)[:, 1]
y_test_pred = (y_test_proba >= threshold).astype(int)
print(classification_report(y_test, y_test_pred))
print(pd.DataFrame(confusion_matrix(y_test, y_test_pred)))

## <div style="background-color:#0081CC; padding:10px; border-radius:10px; color:white; width:auto;">Hyperparameter Tuning</div>

In [None]:
# Read test.csv file
prediction = pd.read_csv("test.csv")

# Drop identifiable and birth details
X_prediction = prediction.drop(columns=drop_col)
id_prediction = prediction['baby_id']

# Predict outcomes using best model
y_prediction = pipe.predict_proba(X_prediction)[:,1]
df_prediction = pd.DataFrame({"id": id_prediction, 
                              "At Risk Outcome": y_prediction})

df_prediction.to_csv("model_output.csv", index=False)

In [None]:
joblib.dump(pipe, "new_model.joblib")