# Maternal Health Risk Prediction

### A Comparative Study of Machine Learning Model Performance

---

## Summary

This project aims to use machine learning to predict maternal health risk based on key physiological factors. Maternal health are a major concern in healthcare. Hence, early risk assessment is crucial for the welfare of both the mother and child. By leveraging structured data from the Maternal Health Risk Dataset, we trained and evaluated machine learning models to classify maternal health risk levels into three different classes: low, mid, and high risk categories.

The dataset contains 1,014 records with features such as age, blood pressures, blood sugar levels, body temperature and heart rate. Our goal will be to determine whether machine learning algorithms could accurately predict risk levels based on these physiological markers and aid medical experts.

We will carry out exploratory data analysis, feature engineering and preprocessing to the dataset and then apply multiple classification models including logistic regression, decision trees,___________, We might also carry out hyperparameter optimization and finally evaluate their accuracy and other classification metrics such as precision, recall, F1-score, and AUC-ROC.

---

## Introduction


Maternal health can be defined as "the health condition of women during preganancy, childbirth, and the postnatal period (WHO, 2025). This is a critical area of healthcare, as complications during preganancy and childbirth can lead to severe consequences for both mothers and newborns. According to the World Health Organization (2024), around 800 women died each day in 2020 due to preventable causes related to maternal health, further emphasizing the need for risk assessment measures. 

Historically, risk assessment have been carried out by medical professionals that relied heavily on clinical expertise and constant monitoring. However, traditional approaches to monitoring basic physiological indicators often lacked efficiency in identifying potential complications (Mu et al., 2023). Since the boom of machine learning (ML), many members of the academe have explored the use of ML in maternal health risk prediction, offering data-driven approaches to enhance early detection and intervention to offload the burden on overworked medical professionals (Mu et al., 2023; Ukrit et al., 2024; Bajaj et al., 2023).

To contribute to this discourse, this research aims to conduct a comparative study on the performance of two ML techniques in predicting maternal health risk, assessing each model's reliability in identifying risk levels.

The analysis will utilize the [Maternal Health Risk](https://archive.ics.uci.edu/dataset/863/maternal+health+risk) dataset sourced from the UC Irvine Machine Learning Repository. Consisting of 1014 observations, this dataset includes the following 7 features:

- `Age`: Age of the patient (in years).
- `SystolicBP`: Systolic Blood Pressure (mmHg).
- `DiastolicBP`: Diastolic Blood Pressure (mmHg).
- `BS (Blood Sugar Level)`: Blood sugar concentration (mmol/L).
- `BodyTemp`: Body temperature (°F).
- `HeartRate`: Heart rate (beats per minute).
- `RiskLevel`: The target variable, categorized into low risk, mid risk, and high risk.

---

## Methods


For this analysis, the data will first be loaded into the notebook then cleaned to handle any possible missing values and ensure its usability for the various models. Following the data cleaning stabe will be an exploratory data analysis (EDA) to gain a comprehensive view of the data. This step will include visualizing the summary statistics, distributions, and correlations between variables to determine any patterns in the data prior to the model development.

This study will implement 2 ML classification models:
1. Logistic Regression
2. Random Forest

Each model will be evaluated using various metrics such as accuracy, precision, recall, F1-score, and AUC-ROC to compare their relative performance in maternal health risk prediction. 

### Importing Relevant Libraries

In [None]:
library(tidyverse)
library(corrplot)
library(nnet)
library(caret)

### Reading the Data
The [dataset](https://archive.ics.uci.edu/dataset/863/maternal+health+risk) was downloaded from the UC Irvine Machine Learning Repository and uploaded to the project's repository from which it is read for the analysis.

In [None]:
data <- read_csv("data/Maternal Health Risk Data Set.csv")

cat("\n\033[1mTable 1: Sample of Raw Data\033[0m\n\n")
head(data)


### Wrangling and Cleaning the Data
From the cell below, we find that there are no NA or null values in our dataset.

In [None]:
tibble(feature = names(data), na = colSums(is.na(data)))

The code below shows that the features `Age`, `SystolicBP`, `DiastolicBP`, `BS`, `BodyTemp`, and `HeartRate` are numeric variables, while `RiskLevel` is currently a character variable. Moreover, there are three categories under `RiskLevel`: high risk, mid risk, and low risk

In [None]:
str(data)
data %>% distinct(RiskLevel)

Given the three distinct categories under the target feature, we will modify `RiskLevel` to a factor variable to appropriately reflect its categorical nature in further analysis. 

In [None]:
data_clean <- data %>% 
    mutate(RiskLevel = factor(RiskLevel, levels = c("low risk", "mid risk", "high risk")))

cat("\n\033[1mTable 2: Sample of Clean Data\033[0m\n\n")

head(data_clean)

### EDA

#### Summary Statistics

In [None]:
summary(data_clean)

#### Distributions

Since age is an important factor in maternal health, we visualize the age distribution by risk level. From the visualization, high risk individuals have a higher median age around 35 years old. Additionally, the interquartile range indicates that the high risk group has more variation in age. We observe some outliers in the low and mid risk groups. Based on the visualization, older aged individuals seem more associated with maternal health risks. 

For further exploration, we plot for each variable.

In [None]:
ggplot(data_clean, aes(x = RiskLevel, y = Age, fill = RiskLevel)) +
  geom_boxplot() +
  theme_minimal() +
  labs(
    title = "Figure 1: Age Distribution by Risk Level", 
    x = "Risk Level",
    y = "Age"
  ) +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

#### Correlation Matrix

All of the variables have a positive correlation with RiskLevel, indicating that increases in these variables generally correspond to a higher maternal health risk. BS (Blood Sugar level) has the strongest correlation of 0.57, suggesting it is likely to be the most influential factor. We thought age would have a stronger correlation with RiskLevel, however, systolic blood pressure and diastolic blood pressure seems to have a stronger correlation with RiskLevel than age. 

These findings may be a possible reason for the outliers observed above. Younger individuals with high blood pressures or sugar levels may be classified into higher risk levels. This indicates the importance of other factors.

In [None]:
# Temporarily change RiskLevel to numerical values (3 = high risk, 2 = mid risk, 1 = low risk)
data_numeric <- data_clean %>%
  mutate(RiskLevel_numeric = as.numeric(RiskLevel)) %>%
  select(-RiskLevel) # remove the original factor variable

cor_matrix <- cor(data_numeric)

options(repr.plot.width=6.5, repr.plot.height=6) 

corrplot(cor_matrix, method = "ellipse", type = "upper", tl.cex = 0.8, tl.col = "black",
    col = colorRampPalette(c("red", "grey", "blue"))(200))
mtext("Figure 2b: Correlation Matrix", side = 3, line = 3, cex = 1.5, font = 2)


corrplot(cor_matrix, method = "number", type = "upper", number.cex = 1.5, tl.cex = 0.8, tl.col = "black",
    col = colorRampPalette(c("red", "grey", "blue"))(200))
mtext("Figure 2b: Correlation Matrix Values", side = 3, line = 3, cex = 1.5, font = 2)


## Classification Model Building

### Train/Test Splitting

In [None]:
set.seed(123)

# Create an 80% training and 20% testing split
train_index <- createDataPartition(data_clean$RiskLevel, p = 0.8, list = FALSE)

# Subset data into training and testing sets
train_data <- data_clean[train_index, ]
test_data  <- data_clean[-train_index, ]

dim(train_data)
dim(test_data)

### Multinomial Logistic regression

In [None]:
multinom_logistic <- multinom(RiskLevel ~ ., data = train_data)

summary(multinom_logistic)

In [None]:
test_predictions <- predict(multinom_logistic, newdata = test_data)

# predict probabilities
test_probabilities <- predict(multinom_logistic, newdata = test_data, type = "probs")
head(test_predictions)
cat("\n\033[1mTable 3: Predicted Probabilities for Test\033[0m\n\n")
head(test_probabilities)

In [None]:
conf_matrix <- confusionMatrix(as.factor(test_predictions), as.factor(test_data$RiskLevel))

cm_table <- as.data.frame(conf_matrix$table)

colnames(cm_table) <- c("True", "Predicted", "Frequency")

cm_table <- cm_table %>%
  group_by(True) %>%
  mutate(Percentage = round((Frequency / sum(Frequency)) * 100, 1))  # Round to 1 decimal place
cat("\n\033[1mTable 4: Confusion Matrix\033[0m\n\n")
cm_table

cm_visualization <- ggplot(cm_table, aes(x = True, y = Predicted, fill = Frequency)) +
  geom_tile(color = "black") +  # Draw tiles
  geom_text(aes(label = paste0(Frequency, "\n(", Percentage, "%)")), color = "white", size = 5) +  # Add labels
  scale_fill_gradient(low = "lightblue", high = "darkblue") +  # Color scale
  labs(title = "Figure 3: Confusion Matrix with Frequencies and Percentages",
       x = "True Label", y = "Predicted Label") +
  theme_minimal() +
  theme(plot.title = element_text(hjust = 0.5, face = "bold"))

cm_visualization

### Random Forest

In [None]:
install.packages("randomForest")

library(randomForest)

rf_model <- randomForest(Species ~ ., data = train_data, ntree = 500)
print(rf_model)

Here are some parameters that randomForest passes:
- `Species ~ .`: It predicts Species based on all other variables.
- ntree = 500 → The number of trees used in the forest. In this case it uses 500 trees.
- mtry = 2 → Uses 2 random variables at each split.
- importance = TRUE → Computes variable importance.

---

## Discussion


- summarize what you found
- discuss whether this is what you expected to find?
- discuss what impact could such findings have?
- discuss what future questions could this lead to?

---

## References

Ahmed, M. (2020). Maternal Health Risk [Dataset]. UCI Machine Learning Repository. https://doi.org/10.24432/C5DP5D.

Bajaj, D., Kumari, R., & Bansal, P. (2023). Risk level prediction for maternal health using machine learning algorithms. 2023 International Conference on Communication, Security and Artificial Intelligence (ICCSAI), 405–409. https://doi.org/10.1109/iccsai59793.2023.10421156 

Mu, C., Yan, Z., & Zhu, Y. (2023). Prediction of maternal health risk based on physiological indicators. Proceedings of the 2023 4th International Symposium on Artificial Intelligence for Medicine Science, 578–584. https://doi.org/10.1145/3644116.3644212 

Ukrit, M. F., Jeyavathana, R. B., Rani, A. L., & Chandana, V. (2024). Maternal health risk prediction with machine learning methods. 2024 Second International Conference on Emerging Trends in Information Technology and Engineering (ICETITE), 1–9. https://doi.org/10.1109/ic-etite58242.2024.10493737 

World Health Organization. (2024, April 26). Maternal mortality. World Health Organization. https://www.who.int/news-room/fact-sheets/detail/maternal-mortality 

World Health Organization. (2025). Maternal health. World Health Organization. https://www.who.int/health-topics/maternal-health#tab=tab_1 