# Maternal Health Risk Assessment: A Comprehensive Model of Predictive Factors



#### Introduction:


Maternal mortality is a major concern for the UN's sustainable development goals, with already 287,000 women dying in 2020. A predictive model can help identify high-risk patients, enabling healthcare providers to allocate resources more effectively, enhancing the likelihood of successful outcomes for both mothers and babies.

Can we determine whether a pregnant woman faces low, medium, or high maternal health risk, thereby contributing to improved maternal health outcomes and a reduction in maternal mortality rates?

We will be using a dataset that has been collected from 1013 pregnant women from different hospitals, clinics, and maternal health care in the rural areas of Bangladesh. This data has columns for: age, systolic blood pressure, diastolic blood pressure, blood glucose, body temperature, and heart rate.

#### Preliminary exploratory data analysis:
We are reading the dataset from UCI Machine Learning Repository from the Web to R with the following steps. In summary, we need to download the file, unzip, give it a new name and read the first 6 rows. 

#### Methods:

Based on the data analysis, it has become apparent that the temperature variable does not significantly contribute to distinguishing between high, medium and low-risk patients. The temperature remains within the same range for all risk groups. Therefore, it would be misleading to consider temperature as a crucial factor in determining whether a pregnant woman is at high/medium/low risk.

To visualize the distribution of each quantitative variable and its relationship with the target predictive variable, RiskLevel, we created histograms for each variable. Additionally, we used different colors to represent the 3 levels of the RiskLevel variable. 


#### Step 1: Loading dataset from the web

In [None]:
## reading the data into R
temp <- tempfile()
download.file("https://archive.ics.uci.edu/static/public/863/maternal+health+risk.zip",temp)
maternal_original_data <- read.csv(unzip(temp, "Maternal Health Risk Data Set.csv"))
unlink(temp)
head(maternal_original_data)

In [None]:
# no NA value within the data

has_na <- any(is.na(maternal_original_data))
has_na

### Summary of the original data before wrangling

Interpretation of each columns:
- **Age**: Age in years
- **SystolicBP**: Systolic blood pressure(measures the pressure in arteries when heart beats in mmHg)
- **DiastolicBP**: Diastolic blood pressure(measure the pressure in arteries when heart rests between beats in mmHg)
- **BS**: Blood Glucose Level(in terms of a molar concentration in mmol/L)
- **BodyTemp**: Body temperature measured in Fahrenheit 
- **HearRate**: Heart rate(normal resting heart rate in bpm)
- **RiskLevel**: Predicted risk intensity level of maternal health during pregnancy considering all the prior attributes)

Other information of the original data:

- There are 1013 subjects within the whole data set.


In [None]:
## loading the packages
library(tidyverse)
library(dplyr)
library(repr)
library(tidymodels)
install.packages("kknn")

#### Step 2: Wrangling and cleaning

In [None]:
## Adding the Celcius column for interpretation purposes
maternal_with_celcius <- maternal_original_data |> 
                         mutate(Celcius = (BodyTemp - 32) * 5/9 ) 
head(maternal_with_celcius)

In [None]:
## Rename the columns 
names(maternal_with_celcius) <- c("Age", 
                                  "Systolic_Blood_Pressure", 
                                  "Diastolic_Blood_Pressure", 
                                  "Blood_Glucose", #mmol/L	
                                  "Farenheit",
                                  "Heart_Beat", #bpm
                                  "Risk_Level",
                                  "Body_Temp_Celcius")
maternal_new_name <- maternal_with_celcius
# Removing farenheit
maternal_new_name <- select(maternal_new_name, 
                            Age, 
                            Systolic_Blood_Pressure, 
                            Diastolic_Blood_Pressure,
                            Blood_Glucose,
                            Heart_Beat,
                            Risk_Level,
                            Body_Temp_Celcius)
head(maternal_new_name)

First, we added the Celsius column to help with understanding of the data later, and we also changed the column names so they are easier to identify. Then we removed the farenheit column because it's unnecessary now there's a Celsius column

In [None]:
options(repr.plot.width = 7, repr.plot.height = 6) 
# initial visualization
systolic_hisrogram <- ggplot(maternal_new_name, aes(x = Systolic_Blood_Pressure, fill = as_factor(Risk_Level))) +
                     geom_histogram(alpha = 0.5, position = "identity") +
                     labs(x = "Systolic Blood Pressure", fill = "Risk Level", y = "Count") +
                     ggtitle("Level of Systolic Blood Pressure among Risk Levels") +
                     theme(text = element_text(size = 14))
systolic_hisrogram
diastolic_histogram <- ggplot(maternal_new_name, aes(x = Diastolic_Blood_Pressure, fill = as_factor(Risk_Level))) +
                     geom_histogram(alpha = 0.5, position = "identity") +
                     labs(x = "Diastolic Blood Pressure", fill = "Risk Level", y = "Count") +
                     ggtitle("Level of Diastolic Blood Pressure among Risk Levels") +
                     theme(text = element_text(size = 14))
diastolic_histogram
bs_histogram <- ggplot(maternal_new_name, aes(x = Blood_Glucose, fill = as_factor(Risk_Level))) +
                     geom_histogram(alpha = 0.5, position = "identity") +
                     labs(x = "Blood Glucose(mmol/L)", fill = "Risk Level", y = "Count") +
                     ggtitle("Level of Blood Glucose among Risk Levels") +
                     theme(text = element_text(size = 14))
bs_histogram
heart_beat_histogram <- ggplot(maternal_new_name, aes(x = Heart_Beat, fill = as_factor(Risk_Level))) +
                     geom_histogram(alpha = 0.5, position = "identity") +
                     labs(x = "Heart Beat(bpm)", fill = "Risk Level", y = "Count") +
                     ggtitle("Rate of Heart Beat among Risk Levels") +
                     theme(text = element_text(size = 14))
heart_beat_histogram
age_histogram <- ggplot(maternal_new_name, aes(x = Age, fill = as_factor(Risk_Level))) +
                     geom_histogram(alpha = 0.5, position = "identity") +
                     labs(x = "Age", fill = "Risk Level", y = "Count") +
                     ggtitle("Age among Risk Levels") +
                     theme(text = element_text(size = 14))
age_histogram
celsius_histogram <- ggplot(maternal_new_name, aes(x = Body_Temp_Celcius, fill = as_factor(Risk_Level))) +
                     geom_histogram(alpha = 0.5, position = "identity") +
                     labs(x = "Body Temp in Celcius", fill = "Risk Level", y = "Count") +
                     ggtitle("Body Temperature among Risk Levels") +
                     theme(text = element_text(size = 14))
celsius_histogram

From the above visualizations, we can conclude a few things:
1. Temperature is not worth using as a predictor because the distribution of different risk levels remain about the same for those small changes within the temperature. The overall temperature also remained within the normal range around 36 to 39 degrees. It's also extremely right skewed meaning majority of the women had a normal temperature in this data set, therefore, it will be misleading to use temperature as a predictor for the prediction. 
2. Blood glucose, age and systolic and diastolic blood pressure are 4 predictors that visually seem especially important to the prediction because as both four variables increase, high risk level starts to appear more often. 
3. Although heart beat doesn't show as clear of a trend as the other predictors, there still seem to have a connection between heart beat and risk level because the risk level does increase as heart beat increases.
In conclusion, after examining the initial visualizations, we determine to use **age**, **systolic blood pressure**, **diastolic blood pressure**, **blood glucose**, and **heart beat** as predictors.

In [None]:
## Delete body temperature column
maternal_final <- maternal_new_name |>
                  select(Age,Systolic_Blood_Pressure,Diastolic_Blood_Pressure,
                         Blood_Glucose, Heart_Beat, Risk_Level) |>
                  mutate(Risk_Level = as_factor(Risk_Level))
head(maternal_final)

In [None]:
cancer_plot <- ggplot(cancer, aes(x = Symmetry, y = Radius)) +
                geom_point(aes(color = Class)) +
                labs(x = "Standardized Symmetry", y = "Standardized Radius", color = "Class") +
                theme(text = element_text(size = 20))

After cleaning the data by getting rid of unnecessary columns, we will set seeds and split the data to prepare with the classification process.

In [None]:
## Setting seed and splitting data into trianing set and testing set 
set.seed(10)
maternal_split <- initial_split(maternal_final, prop = 0.75, strata = Risk_Level)
maternal_train <- training(maternal_split)
maternal_test <- testing(maternal_split) 
glimpse(maternal_train)
glimpse(maternal_test)

#### Step 4: Finding optimal k

In [None]:
# recipe for training data
maternal_recipe <- recipe(Risk_Level ~.,
                          data = maternal_train) |>
                   step_scale(all_predictors()) |>
                   step_center(all_predictors())
maternal_recipe

In [None]:
# model for training data 
maternal_spec <- nearest_neighbor(weight_fun = "rectangular", neighbors = tune()) |>
                 set_engine("kknn") |>
                 set_mode("classification")
maternal_spec

In [None]:
# creating v folds for traning data 
maternal_vfold_10 <- vfold_cv(maternal_train, v = 10, strata = Risk_Level)

# data frame that contains sequence of K values
maternal_k_vals <- tibble(neighbors = seq(from = 1, to = 70, by = 1))

We starts finding the optimal K by creating a recipe with all the necessary predictors, a model for classification. Next, we decides to first try 10 folds because we have a relatively big size of data, and we kept the range of k values around 1/3 of the amount within training set. 

In [None]:
# fitting the model
maternal_first_workflow <- workflow () |>
                           add_recipe(maternal_recipe) |>
                           add_model(maternal_spec) |>
                           tune_grid(resamples = maternal_vfold_10, grid = maternal_k_vals) |>
                           collect_metrics()
                        
head(maternal_first_workflow)

In [None]:
# getting only the accuracies
accuracies <- maternal_first_workflow |>
             filter(.metric == "accuracy")

# accuracy plot
accuracy_plot <- ggplot(accuracies, aes(x = neighbors, y = mean)) +
                geom_point() +
                geom_line() +
                labs(x = "Neighbors", y = "Accuracy Estimate")
accuracy_plot

From this plot, it's clear that the amount of neighbor we need will be very relatively small, therefore we made a second plot with less neighbors to get a better read of the amount of k we need.

In [None]:
# new k values
maternal_k <- tibble(neighbors = seq(from = 1, to = 10, by = 1))

# fitting the newer model
maternal_new_workflow <- workflow () |>
                           add_recipe(maternal_recipe) |>
                           add_model(maternal_spec) |>
                           tune_grid(resamples = maternal_vfold_10, grid = maternal_k) |>
                           collect_metrics()
head(maternal_new_workflow)

# getting only the accuracies
accuracy <- maternal_new_workflow |>
             filter(.metric == "accuracy")

# accuracy plot
accuracy_plot_2 <- ggplot(accuracy, aes(x = neighbors, y = mean)) +
                geom_point() +
                geom_line() +
                labs(x = "Neighbors", y = "Accuracy Estimate") +
                scale_x_continuous(breaks = seq(0, 14, by = 1)) +
                scale_y_continuous(limits = c(0.4, 1.0))
accuracy_plot_2

After tuning the grid, filtering out the accuracy and plotting out the elbow plot. We've determined that k = # would provide the highest accuracy.

#### Step 5: Final model

In [None]:
# new model
maternal_final_spec <- nearest_neighbor(weight_func = "rectangular", neighbors = 1) |>
                       set_engine("kknn") |>
                       set_mode("classification")

# new workflow
maternal_final_workflow <- workflow() |>
                           add_recipe(maternal_recipe) |>
                           add_model(maternal_final_spec) |>
                           fit(data = maternal_train)
maternal_final_workflow

With our new k value, we created a new model and fit it again for the final workflow to use in predictions of the test data set.

In [None]:
# predicting testing data
maternal_predict <- predict(maternal_final_workflow, maternal_test) |>
                    bind_cols(maternal_test)
head(maternal_predict)

In [None]:
# assessing classifier's performance
maternal_metrics <- maternal_predict |>
                    metrics(truth = Risk_Level, estimate = .pred_class) |>
                    filter(.metric == "accuracy")
maternal_metrics

In [None]:
# producing the confusion matrix
maternal_matrix <- maternal_predict |>
                   conf_mat(truth = Risk_Level, estimate = .pred_class)
maternal_matrix

CONFUSION MATRIX ANALYSIS IMPORTANT!!!

In [None]:
maternal_final

In [None]:
plot_age_vs_blood_glucose <- ggplot(maternal_final, aes(x = Age, y = Blood_Glocose, color = Risk_Level)) +
                            geom_point(alpha = 0.5) +
                            labs(color = "Risk Level") +
                            ggtitle("Scatter plot of Age versus Blood Glocose colored by risk level") +
                            theme(text = element_text(size = 12))

#### Expected outcomes and significance:

- Expected Findings:
  
Multifactorial Nature of Risk: The multifactorial nature of maternal health risks suggests that multiple variables collectively contribute to the overall risk assessment, highlighting that no single factor holds all the weight in determining risk.
Identifying correlations and associations: Variables like blood pressure, body temperature, and blood sugar levels positively impact risk assessment, with older mothers having higher maternity risk.

- Impact of Findings:

The analysis of variables and expected risk in the maternal health dataset can significantly influence maternal healthcare, healthcare policy, and public health initiatives.
1) Preventive Health Education: Understanding risk factors can aid in creating health education programs promoting healthy lifestyle choices, including prenatal care, diet-related risk reduction, and general health practices among pregnant women.
2) Evidence-Based Policy Formulation: The findings could guide the development of evidence-based policies in maternal healthcare, aiming to reduce maternal health risk factors.
3) Research and Innovation: The findings could drive further research and innovation in maternal health, focusing on developing novel diagnostic tools, therapeutic interventions, and preventive measures to address identified risk factors, thereby improving treatment options and health outcomes for pregnant women.

- Future Questions:
  
1) Personalized Care Plans: How can the insights from the predictive model be used to create personalized care plans for pregnant women, considering their individual health profiles? 
2) Monitoring and Evaluation: How can the model be integrated into the monitoring and evaluation system to identify high-risk expecting mothers and ensure frequent monitoring to enhance maternal health outcomes?

#### References Used:

Ahmed,Marzia. (2023). Maternal Health Risk. UCI Machine Learning Repository. https://doi.org/10.24432/C5DP5D.