# *Maternal Mortality Risk Analysis: Exploring Factors Impacting Maternal Well-being*

## Introduction

Maternal health risk can be defined as the health of a mother during pregnancy, childbirth and postnatal period (World Health Organization, 2019). 287 000 mothers died during and after pregnancy and childbirth in 2020. These deaths can be caused by many different variables so being able to create a predictive model using all these variables will be beneficial in preparing for and preventing these deaths. 

The predictive question we will try to answer is: Given a patient’s score on 6 different numerical variables (age, systolic blood pressure, diastolic blood pressure, blood glucose, body temperature, heart rate), can we predict the risk level of a patient as low, medium, or high? Furthermore, we will question which factors are most strongly associated with the likelihood of highest maternal risk in the dataset.

Our dataset has seven variables, one of them being the risk level we are trying to predict. The other 6 variables are age, systolic blood pressure, diastolic blood pressure, blood glucose, body temperature and heart race. These variables can be used as predictor variables for our final prediction. 

Our database is https://archive.ics.uci.edu/dataset/863/maternal+health+risk. This data set has 7 variables. 6 of them are


## Preliminary exporatory Data Analysis

In [None]:
# Import necessary libraries

library(rvest)
library(tidyverse)
library(tidymodels)
library(repr)

### Reading the dataset from the web into R

#### General preface

The dataset we have chosen is hosted on a website which does not have a supported R API.

After searching extensively for a terms of service to see if scraping is disallowed, it was not found.

Thus, we will be using web scraping in order to retrieve our dataset.

#### Website analysis

- After studying the source code of the website, it is clear that the dataset is retrieved by pressing a button, which sends a GET request fetching the file which is hosted on the website. (the "button" is not of type `<button>`, rather it is `<a>` -  a link.)

- Note that the original file the website hands to us is a .zip file, which we will need to extract

- If we can find the url to the file, we can send a GET request which should retrieve the dataset file data which we can then store locally in a file. (We will more likely than not use a built in R function or library to do this)

- The download button is unique in one way, it is the only element in the HTML file which has the unique combination of classes that it has, making it easy to extract it from the webpage.

#### General Scraping procedure

**Part 1**
- Retrieve source code of website
- Use CSS selectors to get the `<a></a>` elment that contains the link to the file
- Extract the url which is held in the `href` attribute of the `<a></a>` tag

**Part 2**
- Download the dataset file pointed to by the url
- Extract the .zip file containing the dataset csv
- Read the csv file into R

In [None]:
# Part 1

dataset_website_root <- "https://archive.ics.uci.edu"
dataset_webpage_url <- "/dataset/863/maternal+health+risk"

page = paste(dataset_website_root, dataset_webpage_url, sep = "") |>
    read_html()   # Contains html source code of website

selectors <- paste(".btn-primary.btn.w-full.text-primary-content", sep = ",")    # CSS selectors created inspecting the download button 

population_nodes <- html_nodes(page, selectors)  # Contains all the elements that match the CSS selector criteria (only the button we want)

file_url <- population_nodes |> html_attr("href")    # Gets the href of the download button (filepath of dataset file relative to the website root)

download_url <- paste(dataset_website_root, file_url, sep = "") # Url of webpage + path to file

download_url

In [None]:
# Part 2

temp <- tempfile() # Create a temporary file to store the zip

download.file(download_url,temp) # Download the zip file from the website and store it in the temporary file

original_name <- unzip(temp) # Extract and save the csv file from the zip
unlink(temp) # Remove the temporary zip file

new_filename <- "maternal_health.csv" # improved filename removing whitespaces

file.rename(original_name, new_filename) # Rename the file to be easier to access and removing whitespace in the original name

maternal_health <- read_csv(new_filename) # Read the csv file into R

### Using local files
If the data has already been downloaded to our local machine, it can simply be read using the `read_*` function on the filepath, depending on the type of file (csv in our case hence `read_csv()`)

In [None]:
maternal_health <- read_csv("maternal_health.csv")

### Cleaning and Wrangling the data

#### Tidying data

Our dataset is already tidy, we say this because

- each row is a single observation,

- each column is a single variable, and

- each value is a single cell (i.e., its entry in the data frame is not shared with another value).

Our dataset is already in a clean format, but to get it ready for training we have to set the factors 

#### Splitting the data into training and test data

We will split the data into two parts. 75% will be used as training data and 25% will be used as test data.

In [None]:
# Renaming Columns
new_col_names <- c("age", 
                    "systolic_blood_pressure", 
                    "diastolic_blood_pressure", 
                    "blood_glucose", 
                    "body_temperature", 
                    "heart_rate", 
                    "risk_level")
old_col_names <- colnames(maternal_health)
maternal_health <- maternal_health |> 
    rename_at(old_col_names, ~ new_col_names)

# Settng risk_level as the factor
maternal_health <- maternal_health |> 
    mutate(risk_level = as_factor(risk_level)) |>
    mutate(risk_level = fct_recode(risk_level, "high" = "high risk", "low" = "low risk", "mid" = "mid risk"))

# Splitting the data 

maternal_health_split <- maternal_health |>
    initial_split(prop = 0.75, strata = risk_level) # Split the data 3:1


maternal_health_train <- training(maternal_health_split) # Assign the training data to maternal_health_train
maternal_health_test <-testing(maternal_health_split) # Assign the testing data to maternal_health_test

glimpse(maternal_health_train)

### Summarizing the data using training data

We first show a summary of the number and percentage of the three categories, high risk, low risk and medium risk.

Secondly, we show a table of the mean values for all our variables

In [None]:
total_number <- nrow(maternal_health_train)

maternal_health_count_categories <- maternal_health_train |>
    group_by(risk_level) |>
    summarize(count = n(), percentage = round((n() / total_number) * 100))

maternal_health_variable_means = maternal_health_train |>
    summarize(across(age:heart_rate, \(x) mean(x, na.rm = TRUE)))

#### Summary of the distribution of high risk, low risk and medium risk for maternal mortality 

In [None]:
maternal_health_count_categories

#### Summary of the mean of the different variables affecting maternal mortality

In [None]:
maternal_health_variable_means 

### Visualisation using training data

We have made graphs to illustrate the variables and how they have an affect on the category.

In [None]:
options(repr.plot.width = 8, repr.plot.height = 4) 

maternal_health_train |>
    ggplot(aes(x = age, fill = as_factor(risk_level))) + geom_histogram(alpha = 1, position = "identity", binwidth = 1) + facet_grid(rows = vars(risk_level)) +
    labs(x = "Age", title = "Age vs risk level", fill = "Risk Level") + theme(text = element_text(size = 20))
maternal_health_train |>
    ggplot(aes(x = systolic_blood_pressure, fill = as_factor(risk_level))) + geom_histogram(alpha = 1, position = "identity", binwidth = 3) + facet_grid(rows = vars(risk_level)) +
    labs(x = "Systolic blood pressure", title = "Systolic blood pressure vs risk level", fill = "Risk Level") + theme(text = element_text(size = 20))
maternal_health_train |>
    ggplot(aes(x = diastolic_blood_pressure, fill = as_factor(risk_level))) + geom_histogram(alpha = 1, position = "identity", binwidth = 2) + facet_grid(rows = vars(risk_level)) +
    labs(x = "Diastolic blood pressure", title = "Diastolic blood pressure vs risk level", fill = "Risk Level") + theme(text = element_text(size = 20))
maternal_health_train |>
    ggplot(aes(x = blood_glucose, fill = as_factor(risk_level))) + geom_histogram(alpha = 1, position = "identity", binwidth = 0.5) + facet_grid(rows = vars(risk_level)) +
    labs(x = "Blood Glucose", title = "Blood Glucose vs risk level", fill = "Risk Level") + theme(text = element_text(size = 20))
maternal_health_train |>
    ggplot(aes(x = body_temperature, fill = as_factor(risk_level))) + geom_histogram(alpha = 1, position = "identity", binwidth = 0.5) + facet_grid(rows = vars(risk_level)) +
    labs(x = "Body Temperature", title = "Body Temperature vs risk level", fill = "Risk Level") + theme(text = element_text(size = 20))
maternal_health_train |>
    ggplot(aes(x = heart_rate, fill = as_factor(risk_level))) + geom_histogram(alpha = 1, position = "identity", binwidth = 1) + facet_grid(rows = vars(risk_level)) +
    labs(x = "Heart Rate", title = "Heart Rate vs risk level", fill = "Risk Level") + theme(text = element_text(size = 20))

maternal_health_train |> ggplot(aes(x = age, y = diastolic_blood_pressure, color = risk_level)) + geom_point() + theme(text = element_text(size = 20))
maternal_health_train |> ggplot(aes(x = age, y = blood_glucose, color = risk_level)) + geom_point() + theme(text = element_text(size = 20))
maternal_health_train |> ggplot(aes(x = blood_glucose, y = heart_rate, color = risk_level)) + geom_point() + theme(text = element_text(size = 20))
maternal_health_train |> ggplot(aes(x = age, y = blood_glucose, color = risk_level)) + geom_point() + theme(text = element_text(size = 20))

# Method

## **\<\<INSERT METHOD HERE\>\>**