# DESCRIPTION OF THE PROJECT

## The need for predicting post operative glucose

Hyperglycemia is classified into three categories: persistent hyperglycemia, transient hyperglycemia, and stress-induced hyperglycemia. Persistent hyperglycemia is primarily caused by diabetes mellitus, whereas transient hyperglycemia can result from severe systemic conditions such as infections, sepsis, acute myocardial infarction, and glucocorticoid treatment.

Stress hyperglycemia, also known as the oxidative stress response, arises from a variety of factors. Generally, when a patient undergoes surgery, the procedure itself causes stress to the body, leading to an increase in blood glucose levels.

The most common complications associated with hyperglycemia include surgical wound infections, pneumonia, sepsis, acute kidney injury, acute myocardial infarction, and stroke. These complications are linked to high morbidity and mortality rates.

The therapeutic approach for managing postoperative hyperglycemia typically involves administering insulin, either subcutaneously or intravenously. However, there is some debate within the medical community about the optimal glycemic target, though most published literature suggests an upper limit of 180 mg/dL.

Therefore, it is deemed beneficial to develop an algorithm capable of accurately predicting postoperative glucose levels to help prevent complications related to hyperglycemia.

## Story behind the Diploma Thesis 

To complete my education in the Department of Mechanical Engineering at Aristotle University of Thessaloniki, I needed to undertake a diploma thesis. Since I aspire to become a data scientist/analyst, it seemed fitting for my thesis to focus on a data science project.

I have a strong interest in the healthcare field, which led me to volunteer as a part-time data scientist/analyst in the Anesthesiology Department at Theagenio Anti-Cancer Hospital. I hoped to make a meaningful contribution to the work of the doctors there. After meeting with the hospital's anesthesiologists, we determined that it would be both meaningful and practically useful to create a model that predicts the postoperative blood glucose levels of patients undergoing thoracic surgeries.

## Background

With the help of the hospital's anesthesiologists, I managed to collect data on 235 patients who underwent thoracic surgery between March 2023 and September 2023. Typically, before an operation, patients undergo a preoperative checkup that gathers information about their general health, medical history, regular medications, habits, and blood test results. The data for my project was extracted from these preoperative checkup folders. 

## Goals of the Diploma Thesis

1. **Build Models:** Develop models to predict postoperative glucose levels.  
2. **Compare Models:** Evaluate and compare the performance of the models to determine if one is more efficient than the others. 
3. **Estimate Feature Importance:** Assess the importance of each feature in the models to understand which factors most significantly impact postoperative glucose predictions.  

## Methods 

* For prediction
    1. Linear Regression
    2. Bagging Regressor
    3. Random Forest Regressor
    4. AdaBoost Regressor
    5. Support Vector Regression

* For Feature Importance
    1. SHAP values
    2. p-values (in the case of Stepwise Selection Linear Regression)

## Performance Evaluation

1. Coefficient of Determination __R^2__
2. Training Mean Absolute Error __Training_MAE__
3. 5 Fold Cross Validated Mean Absolute Error __Test_MAE(KFold)__
4. Test Mean Absolute Error __Test_MAE__

## Datasets and Data Dictionary

The dataset of 235 patients is divided into to smaller datasets : The Training Dataset (200 patients) and the Test Dataset (35 patients)

1. **Post Operative Glucose**: Quantitative variable. The unit of measurement is milligrams of glucose per deciliter of blood (mg/dL).

2. **Sex**: Binary (categorical) variable. It takes the value of 1 if the patient is male and 0 if the patient is female.

3. **Age**: Quantitative variable. The unit of measurement is years.

4. **Body Mass Index (BMI)**: Quantitative variable. The unit of measurement is kilograms per square meter (kg/m²).

5. **Cortisol (Cort)**: Binary (categorical) variable. It takes the value of 1 if cortisol is administered intraoperatively by the anesthesiologist.

6. **Pre Operative Fast**: Quantitative variable. The unit of measurement is hours (h). It indicates the hours of fasting of the patient until the surgery was performed.

7. **Smoking**: Quantitative variable. The goal is to express the patient's exposure to smoking within a specific time frame. The unit of measurement is pack-years (pack of cigarettes per years of smoking). These units of measurement are used by the medical staff.

8. **Pre Operative Glucose**: Quantitative variable. The unit of measurement is milligrams of glucose per deciliter of blood (mg/dL).

9. **Lung Ventilation Number (LV)**: Binary (categorical) variable. It classifies based on mechanical ventilation, either one-lung ventilation (OLV) or two-lung ventilation (TLV). It takes the value of 1 when ventilation is done with one lung and 2 when done with two lungs.

10. **ASA Physical Status Classification System (ASA)**: Hierarchical (categorical) variable. Its purpose is to categorize patients according to their physical condition. It is a qualitative variable that assesses the health and risk a patient faces when anesthesia is administered and surgery is performed. The values are given by the anesthesiologist who examines the patient during the pre-operative assessment. It takes values from 1 to 5, where 1 means the patient is in excellent condition, and 5 means the patient is in a pre-death state or is an organ donor.


## Exceptions and Assumptions

During the data collection process, it was deemed necessary to exclude patients who fall into certain categories from the study. The following are some examples of patients who were excluded:

1. **Duplicates**: Each patient was included in the sample only once. This decision was made because having patients with completely identical characteristics (e.g., same age, years of smoking, body mass index) but different glucose values in the training data could be harmful to the models.

2. **Patients receiving corticosteroid medication pre-operatively**: These patients were excluded to determine whether the intra-operative administration of cortisol affects the response variable.

3. **Patients with diabetes mellitus**: We excluded patients with diabetes mellitus from the sample, as the presence of glucose disorders post-operatively is expected for them. Furthermore, we considered it appropriate to examine whether the models could detect the change in post-operative glucose in patients who are not known to have diabetes mellitus beforehand or in patients who show increased insulin resistance (prediabetes).

During the data collection, certain assumptions had to be made regarding some of the variables. The following describes the assumptions that were made:

1. **Assumption on Smoking values**: Patients who declare themselves as active smokers within the last 28 days before surgery are defined as smokers, while over 60% of patients relapse within the first year after quitting smoking. The chronic damage caused by smoking significantly increases the likelihood of cardiovascular events, strokes, chronic lung disease, and various forms of cancer, including lung cancer. The chances of acute events begin to decrease significantly after two years of quitting smoking. Therefore, we deemed it appropriate to classify patients who quit smoking more than two years ago as non-smokers since recent cessation does not offer protection for the surgical patient from acute ischemic-type events accompanied by hyperglycemia.

2. **Assumption on Pre-Operative Fast values**: The exact start time of pre-operative fasting is not known. According to the relevant guideline of the hospital's medical staff, fasting starts no later than 21:00. Therefore, the values of Pre-Operative Fast are derived from measuring the time from 9 p.m. the night before the surgery until the start time of the surgery. Surgeries start early in the morning, while typically, the last one is a few hours after noon.

## Jupyter Notebooks


In **Notebooks B1 and B2**, I perform Exploratory Data Analysis (EDA) on the data and take the necessary steps to prepare the dataset for applying machine learning (ML) methods. I also divide the dataset into training and test sets and create a version of the dataset with outliers removed.

In **Notebooks C1 to C5**, I apply the ML methods. In each notebook, I first optimize the model using GridSearchCV, estimate the R² and Mean Absolute Error (MAE), and calculate the SHAP values. I also create graphs to help understand how the model behaves.

In **Notebooks D1 to D5**, I repeat the process using the dataset without outliers.

In **Notebooks E1 to E3**, I build additional Linear Regression models. Specifically, I construct two models: one that includes interactions among the six most important variables and another that includes interactions among seven variables. Finally, in E3, I optimize the Linear Regression model using Stepwise Selection.

In **Notebook F**, you can find all the results aggregated.

Finally, in **Notebooks G1 and G2**, I create graphs necessary for explaining some concepts in my thesis.



## Notes

__Note 1__ : Parts of the code that were omitted from the final draft
    1. VIF
    2. Residuals vs Response Graphs

__Note 2__ : The data were needed scaling 