# Project Assignments 

* [Project Proposal](https://canvas.txstate.edu/courses/1993336/quizzes/6830611) 
* [Project Progress](https://canvas.txstate.edu/courses/1993336/assignments/27480554) 
* [Project Submission](https://canvas.txstate.edu/courses/1993336/assignments/27480566) 


# PROJECT FORMAT

#  Project Synthea

**Bridget Bangert & Aaron Parish** 

* Link to the Project Repository: https://git.txstate.edu/a-p789/ML4347-project-synthea

## Project Summary

This project was aimed to predict whether or not a patient will die from COVID-19 using their demographic information and recent conditions and observations. Our goal was to see which features contributed the most to a COVID death. Feature selection and creating a large dataset was challenging, as there were so many features to pick from in the 16 dataset provided. We did so by plotting feature importance in the conditions and observation datasets and extracted features that contributed the most in those separate models. We then saved those and combined them all into a large dataframe of 10,535 rows and 47 columns, with 300 patients that suffered from a COVID-19 death. We then put that dataset through a few models until we decided that XGBoost would be the most suitable solution to our problem. We normalized all of our features that contained quantitative information and oversampled the data at a ratio of 0.2. We put the data through a basic XGBoost model, then ran a grid search through the parameters that aided the most to unbalanced data to optimize its performace. We found the most important features were Oxygen saturation in Arterial blood, QALY, Systolic Blood Pressure, Age, DALY, Body temperature, BMI, Heart rate, Diastolic Blood Pressure, and Pain severity - 0-10 verbal numeric rating Score (respectively).


## Problem Statement 

* Synthea COVID-19 10k patient dataset - the link to download the data can be found here:

https://synthea.mitre.org/downloads


* Our goal was to use the Synthea dataset with 10k patients to predict whether or not a patient will die from COVID-19 based on their medical history and demographic information. We used the datasets that give descriptions of a patient's obervations, conditions, age, gender, ethnicity, etc.

* We used XGBoost for feature importance and for our predictive model. By using XGBoost, this helped us minimize the effects of the imbalanced dataset (see figures below).

* PCA plot of Conditions:

![My Image](Images/index2.png)

* PCA plot of Observations

![My Image](Images/index4.png)

* For cross-validation, we implemented a grid search for the parameters of the XGBoost model, specifically on max_depth, scale_pos_weight, and eta to maximize the performance. For a more in-depth explaination of why we used these parameters, refer to the Project-Pipeline notebook.

* I mostly relied on the confusion matrix to check the model's performance. Since in a real-life context, we would want the model to predict more COVID-19 deaths rather than predicting a non-COVID-19 related death, so that the patients and medical professionals could prepare for the worst-case scenario.


## Dataset 

* Our data consists of 16 datasets displaying information of 10k synthetic patients. We only used the allergies, conditions, immunizations, observations, and patients datasets. Each of them have different shapes and features, but we really only are taking 1-3 features from each one to merge them into one large dataframe of about 10k rows.

* The final merged dataframe had 10535 rows and 47 columns, which consisted of the patient ID, age, race, ethnicity, observations (such as QALY, DALY, BMI, heart rate, etc.), conditions (such as body temperature, fever, fatigue, etc.), and whether or not the patient died from COVID-19.

* Conditions and observation features in the merged dataframe we used in our pipeline were denoted as 'CODE_(condition_code)' and 'VALUE_(observation_code)' (respectively), and their corresponding codes can be found in the condition_codes and observation_codes dataframes under the data directory.

* For the observations data, it initially was in a format of one observation per row (see below). I wanted each observation to be their own feature, so I pivoted the dataframe, which resulted in lots of NaN values. I dropped the features that had less than 1000 observations, and filled the rest of the values depending on the context of each feature. For example, I filled pain-severity NaN's with 0, assuming the patient never had to seek medical attention from pain, and I filled the NaN heart rate with the median.

* Observations data before transform
![My Image](Images/observations_before.png)

* Observations data after transform
![My Image](Images/observations_after.png)

* In our final pipeline, I used the merged version of these two after running through a model and selecting the feature importance of the conditions/demographics and observations dataframes separately. I then normalized our real columns, then slightly oversampled at a ratio of .2 to minimize the bias as much as we could.


## Exploratory Data Analysis 

#### Visualizations

* The disribution between the groups based on age.

![My Image](Images/boxplot.png)

 * Conditions heatmap
 
![My Image](Images/heatmap.png)

* QALY Scores in patients

![My Image](Images/qaly.png)

* BMI distributions between the groups

![My Image](Images/bmi.png)

* I used an XGBoost classifier to classify with the observations and conditions data with slight oversampling to extract the most important features that predict a given patient's state after COVID-19. Here is the feature importance plot of our observations dataframe:

![My Image](Images/feature_importance.png)


## Data Preprocessing 
 
* We reduced the dimensions of our dataframes by feature selection to relevant information based on our task. As mentioned above, we are dropping many features such as payment information, insurance, and medications. Our observations and conditions dataframes had a substancial amount of information that we did not need, so we dropped those features with ones that had low value counts and condition value counts that were greater than 10k.

* As stated before, we used a model on the conditions/demographics and observations dataframes to figure out which ones to add into our final pipeline dataframe. We then merged the two based on their most important features and normalized the continuous features (observations and age), then oversampled at a ratio of 0.2.


## Machine Learning Approaches

* At first, we used a Logistic Regression model then SVM as our baseline models, but they did not perform well. We wanted to start with a simple classification model as our baseline; however, there are so many features to take into account with 10k patients with an unbalanced y. 

* Describe the methods/datasets (you can have unscaled, selected, scaled version, multiple data farmes) that you ended up using for modeling. 

* We started off with putting the two large conditions/demographics and observation dataframes in their own model. We took into account of age and the time a patient had a condition with the conditions/demograhics data since those will affect how the patient will react to a given condition. We did not add the time variable into our final model because that created many duplicate patients with how our final dataframe was constructed, and this would have resulted in a biased model towards patients with more conditions.

* For our improved models, we used a basic LGBM and an XGBoost model, and it did not perform as well as we had hoped. Then, we did research on how we could improve the performance. We did this by researching parameters for unbalanced data and performing a grid search on those to get the best possible results.

## Experiments 

* We heavily relied on precision/recall, the feature importance plot, and the confusion matrix to evaluate its performance. A big problem of ours was the lack of correct predictions of COVID-19 deaths.

* Our baseline models (logistic regression and SVM) would only predict about half of the COVID-19 deaths correctly, and this was an issue we had with every model before the final one. None of them had an issue with the non-COVID-19 deaths since there was plenty of data to learn from in that category.

* Link to another model on a similar dataset with more patients https://www.ncbi.nlm.nih.gov/pmc/articles/PMC7531559/
    * Their data had 88k COVID-19 patients, information for major complications, the amount of ICU stays, and disease severity.
    * What they did differently than us was implement features such as length of stays when seeking medical attention and supplies
    * Overall, their model was much more complex than ours, as they kept in a lot more information than we did as well as taking the dates into consideration.

* The performance of our final model: 

![My Image](Images/confusion_matrix.png)

* If we were to go back and do this again, I would use a larger dataset. I would create a dataframe depticting the information of times a patient stayed for a visit, supplies used, and included the healthcare providers like on this article. Then, I would attempt a more intense grid search on my parameters.

## Conclusion

* What did not work? 
    * Narrowing down the data too much
    * The first models - logistic regression and SVM
* What do you think why? 
    * At first, I had dropped too many columns, and this resulted in a bad model once I put the better and improved data through the model. Preprocessing is arguably one of the most important steps when making a predictive model, and lacking good data will result in bad performance.
    * These models are not as robust for unbalanced data as gradient boosting
* What were approaches, tuning model parameters you have tried? 
    * I oversampled the COVID-19 death group at a ratio of .2, and I researched about XGBoost parameters that would aid in an imbalanced dataset. I played around with them manually at first, then decided to opt for a grid search.
* What features worked well and what didn't? 
    * In our merged dataset, our conditions did not contribute much to a COVID-19 death, and observations and age were the most important features. 
* When describing methods that didn't work, make clear how they failed and any evaluation metrics you used to decide so. 
    * The first few models I used only predicted about half of the COVID-19 deaths correctly. So after altering the XGBoost parameters, it was clear to see that the model was biased towards non-COVID-19 deaths, and it kept predicting non-deaths for deaths.
* How was that a data-driven decision? Be consise, all details can be left in .ipynb
    * In the contecxt of the situation, we wanted to have it to be the other way around, if possible (so patients and medical professionals can prepare for the worst-case scenario).
