## Length of the report {-}
The length of the report must be no more than 15 pages, when printed as PDF. However, there is no requirement on the minimum number of pages.

You may put additional stuff as Appendix. You may refer to the Appendix in the main report to support your arguments. However, your appendix is unlikely to be checked while grading, unless the grader deems it necessary. The appendix, references, and information about GitHub and individual contribution will not be included in the page count, and there is no limit on the length of the appendix.

**Delete this section from the report, when using this template.** 

## Background / Motivation

A hospital's ability to release happier and healthier patients is an established metric for its success as an institution. But more recently, a hospital's success has also been measured by its readmission rate: or the extent to which patients are being retreated in a hospital within 30 days of their initial inpatient hospital stay. Hospital readmissions are associated with worse patient outcomes and high financial strains on hospitals and their administration [1]. And the cost is high; readmissions often double the cost of a patient's care (both for the patients themselves and the hospital treating them. As Mcllvennan et al. notes, approximately "20% of all Medicare discharges have a readmission within 30 days" and, of those, an estimated 12% are potentially avoidable [1], [2]. These often avoidable readmissions (due to hasty and/or incomplete care) are not only costly to the hospital and patient, but also potentially life-threatening depending on a given patient's ailment.

As such, determining what factors affect a patient's likelihood of being readmitted to the hospital is of paramount importance for patients, their physicians, and hospital administration alike. As our group is interested in medicine and passionate about improved and equitable health outcomes, we asked ourselves: **What if we could identify the relationship between the factors associated with a patient's initial stay and their likelihood of being readmitted?** We imagine that, equipped with this information, hospital staff and patients could work together to adjust patient care plans and avoid readmissions.

We propose to address this problem using diabetes as a case study. Diabetes is an ailment that affects approximately 537 million individuals worldwide, regardless of age [3]. Though individuals with diabetes must carefully maintain their diet, exercise, and other living conditions in order to manage their illness, hospitalizations are not rare. In one study, "25 percent of patients with Type 1 diabetes and 30 percent with Type 2 diabetes had a hospital admission during one year" [4]. The abundance of information known about these hospital admissions provides a ripe opportunity to build and test a model to predict readmission, simultaneously filling a need for diabetes patients, healthcare workers, and hospital administrators alike.

## Problem Statement 

To approach our analysis, we first articulated a question to guide our modeling process: **What is the relationship between the different factors associated with a patient's hospital stay and their likelihood of being readmitted to that hospital within 30 days?**

In this question, we are most interested in *identifying the relationship* between predictors (here, factors associated with a patient's hospital stay) and a response (whether a patient is readmitted or not). Thus, our model is mainly concerned with **inference.**

## Data sources

In our project, we used a dataset entitled “Diabetes 130-US hospitals for years 1999-2008 dataset” from the UCI Machine Learning Repository. The dataset can be accessed [here](https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008#). 

All data was collected between 1999 and 2008 from over 130 US hospitals and related care centers. The dataset contains over 10,000 entries, each corresponding to a patient with either Type 1 or Type 2 diabetes. 

Observations had to satisfy the following criteria to be included in this dataset: 

1. The patient must have had an inpatient encounter (i.e., they must have been admitted to the hospital)
2. The incident must have been a diabetic encounter (i.e., the patient had to have been a diabetic)
3. The length of the patient’s hospital stay had to have been between 1 and 14 days
4. Laboratory tests had to have been administered during the patient’s initial hospital stay
5. The patient must have received medications during their hospital stay

In the resulting dataset, there are 55 attributes with information about a patients demographics (such as their age, gender, and race); the patient's medical history (such as their number of inpatient and outpatient visits, medications, etc.); their initial hospital stay hospital stay (such as their number of medications, procedures, or diagnoses); and if the patient was readmitted to the hospital within 30 days of their initial hospital visit. The dataset is complete, though there are 8 attributes with missing values that encompass over 50% of the observations. The 47 additional attributes in the dataset have little-to-no missing values. Thus, the vast majority of the attributes in this dataset are usable as potential predictors for our model.

We chose to use this dataset because it contains our variable of interest (whether a patient was readmitted or not) along with ~46 potential predictors to use to fit the model. This dataset is extremely robust and comprehensive, including information on over 10,000 patients in the US across a decade. This broad and comprehensive data increases the generalizability of our study's findings. Its number of predictors also allows us to approach our problem from multiple perspectives: (1) the patient’s characteristics, (2) their hospital stay and medical history, and (3) the treatments administered to them. 

## Stakeholders

We have identified three stakeholder groups that would benefit from our analysis: (1) diabetes patients and their loved ones; (2) hospital staff involved in direct patient care and; (3) hospital administraotrs and those who have an interest in reducing the financial waste in healthcare (i.e. Medicaid).


### Diabetic Patients and Their Loved Ones 
Our first stakeholder group is diabetes patients and their loved ones. For diabetes patients and their loved ones, the results of our project could provide them insight into factors impacting the diabetic’s continued risks of illness and subsequent hospital readmissions. A  hospital stay can be emotionally and financially stressful for a diabetic patient and their family. Hospital readmissions then exacerbate this toll on patients by subjecting them to more inpatient stays. Further, as readmissions are associated with worse patient health outcomes, it is likely that diabetic who are readmitted will thus be readmitted again, worsening the burden to both their health and finances [3]. Understanding what factors contribute to hospital readmissions may empower patients and their families to take a more active and focused role in improving their health outcomes. 

For instance, if a patient was aware that they had certain risk factors associated with a greater likelihood of readmission, they may direct more time and resources towards alleviating these factors and advocating for their health. Patients and their family could be more vigilant toward early warning-signs and communicate proactively with their physician or healthcare provider about prevention. Overall, this knowledge can empower patients with diabetes to direct their efforts towards areas that most affect their health outcome. 

### Hospital Staff Involved in Direct Patient Care 
Our second stakeholder group include hospital staff and physicians involved in direct patient care. These stakeholders are directly charged with overseeing the healthcare plans of diabetic patients. Thus, they are morally, professionally, and financially invested in giving their patients the best possible care to improve their health outcomes. With a better understanding of factors that impact the likelihood of a diabetic patient’s readmission, healthcare workers may adjust their administration of care during the patient’s initial hospital visit. 

For instance, if a patient has an elevated risk of being readmitted, their doctor may  adjust the patient's healthcare plan to focus on and alleviate that risk. The patient thus will likely have better health outcomes, and their risk of returning to the hospital within 30 days of being released (the criteria for a readmission) would be avoided.

This project may thus inform healthcare workers about which of their patients require the most vigilant and preventative interventions. Physicians, equipped with this knowledge, may then help to lower both the morbidity and mortality for diabetics in their care.

### Hospital Administrators and Medicaid
Our third and final stakeholder group consists of hospital administrators and those with an active interest in reducing the cost of healthcare, like Medicaid providers. Hospital readmissions are financially costly to the healthcare system. The Affordable Care Act incentivizes the reduction of hospital admissions by financially penalizing hospitals with high readmission rates [4]. Penalties may include reduced payments to hospitals with excessive readmissions. In 2021, these penalties are projected to cost hospitals $521 million dollars [4]. As such, hospital administrators may have a vested financial interest in identifying factors that most contribute to an increased likelihood of returning to the hospital. By identifying such factors, administrators can allocate hospital resources towards addressing them, so that patient health outcomes are improved, readmissions decrease, and their hospital avoids suffering related financial penalties. 


## Data quality check / cleaning / preparation 

**NOTE - WE HAVE TO CITE [5] WHEN TALKING ABOUT THE CHOICES WE MADE FOR DATA CLEANING BECAUSE WE WERE BORROWING FROM THAT CODE REPOSITORY**

### Distribution of Categorical Variables

In a tabular form, show the distribution of values of each variable used in the analysis - for both categorical and continuous variables. Distribution of a categorical variable must include the number of missing values, the number of unique values, the frequency of all its levels. If a categorical variable has too many levels, you may just include the counts of the top 3-5 levels. 

### Distribution of Continuous Variables

If the tables in this section take too much space, you may put them in the appendix, and just mention any useful insights you obtained from the data quality check that helped you develop the model or helped you realize the necessary data cleaning / preparation.


### Data Cleaning 
Were there any potentially incorrect values of variables that required cleaning? If yes, how did you clean them? 


### Data Wrangling and Preparation
Did you do any data wrangling or data preparation before the data was ready to use for model development? Did you create any new predictors from exisiting predictors? For example, if you have number of transactions and spend in a credit card dataset, you may create spend per transaction for predicting if a customer pays their credit card bill. Mention the steps at a broad level, you may put minor details in the appendix. Only mention the steps that ended up being useful towards developing your final model(s).

## Exploratory data analysis

### Key Insights from Exploratory Data Analysis

Our key insights from our exploratory data analysis are as follows:

1. The continuous variables that are most highly correlated with readmission are: `num_inpatient`, `num_diagnoses`, `time_in_hospital`, `age`, `encounter_id`, `num_medications`, and `num_procedures`.

2. The following variables showed differences in distribution when subset by data readmitted vs. not readmitted: `time_in_hospital`, `num_diagnoses`, `age`, `num_inpatient`, `num_emergency`, `num_changes`.

3. As `time_in_hospital` increases, `age` increases. The `time_in_hospital` also increases as `num_of_changes` increase.

4. The variability in `num_changes` increases as `num_inpatient` increases.

5. For younger patients, `num_inpatient` and `num_changes` have high variability.

6. The variables `num_inpatient` and `time_in_hospital` do not seem to be related. 

7. The variable `discharge_disposition_id` does not seem to be related to other predictors. 


### Summarizing Insights from Related Plots and Tables 
Put the relevant EDA here (visualizations, tables, etc.) that helped you figure out useful predictors for developing the model(s). Only put the EDA that ended up being useful towards developing your final model(s). 

List the insights (as bullet points) you got from EDA that ended up being useful towards developing your final model. 

Again, if there are too many plots / tables, you may put them into appendix, and just mention the insights you got from them.

## Approach

### Modeling Approach
Our topic lends itself best to a **classification problem**. This is because our response variable (hospital readmission, or `readmitted`) is categorical and binary: either a patient  was readmitted or they were not readmitted. As such, we chose to implement a **logistic regression model** in our modeling and analyses, as this modeling method is appropriate when the response variable (here `readmitted`) is categorical. 

### Optimizing Performance Metrics

In our modeling and analyses, we are most interested in the following performance metrics: our model's (1) False Negative Rate (FNR), (2) precision, and (3) recall. 

#### False Negative Rate

We needed to build a model that has a low False Negative Rate, or FNR. In the case of readmissions, false negatives are more concerning than false positives. This is because if a person is likely to be readmitted, and is told that they are *not* going to be readmitted, their condition may deteriorate and they may suffer further financial and emotional costs associated with unexpected hospital stays. If a person is not going to be readmitted, and is told that they *will* be readmitted, they may take unnecessary precautions or tests to avoid further hospitalizations, but it will not be as harmful to the person as in the previous case. This is not a prediction problem, but it is still vital that our model correctly identifies when a person is going to be readmitted. Thus, we are  more focused on reducing the number of false negatives (and the False Negative Rate of the model), instead of reducing the number of false positives. 

#### Precision

Precision in logistic regression helps us to measure the accuracy of positive predictions: or how good our model is at identifying positive values. Here, positive predictions correspond to a scenario in which a patient is going to be readmitted to the hospital. This outcome directly relates to a patient's health outcomes and, similar to our reasoning above, it is vital that our model catch and classify these values appropriately. We need our model to be very good at predicting positive outcomes when it needs to (i.e., making accurate classifications as to when patients may return to the hospital), so that patients can take preventative measures and avoid undue financial, health-related, and emotional strains associated with further hospital stays.

#### Recall 

Similarly, we sought to maximize our model's recall, or the ratio of positive instances that are correctly detected by the classifier. Again, a "positive instance" in our model is one in which a patient is readmitted to the hospital. Often, there is a tradeoff between precision and recall: increasing precision reduces recall and vice versa. However, it is still important that our model both has a high accuracy of measuring positive predictions (precision) as well as a high ratio of positive instances that are correctly detected (recall). Thus, we sought to find a balance between precision and recall such that both metrics were maximized by a given model threshold.

### Anticipated Problems and Initial Modeling


#### Anticipated Problems: Correcting for an Uneven Response Distribution
When visualizing the distribution of our response variable, we noticed a potential problem that could arise in later modeling: there were far more negative instances (when a patient *was not* readmitted, than there were positive instances, or when a patient *was* readmitted (see **Appendix C Figure 1**). 

To correct for this uneven response distribution, our team first tried undersampling the data to reduce the number of negative values (or non-readmissions) to the number of positive values, or readmissions (see **Appendix C Figure 2**).

However, we later realized by undersampling and removing values from the majority class, we were losing data that could potentially help us in our inference model. As such, we implemented SMOTENC in our model optimization process. SMOTENC, or Synthetic Minority Over-sampling Technique for Nominal and Continuous, helps to generate synthetic data to oversample a minority target class in an imbalanced dataset. After implementing SMOTENC on our training dataset, we created a balanced response ratio of 50-50 between negative responses (or non-readmissions) and positive responses (or readmissions) (see **Appendix C Figure 3**). 

#### Initial Modeling with Undersampling

In our initial model, we implemented undersampling and included the following variables as defined in our exporatory data analysis: 

1. `num_of_changes` (or number of medication changes); 

2. `number_inpatient` (or number of inpatient stays); 

3. An interaction between `time_in_hospital` (or the amount of days a patient stayed in the hospital originally) and `age` (the patient's age). 

The $p$-values for all above attributes were under 0.05 (indicating that they were significant at a confidence interval of 95%). Our base model's formula is as follows:

> `readmitted` = (0.1104 * `time_in_hospital`) + (0.0144 * `age`) + (-0.0009 * `time_in_hospital` * `age`) + (0.1499 * `num_of_changes`) + (`number_inpatient`)

However, when inspecting the confusion matrix of this model on training data, we found that it not performing how we wished it to (see **Appendix C Figure 4**). Our model's classification accuracy was only **57.1%**, while its precision and recall were **57.8%** and **52.9%** respectively. This indicated that our base model was only making correct classifications 57.1% of the time, and that both its accuracy of predicting positive values as well as  the ratio of positive instances that are correctly detected by the classifier were both rather low. Our base model's False Negative Rate, or FNR, was similarly disappointing with a value was **47.1%**. This indicated that the probability that a true positive will be missed by the test was 47.1%. 

We plotted the ROC(Receiver Operator Characteristic Curve) of our base model using training data (see **Appendix C Figure 5**). This plot visualizes the sensitivity (True Positive Rate) of the model on the y-axis against (1−specificity) (False Positive Rate) on the x axis for varying values of a threshold. The 45° diagonal line connecting (0,0) to (1,1) is the ROC curve corresponds to random chance. The area under ROC is called Area Under the Curve(AUC). AUC gives the rate of successful classification by the logistic model. Note in **Appendix C Figure 5** that the base model's AUC is quite low, indicating that the classifier is better than random chance, but still not optimal in distinguishing between positive and negative classes.

On test data, our model's classification accuracy increased to **60.1%**, though its precision and recall decreased to **11.9%** and **51.9%** respectively (see **Appendix C Figure 5**). The model's FNR also slightly increased to **48.1%**. This worsened performance of the model on test data inspired our team to move to a SMOTENC approach in an effort to optimize the model and its performance.




#### Existing Solutions

Our team was perplexed by our base model's subpar performance. Thus, we looked to Kaggle to see if there were any existing solutions to the problem that could inform our analysis. We found a code repository from Abishek Sharma called "Prediction on Hospital Readmission" that used our dataset to predict whether a patient was going to be readmitted to the hospital or not [5]. Sharma implemented three different models for prediction: (1) a logistic regression model, (2) a decision tree model, and (3) a Random Forest model. 

To create his logistic regression model, Sharma used the LogisticRegression, train_test_split, and cross_val_score functions of the *sklearn* library. It was unclear which predictors he used in his analysis, but his logistic regression model attained an accuracy of 61%, precision of 63%, and recall of 56% [5]. However, both of his later models (decision tree and Random Forest) performed far better in terms of classification accuracy, precision, and recall. His decision tree model had an accuracy of 92%, precision of 93%, and recall of 90%, whereas his Random Forest model achieved an accuracy of 94%, precision of 99%, and recall of 90% (see **Appendix C Figure 7** for a visualized comparison of the performance of each of Sharma's models). 

Ultimately, Sharma's work informed us about the possible limitations of our approach. His decision tree and Random Forest models performed far better than our base model. Yet, because we did not have tools or knowledge to implement those methods, we were limited by logistic regression. Further, Sharma implemented logistic regression in *sklearn* while our team was working in *statsmodels.formula.api*. Both Sharma and our team use SMOTENC to correct for the uneven response distribution, but Sharma's approach to logistic regression was difficult to interpret, as the libraries and packages he was working in were different than our own.

Overall, our base model performed on-par with Sharma's on training data in terms of classification accuracy. Our precision was slightly lower than his (57.8% vs 63%), as was our recall (52.9% vs 56%) [5]. 

## Developing the model

Explain the steps taken to develop and improve the base model - informative visualizations / addressing modeling assumption violations / variable transformation / interactions / outlier treatment / influential points treatment / addressing over-fitting / addressing multicollinearity / variable selection - stepwise regression, lasso, ridge regression). 

Did you succeed in achieving your goal, or did you fail? Why

**Important: This section should be rigorous and thorough. Present detailed information about decision you made, why you made them, and any evidence/experimentation to back them up.**

### Final Model Equation

**Put the final model equation**.

## Limitations of the model with regard to Inference

If it is inference, will the inference hold for a certain period of time, for a certain subset of population, and / or for certain conditions.

If it is prediction, then will it be possible / convenient / expensive for the stakeholders to collect the data relating to the predictors in the model. Using your model, how soon will the stakeholder be able to predict the outcome before the outcome occurs. For example, if the model predicts the number of bikes people will rent in Evanston on a certain day, then how many days before that day will your model be able to make the prediction. This will depend on how soon the data that your model uses becomes available. If you are predicting election results, how many days / weeks / months / years before the election can you predict the results. 

When will your model become too obsolete to be useful?

## Future Work

Our model's performance may be limited by our approach. In Abishek Sharma's code repository from Kaggle, he was able to build models with over 90% classification accuracy, precision, and recall using Random Forest and decision tree modeling methods [5]. His logistic regression model from *sklearn* ultimately performed similar to ours, which we created using *statsmodels.formula.api*. In future studies, it may be interesting to implement furhter Random Forest and decision tree modeling so that the model's performance could improve (and thus its power in regards to inference could be improved. 

Further, future work in the readmission space may benefit from more recent data. Our dataset was amalgamated between 1999 and 2008, though new policies have been introduced since 2008 that have imposed further financial penalties on hospitals that have high readmission rates. These penalties incentivize hospitals to identify and address risk factors to reduce readmissions. As such, one potential avenue of future inquiry within this research community could include studying readmissions data in the last decade to see whether such policies have had an impact on readmissions (and whether the risk factors for hospital readmissions have changed).

## Conclusions and Recommendations to Stakeholder(s)

### Model Conclusions
What conclusions do you draw based on your model? If it is inference you may draw conclusions based on the coefficients, statistical significance of predictors / interactions, etc. If it is prediction, you may draw conclusions based on prediction accuracy, or other performance metrics.


### Recommendations to Stakeholders 
How do you use those conclusions to come up with meaningful recommendations for stakeholders? The recommendations must be action-items for stakeholders that they can directly implement without any further analysis. Be as precise as possible. The stakeholder(s) are depending on you to come up with practically implementable recommendations, instead of having to think for themselves.

### Implementation and Limitations of Recommendations
If your recommendations are not practically implementable by stakeholders, how will they help them? Is there some additional data / analysis / domain expertise you need to do to make the recommendations implementable? 

Do the stakeholder(s) need to be aware about some limitations of your model? Is your model only good for one-time use, or is it possible to update your model at a certain frequency (based on recent data) to keep using it in the future? If it can be used in the future, then for how far into the future?

## GitHub and individual contribution {-}

Our project repository can be found at [**this link**]('https://github.com/AnastasiaKWei/Saturn'). A more detailed overview of each team member's contributions to the project can be found below. 

<html>
<style>
table, td, th {
  border: 1px solid black;
}

table {
  border-collapse: collapse;
  width: 100%;
}

th {
  text-align: left;
}
    

</style>
<body>

<h2>Individual contribution</h2>

<table style="width:100%">
     <colgroup>
       <col span="1" style="width: 15%;">
       <col span="1" style="width: 20%;">
       <col span="1" style="width: 50%;">
       <col span="1" style="width: 15%;"> 
    </colgroup>
  <tr>
    <th>Team member</th>
    <th>Contributed aspects</th>
    <th>Details</th>
    <th>Number of GitHub commits</th>
  </tr>
  <tr>
    <td>Kaitlyn Hung</td>
    <td>Modeling and EDA</td>
    <td>XXXXX</td>
    <td>XXXXX</td>
  </tr>
  <tr>
    <td>Amy Wang</td>
    <td>Modeling and model optimization</td>
    <td>XXXXX</td>
    <td>XXXXX</td>
  </tr>
    <tr>
    <td>Anastasia Wei</td>
    <td>Data cleaning and model optimization</td>
    <td>XXXXX</td>
    <td>XXXXX</td>    
  </tr>
    <tr>
    <td>Lila Wells</td>
    <td>Modeling and EDA</td>
    <td>XXXXX</td>
    <td>XXXXX</td>    
  </tr>
</table>

### Reflections on the GitHub Experience

When reflecting on our collaboration via GitHub, our team members appreciated how we were able to collectively work on one document (like this project report, for example) from our own devices while pushing and pulling changes that other group members made. One challenge that we faced when using GitHub was maintaining a clean and easy-to-navigate repository, as our team members were often simultaneously working on different files and, as such, our repository would often become quite crowded with different documents. However, we were able to circumvent this issue by organizing our repository into different subsections and folders. 

We found that collaborating on GitHub made it far easier to use the platform as a whole. Practice is the best teacher and, as our team was quite active in our repository, we learned a lot about pushing and pulling changes that other team members made, merging files, and making meaningful commits. GitHub definitely made our collaboration easier, as it stored all of our files in one place and allowed us to collaborate remotely yet effectively. 

## References {-}

[1] Mcllvennan, Colleen K., et al. "Hospital Readmissions Reduction Program." 
     National Library of Medicine, https://doi.org/10.1161/ 
     CIRCULATIONAHA.114.010270.

[2] "How Reducing Hospital Readmissions Benefits Patients and Hospitals." Regis College, 10 Aug. 2022, 
    online.regiscollege.edu/blog/reducing-hospital-readmissions/.
    
[3] "About Diabetes." International Diabetes Federation, 12 Sept. 2021, idf.org/ 
     aboutdiabetes/what-is-diabetes/ 
     facts-figures.html#:~:text=The%20IDF%20Diabetes%20Atlas%20Tenth,and%20783%20milli 
     on%20by%202045.

[4] Moss, S. E., et al. "Risk factors for hospitalization in people with diabetes." 
     JAMA Internal Medicine, https://doi.org/10.1001/archinte.159.17.2053. 
     Accessed 27 Sept. 1999.
     
[5] Sharma, Abishek. "Prediction on Hospital Readmission." Kaggle, https://www.kaggle.com/code/iabhishekofficial/prediction-on-hospital-readmission.


## Appendix {-}

You may put additional stuff here as Appendix. You may refer to the Appendix in the main report to support your arguments. However, the appendix section is unlikely to be checked while grading, unless the grader deems it necessary.

### Appendix A: Data quality check / cleaning / preparation

This appendix contains several figures related to our data quality check, cleaning, and preparation. See each figure below for more detail.

### Appendix B: Exploratory data analysis

This appendix contains several figures related to our exploratory data analysis. See each figure below for more detail.

### Appendix C: Approach

This appendix contains several figures related to our model approach. See each figure below for more detail.

#### Appendix C Figure 1: Distribution of the response variable `readmitted` before undersampling or SMOTENC in the training data.

![readmission_distribution_originaltrain.png](attachment:readmission_distribution_originaltrain.png)

#### Appendix C Figure 2: Distribution of the response variable `readmitted` after undersampling
![readmission_distribution_undersampling.png](attachment:readmission_distribution_undersampling.png)

#### Appendix C Figure 3: Distribution of the response variable `readmitted` after SMOTENC

![readmission_distribution_after_smotenc.png](attachment:readmission_distribution_after_smotenc.png)


#### Appendix C Figure 4: Confusion Matrix of Base Model on Training Data

![confusion_matrix_train-2.png](attachment:confusion_matrix_train-2.png)

#### Appendix C Figure 5: ROC Curve of Base Model 

![roc_curve_base_model.png](attachment:roc_curve_base_model.png)

#### Appendix C Figure 6: Confusion Matrix of Base Model on Test Data

![initial_confusion_matrix_test.png](attachment:initial_confusion_matrix_test.png)

#### Appendix C Figure 7: Sharma's Model Performance

Note, this figure was downloaded from Abishek Sharma's code repository on Kaggle. The citation is listed under [5] in the references section of this report.
![sharma_models.png](attachment:sharma_models.png)

### Appendix D: Developing the Model

This appendix contains several figures related to developing our optimized model. See each figure below for more detail.