## Background / Motivation

A hospital's ability to release happier and healthier patients is an established metric for its success as an institution. But more recently, a hospital's success has also been measured by its readmission rate: or the extent to which patients are being retreated in a hospital within 30 days of their initial inpatient hospital stay. Hospital readmissions are associated with worse patient outcomes and high financial strains on hospitals and their administration [1]. And the cost is high; readmissions often double the cost of a patient's care (both for the patients themselves and the hospital treating them. As Mcllvennan et al. notes, approximately "20% of all Medicare discharges have a readmission within 30 days" and, of those, an estimated 12% are potentially avoidable [1], [2]. These often avoidable readmissions (due to hasty and/or incomplete care) are not only costly to the hospital and patient, but also potentially life-threatening depending on a given patient's ailment.

As such, determining what factors affect a patient's likelihood of being readmitted to the hospital is of paramount importance for patients, their physicians, and hospital administration alike. As our group is interested in medicine and passionate about improved and equitable health outcomes, we asked ourselves: **What if we could identify the relationship between the factors associated with a patient's initial stay and their likelihood of being readmitted?** We imagine that, equipped with this information, hospital staff and patients could work together to adjust patient care plans and avoid readmissions.

We propose to address this problem using diabetes as a case study. Diabetes is an ailment that affects approximately 537 million individuals worldwide, regardless of age [3]. Though individuals with diabetes must carefully maintain their diet, exercise, and other living conditions in order to manage their illness, hospitalizations are not rare. In one study, "25 percent of patients with Type 1 diabetes and 30 percent with Type 2 diabetes had a hospital admission during one year" [4]. The abundance of information known about these hospital admissions provides a ripe opportunity to build and test a model to predict readmission, simultaneously filling a need for diabetes patients, healthcare workers, and hospital administrators alike.

## Problem Statement 

To approach our analysis, we first articulated a question to guide our modeling process: **What is the relationship between the different factors associated with a patient's hospital stay and their likelihood of being readmitted to that hospital within 30 days?**

In this question, we are most interested in *identifying the relationship* between predictors (here, factors associated with a patient's hospital stay) and a response (whether a patient is readmitted or not). Thus, our model is mainly concerned with **inference.**

## Data sources

In our project, we used a dataset entitled “Diabetes 130-US hospitals for years 1999-2008 dataset” from the UCI Machine Learning Repository. The dataset can be accessed [here](https://archive.ics.uci.edu/ml/datasets/Diabetes+130-US+hospitals+for+years+1999-2008#). 

All data was collected between 1999 and 2008 from over 130 US hospitals and related care centers. The dataset contains over 10,000 entries, each corresponding to a patient with either Type 1 or Type 2 diabetes. 

Observations had to satisfy the following criteria to be included in this dataset: 

1. The patient must have had an inpatient encounter (i.e., they must have been admitted to the hospital)
2. The incident must have been a diabetic encounter (i.e., the patient had to have been a diabetic)
3. The length of the patient’s hospital stay had to have been between 1 and 14 days
4. Laboratory tests had to have been administered during the patient’s initial hospital stay
5. The patient must have received medications during their hospital stay

In the resulting dataset, there are 55 attributes with information about a patients demographics (such as their age, gender, and race); the patient's medical history (such as their number of inpatient and outpatient visits, medications, etc.); their initial hospital stay hospital stay (such as their number of medications, procedures, or diagnoses); and if the patient was readmitted to the hospital within 30 days of their initial hospital visit. The dataset is complete, though there are 8 attributes with missing values that encompass over 50% of the observations. The 47 additional attributes in the dataset have little-to-no missing values. Thus, the vast majority of the attributes in this dataset are usable as potential predictors for our model.

We chose to use this dataset because it contains our variable of interest (whether a patient was readmitted or not) along with ~46 potential predictors to use to fit the model. This dataset is extremely robust and comprehensive, including information on over 10,000 patients in the US across a decade. This broad and comprehensive data increases the generalizability of our study's findings. Its number of predictors also allows us to approach our problem from multiple perspectives: (1) the patient’s characteristics, (2) their hospital stay and medical history, and (3) the treatments administered to them. 

## Stakeholders

We have identified three stakeholder groups that would benefit from our analysis: (1) diabetes patients and their loved ones; (2) hospital staff involved in direct patient care and; (3) hospital administraotrs and those who have an interest in reducing the financial waste in healthcare (i.e. Medicaid).


### Diabetic Patients and Their Loved Ones 
Our first stakeholder group is diabetes patients and their loved ones. For diabetes patients and their loved ones, the results of our project could provide them insight into factors impacting the diabetic’s continued risks of illness and subsequent hospital readmissions. A  hospital stay can be emotionally and financially stressful for a diabetic patient and their family. Hospital readmissions then exacerbate this toll on patients by subjecting them to more inpatient stays. Further, as readmissions are associated with worse patient health outcomes, it is likely that diabetic who are readmitted will thus be readmitted again, worsening the burden to both their health and finances [3]. Understanding what factors contribute to hospital readmissions may empower patients and their families to take a more active and focused role in improving their health outcomes. 

For instance, if a patient was aware that they had certain risk factors associated with a greater likelihood of readmission, they may direct more time and resources towards alleviating these factors and advocating for their health. Patients and their family could be more vigilant toward early warning-signs and communicate proactively with their physician or healthcare provider about prevention. Overall, this knowledge can empower patients with diabetes to direct their efforts towards areas that most affect their health outcome. 

### Hospital Staff Involved in Direct Patient Care 
Our second stakeholder group include hospital staff and physicians involved in direct patient care. These stakeholders are directly charged with overseeing the healthcare plans of diabetic patients. Thus, they are morally, professionally, and financially invested in giving their patients the best possible care to improve their health outcomes. With a better understanding of factors that impact the likelihood of a diabetic patient’s readmission, healthcare workers may adjust their administration of care during the patient’s initial hospital visit. 

For instance, if a patient has an elevated risk of being readmitted, their doctor may  adjust the patient's healthcare plan to focus on and alleviate that risk. The patient thus will likely have better health outcomes, and their risk of returning to the hospital within 30 days of being released (the criteria for a readmission) would be avoided.

This project may thus inform healthcare workers about which of their patients require the most vigilant and preventative interventions. Physicians, equipped with this knowledge, may then help to lower both the morbidity and mortality for diabetics in their care.

### Hospital Administrators and Medicaid
Our third and final stakeholder group consists of hospital administrators and those with an active interest in reducing the cost of healthcare, like Medicaid providers. Hospital readmissions are financially costly to the healthcare system. The Affordable Care Act incentivizes the reduction of hospital admissions by financially penalizing hospitals with high readmission rates [4]. Penalties may include reduced payments to hospitals with excessive readmissions. In 2021, these penalties are projected to cost hospitals $521 million dollars [4]. As such, hospital administrators may have a vested financial interest in identifying factors that most contribute to an increased likelihood of returning to the hospital. By identifying such factors, administrators can allocate hospital resources towards addressing them, so that patient health outcomes are improved, readmissions decrease, and their hospital avoids suffering related financial penalties. 


## Data quality check / cleaning / preparation 

### Distribution of Categorical Variables

See **Appendix A, Tables 1-7** for distributions of the categorical variables used in our analysis: (1) changes in medications (`change`), (2) use of diabetic mediations (`diabetesMed`), (3) primary diagnosis (`diag_1`), (4) secondary diagnosis (`diag_2`), (5) admission source, like the emergency room or from a routine visit (`admission_source_id`), (6) discharge type (i.e., to ones home, to further care, etc.) (`discharge_disposition_id`) and (7) type of admission (`admission_type_id`). All categorical variables had no missing values.  

After visualizing the distribution of these variables, we noted that the variables for primary and secondary diagnoses (`diag_1` and `diag_2`) had 717 and 749 levels, respectively. Thus, these variables would likely benefit from being binned into larger categories representative of the diagnosis type (i.e. circulatory, respiratory, etc.) to aid in our later model's interpretability. 

### Distribution of Continuous Variables

See **Appendix A, Tables 8-14** for distributions of the continuous variables used in our analysis: (1) number of lab procedures (`num_lab_procedures`), (2) number of additional procedures ordered (`num_procedures`), (3) number of medications administered (`num_medications`), (4) number of outpatient visits (`number_outpatient`), (5) number of emergency visits (`number_emergency`), (6) number of inpatient visits (`number_inpatient`), (7) number of diagnoses (`number_diagnoses`), (8) patient age (`age`), and (9) time spent in hospital during the patient's initial stay (`time_in_hospital`).  

After visualizing the distribution of these variables, we realized that the patient age variable was prebinned (and thus may benefit from some sort of later data manipulation). 

### Data Cleaning 

The dataset we obtained from UCI was already in very good condition and required minimum cleaning. We took two steps to clean the data further. First, we dropped the predictors with more than 50% observations missing, which are `weight`and `medical specialty`. Then we also removed duplicate records in our data as they point to the same patient and do not offer additional insights. The length before and after we dropped duplicate records is 101766 and 71518 respectively.

### Data Wrangling and Preparation
We implemented a few steps to transform the existing predictors to be better suited for the logistic regression model. <br>
1. Changed the `age` variable from intervals of 10 years (e.g. [0-10), [10-20)) from 0 to 100 to ten levels from 5 to 95 with increments of 10. <br>
2. Used domain knowledge about the diagnosis mapping to futher bin the `diag1`, `diag2`, and `diag3` predictors. There are more than 900 individual diagnosis, using the mapping from [6] (see exact mapping in Appendix A diagnosis mapping), we divided them into 10 different levels. <br>
3. Performed mapping on variables `admission_type_id`, `admission_source_id`, and `discharge_disposition_id.` Changing each from 8, 30, 25 distinct levels to 4, 6, and 8 distinct levels. See exact mapping in Appendix A admission & discharge mapping. <br>
4. To make use of the 23 medication changes in the data, we created a new variable `num_of_changes` to aggregate the number of changes in medications per patient.<br>
5. Changed `readmission`, the response variable, from 3 levels (No, >30, <30) to two different levels of readmitted and not readmitted, where readmission is when the patient return to the hospital within 30 days.
6. Split the data into train set and test sets. We randomly sampled 80% of the data to be the train set and used the rest as the test data set.

## Exploratory data analysis

We identified seven key insights from our exploratory data analysis below. See **Appendix B** for all relevant figures in our exploratory data analysis that helped our modeling process. 

- **Insight 1**: First, the continuous variables that are the most highly correlated with readmission are `num_inpatient`, `num_diagnoses`, `time_in_hospital`, `age`, `encounter_id`, `num_medications`, and `num_procedures` (see **Appendix B Figure 1**). As noted in the heatmap in **Appendix B Figure 1**, a person's number of inpatient visits to the hospital, number of medical diagnoses, time spent in the hospital, encounter type, number of medications and procedures, as well as that person's age may be correlated with whether they are readmitted to the hospital. 

- **Insight 2**:  Second, the following variables showed differences in their distribution when subset by data readmitted vs. not readmitted: `time_in_hospital`, `num_diagnoses`, `age`, `num_inpatient`, `num_emergency`, `num_changes` (see **Appendix B Figures 2-7**). 

- **Insight 3**: Third, as `time_in_hospital` increases, `age` increases. The `time_in_hospital` also increases as `num_of_changes` increase (see **Appendix B Figures 8-9**).

- **Insight 4**: Fourth, the variability in `num_changes` increases as `num_inpatient` increases. (see **Appendix B Figures 10**). 

- **Insight 5**: Fifth, for younger patients, `num_inpatient` and `num_changes` have higher variability than for older patients (see **Appendix B Figure 11-12**).

- **Insight 6**: Sixth, the variables `num_inpatient` and `time_in_hospital` do not seem to be related (see **Appendix B Figure 13**). 

- **Insight 7** Finally, the variable `discharge_disposition_id` does not seem to be related to other predictors (see **Appendix B Figure 1**). 


## Approach

### Modeling Approach
Our topic lends itself best to a **classification problem**. This is because our response variable (hospital readmission, or `readmitted`) is categorical and binary: either a patient  was readmitted or they were not readmitted. As such, we chose to implement a **logistic regression model** in our modeling and analyses, as this modeling method is appropriate when the response variable (here `readmitted`) is categorical. 

### Optimizing Performance Metrics

In our modeling and analyses, we are most interested in the following performance metrics: our model's (1) False Negative Rate (FNR), (2) precision, and (3) recall. 

#### False Negative Rate

We needed to build a model that has a low False Negative Rate, or FNR. In the case of readmissions, false negatives are more concerning than false positives. This is because if a person is likely to be readmitted, and is told that they are *not* going to be readmitted, their condition may deteriorate and they may suffer further financial and emotional costs associated with unexpected hospital stays. If a person is not going to be readmitted, and is told that they *will* be readmitted, they may take unnecessary precautions or tests to avoid further hospitalizations, but it will not be as harmful to the person as in the previous case. This is not a prediction problem, but it is still vital that our model correctly identifies when a person is going to be readmitted. Thus, we are  more focused on reducing the number of false negatives (and the False Negative Rate of the model), instead of reducing the number of false positives. 

#### Precision

Precision in logistic regression helps us to measure the accuracy of positive predictions: or how good our model is at identifying positive values. Here, positive predictions correspond to a scenario in which a patient is going to be readmitted to the hospital. This outcome directly relates to a patient's health outcomes and, similar to our reasoning above, it is vital that our model catch and classify these values appropriately. We need our model to be very good at predicting positive outcomes when it needs to (i.e., making accurate classifications as to when patients may return to the hospital), so that patients can take preventative measures and avoid undue financial, health-related, and emotional strains associated with further hospital stays.

#### Recall 

Similarly, we sought to maximize our model's recall, or the ratio of positive instances that are correctly detected by the classifier. Again, a "positive instance" in our model is one in which a patient is readmitted to the hospital. Often, there is a tradeoff between precision and recall: increasing precision reduces recall and vice versa. However, it is still important that our model both has a high accuracy of measuring positive predictions (precision) as well as a high ratio of positive instances that are correctly detected (recall). Thus, we sought to find a balance between precision and recall such that both metrics were maximized by a given model threshold.

### Anticipated Problems and Initial Modeling


#### Anticipated Problems: Correcting for an Uneven Response Distribution
When visualizing the distribution of our response variable, we noticed a potential problem that could arise in later modeling: there were far more negative instances (when a patient *was not* readmitted, than there were positive instances, or when a patient *was* readmitted (see **Appendix C Figure 1**). 

To correct for this uneven response distribution, our team first tried undersampling the data to reduce the number of negative values (or non-readmissions) to the number of positive values, or readmissions (see **Appendix C Figure 2**).

However, we later realized by undersampling and removing values from the majority class, we were losing data that could potentially help us in our inference model. As such, we implemented SMOTENC in our model optimization process. SMOTENC, or Synthetic Minority Over-sampling Technique for Nominal and Continuous, helps to generate synthetic data to oversample a minority target class in an imbalanced dataset. After implementing SMOTENC on our training dataset, we created a balanced response ratio of 50-50 between negative responses (or non-readmissions) and positive responses (or readmissions) (see **Appendix C Figure 3**). 

#### Initial Modeling with Undersampling

In our initial model, we implemented undersampling and included the following variables as defined in our exporatory data analysis: 

1. `num_of_changes` (or number of medication changes); 

2. `number_inpatient` (or number of inpatient stays); 

3. An interaction between `time_in_hospital` (or the amount of days a patient stayed in the hospital originally) and `age` (the patient's age). 

The $p$-values for all above attributes were under 0.05 (indicating that they were significant at a confidence interval of 95%). Our base model's formula is as follows:

> `readmitted` = -1.3101 + (0.1104 * `time_in_hospital`) + (0.0144 * `age`) + (-0.0009 * `time_in_hospital` * `age`) + (0.1499 * `num_of_changes`) + (0.3884 * `number_inpatient`)

However, when inspecting the confusion matrix of this model on training data, we found that it not performing how we wished it to (see **Appendix C Figure 4**). Our model's classification accuracy was only **57.1%**, while its precision and recall were **57.8%** and **52.9%** respectively. This indicated that our base model was only making correct classifications 57.1% of the time, and that both its accuracy of predicting positive values as well as  the ratio of positive instances that are correctly detected by the classifier were both rather low. Our base model's False Negative Rate, or FNR, was similarly disappointing with a value was **47.1%**. This indicated that the probability that a true positive will be missed by the test was 47.1%. 

We plotted the ROC(Receiver Operator Characteristic Curve) of our base model using training data (see **Appendix C Figure 5**). This plot visualizes the sensitivity (True Positive Rate) of the model on the y-axis against (1−specificity) (False Positive Rate) on the x axis for varying values of a threshold. The 45° diagonal line connecting (0,0) to (1,1) is the ROC curve corresponds to random chance. The area under ROC is called Area Under the Curve(AUC). AUC gives the rate of successful classification by the logistic model. Note in **Appendix C Figure 5** that the base model's AUC is quite low, indicating that the classifier is better than random chance, but still not optimal in distinguishing between positive and negative classes.

On test data, our model's classification accuracy increased to **60.1%**, though its precision and recall decreased to **11.9%** and **51.9%** respectively (see **Appendix C Figure 6**). The model's FNR also slightly increased to **48.1%**. This worsened performance of the model on test data inspired our team to move to a SMOTENC approach in an effort to optimize the model and its performance.

#### Re-Training the Base Model with SMOTENC

We then attempted to train the base model using SMOTENC data (or training data with artifically added positive values) to see if we could improve the model's performance. The re-trained base model's formula is as follows:

> `readmitted` = -2.2629 + (0.2012 * `time_in_hospital`) + (0.0339 * `age`) + (-0.0028	* `time_in_hospital` * `age`) + (-0.5269 * `num_of_changes`) + (0.1540 * `number_inpatient`)


All $p$-values for predictors this model remained under 0.05. We then optimized the decision threshold probability for this retrained base model using ROC-AUC, and tested the model's performance on training and test data.

On SMOTENC training data, the re-trained base model's classification accuracy increased to **59.1%** (from 57.1% originally). However, the recall score increase far more significantly between the two base models (from 52.9% to **70.2%** in the re-trained model). The False Negative Rate (FNR) also decreased. from 47.1% to **29.8%** (see **Appendix C Figure 7**). 

On test data, the re-trained model's classification slightly decreased (from 60.1% to 48.3%). However, its test recall increased (from 51.9% to **55.1%** and its FNR decreased (from 58.1% to **44.9%** in the re-trained model) (see **Appendix C Figure 8**). 

#### Existing Solutions

Our team was perplexed by our base model's subpar performance. Thus, we looked to Kaggle to see if there were any existing solutions to the problem that could inform our analysis. We found a code repository from Abishek Sharma called "Prediction on Hospital Readmission" that used our dataset to predict whether a patient was going to be readmitted to the hospital or not [5]. Sharma implemented three different models for prediction: (1) a logistic regression model, (2) a decision tree model, and (3) a Random Forest model. 

To create his logistic regression model, Sharma used the LogisticRegression, train_test_split, and cross_val_score functions of the *sklearn* library. It was unclear which predictors he used in his analysis, but his logistic regression model attained an accuracy of 61%, precision of 63%, and recall of 56% [5]. However, both of his later models (decision tree and Random Forest) performed far better in terms of classification accuracy, precision, and recall. His decision tree model had an accuracy of 92%, precision of 93%, and recall of 90%, whereas his Random Forest model achieved an accuracy of 94%, precision of 99%, and recall of 90% (see **Appendix C Figure 9** for a visualized comparison of the performance of each of Sharma's models). 

Ultimately, Sharma's work informed us about the possible limitations of our approach. His decision tree and Random Forest models performed far better than our base model. Yet, because we did not have tools or knowledge to implement those methods, we were limited by logistic regression. Further, Sharma implemented logistic regression in *sklearn* while our team was working in *statsmodels.formula.api*. Both Sharma and our team use SMOTENC to correct for the uneven response distribution, but Sharma's approach to logistic regression was difficult to interpret, as the libraries and packages he was working in were different than our own.

Overall, our base model performed on-par with Sharma's on training data in terms of classification accuracy. Our precision was slightly lower than his (57.8% vs 63%), as was our recall (52.9% vs 56%) [5]. 

## Developing the model

### Improving the base model 
We aimed to improve the fit of our base model by training subsequent models on the SMOTENC-generated training data, focusing on binning significant categorical variables, manually selecting for variables by intuition, and continuously ensuring that no predictors were collinear with one another.

As described in anticipated problems section, we used SMOTENC to articially generate positive datapoints corresponding to readmissions, such that we had equal amounts of positive and negative datapoints. This allowed us to increase the size of our training data by nearly ten times and decrease the bias in our model (**Appendix C, Figures 2 and 3**). 

Our initial EDA revealed that each value of `age` and `time_in_hospital` had a distinct distribution of the `Readmitted` response, suggesting that we could improve the model's fit on training data by making each value its own predictor (**Appendix B, Figures 2 and 4**). 

Our initial EDA revealed that each value of `age` and `time_in_hospital` had a distinct distribution of the `Readmitted` response, suggesting that we could improve the model's fit on training data by making each value its own predictor (**Appendix B, Figures 2 and 4**). Our first model used each value of `age` as predictors: this model performed better than the base model on undersampling data in terms of classification accuracy (58.2%, compared to 57.2%), recall (79.8% vs. 48.8%) and false positive rate (63.4% vs. 60.2%) (**Appendix D, Figure 1**). However, where the base model's performance improved on test data, this model fared worse in terms of classification accuracy (40.0%) and overall ROC-AUC (53.9%). We suspected that using only `age` values as predictors led to overfitting the model, and explored additional models containing other binned variables (`time_in_hospital`), interactions in our base model, and promising variables similar to those referenced by similar studies (`diag_1`, `diag_2`, `admission_type`, `discharge_disposition_id`, `admisson_source_id`) [7]. We continuously ensured that predictors included within the model were not multicollinear with one another through calculating for VIF after initially fitting each model. For each of these new categorical predictors, we identified significant values of each predictor (i.e. `neoplasms` in `diag_1`) and removed statistically insignificant values. Further information on the exact modeling can be found in our code.

"We collected the insights gained in the piecewise exploration of the variables to fit a single model. Similar to our first model, this combined model performed better than the base model on training data, showing improvements in classification accuracy (65.4% from 57.2%), precision (64.6% vs. 58.6%), recall (68.4% vs. 48.8%), ROC-AUC (70.7% vs. 60.2%), and false positive rate (37.5% vs. 34.5%), while false negative rate went down (**Appendix D, Figure 2**). However, performance was worse than the base model in all these metrics (**Appendix D, Figure 2**). To further optimize the model, we attempted to manually add in interactions but it was too computationally intensive to fit a model containing all combinations of interactions between the 41 predictors without cross validation.


Overall, this phase of model optimization did succeed in improving the fit of our model to the training data. However, we suspect that this model overfits the training data, given the subpar performance on test data. We hypothesize that the poor test data performance is a result of the imbalanced `readmitted` response inherent to the test data. In addition, we used the validation set method to evaluate our model. This method is known for high variability in test error, since the validation set data may not be representative of the population data. The lessons learned lead to actionable directions in our next phase: more rigorous variable selection using the Lasso and k-fold cross validation and adding in interactions.

### Modeling with *Sklearn*
We attempted to use Sklearn to implement further model optimization, specifically for feature selection. <br>

We used the l1 penalty option in the LogisticRegression modulo from Sklearn for the fitting to seek to reduce over fitting as it penalize the sum of the absolute values of the wights. We fit the model using all of the dummy variables for the categorical variables found significant in the previous models (`age`, `time_in_hospital`, `admission_type_id`, `discharge_disposition_id`, `admission_source_id`, `diag_1`, `diag_2`, `diabetesMed`, `change`) and the significant continuous variables, and found the following predictors to be significant as their coefficient was not reduced to zero.
> 'num_lab_procedures', 'num_procedures', 'num_medications',
       'number_outpatient', 'number_emergency', 'number_inpatient',
       'number_diagnoses', 'num_of_changes', 'age5', 'age15', 'age25',
       'age35', 'age45', 'age55', 'age65', 'age75', 'age85', 'age95',
       'time_in_hospital1', 'time_in_hospital2', 'time_in_hospital3',
       'time_in_hospital4', 'time_in_hospital5', 'time_in_hospital6',
       'time_in_hospital7', 'time_in_hospital8', 'time_in_hospital9',
       'time_in_hospital10', 'time_in_hospital11', 'time_in_hospital12',
       'time_in_hospital13', 'time_in_hospital14', 'admission_type_id1',
       'admission_type_id3', 'admission_type_id4', 'admission_type_id5',
       'discharge_disposition_id1', 'discharge_disposition_id2',
       'discharge_disposition_id7', 'discharge_disposition_id10',
       'discharge_disposition_id11', 'discharge_disposition_id18',
       'admission_source_id1', 'admission_source_id4',
       'admission_source_id7', 'admission_source_id8',
       'admission_source_id9', 'admission_source_id11',
       'diag_1circulatory', 'diag_1diabetes', 'diag_1digestive',
       'diag_1genitourinary', 'diag_1injury', 'diag_1musculoskeletal',
       'diag_1neoplasms', 'diag_1other', 'diag_1pregnecy',
       'diag_1respiratory', 'diag_2circulatory', 'diag_2diabetes',
       'diag_2digestive', 'diag_2genitourinary', 'diag_2injury',
       'diag_2musculoskeletal', 'diag_2other', 'diag_2pregnecy',
       'diag_2respiratory', 'diabetesMedNo', 'diabetesMedYes', 'changeCh',
       'changeNo' 
       
See the confusion matrix, and other model metrics on train and test data in **Appendix C Figure 3**.

The next step was to test out variable interactions. From the descriptions of variables and common sense, we thought the following four variable interactions could be significant.

- `change` * `diabetesMed`
- `age` * `time_in_hospital`
- `age` * `number_inpatient`
- `change` * `admission_source_id`

We used dmatrics modulo from pasty to create the training features with interactions for Sklearn by adding all the interactions to significant variables we found in the last model. The initial input predictors formula looks like:

> num_lab_procedures + num_procedures + num_medications + number_outpatient + number_emergency + number_inpatient + number_diagnoses + num_of_changes + age5 + age15 + age25 + age35 + age45 + age55 + age65 + age75 + age85 + age95 + time_in_hospital1 + time_in_hospital2 + time_in_hospital3 + time_in_hospital4 + time_in_hospital5 + time_in_hospital6 + time_in_hospital7 + time_in_hospital8 + time_in_hospital9 + time_in_hospital10 + time_in_hospital11  + time_in_hospital12 + time_in_hospital13 + time_in_hospital14 + admission_type_id1 + admission_type_id3 + admission_type_id4 + admission_type_id5 + discharge_disposition_id1 + discharge_disposition_id2 + discharge_disposition_id7  + discharge_disposition_id10 + discharge_disposition_id11 + discharge_disposition_id18 + admission_source_id1 + admission_source_id4 + admission_source_id7 + admission_source_id8 + admission_source_id9 + admission_source_id11 + diag_1circulatory + diag_1diabetes + diag_1digestive + diag_1genitourinary + diag_1injury + diag_1musculoskeletal + diag_1neoplasms + diag_1other + diag_1pregnecy + diag_1respiratory + diag_2circulatory + diag_2diabetes + diag_2digestive + diag_2genitourinary + diag_2injury + diag_2musculoskeletal + diag_2other + diag_2pregnecy + diag_2respiratory + diabetesMedNo + diabetesMedYes + changeCh + changeNo + change*diabetesMed +(age5+age15+age25+age35+age45+age55+age65+age75+age85+age95) *(time_in_hospital1+time_in_hospital2+time_in_hospital3+time_in_hospital4 + time_in_hospital5 + time_in_hospital6 + time_in_hospital7 + time_in_hospital8 + time_in_hospital9 + time_in_hospital10 + time_in_hospital11 + time_in_hospital12 + time_in_hospital13 + time_in_hospital14)+(age5 + age15 + age25 + age35 + age45 + age55 + age65 + age75 + age85 + age95)*number_inpatient+change*(admission_type_id1 + admission_type_id3 + admission_type_id4 + admission_type_id5)

After the fitting, here's the list of the significant variables whose coefficient did not get reduced to zero by the penalty.

> 'change[T.No]', 'num_lab_procedures',
       'num_procedures', 'num_medications', 'number_outpatient',
       'number_emergency', 'number_inpatient', 'number_diagnoses',
       'num_of_changes', 'age5', 'age15', 'age25', 'age35', 'age45',
       'age55', 'age65', 'age75', 'age85', 'age95', 'time_in_hospital1',
       'time_in_hospital2', 'time_in_hospital3', 'time_in_hospital4',
       'time_in_hospital6', 'time_in_hospital7', 'time_in_hospital8',
       'time_in_hospital9', 'time_in_hospital10', 'time_in_hospital11',
       'time_in_hospital13', 'time_in_hospital14', 'admission_type_id1',
       'change[T.No]:admission_type_id1', 'admission_type_id3',
       'change[T.No]:admission_type_id3', 'admission_type_id5',
       'discharge_disposition_id1', 'discharge_disposition_id2',
       'discharge_disposition_id7', 'discharge_disposition_id10',
       'discharge_disposition_id11', 'discharge_disposition_id18',
       'admission_source_id1', 'admission_source_id4',
       'admission_source_id7', 'admission_source_id8',
       'admission_source_id9', 'admission_source_id11',
       'diag_1circulatory', 'diag_1diabetes', 'diag_1digestive',
       'diag_1genitourinary', 'diag_1injury', 'diag_1musculoskeletal',
       'diag_1neoplasms', 'diag_1other', 'diag_1pregnecy',
       'diag_1respiratory', 'diag_2circulatory', 'diag_2diabetes',
       'diag_2digestive', 'diag_2genitourinary', 'diag_2injury',
       'diag_2musculoskeletal', 'diag_2other', 'diag_2pregnecy',
       'diag_2respiratory', 'diabetesMedNo', 'changeCh',
       'age15:time_in_hospital1', 'age15:time_in_hospital2',
       'age15:time_in_hospital3', 'age15:time_in_hospital5',
       'age15:time_in_hospital7', 'age15:time_in_hospital9',
       'age25:time_in_hospital4', 'age25:time_in_hospital5',
       'age25:time_in_hospital6', 'age25:time_in_hospital8',
       'age25:time_in_hospital10', 'age25:time_in_hospital11',
       'age35:time_in_hospital2', 'age35:time_in_hospital3',
       'age35:time_in_hospital4', 'age35:time_in_hospital5',
       'age35:time_in_hospital6', 'age35:time_in_hospital7',
       'age35:time_in_hospital8', 'age35:time_in_hospital9',
       'age35:time_in_hospital10', 'age35:time_in_hospital11',
       'age35:time_in_hospital13', 'age35:time_in_hospital14',
       'age45:time_in_hospital1', 'age45:time_in_hospital2',
       'age45:time_in_hospital3', 'age45:time_in_hospital5',
       'age45:time_in_hospital6', 'age45:time_in_hospital7',
       'age45:time_in_hospital8', 'age45:time_in_hospital9',
       'age45:time_in_hospital10', 'age45:time_in_hospital11',
       'age45:time_in_hospital12', 'age45:time_in_hospital13',
       'age45:time_in_hospital14', 'age55:time_in_hospital1',
       'age55:time_in_hospital2', 'age55:time_in_hospital3',
       'age55:time_in_hospital4', 'age55:time_in_hospital5',
       'age55:time_in_hospital6', 'age55:time_in_hospital7',
       'age55:time_in_hospital9', 'age55:time_in_hospital10',
       'age55:time_in_hospital11', 'age55:time_in_hospital12',
       'age55:time_in_hospital14', 'age65:time_in_hospital2',
       'age65:time_in_hospital4', 'age65:time_in_hospital5',
       'age65:time_in_hospital6', 'age65:time_in_hospital7',
       'age65:time_in_hospital8', 'age65:time_in_hospital9',
       'age65:time_in_hospital10', 'age65:time_in_hospital11',
       'age65:time_in_hospital12', 'age65:time_in_hospital13',
       'age75:time_in_hospital1', 'age75:time_in_hospital2',
       'age75:time_in_hospital3', 'age75:time_in_hospital5',
       'age75:time_in_hospital6', 'age75:time_in_hospital8',
       'age75:time_in_hospital10', 'age75:time_in_hospital14',
       'age85:time_in_hospital1', 'age85:time_in_hospital2',
       'age85:time_in_hospital3', 'age85:time_in_hospital4',
       'age85:time_in_hospital5', 'age85:time_in_hospital7',
       'age85:time_in_hospital8', 'age85:time_in_hospital11',
       'age85:time_in_hospital12', 'age85:time_in_hospital13',
       'age85:time_in_hospital14', 'age95:time_in_hospital1',
       'age95:time_in_hospital2', 'age95:time_in_hospital4',
       'age95:time_in_hospital5', 'age95:time_in_hospital6',
       'age95:time_in_hospital7', 'age95:time_in_hospital9',
       'age95:time_in_hospital10', 'age95:time_in_hospital11',
       'age95:time_in_hospital12', 'age95:time_in_hospital13',
       'age95:time_in_hospital14', 'age15:number_inpatient',
       'age25:number_inpatient', 'age35:number_inpatient',
       'age45:number_inpatient', 'age55:number_inpatient',
       'age75:number_inpatient', 'age85:number_inpatient',
       'age95:number_inpatient'
 
See details about the confusion matrix of this model and other model metrics on the test and train set in **Appendix D Figure 4**.

Adding these interaction variables only improved the recall by 0.2% on the training and test data set, and the ROC-AUC by 0.1% on the test data. The improvements on model performance is quite minimal. Thus, since this is an inference problem and we would like to preserve simplicity of the model, we should not use the interactions in the final model.

### Final Model Equation

> `readmitted` = -2.2629 + (0.2012 * `time_in_hospital`) + (0.0339 * `age`) + (-0.0028	* `time_in_hospital` * `age`) + (-0.5269 * `num_of_changes`) + (0.1540 * `number_inpatient`)

## Limitations of the model with regard to Inference

Our model is most limited by the period of time. The data was collected between 1999-2008, about 15 years ago. However, we believe that this data remains relevant to understanding which factors influence hospital readmissions. For one, there have not been major advances in care for diabetic patients that would greatly affect readmission. The biggest change that would impact our models is that in 2012, the Affordable Care Act established the Hospital Readmission Reduction Program. This program financially penalized hospitals if the readmission rates were higher than expected for acute myocardial infarction, heart failure, and pneumonia [8]. Prior to 2012, there was little financial incentive for hospitals to reduce readmissions.

Actions that hospitals have taken to reduce readmissions since 2012 may change the composition of which patients would be readmitted in our data, if it were collected from 2012-2022 for example. The effects of this on our analysis, however, are minimal as we are interested in understanding factors that affect readmission for diabetes patients. For one, diabetes is not one of the diseases that factor into financial penalties for elevated readmission rates. Two, while some diseases have seen decreases in readmission rates following the change in policy, the literature is unclear on the status of readmission rates for diabetes. The scarcity of studies examining how hospital readmission rates changed specifically for diabetic patients suggests that identifying factors influencing readmission rate for these patients remains a pertinent issue. In the future, if there are major changes to diabetic care, or if policies are implemented that aim to reduce hospital readmission rates for diabetic patients, then our model may become obsolete.

Our model is broadly applicable to hospitals within the US because the data was sourced from 130 hospitals across the United States, and a wide variety of diabetic patients were included (10,000+). Hence, we do not anticipate limitations to inference in regards to population. 

## Future Work

Our model's performance may be limited by our approach. In Abishek Sharma's code repository from Kaggle, he was able to build models with over 90% classification accuracy, precision, and recall using Random Forest and decision tree modeling methods [5]. His logistic regression model from *sklearn* ultimately performed similar to ours, which we created using *statsmodels.formula.api*. In future studies, it may be interesting to implement furhter Random Forest and decision tree modeling so that the model's performance could improve (and thus its power in regards to inference could be improved. 

Further, future work in the readmission space may benefit from more recent data. Our dataset was amalgamated between 1999 and 2008, though new policies have been introduced since 2008 that have imposed further financial penalties on hospitals that have high readmission rates. These penalties incentivize hospitals to identify and address risk factors to reduce readmissions. As such, one potential avenue of future inquiry within this research community could include studying readmissions data in the last decade to see whether such policies have had an impact on readmissions (and whether the risk factors for hospital readmissions have changed).

## Conclusions and Recommendations to Stakeholder(s)

### Model Conclusions

In the final model equation, all predictors used are statistically significant (as they all have $p$-values under the threshold of 0.05 for 95% confidence). We interpret the model's coefficients to come to five total conclusions: 

1. The odds of readmission increases by 22.29 % with each additional day spent in the hospital during the patients initial hospital stay.

2. The odds of readmission increase by 3.44 % with each additional added decade in patient age (i.e., each decade-increase in a patient's age impacts these odds). 

3. For a constant age $X$, the odds of readmission change by $0.1916 - X*0.0026$ with each additional day spent in the hospital. In other words, as the decade of a patient's age increases, the additional day in the hospital will have a greater effect in decreasing the odds of readmission. We note, however, that conclusion (1) still holds. The reducing effect of age is small relative to the increase of readmission odds brought on by the additional day in the hospital, so the likelihood of readmission will overall still be increased.

4. The odds of readmission decrease by 40.96 % with each additional medication change.

5. The odds of readmission increase by 16.65 % with each additional inpatient hospital visit.

### Recommendations to Stakeholders 
How do you use those conclusions to come up with meaningful recommendations for stakeholders? The recommendations must be action-items for stakeholders that they can directly implement without any further analysis. Be as precise as possible. The stakeholder(s) are depending on you to come up with practically implementable recommendations, instead of having to think for themselves.

#### Recommendation (1) 
Regarding conclusions (1), it is imperative that healthcare providers be hyper-vigilant regarding the care of those who have long inpatient hospital stays. As the odds of being readmitted to the hospital within 30 days increases by 22.9% *with each additional day spent in the hospital*, patient's with lengthy hospital stays are far more likely to have to go *back* to the hospital for further treatment. Per conclusion (3), the long hospital stay is caveated by how greater patient age slightly decreases the odds of returning to the hospital. According to the National Institute for Health, the average hospital stay in the U.S. is 5.5 days long [9]. Thus, healthcare providers and patients alike should be notified *if the patient's stay exceeds that average* (as longer hospital stays may indicate that a patient is at a greater risk of readmission). 

#### Recommendation (2) 
Age is a large risk factor in determining hospital readmission, with the odds of readmission increasing 3.44% per decade increase of the patient's age (conclusion 2). We recommend that healthcare providers take additional measures to ensure diligent care for elderly patients. Specifically, readmission is often caused by poor transitions from the hospital to a subsequent location of care and complications in the patient's illness. These causes may be countered, especially for elderly patients, by deliberately giving consultations to the patient's family and caregivers on care and reevaluating the patient's medication plan to ensure compliance [8].

#### Recommendation (3) 
As conclusion (4) notes, a patient’s odds of readmission decrease by 40.96% with every additional medication change during their initial hospital stay. Importantly, our population of interest is diabetic patients, whose variable blood sugar levels necessitate constant supervision and subsequent adjustments in insulin and related medications. As such, we recommend that, for diabetic patients undergoing an inpatient hospital stay, their medications be monitored and adjusted more frequently than that of non-diabetic patients. This may necessitate more frequent nurse visits to diabetic patients. For example, if nurses were to check on patients every 3 hours, we would recommend moving that interval up slightly (depending on hospital staffing) for diabetic patients (so instead of every 3 hours, diabetic patients are checked on and monitored every 1.5 to 2 hours). Increased monitoring may allow hospital staff to more accurately gauge a patient’s blood sugar levels and associated health risks (leading, thus, to an increased probability of medication or insulin adjustments that could decrease that diabetic patient’s risk of readmission). 

#### Recommendation (4)
Number of inpatient hospital visits are an important risk factor, where more hospital visits increase the chance of readmission per conclusion (5). In light of this, we suggest that healthcare providers determine a patient's number of inpatient hospital visits while taking the patient's medical history. We suggest that patients be transparent about the number of their inpatient visits to the hospital. Healthcare administrators may assist the effort to obtain this information from each patient by supporting the implementation of a system that automatically tracks number of inpatient hospital visits, such as an electronic health record system.

### Implementation and Limitations of Recommendations

Our recommendations should be practically implementable to stakeholders in their current form . However, stakeholders should also be aware ofthe limitations of our model (due to the time period in which its training and test data was collected)(see *Limitations of the model with regard to inference* section). It may be possible to update this model based on more recent data (if available) to keep using it in the future, although this information must be up-to-date and must also be used to re-train the model (perhaps while following the procedure we created in this study) to recalibrate its responses and model performance. 

However, in its current form, this model and its conclusions essentially should be implementable unless there are any major breakthroughs in the care for diabetic patients (that could impact their hospitalization rates or care). Otherwise, our model is broadly applicable to hospitals within the US because the data was sourced from 130 hospitals across the United States, and a wide variety of diabetic patients were included (10,000+). Hence, we do not anticipate limitations to inference (and thus to the recommendations we formed *based* on this inference) in regards to the  population. 

## GitHub and individual contribution {-}

Our project repository can be found at [**this link**]('https://github.com/AnastasiaKWei/Saturn'). A more detailed overview of each team member's contributions to the project can be found below. 

<html>
<style>
table, td, th {
  border: 1px solid black;
}

table {
  border-collapse: collapse;
  width: 100%;
}

th {
  text-align: left;
}
    

</style>
<body>

<h2>Individual contribution</h2>

<table style="width:100%">
     <colgroup>
       <col span="1" style="width: 15%;">
       <col span="1" style="width: 20%;">
       <col span="1" style="width: 50%;">
       <col span="1" style="width: 15%;"> 
    </colgroup>
  <tr>
    <th>Team member</th>
    <th>Contributed aspects</th>
    <th>Details</th>
    <th>Number of GitHub commits</th>
  </tr>
  <tr>
    <td>Kaitlyn Hung</td>
    <td>Modeling and EDA</td>
    <td>XXXXX</td>
    <td>XXXXX</td>
  </tr>
  <tr>
    <td>Amy Wang</td>
    <td>Modeling and model optimization</td>
    <td>Made commits to optimize the base model, resample training data using SMOTENC, add summary of this work in the "Improving the Base Model" section of the Saturn_report, add code to Saturn_code, added metrics to Appendix D. Add to conclusions and recommendations to stakeholders. </td>
    <td>25</td>
  </tr>
    <tr>
    <td>Anastasia Wei</td>
    <td>Data cleaning and model optimization</td>
    <td>XXXXX</td>
    <td>XXXXX</td>    
  </tr>
    <tr>
    <td>Lila Wells</td>
    <td>Modeling, EDA, and Presentation Assets</td>
    <td>Made commits to create and optimize the base model using decision threshold probabilities (also performed EDA to determine relevant predictors). Resamples training data using undersampling.  Coded and added all figures for variable distributions, initial modeling, and EDA to Appendices A-C. Created all presentation assets for the project presentation and filled out all slides. Added relevant code to Saturn code document. Formed conclusions and recommendations based on the final model</td>
    <td>114</td>    
  </tr>
</table>

### Reflections on the GitHub Experience

When reflecting on our collaboration via GitHub, our team members appreciated how we were able to collectively work on one document (like this project report, for example) from our own devices while pushing and pulling changes that other group members made. One challenge that we faced when using GitHub was maintaining a clean and easy-to-navigate repository, as our team members were often simultaneously working on different files and, as such, our repository would often become quite crowded with different documents. However, we were able to circumvent this issue by organizing our repository into different subsections and folders. 

We found that collaborating on GitHub made it far easier to use the platform as a whole. Practice is the best teacher and, as our team was quite active in our repository, we learned a lot about pushing and pulling changes that other team members made, merging files, and making meaningful commits. GitHub definitely made our collaboration easier, as it stored all of our files in one place and allowed us to collaborate remotely yet effectively. 

## References {-}

[1] Mcllvennan, Colleen K., et al. "Hospital Readmissions Reduction Program." 
     National Library of Medicine, https://doi.org/10.1161/ 
     CIRCULATIONAHA.114.010270.

[2] "How Reducing Hospital Readmissions Benefits Patients and Hospitals." Regis College, 10 Aug. 2022, 
    online.regiscollege.edu/blog/reducing-hospital-readmissions/.
    
[3] "About Diabetes." International Diabetes Federation, 12 Sept. 2021, idf.org/ 
     aboutdiabetes/what-is-diabetes/ 
     facts-figures.html#:~:text=The%20IDF%20Diabetes%20Atlas%20Tenth,and%20783%20milli 
     on%20by%202045.

[4] Moss, S. E., et al. "Risk factors for hospitalization in people with diabetes." 
     JAMA Internal Medicine, https://doi.org/10.1001/archinte.159.17.2053. 
     Accessed 27 Sept. 1999.
     
[5] Sharma, Abishek. "Prediction on Hospital Readmission." Kaggle, https://www.kaggle.com/code/iabhishekofficial/prediction-on-hospital-readmission.

[6] Ranveer, Sachin. "Diabetes 130 US hospitals for years 1999-2008" https://medium.com/analytics-vidhya/diabetes-130-us-hospitals-for-years-1999-2008-e18d69beea4d


[7] Ostling, S., Wyckoff, J., Ciarkowski, S.L. et al. The relationship between diabetes mellitus and 30-day readmission rates. Clin Diabetes Endocrinol 3, 3 (2017). https://doi.org/10.1186/s40842-016-0040-x

[8] How Reducing Hospital Readmissions Benefits Patients and Hospitals. (2018, March 9). Regis College Online. https://online.regiscollege.edu/blog/reducing-hospital-readmissions/

[9] Interventions To Decrease Hospital Length of Stay, National Library of Medicine. https://www.ncbi.nlm.nih.gov/books/NBK574438/.


## Appendix {-}

### Appendix A: Data quality check / cleaning / preparation

This appendix contains several figures related to our data quality check, cleaning, and preparation. See each figure below for more detail.

#### Appendix A Table 1: Categorical Variable - `Change`

|             | `change`     |
| ----------- | ----------- |
| Levels      | 2 (No, Change) |
| Missing values   | 0        |
| Number of unique values   | 2        |
| Frequency at all levels   | {No : 54755, Change: 47011}   |


#### Appendix A Table 2: Categorical Variable - `diabetesMed`

|             | `diabetesMed`     |
| ----------- | ----------- |
| Levels      | 2 (No, Yes) |
| Missing values   | 0        |
| Number of unique values   | 2        |
| Frequency at all levels   | {No : 23403, Yes: 78363}   |


#### Appendix A Table 3: Categorical Variable - `diag_1`

|             | `diag_1`     |
| ----------- | ----------- |
| Levels      | 717 |
| Missing values   | 0        |
| Number of unique values   | 717        |
| Frequency at all levels   | It depends on the level. There are so many levels that binning here is necessary to make this variable interpretable   |

#### Appendix A Table 4: Categorical Variable - `diag_2`

|             | `diag_2`     |
| ----------- | ----------- |
| Levels      | 749 |
| Missing values   | 0        |
| Number of unique values   | 749       |
| Frequency at all levels   | It depends on the level. There are so many levels that binning here is necessary to make this variable interpretable   |


#### Appendix A Table 5: Categorical Variable - `admission_source_id`
*Note: This variable has integer values that correspond with a type of admission sourcing for the patient. Thus, it is classified as a categorical variable.*

|             | `admission_source_id`     |
| ----------- | ----------- |
| Levels      | 17 |
| Missing values   | 0        |
| Number of unique values   | 17       |
| Frequency at all levels   | {7 : 57494, 1 : 29565, 17 : 6781, 4 : 3187} *Note, only showing the frequency of the first few most popular levels*  |

#### Appendix A Table 6: Categorical Variable - `discharge_disposition_id`
*Note: This variable has integer values that correspond with a type of patient discharge from the hospital. Thus, it is classified as a categorical variable.*


|             | `discharge_disposition_id`     |
| ----------- | ----------- |
| Levels      | 26 |
| Missing values   | 0        |
| Number of unique values   | 26       |
| Frequency at all levels   | {1 : 60234, 3 : 13954, 6 : 12902, 18 : 3691, 2 : 2128} *Note, only showing the frequency of the first few most popular levels*  |

#### Appendix A Table 7: Categorical Variable - `admission_type_id`

|             | `admission_type_id`     |
| ----------- | ----------- |
| Levels      | 8 |
| Missing values   | 0        |
| Number of unique values   | 8       |
| Frequency at all levels   | {1 : 53990, 3 : 18869, 2 : 18480, 6 : 5291} *Note, only showing the frequency of the first few most popular levels*  |

#### Appendix A Table 8: Continuous Variable - `num_lab_procedures`

|             | `num_lab_procedures`     |
| ----------- | ----------- |
| count      | 101766.000000 |
| mean   | 43.095641        |
| std   | 19.674362       |
| min   | 1.000000       |
| 25%   | 31.000000       |
| 50%   | 44.000000       |
| 75%   | 57.000000       |
| max   | 132.000000       |


#### Appendix A Table 9: Continuous Variable - `num_procedures`


|             | `num_procedures`     |
| ----------- | ----------- |
| count      | 101766.000000 |
| mean   | 1.339730        |
| std   | 1.705807       |
| min   | 0.000000      |
| 25%   | 0.000000       |
| 50%   | 1.000000       |
| 75%   | 2.000000       |
| max   | 6.000000       |

#### Appendix A Table 10: Continuous Variable - `num_medications`

|             | `num_medications`     |
| ----------- | ----------- |
| count      | 101766.000000 |
| mean   | 16.021844        |
| std   | 8.127566       |
| min   | 1.000000      |
| 25%   | 10.000000       |
| 50%   | 15.000000       |
| 75%   | 20.000000       |
| max   | 81.000000       |

#### Appendix A Table 11: Continuous Variable - `number_outpatient`

|             | `number_outpatient`     |
| ----------- | ----------- |
| count      | 101766.000000 |
| mean   | 0.369357        |
| std   | 1.267265       |
| min   | 0.000000      |
| 25%   | 0.000000       |
| 50%   | 0.000000      |
| 75%   | 0.000000       |
| max   | 42.000000       |

#### Appendix A Table 12: Continuous Variable - `number_emergency`


|             | `number_emergency`     |
| ----------- | ----------- |
| count      | 101766.000000 |
| mean   | 0.197836        |
| std   | 0.930472       |
| min   | 0.000000      |
| 25%   | 0.000000       |
| 50%   | 0.000000      |
| 75%   | 0.000000       |
| max   | 76.000000      |

#### Appendix A Table 13: Continuous Variable - `number_inpatient`

|             | `number_inpatient`     |
| ----------- | ----------- |
| count      | 101766.000000 |
| mean   | 0.635566       |
| std   | 1.262863       |
| min   | 0.000000      |
| 25%   | 0.000000       |
| 50%   | 0.000000      |
| 75%   | 1.000000       |
| max   | 21.000000      |

#### Appendix A Table 14: Continuous Variable - `number_diagnoses`

|             | `number_diagnoses`     |
| ----------- | ----------- |
| count      | 101766.000000 |
| mean   | 7.422607       |
| std   | 1.933600       |
| min   | 1.000000      |
| 25%   | 6.000000       |
| 50%   | 8.000000      |
| 75%   | 9.000000       |
| max   | 16.000000      |

#### Appendix A Table 15: Continuous Variable - `age`

|             | `age`     |
| ----------- | ----------- |
| count      | 101766 |
| unique   | 10       |
| top   | [70-80)       |
| freq   | 26068      |


#### Appendix A Table 14: Continuous Variable - `time_in_hospital`

|             | `time_in_hospital`     |
| ----------- | ----------- |
| count      | 101766.000000 |
| mean   | 4.395987       |
| std   | 2.985108       |
| min   | 1.000000      |
| 25%   | 2.000000       |
| 50%   | 4.000000      |
| 75%   | 6.000000       |
| max   | 14.000000      |

**Appendix A Diagnosis Mapping**
1. Circulatory → 390–459, 785 → Diseases of the circulatory system
2. Respiratory → 460–519, 786 → Diseases of the respiratory system
3. Digestive → 520–579, 787 → Diseases of the digestive system
4. Diabetes → 250.xx → Diabetes mellitus
5. Injury → 800–999 → Injury and poisoning
6. Musculoskeletal → 710–739 → Diseases of the musculoskeletal system and connective tissue
7. Genitourinary → 580–629, 788 → Diseases of the genitourinary system
8. Neoplasms → 140–239 → Neoplasms
9. Pregnecy → 630–679 → Complications of pregnancy, childbirth, and the puerperium
10. Other

**Appendix A Admission & Discharge Mapping**
Admission Type ID Mapping <br>
1. Urgent, Trama Center 2,7 → Emergency 1
2. NaN,Not Mapped 6,8 → Not Avaliable 5

Admission Source ID Mapping <br>
1. Referral 2,3 → 1
2. Transfers 5,6,10,22,25 → 4
3. Not available, not known 15,17,20,21 → 9
4. Sick baby, extramural birth 13, 14 → 11 Normal Delivery

Discharge Disposition Mapping
1. Discharged/transferred to home care 6,8,9,13→ 1
2. Discharged/transferred to short term hospital 3,4,5,14,22,23,24 → 2
3. Neonate discharge to another hospital for neonatal aftercare 12,15,16,17 → 10
4. NaN, not mapped 25,26 -> 18
5. Expired 13,14,19,20,21 -> 11

### Appendix B: Exploratory data analysis

This appendix contains several figures related to our exploratory data analysis. See each figure below for more detail.

#### Appendix B Figure 1: Heatmap of Continuous Variables in the Dataset

![heatmap_continuous-4.png](attachment:heatmap_continuous-4.png)

#### Appendix B Figure 2: Time Spent in Hospital vs. Readmission

![time_in_hospital_vs_readmissions-2.png](attachment:time_in_hospital_vs_readmissions-2.png)

#### Appendix B Figure 3: Number of Diagnoses vs. Readmission

![diag_vs_readmissions.png](attachment:diag_vs_readmissions.png)


#### Appendix B Figure 4: Patient Age vs. Readmission

![readmissions_vs_age.png](attachment:readmissions_vs_age.png)

#### Appendix B Figure 5: Number of Inpatient Visits vs. Readmission
![num_inpatient_vs_readmissions-3.png](attachment:num_inpatient_vs_readmissions-3.png)

#### Appendix B Figure 6: Number of Emergency Visits vs. Readmission

![num_emergency_vs_readmissions.png](attachment:num_emergency_vs_readmissions.png)

#### Appendix B Figure 7: Number of Medication Changes vs. Readmission

![num_changes_vs_readmissions.png](attachment:num_changes_vs_readmissions.png)

#### Appendix B Figure 8: Time Spent in Hospital vs. Patient Age

![time_in_hospital_vs_age.png](attachment:time_in_hospital_vs_age.png)

#### Appendix B Figure 9: Time Spent in Hospital vs. Number of Medication Changes

![time_in_hospital_vs_num_changes.png](attachment:time_in_hospital_vs_num_changes.png)

#### Appendix B Figure 10: Number of Medication Changes vs. Number of Inpatient Visits

![num_changes_vs_number_inpatient.png](attachment:num_changes_vs_number_inpatient.png)

#### Appendix B Figure 11: Number of Inpatient Visits vs. Patient Age

![number_inpatient_vs_age.png](attachment:number_inpatient_vs_age.png)

#### Appendix B Figure 12: Number of Medication Changes vs. Patient Age

![num_changes_vs_age.png](attachment:num_changes_vs_age.png)

#### Appendix B Figure 13: Number of Inpatient Visits vs. Time Spent in Hospital

![num_inpatient_vs_time_in_hospital.png](attachment:num_inpatient_vs_time_in_hospital.png)

### Appendix C: Approach

This appendix contains several figures related to our model approach. See each figure below for more detail.

#### Appendix C Figure 1: Distribution of the response variable `readmitted` before undersampling or SMOTENC in the training data.

![readmission_distribution_originaltrain.png](attachment:readmission_distribution_originaltrain.png)

#### Appendix C Figure 2: Distribution of the response variable `readmitted` after undersampling
![readmission_distribution_undersampling.png](attachment:readmission_distribution_undersampling.png)

#### Appendix C Figure 3: Distribution of the response variable `readmitted` after SMOTENC

![readmission_distribution_after_smotenc.png](attachment:readmission_distribution_after_smotenc.png)


#### Appendix C Figure 4: Confusion Matrix of Base Model on Training Data

![confusion_matrix_train-2.png](attachment:confusion_matrix_train-2.png)

#### Appendix C Figure 5: ROC Curve of Base Model 

![roc_curve_base_model.png](attachment:roc_curve_base_model.png)

#### Appendix C Figure 6: Confusion Matrix of Base Model on Test Data

![initial_confusion_matrix_test.png](attachment:initial_confusion_matrix_test.png)

#### Appendix C Figure 7: Re-Trained Base Model Performance on Train Data (SMOTENC)

![base_model_SMOTENC_train.png](attachment:base_model_SMOTENC_train.png)

#### Appendix C Figure 8: Re-Trained Base Model Performance on Test Data 

![base_model_SMOTENC_test.png](attachment:base_model_SMOTENC_test.png)

#### Appendix C Figure 9: Sharma's Model Performance

Note, this figure was downloaded from Abishek Sharma's code repository on Kaggle. The citation is listed under [5] in the references section of this report.
![sharma_models.png](attachment:sharma_models.png)

### Appendix D: Developing the Model

This appendix contains several figures related to developing our optimized model. See each figure below for more detail.

**Appendix D Figure 1: Model 1.1, Age Bins as their own variable** <br>
Model Performance on Train Data <br>
![model1_1_train.png](visualizations/model1_1_train.png) <br>
Classification accuracy = 58.2% <br>
Precision = 55.7% <br>
TPR or Recall = 79.8% <br>
FNR = 20.2% <br>
FPR = 63.4% <br>
ROC-AUC = 60.5% <br>

Model Performance on Test Data <br>
![model1_1_test.png](visualizations/model_1_1_test.png) <br>
Classification accuracy = 40.0% <br>
Precision = 10.1% <br>
TPR or Recall = 70.0% <br>
FNR = 30.0% <br>
FPR = 63.0% <br>
ROC-AUC = 53.9% <br>

**Appendix D Figure 2: Single Best Model** <br>
![single_model_train.png](visualizations/single_model_train.png) <br>
Classification accuracy = 65.4% <br>
Precision = 64.6% <br>
TPR or Recall = 68.4% <br>
FNR = 31.6% <br>
FPR = 37.5% <br>
ROC-AUC = 70.7% <br>

![single_model_test.png](visualizations/single_model_test.png) <br>
Classification accuracy = 60.7% <br>
Precision = 10.4% <br>
TPR or Recall = 42.9% <br>
FNR = 57.1% <br>
FPR = 37.5% <br>
ROC-AUC = 54.4% <br>

**Appendix D Figure 3: Sklearn Feature Selection** <br>
Model Performance on Train Data <br>
![image.png](attachment:image.png)
Classification accuracy = 69.9% <br>
Precision = 68.7% <br>
TPR or Recall = 73.0% <br>
FNR = 27.0% <br>
FPR = 33.3% <br>
ROC-AUC = 69.9% <br>

Model Performance on Test Data <br>
![image-2.png](attachment:image-2.png)
Classification accuracy = 64.5% <br>
Precision = 11.3% <br>
TPR or Recall = 41.6% <br>
FNR = 58.4% <br>
FPR = 33.1% <br>
ROC-AUC = 54.3% <br>

**Appendix D Figure 4: Sklearn Feature Interaction Selection** <br>
Model Performance on Train Data <br>
![image.png](attachment:image.png)
Classification accuracy = 69.9% <br>
Precision = 68.7% <br>
TPR or Recall = 73.2% <br>
FNR = 26.8% <br>
FPR = 33.4% <br>
ROC-AUC = 69.9% <br>

Model Performance on Test Data <br>
![image-2.png](attachment:image-2.png)
Classification accuracy = 64.6% <br>
Precision = 11.3% <br>
TPR or Recall = 41.8% <br>
FNR = 58.2% <br>
FPR = 33.1% <br>
ROC-AUC = 54.4% <br>