## Length of the report {-}
The length of the report must be no more than 15 pages, when printed as PDF. However, there is no requirement on the minimum number of pages.

You may put additional stuff as Appendix. You may refer to the Appendix in the main report to support your arguments. However, your appendix is unlikely to be checked while grading, unless the grader deems it necessary. The appendix, references, and information about GitHub and individual contribution will not be included in the page count, and there is no limit on the length of the appendix.

**Delete this section from the report, when using this template.** 

## Background / Motivation

What motivated you to work on this problem?

Mention any background about the problem, if it is required to understand your analysis later on.

## Problem statement 

Describe your problem statement. Articulate your objectives using absolutely no jargon. Interpret the problem as inference and/or prediction.

## Data sources
What data did you use? Provide details about your data. Include links to data if you are using open-access data.

Here is a link to our data. The data set is entitled “Diabetes 130-US hospitals for years 1999-2008 dataset”, and comes from the UCI Machine Learning Repository. The data was collected between 1999 and 2008 at 130 US hospitals and contains over 10,000 entries each corresponding to a patient with diabetes (type 1 or 2). Data was included in the dataset if it satisfied the following criteria: 

1. An inpatient encounter
2. A diabetic encounter (i.e., where the patient was listed as having diabetes)
3. The length of the patient’s hospital stay was between 1 and 14 days
4. Laboratory tests were administered during the patient’s hospital stay
5. The patient received medications during their hospital stay

There are 55 attributes containing information about the patients demographics (age, gender, race, etc), medical history (inpatient and outpatient visits, medications, etc), hospital stay (medications, procedures, diagnosis, etc), and if they were readmitted to the hospital within 30 days of their initial hospital visit. The data set is complete; there are few missing values making the vast majority of the attributes usable for our model.

We chose this dataset because it contains our variable of interest: whether the patient was readmitted or not, along with ~46 potential predictors to use to fit the model. The dataset is extremely robust and comprehensive; it pools 10,000+ patients in the US across a decade. Having such broad and comprehensive data will help us generalize our findings. The number of predictors allows us to approach the problem from multiple perspectives: the patient’s characteristics, their hospital stay and medical history, and the treatments administered to them. 


## Stakeholders
Who cares? If you are successful, what difference will it make to them?

We identified three groups of key stakeholders: (1) diabetes patients and their loved ones; (2) hospital staff involved in direct patient care; and (3) hospital administrators and those who have an interest in reducing the financial waste in healthcare (Medicaid).

For diabetes patients and their loved ones, the results of this project will provide insight into factors impacting the diabetic’s continued risk of illness and subsequent hospital readmission. Even one stay in the hospital can be an emotional and financial burden on the patient and their family. Hospital readmissions exacerbate this toll on patients, not to mention that readmission is also associated with worse patient outcomes [3]. Thus, understanding which factors contribute to hospital readmission can empower patients and their family to take an active role toward improving their health outcomes. For instance, if a patient was aware that they had certain risk factors associated with greater likelihood of readmission, they could direct more time and resources towards these factors and away from other areas that may have less impact on hospital readmission. Patients and their family could be more vigilant toward early warning-signs and communicate proactively with their physician or healthcare provider about prevention. Overall, this knowledge can empower patients with diabetes to direct their efforts towards areas that most affect their health outcome. 

Hospital staff and physicians are another primary stakeholder group in this project. These individuals are directly charged with overseeing the healthcare plans of diabetic patients, and are morally, professionally, and financially invested in giving their patients the best possible care to improve health outcomes. With a better understanding of factors that impact the likelihood of a diabetic patient’s readmission, healthcare workers may be able to adjust their administration of care during the patient’s initial hospital visit. If a patient is at elevated risk of being readmitted, hospital staff can adjust their healthcare plan accordingly by focusing on reducing those factors that are most indicative of increased likelihood of readmission. This project may inform healthcare workers about which patients require the most vigilant and preventative intervention, potentially lowering both the morbidity and mortality for diabetics in their care.

Finally, a third stakeholder group consists of hospital administrators and those with an interest in reducing the cost of healthcare. Hospital readmissions are financially costly to the healthcare system. The Affordable Care Act incentivizes the reduction of hospital admissions through financial penalties, reducing payments to hospitals with excessive readmissions. In 2021, the penalties are projected to cost hospitals $521 million dollars [4]. As such, hospital administrators may have a vested financial interest in identifying factors that most contribute to an increased likelihood of returning to the hospital.


## Data quality check / cleaning / preparation 

In a tabular form, show the distribution of values of each variable used in the analysis - for both categorical and continuous variables. Distribution of a categorical variable must include the number of missing values, the number of unique values, the frequency of all its levels. If a categorical variable has too many levels, you may just include the counts of the top 3-5 levels. 

If the tables in this section take too much space, you may put them in the appendix, and just mention any useful insights you obtained from the data quality check that helped you develop the model or helped you realize the necessary data cleaning / preparation.

Were there any potentially incorrect values of variables that required cleaning? If yes, how did you clean them? 

Did you do any data wrangling or data preparation before the data was ready to use for model development? Did you create any new predictors from exisiting predictors? For example, if you have number of transactions and spend in a credit card dataset, you may create spend per transaction for predicting if a customer pays their credit card bill. Mention the steps at a broad level, you may put minor details in the appendix. Only mention the steps that ended up being useful towards developing your final model(s).

## Exploratory data analysis

Put the relevant EDA here (visualizations, tables, etc.) that helped you figure out useful predictors for developing the model(s). Only put the EDA that ended up being useful towards developing your final model(s). 

List the insights (as bullet points) you got from EDA that ended up being useful towards developing your final model. 

Again, if there are too many plots / tables, you may put them into appendix, and just mention the insights you got from them.

- The continuous variables that are most highly correlated with readmission are: num_inpatient, num_diagnoses, time_in_hospital, age, encounter_id, num_medications, and num_procedures
- The following variables showed differences in distribution when subset by data readmitted vs. not readmitted: time_in_hospital, num_diagnoses, age, num_inpatient, num_emergency, num_changes
- time_in_hospital increases as age increases, and time_in_hospital also increases as num_of_changes increase
- the variability in num_changes increases as num_inpatient increases
- at low ages, num_inpatient and num_changes have high variability
- num_inpatient and time_in_hospital do not seem to be related
- discharge disposition ID does not seem to be related to other predictors

## Approach

What kind of a model (linear / logistic / other) did you use? What performance metric(s) did you optimize and why?

Is there anything unorthodox / new in your approach? 

What problems did you anticipate? What problems did you encounter? Did the very first model you tried work? 

Did your problem already have solution(s) (posted on Kaggle or elsewhere). If yes, then how did you build upon those solutions, what did you do differently? Is your model better as compared to those solutions in terms of prediction / inference?

**Important: Mention any code repositories (with citations) or other sources that you used, and specifically what changes you made to them for your project.**

## Developing the model

Explain the steps taken to develop and improve the base model - informative visualizations / addressing modeling assumption violations / variable transformation / interactions / outlier treatment / influential points treatment / addressing over-fitting / addressing multicollinearity / variable selection - stepwise regression, lasso, ridge regression). 

Did you succeed in achieving your goal, or did you fail? Why?

### Final Model Equation

**Put the final model equation**.

**Important: This section should be rigorous and thorough. Present detailed information about decision you made, why you made them, and any evidence/experimentation to back them up.**

## Limitations of the model with regard to inference / prediction

If it is inference, will the inference hold for a certain period of time, for a certain subset of population, and / or for certain conditions.

If it is prediction, then will it be possible / convenient / expensive for the stakeholders to collect the data relating to the predictors in the model. Using your model, how soon will the stakeholder be able to predict the outcome before the outcome occurs. For example, if the model predicts the number of bikes people will rent in Evanston on a certain day, then how many days before that day will your model be able to make the prediction. This will depend on how soon the data that your model uses becomes available. If you are predicting election results, how many days / weeks / months / years before the election can you predict the results. 

When will your model become too obsolete to be useful?

## Future Work

You are welcome to introduce additional sections or subsections, if required, to address any specific aspects of your project in detail. For example, you may briefly discuss potential future work that the research community could focus on to make further progress in the direction of your project's topic.

## Conclusions and Recommendations to stakeholder(s)

What conclusions do you draw based on your model? If it is inference you may draw conclusions based on the coefficients, statistical significance of predictors / interactions, etc. If it is prediction, you may draw conclusions based on prediction accuracy, or other performance metrics.

How do you use those conclusions to come up with meaningful recommendations for stakeholders? The recommendations must be action-items for stakeholders that they can directly implement without any further analysis. Be as precise as possible. The stakeholder(s) are depending on you to come up with practically implementable recommendations, instead of having to think for themselves.

If your recommendations are not practically implementable by stakeholders, how will they help them? Is there some additional data / analysis / domain expertise you need to do to make the recommendations implementable? 

Do the stakeholder(s) need to be aware about some limitations of your model? Is your model only good for one-time use, or is it possible to update your model at a certain frequency (based on recent data) to keep using it in the future? If it can be used in the future, then for how far into the future?

## GitHub and individual contribution {-}

Our project repository can be found at [**this link**]('https://github.com/AnastasiaKWei/Saturn'). A more detailed overview of each team member's contributions to the project can be found below. 

<html>
<style>
table, td, th {
  border: 1px solid black;
}

table {
  border-collapse: collapse;
  width: 100%;
}

th {
  text-align: left;
}
    

</style>
<body>

<h2>Individual contribution</h2>

<table style="width:100%">
     <colgroup>
       <col span="1" style="width: 15%;">
       <col span="1" style="width: 20%;">
       <col span="1" style="width: 50%;">
       <col span="1" style="width: 15%;"> 
    </colgroup>
  <tr>
    <th>Team member</th>
    <th>Contributed aspects</th>
    <th>Details</th>
    <th>Number of GitHub commits</th>
  </tr>
  <tr>
    <td>Kaitlyn Hung</td>
    <td>Modeling and EDA</td>
    <td>Cleaned data to impute missing values and developed visualizations to identify appropriate variable transformations.</td>
    <td>100</td>
  </tr>
  <tr>
    <td>Amy Wang</td>
    <td>Modeling and model optimization</td>
    <td>Checked and addressed modeling assumptions and identified relevant variable interactions.</td>
    <td>120</td>
  </tr>
    <tr>
    <td>Anastasia Wei</td>
    <td>Data cleaning and model optimization</td>
    <td>Identified outliers/influential points and analayzed their effect on the model.</td>
    <td>130</td>    
  </tr>
    <tr>
    <td>Lila Wells</td>
    <td>Modeling and EDA</td>
    <td>Performed variable selection on an exhaustive set of predictors to address multicollinearity and overfitting.</td>
    <td>150</td>    
  </tr>
</table>

List the **challenges** you faced when collaborating with the team on GitHub. Are you comfortable using GitHub? 
Do you feel GitHuB made collaboration easier? If not, then why? *(Individual team members can put their opinion separately, if different from the rest of the team)*

## References {-}

List and number all bibliographical references. When referenced in the text, enclose the citation number in square brackets, for example [1].

[1] Authors. The frobnicatable foo filter, 2014. Face and Gesture submission ID 324. Supplied as additional material
fg324.pdf. 3


## Appendix {-}

You may put additional stuff here as Appendix. You may refer to the Appendix in the main report to support your arguments. However, the appendix section is unlikely to be checked while grading, unless the grader deems it necessary.