Hospital Readmission Dataset Analysis

by Christiaan Defaux and Maia Ngo


Introduction

This data set was pulled from the CMS (Centres for Medicare and Medicaid Services). As part of the Patient Protection and Affordable Care Act. This is a system which analyses hospital data to determine which systems have higher than expected readmission rates and penalises them through the Hospital Readmission Reduction Program. The dataset includes information on 101766 patients nationwide including data on their time in hospital, race, gender, diagnoses, their medications and whether they were readmitted or not. The data was analysed to determine features which affected the readmission rates. 

Methodology

Firstly, we examined the data to determine how to approach data cleansing. There were 50 unique features which could be used to gain insight. Of these features we decided which were relevant to readmission. In order to find droppable columns we looked into each feature to see how many null values were contained. It was found that some features contained majority (>50%) null values in the form of "?". These columns were dropped as they were deemed unnecessary. It was also found that certain columns contained unique identifier numbers (patient_nbr, payer_code etc.). These columns were checked for duplicates. There were in fact no duplicates throughout the DataFrame. It was decided to drop the arbitrary identifiers as they couldn't give any insight into readmission data. It was also found that two features in particular contained all identical data and therefore could not provide any discriminatory information about the dataset. These columns were dropped.

Next, the data was examined to see where insights might be found. This was done using the .describe(), .info() and .groupby() functions, which allowed a high level overview of the data using metrics such as mean, total values and sum. Using these functions we could see that there were interesting features related to age, race, gender and time spent in hospital. These features were pulled out for further investigation as they seemed to affect readmission rates. Additionally, it was decided to create and edit features using feature engineering techniques. This was done to two features in particular which, without this engineering would have been difficult to manage. The age feature was in an inconvenient format ('(0-10],(10-20]'). This was categoric data, therefore it was decided to use the end of the class rather than a class range, therefore (1-10] was translated to '10' and so on. Another feature which was created for ease of use was a new column to group together patients with readmittance '<30' and '>30' into 'True' and 'NO' (not readmitted) into 'False'. 

However, it must be noted that despite the correlation we saw, it was also obvious that no direct causation could be determined. Causation could only have been determined with deeper investigation. Thus, we decided to try to glean greater insight by visualizing the data using the matplotlib library. This library allows for easy plotting of data into a variety of standard chart types. Initially, we plotted time in hospital versus readmission rate. This gave interesting insight which can be seen in the figure below.

![image.png](attachment:image.png)

The figure above shows how the rate of readmission falls as the time spent in hospital increases. This could be due to a plethora of factors. Perhaps with increased time in care, greater attention is paid to the patient's state and ability to care for oneself. However, this does indicate that one way to reduce readmission rates could be to keep patients in the hospital for longer times. Whether this would decrease total costs in the long run is unclear; a longitudinal study of this feature would be useful to determine this. Next, we plotted the age versus the instances readmittance. These gave an even distribution across categories and ages that can be seen in the figure below:

![image.png](attachment:image.png)

The figure above demonstrates how readmission numbers increase as age increases, up to about 90, then decreases, this decrease is likely due to patients dying and there being fewer people living over 90. It also indicates an increase in the rate of readmission over 30 days as age increases. This longer time for readmission could be due to factors such as mobility issues hindering access to care, or due to obstinance increasing with age. The recommendation would be to keep older patients in the hospital for longer, as indicated by Figure 1 above, as this might decrease the rate of readmission. In addition, we plotted the gender versus readmittance rate, the data for this can be seen in the figure below:

![image.png](attachment:image.png)

This figure shows that there is a higher readmission rate for females than males across a variety of times spent in hospital. Again precise causation cannot be determined solely from this data. However, our assumption is that it could be due to female physiology needing specific care not needed for males, or perhaps a greater openness on the part of females to seek medical attention. In order to tackle the readmission problem, females as a group might need greater attention before discharging them. Finally, we looked at data specfic to diagnosis codes and those in particular with the highest rates of readmission. Each patient was allocated 3 separate diagnoses. These can be seen below:

![image.png](attachment:image.png) ![image.png](attachment:image.png) ![image.png](attachment:image.png)

From the three diagrams above the diagnoses with the highest rates of readmission can be determined. In particular, these are: 250 Diabetes Mellitus (~4500), 428 Acute Heart Failure (~4000), 276 Disorder of Fluid Electrolyte (~3100), 401 Hypertension (~3000), 414 Ischemic Heart Disease (~2500). From this information it was determined that those patients with these particular diagnoses should be targeted in order to reduce readmission rates. The care that should be given can include greater time spent in hospital care or careful nursing attention given at home or in a nursing facility. 

Overall, there are many inights yet to be gleaned from the data. From the analysis that was performed it was determined that there are specific at risk groups that should be given extra attention. At risk groups found in our analysis were the elderly, females and those with specific acute and chronic cardiovascular and diabetic diseases. The most useful attention for reducing readmission was an increase of time in hospital care. In particular, those with acute disease should spend longer in palliative care before being discharged to their homes. Our hope is that these insights can be used to decreas the rate of readmission to hospital and reduce the cost of providing long term care to suffering patients.

