# Hospital Readmissions Data Analysis and Recommendations for Reduction

### Background
In October 2012, the US government's Center for Medicare and Medicaid Services (CMS) began reducing Medicare payments for Inpatient Prospective Payment System hospitals with excess readmissions. Excess readmissions are measured by a ratio, by dividing a hospital’s number of “predicted” 30-day readmissions for heart attack, heart failure, and pneumonia by the number that would be “expected,” based on an average hospital with similar patients. A ratio greater than 1 indicates excess readmissions.

### Exercise Directions

In this exercise, you will:
+ critique a preliminary analysis of readmissions data and recommendations (provided below) for reducing the readmissions rate
+ construct a statistically sound analysis and make recommendations of your own 

More instructions provided below. Include your work **in this notebook and submit to your Github account**. 

### Resources
+ Data source: https://data.medicare.gov/Hospital-Compare/Hospital-Readmission-Reduction/9n3s-kdb3
+ More information: http://www.cms.gov/Medicare/medicare-fee-for-service-payment/acuteinpatientPPS/readmissions-reduction-program.html
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****

In [None]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
%matplotlib inline
import scipy.stats as stats
import statsmodels.stats.api as sm
import seaborn as sns
sns.set()
from mpl_toolkits.axes_grid1 import make_axes_locatable

In [None]:
# read in readmissions data provided
hospital_read_df = pd.read_csv('data/cms_hospital_readmissions.csv')

****
## Preliminary Analysis

In [None]:
# deal with missing and inconvenient portions of data 
clean_hospital_read_df = hospital_read_df[hospital_read_df['Number of Discharges'] != 'Not Available']
clean_hospital_read_df.loc[:, 'Number of Discharges'] = clean_hospital_read_df['Number of Discharges'].astype(int)
clean_hospital_read_df = clean_hospital_read_df.sort_values('Number of Discharges')

In [None]:
# generate a scatterplot for number of discharges vs. excess rate of readmissions
# lists work better with matplotlib scatterplot function
x = [a for a in clean_hospital_read_df['Number of Discharges'][81:-3]]
y = list(clean_hospital_read_df['Excess Readmission Ratio'][81:-3])

fig, ax = plt.subplots(figsize=(8,5))
ax.scatter(x, y,alpha=0.2)

ax.fill_between([0,350], 1.15, 2, facecolor='red', alpha = .15, interpolate=True)
ax.fill_between([800,2500], .5, .95, facecolor='green', alpha = .15, interpolate=True)

ax.set_xlim([0, max(x)])
ax.set_xlabel('Number of discharges', fontsize=12)
ax.set_ylabel('Excess rate of readmissions', fontsize=12)
ax.set_title('Scatterplot of number of discharges vs. excess rate of readmissions', fontsize=14)

ax.grid(True)
fig.tight_layout()

****

## Preliminary Report

Read the following results/report. While you are reading it, think about if the conclusions are correct, incorrect, misleading or unfounded. Think about what you would change or what additional analyses you would perform.

**A. Initial observations based on the plot above**
+ Overall, rate of readmissions is trending down with increasing number of discharges
+ With lower number of discharges, there is a greater incidence of excess rate of readmissions (area shaded red)
+ With higher number of discharges, there is a greater incidence of lower rates of readmissions (area shaded green) 

**B. Statistics**
+ In hospitals/facilities with number of discharges < 100, mean excess readmission rate is 1.023 and 63% have excess readmission rate greater than 1 
+ In hospitals/facilities with number of discharges > 1000, mean excess readmission rate is 0.978 and 44% have excess readmission rate greater than 1 

**C. Conclusions**
+ There is a significant correlation between hospital capacity (number of discharges) and readmission rates. 
+ Smaller hospitals/facilities may be lacking necessary resources to ensure quality care and prevent complications that lead to readmissions.

**D. Regulatory policy recommendations**
+ Hospitals/facilties with small capacity (< 300) should be required to demonstrate upgraded resource allocation for quality care to continue operation.
+ Directives and incentives should be provided for consolidation of hospitals and facilities to have a smaller number of them with higher capacity and number of discharges.

****
### Exercise

Include your work on the following **in this notebook and submit to your Github account**. 

A. Do you agree with the above analysis and recommendations? Why or why not?
   
B. Provide support for your arguments and your own recommendations with a statistically sound analysis:

   1. Setup an appropriate hypothesis test.
   2. Compute and report the observed significance value (or p-value).
   3. Report statistical significance for $\alpha$ = .01. 
   4. Discuss statistical significance and practical significance. Do they differ here? How does this change your recommendation to the client?
   5. Look at the scatterplot above. 
      - What are the advantages and disadvantages of using this plot to convey information?
      - Construct another plot that conveys the same information in a more direct manner.



You can compose in notebook cells using Markdown: 
+ In the control panel at the top, choose Cell > Cell Type > Markdown
+ Markdown syntax: http://nestacms.com/docs/creating-content/markdown-cheat-sheet
****

# My Analysis and Recommendation on Hospital Readmissions

## A. Do you agree with the above analysis and recommendations? Why or why not?

The above analysis is a good start point but now quite enough to draw conclusions and analysis set forth since they are only based on one scatter plot of the data, with no supporting statistical analysis to substantiate the claims. For that reason, I find those recommendations suspicious and do not agree with the analysis or recommendations above. I state my reasons for not accepting them below. 

- It is tempting to guess the trend which is mentioned in analysis, since the notable extreme points draw the eye from top left to bottom right. The plot is actually a little bit complicated. It is difficult to discern any real trends. Besides that, it is not clear why the boundaries of the shaded regions are chosen. The clustering of many points in those regions make these statements difficult to approve.


- it is essential to consider the entire data set, including the very dense collection of points in the center. It is not clear why less than 100 and greater than 1000 were used, since the low and high demarcation used in the previous section (in the form of shaded boxes) was 350 and 800, respectively. This shows that a proper hypothesis test was not conducted to determine the statistical significance of readmission rate across different hospital sizes.


- The numerical relationship was simply "eyeballed" between number of discharges and rate of readmissions. There was no correlation coefficient or numerical evaluation calculated to confirm initial observations. We do not have enough evidence to tell the two variables correlated with each other.


- The conclusion is completely unfounded around hospital size lacking resources. There's no evidence that more resources would resolve this issue.


- It is also curious that the only statistical evidence involved small hospitals defined as less than 100 whereas here they are defined as less than 300. This is another instance where numbers are given without explanation or further context.


- The statement ,which defines "Smaller hospitals/facilities may be lacking necessary resources to ensure quality care and prevent complications that lead to readmissions", might be true. But there might be some other factors causing this particular situtation such as insurance and doctor ratings not available in the dataset. Recommendations are given without any solid analysis.


- The missing data was handled properly above by dropping rows with null values (except for Footnote columns and 81 rows of missing values in each 'Excess Readmission Ratio', 'Predicted Readmission Rate', 'Expected Readmission Rate', and 'Number of Readmissions' features).

## B. Provide support for your arguments and your own recommendations with a statistically sound analysis:

**Let's start by inspecting data**

In [None]:
clean_hospital_read_df.sample(5)

In [None]:
clean_hospital_read_df.describe(include='all')

In [None]:
clean_hospital_read_df.info()

In [None]:
# Check the duplicate observations
clean_hospital_read_df.duplicated().sum()

I checked whether there is any duplicate observations in order to drop it. The result shows that there is no duplicate value. 

In [None]:
# Find the missing values
clean_hospital_read_df.isnull().sum()

There are 11497 missing values in 'Footnote' feature. Besides that, there are 81 missing values in each 'Excess Readmission Ratio', 'Predicted Readmission Rate', 'Expected Readmission Rate', and 'Number of Readmissions' features. I will handle these missing values in the following cells. 

In [None]:
# There are less missing values which are 81 out of 11578. Therefore, we can drop them. 
hospital_df = clean_hospital_read_df.dropna(subset=['Excess Readmission Ratio','Predicted Readmission Rate','Expected Readmission Rate',
                                     'Number of Readmissions'])

In [None]:
# Drop 'Footnote' column
hospital_df.drop(columns= ['Footnote'], inplace=True, errors='ignore')
hospital_df.sample(5)

## Setup an appropriate hypothesis test

In the premilinary report's conclusion part, it is stated that there is a significant correlation between hospital capacity (number of discharges) and readmission rates. I will make my hypothesis test on it. 

**Null Hypothesis :** There is no significant relationship between number of discharge and the excess readmission.

**Alternative Hypothesis :** There is significant correlation between number of discharge and number of readmission.

Define the test statistic as the Pearson-R (correlation coefficient)

Significant level: 95%

In [None]:
# Calculate the correlation coefficient
r=stats.pearsonr(hospital_df['Number of Discharges'], hospital_df['Excess Readmission Ratio'])
print("correlation coefficient of two data is:",r[0])

Correlation coefficient is not very significant between excess readmission and number of discharges.

### Compute and report the observed significance value(p-value)

In [None]:
# Define function
def permute_stat(data_1, data_2, size):
    """ This function calculates the pearson correlation coefficient for two sets of data, but randomized"""
    """ Returns statistics value of size = size"""
    
    r = np.empty(size)

    np.random.seed(22)
    for i in range(size):
        syn_data1 = np.random.permutation(data_1)
        syn_data2 = np.random.permutation(data_2)
        r[i] = (stats.pearsonr(syn_data1,syn_data2))[0]
    return r

In [None]:
# Calculate bootstrap correlation coefficient , size 10000
r = permute_stat(hospital_df['Number of Discharges'], hospital_df['Excess Readmission Ratio'], 10000)

In [None]:
# Calculate standard deviation
np.std(r)

In [None]:
# fit a slope for interpretation
p = np.polyfit(hospital_df['Number of Discharges'], hospital_df['Excess Readmission Ratio'], 1)
print("coefficient = ", p[0])

In [None]:
plt.hist(r, bins = 100)
plt.xlabel('pearson r value')
plt.ylabel('counts')
plt.title('bootstrap r correlations, based on random assumption')

In [None]:
# Calculate P-value for a 0.79 pearson r:
p_val_09 = sum(r<=-0.0973)
print("p_value for the hospital dataset is:", p_val_09)

`r = 0.79 is the Pearson's sample correlation coefficient. It has a value between -1 and +1 and indicates a substantial 'positive' relationship near +1 and on the flip side, a 'negative' relationship near -1.`

The p value for this observation is lower than significant level. That means the null hypothesis should be rejected. There is siginificant correlation between discharge and readmission.

### Discuss statistical significance and practical significance. Do they differ here? How does this change your recommendation to the client?

**Discussion on statistical significance and practical significance:**

- Statistical significance refers to the unlikelihood that the result is obtained by chance, i.e., probability of relationship between two variables exists. Practical significance refers to the relationship between the variables and the real world situation.

- Statistical significance depends upon the sample size, practical significance depends upon external factors like cost, time, objective, etc.

- Statistical significance does not guarantee practical significance, but to be practically significant, a data must be statistically significant.

Click on this [link](http://www.differencebetween.net/science/mathematics-statistics/difference-between-statistical-significance-and-practical-significance/#ixzz5ZEwMu3oW) to read more about "Statistical significance vs Practical significance" 

The idea of statistical significance is the unlikelihood that the statistical value measured/observed would occur due to sampling. Usually, a hypothesis test only provides that there "is" or "isn't" a relationship aside from sampling. It does not describe the "strength" of the significance, even though it can prove the existence of the relationship. E.g. For all the hospitals, every 100 discharge increase of the capacity , there is only about 0.3% decrease on the readmission excess. Since the relationship between discharge and readamission can be very weak that there is no practical use to address it. So it may not be very meaningful to act upon that there is a statistical significance that the two are correlated.

Adding an "effective size" measurement , like in this case, the Pearson r, would tell us "how strong" the relationship is. The Pearson R can be classified as: R~0.1, the correlation is low; R ~ 0.3, the correlation is medium; R> 0.5, the correlation is large. This combined with statistical significance, can be one example of practical significance. The practical significance is usually addressed depending on the field of study. How "strong" is strong can also be different based upon the field and the specific question. In this survey of readmissison on hospitals. I would probably convey to the client that there is a very weak correlation between hospital capacity and readimission. But that relationship may not be strong enough to draw any conclusion to act upon.

### Look at the scatterplot above.
**What are the advantages and disadvantages of using this plot to convey information?**

Scatter plots are good for visulizing relationship between continuous variables but without a sound statistical analysis it is not appropriate to reach out the conclusion from scatter plots.

**Construct another plot that conveys the same information in a more direct manner.**

The scatter-plot shows too much information in a small space. A better visual would be to provide joint-plots.

In [None]:
sns.jointplot('Number of Discharges','Excess Readmission Ratio', data= hospital_df, kind='reg')

In [None]:
matplotlib.pyplot.hist('Number of Readmissions', bins = 100, data = hospital_df)