In this notebook we will be exploring the IPPS dataset.

# Imports

In [None]:
%matplotlib inline
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

In [None]:
data = pd.read_csv('IPPS_2013.csv')

data.head()

The data contains the DRG (diagnosis-related group) definition, information about the provider, the number of discharges for each DRG, and the average covered charges, total payments, and Medicare payments for each DRG for each provider.

Let's try grouping by DRG and looking at how much the charges and payments vary by provider.  We'll first need to convert the strings representing the dollar amounts into numbers.

In [None]:
def str_to_num(s):
    """
    Return a float representing the dollar amount in string s
    String s is of the form '$xxxx.xx'
    """
    return float(s[1:])

data['Average Covered Charges Num'] = data['Average Covered Charges'].apply(str_to_num)
data['Average Total Payments Num'] = data['Average Total Payments'].apply(str_to_num)
data['Average Medicare Payments Num'] = data['Average Medicare Payments'].apply(str_to_num)

In [None]:
drgs = data.groupby('DRG Definition')
cols = ['Average Covered Charges Num', 'Average Total Payments Num', 'Average Medicare Payments Num']

n = 10  # Limit the number to show
i = 1
plt.figure(figsize=(20,60))
for drg_name,drg in drgs:
    for j,col in enumerate(cols):
        plt.subplot(n, 3, i)
        drg[col].hist(bins=25)
        xmin, xmax = plt.xlim()
        plt.xlim(0, xmax)
        plt.xlabel('Amount ($)')
        plt.ylabel('Count')
        plt.legend([col])
        if j == 1:
            plt.title(drg_name)
        i += 1
    if i > n * len(cols):
        break


It looks like there's a common trend of the distributions being skewed to the right.  This makes sense to some extent because things cannot cost a negative amount of money.  It also appears that the covered charges are pretty significantly higher than the total payments or medicare payments.

We have geocoded the providers in the dataset and saved the data in another CSV file.  Let's load this and join it with our existing data to look at things as a function of latitude and longitude.

## Note:
I'm in the process of geocoding everything; there are some rate limits preventing me from doing it all at once.  So, `provider_geocodes.csv` isn't complete yet.  Also, I'm not sure why the plots are so buggy.

In [None]:
geo_data = pd.read_csv('provider_geocodes.csv')
data = data.merge(geo_data, how='left')

data.head()

In [None]:
drgs = data[data.Latitude.notnull()].groupby('DRG Definition')
cols = ['Average Covered Charges Num', 'Average Total Payments Num', 'Average Medicare Payments Num']

n = 10  # Limit the number to show
i = 1
plt.figure(figsize=(20,60))
for drg_name,drg in drgs:
    for j,col in enumerate(cols):
        plt.subplot(n, 3, i)
        drg[drg.Latitude.notnull()].plot(kind='scatter', x='Latitude', y='Longitude', c=col)
        plt.xlabel('Latitude')
        plt.ylabel('Longitude')
        if j == 1:
            plt.title(drg_name)
        i += 1
    if i > n * len(cols):
        break