# Model Exploration

In this notebook, we will be exploring various models and trying to predict cost based on location and type of procedure.

## Style Fix

In [1]:
%%html
<style>
table {float:left}
</style>

## Imports

In [2]:
import loadAndClean
import numpy as np
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
from sklearn import cross_validation
from sklearn.metrics import mean_squared_error
import xgboost as xgb

## Load Cleaned Data and Split for Cross Validation

In [3]:
X = loadAndClean.loadAndClean()
X.describe()

y = X['Average Medicare Payments Num']
X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.5, stratify=np.array(X['DRG Code']))

## Baseline Models

For our first baseline model, we will simply predict the average cost regardless of what the data is.

In [4]:
def crossVal(clf, X, y, cv=3):
    scores = []
    for i in range(cv):
        X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.5, stratify=np.array(X['DRG Code']))
        clf.fit(X_train, y_train)
        predictions = clf.predict(X_test)
        scores.append(mean_squared_error(y_test, predictions))
        print scores[i]
    print "Average MSE: ", np.mean(scores)

In [5]:
class baseline(object):
    def __init__(self):
        self.has_fit = False
        
    def fit(self, X_train, y_train):
        self.average_value = y_train.mean()
        self.has_fit = True

    def predict(self, X_test):
        if self.has_fit:
            return np.ones((len(X_test),)) * self.average_value
        return None

alg = baseline()
alg.fit(X_train, y_train)
predictions = alg.predict(X_test)
mean_squared_error(y_test, predictions)

56822730.795006938

As expected, this model did not perform particularly well.  The root mean squared error was $7,564.69.

For a more sophisticated baseline model, we will predict the average cost for the given DRG.

In [6]:
class grouped_baseline(object):
    def __init__(self):
        self.has_fit = False

    def fit(self, X_train, y_train):
        X_train = X_train.copy()
        X_train['Cost'] = y_train
        groups = X_train.groupby(['DRG Code'])

        # Average the cost for each DRG
        self.drg_costs = {}
        for ind,data in groups:
            self.drg_costs[ind] = data['Cost'].mean()

        self.has_fit = True

    def predict(self, X_test):
        if self.has_fit:
            return X_test['DRG Code'].apply(lambda x: self.drg_costs[x])
        return None

alg = grouped_baseline()
alg.fit(X_train, y_train)
predictions = alg.predict(X_test)
mean_squared_error(y_test, predictions)

8550945.6168891862

This model did much better, having a RMSE of $2,902.30.  However, we think we can do better by also using the location information.

## Random Forest Regressor

We will start with a Random Forest, since we have had success with them in the past.  This time, however, it will need to be a Random Forest Regressor instead of a Random Forest Classifier since we want it to predict a continuous value.

In [7]:
predictors = ['Latitude','Longitude','DRG Code']
alg = RandomForestRegressor(n_estimators=50)
alg.fit(X_train[predictors], y_train)
predictions = alg.predict(X_test[predictors])
mean_squared_error(y_test, predictions)

5407738.3686212441

It did better than either of our baseline models, with a RMSE of $2,269.44, but we're not too sure if this is a great result.  We'll try some other models to see if we can do better.

## Linear Regression

The next model we will try is Linear Regression, but we don't have too high of hopes for it because the mapping from latitude and longitude to cost isn't really linear.

In [8]:
predictors = ['Latitude','Longitude','DRG Code']
alg = LinearRegression()
alg.fit(X_train[predictors], y_train)
predictions = alg.predict(X_test[predictors])
mean_squared_error(y_test, predictions)

56162513.386226073

As expected, this didn't do too well, resulting in a RMSE of $7,519.29.  We can help out the model by one-hot encoding the categorical features, so we'll try that next.

In [9]:
# Linear Regression with one-hot encoded DRG Code

In [10]:
predictors = ['Latitude','Longitude','DRG Code']

alg.fit(X_train[predictors], y_train)
predictions = alg.predict(X_test[predictors])
mean_squared_error(y_test, predictions)

56162513.386226073

In [11]:
predictors = ['Latitude','Longitude','DRG Code']
alg = xgb.XGBRegressor(n_estimators=5000)
crossVal(alg,X[predictors], X['Average Medicare Payments Num'])

3381361.74153
3081999.0194
3254480.90949
Average MSE:  3239280.55681


In [12]:
# Linear Regression using HRR

## Summary of Model Results


| Model    | MSE
| :---     | --:
|baseline  | 56,454,378.43
|grouped_bl| 8,402,783.52
|RFR       | 5,250,844.69