# Model Exploration

In this notebook, we will be exploring various models and trying to predict cost based on location and type of procedure.

## Style Fix

In [None]:
%%html
<style>
table {float:left}
</style>

## Imports

In [None]:
import loadAndClean
import pandas as pd
import numpy as npfrom sklearn import cross_validation
from sklearn.metrics import mean_squared_error
from sklearn.ensemble import RandomForestRegressor
from sklearn.linear_model import LinearRegression
import xgboost as xgb

## Load Cleaned Data and Split for Cross Validation

In [None]:
X = loadAndClean.loadAndClean()
X.describe()

## Cross Validation Function

In [None]:
def crossVal(clf, X, y, stratify_series=None, cv=3):
    if stratify_series is None:
        stratify_series = X['DRG Code']
    scores = []
    for i in range(cv):
        X_train, X_test, y_train, y_test = cross_validation.train_test_split(X, y, test_size=0.5, stratify=np.array(stratify_series))
        clf.fit(X_train, y_train)
        predictions = clf.predict(X_test)
        scores.append(mean_squared_error(y_test, predictions)**0.5)
        print scores[i]
    print "Average RMSE: ${:,.2f}".format(np.mean(scores))

## Baseline Models

For our first baseline model, we will simply predict the average cost regardless of what the data is.

In [None]:
class baseline(object):
    def __init__(self):
        self.has_fit = False
        
    def fit(self, X_train, y_train):
        self.average_value = y_train.mean()
        self.has_fit = True

    def predict(self, X_test):
        if self.has_fit:
            return np.ones((len(X_test),)) * self.average_value
        return None

alg = baseline()
crossVal(alg, X, X['Average Medicare Payments Num'])

As expected, this model did not perform particularly well.  The root mean squared error was ~$7,500.

For a more sophisticated baseline model, we will predict the average cost for the given DRG.

In [None]:
class grouped_baseline(object):
    def __init__(self):
        self.has_fit = False

    def fit(self, X_train, y_train):
        X_train = X_train.copy()
        X_train['Cost'] = y_train
        groups = X_train.groupby(['DRG Code'])

        # Average the cost for each DRG
        self.drg_costs = {}
        for ind,data in groups:
            self.drg_costs[ind] = data['Cost'].mean()

        self.has_fit = True

    def predict(self, X_test):
        if self.has_fit:
            return X_test['DRG Code'].apply(lambda x: self.drg_costs[x])
        return None

alg = grouped_baseline()
crossVal(alg, X, X['Average Medicare Payments Num'])

This model did much better, having a RMSE of ~$2,900.  However, we think we can do better by also using the location information.

## Random Forest Regressor

We will start with a Random Forest, since we have had success with them in the past.  This time, however, it will need to be a Random Forest Regressor instead of a Random Forest Classifier since we want it to predict a continuous value.

In [None]:
predictors = ['Latitude','Longitude','DRG Code']
alg = RandomForestRegressor(n_estimators=50)
crossVal(alg, X[predictors], X['Average Medicare Payments Num'])

It did better than either of our baseline models, with a RMSE of ~$2,300, but we're not too sure if this is a great result.  We'll try some other models to see if we can do better.

## Linear Regression

The next model we will try is Linear Regression, but we don't have too high of hopes for it if we just use raw Latitude, Longitude, and DRG Code as the features since there's probably not a linear relationship between them and the cost.

In [None]:
predictors = ['Latitude','Longitude','DRG Code']
alg = LinearRegression()
crossVal(alg, X[predictors], X['Average Medicare Payments Num'])

As expected, this didn't do too well, resulting in a RMSE of ~$7,500 (about the same as our initial baseline model).  We can help out the model by one-hot encoding the DRG Codes, so we'll try that next.

In [None]:
one_hot_drg = pd.get_dummies(X['DRG Code'])

predictors = ['Latitude','Longitude']
X_one_hot = pd.concat([X[predictors], one_hot_drg], axis=1)

alg = LinearRegression()
crossVal(alg, X_one_hot, X['Average Medicare Payments Num'], X['DRG Code'])

That helped a lot!  Now the RMSE is ~$2,800, slightly better than our grouped baseline model, but not as good as the Random Forest.  We might be able to do better by using the discrete Hospital Referral Regions (HRRs) instead of the continuous Latitude and Longitude values since they're probably not linearly related to the cost.

In [None]:
predictors = ['Provider HRR Num']
X_one_hot= pd.concat([X[predictors], one_hot_drg], axis=1)

alg = LinearRegression()
crossVal(alg, X_one_hot, X['Average Medicare Payments Num'], X['DRG Code'])

Using HRRs resulted in a RMSE of ~$2,900, but one-hot encoding them will probably help.

In [None]:
one_hot_hrr = pd.get_dummies(X['Provider HRR Num'])

X_one_hot = pd.concat([one_hot_hrr, one_hot_drg], axis=1)

alg = LinearRegression()
crossVal(alg, X_one_hot, X['Average Medicare Payments Num'], X['DRG Code'])

The one-hot encoding brought the RMSE down to ~$2,400, but this still isn't quite as good as our Random Forest model.

## eXtreme Gradient Boosting

In [None]:
predictors = ['Latitude','Longitude','DRG Code']
alg = xgb.XGBRegressor(n_estimators=5000)
crossVal(alg, X[predictors], X['Average Medicare Payments Num'])

This is the best model so far with a RMSE of ~$1,800.

## Summary of Model Results


| Model                       | RMSE
| :---                        | ---:
| Baseline                    | \$7,500
| Grouped Baseline            | \$2,900
| Random Forest (50)          | \$2,300
| Linear Reg. (one-hot enc.)  | \$2,400
| XGB (500)                   | \$1,800