In this notebook, we will be running a parameter sweep for the k-Nearest Neighbors model.

## Style Fix

In [None]:
%%html
<style>
table {float:left}
</style>

## Imports

In [None]:
import loadAndClean
import random
import numpy as np
from sklearn.metrics import mean_squared_error
from sklearn.neighbors import KNeighborsRegressor

## Load data

In [None]:
X = loadAndClean.loadAndClean()
X.head(3)

## Cross Validation Function

This function ensures that a given provider will appear in either the training set or the testing set, but not both, as well as stratifies the data by the DRG Code to guarantee that the model is trained with all possible DRG Codes.

In [None]:
def crossVal(clf, X, predictors, cv=3):
    random.seed(6)
    scores = []
    for i in range(cv):
        while True:
            testIds = random.sample(X['Provider Id'].unique(),500)
            testData = X[X['Provider Id'].isin(testIds)]
            trainData = X[~X['Provider Id'].isin(testIds)]
            if len(testData['DRG Code'].unique()) == len(X['DRG Code'].unique()) and len(trainData['DRG Code'].unique()) == len(X['DRG Code'].unique()):
                break
        X_train = trainData[predictors]
        y_train = trainData['Average Medicare Payments Num']
        X_test = testData[predictors]
        y_test = testData['Average Medicare Payments Num']
        clf.fit(X_train, y_train)
        predictions = clf.predict(X_test)
        scores.append(mean_squared_error(y_test, predictions)**0.5)

    return np.mean(scores)

## Grid Search

Now we will iterate exhaustively over some possible parameter values for the k-NN model.  The parameters we are looking at are `n_neighbors`, the number of neighbors used in the prediction, and `weights`, whether the neighbors are weighted equally or by the inverse of distance.  The features we'll be using as predictors in the model are Latitude, Longitude, and DRG Code.

In [None]:
def gridsearch(X, weights, n_neighbors):
    predictors = ['Latitude','Longitude','DRG Code']
    best = [None, None, np.inf]
    line_str = '|  {: <8}  |  {: <3}  |  {}'
    print line_str.format('w', 'n', 'RMSE')
    print line_str.format(':--', ':--', ':--')
    for w in weights:
        for n in n_neighbors:
            alg = KNeighborsRegressor(n_neighbors=n, weights=w, n_jobs=4)
            score = crossVal(alg, X, predictors, cv=10)
            if score < best[2]:
                best = [w, n, score]
            print line_str.format(w, n, score)

    print '\nBest:'
    print '{: <10} {: <5} ${:,.2f}'.format(*best)

In [None]:
weights = ['uniform', 'distance']
n_neighbors = [1, 5, 10, 25, 50, 100]
gridsearch(X, weights, n_neighbors)

Results:

| w          | n     | RMSE
| :--        | :--   | :--
| uniform    | 1     | 2902.07093186
| uniform    | 5     | 2498.40236491
| uniform    | 10    | 2639.85794093
| uniform    | 25    | 3029.05165345
| uniform    | 50    | 3440.47657637
| uniform    | 100   | 3885.05167983
| distance   | 1     | 2902.07093186
| distance   | 5     | 2411.13212742
|**distance**|**10** |**2402.10445487**
| distance   | 25    | 2525.15988371
| distance   | 50    | 2726.00126531
| distance   | 100   | 2992.50544288


It looks like the model did the best with a value for `n_neighbors` between 5 and 25 and `weights` set to 'distance', so let's zoom in there.

In [None]:
weights = ['distance']
n_neighbors = range(6, 25, 2)
gridsearch(X, weights, n_neighbors)

Results:

| w          | n    | RMSE
| :--        | :--  | :--
| distance   | 6    | 2402.47325159
|**distance**|**8** |**2396.76030874**
| distance   | 10   | 2402.10445487
| distance   | 12   | 2412.02137187
| distance   | 14   | 2427.56467998
| distance   | 16   | 2445.20653534
| distance   | 18   | 2463.62578598
| distance   | 20   | 2481.88697013
| distance   | 22   | 2499.45920116
| distance   | 24   | 2516.47302241

Zooming in ever further, let's check `n_neighbors` of 7, 8, and 9 to find the victor.

In [None]:
weights = ['distance']
n_neighbors = [7, 8, 9]
gridsearch(X, weights, n_neighbors)

Results:

|  w         |  n    |  RMSE
|  :--       |  :--  |  :--
|  distance  |  7    |  2397.19221963
|**distance**|**8**  |**2396.76030874**
|  distance  |  9    |  2398.27932412

The best parameters for our k-NN model are `n_neighbors` of 8 and `weights` set to 'distance'.