# UPMCE Machine Learning Palooza
### Goal
Predict the length-of-stay (LOS) of a patient in a hospital. See this repo to get details about the challenge [here](https://git.tdc.upmc.edu/MachineLearning/ml-intro-spark-regression) 

#### 1) Get my features before splitting

In [1]:
import pandas as pd
import numpy as np
import scipy.sparse as sp

from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

from ohe import *

In [2]:
url = "C:/Users/dickm/Documents/Projects/ML/Source/UPMC/Pharmacy/visit_train_panda.csv"
#url = 'http://sparkdl04:50070/webhdfs/v1/palooza/data/visit_train_panda.csv?op=OPEN'
source_data = pd.read_csv(url)

#Get parse admin dates and add derived day of week (DOW) fields
source_data.ArriveDate = source_data.ArriveDate.map(lambda d: pd.to_datetime(str(d)))
source_data.DischargeDate = source_data.DischargeDate.map(lambda d: pd.to_datetime(str(d)))

#Add DOW columns. N.B. Monday is 0 and Sunday is 6.
source_data['ArriveDateDOW'] = source_data.ArriveDate.dt.dayofweek
source_data['DischargeDateDOW'] = source_data.DischargeDate.dt.dayofweek

source_data.head()

Unnamed: 0,VisitID,Hospital,Dept_Code,PaymentType,Age,Race,Gender,FC,ArriveDate,DischargeDate,LOS,DXCODE,Description,DispenseID,DOC,ArriveDateDOW,DischargeDateDOW
0,HdBgCT1YkEl14280,SHY,436,I,77,W,M,MS,2014-10-07,2014-10-10,3,197.7,SECOND MALIG NEO LIVE,1,10028,1,4
1,HdBgCT1YkEl14280,SHY,436,I,77,W,M,MS,2014-10-07,2014-10-10,3,276.2,ACIDOSIS,1,10028,1,4
2,HdBgCT1YkEl14280,SHY,436,I,77,W,M,MS,2014-10-07,2014-10-10,3,198.5,SECONDARY MALIG NEO B,1,10028,1,4
3,HdBgCT1YkEl14280,SHY,436,I,77,W,M,MS,2014-10-07,2014-10-10,3,253.6,NEUROHYPOPHYSIS DIS N,1,10028,1,4
4,HdBgCT1YkEl14280,SHY,436,I,77,W,M,MS,2014-10-07,2014-10-10,3,V58.66,LONG-TERM USE ASPIRIN,1,10028,1,4


## N.B. - Sparse encode DXCODE
DXCODE is too big, but it can be sparsely represented. DXCODE has over 4K unique values so it is impossible to encode using a dense matric like panda on a single machine.

In [29]:
mytest = scipy.sparse.csr_matrix(source_data.as_matrix())

TypeError: no supported conversion for types: (dtype('O'),)

### Pre-Training calls

In [3]:
# Encoded categorical and converted all to sparse repsresentation.
global_features = ['DXCODE', 'Age', 'Race', 'Gender', 'Hospital', 'ArriveDateDOW', 'DischargeDateDOW']
categorical_features = ['DXCODE', 'Race', 'Gender', 'Hospital', 'ArriveDateDOW', 'DischargeDateDOW']
source_data_categorical,_,_ = ohe_dataframe_to_sparse(source_data[global_features], categorical_features)

### Method 1 - Let's see what happnes if I only encode DXCODE

In [6]:
# Prompt variables
evalPrompt = 'RMSE with dxcode only:'

# ID feature and target variables
#Using DXCODE only
feature_names = ['DXCODE']

TrainAndPrintEval(source_data_features, feature_names, evalPrompt)

NameError: name 'source_data_features' is not defined

### Method 2 - Let's try more features; DXCODE, Patient demos and Hospital

In [21]:
#ID feature and target variables
#Using DXCODE, patient demos and hospital as only feature
feature_names = ['DXCODE', 'Age', 'Race', 'Gender', 'Hospital']
categorical_features = ['DXCODE', 'Race', 'Gender', 'Hospital']

# Encoded categorical and converted all to sparse repsresentation.
source_data_features,_,_ = ohe_dataframe_to_sparse(source_data[feature_names], categorical_features)

#Split Data
target = source_data.LOS
source_features_train, source_features_test, target_train, target_test = train_test_split(source_data_features, target)

#Train Model with DXCODE, patient demos and hospital as only feature
lin_reg = LinearRegression()
lin_reg.fit(source_features_train, target_train)

#Evaluate Model with DXCODE, patient demos and hospital as only feature
target_pred = lin_reg.predict(source_features_test)

print ('RMSE with dxcode, demos & hospital: {:,.2f}'.format(np.sqrt(metrics.mean_squared_error(target_test, target_pred))))
print ('RMSE (standard): {:,.2f}'.format(
        np.sqrt(metrics.mean_squared_error(target_test, [target_test.mean()] * len(target_test)))))

RMSE with dxcode, demos & hospital: 6.19
RMSE (standard): 6.41


### Method 3 - Let's try Method 2 with DOW for Arrival and Discharge

In [12]:
#ID feature and target variables
feature_names = ['DXCODE', 'Age', 'Race', 'Gender', 'Hospital', 'ArriveDateDOW', 'DischargeDateDOW']


# Encoded categorical and converted all to sparse repsresentation.
source_data_features,_,_ = ohe_dataframe_to_sparse(source_data[feature_names], categorical_features)

#Split Data
target = source_data.LOS
source_features_train, source_features_test, target_train, target_test = train_test_split(source_data_features, target)

#Train Model with DXCODE, patient demos and hospital as only feature
lin_reg = LinearRegression()
lin_reg.fit(source_features_train, target_train)

#Evaluate Model with DXCODE, patient demos and hospital as only feature
target_pred = lin_reg.predict(source_features_test)

print ('RMSE with dxcode, demos, hospital & dates: {:,.2f}'.format(np.sqrt(metrics.mean_squared_error(target_test, target_pred))))
print ('RMSE (standard): {:,.2f}'.format(
        np.sqrt(metrics.mean_squared_error(target_test, [target_test.mean()] * len(target_test)))))

RMSE with dxcode, demos, hospital & dates: 6.27
RMSE (standard): 6.50


### Do common work to train & eval model

In [5]:
def TrainAndPrintEval(data, featureList, message):
    # Encoded categorical and converted all to sparse repsresentation.
    source_data_features = data[featureList]

    # Split Data
    target = data.LOS
    source_features_train, source_features_test, target_train, target_test = train_test_split(source_data_features, target)

    # Train Model with DXCODE as only feature
    model = LinearRegression()
    model.fit(source_features_train, target_train)

    # Evaluate Model with DXCODE as only feature
    target_pred = model.predict(source_features_test)

    print (message + ' {:,.2f}'.format(np.sqrt(metrics.mean_squared_error(target_test, target_pred))))
    print ('RMSE (standard): {:,.2f}'.format(
            np.sqrt(metrics.mean_squared_error(target_test, [target_test.mean()] * len(target_test)))))

### Show the issue numerically with DXCODE

In [23]:
print ('%d of the %d observations have unique DXCODES.' % 
       (len(source_data.DXCODE.unique()), len(source_data.index)))

4789 of the 279808 observations have unique DXCODES.


## Scrap

In [4]:
## Show how to convert column to date time.
feature_names = ['LOS', 'Age', 'Race', 'Gender', 'Hospital', 'ArriveDate', 'DischargeDate']
source_data_features = source_data[feature_names]
source_data_features.ArriveDate = source_data_features.ArriveDate.map(lambda d: pd.to_datetime(str(d)))
source_data_features.DischargeDate = source_data_features.DischargeDate.map(lambda d: pd.to_datetime(str(d)))
source_data_features.DischargeDate.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


0   2014-10-10
1   2014-10-10
2   2014-10-10
3   2014-10-10
4   2014-10-10
Name: DischargeDate, dtype: datetime64[ns]

In [7]:
source_data_features.DischargeDate.dt.dayofweek

0         4
1         4
2         4
3         4
4         4
5         4
6         4
7         4
8         4
9         4
10        4
11        4
12        4
13        4
14        4
15        4
16        4
17        4
18        4
19        4
20        4
21        4
22        4
23        4
24        4
25        4
26        4
27        1
28        1
29        1
         ..
279778    1
279779    0
279780    5
279781    5
279782    5
279783    5
279784    5
279785    5
279786    5
279787    0
279788    0
279789    0
279790    0
279791    0
279792    0
279793    0
279794    0
279795    0
279796    3
279797    3
279798    3
279799    3
279800    3
279801    3
279802    3
279803    4
279804    4
279805    4
279806    4
279807    4
Name: DischargeDate, dtype: int64