# UPMCE Machine Learning Palooza
### Goal
Predict the length-of-stay (LOS) of a patient in a hospital. See this repo to get details about the challenge [here](https://git.tdc.upmc.edu/MachineLearning/ml-intro-spark-regression) 

#### 1) Get my features before splitting

In [1]:
import pandas as pd
import numpy as np
import scipy.sparse as sp

from sklearn.cross_validation import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn import metrics

from ohe import *

In [2]:
url = 'http://sparkdl04:50070/webhdfs/v1/palooza/data/visit_train_panda.csv?op=OPEN'
source_data = pd.read_csv(url)
source_data.head()

Unnamed: 0,VisitID,Hospital,Dept_Code,PaymentType,Age,Race,Gender,FC,ArriveDate,DischargeDate,LOS,DXCODE,Description,DispenseID,DOC
0,HdBgCT1YkEl14280,SHY,436,I,77,W,M,MS,10/07/2014,10/10/2014,3,197.7,SECOND MALIG NEO LIVE,1,10028
1,HdBgCT1YkEl14280,SHY,436,I,77,W,M,MS,10/07/2014,10/10/2014,3,276.2,ACIDOSIS,1,10028
2,HdBgCT1YkEl14280,SHY,436,I,77,W,M,MS,10/07/2014,10/10/2014,3,198.5,SECONDARY MALIG NEO B,1,10028
3,HdBgCT1YkEl14280,SHY,436,I,77,W,M,MS,10/07/2014,10/10/2014,3,253.6,NEUROHYPOPHYSIS DIS N,1,10028
4,HdBgCT1YkEl14280,SHY,436,I,77,W,M,MS,10/07/2014,10/10/2014,3,V58.66,LONG-TERM USE ASPIRIN,1,10028


## N.B. - Sparse encode DXCODE
DXCODE is too big, but it can be sparsely represented. DXCODE has over 4K unique values so it is impossible to encode using a dense matric like panda on a single machine.

### Let's see what happnes if I only encode DXCODE

In [19]:
# ID feature and target variables
feature_names = ['DXCODE']
categorical_features = ['DXCODE']

# Encoded categorical and converted all to sparse repsresentation.
source_data_features,_,_ = ohe_dataframe_to_sparse(source_data[feature_names], categorical_features)

# Split Data
target = source_data.LOS
source_features_train, source_features_test, target_train, target_test = train_test_split(source_data_features, target)

# Train Model with DXCODE as only feature
lin_reg = LinearRegression()
lin_reg.fit(source_features_train, target_train)

# Evaluate Model with DXCODE as only feature
target_pred = lin_reg.predict(source_features_test)

print ('RMSE with dxcode only: {:,.2f}'.format(np.sqrt(metrics.mean_squared_error(target_test, target_pred))))
print ('RMSE (standard): {:,.2f}'.format(
        np.sqrt(metrics.mean_squared_error(target_test, [target_test.mean()] * len(target_test)))))

RMSE with dxcode only: 6.39
RMSE (standard): 6.57


### Let's try more features; DXCODE, Patient demos and Hospital

In [21]:
#ID feature and target variables
feature_names = ['DXCODE', 'Age', 'Race', 'Gender', 'Hospital']
categorical_features = ['DXCODE', 'Race', 'Gender', 'Hospital']

# Encoded categorical and converted all to sparse repsresentation.
source_data_features,_,_ = ohe_dataframe_to_sparse(source_data[feature_names], categorical_features)

#Split Data
target = source_data.LOS
source_features_train, source_features_test, target_train, target_test = train_test_split(source_data_features, target)

#Train Model with DXCODE, patient demos and hospital as only feature
lin_reg = LinearRegression()
lin_reg.fit(source_features_train, target_train)

#Evaluate Model with DXCODE, patient demos and hospital as only feature
target_pred = lin_reg.predict(source_features_test)

print ('RMSE with dxcode, demos & hospital: {:,.2f}'.format(np.sqrt(metrics.mean_squared_error(target_test, target_pred))))
print ('RMSE (standard): {:,.2f}'.format(
        np.sqrt(metrics.mean_squared_error(target_test, [target_test.mean()] * len(target_test)))))

RMSE with dxcode, demos & hospital: 6.19
RMSE (standard): 6.41


### Show the issue numerically with DXCODE

In [23]:
print ('%d of the %d observations have unique DXCODES.' % 
       (len(source_data.DXCODE.unique()), len(source_data.index)))

4789 of the 279808 observations have unique DXCODES.


## Scrap

In [3]:
## Show how to convert column to date time.
feature_names = ['LOS', 'Age', 'Race', 'Gender', 'Hospital', 'ArriveDate', 'DischargeDate']
source_data_features = source_data[feature_names]
source_data_features.ArriveDate = source_data_features.ArriveDate.map(lambda d: pd.to_datetime(str(d)))
source_data_features.DischargeDate = source_data_features.DischargeDate.map(lambda d: pd.to_datetime(str(d)))
source_data_features.DischargeDate.head()

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  self[name] = value


0   2014-10-10
1   2014-10-10
2   2014-10-10
3   2014-10-10
4   2014-10-10
Name: DischargeDate, dtype: datetime64[ns]