# Introduction and Summary

**Problem Description**
* Client estimates that an unplanned, acute inpatient admission incurrs approximately $15,170 in cost to insurers per event.
* Given that Client profits from shared savings, this also impacts Client's bottom line.
* Since these are unexpected events, Client is only able now to analyze them in hindsight.  Therefore, we examined 1,978,184 "patient months" of claims, clinical and demographic patient information from 2016 and 2017 in order to determine if we could find statistical patterns in the data that might allow us to predict which patients are most at risk for these events.

**Goals of Project**
* The goal is that if we are able to identify the patients most at risk, Client can begin to design potential interventons and treatments to reduce this risk resulting in healthier patients, lower costs, and more shared savings.
* Our target for prediction is a patient month in which that patient has an acute, inpatient claim that is coincided on the same calendar date by an ER visit within the next 6 calendar months.

**Output of Project**
* We fit a statistical model using logistic regression and use this model to test against a holdout set of this data in order to evaluate its performance.  
* If the model's performance is acceptable, we will then make predictions for 2018.

# Import Packages and Settings

In [1]:
import pyodbc
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
import scipy as sp
from yellowbrick.classifier import ConfusionMatrix
from sklearn.metrics import confusion_matrix

# Display options
pd.options.display.max_columns = None
%matplotlib inline
sns.set(rc={'figure.figsize':(11.7,8.27)}) # Sets the size of plots to be a little bigger.
sns.set_style("darkgrid") # Set style of plots
sns.set_context("notebook") # Set plots to format for Jupyter
pd.options.display.float_format = '{:,.5f}'.format #Set to display only 5 decimal places

#Import module to create our training and test sets.
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import MinMaxScaler



# Machine Learning
import statsmodels.api as sm 
from sklearn.linear_model import LogisticRegression
from sklearn.svm import SVC, LinearSVC
from sklearn.ensemble import RandomForestClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.linear_model import Perceptron
from sklearn.linear_model import SGDClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import roc_curve, roc_auc_score, auc

  from pandas.core import datetools


In [2]:
%%javascript
$('<div id="toc"></div>').css({position: 'fixed', top: '120px', left: 0}).appendTo(document.body);
$.getScript('https://kmahelona.github.io/ipython_notebook_goodies/ipython_notebook_toc.js');

<IPython.core.display.Javascript object>

# Connect to DB and Initialize DataFrames

In [3]:
con = pyodbc.connect('Trusted_Connection=yes', driver='{SQL Server}', server='dev-sql05', database='Delphi')

cur = con.cursor()

querystring = 'SELECT * FROM stage_PatientClaimDetail_Combined'

df_AcuteInpatient = pd.read_sql(querystring,con)

In [4]:
# Create dfs for modeling and predictions.
    # Modeling set is all data from 2016/2017.
    # Predict2018 set is all data from 2018 or later.

df_Modeling = df_AcuteInpatient[df_AcuteInpatient.DateMonthlyID < 337]

df_Predict2018 = df_AcuteInpatient[df_AcuteInpatient.DateMonthlyID >= 337]

# Convert all possible columns to int, ignore if not int (prevents errors in other procs).
df_AcuteInpatient = df_AcuteInpatient.apply(pd.to_numeric, errors='ignore')

# Create dfs to compare Target vs Non-Target populations.
df_Target = df_Modeling[df_Modeling.HasAcuteERIn6Mths == 1]

df_NonTarget = df_Modeling[df_Modeling.HasAcuteERIn6Mths == 0]


## Automated Feature Selection

Some vars that have a large number of classes but no obvious correlation will be excluded to keep the list of candidate vars manageable.

In [5]:
# Create new df to analyze features

# Make copy of main df for binary variables
df_CatFeatures = df_AcuteInpatient.copy(deep=True)

# Make copy of main df for numeric vars
df_NumFeatures = df_AcuteInpatient.copy(deep=True)

In [8]:

# Drop non-categorical columns
df_CatFeatures = df_CatFeatures[[
 'IsCOPD'
,'IsDiabetes'
,'IsHeartFailure'
,'IsHypertension'
,'IsChronicKidney'
,'HasPreviousAdmit'
,'Has3MonthSurgery'
,'HasMedicareIns'
,'MaritalStatus'
,'Over21AgeBracketID'
,'HasAcuteERIn6Mths'
]]

In [9]:
# Change categorical columns to Categorical data type
df_CatFeatures['MaritalStatus'] = df_CatFeatures['MaritalStatus'].astype('category')
df_CatFeatures['Over21AgeBracketID'] = df_CatFeatures['Over21AgeBracketID'].astype('category')

In [10]:
# Convert categorical to dummies
df_CatFeatures = pd.get_dummies(df_CatFeatures)

In [11]:
# Drop non-numeric columns
df_NumFeatures = df_NumFeatures[[
'OPPLast6Mths'
,'CntInpatient'
,'CntIsERClaim'
,'HasAcuteERIn6Mths'
]]

In [12]:
# Change any NULLs to 0
df_CatFeatures = df_CatFeatures.fillna(0)
df_NumFeatures = df_NumFeatures.fillna(0)

In [13]:
# Add all columns to a new array.
cols=df_CatFeatures.columns.values.tolist()

XCat=df_CatFeatures[cols]
yCat=df_CatFeatures['HasAcuteERIn6Mths']

In [14]:
# Add all columns to a new array.
cols=df_NumFeatures.columns.values.tolist()

XNum=df_NumFeatures[cols]
yNum=df_NumFeatures['HasAcuteERIn6Mths']

In [15]:
# Drop target from features for consideration
XCat = XCat.drop(columns=['HasAcuteERIn6Mths']) 
XNum = XNum.drop(columns=['HasAcuteERIn6Mths']) 

In [19]:
from sklearn.feature_selection import SelectKBest, chi2
#Features with top Chi^2
selector = SelectKBest(chi2, k = 'all')
#New dataframe with the selected features for later use in the classifier. fit() method works too, if you want only the feature names and their corresponding scores
X_new = selector.fit_transform(XCat, yCat)
names = XCat.columns.values[selector.get_support()]
scores = selector.scores_[selector.get_support()]
pval = selector.pvalues_ [selector.get_support()]
names_scores = list(zip(names, scores, pval))
ns_df = pd.DataFrame(data = names_scores, columns=['Feat_names', 'Chi2', 'pval'])
#Sort the dataframe for better visualization
ns_df_sorted = ns_df.sort_values(['Chi2', 'Feat_names'], ascending = [False, True])
print(ns_df_sorted)

              Feat_names          Chi2    pval
6       HasPreviousAdmit 129,050.73973 0.00000
21  Over21AgeBracketID_5  44,589.53248 0.00000
8         HasMedicareIns  36,934.69378 0.00000
3         IsHeartFailure  35,010.50796 0.00000
7       Has3MonthSurgery  26,057.71042 0.00000
1                 IsCOPD  19,944.43614 0.00000
5        IsChronicKidney  19,549.89638 0.00000
15       MaritalStatus_W  19,442.95306 0.00000
20  Over21AgeBracketID_4  15,977.17481 0.00000
19  Over21AgeBracketID_3   9,325.74892 0.00000
18  Over21AgeBracketID_2   7,116.39893 0.00000
2             IsDiabetes   6,755.59693 0.00000
17  Over21AgeBracketID_1   5,426.08945 0.00000
4         IsHypertension   5,180.33504 0.00000
11       MaritalStatus_M   4,570.30610 0.00000
13       MaritalStatus_S     884.48018 0.00000
0               IsAsthma     434.04671 0.00000
14       MaritalStatus_U       7.66371 0.00563
10       MaritalStatus_D       4.75544 0.02921
16       MaritalStatus_X       3.67793 0.05514
9        Mari

In [18]:
from sklearn.feature_selection import SelectKBest, f_classif
#Features with top F-scores
selector = SelectKBest(f_classif, k = 'all')
#New dataframe with the selected features for later use in the classifier. fit() method works too, if you want only the feature names and their corresponding scores
X_new = selector.fit_transform(XNum, yNum)
names = XNum.columns.values[selector.get_support()]
scores = selector.scores_[selector.get_support()]
pval = selector.pvalues_ [selector.get_support()]
names_scores = list(zip(names, scores, pval))
ns_df = pd.DataFrame(data = names_scores, columns=['Feat_names', 'F_Scores', 'pval'])
#Sort the dataframe for better visualization
ns_df_sorted = ns_df.sort_values(['F_Scores', 'Feat_names'], ascending = [False, True])
print(ns_df_sorted)

     Feat_names      F_Scores    pval
1  CntInpatient 436,979.68878 0.00000
0  OPPLast6Mths  62,936.24391 0.00000
2  CntIsERClaim   1,684.10392 0.00000
