## Medical Plan Recommender Model


#### In this assignment, we want to evaluate your ability to engineer features and design and evaluate a model. We will create a multi-class classification model to recommend medical plans to employees based on user data (inputs) and medical plan labels (outputs). We have access to data from ~250 users. Each user is classified into 1 of 3 plans by actuaries. 

#### Feel free to use any python packages you would like. You are also allowed to google for help with method names / syntax! 

#### Dataset Columns:

- **age**: age of employee
- **family**: who is covered? (Just Me, Me and my Spouse', Me and my kids, Me, Spouse, and Kids)
- **salary**: income of employee
- **household_salaries**: household income of employee
- **financial_risk_preference**: (1) Prefer Savings to Prefer Protection (5) 
- **preexisting_conditions**: conditions that require frequent doctor visits (cancer, high blood pressure, etc)
- **prescription_costs**: costs of annual prescription 
- **pcp_costs**: costs of primary care costs last year
- **specialist_costs**: annual cost of speciality care costs last year
- **pcp_visits**: number of pcp visits last year
- **qle**: qualifying life event that might incur costs (baby, medical procedure, married, moving)
- **specialty_visits**: number of specalist visits last year 
- **exercises**: frequency of exercise (I exercise everyday, I exercise 3x a week, I don't exercise)
- **savings**: if they had to pay $3000, how would they pay for this? (borrow money, have savings, HSA)
- **label**: plan recommendation as indicated by actuary

In [16]:
import pandas as pd
surveys = pd.read_csv("data/surveys.csv", index_col=0).reset_index(drop=True)
surveys.head()

Unnamed: 0,age,family,salary,household_salaries,financial_risk_preference,preexisting_conditions,prescription_costs,pcp_costs,specialist_costs,pcp_visits,qle,specialty_visits,exercises,savings,label
0,38,Just Me,84189,84189.0,3,none,97,1025,358,9,none,1,I exercise 3x a week,HSA,Cigna Copay Plan PPO
1,33,Me and my Spouse,117690,129459.0,3,none,51,155,0,2,none,0,I exercise 3x a week,have savings,Cigna Base HDHP
2,47,Me and my kids,83461,100153.2,3,high blood pressure,763,268,0,2,baby,0,I exercise 3x a week,have savings,Cigna Copay Plan PPO
3,30,Me and my Spouse,62145,74574.0,3,none,92,268,1257,2,none,3,I don't exercise,have savings,Cigna Base HDHP
4,35,Just Me,55385,55385.0,3,none,80,270,2590,2,moving,6,I exercise 3x a week,have savings,Cigna Choice HDHP


In [17]:
X = surveys.drop("label", axis=1)
y = surveys.label.values

### 1) Write code to split the data into train & test sets

In [44]:
### Option 1 - sklearn 
from sklearn.model_selection import train_test_split

train_size = 0.75
random_state = 42

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_size, random_state=random_state, stratify=y)

In [45]:
X_train

Unnamed: 0,age,family,salary,household_salaries,financial_risk_preference,preexisting_conditions,prescription_costs,pcp_costs,specialist_costs,pcp_visits,qle,specialty_visits,exercises,savings
141,51,"Me, Spouse, and Kids",61115,103895.5,3,none,21,186,0,2,none,0,I don't exercise,borrow money
167,21,Just Me,47762,47762.0,4,none,42,173,723,2,med_procedure,2,I don't exercise,borrow money
171,22,Me and my Spouse,61759,92638.5,2,none,49,110,360,1,married,1,I exercise everyday,HSA
48,32,Me and my Spouse,43286,69257.6,3,none,90,227,699,2,baby,2,I exercise everyday,borrow money
128,57,Me and my Spouse,62072,105522.4,3,none,48,182,0,2,none,0,I exercise everyday,HSA
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
136,57,"Me, Spouse, and Kids",61731,86423.4,3,none,221,187,646,2,none,2,I exercise 3x a week,have savings
265,35,Me and my Spouse,113372,170058.0,3,heart disease,66,134,0,2,none,0,I don't exercise,HSA
60,37,Me and my Spouse,39855,43840.5,3,none,39,485,3482,3,med_procedure,7,I exercise everyday,have savings
103,37,"Me, Spouse, and Kids",63848,89387.2,3,none,1073,715,1935,8,none,4,I exercise 3x a week,borrow money


In [52]:
### Option 2 - from scratch without stratified sampling 

from random import shuffle

train_size = 0.80

indices = [i for i in range(len(X))]
shuffle(indices)
num_training_indices = int(len(indices) * train_size)
train_indices = indices[:num_training_indices]
test_indices = indices[num_training_indices:]

# split the actual data
X_train, X_test = X.iloc[train_indices], X.iloc[test_indices], 
y_train, y_test = y[train_indices], y[test_indices]

### 2) You will train a multi-class classification model later in the script. Consider what model you would like to train, and implement a feature normalization strategy. Explain your reasoning behind your strategy. 

In [46]:
numeric_features = X_train.select_dtypes(include='number').columns
numeric_features

Index(['age', 'salary', 'household_salaries', 'financial_risk_preference',
       'prescription_costs', 'pcp_costs', 'specialist_costs', 'pcp_visits',
       'specialty_visits'],
      dtype='object')

In [47]:
categorical_features = X_train.select_dtypes(include='object').columns
categorical_features

Index(['family', 'preexisting_conditions', 'qle', 'exercises', 'savings'], dtype='object')

In [54]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

scaler = StandardScaler() ### Note if candidate selects tree-based model, normalization not required!! 
X_train_numeric = pd.DataFrame(scaler.fit_transform(X_train[numeric_features]), columns=numeric_features, index=X_train.index)
X_test_numeric = pd.DataFrame(scaler.transform(X_test[numeric_features]), columns=numeric_features, index=X_test.index)

enc = OneHotEncoder()
X_train_categorical = pd.DataFrame(enc.fit_transform(X_train[categorical_features]).toarray(), columns=enc.get_feature_names_out(), index=X_train.index)
X_test_categorical = pd.DataFrame(enc.transform(X_test[categorical_features]).toarray(), columns=enc.get_feature_names_out(), index=X_test.index)


In [56]:
X_train_normalized = X_train_numeric.merge(X_train_categorical, left_index=True, right_index=True)
X_test_normalized = X_test_numeric.merge(X_test_categorical, left_index=True, right_index=True)

### 3) Select one model and train. We do not expect you to implement hyperparameter tuning, but please talk through how you would set this up.

In [None]:
clf = #### CHANGEME! select model
model = clf.fit() #### CHANGEME! fit model 

### 4) Evaluate Model: Display model train/test classification metrics of your choice and describe them in the context of this problem.  

In [None]:
#### CHANGEME! display classification metrics of your choice 

### 5) Monitoring: Describe metrics you would consider to monitor this model in production.  
