## Medical Plan Recommender Model


In this assignment, we want to evaluate your ability to engineer features and design and evaluate a model. We will create a multi-class classification model to recommend medical plans to employees based on user data (inputs) and medical plan labels (outputs). We have access to data from ~250 users. Each user is classified into 1 of 3 plans by actuaries. 

Feel free to use any python packages you would like. You are also allowed to google for help with method names / syntax! 

Dataset Columns:

- **age**: age of employee
- **family**: who is covered? (Just Me, Me and my Spouse', Me and my kids, Me, Spouse, and Kids)
- **salary**: income of employee
- **household_salaries**: household income of employee
- **financial_risk_preference**: (1) Prefer Savings to Prefer Protection (5) 
- **preexisting_conditions**: conditions that require frequent doctor visits (cancer, high blood pressure, etc)
- **prescription_costs**: costs of annual prescription 
- **pcp_costs**: costs of primary care costs last year
- **specialist_costs**: annual cost of speciality care costs last year
- **pcp_visits**: number of pcp visits last year
- **qle**: qualifying life event that might incur costs (baby, medical procedure, married, moving)
- **specialty_visits**: number of specalist visits last year 
- **exercises**: frequency of exercise (I exercise everyday, I exercise 3x a week, I don't exercise)
- **savings**: if they had to pay $3000, how would they pay for this? (borrow money, have savings, HSA)
- **label**: plan recommendation as indicated by actuary

In [2]:
import pandas as pd
surveys = pd.read_csv("data/surveys.csv", index_col=0).reset_index(drop=True)
surveys.head()

ModuleNotFoundError: No module named 'pandas'

In [17]:
X = surveys.drop("label", axis=1)
y = surveys.label.values

### 1) Write code to split the data into train & test sets

In [44]:
### Option 1 - sklearn 
from sklearn.model_selection import train_test_split

train_size = 0.75
random_state = 42

X_train, X_test, y_train, y_test = train_test_split(X, y, train_size=train_size, random_state=random_state, stratify=y)

In [52]:
### Option 2 - from scratch without stratified sampling 

from random import shuffle

train_size = 0.80

indices = [i for i in range(len(X))]
shuffle(indices)
num_training_indices = int(len(indices) * train_size)
train_indices = indices[:num_training_indices]
test_indices = indices[num_training_indices:]

# split the actual data
X_train, X_test = X.iloc[train_indices], X.iloc[test_indices], 
y_train, y_test = y[train_indices], y[test_indices]

### 2) You will train a multi-class classification model later in the script. Consider what model you would like to train, and implement a feature normalization strategy. Explain your reasoning behind your strategy. 

In [46]:
numeric_features = X_train.select_dtypes(include='number').columns
numeric_features

Index(['age', 'salary', 'household_salaries', 'financial_risk_preference',
       'prescription_costs', 'pcp_costs', 'specialist_costs', 'pcp_visits',
       'specialty_visits'],
      dtype='object')

In [47]:
categorical_features = X_train.select_dtypes(include='object').columns
categorical_features

Index(['family', 'preexisting_conditions', 'qle', 'exercises', 'savings'], dtype='object')

In [54]:
from sklearn.preprocessing import StandardScaler
from sklearn.preprocessing import OneHotEncoder

scaler = StandardScaler() ### Note if candidate selects tree-based model, normalization not required!! 
X_train_numeric = pd.DataFrame(scaler.fit_transform(X_train[numeric_features]), columns=numeric_features, index=X_train.index)
X_test_numeric = pd.DataFrame(scaler.transform(X_test[numeric_features]), columns=numeric_features, index=X_test.index)

enc = OneHotEncoder()
X_train_categorical = pd.DataFrame(enc.fit_transform(X_train[categorical_features]).toarray(), columns=enc.get_feature_names_out(), index=X_train.index)
X_test_categorical = pd.DataFrame(enc.transform(X_test[categorical_features]).toarray(), columns=enc.get_feature_names_out(), index=X_test.index)


In [56]:
X_train_normalized = X_train_numeric.merge(X_train_categorical, left_index=True, right_index=True)
X_test_normalized = X_test_numeric.merge(X_test_categorical, left_index=True, right_index=True)

### 3) Select one model and train. We do not expect you to implement hyperparameter tuning, but please talk through how you would set this up.

In [59]:
from sklearn.linear_model import LogisticRegression

clf = LogisticRegression(random_state=0)
model = clf.fit(X_train_normalized, y_train) 

### 4) Evaluate Model: Display model train/test classification metrics of your choice and describe them in the context of this problem.  

In [61]:
from sklearn.metrics import classification_report

y_train_predict = model.predict(X_train_normalized)
print("Train Metrics: ")
print(classification_report(y_train, y_train_predict))

y_test_predict = model.predict(X_test_normalized)
print("Test Metrics: ")
print(classification_report(y_test, y_test_predict))

Train Metrics: 
                      precision    recall  f1-score   support

     Cigna Base HDHP       0.92      0.94      0.93        49
   Cigna Choice HDHP       0.82      0.87      0.85       102
Cigna Copay Plan PPO       0.81      0.72      0.76        65

            accuracy                           0.84       216
           macro avg       0.85      0.84      0.85       216
        weighted avg       0.84      0.84      0.84       216

Test Metrics: 
                      precision    recall  f1-score   support

     Cigna Base HDHP       0.91      0.83      0.87        12
   Cigna Choice HDHP       0.76      0.93      0.84        28
Cigna Copay Plan PPO       0.90      0.60      0.72        15

            accuracy                           0.82        55
           macro avg       0.86      0.79      0.81        55
        weighted avg       0.83      0.82      0.81        55



### 5) Monitoring: Describe metrics you would consider to monitor this model in production.  
