## Medical Recommender Model


In this assignment, we want to evaluate your ability to engineer features and design and evaluate a model. In this notebook, we will have you walk through the steps of creating a model to recommend medical plans to employees based on user data (demographics, health, finances) and medical plan labels. For this exercise, we will focus on one employer who offers 3 medical plans to >10K employees. We have access to data from ~250 of their users (randomly sampled), with each user classified into one of three plans offered by actuaries. 

The columns of this dataset:

- **age**: age of employee
- **family**: who is covered? (Just Me, Me and my Spouse', Me and my kids, Me, Spouse, and Kids)
- **salary**: income of employee
- **household_salaries**: household income of employee
- **financial_risk_preference**: (1) Prefer Savings to Prefer Protection (5) 
- **preexisting_conditions**: conditions that require frequent doctor visits (cancer, high blood pressure, etc)
- **prescription_costs**: costs of annual prescription 
- **pcp_costs**: costs of primary care costs last year
- **specialist_costs**: annual cost of speciality care costs last year
- **pcp_visits**: number of pcp visits last year
- **qle**: qualifying life event that might incur costs (baby, medical procedure, married, moving)
- **specialty_visits**: number of specalist visits last year 
- **exercises**: frequency of exercise (I exercise everyday, I exercise 3x a week, I don't exercise)
- **savings**: if they had to pay $3000, how would they pay for this? (borrow money, have savings, HSA)
- **label**: plan recommendation as indicated by actuary

In [10]:
import pandas as pd

surveys = pd.read_csv("data/surveys.csv", index_col=0)
surveys.sample(5)

Unnamed: 0_level_0,age,family,salary,household_salaries,financial_risk_preference,preexisting_conditions,prescription_costs,pcp_costs,specialist_costs,pcp_visits,qle,specialty_visits,exercises,savings,label
idx,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1
37,24,Me and my kids,40534,56747.6,3,none,45,1265,0,8,none,0,I exercise everyday,borrow money,Cigna Choice HDHP
201,19,Just Me,133141,133141.0,3,high blood pressure,87,671,811,8,none,2,I exercise everyday,HSA,Cigna Choice HDHP
105,30,Me and my Spouse,48047,76875.2,3,none,254,281,372,3,none,1,I exercise everyday,borrow money,Cigna Base HDHP
36,22,Me and my kids,60161,78209.3,4,high blood pressure,52,322,1209,2,none,3,I exercise 3x a week,have savings,Cigna Choice HDHP
209,18,"Me, Spouse, and Kids",48505,67907.0,3,obesity,61,0,2574,0,none,7,I don't exercise,have savings,Cigna Choice HDHP


In [11]:
surveys.columns

Index(['age', 'family', 'salary', 'household_salaries',
       'financial_risk_preference', 'preexisting_conditions',
       'prescription_costs', 'pcp_costs', 'specialist_costs', 'pcp_visits',
       'qle', 'specialty_visits', 'exercises', 'savings', 'label'],
      dtype='object')

In [12]:
features = [
    "age",
    "salary",
    "family",
    "household_salaries",
    "savings",
    "financial_risk_preference",
    "preexisting_conditions",
    "qle",
    "pcp_visits",
    "specialty_visits",
    "pcp_costs",
    "specialist_costs"
]
categorical_features = ["family", "preexisting_conditions", "qle", "savings", "exercises"]

#### 1) Separate Numeric and Categorical Features using pandas indexing  (TODO)

In [14]:
numeric_df = surveys.select_dtypes(include='number')
categorical_df = surveys[categorical_features]

#### 2) Normalise Features using pandas transformations  (TODO)

##### Question: 

In [22]:
numeric_df = numeric_df #TODO: normalise numeric_df 
categorical_df = categorical_df #TODO: normalise categorical_df

X = numeric_df.merge(categorical_df, left_index=True, right_index=True)
y = surveys.label.values

#### 3) Split data into train and test sets (TODO)

In [23]:
surveys.label.value_counts()

Cigna Choice HDHP       130
Cigna Copay Plan PPO     80
Cigna Base HDHP          61
Name: label, dtype: int64

In [None]:
train_split = 0.75

X_train = X #TODO: training features
X_test = X #TODO: testing features

y_train = y #TODO training labels
y_test = y #TODO testing labels

#### 4) Select model and train (no need to hyperparam tune) 

In [None]:
clf = #TODO:  select model
model = clf.fit() #TODO:  fit model

#### 5) Evaluate Model: Display classification metrics 

In [None]:
#TODO: evaluate model

### Discussion Questions

Feature Exploration and Selection

1) What techniques do you use to explore and visualize the distribution of features in the dataset?
2) How do you decide which features are relevant for the classification task? Can you discuss feature selection methods you're familiar with?

Categorical Variables

1) How do you handle categorical variables in a tabular dataset? Are there specific encoding techniques you prefer for classification models?
2) Can you explain the concept of target encoding, and when might it be useful in a classification problem?

Dealing with Imbalanced Data

1) In the context of imbalanced classes, what strategies do you employ during feature engineering to address potential issues?
2) How can feature engineering contribute to mitigating the impact of class imbalance in a classification model?

Feature Scaling

1) Do you consider feature scaling in your feature engineering process? When is it necessary, and how does it impact different machine learning algorithms?
2) Can you explain the difference between normalization and standardization, and when might you choose one over the other?

Feature Transformation

1) How do you approach feature transformation, such as creating interaction terms or polynomial features, and when might these techniques be beneficial?
2) Can you discuss the use of log-transformations or Box-Cox transformations for certain types of features?