## Preprocessing, Splitting, & Evaluation
The objective of this notebook is to obtain a standard way of preprocessing data and evaluating models. 

Currently functions prototyped here have been moved to 'merge.py' in the modules file. 

This only merges the files. 

As the goal of algorithm is to replicate the physician decision making process, it is appropriate that the timing of the information available for the model be aligned with the information available to the physician.  To account for this, though referrals can occur throughout the month, we shift all referrals forward to the 1st of the following month (placing it in the next month). If the referral were associated with a traumatic event such as a car accident, it would be impossible for the model to predict. However, with a shift forward of the referral to the next month an appropriately trained model may predict the referral. 

We also create features with provide the model with information as to whether the person was recently referred and the total number of times the person has been referred.  

This preprocessing also accounts for the case where an individual is referred multiple times in a single month. While this is added to the history, it isn't taken into the main classification.  




In [1]:
#for for running on colab
#files = "https://github.com/HealthINCITE/patient_panel/blob/master/zip/merge/v1/files.zip" 
#!wget $files && unzip -o "files.zip"

In [2]:
%reload_ext autoreload
%autoreload 2


In [3]:
import sys, torch, datetime
from pathlib import Path
path=Path('..')
sys.path.append('../modules/')
import pandas as pd
import numpy as np
import merge 
import testit as t
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import precision_score, accuracy_score,recall_score, balanced_accuracy_score, f1_score, roc_auc_score,log_loss, roc_curve 
pd.set_option('display.max_columns', 9999)
pd.set_option('display.max_rows',10)
per_df=pd.read_csv(path/'data'/'raw'/'test.csv')
ref_df=pd.read_csv(path/'data'/'raw'/'test_referral.csv')
#In the real one 

This does the merge and feature generation. 

It returns a series translate which 

`ref` a whether a referral occured in month (0/1). 
`ref_m` is number of referrals in a giving month (int).
`class` is the type of refferals in a giving month (str). 
`labels` is the type of refferals in a giving month (str).
`m_ref` total referrals that came in during that month.  
`c_ref` cumulative sum of the prior m_ref.
`lab_x` label for class of referral. 
`lag1_lab_x` lagged variable (1 month) for the class of referral.
`win6_lab_x` window variable (6 months) indicating a referral for a reason. 
`sum_lab_x` sum of all previous referrals for reason. 



In [4]:
#Configuration, this shouldn't change. 
#Insurance member history referred to has 'person' file
per_col='person_id'
per_date_col='myr'
per_date_format='%Y%m'
#Referral history referred to as the 'referral file'
ref_date_col='date'
ref_date_format='%m/%d/%Y'
ref_shift_months=1
df,translate=merge.merge(per_df, per_col, per_date_col, per_date_format, ref_df, ref_date_col, ref_date_format, gen_features=True )
df

Unnamed: 0,person_id,myr,cad0,cad1,dv9,date,lab_1,lab_2,lab_3,ref_m,ref,class,labels,lag1_lab_1,lag1_lab_2,lag1_lab_3,sum_lab_1,sum_lab_2,sum_lab_3,win6_lab_1,win6_lab_2,win6_lab_3
0,1000,201601,1,1,0,2016-01-01,0,0,0,0,0,h,0,0,0,0,0,0,0,0,0,0
1,1000,201602,1,1,0,2016-02-01,0,0,0,0,0,h,0,0,0,0,0,0,0,0,0,0
2,1000,201603,1,1,0,2016-03-01,0,0,0,0,0,h,0,0,0,0,0,0,0,0,0,0
3,1000,201604,1,1,0,2016-04-01,0,0,0,0,0,h,0,0,0,0,0,0,0,0,0,0
4,1000,201605,1,1,0,2016-05-01,0,0,0,0,0,h,0,0,0,0,0,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
282,2001,201707,1,0,0,2017-07-01,0,0,0,0,0,h,0,0,0,0,0,0,0,0,0,0
283,2001,201708,1,0,0,2017-08-01,0,0,0,0,0,h,0,0,0,0,0,0,0,0,0,0
284,2001,201709,1,0,0,2017-09-01,0,0,0,0,0,h,0,0,0,0,0,0,0,0,0,0
285,2001,201710,1,0,0,2017-10-01,0,0,0,0,0,h,0,0,0,0,0,0,0,0,0,0


In [12]:
#save
#t.test_eq(df.shap)
t.test_eq(len(ref_df),df.ref_m.sum())
df.to_csv(path/'data'/'processed'/'test.csv', index = False)

## Details
Below are the steps that the merge function performs.
1. Preprocess referrals. 
2. Preprocess persons.
3. Merge history.
4. Fill NAs
5. Create features

### 1. Preprocess referrals. 

This does a number of things including (a) aggregating for multiple referrals in a single month, (b) creating labes to indicate the reason code for the referral. 


In [6]:
# 1. Preprocess referrals.
ref_pro, transform=merge.preprocess_referrals(ref_df, per_col, ref_date_col, ref_date_format)
ref_pro

Unnamed: 0,person_id,myr,lab_1,lab_2,lab_3,ref_m,ref,class,labels
0,1004,201701,1,1,0,2,1,"diabetes, liver","1, 2"
1,1004,201706,0,1,0,1,1,liver,2
2,1005,201701,1,1,0,2,1,"liver, diabetes","2, 1"
3,1006,201706,0,1,0,1,1,liver,2
4,1007,201706,0,1,0,1,1,liver,2
5,1008,201710,0,0,1,1,1,pnemonia,3
6,1009,201710,0,0,1,1,1,pnemonia,3


In [7]:
# Gives a map from reason to code. 
transform

Unnamed: 0,0
diabetes,1
liver,2
pnemonia,3


### 2. Preprocess persons.
This adds a date field.  Not much else. 

In [8]:
#2. Preprocess Persons
per_pro=merge.preprocess_persons(per_df, per_col, per_date_col, per_date_format)
per_pro.head()
#df.shape

Unnamed: 0,person_id,myr,cad0,cad1,dv9,date
0,1000,201601,1,1,0,2016-01-01
1,1000,201602,1,1,0,2016-02-01
2,1000,201603,1,1,0,2016-03-01
3,1000,201604,1,1,0,2016-04-01
4,1000,201605,1,1,0,2016-05-01


### 3. Merge

This merges the person data with the referral data. 


In [9]:
#3. Merge 
df=pd.merge(per_pro, ref_pro, how='left',  on=[per_col, per_date_col])
df

Unnamed: 0,person_id,myr,cad0,cad1,dv9,date,lab_1,lab_2,lab_3,ref_m,ref,class,labels
0,1000,201601,1,1,0,2016-01-01,,,,,,,
1,1000,201602,1,1,0,2016-02-01,,,,,,,
2,1000,201603,1,1,0,2016-03-01,,,,,,,
3,1000,201604,1,1,0,2016-04-01,,,,,,,
4,1000,201605,1,1,0,2016-05-01,,,,,,,
...,...,...,...,...,...,...,...,...,...,...,...,...,...
282,2001,201707,1,0,0,2017-07-01,,,,,,,
283,2001,201708,1,0,0,2017-08-01,,,,,,,
284,2001,201709,1,0,0,2017-09-01,,,,,,,
285,2001,201710,1,0,0,2017-10-01,,,,,,,


## 4-5. Fill NAs and create features. 

Merging leaves a lot of empty values for the "no referral" cases.  

Generating features creates a "look back" for referrals that happened (a) anytime in the history, (b) within a 6 month window.


In [11]:
#4-5. Fill NAs and create features. 

df=merge.fill_na(df,['lab_','ref'],0, int)
df=merge.fill_na(df,['class'],'h',str)
df=merge.fill_na(df,['labels'],'0',str)
df=merge.generate_features(df, per_col, ['lab_'], 'lag_1')
df=merge.fill_na(df,['lag1_lab'],'0',int)
    
df=merge.generate_features(df, per_col, ['lag1_lab'], 'sum')
df=merge.generate_features(df, per_col, ['lag1_lab'], 'win_6')