## Preprocessing, Splitting, & Evaluation
The objective of this notebook is to obtain a standard way of preprocessing data and evaluating models. 

Currently functions prototyped here have been moved to 'utils.py' in the modules file. 

For each person, the model is expected to predict a probability.  The referal is then evaluated under what time period it occurs. 




In [2]:
%reload_ext autoreload
%autoreload 2


In [3]:
import sys, torch, datetime
from pathlib import Path
path=Path('..')
sys.path.append('../modules/')
import pandas as pd
import numpy as np
import utils
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import precision_score, accuracy_score,recall_score, balanced_accuracy_score, f1_score, roc_auc_score,log_loss, roc_curve 
pd.set_option('display.max_columns', 9999)
pd.set_option('display.max_rows', 100)

In [4]:
#This could be used for benchmarking data
#100 People for 5.5 years. 
ref=pd.read_csv(path/'data'/'raw'/'test_referral.csv')
ref_pro=utils.preprocess_referrals(ref)
ref_pro

Unnamed: 0_level_0,Unnamed: 1_level_0,referrals,referral,class
id,yrm,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1004,201701,2,1,diabetes liver
1005,201701,1,1,diabetes
1006,201706,1,1,liver
1007,201706,1,1,liver
1008,201712,1,1,pnemonia
1009,201712,1,1,pnemonia


In [5]:
#Preprocess patients
#Currently doesn't handle missing values. 
pat=pd.read_csv(path/'data'/'raw'/'test.csv')
con_months=24
date_col='yrm'
date_format='%Y%m'
con_start=datetime.date(2016, 1, 1)
pat_pro=utils.preprocess_patients(pat, date_col, date_format, con_start,con_months)
pat_pro.head()
#df.shape

Unnamed: 0,id,yrm,cad0,cad1,dv9,date
0,1000,201601,1,1,0,2016-01-01
1,1000,201602,1,1,0,2016-02-01
2,1000,201603,1,1,0,2016-03-01
3,1000,201604,1,1,0,2016-04-01
4,1000,201605,1,1,0,2016-05-01


### Details 
- Left outer merge of origional data and real data.


In [9]:
df=utils.merge_and_fill(pat_pro, ref_pro)
df.head()

Unnamed: 0,id,yrm,cad0,cad1,dv9,date,referrals,referral,class
0,1000,201601,1,1,0,2016-01-01,0,0,healthy
1,1000,201602,1,1,0,2016-02-01,0,0,healthy
2,1000,201603,1,1,0,2016-03-01,0,0,healthy
3,1000,201604,1,1,0,2016-04-01,0,0,healthy
4,1000,201605,1,1,0,2016-05-01,0,0,healthy


## Train test split based on time window.


In [10]:
#Run the Fuction
split_time=datetime.date(2016, 12,30)
date_format='%Y%m'
train, test = utils.train_test_split(df, 'yrm', date_format, split_time) 
train.head()

Unnamed: 0,id,yrm,cad0,cad1,dv9,date,referrals,referral,class
0,1000,2016-01-01,1,1,0,2016-01-01,0,0,healthy
1,1000,2016-02-01,1,1,0,2016-02-01,0,0,healthy
2,1000,2016-03-01,1,1,0,2016-03-01,0,0,healthy
3,1000,2016-04-01,1,1,0,2016-04-01,0,0,healthy
4,1000,2016-05-01,1,1,0,2016-05-01,0,0,healthy


### Predictions 
The predictions are easy to assess for the toy model. 
For the toy model:
    - The first 4 individuals are not referrals.
    - The next 2 are positive in the first three months.
    - The next 2 are positive in the first 6 months.
    - The final 2 are positive in the 12th month. 

We set the windows according to the following. 
`windows= [[0,3], [0,6], [0,12]]`

In [32]:
#define the windows.  For example [0,3] is including between 0-3 months.
windows= [[0,3], [0,6], [0,12]]
pred=(path/'data'/'predictions'/'test.csv')
capacity = 7 #The capacity of human review
exp= 'manual_test'
results_file=(path/'data'/'results'/'test.csv')
threshold=0.5
#Score windows
score=utils.score_windows(exp=exp, df=test, pred=pred, capacity=capacity, windows=windows, results_file=results_file, save=True, append=False, target='referral', threshold=threshold)
score

Int64Index([1000, 1001, 1002, 1003, 1004, 1005, 1006, 1007, 1008, 1009], dtype='int64', name='id')
      referral     class
id                      
1000         0   healthy
1004         1  diabetes
1005         1  diabetes
1006         1     liver
1007         1     liver
1008         1  pnemonia
1009         1  pnemonia
id
1000    0
1004    1
1005    1
1006    0
1007    0
1008    0
1009    0
dtype: int64
id
1000    0
1004    1
1005    1
1006    1
1007    1
1008    0
1009    0
dtype: int64
id
1000    0
1004    1
1005    1
1006    1
1007    1
1008    1
1009    1
dtype: int64


Unnamed: 0,experiment,date,range,log_loss,precision,recall,accuracy,balanced_accuracy,f1
0,manual_test,2019-11-22 21:50:41.763722,201701-201703,19.7369,0.333333,1.0,0.428571,0.6,0.5
1,manual_test,2019-11-22 21:50:41.785024,201701-201706,9.86845,0.666667,1.0,0.714286,0.666667,0.8
2,manual_test,2019-11-22 21:50:41.798254,201701-201712,9.992007e-16,1.0,1.0,1.0,1.0,1.0


## Null Model 
The null model here is just that there are referrals. 