# Data collection and train test split

The data are stored in three different csv files: dim_claim.csv, dim_date.csv, dim_pa.csv. 

These data are connected through bridge.csv. 

The dim_claim.csv file contains all claims (approved and rejected), while dim_pa.csv contains data about the rejected claims (not on the formulary or needs PA). 

The dim_date.csv file contains the date information up to 2021-02-28 whose dim_date_id is 1520, however the birde file only contains dates up to 2019-12-31, corresponding dim_date_id is 1095. 

In [1]:
import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns

## Collecting the data

### The claims

dim_claims.csv contains 1335576 claims. 

Each claim has following information: claim id (dim_claim_id), the payer (bin), drug type, claim appoved or not (pharmacy_claim_approved), reject code (only rejected claims has reject code, NaN if the claim is approved).

In [2]:
claims = pd.read_csv('../data/initial_data/dim_claims.csv')

In [3]:
claims.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1335576 entries, 0 to 1335575
Data columns (total 5 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   dim_claim_id             1335576 non-null  int64  
 1   bin                      1335576 non-null  int64  
 2   drug                     1335576 non-null  object 
 3   reject_code              555951 non-null   float64
 4   pharmacy_claim_approved  1335576 non-null  int64  
dtypes: float64(1), int64(3), object(1)
memory usage: 50.9+ MB


In [4]:
claims.head()

Unnamed: 0,dim_claim_id,bin,drug,reject_code,pharmacy_claim_approved
0,1,417380,A,75.0,0
1,2,999001,A,,1
2,3,417740,A,76.0,0
3,4,999001,A,,1
4,5,417740,A,,1


In [5]:
claims.columns

Index(['dim_claim_id', 'bin', 'drug', 'reject_code',
       'pharmacy_claim_approved'],
      dtype='object')

### The PA data 

There are 555951 rejected claims, which have three different reject codes. 70: not covered by the plan, 75: on formulary but needs PA, 76: drug is covered by exceeded limitations.  

In [6]:
pas = pd.read_csv('../data/initial_data/dim_pa.csv')

In [7]:
pas.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 555951 entries, 0 to 555950
Data columns (total 5 columns):
 #   Column             Non-Null Count   Dtype
---  ------             --------------   -----
 0   dim_pa_id          555951 non-null  int64
 1   correct_diagnosis  555951 non-null  int64
 2   tried_and_failed   555951 non-null  int64
 3   contraindication   555951 non-null  int64
 4   pa_approved        555951 non-null  int64
dtypes: int64(5)
memory usage: 21.2 MB


In [8]:
pas.head()

Unnamed: 0,dim_pa_id,correct_diagnosis,tried_and_failed,contraindication,pa_approved
0,1,1,1,0,1
1,2,1,0,0,1
2,3,0,0,1,1
3,4,1,1,0,1
4,5,0,1,0,1


### The date data 

There are 1520 dates in the dim_date.csv file, which contains date information such as 'dim_date_id', date value, calendar year, calendar month, calendar day', day of week, weekday, workday, holiday, etc.

In [9]:
dates = pd.read_csv('../data/initial_data/dim_date.csv')

In [10]:
dates.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1520 entries, 0 to 1519
Data columns (total 9 columns):
 #   Column          Non-Null Count  Dtype 
---  ------          --------------  ----- 
 0   dim_date_id     1520 non-null   int64 
 1   date_val        1520 non-null   object
 2   calendar_year   1520 non-null   int64 
 3   calendar_month  1520 non-null   int64 
 4   calendar_day    1520 non-null   int64 
 5   day_of_week     1520 non-null   int64 
 6   is_weekday      1520 non-null   int64 
 7   is_workday      1520 non-null   int64 
 8   is_holiday      1520 non-null   int64 
dtypes: int64(8), object(1)
memory usage: 107.0+ KB


In [11]:
dates.columns

Index(['dim_date_id', 'date_val', 'calendar_year', 'calendar_month',
       'calendar_day', 'day_of_week', 'is_weekday', 'is_workday',
       'is_holiday'],
      dtype='object')

### The bridge data 

The dim_claim_id, dim_pa_id, dim_date_id are connected through corresponding index of bridge data. 

In [12]:
bridge = pd.read_csv('../data/initial_data/bridge.csv')

In [13]:
bridge.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 1335576 entries, 0 to 1335575
Data columns (total 3 columns):
 #   Column        Non-Null Count    Dtype  
---  ------        --------------    -----  
 0   dim_claim_id  1335576 non-null  int64  
 1   dim_pa_id     555951 non-null   float64
 2   dim_date_id   1335576 non-null  int64  
dtypes: float64(1), int64(2)
memory usage: 30.6 MB


In [14]:
bridge.head()

Unnamed: 0,dim_claim_id,dim_pa_id,dim_date_id
0,1,1.0,1
1,2,,1
2,3,2.0,1
3,4,,1
4,5,,1


# Merging the data

### Creating merged data

Frist create the data by merging claims to bridge by matching 'dim_claim_id' in claims.

In [15]:
data = claims.merge(bridge,on='dim_claim_id')    
#pas = pas.merge(bridge, on='dim_pa_id', how = 'right')

In [16]:
data

Unnamed: 0,dim_claim_id,bin,drug,reject_code,pharmacy_claim_approved,dim_pa_id,dim_date_id
0,1,417380,A,75.0,0,1.0,1
1,2,999001,A,,1,,1
2,3,417740,A,76.0,0,2.0,1
3,4,999001,A,,1,,1
4,5,417740,A,,1,,1
...,...,...,...,...,...,...,...
1335571,1335572,417740,C,75.0,0,555950.0,1095
1335572,1335573,999001,C,,1,,1095
1335573,1335574,417380,C,70.0,0,555951.0,1095
1335574,1335575,999001,C,,1,,1095


Then merge dates to data by matching 'dim_data_id' in data. 

In [17]:
data = dates.merge(data, on='dim_date_id')

In [18]:
data

Unnamed: 0,dim_date_id,date_val,calendar_year,calendar_month,calendar_day,day_of_week,is_weekday,is_workday,is_holiday,dim_claim_id,bin,drug,reject_code,pharmacy_claim_approved,dim_pa_id
0,1,2017-01-01,2017,1,1,1,0,0,1,1,417380,A,75.0,0,1.0
1,1,2017-01-01,2017,1,1,1,0,0,1,2,999001,A,,1,
2,1,2017-01-01,2017,1,1,1,0,0,1,3,417740,A,76.0,0,2.0
3,1,2017-01-01,2017,1,1,1,0,0,1,4,999001,A,,1,
4,1,2017-01-01,2017,1,1,1,0,0,1,5,417740,A,,1,
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1335571,1095,2019-12-31,2019,12,31,3,1,1,0,1335572,417740,C,75.0,0,555950.0
1335572,1095,2019-12-31,2019,12,31,3,1,1,0,1335573,999001,C,,1,
1335573,1095,2019-12-31,2019,12,31,3,1,1,0,1335574,417380,C,70.0,0,555951.0
1335574,1095,2019-12-31,2019,12,31,3,1,1,0,1335575,999001,C,,1,


Then merge pa to data by matching 'dim_pa_id' in data. Because 'dim_pa_id' contains NaN, in order to keep NaNs, we use "how='right'".

In [19]:
data = pas.merge(data, on='dim_pa_id', how='right')

In [20]:
data

Unnamed: 0,dim_pa_id,correct_diagnosis,tried_and_failed,contraindication,pa_approved,dim_date_id,date_val,calendar_year,calendar_month,calendar_day,day_of_week,is_weekday,is_workday,is_holiday,dim_claim_id,bin,drug,reject_code,pharmacy_claim_approved
0,1.0,1.0,1.0,0.0,1.0,1,2017-01-01,2017,1,1,1,0,0,1,1,417380,A,75.0,0
1,,,,,,1,2017-01-01,2017,1,1,1,0,0,1,2,999001,A,,1
2,2.0,1.0,0.0,0.0,1.0,1,2017-01-01,2017,1,1,1,0,0,1,3,417740,A,76.0,0
3,,,,,,1,2017-01-01,2017,1,1,1,0,0,1,4,999001,A,,1
4,,,,,,1,2017-01-01,2017,1,1,1,0,0,1,5,417740,A,,1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1335571,555950.0,1.0,0.0,0.0,1.0,1095,2019-12-31,2019,12,31,3,1,1,0,1335572,417740,C,75.0,0
1335572,,,,,,1095,2019-12-31,2019,12,31,3,1,1,0,1335573,999001,C,,1
1335573,555951.0,0.0,0.0,1.0,0.0,1095,2019-12-31,2019,12,31,3,1,1,0,1335574,417380,C,70.0,0
1335574,,,,,,1095,2019-12-31,2019,12,31,3,1,1,0,1335575,999001,C,,1


In [21]:
data.info()

<class 'pandas.core.frame.DataFrame'>
Int64Index: 1335576 entries, 0 to 1335575
Data columns (total 19 columns):
 #   Column                   Non-Null Count    Dtype  
---  ------                   --------------    -----  
 0   dim_pa_id                555951 non-null   float64
 1   correct_diagnosis        555951 non-null   float64
 2   tried_and_failed         555951 non-null   float64
 3   contraindication         555951 non-null   float64
 4   pa_approved              555951 non-null   float64
 5   dim_date_id              1335576 non-null  int64  
 6   date_val                 1335576 non-null  object 
 7   calendar_year            1335576 non-null  int64  
 8   calendar_month           1335576 non-null  int64  
 9   calendar_day             1335576 non-null  int64  
 10  day_of_week              1335576 non-null  int64  
 11  is_weekday               1335576 non-null  int64  
 12  is_workday               1335576 non-null  int64  
 13  is_holiday               1335576 non-null 

#### A few more comparisions 

Compare the number of nun entrys in reject code. There are 779625 NaN entries in reject code which matches the number of approved pharmach claims (pharmacy_claim_approved = 1). 

In [42]:
(claims.reject_code == claims.reject_code).value_counts()

False    779625
True     555951
Name: reject_code, dtype: int64

In [43]:
(claims.reject_code == data.reject_code).value_counts()

False    779625
True     555951
Name: reject_code, dtype: int64

In [62]:
(data.dim_pa_id.fillna(1) == data.pharmacy_claim_approved).value_counts()

True     779625
False    555951
dtype: int64

# Train test split

In [63]:
from sklearn.model_selection import train_test_split

In [67]:
train, test = train_test_split(data,shuffle=True, random_state=3453, test_size=0.2, 
                               stratify=data.pharmacy_claim_approved)

### Export train test data to csv file

In [70]:
train.to_csv('../data/processed_data/train.csv')

In [71]:
test.to_csv('../data/processed_data/test.csv')

In [68]:
train

Unnamed: 0,dim_pa_id,correct_diagnosis,tried_and_failed,contraindication,pa_approved,dim_date_id,date_val,calendar_year,calendar_month,calendar_day,day_of_week,is_weekday,is_workday,is_holiday,dim_claim_id,bin,drug,reject_code,pharmacy_claim_approved
99180,,,,,,90,2017-03-31,2017,3,31,6,1,1,0,99181,999001,C,,1
415699,,,,,,390,2018-01-25,2018,1,25,5,1,1,0,415700,999001,B,,1
728398,302602.0,1.0,1.0,0.0,1.0,638,2018-09-30,2018,9,30,1,0,0,0,728399,417740,A,76.0,0
97547,,,,,,89,2017-03-30,2017,3,30,5,1,1,0,97548,999001,B,,1
1209171,502990.0,1.0,1.0,0.0,1.0,997,2019-09-24,2019,9,24,3,1,1,0,1209172,999001,C,76.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1085912,451348.0,1.0,1.0,0.0,1.0,905,2019-06-24,2019,6,24,2,1,1,0,1085913,999001,A,76.0,0
1188636,494471.0,1.0,1.0,0.0,1.0,982,2019-09-09,2019,9,9,2,1,1,0,1188637,999001,B,76.0,0
1080523,,,,,,900,2019-06-19,2019,6,19,4,1,1,0,1080524,999001,A,,1
684619,,,,,,601,2018-08-24,2018,8,24,6,1,1,0,684620,417614,C,,1


In [69]:
test

Unnamed: 0,dim_pa_id,correct_diagnosis,tried_and_failed,contraindication,pa_approved,dim_date_id,date_val,calendar_year,calendar_month,calendar_day,day_of_week,is_weekday,is_workday,is_holiday,dim_claim_id,bin,drug,reject_code,pharmacy_claim_approved
144035,,,,,,132,2017-05-12,2017,5,12,6,1,1,0,144036,417614,C,,1
741339,308096.0,1.0,0.0,0.0,1.0,648,2018-10-10,2018,10,10,4,1,1,0,741340,417614,B,75.0,0
916621,,,,,,783,2019-02-22,2019,2,22,6,1,1,0,916622,417740,A,,1
542034,225098.0,0.0,1.0,0.0,1.0,487,2018-05-02,2018,5,2,4,1,1,0,542035,417380,A,75.0,0
248533,102868.0,0.0,1.0,1.0,0.0,235,2017-08-23,2017,8,23,4,1,1,0,248534,417740,C,75.0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
867451,,,,,,751,2019-01-21,2019,1,21,2,1,1,0,867452,999001,A,,1
65929,,,,,,61,2017-03-02,2017,3,2,5,1,1,0,65930,999001,A,,1
901177,374752.0,0.0,0.0,1.0,0.0,773,2019-02-12,2019,2,12,3,1,1,0,901178,417740,B,70.0,0
496624,206093.0,1.0,0.0,1.0,0.0,452,2018-03-28,2018,3,28,4,1,1,0,496625,417614,A,70.0,0
