# CORE Cartridge Notebook:: Cancel Before Active Enrichment
![CORE Logo](assets/coreLogo.png) 

---
## Keep in Mind
Good Transforms Are...
- **singular in purpose:** good transforms do one and only one thing, and handle all known cases for that thing. 
- **repeatable:** transforms should be written in a way that they can be run against the same dataset an infinate number of times and get the same result every time. 
- **easy to read:** 99 times out of 100, readable, clear code that runs a little slower is more valuable than a mess that runs quickly. 
- **No 'magic numbers':** if a variable or function is not instantly obvious as to what it is or does, without context, maybe consider renaming it.

## Workflow - how to use this notebook to make science
#### Data Science
1. **Document your transform.** Fill out the _description_ cell below describing what it is this transform does; this will appear in the configuration application where Ops will create, configure and update pipelines. 
1. **Define your config object.** Fill out the _configuration_ cell below the commented-out guide to define the variables you want ops to set in the configuration application (these will populate here for every pipeline). 
2. **Build your transformation logic.** Use the transformation cell to do that magic that you do.

![caution](assets/cautionTape.png)

### Description
What does this transformation do? be specific.

![what does your transform do](assets/what.gif)

Cancelled/Discontinued Before Active enrichment.
Assigns hierarchy values in cases where cancelled or discontinued status is reported before first active shipment.  This is used as part of the TTFF enrichment.  See logic diagram below:

<img src='assets/cancel_before_active.svg' width=700>

In [103]:
from core.helpers.session_helper import SessionHelper
session = SessionHelper().session

2019-09-06 19:11:09,032 - core.helpers.session_helper.SessionHelper - INFO - Creating session for dev environment...
2019-09-06 19:11:09,061 - core.helpers.configuration_mocker.ConfigurationMocker - DEBUG - Generating administrator mocks.
2019-09-06 19:11:09,066 - core.helpers.configuration_mocker.ConfigurationMocker - DEBUG - Done generating administrator mocks.
2019-09-06 19:11:09,067 - core.helpers.configuration_mocker.ConfigurationMocker - DEBUG - Generating pharmaceutical company mocks.
2019-09-06 19:11:09,071 - core.helpers.configuration_mocker.ConfigurationMocker - DEBUG - Done generating pharmaceutical company mocks.
2019-09-06 19:11:09,073 - core.helpers.configuration_mocker.ConfigurationMocker - DEBUG - Generating brand mocks.
2019-09-06 19:11:09,079 - core.helpers.configuration_mocker.ConfigurationMocker - DEBUG - Done generating brand mocks.
2019-09-06 19:11:09,081 - core.helpers.configuration_mocker.ConfigurationMocker - DEBUG - Generating segment mocks.
2019-09-06 19:11:0

In [104]:
"""
************ CONFIGURATION - PLEASE TOUCH **************
Pipeline Builder configuration: creates configurations from variables specified here!!
This cell will be off in production as configurations will come from the configuration postgres DB.
"""
# config vars: this dataset
config_pharma = "sun" # the pharmaceutical company which owns {brand}
config_brand = "ilumya" # the brand this pipeline operates on
config_state = "enrich" # the state this transform runs in
config_name = "patient_status_enrich_cancel_before_active" # the name of this transform, which is the name of this notebook without .ipynb

# input vars: dataset to fetch. Recall that a contract published to S3 has a key format branch/pharma/brand/state/name
# Note: this key is case sensitive!!
input_pharma = "sun"
input_brand = "ilumya"
input_state = "ingest"
input_name = "patient_status_ingest_column_mapping"
input_branch = "longitudal-id" # if None, input_branch is automagically set to your working branch

In [105]:
"""
************ SETUP - DON'T TOUCH **************
Populating config mocker based on config parameters...
"""
import core.helpers.pipeline_builder as builder

ids = builder.build(config_pharma, config_brand, config_state, config_name, session)
transform_id = ids[0]
run_id = ids[1]

2019-09-06 19:11:25,989 - core.logging - DEBUG - Adding/getting mocks for specified configurations...
2019-09-06 19:11:26,028 - core.logging - DEBUG - Done. Creating mock run event and committing results to configuration mocker.


In [106]:
"""
************ SETUP - DON'T TOUCH **************
This section imports data from the configuration database
and should not need to be altered or otherwise messed with. 
~~These are not the droids you are looking for~~
"""
from core.constants import BRANCH_NAME, ENV_BUCKET
from core.helpers.session_helper import SessionHelper
from core.models.configuration import Transformation
from dataclasses import dataclass
from core.dataset_contract import DatasetContract

db_transform = session.query(Transformation).filter(Transformation.id == transform_id).one()

@dataclass
class DbTransform:
    id: int = db_transform.id ## the instance id of the transform in the config app
    name: str = db_transform.transformation_template.name ## the transform name in the config app
    state: str = db_transform.pipeline_state.pipeline_state_type.name ## the pipeline state, one of raw, ingest, master, enhance, enrich, metrics, dimensional
    branch:str = BRANCH_NAME ## the git branch for this execution 
    brand: str = db_transform.pipeline_state.pipeline.brand.name ## the pharma brand name
    pharmaceutical_company: str = db_transform.pipeline_state.pipeline.brand.pharmaceutical_company.name # the pharma company name
    publish_contract: DatasetContract = DatasetContract(branch=BRANCH_NAME,
                            state=db_transform.pipeline_state.pipeline_state_type.name,
                            parent=db_transform.pipeline_state.pipeline.brand.pharmaceutical_company.name,
                            child=db_transform.pipeline_state.pipeline.brand.name,
                            dataset=db_transform.transformation_template.name)


### Configuration

In [107]:
""" 
********* VARIABLES - PLEASE TOUCH ********* 
This section defines what you expect to get from the configuration application 
in a single "transform" object. Define the vars you need here, and comment inline to the right of them 
for all-in-one documentation. 
Engineering will build a production "transform" object for every pipeline that matches what you define here.

@@@ FORMAT OF THE DATA CLASS IS: @@@ 

<variable_name>: <data_type> #<comment explaining what the value is to future us>

e.g.

class Transform(DbTransform):
    some_ratio: float
    site_name: str

~~These ARE the droids you are looking for~~
"""

class Transform(DbTransform):
    '''
    YOUR properties go here!!
    Variable properties should be assigned to the exact name of
    the transformation as it appears in the Jupyter notebook filename.
    '''
    input_transform: str #= db_transform.variables.input_transform # Transform to source data from
    hierarchy: str #= db_transform.variables.hierarchy # Column header for Patient Journey Hierarchy
    active_substatus_code: str# = db_transform.variables.active_substatus_code # Active Shipment Substatus code, e.g. 'SHIPMENT' (customer-specific)
    cancel_discontinue_status_code: str# = db_transform.variables.cancel_discontinue_status_code # Comma-separated list (stored as string) of Cancelled and Discontinued status codes, e.g. 'CANCELLED,DISCONTINUED' (customer-specific)
    #cancel_discontinue_status_code = cancel_discontinue_status_code.split(',') # We reassign the string variable to be a list of strings by comma split
    bvpa_cancel_discontinue_substatus: str# = db_transform.variables.bvpa_cancel_discontinue_substatus # Comma-separated list (stored as string) of accepted substatus codes used for BVPA hierarchy, e.g. 'INSURANCE DENIED,COVERAGE DENIED' (customer-specific)
    #bvpa_cancel_discontinue_substatus = bvpa_cancel_discontinue_substatus.split(',') # We reassign the string variable to be a list of strings by comma split
    active_diff_threshold: int# = db_transform.variables.active_diff_threshold # Threshold value for Active/Cancel date difference logic (customer-specific)
    prior_diff_threshold: int #= db_transform.variables.prior_diff_threshold # Threshold value for Cancel/Prior date difference logic (customer-specific)
    active_hierarchy: str #= db_transform.variables.active_hierarchy # Hierarchy to assign to statuses after the first fill, e.g. 'ACTIVE - SHIPMENT' (customer-specific)
    remove_from_ttff: str #= db_transform.variables.remove_from_ttff # Hierarchy to assign to statuses that are ignored from TTFF (customer-specific)
    no_status_clarity: str #= db_transform.variables.no_status_clarity # Hierarchy to assign to cancelled/discontinued statuses with no status clarity (customer-specific)
    bvpa_hierarchy: str #= db_transform.variables.bvpa_hierarchy # Hierarchy to assign to cancelled/discontinued statuses that have BVPA substatus (customer-specific)

In [108]:
transform = Transform()

In [109]:
## Please place your value assignments for development here!!
## This cell will be turned off in production and Engineering will set to pull form the configuration application instead

transform.hierarchy = 'patient_journey_hierarchy'
transform.active_substatus_code = 'SHIPMENT'
transform.cancel_discontinue_status_code = 'CANCELLED,DISCONTINUED'
transform.bvpa_cancel_discontinue_substatus = 'INSURANCE DENIED'
transform.active_diff_threshold = 60
transform.prior_diff_threshold = 60
transform.active_hierarchy = 'ACTIVE - SHIPMENT'
transform.remove_from_ttff = 'REMOVE FROM TTFF'
transform.no_status_clarity = 'NO STATUS CLARITY'
transform.bvpa_hierarchy = 'BVPA'

In [110]:
transform.cancel_discontinue_status_code = transform.cancel_discontinue_status_code.split(',')
transform.bvpa_cancel_discontinue_substatus = transform.bvpa_cancel_discontinue_substatus.split(',')

### Transformation

In [111]:
"""
************ FETCH DATA - TOUCH, BUT CAREFULLY **************
This cell will be turned off in production, as the input_contract will be handled by the pipeline.
"""

if not input_branch:
    input_branch = BRANCH_NAME
input_contract = DatasetContract(branch=input_branch, state=input_state, parent=input_pharma, child=input_brand, dataset=input_name)
run_filter = []
run_filter.append(dict(partition="__metadata_run_id", comparison="==", values=[2]))
# IF YOU HAVE PUBLISHED DATA MULTIPLE TIMES, uncomment the above line and change the int to the run_id to fetch.
# Otherwise, you will have duplicate values in your fetched dataset!
df = input_contract.fetch(filters=run_filter)

2019-09-06 19:11:32,518 - core.dataset_contract.DatasetContract - INFO - Fetching dataframe from s3 location s3://ichain-dev/longitudal-id/sun/ilumya/ingest/patient_status_ingest_column_mapping.


In [112]:
df.shape

(1546, 136)

In [113]:
### Use the variables above to execute your transformation. the final output needs to be a variable named final_dataframe

In [114]:
import numpy as np
import pandas as pd

In [115]:
pd.set_option('display.max_columns', 150)
pd.set_option('display.max_rows', 500)

In [116]:
# Column names defined here from the pre-defined patient status schema

brand_col = 'brand'
patient_id = 'longitudinal_patient_id'
pharmacy = 'pharmacy_name'
status_date = 'status_date'
referral_date = 'referral_date'
status =  'status'
substatus =  'substatus'
hierarchy = transform.hierarchy

if DbTransform.pharmaceutical_company.upper() == 'SUN':
    trans_id = 'pharmacy_transaction_id'

else:
    trans_id = 'aggregator_transaction_id'
    

In [117]:
df.head()

Unnamed: 0,__metadata_app_version,__metadata_output_contract,__metadata_run_id,__metadata_run_timestamp,__metadata_transform_timestamp,aggregator_transaction_id,brand,bridge_patient,bridge_quantity_dispensed,bridge_quantity_dispensed_2,copay_as_amount,customer_status,customer_status_description,customer_substatus,days_supply,dose_count,dose_exchange_count,dose_exchange_flag,dose_titration_count,dose_titration_quantity,dx_1,dx_2,enroll_received_date,fitness_for_duty_request_flag,fitness_for_duty_ship_date,has_medical_coverage_flag,hcp_address_1,hcp_address_2,hcp_city,hcp_dea_number,hcp_facility,hcp_first_name,hcp_last_name,hcp_middle_name,hcp_npi,hcp_phone,hcp_specialty,hcp_state,hcp_state_license_number,hcp_suffix,hcp_zip,hub_patient,hub_patient_id,longitudinal_patient_id,medication,ndc,other_payer_amount,oxygen_flag,patient_consent_date,patient_dob,patient_gender,patient_oop_program_name,patient_state,patient_support_1,patient_support_2,patient_zip,pharmacy_address_1,pharmacy_address_2,pharmacy_city,pharmacy_code,pharmacy_dea_number,pharmacy_hin,pharmacy_name,pharmacy_ncpdp,pharmacy_npi,pharmacy_parent_name,pharmacy_patient_id,pharmacy_state,pharmacy_transaction_id,pharmacy_zip,prev_dispensed,primary_coins,primary_copay,primary_cost_amount,primary_cost_type,primary_coverage_type,primary_deductible,primary_patient_responsibility,primary_payer,primary_payer_bin,primary_payer_group,primary_payer_iin,primary_payer_pcn,primary_payer_subtype,primary_payer_type,primary_pbm_name,primary_plan,primary_plan_paid,primary_plan_type,primary_prior_auth_expiration_date,primary_prior_auth_required_flag,prior_therapy_name,quantity_dispensed,referral_date,referral_number,referral_source,restatement_flag,rx_date,rx_fill_number,rx_fills,rx_number,rx_refills_remaining,secondary_coins,secondary_copay,secondary_coverage_type,secondary_deductible,secondary_patient_responsibility,secondary_payer,secondary_payer_bin,secondary_payer_flag,secondary_payer_group,secondary_payer_iin,secondary_payer_pcn,secondary_payer_subtype,secondary_payer_type,secondary_plan,secondary_plan_paid,secondary_plan_type,ship_address_1,ship_address_2,ship_carrier,ship_city,ship_date,ship_location,ship_state,ship_tracking_id,ship_zip,status,status_date,substatus,transaction_date,transaction_sequence,transaction_type,transfer_pharmacy,triage_date,uom_dispensed
0,0.0.11,s3://ichain-dev/longitudal-id/sun/ilumya/inges...,2,2019-08-19 17:53:17,2019-08-19 17:56:55,,,,,,,ACTIVE,,SHIPMENT,84.0,,,,,,,,,,,,1140 YOUNGS RD,,BUFFALO,MT4284155,,JONATHAN,TUROWSKI,,1104366368,7166880020,,NY,,,14221,,,,ILUMYA,47335017795.0,,,,,M,,,,,14,,,,CVS,,,,,1003925587,,9011536684,,191474159,,,,,,,Medical,,,,,,,,,MEDICAID,,,,,,,,1.0,20190301 23:00:00,,DIRECT,,20190311.0,1.0,3.0,85291256.0,2.0,,,,,,,,,,,,,,,,,,,,,20190813 23:00:00,,,,,,20190508 23:00:00,,20190813 23:00:00,0,COM,,,
1,0.0.11,s3://ichain-dev/longitudal-id/sun/ilumya/inges...,2,2019-08-19 17:53:17,2019-08-19 17:56:55,,,,,,,ACTIVE,,SHIPMENT,84.0,,,,,,L40.9,,,,,,324 S SHERMAN,,SPOKANE,BW0942296,,PHILIP,WERSCHLER,,1023119682,5096241184,,WA,,,99202,,,,ILUMYA SD PFS,47335017795.0,,,,,F,,,,,99,,,,CVS,,,,,1013998921,,9012977441,,191591627,,,,,,,Pharmacy,,,,,,,,,MEDICAID,,,,,,,,1.0,20190515 23:00:00,,DIRECT,,20190520.0,0.0,4.0,86268050.0,4.0,,,,,,,,,,,,,,,,,"SPOKANE DERMATOLOGY,324 S SHERMAN ST STE. A1",,UPS,SPOKANE,20190814 23:00:00,PRESCRIBER OFFICE,WA,1Z6V93W92900031451,99.0,,20190814 23:00:00,,20190814 23:00:00,0,COM,,,
2,0.0.11,s3://ichain-dev/longitudal-id/sun/ilumya/inges...,2,2019-08-19 17:53:17,2019-08-19 17:56:55,,,,,,,CANCELLED,,PATIENT RESPONSE,28.0,,,,,,L40.9,,,,,,1210 BROOKSTONE CTR,PKWY,COLUMBUS,MS1672775,,MARK,"SPATZ, PA",,1083803068,7063221717,,GA,,,31904,,,,ILUMYA SD PFS,47335017795.0,,,,,F,,,,,36,,,,CVS,,,,,1043382302,,9011901489,,901190148920190814000000,,,,,,,Medical,,,,,,,,,,,,,,,,,1.0,20190322 23:00:00,,DIRECT,,20190321.0,,0.0,86125387.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,20190814 23:00:00,,20190814 23:00:00,0,COM,,,
3,0.0.11,s3://ichain-dev/longitudal-id/sun/ilumya/inges...,2,2019-08-19 17:53:17,2019-08-19 17:56:55,,,,,,,PENDING,,PA,,,,,,,,,,,,,156 RAMAPO VALLEY RD,DR JOCELYN LIEB,MAHWAH,FL2139170,,JOCELYN,LIEB,,1891945341,2015007525,,NJ,,,7430,,,,ILUMYA,,,,,,F,,,,,7,,,,CVS,,,,,1043382302,,9013110426,,901311042620190814000000,,,,,,,Medical,,,,,,,,,,,,,,,,,,20190521 23:00:00,,DIRECT,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,,20190814 23:00:00,,20190814 23:00:00,0,COM,,,
4,0.0.11,s3://ichain-dev/longitudal-id/sun/ilumya/inges...,2,2019-08-19 17:53:17,2019-08-19 17:56:55,,,,,,,CANCELLED,,PATIENT END,28.0,,,,,,L40.0,,,,,,1235 LAKE POINTE PWY,STE 200,SUGAR LAND,,,SHAYLA,ARCENEAUX,,1326464207,2819800166,,TX,,,77478,,,,ILUMYA SD PFS,47335017795.0,,,,,F,,,,,77,,,,CVS,,,,,1043382302,,9013799320,,901379932020190814000000,,,,,,,Medical,,,,,,,,,,,,,,,,,1.0,20190618 23:00:00,,DIRECT,,20190520.0,,0.0,86565740.0,0.0,,,,,,,,,,,,,,,,,,,,,,,,,,,20190814 23:00:00,,20190814 23:00:00,0,COM,,,


### DATA CLEANING: ADDRESS THIS SECTION BEFORE PIPELINE INTEGRATION

In [118]:
input_df = df.copy()

In [119]:
patient_id = 'pharmacy_patient_id'
pharmacy = 'pharmacy_code'
status = 'customer_status'
substatus = 'customer_substatus'
brand_col = 'medication'
datetime = '%Y%m%d'

# CLEAN DATA - This step should not be necessary once transform is integrated into pipeline.
#    Extract and map relevant columns
#    Convert dates to datetime format
#    Extract brand from medication
#    Convert substatuses to uppercase
#    Populate null referral dates with the min(status_date) for that patient/pharmacy/brand.
    
def clean_data(cust_input_df, datetime, transform):

    clean_df = (
        cust_input_df
        .loc[:,
             [trans_id,
              patient_id,
              pharmacy,
              brand_col,
              status_date,
              referral_date,
              status,
              substatus]
            ]
        .assign(**{
            status_date : lambda x: (
                pd.to_datetime(
                    x[status_date].str[:8].astype(str),
                    format=datetime,
                    errors='coerce'
                )),
            'min_status_date' : lambda x: (
                x.groupby([patient_id,pharmacy,brand_col])
                [status_date]
                .transform(min)
            )
        })       
        .fillna(value={referral_date:'min_status_date'})
        .assign(**{
            referral_date : lambda x: (
                pd.to_datetime(
                    x[referral_date].str[:8].astype(str),
                    format=datetime,
                    errors='coerce'
                ))
        })
        .dropna()
        .assign(**{
            brand_col : lambda x: (x[brand_col].apply(lambda x: x.split()[0].strip())),
            status : lambda x: (x[status].str.upper()),
            substatus : lambda x: (x[substatus].str.upper())
        })
        .drop(['min_status_date'],axis=1)
        .drop_duplicates()
        .sort_values(
            by=[patient_id, pharmacy, brand_col, status_date, status, trans_id],
            ascending=[True, True, True, True, False, True])
        .reset_index(drop=True)
        .assign(**{transform.hierarchy : 'Dummy'})
    )

    return clean_df

In [120]:
df = clean_data(
    input_df,
    datetime,
    transform
)

df.head()

Unnamed: 0,pharmacy_transaction_id,pharmacy_patient_id,pharmacy_code,medication,status_date,referral_date,customer_status,customer_substatus,patient_journey_hierarchy
0,522918362924,11855512788,WAG,ILUMYA,2019-08-13,2019-08-06,PENDING,BENEFITS,Dummy
1,405647714937,11855512788,WAG,ILUMYA,2019-08-14,2019-08-06,PENDING,PATIENT RESPONSE,Dummy
2,560833250829,61489680474,WAG,ILUMYA,2019-08-12,2019-08-06,PENDING,BENEFITS,Dummy
3,145219598523,61489680474,WAG,ILUMYA,2019-08-14,2019-08-06,PENDING,PATIENT CONTACT,Dummy
4,285375989569,61489680474,WAG,ILUMYA,2019-08-15,2019-08-06,PENDING,PATIENT RESPONSE,Dummy


In [121]:
df.shape

(499, 9)

In [134]:
df_mod = df.loc[~df['customer_status'].isin(transform.cancel_discontinue_status_code)]

### APPLY TRANSFORM LOGIC

In [122]:
# Assign Patient Journey (pj_id) and Patient Journey Step (pj_step) identifiers
# (These IDs are used for calculation purposes only.  They will not be published)

def pj(df):
    pj_df = (
        df
        .assign(**{
            'pj_id' : lambda x: (
                x.groupby([patient_id, pharmacy, brand_col]).grouper.group_info[0]
            ),
            'pj_step' : lambda x: x.index
        })
        .sort_values(
            by=[patient_id, pharmacy, brand_col, status_date, status, trans_id],
            ascending=[True, True, True, True, False, True])
        .reset_index(drop=True)
    )
    return pj_df

In [123]:
# Filter to only include patient journeys where:
#    a) Active Shipment status is reported
#    b) Cancelled or Discontinued status occurs prior to first active shipment

def cancel_before_active(pj_df):
    cancel_before_active_df = (
        pj_df
        .assign(active_step = lambda x: (
            np.where(
                x[substatus] == transform.active_substatus_code,
                x['pj_step'],
                np.nan
        )))
        .assign(active_status_date = lambda x: (
            pd.to_datetime(np.where(
                x[substatus] == transform.active_substatus_code,
                x[status_date],
                pd.NaT
        ))))
        .assign(first_active_step = lambda x: (
            x.groupby(['pj_id'])['active_step']
            .transform(min)
        ))
        .assign(first_active_status_date = lambda x: (
            x.groupby(['pj_id'])['active_status_date']
            .transform(min)
        ))
        .drop(['active_step', 'active_status_date'], axis=1)
        .assign(active_cancel_diff = lambda x:(
            np.where(x[status].isin(transform.cancel_discontinue_status_code),
                     (x['first_active_status_date'] - x[status_date]) / np.timedelta64(1, 'D'),
                     np.nan
                    )
        ))
        .assign(active_cancel_diff = lambda x: (
            x.groupby(['pj_id'], sort=False)['active_cancel_diff']
            .transform(lambda x: x.bfill())
            ))
        .loc[lambda x: (
            x['pj_id'].isin(x
                            .loc[x['active_cancel_diff'] >= 0]
                            .pj_id
                            .drop_duplicates()
                            .tolist()
                           )
        )]
    )
    return cancel_before_active_df

In [124]:
# For each patient journey step, get the previous status. If it's the first step in the patient journey, show "no_prior_status"
# For cancelled or discontinued statuses, get the time spent in previous status (if >= 60 days) - and then backfill values for that patient journey.

def prior_status(cancel_before_active_df):
    prior_status_df = (
        cancel_before_active_df
        .assign(prior_status = lambda x:(
            x.groupby(['pj_id'])[status]
            .transform(lambda x: x.shift(1))
        ))
        .fillna(value={'prior_status':'no_prior_status'})
        .assign(prior_status_diff = lambda x: (
            np.where(
                (x[status].isin(transform.cancel_discontinue_status_code)) & ((x[status_date] - x[status_date].shift(1))/np.timedelta64(1,'D') >= 60),
                (x
                 .groupby(['pj_id'])[status_date]
                 .transform(lambda x: (x - x.shift(1))/np.timedelta64(1,'D'))),
                np.nan       
            )
        ))
        .assign(prior_status_diff = lambda x: (
            x.groupby(['pj_id'], sort=False)['prior_status_diff']
            .transform(lambda x: x.bfill())
        ))
    )
    return prior_status_df

In [125]:
# Apply logic to determine patient journey hierarchy. See logic diagram in transform description.

def hierarchy(prior_status_df):
    hierarchy_df = (
        prior_status_df
        .assign(**{
            transform.hierarchy : lambda x:(
                np.where(
                    x['pj_step'] >= x['first_active_step'],
                    transform.active_hierarchy,
                    np.where(
                        x['active_cancel_diff'] > transform.active_diff_threshold,
                        transform.remove_from_ttff,
                        np.where(
                            (~x[status].isin(transform.cancel_discontinue_status_code)),
                            np.where(
                                x['prior_status_diff'] > transform.prior_diff_threshold,
                                transform.remove_from_ttff,
                                x[transform.hierarchy]
                            ),
                            np.where(
                                (x['prior_status_diff'] > transform.prior_diff_threshold) | (x['prior_status'] == 'no_prior_status'),
                                transform.no_status_clarity,
                                np.where(
                                    x[substatus].isin(transform.bvpa_cancel_discontinue_substatus),
                                    transform.bvpa_hierarchy,
                                    None
                                )
                            )
                        )
                    )
                )
            )
        })
                
        .reset_index(drop=True)
        .assign(**{
            transform.hierarchy : lambda x: (
                x.groupby(['pj_id'], sort=False)[transform.hierarchy]
                .transform(lambda x: x.ffill())
                )
        })
    )
    return hierarchy_df

In [126]:
pj_df = pj(df)

pj_df.head()

Unnamed: 0,pharmacy_transaction_id,pharmacy_patient_id,pharmacy_code,medication,status_date,referral_date,customer_status,customer_substatus,patient_journey_hierarchy,pj_id,pj_step
0,522918362924,11855512788,WAG,ILUMYA,2019-08-13,2019-08-06,PENDING,BENEFITS,Dummy,0,0
1,405647714937,11855512788,WAG,ILUMYA,2019-08-14,2019-08-06,PENDING,PATIENT RESPONSE,Dummy,0,1
2,560833250829,61489680474,WAG,ILUMYA,2019-08-12,2019-08-06,PENDING,BENEFITS,Dummy,1,2
3,145219598523,61489680474,WAG,ILUMYA,2019-08-14,2019-08-06,PENDING,PATIENT CONTACT,Dummy,1,3
4,285375989569,61489680474,WAG,ILUMYA,2019-08-15,2019-08-06,PENDING,PATIENT RESPONSE,Dummy,1,4


In [128]:
hierarchy_df = (
    pj_df
    .pipe(cancel_before_active)
    .pipe(prior_status)
    .pipe(hierarchy)
)

hierarchy_df.head()

Unnamed: 0,pharmacy_transaction_id,pharmacy_patient_id,pharmacy_code,medication,status_date,referral_date,customer_status,customer_substatus,patient_journey_hierarchy,pj_id,pj_step,first_active_step,first_active_status_date,active_cancel_diff,prior_status,prior_status_diff
0,191701372019081513,19170137,ACCREDO,ILUMYA,2019-08-15,2019-04-19,CANCELLED,OTHER,NO STATUS CLARITY,13,28,29.0,2019-08-15,0.0,no_prior_status,
1,191701372019081510,19170137,ACCREDO,ILUMYA,2019-08-15,2019-08-15,ACTIVE,SHIPMENT,ACTIVE - SHIPMENT,13,29,29.0,2019-08-15,,CANCELLED,
2,284802602019081420,28480260,ACCREDO,YONSA,2019-08-14,2019-08-14,CANCELLED,OTHER,NO STATUS CLARITY,72,116,117.0,2019-08-15,1.0,no_prior_status,
3,284802602019081518,28480260,ACCREDO,YONSA,2019-08-15,2019-08-15,ACTIVE,SHIPMENT,ACTIVE - SHIPMENT,72,117,117.0,2019-08-15,,CANCELLED,
4,BRIOVARX_20190813_145116881,413721804,BRV,ILUMYA,2019-08-12,2019-08-08,ACTIVE,READY,Dummy,111,201,207.0,2019-08-15,2.0,no_prior_status,


In [137]:
pj_df_mod = pj(df_mod)

pj_df_mod.head()

Unnamed: 0,pharmacy_transaction_id,pharmacy_patient_id,pharmacy_code,medication,status_date,referral_date,customer_status,customer_substatus,patient_journey_hierarchy,pj_id,pj_step
0,522918362924,11855512788,WAG,ILUMYA,2019-08-13,2019-08-06,PENDING,BENEFITS,Dummy,0,0
1,405647714937,11855512788,WAG,ILUMYA,2019-08-14,2019-08-06,PENDING,PATIENT RESPONSE,Dummy,0,1
2,560833250829,61489680474,WAG,ILUMYA,2019-08-12,2019-08-06,PENDING,BENEFITS,Dummy,1,2
3,145219598523,61489680474,WAG,ILUMYA,2019-08-14,2019-08-06,PENDING,PATIENT CONTACT,Dummy,1,3
4,285375989569,61489680474,WAG,ILUMYA,2019-08-15,2019-08-06,PENDING,PATIENT RESPONSE,Dummy,1,4


In [140]:
output_mod = (
    pj_df_mod
    .pipe(cancel_before_active)
    .pipe(prior_status)
#    .pipe(hierarchy)
)

output_mod.head()

ValueError: No objects to concatenate

In [130]:
# Merge hierarchy results for this enrichment back into the initial dataframe

final_dataframe = (
    pd.merge(
        pj_df.rename(columns = {transform.hierarchy:'old_hierarchy'}),
        hierarchy_df.loc[:,['pj_id', 'pj_step', transform.hierarchy]],
        how='left',
        on=['pj_id', 'pj_step']
    )
    .assign(**{
        transform.hierarchy : lambda x:(
            np.where(
                x[transform.hierarchy].isnull(),
                x['old_hierarchy'],
                x[transform.hierarchy]
            )
        )}
    )
    .drop(['pj_id', 'pj_step', 'old_hierarchy'], axis=1)
)

final_dataframe.head()

Unnamed: 0,pharmacy_transaction_id,pharmacy_patient_id,pharmacy_code,medication,status_date,referral_date,customer_status,customer_substatus,patient_journey_hierarchy
0,522918362924,11855512788,WAG,ILUMYA,2019-08-13,2019-08-06,PENDING,BENEFITS,Dummy
1,405647714937,11855512788,WAG,ILUMYA,2019-08-14,2019-08-06,PENDING,PATIENT RESPONSE,Dummy
2,560833250829,61489680474,WAG,ILUMYA,2019-08-12,2019-08-06,PENDING,BENEFITS,Dummy
3,145219598523,61489680474,WAG,ILUMYA,2019-08-14,2019-08-06,PENDING,PATIENT CONTACT,Dummy
4,285375989569,61489680474,WAG,ILUMYA,2019-08-15,2019-08-06,PENDING,PATIENT RESPONSE,Dummy


In [131]:
final_dataframe.shape

(499, 9)

### TEST TRANSFORM OUTPUT

In [132]:
# TEST 1: Check that final dataframe has the same number of rows as the input dataframe

test1 = (pj_df.shape[0] == final_dataframe.shape[0])

test1

True

In [133]:
hierarchy_df.shape

(11, 16)

In [None]:
# TEST 2: Check that hierarchy is not changed for statuses after first active shipment, or for journeys that never reported active shipment

first_active_status = (
    final_dataframe
    .assign(pj_id = lambda x: x.groupby([patient_id, pharmacy, brand_col]).grouper.group_info[0])
    .assign(pj_step = lambda x: x.index)
    .assign(active_status_date = lambda x: (
        pd.to_datetime(np.where(
            x[substatus] == transform.active_substatus_code,
            x[status_date],
            pd.NaT
    ))))
    .assign(first_active_status_date = lambda x: (
        x.groupby(['pj_id'])['active_status_date']
        .transform(min)
    ))
    .drop(['active_status_date'], axis=1)
    .merge(
        pj_df.loc[:,['pj_step', hierarchy]].rename(columns={hierarchy:'old_hierarchy'}),
        how = 'inner',
        on = ['pj_step']
    )
)

test2 = (first_active_status   
    .loc[lambda x: (
        ((x[status_date] > x['first_active_status_date'])
         |
         (x['first_active_status_date'].isnull())
        )
        &
        (x[hierarchy] != x['old_hierarchy'])
        &
        ~((x[hierarchy].isnull()) & (x['old_hierarchy'].isnull()))
        &
        (x[hierarchy] != transform.active_hierarchy)
    )]
)

test2 = (test2.shape[0] == 0)

test2

In [None]:
# TEST 3: Check that all cancelled and discontinued statuses prior to first active shipment have a new hierarchy assigned to them

test3 = (
    first_active_status   
    .loc[lambda x: (
        (x[status_date] < x['first_active_status_date'])
        &
        (x[status].isin(transform.cancel_discontinue_status_code))
        &
        (
            ((x['first_active_status_date'] - x[status_date]) / np.timedelta64(1,'D') > transform.active_diff_threshold)
            |
            ((x[status_date] - x[status_date].shift(1)) / np.timedelta64(1,'D') > transform.prior_diff_threshold)
            |
            (x[substatus].isin(transform.bvpa_cancel_discontinue_substatus))
        )
        &
        (x[hierarchy] == x['old_hierarchy'])
    )]
)

test3 = (test3.shape[0] == 0)

test3

In [None]:
# TEST 4: Check that all non-cancel/discontinue statuses prior to first active shipment have their previous hierarchy assignment (unless they are REMOVE FROM TTFF)

test4 = (
    first_active_status
    .loc[lambda x: (
        (x[status_date] < x['first_active_status_date'])
        &
        (x['pj_id'].isin(
            x
            .loc[(x[status].isin(transform.cancel_discontinue_status_code)) & (x[status_date] < x['first_active_status_date'])]
            .pj_id
            .drop_duplicates()
            .tolist()
        ))
        &
        (~x[status].isin(transform.cancel_discontinue_status_code))
        &
        (
            (x[hierarchy] != x['old_hierarchy'])
            &
            ~((x[hierarchy].isnull()) & (x['old_hierarchy'].isnull()))
            &
            (x[hierarchy] != transform.remove_from_ttff)
        )
    )]
)

test4 = (test4.shape[0] == 0)

test4

In [None]:
# TEST 5: Create a "test" dataframe with expected results

test_data = ([
    [1, 0, 'PENDING', 'OTHER', 'PENDING - OTHER'],
    [1, 70, 'PENDING', 'OTHER', 'PENDING - OTHER'],
    [1, 72, transform.cancel_discontinue_status_code[0], transform.bvpa_cancel_discontinue_substatus[0], transform.bvpa_hierarchy],
    [1, 72, transform.cancel_discontinue_status_code[1], 'OTHER', transform.bvpa_hierarchy],
    [1, 72, 'ACTIVE', transform.active_substatus_code, transform.active_hierarchy],
    [2, 0, 'PENDING', 'OTHER', transform.remove_from_ttff],
    [2, 70, transform.cancel_discontinue_status_code[1], 'OTHER', transform.no_status_clarity],
    [2, 72, 'PENDING', 'OTHER', 'PENDING - OTHER'],
    [2, 72, transform.cancel_discontinue_status_code[0], 'OTHER', 'PENDING - OTHER'],
    [2, 72, 'ACTIVE', transform.active_substatus_code, transform.active_hierarchy],
    [3, 0, transform.cancel_discontinue_status_code[1], transform.bvpa_cancel_discontinue_substatus[0], transform.remove_from_ttff],
    [3, 1, 'PENDING', 'OTHER', transform.remove_from_ttff],
    [3, 2, transform.cancel_discontinue_status_code[1], 'OTHER', transform.remove_from_ttff],
    [3, 70, transform.cancel_discontinue_status_code[1], 'OTHER', transform.no_status_clarity],
    [3, 71, 'ACTIVE', transform.active_substatus_code, transform.active_hierarchy],
    [3, 72, transform.cancel_discontinue_status_code[0], 'OTHER', transform.active_hierarchy],
    [3, 73, transform.cancel_discontinue_status_code[0], transform.bvpa_cancel_discontinue_substatus[0], transform.active_hierarchy]
])

test_df = (
    pd.DataFrame(test_data, columns = [patient_id, status_date, status, substatus, 'expected_hierarchy'])
    .assign(**{
        pharmacy : 'ABC',
        brand_col : 'A',
        status_date : lambda x: (
            pd.to_datetime('2019-01-01', format='%Y-%m-%d') + pd.to_timedelta(x[status_date], unit='d')
        ),
        hierarchy : lambda x: (
            x[status] + ' - ' + x[substatus]
        )
    })
    
)

test_df

In [None]:
# Apply transform to test dataframe

pj_test = pj(test_df)

test_output = (
    pj_test
    .pipe(cancel_before_active)
    .pipe(prior_status)
    .pipe(hierarchy)
)

final_dataframe_test = (
    pd.merge(
        pj_test.rename(columns = {hierarchy:'old_hierarchy'}),
        test_output.loc[:,['pj_id','pj_step', hierarchy]],
        how='left',
        on=['pj_id', 'pj_step']
    )
    .assign(**{
        hierarchy : lambda x:(
            np.where(
                x[hierarchy].isnull(),
                x['old_hierarchy'],
                x[hierarchy]
            )
        )}
    )
    .drop(['pj_id', 'pj_step', 'old_hierarchy'], axis=1)
)

final_dataframe_test

In [None]:
# Check that results match expectations

test5 = (
    final_dataframe_test
    .assign(passfail = lambda x: np.where(
        (x[hierarchy] == x['expected_hierarchy']) | (x[hierarchy].isnull() & x['expected_hierarchy'].isnull()),
        True,
        False
    ))
    .passfail
    .all()
)

test5

In [None]:
# FINAL TEST: Did all 5 tests pass?

test1 & test2 & test3 & test4 & test5

### Publish

In [None]:
## that's it - just provide the final dataframe to the var final_dataframe and we take it from there
transform.publish_contract.publish(final_dataframe, run_id, session, publish_to_redshift=False) # Remove publish_to_redshift=False before pipeline integration!
session.close()