# CORE Cartridge Notebook::Cancelled Not Before Active Enrichment
![CORE Logo](assets/coreLogo.png) 

---
## Keep in Mind
Good Transforms Are...
- **singular in purpose:** good transforms do one and only one thing, and handle all known cases for that thing. 
- **repeatable:** transforms should be written in a way that they can be run against the same dataset an infinate number of times and get the same result every time. 
- **easy to read:** 99 times out of 100, readable, clear code that runs a little slower is more valuable than a mess that runs quickly. 
- **No 'magic numbers':** if a variable or function is not instantly obvious as to what it is or does, without context, maybe consider renaming it.

## Workflow - how to use this notebook to make science
#### Data Science
1. **Document your transform.** Fill out the _description_ cell below describing what it is this transform does; this will appear in the configuration application where Ops will create, configure and update pipelines. 
1. **Define your config object.** Fill out the _configuration_ cell below the commented-out guide to define the variables you want ops to set in the configuration application (these will populate here for every pipeline). 
2. **Build your transformation logic.** Use the transformation cell to do that magic that you do.

![caution](assets/cautionTape.png)

### Description
What does this transformation do? be specific.

![what does your transform do](assets/what.gif)

Cancelled Not Before Active enrichment. Assigns hierarchy values in cases where cancelled status is NOT reported before first active shipment. (either cancel after active, or no active reported).  This is used as part of the Fill Rate enrichment. See logic diagram below:

<img src = 'assets/cancel_not_before_active.svg' style="width:800px;">

### Configuration

In [None]:
from core.helpers.session_helper import SessionHelper
session = SessionHelper().session

In [None]:
"""
************ SETUP - DON'T TOUCH **************
This section imports data from the configuration database
and should not need to be altered or otherwise messed with. 
~~These are not the droids you are looking for~~
"""
from core.constants import BRANCH_NAME, ENV_BUCKET
from core.helpers.session_helper import SessionHelper
from core.models.configuration import Transformation
from dataclasses import dataclass
from core.dataset_contract import DatasetContract

db_transform = session.query(Transformation).filter(Transformation.id == transform_id).one()

@dataclass
class DbTransform:
    id: int = db_transform.id ## the instance id of the transform in the config app
    name: str = db_transform.transformation_template.name ## the transform name in the config app
    state: str = db_transform.pipeline_state.pipeline_state_type.name ## the pipeline state, one of raw, ingest, master, enhance, enrich, metrics, dimensional
    branch:str = BRANCH_NAME ## the git branch for this execution 
    brand: str = db_transform.pipeline_state.pipeline.brand.name ## the pharma brand name
    pharmaceutical_company: str = db_transform.pipeline_state.pipeline.brand.pharmaceutical_company.name # the pharma company name
    publish_contract: DatasetContract = DatasetContract(branch=BRANCH_NAME,
                            state=db_transform.pipeline_state.pipeline_state_type.name,
                            parent=db_transform.pipeline_state.pipeline.brand.pharmaceutical_company.name,
                            child=db_transform.pipeline_state.pipeline.brand.name,
                            dataset=db_transform.transformation_template.name)


In [None]:
""" 
********* VARIABLES - PLEASE TOUCH ********* 
This section defines what you expect to get from the configuration application 
in a single "transform" object. Define the vars you need here, and comment inline to the right of them 
for all-in-one documentation. 
Engineering will build a production "transform" object for every pipeline that matches what you define here.

@@@ FORMAT OF THE DATA CLASS IS: @@@ 

<variable_name>: <data_type> #<comment explaining what the value is to future us>

e.g.

class Transform(DbTransform):
    some_ratio: float
    site_name: str

~~These ARE the droids you are looking for~~
"""

class Transform(DbTransform):
    '''
    YOUR properties go here!!
    Variable properties should be assigned to the exact name of
    the transformation as it appears in the Jupyter notebook filename.
    '''
    hierarchy: str # Column name to use for Hierarchy
    pending_status_code: str # Pending status code (customer-specific)
    active_status_code: str # Active status code (customer-specific)
    cancel_status_code: str # Cancel status code (customer-specific)
    active_substatus_code: str # Active Shipment Substatus code, e.g. 'SHIPMENT' (customer-specific)
    payer_substatus_code: str # Comma-separated list of accepted substatus codes for BVPA hierarchy, e.g. INSURANCE DENIED,COVERAGE DENIED (customer-specific)
    transfer_hierarchy_input: str # Comma-separated list of accepted substatus codes for Transfer hierarchy, e.g. TRANSFER SP,TRANSFER HUB (customer-specific)
    payer_hierarchy: str # Name to use for Payer hierarchy assignment (customer-specific)
    transfer_greater2_hierarchy: str # Name to use for Transferred > 2 Days hierarchy assignment (customer-specific)
    transfer_less2_hierarchy: str # Name to use for Transferred <= 2 Days hierarchy assignment (customer-specific)
    pending_greater7_hierarchy: str # Name to use for Pending >= 7 Days hierarchy assignment (customer-specific)
    pending_4to6_hierarchy: str # Name to use for Pending 4 to 6 Days hierarchy assignment (customer-specific)
    pending_less3_hierarchy: str # Name to use for Pending <= 3 Days hierarchy assignment (customer-specific)
    transfer_threshold: int # Number of days to use as threshold for Transfer hierarchy logic (customer-specific)
    pending_upper_threshold: int # Number of days to use as upper-tier threshold for Pending hierarchy logic (customer-specific)
    pending_lower_threshold: int # Number of days to use as lower-tier threshold for Pending hierarchy logic (customer-specific)
    input_transform: str # Name of transform to input data from

In [None]:
transform = Transform()

In [None]:
# hardcoded variable values based on Ingest schema
transform.trans_id = 'pharmacy_transaction_id'
transform.brand_col = 'brand'
transform.patient_id = 'pharmacy_patient_id'
transform.pharmacy = 'pharmacy_code'
transform.status_date = 'status_date'
transform.referral_date = 'referral_date'
transform.status = 'status'
transform.substatus = 'substatus'

In [1]:
# split comma-separated lists into actual lists
transform.payer_substatus_code = ','.split(transform.payer_substatus_code)
transform.transfer_hierarchy_input = ','.split(transform.transfer_hierarchy_input)

NameError: name 'transform' is not defined

### Transformation

In [None]:
### Retrieve current dataset from contract
from core.dataset_diff import DatasetDiff

diff = DatasetDiff(db_transform.id)
df = diff.get_diff(transform_name=transform.input_transform, values=[run_id])

In [None]:
df.shape

In [None]:
### Use the variables above to execute your transformation. the final output needs to be a variable named final_dataframe

In [None]:
import numpy as np
import pandas as pd

In [None]:
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 500)

### DATA CLEANING: ADDRESS THIS SECTION BEFORE PIPELINE INTEGRATION

### APPLY TRANSFORM LOGIC

In [None]:
# Insert threshold values into hierarchy string

transfer_greater2_hierarchy = (
    transform.transfer_greater2_hierarchy
    .format(transform.transfer_threshold))

transfer_less2_hierarchy = (
    transform.transfer_less2_hierarchy
    .format(transform.transfer_threshold))

pending_greater7_hierarchy = (
    transform.pending_greater7_hierarchy
    .format(transform.pending_upper_threshold))

pending_4to6_hierarchy = (
    transform.pending_4to6_hierarchy
    .format(transform.pending_lower_threshold + 1, transform.pending_upper_threshold - 1))

pending_less3_hierarchy = (
    transform.pending_less3_hierarchy
    .format(transform.pending_lower_threshold))

In [None]:
# Assign Patient Journey (pj_id), Patient Journey Step (pj_step) and Patient Journey Phase (pj_phase) identifiers
# (These IDs are used for calculation purposes only.  They will not be published)

def pj(df):
    pj_df = (
        df
        .assign(**{
            'pj_id' : lambda x: (
                x.groupby([transform.patient_id, transform.pharmacy, transform.brand_col]).grouper.group_info[0]
            ),
            'pj_step' : lambda x: x.index,
            'pj_phase' : lambda x:(
                np.where((x['pj_id'] == x['pj_id'].shift(1))
                         & (x[transform.status] == x[transform.status].shift(1)),
                         0,
                         1
                        )
                .cumsum()
            )
        })
        .sort_values(
            by=[transform.patient_id, transform.pharmacy, transform.brand_col, transform.status_date, transform.status, transform.trans_id],
            ascending=[True, True, True, True, False, True])
        .reset_index(drop=True)
    )
    return pj_df

In [None]:
# Filter to only include patient journeys where at least 1 cancelled status is reported

def cancel(pj_df):
    cancel_df = (
        pj_df
        .loc[lambda x: (
            x['pj_id'].isin(x
                            .loc[x[transform.status] == transform.cancel_status_code]
                            .pj_id
                            .drop_duplicates()
                            .tolist()
                           )
        )]
    )
    return cancel_df

In [None]:
# Filter to only include patient journeys where at least 1 of the following is true:
#    a) No active shipment reported OR
#    b) First active shipment occurs PRIOR to cancelled status

def cancel_not_before_active(cancel_df):
    cancel_not_before_active_df = (
        cancel_df
        .assign(active_phase = lambda x: (
            x.loc[x[transform.substatus] == transform.active_substatus_code].groupby(['pj_id'])['pj_phase']
            .transform(min)
        ))
        .assign(active_phase = lambda x: (
            x.groupby(['pj_id'], sort=False)['active_phase']
            .transform(lambda x: x.ffill())
        ))
        .assign(active_phase = lambda x: (
            x.groupby(['pj_id'], sort=False)['active_phase']
            .transform(lambda x: x.bfill())
        ))
        .assign(cancel_flag = lambda x: (
            np.where(
                (x[transform.status] == transform.cancel_status_code)
                & (
                    (x['pj_phase'] > x['active_phase'])
                    | (x['active_phase'].isnull())
                ),
                1,
                0
            )
        ))
        .loc[lambda x: (
            x['pj_id'].isin(x
                            .loc[x['cancel_flag'] == 1]
                            .pj_id
                            .drop_duplicates()
                            .tolist()
                           )
        )]
        .drop(['active_phase'], axis=1)
    )
    return cancel_not_before_active_df

In [None]:
# Find all pending phases and time spent in each pending phase
# Forward fill so that all subsequent cancels adopt a "prior_pending_time".

def prior_pending(cancel_not_before_active_df):
    prior_pending_df = (
        cancel_not_before_active_df
        .assign(first_journey_step = lambda x: (
            x.groupby(['pj_id'])['pj_step']
            .transform(min)
        ))
        .assign(first_phase_step = lambda x: (
            x.groupby(['pj_phase'])['pj_step']
            .transform(min)
        ))
        .assign(prev_phase = lambda x: (
            np.where(
                (x['pj_step'] != x['first_journey_step']) & (x['pj_step'] == x['first_phase_step']),
                (
                    x.groupby(['pj_phase'])[transform.status]
                    .transform(min).shift(1)
                ),
                None)
        ))
        .assign(prior_pending_time = lambda x: (
            (x[transform.status_date] - 
            pd.to_datetime(
                np.where(
                    x['prev_phase'] == transform.pending_status_code,
                    (
                        x.groupby(['pj_id','pj_phase'])[transform.status_date]
                        .transform(min).shift(1)
                    ),
                    pd.NaT)
            )) / np.timedelta64(1,'D')
        ))
        .assign(prior_pending_time = lambda x: (
            x.groupby(['pj_id'], sort=False)['prior_pending_time']
            .transform(lambda x: x.ffill())
            ))
        .drop(['first_journey_step','first_phase_step','prev_phase'], axis=1)
    )
    return prior_pending_df

In [None]:
## Find the min(referral_date) for each patient journey (This is used in the hierarchy logic for Transfers)

def ref(prior_pending_df):
    ref_df = (
        prior_pending_df
        .assign(min_ref_date = lambda x: (
            x.groupby(['pj_id'])[transform.referral_date]
            .transform(min)
        ))
        .assign(ref_time = lambda x: (
            (x[transform.status_date] - x['min_ref_date']) / np.timedelta64(1, 'D')
        ))
        .drop(['min_ref_date'], axis=1)
    )
    return ref_df

In [None]:
# Apply logic to determine patient journey hierarchy. See logic diagram in transform description.

def hierarchy(ref_df):
    hierarchy_df = (
        ref_df
        .assign(**{
            transform.hierarchy : lambda x:(
                np.where(
                    x['cancel_flag'] == 0,
                    None,
                    np.where(
                        x[transform.substatus].isin(transform.payer_substatus_code),
                        transform.payer_hierarchy,
                        np.where(
                            x[transform.hierarchy].isin(transform.transfer_hierarchy_input),
                            np.where(
                                x['ref_time'] > 2,
                                transfer_greater2_hierarchy,
                                transfer_less2_hierarchy
                            ),
                            np.where(
                                x['prior_pending_time'] >= 7,
                                pending_greater7_hierarchy,
                                np.where(
                                    x['prior_pending_time'] > 3,
                                    pending_4to6_hierarchy,
                                    pending_less3_hierarchy
                                )
                            )
                        )
                    )
                )
            )
        })
    )
    return hierarchy_df

In [None]:
pj_df = pj(df)

pj_df.head()

In [None]:
hierarchy_df = (
    pj_df
    .pipe(cancel)
    .pipe(cancel_not_before_active)
    .pipe(prior_pending)
    .pipe(ref)
    .pipe(hierarchy)
)

hierarchy_df.head()

In [None]:
final_dataframe = (
    pd.merge(
        pj_df.rename(columns = {transform.hierarchy:'old_hierarchy'}),
        hierarchy_df.loc[
            hierarchy_df['cancel_flag'] == 1,
            ['pj_id', 'pj_step', transform.hierarchy]],
        how='left',
        on=['pj_id', 'pj_step']
    )
    .assign(**{
        transform.hierarchy : lambda x:(
            np.where(
                x[transform.hierarchy].isnull(),
                x['old_hierarchy'],
                x[transform.hierarchy]
            )
        )}
    )
    .drop(['pj_id', 'pj_step', 'pj_phase', 'old_hierarchy'], axis=1)
)

final_dataframe.head()

In [None]:
final_dataframe.shape

### TEST TRANSFORM OUTPUT

In [None]:
# TEST 1: Create a "test" dataframe with expected results

test_data = ([
    [0, 0, transform.active_status_code, 'READY', 'READY'],
    [0, 1, transform.cancel_status_code, transform.transfer_hierarchy_input[0], transfer_less2_hierarchy],
    [0, 5, transform.cancel_status_code, 'OTHER', pending_less3_hierarchy],
    [0, 5, transform.cancel_status_code, transform.payer_substatus_code[0], transform.payer_hierarchy],
    [0, 6, transform.pending_status_code, 'OTHER', 'OTHER'],
    [0, 10, transform.cancel_status_code, 'OTHER', pending_4to6_hierarchy],
    [0, 12, transform.pending_status_code, 'OTHER', 'OTHER'],
    [0, 12, transform.cancel_status_code, 'OTHER', pending_less3_hierarchy],
    [0, 13, transform.cancel_status_code, transform.payer_substatus_code[0], transform.payer_hierarchy],
    [14, 14, transform.cancel_status_code, 'OTHER', pending_less3_hierarchy],
    [14, 15, transform.cancel_status_code, transform.transfer_hierarchy_input[0], transfer_greater2_hierarchy]
])

test_df = (
    pd.DataFrame(test_data, columns = [transform.referral_date, transform.status_date, transform.status, transform.substatus, 'expected_hierarchy'])
    .assign(**{
        transform.patient_id : 123,
        transform.pharmacy : 'ABC',
        transform.brand_col : 'A',
        transform.status_date : lambda x: (
            pd.to_datetime('2019-01-01', format='%Y-%m-%d') + pd.to_timedelta(x[transform.status_date], unit='d')
        ),
        transform.referral_date : lambda x: (
            pd.to_datetime('2019-01-01', format='%Y-%m-%d') + pd.to_timedelta(x[transform.referral_date], unit='d')
        ),
        transform.hierarchy : lambda x: x[transform.substatus],
        transform.trans_id : 1
    })    
)

test_df

In [None]:
# Apply transform to test dataframe

pj_test = pj(test_df)

test_output = (
    pj_test
    .pipe(cancel)
    .pipe(cancel_not_before_active)
    .pipe(prior_pending)
    .pipe(ref)
    .pipe(hierarchy)
)

final_dataframe_test = (
    pd.merge(
        pj_test.rename(columns = {transform.hierarchy:'old_hierarchy'}),
        test_output.loc[
            test_output['cancel_flag'] == 1,
            ['pj_id', 'pj_step', transform.hierarchy]],
        how='left',
        on=['pj_id', 'pj_step']
    )
    .assign(**{
        transform.hierarchy : lambda x:(
            np.where(
                x[transform.hierarchy].isnull(),
                x['old_hierarchy'],
                x[transform.hierarchy]
            )
        )}
    )
    .drop(['pj_id', 'pj_step', 'pj_phase', 'old_hierarchy'], axis=1)
)

final_dataframe_test

In [None]:
# Check that results match expectations

test1 = (
    final_dataframe_test
    .assign(passfail = lambda x: np.where(
        (x[transform.hierarchy] == x['expected_hierarchy']) | (x[transform.hierarchy].isnull() & x['expected_hierarchy'].isnull()),
        True,
        False
    ))
    .passfail
    .all()
)

test1

In [None]:
# TEST 2: Check that final dataframe has the same number of rows as the input dataframe

test2 = (final_dataframe.shape[0] == clean_df.shape[0])

test2

In [None]:
# TEST 3: Check that all cancel statuses NOT before an active shipment have 1 of the 6 cancel_not_before_active hierarchies assigned to them.

hierarchy_list = [
    transform.payer_hierarchy,
    transfer_greater2_hierarchy,
    transfer_less2_hierarchy,
    pending_greater7_hierarchy,
    pending_4to6_hierarchy,
    pending_less3_hierarchy]

first_active_status = (
    final_dataframe
    .assign(pj_id = lambda x: x.groupby([transform.patient_id, transform.pharmacy, transform.brand_col]).grouper.group_info[0])
    .assign(pj_step = lambda x: x.index)
    .assign(active_status_date = lambda x: (
        pd.to_datetime(np.where(
            x[transform.substatus] == transform.active_substatus_code,
            x[transform.status_date],
            pd.NaT
    ))))
    .assign(first_active_status_date = lambda x: (
        x.groupby(['pj_id'])['active_status_date']
        .transform(min)
    ))
    .drop(['active_status_date'], axis=1)
)

test3 = (first_active_status   
    .loc[lambda x: (
        ((x[transform.status_date] > x['first_active_status_date'])
         |
         (x['first_active_status_date'].isnull())
        )
        &
        (x[transform.status] == transform.cancel_status_code)
        &
        (~x[transform.hierarchy].isin(hierarchy_list))
    )]
)

test3 = (test3.shape[0] == 0)

test3

In [None]:
# TEST 4: Check that all statuses BEFORE an active shipment do NOT any of the 6 cancel_not_before_active hierarchies assigned to them.

test4 = (
    first_active_status   
    .loc[lambda x: (
        (x[transform.status_date] < x['first_active_status_date'])
        &
        (x[transform.hierarchy].isin(hierarchy_list))
    )]
)

test4 = (test4.shape[0] == 0)

test4

In [None]:
# FINAL TEST: Did all 4 tests pass?

test1 & test2 & test3 & test4

### Publish

In [None]:
## that's it - just provide the final dataframe to the var final_dataframe and we take it from there
transform.publish_contract.publish(final_dataframe, run_id, session)
session.close()