# CORE Cartridge Notebook::Time to First Fill Metric
![CORE Logo](assets/coreLogo.png) 

---
## Keep in Mind
Good Transforms Are...
- **singular in purpose:** good transforms do one and only one thing, and handle all known cases for that thing. 
- **repeatable:** transforms should be written in a way that they can be run against the same dataset an infinate number of times and get the same result every time. 
- **easy to read:** 99 times out of 100, readable, clear code that runs a little slower is more valuable than a mess that runs quickly. 
- **No 'magic numbers':** if a variable or function is not instantly obvious as to what it is or does, without context, maybe consider renaming it.

## Workflow - how to use this notebook to make science
#### Data Science
1. **Document your transform.** Fill out the _description_ cell below describing what it is this transform does; this will appear in the configuration application where Ops will create, configure and update pipelines. 
1. **Define your config object.** Fill out the _configuration_ cell below the commented-out guide to define the variables you want ops to set in the configuration application (these will populate here for every pipeline). 
2. **Build your transformation logic.** Use the transformation cell to do that magic that you do. 
![caution](assets/cautionTape.png)

### Description
What does this transformation do? be specific.

![what does your transform do](assets/what.gif)

TTFF Metric prepares data to be utilized by Spotfire.  The actual TTFF metric is not calculated, but its components are calculated.  Spotfire handles the actual calculation of TTFF, using the output of this transform.

The transform takes all patient journey data up-to and including the first active shipment.  If no active shipment is reported, that entire journey is dropped.
It then calculates the time spent in each of those statuses as (next status_date) - (status_date).  The sum across all statuses for a patient journey would then give the the total TTFF (this part is done in Spotfire).

### Configuration

In [None]:
from core.helpers.session_helper import SessionHelper
session = SessionHelper().session

In [None]:
"""
************ SETUP - DON'T TOUCH **************
This section imports data from the configuration database
and should not need to be altered or otherwise messed with. 
~~These are not the droids you are looking for~~
"""
from core.constants import BRANCH_NAME, ENV_BUCKET
from core.helpers.session_helper import SessionHelper
from core.models.configuration import Transformation
from dataclasses import dataclass
from core.dataset_contract import DatasetContract

db_transform = session.query(Transformation).filter(Transformation.id == transform_id).one()

@dataclass
class DbTransform:
    id: int = db_transform.id ## the instance id of the transform in the config app
    name: str = db_transform.transformation_template.name ## the transform name in the config app
    state: str = db_transform.pipeline_state.pipeline_state_type.name ## the pipeline state, one of raw, ingest, master, enhance, enrich, metrics, dimensional
    branch:str = BRANCH_NAME ## the git branch for this execution 
    brand: str = db_transform.pipeline_state.pipeline.brand.name ## the pharma brand name
    pharmaceutical_company: str = db_transform.pipeline_state.pipeline.brand.pharmaceutical_company.name # the pharma company name
    publish_contract: DatasetContract = DatasetContract(branch=BRANCH_NAME,
                            state=db_transform.pipeline_state.pipeline_state_type.name,
                            parent=db_transform.pipeline_state.pipeline.brand.pharmaceutical_company.name,
                            child=db_transform.pipeline_state.pipeline.brand.name,
                            dataset=db_transform.transformation_template.name)


In [None]:
""" 
********* VARIABLES - PLEASE TOUCH ********* 
This section defines what you expect to get from the configuration application 
in a single "transform" object. Define the vars you need here, and comment inline to the right of them 
for all-in-one documentation. 
Engineering will build a production "transform" object for every pipeline that matches what you define here.

@@@ FORMAT OF THE DATA CLASS IS: @@@ 

<variable_name>: <data_type> #<comment explaining what the value is to future us>

e.g.

class Transform(DbTransform):
    some_ratio: float
    site_name: str

~~These ARE the droids you are looking for~~
"""

class Transform(DbTransform):
    '''
    YOUR properties go here!!
    Variable properties should be assigned to the exact name of
    the transformation as it appears in the Jupyter notebook filename.
    '''
    input_transform: str = db_transform.variables.input_transform # The name of the dataset to pull from
    input_hierarchy: str = db_transform.variables.hierarchy # Column header for Patient Journey Hierarchy from the input dataset
    active_status_code: str = db_transform.variables.active_status_code # Active Shipment Status code, e.g. 'ACTIVE' (customer-specific)
    active_substatus_code: str = db_transform.variables.active_substatus_code # Active Shipment Substatus code, e.g. 'SHIPMENT' (customer-specific)
    fulfillment_hierarchy: str = db_transform.variables.fulfillment_hierarchy # Hierarchy used for statuses after the first fill, e.g. 'ACTIVE - SHIPMENT' (customer-specific)
    discontinued_hierarchy: str = db_transform.variables.discontinued_hierarchy # Comma separated list (stored as string) of any hierarchies that we know should be excluded from TTFF, e.g. 'PATIENT'
    discontinued_hierarchy = discontinued_hierarchy.split(',') # We reassign the string variable to be a list of strings by comma split


In [None]:
transform = Transform()

In [None]:
transform.remove_from_ttff = "REMOVE FROM TTFF"

In [None]:
# hardcoded variable values based on Ingest schema

input_brand_col = 'brand'
input_medication = 'medication'
input_patient_id = 'longitudinal_patient_id'
input_pharmacy = 'pharmacy_name'
input_status_date = 'status_date'
input_referral_date = 'referral_date'
input_status =  'status'
input_substatus =  'substatus'
input_hierarchy = transform.input_hierarchy
input_cust_status = 'customer_status'
input_cust_status_desc = 'customer_status_description'
input_payer = 'primary_payer'
input_payer_type = 'primary_payer_type'
input_hcp_first_name = 'hcp_first_name'
input_hcp_last_name = 'hcp_last_name'
input_hcp_npi = 'hcp_npi'
input_hcp_state = 'hcp_state'
input_hcp_zip = 'hcp_zip'
input_dx_1 = 'dx_1'
input_dx_2 = 'dx_2'
input_referral_source = 'referral_source'
input_remove_from_ttff = 'REMOVE FROM TTFF'
input_cs_outlet_id = 'cs_outlet_id'
input_cot = 'cot'

if DbTransform.pharmaceutical_company.upper() == 'SUN':
    input_trans_id = 'pharmacy_transaction_id'

else:
    input_trans_id = 'aggregator_transaction_id'

### Transformation

In [None]:
### Retrieve current dataset from contract
from core.dataset_diff import DatasetDiff

diff = DatasetDiff(db_transform.id)
df = diff.get_diff(transform_name=transform.input_transform, values=[run_id])

In [None]:
df.shape

In [None]:
### Use the variables above to execute your transformation. the final output needs to be a variable named final_dataframe

In [None]:
import numpy as np
import pandas as pd

from pandas.tseries.holiday import USFederalHolidayCalendar as cal

In [None]:
pd.set_option('display.max_columns', 100)
pd.set_option('display.max_rows', 500)

### DATA CLEANING: ADDRESS THIS SECTION BEFORE PIPELINE INTEGRATION

### APPLY TRANSFORM LOGIC

In [None]:
# Convert headers to standardized names for schema

trans_id = 'transaction_id'
patient_id = 'longitudinal_patient_id'
payer = 'primary_payer'
payer_type = 'primary_payer_type'
hcp_first_name = 'hcp_first_name'
hcp_last_name = 'hcp_last_name'
hcp_npi = 'hcp_npi'
hcp_state = 'hcp_state'
hcp_zip = 'hcp_zip'
medication = 'medication'
brand = 'brand'
pharmacy = 'pharmacy_name'
cs_outlet_id = 'cs_outlet_id'
cot = 'cot'
status = 'status'
substatus = 'substatus'
cust_status = 'customer_status'
cust_status_desc = 'customer_status_description'
status_date = 'status_date'
next_status_date = 'next_status_date'
dx_1 = 'dx_1'
dx_2 = 'dx_2'
referral_source = 'referral_source'
referral_date = 'referral_date'
min_active_date = 'ship_date'
hierarchy = 'accrual_bucket'
time_in_status = 'tt_fill_days'

#status_date_long = 'startDateLong'
#next_status_date_long = 'statusDateLong'

In [None]:
output_columns = [
    trans_id,
    patient_id,
    payer,
    payer_type,
    hcp_first_name,
    hcp_last_name,
    hcp_npi,
    hcp_state,
    hcp_zip,
    medication,
    brand,
    pharmacy,
    cs_outlet_id,
    cot,
    status,
    substatus,
    cust_status,
    cust_status_desc,
    status_date,
    next_status_date,
    dx_1,
    dx_2,
    referral_source,
    referral_date,
    min_active_date,
    hierarchy,
    time_in_status
]

In [None]:
# Map column headers to match TTFF schema standardization

def map_columns(df):
    map_columns_df = (
        df
        .rename(columns = {
            input_trans_id: trans_id,
            input_patient_id: patient_id,
            input_pharmacy: pharmacy,
            input_payer: payer,
            input_payer_type: payer_type,
            input_hcp_first_name: hcp_first_name,
            input_hcp_last_name: hcp_last_name,
            input_hcp_npi: hcp_npi,
            input_hcp_state: hcp_state,
            input_hcp_zip: hcp_zip,
            input_medication: medication,
            input_brand_col: brand,
            input_pharmacy: pharmacy,
            input_cs_outlet_id: cs_outlet_id,
            input_cot: cot,
            input_status: status,
            input_substatus: substatus,
            input_cust_status: cust_status,
            input_cust_status_desc: cust_status_desc,
            input_status_date: status_date,
            input_dx_1: dx_1,
            input_dx_2: dx_2,
            input_referral_source: referral_source,
            input_referral_date: referral_date,
            input_hierarchy: hierarchy
        })
    )
    return map_columns_df

In [None]:
# Assign Patient Journey (pj_id) identifier
# (This ID is used for calculation purposes only.  It will not be published)
# Sort data

def pj(map_columns_df):
    pj_df = (
        map_columns_df
        .assign(**{
            'pj_id' : lambda x: (
                x.groupby([patient_id, pharmacy, brand]).grouper.group_info[0]
            )
        })
        .sort_values(
            by=['pj_id', status_date, status, trans_id],
            ascending=[True, True, False, True])
        .reset_index(drop=True)
    )
    return pj_df

In [None]:
# Time to First Fill calculation:
#    1) Convert all strings to uppercase
#    2) Omit records with "Remove from TTFF" hierarchy
#    3) Filter to only include patient journeys where active shipment occurs
#    4) Get status date of first active shipment (min_active_date) for each patient journey
#    5) Filter to only journeys on or before first active shipment date. And if date = first active shipment date, hierarchy must be fulfillment.
#    6) Calculate time spent in status (only include business days and non-holidays) using np.busday_count
#    7) Convert all dates to long type (calculated as number of seconds between 1/1/1970 and the status date) - NO LONGER INCLUDING THIS STEP
#    8) Filter/reorder columns to match schema

def ttff(pj_df):
    ttff_df = (
        pj_df
        .apply(lambda x: (x.str.upper() if x.dtype == 'O' else x))
        .loc[lambda x: (
            (x[hierarchy] != transform.remove_from_ttff)
            &
            (x.pj_id.isin(
                x.loc[
                    (
                        (x[status] == transform.active_status_code)
                        &
                        (x[substatus] == transform.active_substatus_code)
                    ),
                    'pj_id'
                ].tolist()
            ))
        )]
        .assign(**{
            min_active_date : lambda x: (
                x.loc[
                    (x[status] == transform.active_status_code)
                    &
                    (x[substatus] == transform.active_substatus_code)
                ]
                .groupby(['pj_id'])[status_date]
                .transform(min)
            )
        })
        .assign(**{
            min_active_date : lambda x: (
                x.groupby(['pj_id'])[min_active_date]
                .transform(lambda x: x.ffill().bfill())
            )
        })
        .loc[lambda x: (
            (x[status_date] < x[min_active_date])
            |
            (
                (x[status_date] == x[min_active_date])
                &
                (x[hierarchy] == transform.fulfillment_hierarchy)
            )
        )]
        .assign(**{
            next_status_date : lambda x: (
                x.groupby(['pj_id'])[status_date]
                .transform(lambda x: x.shift(-1))
                .fillna(x[status_date])
            ),
            time_in_status : lambda x: (
                np.busday_count(
                    x[status_date].apply(lambda x: x.strftime('%Y-%m-%d')),
                    x[next_status_date].apply(lambda x: x.strftime('%Y-%m-%d')),
                    holidays = pd.to_datetime(cal().holidays()).strftime("%Y-%m-%d").tolist()
                )
            )
        })
#        .assign(**{
#            status_date_long : lambda x: (
#                (x[status_date] - pd.to_datetime('1970-01-01'))
#                .dt
#                .total_seconds()
#                .astype(int)),
#            min_active_date : lambda x: (
#                (x[min_active_date] - pd.to_datetime('1970-01-01'))
#                .dt
#                .total_seconds()
#                .astype(int)),
#            next_status_date_long : lambda x: (
#                (x[next_status_date] - pd.to_datetime('1970-01-01'))
#                .dt
#                .total_seconds()
#                .astype(int))
#        })
        .loc[:,output_columns]
        .reset_index(drop=True)
    )
    return ttff_df

In [None]:
final_dataframe = (
    df
    .pipe(map_columns)
    .pipe(pj)
    .pipe(ttff)
)

final_dataframe.head()

### TEST TRANSFORM OUTPUT

In [None]:
# TEST 1: Check that time in status = 0 for all active shipments

test1 = len(
    final_dataframe
    .loc[lambda x:
        (
            (x[status].isin([transform.active_status_code]))
            &
            (x[substatus].isin([transform.active_substatus_code]))
            &
            (x[time_in_status] > 0)
        )]
) == 0

test1

In [None]:
# TEST 2: Check that next status date = status date for active shipments

test2 = len(
    final_dataframe
    .loc[lambda x:
        (
            (x[status].isin([transform.active_status_code]))
            &
            (x[substatus].isin([transform.active_substatus_code]))
            &
            (x[next_status_date] != x[status_date])
        )]
) == 0

test2

In [None]:
# TEST 3: Check that results don't contain any "discontinued" hierarchies

test3 = len(
    final_dataframe
    .loc[lambda x: (
        x[hierarchy].isin(transform.discontinued_hierarchy)
    )]
) == 0

test3

In [None]:
# FINAL TEST: Did the first 3 tests pass?

test1 & test2 & test3

### Publish

In [None]:
## that's it - just provide the final dataframe to the var final_dataframe and we take it from there
transform.publish_contract.publish(final_dataframe, run_id, session)
session.close()