# CORE Cartridge Notebook:: Active Shipment Enrichment
![CORE Logo](assets/coreLogo.png) 

---
## Keep in Mind
Good Transforms Are...
- **singular in purpose:** good transforms do one and only one thing, and handle all known cases for that thing. 
- **repeatable:** transforms should be written in a way that they can be run against the same dataset an infinate number of times and get the same result every time. 
- **easy to read:** 99 times out of 100, readable, clear code that runs a little slower is more valuable than a mess that runs quickly. 
- **No 'magic numbers':** if a variable or function is not instantly obvious as to what it is or does, without context, maybe consider renaming it.

## Workflow - how to use this notebook to make science
#### Data Science
1. **Document your transform.** Fill out the _description_ cell below describing what it is this transform does; this will appear in the configuration application where Ops will create, configure and update pipelines. 
1. **Define your config object.** Fill out the _configuration_ cell below the commented-out guide to define the variables you want ops to set in the configuration application (these will populate here for every pipeline). 
2. **Build your transformation logic.** Use the transformation cell to do that magic that you do. 
![caution](assets/cautionTape.png)

### Description
What does this transformation do? be specific.

![what does your transform do](assets/what.gif)

(clear out and replace with your description)

### Configuration

In [None]:
from core.helpers.session_helper import SessionHelper
session = SessionHelper().session

In [None]:
"""
************ SETUP - DON'T TOUCH **************
This section imports data from the configuration database
and should not need to be altered or otherwise messed with. 
~~These are not the droids you are looking for~~
"""
from core.constants import BRANCH_NAME, ENV_BUCKET
from core.helpers.session_helper import SessionHelper
from core.models.configuration import Transformation
from dataclasses import dataclass
from core.dataset_contract import DatasetContract

db_transform = session.query(Transformation).filter(Transformation.id == transform_id).one()

@dataclass
class DbTransform:
    id: int = db_transform.id ## the instance id of the transform in the config app
    name: str = db_transform.transformation_template.name ## the transform name in the config app
    state: str = db_transform.pipeline_state.pipeline_state_type.name ## the pipeline state, one of raw, ingest, master, enhance, enrich, metrics, dimensional
    branch:str = BRANCH_NAME ## the git branch for this execution 
    brand: str = db_transform.pipeline_state.pipeline.brand.name ## the pharma brand name
    pharmaceutical_company: str = db_transform.pipeline_state.pipeline.brand.pharmaceutical_company.name # the pharma company name
    publish_contract: DatasetContract = DatasetContract(branch=BRANCH_NAME,
                            state=db_transform.pipeline_state.pipeline_state_type.name,
                            parent=db_transform.pipeline_state.pipeline.brand.pharmaceutical_company.name,
                            child=db_transform.pipeline_state.pipeline.brand.name,
                            dataset=db_transform.transformation_template.name)


In [None]:
""" 
********* VARIABLES - PLEASE TOUCH ********* 
This section defines what you expect to get from the configuration application 
in a single "transform" object. Define the vars you need here, and comment inline to the right of them 
for all-in-one documentation. 
Engineering will build a production "transform" object for every pipeline that matches what you define here.

@@@ FORMAT OF THE DATA CLASS IS: @@@ 

<variable_name>: <data_type> #<comment explaining what the value is to future us>

e.g.

class Transform(DbTransform):
    some_ratio: float
    site_name: str

~~These ARE the droids you are looking for~~
"""

class Transform(DbTransform):
    '''
    YOUR properties go here!!
    Variable properties should be assigned to the exact name of
    the transformation as it appears in the Jupyter notebook filename.
    '''
    input_transform: str = db_transform.variables.input_transform # The name of the transform to input source data from
    dispense_input_transform: str = db_transform.variables.dispense_input_transform # The name of the transform to input dispense data from
    brand_name: str = DbTransform.brand.upper() # Name of brand to be inserted into brand column of Dispense data
    active_shipment_status: str = db_transform.variables.active_shipment_status # Status indicating active shipment (customer-specific)
    active_shipment_substatus: str = db_transform.variables.active_shipment_substatus # Substatus indicating active shipment (customer-specific)
    active_hierarchy: str = db_transform.variables.active_hierarchy # Hierarchy indicating active shipment (customer-specific)
    pharmacy_name_map: str = db_transform.variables.pharmacy_name_map # Dictionary (stored as string) with pharmacy names found in Dispense data to be mapped to standardized names, e.g. "{'CVS':'CVS Specialty'}" (customer-specific)

transform = Transform()

### Transformation

In [None]:
### Retrieve current dataset and dispense data from contract
from core.dataset_diff import DatasetDiff

diff = DatasetDiff(db_transform.id)
df = diff.get_diff(transform_name=transform.input_transform, values=[run_id])
dispense_input_df = diff.get_diff(transform_name=transform.dispense_input_transform, values=[run_id])

In [None]:
df.shape

In [None]:
dispense_input_df.shape

In [None]:
### Use the variables above to execute your transformation. the final output needs to be a variable named final_dataframe

In [None]:
import numpy as np
import pandas as pd

In [None]:
pd.set_option('display.max_columns', 150)
pd.set_option('display.max_rows', 500)

In [None]:
# Assign column headers to variables

brand_col = 'brand'
patient_id = 'longitudinal_patient_id'
pharmacy = 'pharmacy_name'
status_date = 'status_date'
referral_date = 'referral_date'
status =  'status'
substatus =  'substatus'
hierarchy = 'patient_journey_hierarchy'
dispense_status_date = 'ship_date'


if DbTransform.pharmaceutical_company.upper() == 'SUN':
    unique_id = 'pharmacy_transaction_id'
    trans_id = 'pharmacy_transaction_id'

else:
    trans_id = 'aggregator_transaction_id'
    unique_id = 'aggregator_transaction_id'
    

### DATA CLEANING: ADDRESS THIS SECTION BEFORE PIPELINE INTEGRATION

In [None]:
# CLEAN DATA - This step should not be necessary once transform is integrated into pipeline.
#    Extract and map relevant columns
#    Rename dispense ship date to status_date and convert to datetime format
#    Assign brand name, active hierarchy, active status, and active substatus using transform variables
#    Filter data to only include patient journeys found in patient status data
#    Keep only the first dispense reported for each patient journey
    
def clean_dispense_data(dispense_input_df, transform):

    clean_dispense_df = (
        dispense_input_df
        .loc[:,
             [patient_id,
              pharmacy,
              dispense_status_date]
            ]
        .rename(columns = 
                {dispense_status_date : status_date}
        )
        .dropna()
        .assign(**{
            brand_col : transform.brand_name,
            hierarchy : transform.active_hierarchy,
            status : transform.active_shipment_status,
            substatus : transform.active_shipment_substatus,
            patient_id : lambda x: (x[patient_id].astype(int)),
            pharmacy : lambda x: (
                np.where(
                    x[pharmacy].isin(eval(transform.pharmacy_name_map).keys()),
                    x[pharmacy].replace(eval(transform.pharmacy_name_map), inplace=True),
                    x[pharmacy]
                )
            ),
            'pj_concat' : lambda x: (x[patient_id].astype(str) + x[pharmacy].astype(str) + x[brand_col].astype(str))
        })
        .loc[lambda x: (
            x['pj_concat'].isin(
                (df[patient_id].astype(str) + df[pharmacy].astype(str) + df[brand_col].astype(str))
                .drop_duplicates()
                .tolist()
            )
        )]
        .sort_values(
            by=[patient_id, pharmacy, brand_col, status_date],
            ascending=[True, True, True, True])
        .drop_duplicates(subset=[patient_id, pharmacy, brand_col], keep='first')
        .drop(['pj_concat'], axis=1)
        .reset_index(drop=True)
    )

    return clean_dispense_df

In [None]:
dispense_df = clean_dispense_data(
    dispense_input_df,
    transform
)

dispense_df.head()

In [None]:
# Inherit "missing" information into dispense data using patient status data
#    Inherit from the status immediately prior, if available
#    Otherwise, use the status immediately following.

inherit_df = (
    df
    .assign(
        pj_step =  lambda x: x.index,
        dispense_flag = False
    )
    .append(
        dispense_df.assign(dispense_flag =  True),
        sort = False
    )
    .assign(
        pj_id = lambda x: (
            x.groupby([patient_id, pharmacy, brand_col]).grouper.group_info[0]
        )
    )
    .sort_values(
        by=[patient_id, pharmacy, brand_col, status_date, status, substatus, 'dispense_flag'],
        ascending=[True, True, True, True, False, True, True])
    .reset_index(drop=True)
)

for column in inherit_df.columns.tolist():
    inherit_df = (
        inherit_df
        .assign(**{
            column : lambda x: (
                x.groupby(['pj_id'], sort=False)[column]
                .ffill(limit = 1)
                .bfill(limit = 1)
            )
        })
    )
    
inherit_df.head()

In [None]:
# Cleanse data to prepare for final output
#    Append "ic_{}_as" to id column for dispense records that were added
#    If referral date > status date, override referral date with status date.
#    If dispense data is redundant (i.e. active shipment is already reported for that date in the status data), drop that dispense record from the final dataframe.

output_df = (
    inherit_df
    .assign(**{
        unique_id : lambda x: (
            np.where(
                x['dispense_flag'],
                'ic_' + x[unique_id].astype(str) + '_as',
                x[unique_id]
            )
        ),
        referral_date : lambda x: (
            np.where(
                x[referral_date] > x[status_date],
                x[status_date],
                x[referral_date]
            )
        )
    })
    .sort_values(
        by=['pj_step', 'dispense_flag'],
        ascending=[True, True])
    .drop_duplicates(
        subset = [
            patient_id,
            pharmacy,
            brand_col,
            status_date,
            status,
            substatus,
            'pj_step'
            
        ]
    )
)

output_df.head()

In [None]:
final_dataframe = (
    output_df.loc[:,df.columns]
)

In [None]:
final_dataframe.head()

### TEST TRANSFORM OUTPUT

In [None]:
# Test 1: Check that the output dataframe (minus the inserted dispense records) has the same number of records as the input dataframe.

test1 = (len(df) == len(output_df.loc[output_df['dispense_flag']==False]))

test1

In [None]:
# Test 2: Check that inserted records' IDs all have "ic" and "_as" appended.

test2 = (
    (output_df.loc[output_df['dispense_flag']][unique_id].str[:2] == 'ic')
    &
    (output_df.loc[output_df['dispense_flag']][unique_id].str[-3:] == '_as')
).all()

test2

In [None]:
# Test 3: Check that inserted records are all listed as active shipments with the correct hierarchy.

test3 = (
    (output_df.loc[output_df['dispense_flag'], [status]] == transform.active_shipment_status).all().all()
    &
    (output_df.loc[output_df['dispense_flag'], [substatus]] == transform.active_shipment_substatus).all().all()
    &
    (output_df.loc[output_df['dispense_flag'], [hierarchy]] == transform.active_hierarchy).all().all()
)

test3

In [None]:
# FINAL TEST: Did all 3 tests pass?

test1 & test2 & test3

### Publish

In [None]:
## that's it - just provide the final dataframe to the var final_dataframe and we take it from there
transform.publish_contract.publish(final_dataframe, run_id, session)
session.close()