# CORE Cartridge Notebook::Accredo Cancel Before Active
![CORE Logo](assets/coreLogo.png) 

---
## Keep in Mind
Good Transforms Are...
- **singular in purpose:** good transforms do one and only one thing, and handle all known cases for that thing. 
- **repeatable:** transforms should be written in a way that they can be run against the same dataset an infinate number of times and get the same result every time. 
- **easy to read:** 99 times out of 100, readable, clear code that runs a little slower is more valuable than a mess that runs quickly. 
- **No 'magic numbers':** if a variable or function is not instantly obvious as to what it is or does, without context, maybe consider renaming it.

## Workflow - how to use this notebook to make science
#### Data Science
1. **Document your transform.** Fill out the _description_ cell below describing what it is this transform does; this will appear in the configuration application where Ops will create, configure and update pipelines. 
1. **Define your config object.** Fill out the _configuration_ cell below the commented-out guide to define the variables you want ops to set in the configuration application (these will populate here for every pipeline). 
2. **Build your transformation logic.** Use the transformation cell to do that magic that you do. 

![caution](assets/cautionTape.png)

### Description

![what my transform does](assets/ds301_accredo_enrichment.png)

### Configuration

In [None]:
from core.helpers.session_helper import SessionHelper
session = SessionHelper().session

import pandas as pd
import numpy as np
from datetime import datetime

In [None]:
"""
************ SETUP - DON'T TOUCH **************
This section imports data from the configuration database
and should not need to be altered or otherwise messed with. 
~~These are not the droids you are looking for~~
"""
from core.constants import BRANCH_NAME, ENV_BUCKET
from core.models.configuration import Transformation
from dataclasses import dataclass
from core.dataset_contract import DatasetContract

db_transform = session.query(Transformation).filter(Transformation.id == transform_id).one()

@dataclass
class DbTransform:
    id: int = db_transform.id ## the instance id of the transform in the config app
    name: str = db_transform.transformation_template.name ## the transform name in the config app
    state: str = db_transform.pipeline_state.pipeline_state_type.name ## the pipeline state, one of raw, ingest, master, enhance, enrich, metrics, dimensional
    branch:str = BRANCH_NAME ## the git branch for this execution 
    brand: str = db_transform.pipeline_state.pipeline.brand.name ## the pharma brand name
    pharmaceutical_company: str = db_transform.pipeline_state.pipeline.brand.pharmaceutical_company.name # the pharma company name
    publish_contract: DatasetContract = DatasetContract(branch=BRANCH_NAME,
                            state=db_transform.pipeline_state.pipeline_state_type.name,
                            parent=db_transform.pipeline_state.pipeline.brand.pharmaceutical_company.name,
                            child=db_transform.pipeline_state.pipeline.brand.name,
                            dataset=db_transform.transformation_template.name)


In [None]:
""" 
********* VARIABLES - PLEASE TOUCH ********* 
This section defines what you expect to get from the configuration application 
in a single "transform" object. Define the vars you need here, and comment inline to the right of them 
for all-in-one documentation. 
Engineering will build a production "transform" object for every pipeline that matches what you define here.

@@@ FORMAT OF THE DATA CLASS IS: @@@ 

<variable_name>: <data_type> #<comment explaining what the value is to future us>

e.g.

class Transform(DbTransform):
    some_ratio: float
    site_name: str

~~These ARE the droids you are looking for~~
"""

class Transform(DbTransform):
    '''
    YOUR properties go here!!
    Variable properties should be assigned to the exact name of
    the transformation as it appears in the Jupyter notebook filename.
    '''
    pjh: str # Patient Journey Hierarchy column
        
    # Possible substatus values
    pending_new: str # New substatus when status is 'PENDING' (NEW)
    active_shipped: str# Shipment substatus when status is 'ACTIVE'
    
    # Possible PJH values
    no_clarity: str # Final result of enrichment
        
    input_transform: str # Name of transform to input data from

transform = Transform()

In [None]:
# Hardcoded transform properties

# Columns
transform.patient = 'longitudinal_patient_id'
transform.pharm = 'pharmacy_name'
transform.ref_date = 'referral_date'
transform.status_date = 'status_date'
transform.status = 'customer_status'
transform.substatus = 'customer_status_description'
transform.ic_status = 'customer_status'
transform.ic_substatus = 'customer_status_description'
transform.brand_col = 'brand'

# Status values
transform.pending = 'PENDING'
transform.active = 'ACTIVE'
transform.cancelled = 'CANCELLED'
transform.discontinued = 'DISCONTINUED'

# Transform values
transform.no_clarity = 'NO STATUS CLARITY'
transform.accredo = 'ACCREDO'

### Transformation

In [None]:
### Retrieve current dataset from contract
from core.dataset_diff import DatasetDiff

diff = DatasetDiff(db_transform.id)
df = diff.get_diff(transform_name=transform.input_transform, values=[run_id])

In [None]:


pd.options.display.max_columns=999

df = final_dataframe.copy()

df = (
    df
    .loc[:,
        [transform.ref_date,
         transform.brand_col,
         transform.patient,
         transform.pharm,
         transform.status_date,
         transform.status,
         transform.substatus]
        ]
    .fillna('NONE')
    .assign(**{
        transform.status: lambda x:(
            x[transform.status].str.upper()
        ),
        transform.substatus: lambda x:(
            x[transform.substatus].str.upper()
        )
    })
    .assign(**{
        transform.status_date: lambda x:(
            pd.to_datetime(
                x[transform.status_date].str[:8].astype(str),
                errors='coerce'
        )),
        transform.ref_date: lambda x:(
            pd.to_datetime(
                x[transform.ref_date].str[:8].astype(str),
                errors='coerce'
        )),
        'min_status_date': lambda x:(
            x.groupby([transform.patient, transform.pharm, transform.brand_col])[transform.status_date].transform(min)
        )
    })
    .fillna(value={transform.ref_date: 'min_status_date'})
    .drop(columns=['min_status_date'])
    .drop_duplicates()
    .sort_values(
        [transform.patient, transform.pharm, transform.brand_col, transform.status_date]
    )
)

df.loc[:, transform.pjh] = np.nan

final_dataframe = df.copy()

In [None]:
# Preserve original data
df = final_dataframe.copy()

In [None]:
# Sort and reset index to join on later
df = (
    df
    .sort_values([
        transform.patient,
        transform.pharm,
        transform.brand_col,
        transform.status_date,
        transform.ref_date,
        transform.status
    ])
    .reset_index(drop=False)
)

In [None]:
# Create secondary df to find the first active shipment date for
# each patient journey
min_shipped_df = (
    df
    .loc[
        (df[transform.ic_status] == transform.active) &
        (df[transform.ic_substatus] == transform.active_shipped)
    ]
    .groupby([transform.patient, transform.pharm, transform.brand_col])
    [transform.status_date]
    .min()
    .reset_index()
    .rename(columns={transform.status_date: 'first_shipped_date'})
)

In [None]:
# Join to main df to get first active shipment date for every
# patient that has one
df = (
    df
    .merge(
        min_shipped_df,
        how='left'
    )
)

In [None]:
# Create a column that can be grouped on later to determine the
# first date where a desired care occurs
df = (
    df
    .assign(status_spree=(
    ~(
        # Check to make sure patient journey for row above is the same
        (df[transform.patient].eq(df[transform.patient].shift(1))) &
        (df[transform.pharm].eq(df[transform.pharm].shift(1))) &
        (df[transform.brand_col].eq(df[transform.brand_col].shift(1))) &
        # Check to make sure statuses in row above are the same
        (df[transform.ic_status].eq(df[transform.ic_status].shift(1))) &
        (df[transform.ic_substatus].eq(df[transform.ic_substatus].shift(1)))
    )
    ).cumsum())
)

In [None]:
# Create column denoting the min status date of each spree
df = (
    df
    .assign(min_status_date=lambda x:(
        x.groupby('status_spree')[transform.status_date].transform(min)
    ))
)

In [None]:
# Create bool column to denote row being right above active shipment row
df = (
    df
    .assign(above_shipment_step=lambda x:(
        # Check to make sure patient journey in row below is the same
        (x[transform.patient].eq(x[transform.patient].shift(-1))) &
        (x[transform.pharm].eq(x[transform.pharm].shift(-1))) &
        (x[transform.brand_col].eq(x[transform.brand_col].shift(-1))) &
        # Check to make sure status in row below is active/shipped
        (x[transform.ic_status].shift(-1) == transform.active) &
        (x[transform.ic_substatus].shift(-1) == transform.active_shipped)
    ))
)

# Pick out sprees that are above a shipment
above_shipment_series = df.groupby('status_spree').above_shipment_step.any()

# Assign True to those sprees
df = (
    df
    .assign(above_shipment=lambda x:(
        x.status_spree.isin(above_shipment_series[above_shipment_series].index)
    ))
)


In [None]:
# Denote how many days between min_status_date and first_shipped_date
df = (
    df
    .assign(day_diff=(df['min_status_date'] - df['first_shipped_date']))
)

In [None]:
df.loc[
    (df[transform.pharm] == transform.accredo) &
    # Negative for day diff, because we only want status sprees that
    # occur BEFORE the first active shipment
    (df.day_diff < np.timedelta64(-2, 'D')) &
    (df.above_shipment == True) &
    (
        (df[transform.ic_status].isin([transform.cancelled, transform.discontinued])) |
        (
            (df[transform.ic_status] == transform.pending) &
            (df[transform.ic_substatus] == transform.pending_new)
        )
    ),
    # Change pjh to 'NO STATUS CLARITY' for the records that pass
    transform.pjh
] = transform.no_clarity

In [None]:
# Format the final dataframe ready for publication
final_dataframe = (
    df
    # Preserve original index
    .set_index('index')
    .sort_index()
    # Drop calculation columns
    .drop(columns=[
        'first_shipped_date',
        'status_spree',
        'min_status_date',
        'above_shipment_step',
        'above_shipment',
        'day_diff'
    ])
)

### Publish

In [None]:
## that's it - just provide the final dataframe to the var final_dataframe and we take it from there
transform.publish_contract.publish(final_dataframe, run_id, session)
session.close()