***
# CORE Cartridge Notebook::[master_patient_substatus]
![CORE Logo](assets/coreLogo.png) 

---
## Keep in Mind
Good Transforms Are...
- **singular in purpose:** good transforms do one and only one thing, and handle all known cases for that thing. 
- **repeatable:** transforms should be written in a way that they can be run against the same dataset an infinate number of times and get the same result every time. 
- **easy to read:** 99 times out of 100, readable, clear code that runs a little slower is more valuable than a mess that runs quickly. 
- **No 'magic numbers':** if a variable or function is not instantly obvious as to what it is or does, without context, maybe consider renaming it.

## Workflow - how to use this notebook to make science
#### Data Science
1. **Document your transform.** Fill out the _description_ cell below describing what it is this transform does; this will appear in the configuration application where Ops will create, configure and update pipelines. 
1. **Define your config object.** Fill out the _configuration_ cell below the commented-out guide to define the variables you want ops to set in the configuration application (these will populate here for every pipeline). 
2. **Build your transformation logic.** Use the transformation cell to do that magic that you do. 
![caution](assets/cautionTape.png)

### Description
What does this transformation do? be specific.

![what does your transform do](assets/what.gif)

## Planned
1. Collect all unique raw patient substatus instances
2. Auto-map as many raw patient substatus instances to a defined cleansed data model per **Customer**

3. Process for identifying and manually mapping where auto-map fails.
4. Do not publish un-mapped instances. Drop them, give us the ability to triage and map to IC-gold in a later event.


<a id="CELL1"></a>
## CELL 1 
<font color=orange>
last time touched for 'dev' Wednesday, August 28, 2019 1:57:07 PM GMT-04:00 DST</font>

In [None]:
"""CELL 1
builds and returns a database session
local assumes a psql instance in a local docker container
only postgres database is supported for configuration_application at this time
"""
"""
gets env-based configuration secret
returns a session to the configuration db
for dev env it pre-populates the database with helper and seed data
"""
from core.helpers.session_helper import SessionHelper

session = SessionHelper().session

## CONFIGURATION - PLEASE TOUCH
### <font color=pink>This cell will be off in production as configurations will come from the configuration postgres DB</color>

In [None]:
"""
************ CONFIGURATION - PLEASE TOUCH **************
Pipeline Builder configuration: creates configurations from variables specified here!!
This cell will be off in production as configurations will come from the configuration postgres DB.
"""
"""
PIPELINE STATE:
raw-->ingest-->master-->enhance-->enrich-->metrics-->dimensional
"""
# config vars: this dataset
config_pharma = "sun" # the pharmaceutical company which owns {brand}
config_brand = "ilumya" # the brand this pipeline operates on
config_state = "master" # the state this transform runs in
config_name = "master_patient_substatus" # the name of this transform!!!, which is the name of this notebook without .ipynb

# input vars: dataset to fetch. 
# Recall that a contract published to S3 has a key format branch/pharma/brand/state/name
#input_branch = "sun-extract-validation"
input_branch ="dc-627_alkermes_ingest_column_mapping"
# None
# if None, input_branch is automagically set to your working branch
input_pharma = "alkermes"
input_brand = "vivitrol"
input_state = "ingest"
#input_name = "symphony_health_association_ingest_column_mapping"
input_name = "patient_status_ingest_column_mapping"
#df = pandas_from_parquet_s3('ichain-dev/dc-627_alkermes_ingest_column_mapping/alkermes/vivitrol/ingest/patient_status_ingest_column_mapping/__metadata_run_id=1/037547bb129341b9aad0ec52424b55e3.parquet')
#This contract defines the base of the output structure of data into S3.
#
#contract structure in s3: 
#s3:// {ENV} / {BRANCH} / {PARENT} / {CHILD} / {STATE} / {name of input}
#
#ENV - environment Must be one of development, uat, production.
#Prefixed with integrichain- due to global unique reqirement
#BRANCH - the software branch for development this will be the working pull request (eg pr-225)
#in uat this will be edge, in production this will be master
#PARENT - The top level source identifier
#this is generally the customer (and it is aliased as such) but can be IntegriChain for internal sources,
#or another aggregator for future-proofing
#CHILD - The sub level source identifier, generally the brand (and is aliased as such)
#STATE - One of: raw, ingest, master, enhance, enrich, metrics


In [None]:
import logging
logging.getLogger().setLevel(logging.DEBUG)
log = logging.getLogger()

### <font color=orange>SETUP - DON'T TOUCH </font>
Populating config mocker based on config parameters...

In [None]:
"""
************ SETUP - DON'T TOUCH **************
Populating config mocker based on config parameters...
"""
import core.helpers.pipeline_builder as builder

ids = builder.build(config_pharma, config_brand, config_state, config_name, session)
"""
RETURNS: A list of 2 items: [transformation_id, run_id] where transformation_id corresponds
to the configuration created/found for {transformation} and run_id is a randomly generated 6 digit
number (to avoid publishing to the same place with the same dataset)
"""

transform_id = ids[0]
run_id = ids[1]


### <font color=orange>SETUP - DON'T TOUCH </font>
This section imports data from the configuration database
and should not need to be altered or otherwise messed with. 


In [None]:
"""************ SETUP - DON'T TOUCH **************
This section imports data from the configuration database
and should not need to be altered or otherwise messed with. 
~~These are not the droids you are looking for~~
"""
from core.constants import BRANCH_NAME, ENV_BUCKET, BATCH_JOB_QUEUE
from core.helpers.session_helper import SessionHelper
from core.models.configuration import Transformation
from dataclasses import dataclass
from core.dataset_contract import DatasetContract


db_transform = session.query(Transformation).filter(Transformation.id == transform_id).one()

@dataclass
class DbTransform:
    id: int = db_transform.id ## the instance id of the transform in the config app
    name: str = db_transform.transformation_template.name ## the transform name in the config app
    state: str = db_transform.pipeline_state.pipeline_state_type.name ## the pipeline state, one of raw, ingest, master, enhance, enrich, metrics, dimensional
    branch:str = BRANCH_NAME ## the git branch for this execution 
    brand: str = db_transform.pipeline_state.pipeline.brand.name ## the pharma brand name
    pharmaceutical_company: str = db_transform.pipeline_state.pipeline.brand.pharmaceutical_company.name # the pharma company name
    publish_contract: DatasetContract = DatasetContract(branch=BRANCH_NAME,
                            state=db_transform.pipeline_state.pipeline_state_type.name,
                            parent=db_transform.pipeline_state.pipeline.brand.pharmaceutical_company.name,
                            child=db_transform.pipeline_state.pipeline.brand.name,
                            dataset=db_transform.transformation_template.name)
        


In [None]:

log.debug('Branch name:{} Env Bucket:{} Batch Job Queue:{}'.format(BRANCH_NAME,ENV_BUCKET,BATCH_JOB_QUEUE))

## CONFIGURATION - VARIABLES - PLEASE TOUCH

# TRANSFORM

In [None]:
""" 
CONFIGURATION ********* VARIABLES - PLEASE TOUCH ********* 
This section defines what you expect to get from the configuration application 
in a single "transform" object. Define the vars you need here, and comment inline to the right of them 
for all-in-one documentation. 
Engineering will build a production "transform" object for every pipeline that matches what you define here.

@@@ FORMAT OF THE DATA CLASS IS: @@@ 

<variable_name>: <data_type> #<comment explaining what the value is to future us>
e.g.
class Transform(DbTransform):
    some_ratio: float
    site_name: str

~~These ARE the droids you are looking for~~
"""
"""
imports
"""
import pandas as pd
import re 
from core.logging import get_logger
 
class Transform(DbTransform):
    '''
    YOUR properties go here!!
    Variable properties should be assigned to the exact name of
    the transformation as it appears in the Jupyter notebook filename.
    ''' 
    # PROD
    '''
    col_substatus: str = db_transform.variables.col_substatus # The column of interest for the transform 
    customer_name: str = db_transform.variables.col_customer_name # The customer name
    input_transform: str = db_transform.variables.input_transform  # the name of the dataset to pull from 
    '''
    # DEV
    col_substatus: str  
    customer_name: str  
    input_transform: str   


        
    def master_substatus():
        customer_name = transform.customer_name
        try:
            
            if customer_name=='sun':
                substatus_dict = Transform.master_substatus_sun()
                substatus_conversion_dict = Transform.master_substatus_conversion_sun()
            elif customer_name=='bi':
                substatus_dict = Transform.master_substatus_bi()
            elif customer_name=='alkermes':
                substatus_dict = Transform.master_substatus_alkermes()
                substatus_conversion_dict = Transform.master_substatus_conversion_alkermes()
            else:
                #go = False # something did not work
                logger.exception('expecting customer name as sun bi or alkermes')
                raise Exception('expecting customer name as sun bi or alkermes') 
        except Exception as e:
            go = False # something did not work
            logger.exception(f'exception:{e}')
            raise Exception(f'raise exception:{e}')  
        return substatus_dict, substatus_conversion_dict  
    
    def master_substatus_sun():
        # need to input/ define for ic-gold mapping
        # for substatus for sun
        # temporary until furture User defines
        # IC - GOLD persistence solution
        substatus_dict = {}
        substatus_dict[1]='ALT THERAPY'
        substatus_dict[2]='APPEAL'
        substatus_dict[3]='BENEFITS'
        substatus_dict[4]='COPAY ASSISTANCE'
        substatus_dict[5]='DELAY'
        substatus_dict[6]='DOSAGE'
        substatus_dict[7]='FORMULARY'
        substatus_dict[8]='FOUNDATION'
        substatus_dict[9]='HOLD OTHER'
        substatus_dict[10]='HOLD RTS'
        substatus_dict[11]='INFORMATION'
        substatus_dict[12]='INS OTHER'
        substatus_dict[13]='INSURANCE COPAY'
        substatus_dict[14]='INSURANCE DENIED'
        substatus_dict[15]='INSURANCE HOLD'
        substatus_dict[16]='INSURANCE OON'
        substatus_dict[17]='INSURANCE OTHER'
        substatus_dict[18]='INVENTORY HOLD'
        substatus_dict[19]='MATERIAL'
        substatus_dict[20]='NEW'
        substatus_dict[21]='OTHER'
        substatus_dict[22]='PA'
        substatus_dict[23]='PATIENT CONTACT'
        substatus_dict[24]='PATIENT DECEASED'
        substatus_dict[25]='PATIENT END'
        substatus_dict[26]='PATIENT FINANCIAL'
        substatus_dict[27]='PATIENT HOLD'
        substatus_dict[28]='PATIENT RESPONSE'
        substatus_dict[29]='PRESCRIBER'
        substatus_dict[30]='PRESCRIBER END'
        substatus_dict[31]='PRESCRIBER HOLD'
        substatus_dict[32]='PT HOLD'
        substatus_dict[33]='QUANTITY'
        substatus_dict[34]='READY'
        substatus_dict[35]='SERVICES END'
        substatus_dict[36]='SHIPMENT'
        substatus_dict[37]='STEP EDIT'
        substatus_dict[38]='THERAPY COMPLETE'
        substatus_dict[39]='THERAPY END'
        substatus_dict[40]='THERAPY HOLD'
        substatus_dict[41]='TRANSFER HUB'
        substatus_dict[42]='TRANSFER SP'
        substatus_dict[43]='TREATMENT DELAY'
        return substatus_dict
   

    def master_substatus_conversion_sun():
        substatus_conversion_dict = {}
        substatus_conversion_dict = {'BENEFITS INVESTIGATION':'BENEFITS','INS OON ':'INSURANCE OON','OTHER ':'OTHER','P05':'PA','PATENT RESPONSE':'PATIENT RESPONSE','PATIENT  RESPONSE':'PATIENT RESPONSE','PATIENT RESPOSNE':'PATIENT RESPONSE','PRESCRIBERHOLD':'PRESCRIBER HOLD','TRANSER SP':'TRANSFER SP'}
        return substatus_conversion_dict
    
    def master_substatus_conversion_alkermes():
        substatus_conversion_dict = {}
        return substatus_conversion_dict
    
    def master_substatus_bi():
        substatus_dict = {}
        return substatus_dict

    
    def master_substatus_alkermes():
        substatus_dict = {}
        substatus_dict[1]='AWAITING FINANCIAL DECISION'
        substatus_dict[2]='AWAITING INJECTION PROVIDER'
        substatus_dict[3]='BI -- BENEFITS VERIFICATION STARTED'
        substatus_dict[4]='CDF APPROVAL'
        substatus_dict[5]='COPAY ASSISTANCE'
        substatus_dict[6]='COVERAGE DENIED'
        substatus_dict[7]='DELAY IN TREATMENT INITIATION'
        substatus_dict[8]='EBI - BV COMPLETE/COVERED'
        substatus_dict[9]='ENROLLMENT MISSING INFO - WAITING ON HCP'
        substatus_dict[10]='ENROLLMENT MISSING INFO - WAITING ON PATIENT'
        substatus_dict[11]='ENROLLMENT STARTED - DATA ENTRY STARTED'
        substatus_dict[12]='HCP UNRESPONSIVE -- ATTEMPTS EXHAUSTED'
        substatus_dict[13]='INDICATION CRITERIA NOT MET'
        substatus_dict[14]='INSURANCE CHANGE - REVERIFYING BENEFITS'
        substatus_dict[15]='NO COVERAGE - DRUG NOT ON FORMULARY'
        substatus_dict[16]='NO INSURANCE'
        substatus_dict[17]='OTHER'
        substatus_dict[18]='PA -- PA APPROVED'
        substatus_dict[19]='PA -- PA DENIED'
        substatus_dict[20]='PA -- PA STARTED'
        substatus_dict[21]='PA -- WAITING ON HCP'
        substatus_dict[22]='PA -- WAITING ON PAYER'
        substatus_dict[23]='PA APPEAL -- STARTED'
        substatus_dict[24]='PATIENT CHOICE - INJECTION RESISTANCE'
        substatus_dict[25]='PATIENT REFUSED TREATMENT'
        substatus_dict[26]='PATIENT TRANSFERRED BACK TO HUB'
        substatus_dict[27]='PATIENT UNRESPONSIVE -- ATTEMPTS EXHAUSTED'
        substatus_dict[28]='PATIENTS CHOICE - FINANCIAL'
        substatus_dict[29]='PHYSICIAN CANCELLED'
        substatus_dict[30]='PHYSICIAN DISCONTINUED'
        substatus_dict[31]='REFILL TOO SOON'
        substatus_dict[32]='SHIPMENT DELAYED -PATIENT HAS SUPPLY ON HAND'
        substatus_dict[33]='SHIPMENT SCHEDULED'
        substatus_dict[34]='SHIPPED'
        substatus_dict[35]='SWITCHED TO COMPETITOR PRODUCT'
        substatus_dict[36]='THERAPY COMPLETE'
        substatus_dict[37]='TRANSFER TO PBM FOR PROCESSING'
        substatus_dict[38]='TRIAGED TO MAIL ORDER'
        substatus_dict[39]='TRIAGED TO OTHER SP'
        substatus_dict[40]='TRIAGED TO PBM FOR PROCESSING'
        substatus_dict[41]='WAITING ON HCP RESPONSE'
        substatus_dict[42]='WAITING ON NEW RX FROM PHYSICIAN'
        substatus_dict[43]='WAITING ON PATIENT SHIP DATE DECISION'
        return substatus_dict
        
    
    def master_patient_substatus(self,df):
        try:        
            logger.info('try:')
            go = False # assume things are not working YET.
           
            dffail = pd.DataFrame() # initialize df for fails
            
            # df in

            #dfSize = df.size
            dfShape = df.shape
            logger.info('df in  shape: {} {}'.format(dfShape[0],dfShape[1])) 
            logger.info('df in {}'.format(df.head()))  
            
            # am I expecting certain column names? YES 
            substatusColNameExpected = transform.col_substatus
            
            logger.info('expecting column name patient sub status as:{}'.format(substatusColNameExpected))
            columnNamesArr = df.columns.values.tolist()
            logger.info('df column names:{}'.format(columnNamesArr))
            
            if substatusColNameExpected in columnNamesArr:
                logger.info('Clean: space Strip and Upper and other cleanup...')  

                df[substatusColNameExpected]= df[substatusColNameExpected].apply(lambda x: x.upper() if x is not None else x)   
                df[substatusColNameExpected]= df[substatusColNameExpected].apply(lambda x: x.strip() if x is not None else x)
                df[substatusColNameExpected]= df[substatusColNameExpected].apply(lambda x: x.replace('_',' ').replace('\r', '').replace('\t', '').replace('\w', '').replace("'",'').replace('.','').replace('  ',' ') if x is not None else x)
              
                                                                  
                # master data IC-GOLD substatus
                substatus_dict = {}
                substatus_conversion_dict = {}
                substatus_dict, substatus_conversion_dict = Transform.master_substatus()
                # print(substatus_conversion_dict)
                # store the golden values in a list
                substatus_list = list(substatus_dict.values())           
                logger.info('Gold Domain List:{}'.format(substatus_list))  
                
                #apply master conversions
                if transform.customer_name=='sun':
                    df[substatusColNameExpected].replace(substatus_conversion_dict, inplace=True)
                
                
                # what fails
                dffail = df.loc[~df[substatusColNameExpected].isin(substatus_list)]
                # apply master selection for the column of interest
                # what passes
                df = df.loc[df[substatusColNameExpected].isin(substatus_list)]
                
                # meta data log for what comes out of the function pass and fail df
                dfOutSize = df.size
                dfOutShape = df.shape
                dffailSize = dffail.size
                dffailShape = dffail.shape
                logger.info('df in   shape: {} {}'.format(dfShape[0],dfShape[1]))                 
                logger.info('df pass shape: {} {}'.format(dfOutShape[0],dfOutShape[1]))
                logger.info('df fail shape: {} {}'.format(dffailShape[0],dffailShape[1]))  
                # meta data log for what comes out of the function pass df
                logger.info('df pass {}'.format(df.head()))
                # meta data log for what comes out of the function fail df
                logger.info('df fail {}'.format(dffail.head()))  
                go = True
            else:
                go = False # something did not work
                logger.exception('expecting column name for patient substatus if/else exception raise')
                raise Exception("sub_status ColNameExpected NOT in columnNamesArr")              
        except Exception as e:
            go = False # something did not work
            logger.exception(f'exception:{e}')
            raise Exception(f'raise exception:{e}')  
        else:
            pass
        finally:
            pass
        return df.copy(),dffail.copy(),go
                

transform = Transform()
logger = get_logger(f"core.transforms.{transform.state}.{transform.name}")

### *Please place your value assignments for development below !!!*
### <font color=pink>This cell will be turned off in production, Engineering will set to pull from the configuration</color>

In [None]:
## Please place your value assignments for development here!!
## This cell will be turned off in production and Engineering will set to pull from the configuration application instead
## For the last example, this could look like...
## transform.some_ratio = 0.6
## transform.site_name = "WALGREENS"

#transform.customer_name = 'sun'
#transform.col_substatus = 'sub_status' # based on dev data works for sun 
transform.customer_name = 'alkermes'
transform.col_substatus = 'customer_status_description' # based on dev data works for alkermes
transform.input_transform = '' # for DEV NA

### FETCH DATA - TOUCH, BUT CAREFULLY
### <font color=pink>This cell will be turned off in production, as the input_contract will be handled by the pipeline</color>

In [None]:
logger.info("FETCH DATA CELL - TOUCH - This cell will be turned off in production, as the input_contract will be handled by the pipeline. ")

# for testing / development only
run_id = 1

if not input_branch:
    input_branch = BRANCH_NAME
input_contract = DatasetContract(branch=input_branch,
                                 state=input_state, 
                                 parent=input_pharma, 
                                 child=input_brand, 
                                 dataset=input_name)
run_filter = []
run_filter.append(dict(partition="__metadata_run_id", comparison="==", values=[run_id]))
# IF YOU HAVE PUBLISHED DATA MULTIPLE TIMES, uncomment the above line and change the int to the run_id to fetch.
# Otherwise, you will have duplicate values in your fetched dataset!
# bypass/comment out when unit testing individual parquet files
df = input_contract.fetch(filters=run_filter)



In [None]:
### Retrieve current dataset from contract
from core.dataset_diff import DatasetDiff

diff = DatasetDiff(db_transform.id)
df = diff.get_diff(transform_name=transform.input_transform, values=[run_id])

## *<font color=grey>unit test development only*</font>
*<font color=grey>The next **5** cells will be deleted in production.* </font>

In [None]:
# unit test/development only
# before shot unit testing only
dfSize = df.size
dfShape = df.shape
print('shape: {} {}'.format(dfShape[0],dfShape[1])) 

In [None]:
# unit test/development only 
# needed to see the col(s) of interest
pd.set_option('display.max_columns', 135)
df.head()

In [None]:
df['customer_status_description'].value_counts(dropna=False)

# <font color=red>**CALL**</font> THE TRANSFORM

In [None]:
### Use the variables above to execute your transformation.
### the final output needs to be a variable named final_dataframe
logger.info("CALL THE TRANSFORM - execute your transformation")

final_dataframe, final_fail, go = transform.master_patient_substatus(df)

if go==True:
    logger.info("CALL THE TRANSFORM -  go no go = GO")
elif go==False:
    logger.info("CALL THE TRANSFORM -  go no go = NO go")
else:
    go=False
    logger.info("CALL THE TRANSFORM -  go no go = unknown make it NO go")
    
    

### *<font color=grey>unittest python*</font>

# **publish**
### Writing to S3
Invoke the `publish()` command to write to a given contract. Some things to know:
- To invoke publish a contract must be at the grain of dataset. This is because file names will be set by the dataframe=\>parquet conversion. 
- publish only accepts a pandas dataframe.
- publish does not allow for timedelta data types at this time (this is missing functionality in pyarrow).
- publish handles partitioning the data as per contract, creating file paths, and creating the binary parquet files in S3, as well as the needed metadata. <br>
**- by default, all datasets include a single partition, \_\_metadata\_run\_id, the RunEvent ID of an executed pipeline**

In [None]:
## that's it - just provide the final dataframe to the var final_dataframe and we take it from there
if go==True:
    logger.info("PUBLISH - that's it - its a GO - just provide the final dataframe to the var final_dataframe and we take it from there")
    transform.publish_contract.publish(final_dataframe, run_id, session)
elif go==False:
    logger.info("PUBLISH -  go no go = NO go -  so DONT publish")
else:
    go=False
    logger.info("PUBLISH -  go no go = unknown make it NO go - so DONT publish")    
session.close()

***