In [6]:
recommender_path = "/Users/jonc101/Box Sync/jichiang_folders/clinical_recommender_pipeline/"

#### Design and Implementation of Clinical Recommender: 

Aims: 

- Create a Framework to Evaluate the Impact of a Recommender System on Physician Behavior
- Create a Data Pipeline that Preprocess Raw Clinical Behavioral Data into Actionable Clinical Insights
- Unit Testing
- Terminal Run Commands 
- Git Pull and Git Run: 
- Configuration: 


HCI PostDoc Feedback: 
- Survey questioning is useless. 
     you couldn’t simply obtain from observing them interact with the system
- Keep them short
- Anecdotal Inverse relationship between # questions asked afterward and quality/truthfulness of response. They’ll humor you for a bit, but after that just click click click
- Most useful is having people think outloud during the protocol. 
- They independently said this is the most useful thing. Or have each person observing for one specific different task


Evaluation Metrics: 
- Number of mouse clicks
- Resolution down to individual buttons and items
- Elapsed time
- From start of simulation to end
- Number of signed orders
- Number of (unique) recommendations
- Signed orders from recommender

    

Methods: 

Our goal was to design a grading pipeline that incorporates an in-house clinical recommender system (CRS) designed by Jonathan Chen. We had to design a data pipeline that actively stores data into a postgres database when physicians select clinical orders from CRS.

The clinical orders correspond to different simulation states. A simulation state represents a patient's current diagnostic and procedural mock-up as designed by an expert physician. We developed 2 cases in-house and reached out to three different Stanford physicians to design three different cases.

For each clinical order that is capture within a simulation state there is a corresponding score. A specific case has anywhere between 3 to 6 states that are triggered due to specific clinical decisions. We are abstracting a clinical grading platform that has never been done before.

Responses were recording using Google Forms, which tabulates the results into a Google Sheet, which is made accessible to R and Python. The forms were joined on physician id from the CRS and manually reviewed for validation. The purpose of the google forms, captures HCI survey information about clinical recommender utility for clinical workflow. The survey also captures physician background information, such as years since receiving medical degree and board certifications.

The majority of the initial participants were Stanford University Medical residents. After the initial trials ended, we made some subsequent quality of life improvements to the recommender system, after the first ~25 physician feedback.

A grading module was prototyped in R stats language and subsequently a more robust test driven python module was used to automatically score from inputs from three clinical experts. The grading module treats each physician id and clinical case as a key which the subsequent orders inside the key have an associated grade and confidence. From the key groups, we can sum the values to derive a score for each doctor's case.

This provides insight on physician decision-making and a quantitative score associated with each decision.

This process was an iterated delphi method where a panel of three clinical experts grade separately on each unique clinical order, and reconvened to discuss guidelines, grading and clinical confidence in treatments.

#### Random Trial Setup: How Were Cases Randomized: 

<pre>
Purpose:  
    Join sim_state_id and clinical_item_id to Grading Sheet
        Then: 
    Generate Deterministic Random Numbers: 
        Reproducible: 
            pseudorandom (deterministic) based on an internal state 
        Set.Seed
            

</pre>  

In [178]:
# how to get pandas data from postgree sql using python
# psycopg2 is a module designed to read dataframes from databases 
# pandas is a module that is R-like Magic for data manipulation 

import psycopg2 as pg
import pandas.io.sql as psql
import pandas as pd
import numpy as np

In [31]:
import random
'''
--------------------------------------------------
sample() is an inbuilt function of random module 
in Python that returns a particular length list 
of items chosen from the sequence i.e. list, tuple, 
string or set
--------------------------------------------------
Used for random sampling without replacement
--------------------------------------------------
x denotes: 
    expects: 
        - list 
        - cases that you want to randomize 
y denotes:
    - list of boolean values that indicate whether or not 
          the recommender is turned on 
n denotes:
    - the number of times you want to sample without replacement
    - should equal the length of x and y (should I make this explicit?) 
Purpose of Script:
    - writing a function that accepts a list of physician cases and randomly orders them
    - making it reproducible (can run again) (may need to review documentation on seed) 
    
Learning Points to Incorporate: 
    - more test driven development
    - functional programming versus Object Oriented Programming 
    - Less Script-Like   
----------------------------------------------------        
'''


def testRandomizeCase(x, y):
    assert type(x) == list
    assert len(x) == len(y)


def randomizeCase(x,y):
    # set the seed 
    random.seed(a=1)
    # initialize an empty list 
    output = []
    # construct for loop for number of physicians in your study
    for _ in range(50):
        a = random.sample(x, 5)
        b = random.sample(y, 4)
        #c = [] 
        #c.append("True")
        output.append((a,b))
    return(output)

# p1 denotes the cases represented by letters in an alphabet     
cases = ['Fever B','Headache','Palpitations', 'Hematemesis', 'Shortness of Breath']

# TRUE or FALSE (True means recommender is turned on) 
booleanList = [True,True, False, False]

# running script: 
t = randomizeCase(cases, booleanList)
# assumes first case recommender is on
print(t[0])
print(t[1])
print(t[2])

(['Headache', 'Fever B', 'Shortness of Breath', 'Hematemesis', 'Palpitations'], [False, True, False, True])
(['Fever B', 'Hematemesis', 'Shortness of Breath', 'Headache', 'Palpitations'], [True, False, True, False])
(['Headache', 'Fever B', 'Shortness of Breath', 'Hematemesis', 'Palpitations'], [True, False, False, True])


#### Assign Paths for Pipeline: 

<pre>
Purpose:  
    Treat Recommender Path as Configuration Folder File
        Then: 
            Assign Appropriate Paths for Clinical Recommender Pipeline     

</pre>  

In [17]:
physician_grading = recommender_path + "physician_grading/"
physician_response = recommender_path + "physician_response/"
tracker_data = recommender_path + "tracker_data/"
unit_test = recommender_path + "unit_test/"


#### Parse Github API for Open Issues with Associated Hash Tag Dates: 

<pre>
Purpose: List Issues from Github with Deadlines: 
    Parse Github API: 
            Then:
                Convert Json file Format to Pandas DataFrame
            Then: 
                Parse for Due Date Hashtag 
            Then: 
                Drop NA values (no due date)
            Then: 
                Create Column Names:
            Then: 
                Sort By Date 

</pre>  

In [8]:
# parses due dates on github 
!curl -i "https://api.github.com/repos/HealthRex/CDSS/issues?state=open" | tail -n +25  > 'issues3.json'
github_issues = pd.read_json('/Users/jonc101/Documents/Biomedical_Data_Science/issues3.json')['title']
due_date = github_issues.str.split("#Due:", n = 1, expand = True) 
c = pd.DataFrame(due_date.dropna())
c.columns = ['issue', 'date']
c.sort_values(by='date')

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100  112k  100  112k    0     0   114k      0 --:--:-- --:--:-- --:--:--  114k


Unnamed: 0,issue,date
13,"Merge case data UI usage, grading, survey res...",6/21/2019
22,Expand Recruitment Process - Send out recruit...,6/24/2019
4,Automated UI Test Grader that accounts for gro...,6/26/2019
11,Manual validation of a couple data rows from m...,6/26/2019
10,Get individual expert panel to deliver their f...,6/28/2019
8,Data Analysis - Define analysis plan of which ...,7/12/2019
9,Convene expert panel to reconcile grading for ...,7/20/2019
7,Paper - Methods Description of Grading Process...,7/25/2019
12,Completing remaining UI tests with physicians ...,7/30/2019
3,Second round expert panel review once collecte...,8/10/2019


Read Data from Database into Memory: 

In [9]:
connection = pg.connect("host='localhost' dbname=stride_inpatient_2014 user=postgres password='MANUAL PASSWORD'")

# -------------------------------------------------------------------------------
# to do :
#         Feature: Generate Grading Scheme
#             1) help visualize processes
#             2) introduce best grading schemes for each case
#             3) create a list of common errors seen
#
#             4) clean up exploratory analysis
#             5) convert to python module
#
#
#
# ---------------------------------------------------------------------------------


# reading in data from different tables 

clinical_item = pd.read_sql_query('select * from clinical_item', con=connection)
sim_patient_order = pd.read_sql_query('select * from sim_patient_order',con=connection)
sim_state = pd.read_sql_query('select * from sim_state',con=connection)
sim_user = pd.read_sql_query('select * from sim_user',con=connection)
sim_state_transition = pd.read_sql_query('select * from sim_state_transition',con=connection)

sim_state['sim_state_name'] = sim_state['description']

<pre>
Read Sim State Data into Memory: 
    Then: 
        join the sim_state with the sim_patient orders 
    Then:
        find all the unique clinical item orders 
</pre>

In [87]:
merged_order = sim_patient_order.merge(sim_state, left_on='sim_state_id', right_on='sim_state_id')
clinical_items_list = merged_order['clinical_item_id'].unique()

<pre>
Create vector of unique sim states (sim_state_id):
    Then: 
        filter vector of unique orders from clinical item table 
    Then: 
        create a description based table for orders and clinical items 
    Then: 
        split by sim_states into group by object 
</pre>

In [88]:
sim_state_list = merged_order['sim_state_id'].unique()
ordered_clinical_item_table = clinical_item[clinical_item['clinical_item_id'].isin(clinical_items_list)]
remerged_order = merged_order.merge(ordered_clinical_item_table, left_on='clinical_item_id', right_on='clinical_item_id')
split_state = remerged_order.groupby('sim_state_id')


explicitly write the lists of objects: 

In [89]:

#--------------------------------------------------------------------------------
# afib
#--------------------------------------------------------------------------------
# "Afib-RVR Initial"
# "Afib-RVR Stabilized"
# "Afib-RVR Worse"
#--------------------------------------------------------------------------------
afib_states = ["Afib-RVR Initial",
                "Afib-RVR Stabilized" ,
                "Afib-RVR Worse" ]
#--------------------------------------------------------------------------------
# meningitis
#--------------------------------------------------------------------------------
# "Mening Active"
# "Meningitis Adequately Treated"
# "Meningits Worsens"
#--------------------------------------------------------------------------------
mening_states =  ["Mening Active",
                   "Meningitis Adequately Treated",
                   "Meningits Worsens"]
# -------------------------------------------------------------------------------
# pulmonary embolism
# -------------------------------------------------------------------------------
# "PE-COPD-LungCA"
# "PE-COPD-LungCA + Anticoagulation"
# "PE-COPD-LungCA + O2"
# "PE-COPD-LungCA + O2 + Anticoagulation"
# -------------------------------------------------------------------------------
pulmonary_emolism_states = ["PE-COPD-LungCA",
                              "PE-COPD-LungCA + Anticoagulation",
                              "PE-COPD-LungCA + O2",
                              "PE-COPD-LungCA + O2 + Anticoagulation"]
# -------------------------------------------------------------------------------
# neutropenic fever
# -------------------------------------------------------------------------------
#  "Neutropenic Fever Treated with IVF"
#  "Neutropenic Fever Treated with IVF + ABX"
#  "Neutropenic Fever v2"
#  "NFever"
# -------------------------------------------------------------------------------

neutropenic_fever_states = ["Neutropenic Fever Treated with IVF",
                              "Neutropenic Fever Treated with IVF + ABX",
                              "Neutropenic Fever v2"]

# -------------------------------------------------------------------------------
# GIBLEED
# -------------------------------------------------------------------------------
# "EtOH-GIBleed Active"
# "EtOH-GIBleed Bleeding Out"
# "EtOH-GIBleed Coag Stabilized"
# "EtOH-GIBleed Post-EGD"
# -------------------------------------------------------------------------------

gi_bleed_states = ["EtOH-GIBleed Active",
                      "EtOH-GIBleed Bleeding Out",
                      "EtOH-GIBleed Coag Stabilized",
                      "EtOH-GIBleed Post-EGD" ]

# -------------------------------------------------------------------------------
# DKA
# -------------------------------------------------------------------------------
# "DKA Euglycemic"
# "DKA Hyperglycemic"
# "DKA Onset"
# -------------------------------------------------------------------------------

dka_states = ["DKA Euglycemic" ,
                "DKA Hyperglycemic" ,
                "DKA Onset"]

list_of_states = [gi_bleed_states,
                       mening_states,
                       pulmonary_emolism_states,
                       afib_states,
                       neutropenic_fever_states]



<pre>
    Split the states into separate dataframes:
        Then: 
            explicitly add a label for the case name 
        Then: 
            select features for grading 
</pre>

In [90]:
def state_split(state_names, df):
    df2 = df[df['name_x'].isin(state_names)]
    return(df2)

gi_test = state_split(gi_bleed_states, remerged_order)
mening_test = state_split(mening_states, remerged_order)
pulmonary_embolism_test = state_split(pulmonary_emolism_states, remerged_order)
afib_test = state_split(afib_states, remerged_order)
neutropenic_test = state_split(neutropenic_fever_states, remerged_order)

gi_test['case'] = "gi_bleed"
mening_test['case'] = "meningitis"
pulmonary_embolism_test['case'] = "pulmonary_embolism"
afib_test['case'] = "atrial_fibrillation"
neutropenic_test['case'] = "neutropenic"

df_grading_pre = pd.concat([gi_test,
                        mening_test,
                        pulmonary_embolism_test,
                        afib_test,
                        neutropenic_test])


df_grading = pd.DataFrame(df_grading_pre[['sim_state_id',
                                        'clinical_item_id',
                                        'sim_user_id',
                                        'sim_patient_id',
                                        'name_x',
                                        'description_x',
                                        'description_y',
                                        'case']])


print(df_grading)


      sim_state_id  clinical_item_id  sim_user_id  sim_patient_id  \
55              14             45763           26             134   
67              15             45763           26             123   
68              15             45763            0             126   
69              15             45763            0             153   
70               2             45763           31             141   
71               2             45763           13              39   
72               2             45763           53             293   
99              14             45801           48             248   
177             14             45866           10              23   
178             14             45866           11              31   
179             14             45866           13              39   
180             14             45866           15              75   
181             14             45866           17              79   
182             14             458

A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  # This is added back by InteractiveShellApp.init_path()
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  if sys.path[0] == '':
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: http://pandas.pydata.org/pandas-docs/stable/indexing.html#indexing-view-versus-copy
  del sys.path[0]
A value is trying to be set on a copy of a slice from a DataFrame.
Try using .loc[row_indexer,col_indexer] = value instead

See the caveats in the documentation: 

<pre>
Purpose:  
    Split: sim states and get unique orders 
        Then: 
            each list/dictionary 
            key = sim_state_id 
            value = all orders 
        Then: 
            for each key: 
                get unique orders: 
        Then: 
            convert list/groupby object into dataframe
</pre>                

In [91]:
sim_state_list = df_grading.groupby(['sim_state_id'])


In [184]:
#df_grading.columns

convert groupby object to dictionary of dataframes:
    

In [93]:
grading_folder = '/Users/jonc101/Documents/Biomedical_Data_Science/physician_grading/'

<pre>
Purpose:  
    Read Grade Data from Grading Folder 
        Then: 
            Append each score as column labeled from Physician 
        Then: 
            Concat to a Dataframe of All Grades 
</pre>                

In [191]:
ak = pd.read_excel(physician_grading + 'andre_kumar_v4.xlsx', index_col=0)
ls = pd.read_excel(physician_grading + 'lisa_shieh_v4.xlsx', index_col=0)
jh = pd.read_excel(physician_grading + 'jason_hom_v4.xlsx', index_col=0)

In [192]:
ak['grade'] = 3
#ls['grade'] = 30
jh['grade'] = 300
# reassign grade terms 
ak['grade_ak'] = ak['grade']
ls['grade_ls'] = ls['grade']
jh['grade_jh'] = jh['grade']


In [193]:
# concat the column grades 
# write tests for grading columns TO DO 
# grading_delphi

gd = pd.DataFrame(pd.concat([ak,  ls['grade_ls'], jh['grade_jh']], axis =1))

In [194]:
gd['grade_mean'] = (gd['grade_ak'].values + gd['grade_ls'].values + gd['grade_jh'].values )/ 3

In [195]:
df_grading['sim_state_name'] = df_grading['description_x']
gd['sim_state_name'] = gd['name.x']

<pre>
Purpose:  
    Join sim_state_id and clinical_item_id to Grading Sheet
        Then: 
             
            key = sim_state_id 
            value = all orders 
        Then: 
            for each key: 
                get unique orders: 
        Then: 
            convert list/groupby object into dataframe
</pre>    

In [185]:
#gd

In [196]:
clinical_item_key = pd.DataFrame(ordered_clinical_item_table[['clinical_item_id', 'description']])
clinical_item_key['clinical_order'] = clinical_item_key['description']
df_grading2 = pd.merge(df_grading_pre, clinical_item_key, how='left', on=['clinical_item_id'])


In [197]:
gd2 = pd.merge(gd, clinical_item_key, how='left', on=['clinical_order'])

In [198]:
gd2['sim_state_name'] = gd2['name.x']

In [199]:
order_grade = pd.merge(gd2, df_grading2, how='outer', on=['clinical_item_id', 'sim_state_name'])


In [200]:
df_grading2['sim_state_clinical_order_id'] = df_grading2['sim_state_id'].apply(str) + '_' + df_grading2['clinical_item_id'].apply(str)
sim_clinical_orders = df_grading2

In [201]:
#gd2['sim_state_clinical_order_id'] = gd2['sim_state_id'].apply(str) + '_' + gd2['clinical_item_id'].apply(str)

In [202]:
sim_state_link = sim_state[['sim_state_id', 'name']]
gd2['name'] = gd2['sim_state_name']
gd3 = pd.merge(gd2, sim_state_link, how='left', on=['name'])
gd3['sim_state_clinical_order_id'] = gd3['sim_state_id'].apply(str) + '_' + gd3['clinical_item_id'].apply(str)
physician_grading_key = gd3 
sim_orders_grade = pd.merge(sim_clinical_orders, physician_grading_key, how='left', on=['sim_state_clinical_order_id'])


<pre>
Def: 
    Sum Values of Each Score
    
Generate Scores for Each Case: 
    Then: 
        Split by each case 
    Then: 
        Group By Each 
    Then: 
        split by sim_states into group by object 
</pre>

In [203]:
# WRITE TEST FUNCTION 

def grade_sum(case):
    return case['grade_mean'].sum(axis = 0, skipna = True)


In [204]:
import numpy as np
gk = sim_orders_grade.groupby('sim_patient_id') 
ctd = gk.apply(grade_sum)
sim_grade_groups = sim_orders_grade.groupby('sim_patient_id').groups
sim_grade_group_list = list(sim_grade_groups.keys())
sim_grades_list = list(ctd)


Def: 
    Sum Values of Each Score
        Use Merged Total Score have a list of total scores for each case 
    Then: 
        split for each case 

In [205]:
merged_total_score = pd.DataFrame(zip(sim_grade_group_list, sim_grades_list))
merged_total_score.columns = ['sim_patient_id', 'case_grade']
#merged_total_score

In [206]:
#sim_orders_grade.columns

In [207]:
# GET STAR: TO RUN v4_data Script 
# RUN SCRIPT HERE TO GENERATE OUTPUT IN TRACKER OUTPUT
tracker_data_out = pd.read_csv(tracker_data + 'tracker_output/output.csv')

# preprocessing data columns 

tracker_data_out['sim_patient_id'] = tracker_data_out['patient']
tracker_data_out['sim_user_id'] = tracker_data_out['user']


In [208]:
# merge on grade orders 
sim_user['sim_name'] = sim_user['name']
tracker_user_join = pd.merge(tracker_data_out, sim_user, how='left', on=['sim_user_id'])


In [133]:
# read out physician response 

physician_response_out = pd.read_csv(physician_response + 'physician_responses2.csv')
physician_response_join = pd.merge(physician_response_out, sim_user, how='left', on=['sim_name'])
tracker_response_join = pd.merge(tracker_user_join, physician_response_join,  how='left', on=['sim_user_id'])



sim_user

In [134]:
#tracker_response_merge = pd.merge(tracker_user_join, physician_response_out, how='left', on=['name'])
def intersection(lst1, lst2): 
    return list(set(lst1) & set(lst2)) 
intersection(tracker_user_join.columns, physician_response_join.columns)

['sim_user_id', 'sim_name']

In [182]:
tracker_response_score_join = pd.merge(tracker_response_join, merged_total_score,  how='left', on=['sim_patient_id'])
#tracker_response_score_join

In [180]:
# write to folder 
tracker_response_score_join.to_csv(recommender_path + 'recommender_generated_outputs/tracker_response_grade.csv')