# Hospitalization cost variation master file

Date created: 1/23/23 <br>
Last updated: 2/20/23 <br>
Adapted from: healthrex_ml materials by Conor Corbin ([202301118_healthrex_ml_workshop](https://github.com/HealthRex/healthrex_ml/tree/main/examples/20230118_healthrex_ml_workshop))

**Table of Contents [Tentative]** <br>
0 Inputs and setup <br>
1 Cohort selection <br>
2 Feature extraction <br>
3 Preliminary data visualization <br>
4 Analysis <br>

Additional: 2.5 Data loading, feature selection, model evaluation (part of analysis?)

2/20/23 update: Need to install statsmodels
% python -m pip install statsmodels 

## 0 Inputs and setup
### 0.1 Global variables
Update for your project

In [None]:
# Your local home directory
user_id = 'selinapi'

# Source data projects and datasets
nero_gcp_project = 'som-nero-phi-jonc101-secure' # *** Label rest of these
cdm_project_id = 'som-nero-phi-jonc101'
cdm_dataset_id = 'shc_core_2021'

# NERO project and dataset where you are saving your data
work_project_id = nero_gcp_project
work_dataset_id = 'proj_IP_variation'

# Cohort dataset name
cohort_id = 'cohort_drg_221'

# Hours after admission date to set index time
index_lag = 24 # NOTE: lag is from admission DATE (midnight) rather than admission TIME yet as of 1/23/23

# Thresholds
eps = 1e-6 # For 0
nz_vars = 0.04 # For uncommon features

# Control variables to run sections of code
run_cohortselection = 1 # Last run: 2/5/23
run_cohortchecks = 1
run_featurizer = 1 # Last run: 1/28/23

## Setup environment / credentials

In [None]:
from google.cloud import bigquery
import os
import pandas as pd
import sys
import yaml
import numpy as np
import math
import matplotlib.pyplot as plt

In [None]:
# GCP credentials for Mac: Ran steps linked here to create JSON credentials and file path (https://github.com/HealthRex/CDSS/blob/master/scripts/DevWorkshop/ReadMe.GoogleCloud-BigQuery-VPC.txt)
os.environ['GOOGLE_APPLICATION_CREDENTIALS'] = (
    f'/Users/{user_id}/.config/gcloud/application_default_credentials.json'
)
os.environ['GCLOUD_PROJECT'] = nero_gcp_project

# Instantiate a client object so you can make queries
client = bigquery.Client()

#  Create a dataset in project to write all our tables there (if it does not exist already)
client.create_dataset(f"{work_project_id}.{work_dataset_id}", exists_ok=True)

## 1 Cohort selection and outcome calculation

### 1.1 Define cohort of admissions based on DRG code
First pass: Creates dataset of admissions with APR-DRG of 221, 245, and 247 with the following columns (1/29/23 - See additions to DRG codes in code comments below)
1. anon_id : id of the patient 
2. observation_id : id of the ML example (observation)

Also merges outcome variable.<br>
Currently using direct cost; note: merges identifying data, be careful!! *** Add cost breakdowns later<br>
Note: true date from shc_map_2021 in highest security dataset + jitter = anonymized date in lower security dataset


In [None]:
# Create dataset of gastrointestinal (GI) admissions with APR-DRG of 221, 245, and 247 with the following columns 
# *** Think of how to make this adaptable for different DRGs, or make cohort creation its own class that you call (like for Healthrex_ML)
# *** ASSUMPTION: sum costs that occur within the dates of the IP admission; ALTERNATE: Only sum costs where Inpatient_C == 'I'
# 1/29/23 UPDATE: Removed 221 and added
#     230 (MAJOR SMALL BOWEL PROCEDURES, drg_id = 6267, code set 3)
#     231 (MAJOR LARGE BOWEL PROCEDURES, drg_id = 6268, code set 3)

#     Also found the following, but did not add because vague or led to duplicate admissions (some admissions were coded with multiple DRG systems)
#     254 (OTHER DIGESTIVE SYSTEM DIAGNOSES, drg_id = 2447) 
#     330 (MAJOR SMALL AND LARGE BOWEL PROCEDURES WITH CC, drg_id = 1836, code set 6)
#     329 (MAJOR SMALL AND LARGE BOWEL PROCEDURES WITH MCC, drg_id = 1835, code set 6)
#     because DRG 221 was not being captured in the cost data. 
# *** Think of ways to capture bowel procedures more effectively/in automated fashion, while being comprehensive and not introducing redundancies from admissions coded with multiple systems
# *** Are there any patients with 221 but NOT 230 or 231? (Context: Admissions were coded with both 221 and either 230 or 231, so I had to remove these duplicates in this first pass of the analysis)
# 2/5/23 UPDATE: Filtered to inpatient costs only
# 2/20/23 UPDATE: Only use 230 and 231 (removed ""(b.drg_mpi_code IN ('245', '247') AND b.drg_id LIKE '2%') OR" in WHERE clause)
query= """
    CREATE OR REPLACE TABLE
    `{work_project_id}.{work_dataset_id}.{cohort_id}` AS
    
    --Get anonymized admission DRG details
    WITH 
    DRG_adms AS 
    (
    SELECT DISTINCT
        a.anon_id, 
        a.pat_enc_csn_id_jittered as observation_id, 
        timestamp(date_add(CAST(a.hosp_adm_date_jittered as datetime), interval {index_lag} hour)) as index_time,
        a.hosp_adm_date_jittered as adm_date,
        a.hosp_disch_date_jittered as disch_date,
        TIMESTAMP_DIFF(a.hosp_disch_date_jittered, a.hosp_adm_date_jittered, DAY) + 1 as LOS,
        b.drg_mpi_code,
        b.drg_id,
        b.drg_name,
        b.DRG_CODE_SET_C
    FROM `{cdm_project_id}.shc_core_2021.f_ip_hsp_admission` a
    LEFT JOIN `{cdm_project_id}.{cdm_dataset_id}.drg_code` b
    ON a.anon_id = b.anon_id AND a.pat_enc_csn_id_jittered = b.pat_enc_csn_id_coded
    WHERE 
        (b.drg_mpi_code IN ('230', '231') AND b.drg_id LIKE '626%')
    ),

    --Link costs to anonymized ID
    SHC_costs AS
    (
    SELECT 
        b.anon_id,
        a.AdmitDate + b.jitter as adm_date_jittered,
        a.DischargeDate + b.jitter as disch_date_jittered,
        --a.VisitCount,
        a.MSDRGWeight,
        a.Inpatient_C,
        --a.ServiceCategory_C,
        a.Cost_Direct,
        a.Cost_Breakdown_Blood,
        a.Cost_Breakdown_Cardiac,
        a.Cost_Breakdown_ED,
        a.Cost_Breakdown_ICU,
        a.Cost_Breakdown_IICU,
        a.Cost_Breakdown_Imaging,
        a.Cost_Breakdown_Labs,
        a.Cost_Breakdown_Implants,
        a.Cost_Breakdown_Supplies,
        a.Cost_Breakdown_OR,
        a.Cost_Breakdown_OrganAcq,
        a.Cost_Breakdown_Other,
        a.Cost_Breakdown_PTOT,
        a.Cost_Breakdown_Resp,
        a.Cost_Breakdown_Accom,
        a.Cost_Breakdown_Pharmacy
    FROM `{nero_gcp_project}.shc_cost.costUB` a
    LEFT JOIN `{nero_gcp_project}.starr_map.shc_map_2021` b
    ON cast(a.mrn AS string) = b.mrn
    )

    --Join admission DRG details and costs by patient ID and overlapping dates (NOTE: manually add all cost variables you want to keep)
    SELECT DISTINCT
        a.*,
        SUM(b.Cost_Direct) OVER(PARTITION BY a.observation_id) AS Cost_Direct,
        SUM(b.Cost_Breakdown_Blood) OVER(PARTITION BY a.observation_id) AS Cost_Breakdown_Blood,
        SUM(b.Cost_Breakdown_Cardiac) OVER(PARTITION BY a.observation_id) AS Cost_Breakdown_Cardiac,
        SUM(b.Cost_Breakdown_ED) OVER(PARTITION BY a.observation_id) AS Cost_Breakdown_ED,
        SUM(b.Cost_Breakdown_ICU) OVER(PARTITION BY a.observation_id) AS Cost_Breakdown_ICU,
        SUM(b.Cost_Breakdown_IICU) OVER(PARTITION BY a.observation_id) AS Cost_Breakdown_IICU,
        SUM(b.Cost_Breakdown_Imaging) OVER(PARTITION BY a.observation_id) AS Cost_Breakdown_Imaging,
        SUM(b.Cost_Breakdown_Labs) OVER(PARTITION BY a.observation_id) AS Cost_Breakdown_Labs,
        SUM(b.Cost_Breakdown_Implants) OVER(PARTITION BY a.observation_id) AS Cost_Breakdown_Implants,
        SUM(b.Cost_Breakdown_Supplies) OVER(PARTITION BY a.observation_id) AS Cost_Breakdown_Supplies,
        SUM(b.Cost_Breakdown_OR) OVER(PARTITION BY a.observation_id) AS Cost_Breakdown_OR,
        SUM(b.Cost_Breakdown_OrganAcq) OVER(PARTITION BY a.observation_id) AS Cost_Breakdown_OrganAcq,
        SUM(b.Cost_Breakdown_Other) OVER(PARTITION BY a.observation_id) AS Cost_Breakdown_Other,
        SUM(b.Cost_Breakdown_PTOT) OVER(PARTITION BY a.observation_id) AS Cost_Breakdown_PTOT,
        SUM(b.Cost_Breakdown_Resp) OVER(PARTITION BY a.observation_id) AS Cost_Breakdown_Resp,
        SUM(b.Cost_Breakdown_Accom) OVER(PARTITION BY a.observation_id) AS Cost_Breakdown_Accom,
        SUM(b.Cost_Breakdown_Pharmacy) OVER(PARTITION BY a.observation_id) AS Cost_Breakdown_Pharmacy,
        MAX(b.MSDRGWeight) OVER(PARTITION BY a.observation_id) AS MSDRGWeight
    FROM DRG_adms a
    LEFT JOIN SHC_costs b
    --ON a.anon_id = b.anon_id AND a.adm_date <= b.disch_date_jittered AND a.disch_date >= b.adm_date_jittered --Join by overlapping dates
    ON a.anon_id = b.anon_id AND a.adm_date <= b.adm_date_jittered AND b.disch_date_jittered <= a.disch_date --Join if cost dates are within IP admission dates
    WHERE b.Inpatient_C = 'I'
""".format_map({'cdm_project_id': cdm_project_id,
                'cdm_dataset_id': cdm_dataset_id,
                'nero_gcp_project': nero_gcp_project,
                'work_project_id': work_project_id,
                'work_dataset_id': work_dataset_id,
               'cohort_id': cohort_id,
               'index_lag': index_lag})

if run_cohortselection == 1:
    client.query(query).result();

### 1.2 Cohort explorations and sanity checks
Record of tests and checks

In [None]:
# Download cohort data temporarily for checks
if run_cohortchecks == 1:
    query = """
        SELECT 
            *
        FROM `{work_project_id}.{work_dataset_id}.{cohort_id}`
    """.format_map({'work_project_id': work_project_id,
                    'work_dataset_id': work_dataset_id,
                   'cohort_id': cohort_id})

    df = client.query(query).to_dataframe()

In [None]:
if run_cohortchecks == 1:
    # Cohort size
    print(df.shape) # 7428, 15. 1/29/23 Update: 6232. 2/5/23 Update: 1658 (patients with costs only)
    print(df["observation_id"].nunique())
    print(type(df))
    
    # Duplicate IP admissions?
    dups = df[df.duplicated(subset=['observation_id'], keep=False)].sort_values(by=['anon_id', 'adm_date'])
    print(dups)

    # Missing values
    print(df.isna().sum())
    
    # Costs: are there non-inpatient costs that overlap with an IP visit?

    # Number of IP admissions with costs
    
    # Min and max dates
    print(min(df["adm_date"]), max(df["adm_date"]))
    

In [None]:
# Test variance inflation factor for cost breakdown

from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

cost_data = df.iloc[:,10:26] # Pharmacy costs (included when index end is 27) introduces a lot of multicollinearity
cost_data = cost_data.astype(float)
cost_data = add_constant(cost_data)
cost_data.replace(to_replace=np.NaN, value=0, inplace=True)
# vif_data = cost_data.columns
vif_data = [variance_inflation_factor(cost_data, i) for i in range(cost_data.shape[1])]# Source: https://www.geeksforgeeks.org/detecting-multicollinearity-with-vif-python/
vif_data = pd.Series(vif_data, index = cost_data.columns)
# vif_data = [variance_inflation_factor(cost_data, i) for i in range(1, 2)]# Source: https://www.geeksforgeeks.org/detecting-multicollinearity-with-vif-python/
vif_data

## 2 Feature extraction

### 2.1 Define a set of extractors

From Conor Corbin's Healthrex_ML API; extractor definitions [here]()

In [None]:
# UPDATE 2/8/22 Added and adapted Patient Problem Group and Medication Group Extractors by Yixing Jiang, removed original Patient and Medication Extractors, added expanded Procedure Extractor
if run_featurizer == 1:
    from healthrex_ml.extractors.starr_extractors import add_create_or_append_logic
    
    class PatientProblemGroupExtractor():
        """
        Defines logic to extract diagnoses on the patient’s problem list
        """
        def __init__(self, cohort_table_id, feature_table_id,
                     project_id='som-nero-phi-jonc101', dataset='shc_core_2021'):
            """
            Args:
                cohort_table: name of cohort table -- used to join to features
                project_id: name of project you are extracting data from
                dataset: name of dataset you are extracting data from
            """
            self.cohort_table_id = cohort_table_id
            self.feature_table_id = feature_table_id
            self.client = bigquery.Client()
            self.project_id = project_id
            self.dataset = dataset
        def __call__(self):
            """
            Executes queries and returns all
            """
            query = f"""
            SELECT
                labels.observation_id,
                labels.index_time,
                '{self.__class__.__name__}' as feature_type,
                CAST(dx.start_date_utc as TIMESTAMP) as feature_time,
                GENERATE_UUID() as feature_id,
                ccsr.CCSR_CATEGORY_1 as feature,
                1 value
            FROM
                ({self.cohort_table_id}
                labels
            LEFT JOIN
                {self.project_id}.{self.dataset}.diagnosis dx
            ON
                labels.anon_id = dx.anon_id)
            LEFT JOIN
                mining-clinical-decisions.mapdata.ahrq_ccsr_diagnosis ccsr
            ON
                --dx.icd10 = ccsr.icd10
                REPLACE(dx.icd10, ".", "") = ccsr.icd10_string --Updated join that corrects missing matches due to extra periods in 3-digit entries of ccsr.icd10, but doesn't account for data with different ICD codes joined by commas
            WHERE
                CAST(dx.start_date_utc as TIMESTAMP) < labels.index_time
            AND
                source = 2 --problem list only
            """
            query = add_create_or_append_logic(query, self.feature_table_id)
            query_job = self.client.query(query)
            query_job.result()
            
    class MedicationGroupExtractor():
        """
        Defines logic to extract medication orders
        """
        def __init__(self, cohort_table_id, feature_table_id,
                     look_back_days=28, project_id='som-nero-phi-jonc101',
                     dataset='shc_core_2021'):
            """
            Args:
                cohort_table: name of cohort table -- used to join to features
                project_id: name of project you are extracting data from
                dataset: name of dataset you are extracting data from
            """
            self.cohort_table_id = cohort_table_id
            self.look_back_days = look_back_days
            self.project_id = project_id
            self.dataset = dataset
            self.feature_table_id = feature_table_id
            self.client = bigquery.Client()
        def __call__(self):
            """
            Executes queries and returns all
            """
            query = f"""
            SELECT DISTINCT
                labels.observation_id,
                labels.index_time,
                '{self.__class__.__name__}' as feature_type,
                meds.order_inst_utc as feature_time,
                CAST(meds.order_med_id_coded as STRING) as feature_id,
                meds.thera_class_abbr as feature,
                1 as value
            FROM
                {self.cohort_table_id}
                labels
            LEFT JOIN
                {self.project_id}.{self.dataset}.order_med meds
            ON
                labels.anon_id = meds.anon_id
            WHERE
                CAST(meds.order_inst_utc as TIMESTAMP) < labels.index_time
            AND
                TIMESTAMP_ADD(meds.order_inst_utc,
                              INTERVAL 24*{self.look_back_days} HOUR)
                              >= labels.index_time
            """
            query = add_create_or_append_logic(query, self.feature_table_id)
            query_job = self.client.query(query)
            query_job.result()
            
    class ProcedureExpandedExtractor():
        """
        Defines logic to extract procedure orders from order_proc, with additional order types
        """

        def __init__(self, cohort_table_id, feature_table_id,
                     look_back_days=28, project_id='som-nero-phi-jonc101',
                     dataset='shc_core_2021'):
            """
            Args:
                cohort_table: name of cohort table -- used to join to features
                project_id: name of project you are extracting data from
                dataset: name of dataset you are extracting data from
            """
            self.cohort_table_id = cohort_table_id
            self.look_back_days = look_back_days
            self.project_id = project_id
            self.dataset = dataset
            self.feature_table_id = feature_table_id
            self.client = bigquery.Client()

        def __call__(self):
            """
            Executes queries and returns all 
            """
            query = f"""
            SELECT DISTINCT
                labels.observation_id,
                labels.index_time,
                '{self.__class__.__name__}' as feature_type,
                op.order_time_jittered_utc as feature_time,
                CAST(op.order_proc_id_coded as STRING) as feature_id,
                op.description as feature,
                1 as value
            FROM
                {self.cohort_table_id}
                labels
            LEFT JOIN
                {self.project_id}.{self.dataset}.order_proc op
            ON
                labels.anon_id = op.anon_id
            WHERE 
                order_type in ('Procedures', 'GI', 'Pathology', 'Surgical Procedures') --List is not comprehensive here, I just looked through BigQuery
            AND
                CAST(op.order_time_jittered_utc as TIMESTAMP) < labels.index_time
            AND
                TIMESTAMP_ADD(op.order_time_jittered_utc,
                              INTERVAL 24*{self.look_back_days} HOUR)
                              >= labels.index_time
            """
            query = add_create_or_append_logic(query, self.feature_table_id)
            query_job = self.client.query(query)
            query_job.result()
        
        
    from healthrex_ml.extractors import (
        AgeExtractor,
        RaceExtractor,
        SexExtractor,
        EthnicityExtractor,
#         ProcedureExtractor,
#         PatientProblemExtractor,
#         MedicationExtractor,
        LabOrderExtractor,
        LabResultBinsExtractor,
        FlowsheetBinsExtractor
    )

    USED_EXTRACTORS = [AgeExtractor,
        RaceExtractor,
        SexExtractor,
        EthnicityExtractor,
#         ProcedureExtractor,
#         PatientProblemExtractor,
#         MedicationExtractor,
        PatientProblemGroupExtractor,
        MedicationGroupExtractor,
        ProcedureExpandedExtractor,
        LabOrderExtractor,
        LabResultBinsExtractor,
        FlowsheetBinsExtractor
    ]

    cohort_table=f"{work_project_id}.{work_dataset_id}.{cohort_id}"
    feature_table=f"{work_project_id}.{work_dataset_id}.{cohort_id}_feature_matrix"
    extractors = [
        ext(cohort_table_id=cohort_table, feature_table_id=feature_table)
        for ext in USED_EXTRACTORS
    ]

### 2.2 Define a featurizer and create a feature matrix

Will execute a series of SQL queries defined by the extractors to build up a long form feature matrix and save to bigquery. Additionally, will read in the long form feature matrix and build up a sparse (CSR) matrix without doing the expensive pivot operation.  Will save locally. Automatically generates train/test split by using last year of data as test set.  Can use `train_years` and `test_years` arguments in the `__init__` function to modify. 

Implementation of [BagOfWordsFeaturizer](https://github.com/HealthRex/healthrex_ml/blob/main/healthrex_ml/featurizers/starr_featurizers.py#L239)

In [None]:
if run_featurizer == 1:
    from healthrex_ml.featurizers import BagOfWordsFeaturizer

    featurizer = BagOfWordsFeaturizer(  cohort_table_id   = cohort_table,
                                        feature_table_id  = feature_table,
                                        extractors        = extractors,
                                        outpath           = f"./{cohort_id}_artifacts",
                                        tfidf             = True
                                )

    featurizer()

## 3 Feature selection and aggregation

### 3.1 Read in data

In [None]:
from sklearn.linear_model import LinearRegression
from scipy.sparse import load_npz
from scipy.sparse.linalg import lsqr

# Read in features
features = pd.read_csv(os.path.join(f"./{cohort_id}_artifacts/feature_order.csv"))

# Read in train data
X_train_full = load_npz(os.path.join(f"./{cohort_id}_artifacts/train_features.npz"))
y_train_full = pd.read_csv(os.path.join(f"./{cohort_id}_artifacts/train_labels.csv"))

# Remove any rows with missing labels (for censoring tasks)
task = 'Cost_Direct' # **** Make a global var?
observed_inds_train = y_train_full[~y_train_full[task].isnull()].index
X_train = X_train_full[observed_inds_train]
y_train = y_train_full.iloc[observed_inds_train].reset_index()
y_train = y_train[task]

### 3.2 Remove uncommon features

In [None]:
# Calculate percentage of each column with data and subset to variables populated above a certain threshold
X_nz_pcts = X_train.getnnz(axis=0)/X_train.shape[0]
plt.hist(X_nz_pcts, bins=25);
nz_inds = np.argwhere(X_nz_pcts > nz_vars).reshape(-1,)

# *** NOTE: Not good practice to change variable after first section it's created - revise later
X_train = X_train.tocsr()[:,nz_inds]

### 3.3 Read test data

In [None]:
# Read in test data
X_test_full = load_npz(os.path.join(f"./{cohort_id}_artifacts/test_features.npz"))
y_test_full = pd.read_csv(os.path.join(f"./{cohort_id}_artifacts/test_labels.csv"))

# Remove any rows with missing labels (for censoring tasks)
task = 'Cost_Direct' # **** Make a global var?
observed_inds_test = y_test_full[~y_test_full[task].isnull()].index
X_test = X_test_full[observed_inds_test]
y_test = y_test_full.iloc[observed_inds_test].reset_index()
y_test = y_test[task]

# Remove uncommon features
X_test = X_test.tocsr()[:,nz_inds]

In [None]:
# Create full datasets
# Reference: https://cmdlinetips.com/2019/07/how-to-slice-rows-and-columns-of-sparse-matrix-in-python/
drg_los_outcomes = y_train_full.append(y_test_full)
y_full = y_train.append(y_test)
len(y_full)



### 3.2 Baseline characteristics (IN DEVELOPMENT)
Potentially use TableOne package to summarize and display? https://academic.oup.com/jamiaopen/article/1/1/26/5001910

In [None]:
query= """
   SELECT DISTINCT
        a.anon_id,
        a.observation_id,
        a.drg_name,
        DATE_DIFF(
                CAST(a.index_time AS date), demo.BIRTH_DATE_JITTERED, YEAR)
            as age,
        enc.acuity_level
    FROM `{work_project_id}.{work_dataset_id}.{cohort_id}` a
    LEFT JOIN `{cdm_project_id}.{cdm_dataset_id}.demographic` demo
    LEFT JOIN `{cdm_project_id}.{cdm_dataset_id}.encounter` enc
    ON a.observation_id = end.pat_enc_csn_id_coded
""".format_map({'cdm_project_id': cdm_project_id,
                'cdm_dataset_id': cdm_dataset_id,
                'nero_gcp_project': nero_gcp_project,
                'work_project_id': work_project_id,
                'work_dataset_id': work_dataset_id,
               'cohort_id': cohort_id,
               'index_lag': index_lag})

bl_chars = client.query(query).to_dataframe()

In [None]:
# Demographics
demog = features[features['features'].str.startswith(('race_', 'sex_', 'Age_', 'eth'))]
demog = demog.sort_values(by='features')
print(demog)



from scipy import sparse
# X_full = vstack(X_train_full, X_test_full)
X_demog = X_train_full.tocsr()[:,demog['indices']].todense()
X_demog = np.concatenate((X_demog, X_test_full.tocsr()[:,demog['indices']].todense()))
# X_demog_summary = 

# All data, data with costs, training data, test data

### 3.3 Cost visualizations

In [None]:
# Histogram of all costs
plt.hist(y_full, bins=20);
plt.xlabel('Total hospitalization cost (USD)');
plt.ylabel('Number of hospitalizations');

In [None]:
# Histogram of costs within Xth percentile
percentile = 90
index = math.floor((percentile / 100)*len(y_full))
y_subset = y_full.sort_values()
y_subset = y_subset[:index]
plt.hist(y_subset, bins=20);
plt.xlabel('Total hospitalization cost (USD)');
plt.ylabel('Number of hospitalizations');

In [None]:
# LOS vs. costs - pretty linear correlation, with some outliers?
plt.scatter(drg_los_outcomes['LOS'], drg_los_outcomes['Cost_Direct'], alpha=0.2)
plt.xlabel('Length of stay (days)')
plt.ylabel('Total hospitalization cost (USD)')

In [None]:
# LOS vs. costs - stratified by DRG
# How to: https://stackoverflow.com/questions/21654635/scatter-plots-in-pandas-pyplot-how-to-plot-by-category
# Also: https://www.statology.org/matplotlib-scatterplot-color-by-value/
# **** 1/29/23 This was how I found out that there aren't really any major small & large bowel procedures coded DRG 221 in the cost data

import matplotlib.pyplot as plt
groups = drg_los_outcomes.groupby('drg_name')
fig, ax = plt.subplots()
ax.margins(0.05) # Optional, just adds 5% padding to the autoscaling
for name, group in groups:
    ax.plot(group.LOS, group.Cost_Direct, marker='o', linestyle='', ms=3, label=name, alpha=0.3)
ax.legend()
plt.xlabel('Length of stay (days)');
plt.ylabel('Total hospitalization cost (USD)');

plt.show()

In [None]:
# Histogram of cost per day
plt.hist(drg_los_outcomes['Cost_Direct']/drg_los_outcomes['LOS'], bins=20)
plt.xlabel('Average cost per day (USD)');
plt.ylabel('Number of hospitalizations');

In [None]:
# CDF of costs and LOS
plt.hist(drg_los_outcomes['Cost_Direct'], density=True, cumulative=True, label='CDF', histtype='step', alpha=0.8, color='k', bins=len(drg_los_outcomes['Cost_Direct']));
plt.xlim([0, 100000]);
plt.xlabel('Total hospitalization cost (USD)');
plt.ylabel('CDF');

In [None]:
# CDF of LOS
plt.hist(drg_los_outcomes['LOS'], density=True, cumulative=True, label='CDF', histtype='step', alpha=0.8, color='k', bins=len(drg_los_outcomes['LOS']));
plt.xlabel('LOS');
plt.ylabel('CDF');

In [None]:
# Correlation between costs and LOS
from scipy.stats import pearsonr
pcorr, _ = pearsonr(drg_los_outcomes['LOS'], drg_los_outcomes['Cost_Direct'])

from scipy.stats import spearmanr
scorr, _ = spearmanr(drg_los_outcomes['LOS'], drg_los_outcomes['Cost_Direct'])

print("Pearson's correlation", pcorr)
print("Spearman's correlation", scorr)

In [None]:
?pearsonr


## 4 Analysis

### 4.2 Linear regression
Sparse linear regression: https://docs.scipy.org/doc/scipy/reference/generated/scipy.sparse.linalg.lsqr.html

In [None]:
x, istop, itn, r1norm, r2norm = lsqr(X_train, y_train)[:5]

# **** 1/29/23 Note: system is under-determined (many more features than data points), so there is a near perfect fit....

In [None]:
y_hat = X_train@x

# Maximum and average absolute percentage prediction error
print(max(abs(y_hat - y_train)/y_train))
print(sum(abs(y_hat - y_train)/y_train)/len(y_train))

plt.scatter(y_train, y_hat, alpha=0.2)

In [None]:
# Calculate R^2
RSS = sum((y_hat - y_train)**2)
TSS = sum((y_train - sum(y_train)/len(y_train))**2)
coef_of_det = 1 - RSS / TSS
print(coef_of_det)

In [None]:
y_hat_test = X_test@x
print(max(abs(y_hat_test - y_test)/y_test))
print(sum(abs(y_hat_test - y_test)/y_test)/len(y_test))
# print(r2norm)
# *** Find regression evaluation approach robust to outliers


plt.scatter(y_test, y_hat_test, alpha=0.2)

### 4.3 LASSO
Read more here: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.Lasso.html <br>
With cross-validation: https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LassoCV.html

In [None]:
from sklearn import linear_model
clf = linear_model.Lasso(alpha=0.1)
clf.fit(X_train, y_train)

print(clf.coef_)

print(clf.intercept_)

In [None]:
# Try increasing tolerance and iterations # 2/20/23 Update: Use cross-validation
# clf2 = linear_model.Lasso(alpha=0.1, max_iter=5000, tol=0.01, selection='random')
# clf2.fit(X_train, y_train)
from sklearn.linear_model import LassoCV
clf2 = LassoCV(cv=10, random_state=0, max_iter=5000, tol=0.01, normalize = True, selection='random').fit(X_train, y_train)
clf2.score(X_train, y_train)

print(clf2.coef_)

print(clf2.intercept_)

In [None]:
sum(abs(clf2.coef_ < eps))/len(clf2.coef_) # OK, ~90% of features are 0; 2/8/22 update: with reduced features ~50% are 0

In [None]:
# Bind features and coefficients
feature_coefs = features.copy().iloc[nz_inds]
feature_coefs['coefs'] = clf2.coef_.tolist()
feature_coefs

In [None]:
# Features with nonzero LASSO coefficients
nonzero_features = feature_coefs[abs(feature_coefs['coefs']) >= eps]
# nonzero_features
nonzero_features['features']

In [None]:
# Evaluate training accuracy by percentage error
y_hat = clf2.predict(X_train)
pct_error = abs(y_hat - y_train)/y_train
min(pct_error), max(pct_error), sum(pct_error)/len(pct_error)

In [None]:
# Calculate R^2
RSS = sum((y_hat - y_train)**2)
TSS = sum((y_train - sum(y_train)/len(y_train))**2)
coef_of_det = 1 - RSS / TSS
print(coef_of_det)

In [None]:
# Plot absolute training error
import matplotlib.pyplot as plt
plt.scatter(y_train, pct_error, alpha = 0.2)
plt.xlabel('True cost (USD)')
plt.ylabel('Absolute percentage error')
plt.show()

In [None]:
# Plot absolute training error (log-log)
plt.scatter(y_train, pct_error, alpha = 0.2)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('True cost (USD, log scale)')
plt.ylabel('Absolute percentage error (log scale)')
plt.show()

In [None]:
# Evaluate test accuracy by percentage error
y_hat_test = clf2.predict(X_test)
pct_error_test = abs(y_hat_test - y_test)/y_test
min(pct_error_test), max(pct_error_test), sum(pct_error_test)/len(pct_error_test)

In [None]:
# Plot absolute testing error
plt.scatter(y_test, pct_error_test, alpha = 0.2)
plt.xlabel('True cost (USD)')
plt.ylabel('Absolute percentage error')
plt.show()

In [None]:
# Plot absolute testing error (log-log)
plt.scatter(y_test, pct_error_test, alpha = 0.2)
plt.xscale('log')
plt.yscale('log')
plt.xlabel('True cost (USD, log scale)')
plt.ylabel('Absolute percentage error (log scale)')
plt.show()

### 4.4 Variance Inflation Factor

In [None]:
vif_data

In [None]:
features[:10]

In [None]:
# Source: https://www.geeksforgeeks.org/detecting-multicollinearity-with-vif-python/
from statsmodels.stats.outliers_influence import variance_inflation_factor
from statsmodels.tools.tools import add_constant

n = 15
x_data = X_train.todense()
x_data = add_constant(x_data)
# vif_data = [variance_inflation_factor(cost_data, i) for i in range(x_data.shape[1])]# Source: https://www.geeksforgeeks.org/detecting-multicollinearity-with-vif-python/
vif_data = [variance_inflation_factor(cost_data, i) for i in range(n)]# Source: https://www.geeksforgeeks.org/detecting-multicollinearity-with-vif-python/
vif_data = pd.Series(vif_data, index = features[:n])
# vif_data = [variance_inflation_factor(cost_data, i) for i in range(1, 2)]# Source: https://www.geeksforgeeks.org/detecting-multicollinearity-with-vif-python/
vif_data


In [None]:
X_train.todense().shape

### 4.5 Ridge regression

In [None]:
clf = Ridge(alpha=1.0)
>>> clf.fit(X, y)

### Model evaluation (NOT USED - From Conor's healthrex_ml example program)

### Train a set of gradient boosted trees

Implementation of [LightGBMTrainer](https://github.com/HealthRex/healthrex_ml/blob/main/healthrex_ml/trainers/sklearn_trainers.py#L23)

In [None]:
from healthrex_ml.trainers import LightGBMTrainer # SP 1/18/23 Grace replaced with BaselineModelTrainer (uses random forest, since LightGBMTrainer was causing a segmentation fault)

trainer = LightGBMTrainer(working_dir=f"./{cohort_id}_artifacts")
tasks = ['Cost_Direct']

for task in tasks:
    trainer(task)

### Evaluate model performance on test set and dump 

Implementation of [BinaryEvaluator](https://github.com/HealthRex/healthrex_ml/blob/main/healthrex_ml/evaluators/evaluators.py#L21) 

In [None]:
from healthrex_ml.evaluators import BinaryEvaluator
from tqdm import tqdm

for task in tqdm(tasks):
    evalr = BinaryEvaluator(
        outdir=f"./{RUN_NAME}_artifacts/{task}_performance_artificats/",
        task_name=task
    )
    df_yhats = pd.read_csv(os.path.join(trainer.working_dir, f"{task}_yhats.csv"))
    evalr(df_yhats.labels, df_yhats.predictions)