### Libraries as necessary

In [142]:
import pandas as pd
import numpy as np 

import glob # reading multiple files in the same directory
import ast # because multiple medications on the same day are kept as list, they are read as String if you were to export and import again, use ast.literal_eval to read them as list again

from tqdm import tqdm # progress bar
tqdm.pandas()

import shutup # it just makes me feel better
shutup.please()

### ICD-10 criteria

ICD-10 codes used to identify are I10, I11, I12, I13 and I15. 

``str(x).startswith(tuple(icd10))`` is used because ICD-10 code is Alpha character followed by multiple numerical characters (A12344). So, ``startswith`` catches all I10, I100, I101, I1010 and so on.

This is the required input format. 

| Primary Key | Date | ICD code |
|-|-|-|
| A | 2023-01-01 | (not relevant) |
| A | 2023-01-10 | (relevant) |
| A | 2023-01-15 | (relevant) |
| A | 2023-02-01 | (relevant) |

The first date of diagnosis for each of these codes is considered as date of HT diagnosis, so only the earliest date is kept.
| Primary Key | Date | Drugcode |
|-|-|-|
| A | 2023-01-10 | (relevant) |

In [143]:
# Diagnosis

icd10 = ["I10", "I11", "I12", "I13", "I15"]


files = glob.glob(r"path/*.csv")
read = []
for file in tqdm(files):
    temp = pd.read_csv(file)
    read.append(temp)

dx = pd.concat(read)

flag = dx["D035KEY"].apply(lambda x: str(x).startswith(tuple(icd10)))
dx = dx[flag]
dx = dx[["ENC_HN", "D001KEY", "D035KEY"]]
dx["D001KEY"] = pd.to_datetime(dx["D001KEY"], format="%Y%m%d")
dx = dx[dx["D001KEY"]>= pd.to_datetime("20230101", format="%Y%m%d")]
dx = dx.sort_values(["ENC_HN", "D001KEY"])
dx = dx.loc[dx.groupby("ENC_HN").D001KEY.idxmin()] # keeping earliest date
dx = dx.drop_duplicates("ENC_HN", keep="first")

  0%|          | 0/2 [00:00<?, ?it/s]

100%|██████████| 2/2 [00:05<00:00,  2.84s/it]


### Medication criteria

The same goes for medication. We identified the medications prescribed in Ramathibodi Hospital and filtered for relevant drug classes. The mapping table was prepared by Dr Kunlawat Thadanipon and Dr Thosaphol Limpijankit.

In [144]:
"Medication"

key = pd.read_excel(r"path\map_list\cohort_anti_HT.xlsx")
key.head()

Unnamed: 0,ramadrugcode,drugname,combination,combinationsplit,atc,atcdeesc,id,drugclass
0,AMLP1T-,Amlodipine 5 mg,Amlodipine besylate,Amlodipine besylate,C08CA,Dihydropyridine derivatives,1,CCB_DHP
1,AMLP2T-,Amlodipine 10 mg,Amlodipine besylate,Amlodipine besylate,C08CA,Dihydropyridine derivatives,2,CCB_DHP
2,NORV1T-,Norvasc 5 mg,Amlodipine besylate,Amlodipine besylate,C08CA,Dihydropyridine derivatives,3,CCB_DHP
3,NORV2T-,Norvasc 10 mg,Amlodipine besylate,Amlodipine besylate,C08CA,Dihydropyridine derivatives,4,CCB_DHP
4,TENM1T-,Tenormin 50 mg,Atenolol,Atenolol,C07AB,"Beta blocking agents, selective",5,beta_blocker


The following code block is needlessly complicated due to the difference in data format of medication data in Ramathibodi Hospital.
This is the required input format. 

| Primary Key | Date | Drugcode |
|-|-|-|
| A | 2023-01-01 | (not relevant) |
| A | 2023-01-10 | (relevant) |
| A | 2023-01-10 | (relevant) |
| A | 2023-02-01 | (relevant) |

Essentially the goal is to identify the transactions where relevant Anti Hypertensive medication was given, and keep the earliest date. Please note that multiple relevant medications can be prescribed on the same day.

| Primary Key | Date | Drugcode |
|-|-|-|
| A | 2023-01-10 | (relevant) |
| A | 2023-01-10 | (relevant) |


In [145]:
files = (glob.glob(r"path/*csv"))

read = []
for file in tqdm(files):
    try:
        temp = pd.read_csv(file)
    except BaseException:
        temp = pd.read_csv(file, sep=r'$', quotechar=r'"', encoding='utf-8', engine= "python", dtype= str, error_bad_lines=False, warn_bad_lines=False)
        temp = temp.loc[:, ~temp.columns.str.startswith("Unnamed")]
    if "CODE" in temp.columns:
        temp = temp[["ENC_HN", "REC_DATE", "CODE"]]
        temp["REC_DATE"] = pd.to_datetime(temp["REC_DATE"], format="%Y%m%d")
        temp.columns = ["ENC_HN", "D001KEY", "CODE"]
    elif "DSPCode" in temp.columns:
        temp = temp[["ENC_HN", "PerformDate", "DSPCode"]]
        temp["PerformDate"] = pd.to_datetime(temp["PerformDate"]).dt.normalize()
        temp.columns = ["ENC_HN", "D001KEY", "CODE"]
    elif "Drugcode" in temp.columns:
        temp = temp[["ENC_HN", "BillDate", "Drugcode", "OrderDate"]]
        temp["BillDate"] = temp["BillDate"].fillna(temp["OrderDate"])
        temp = temp.drop("OrderDate", 1)
        temp["BillDate"] = pd.to_datetime(temp["BillDate"]).dt.normalize()
        temp.columns = ["ENC_HN", "D001KEY", "CODE"]
    # selected medications
    temp = temp[temp["CODE"].isin(key["ramadrugcode"])]
    temp = temp[temp["D001KEY"] == temp.groupby("ENC_HN")["D001KEY"].transform("min")] # keeping earliest date
    read.append(temp)

med = pd.concat(read)

med = med[med["D001KEY"]>= pd.to_datetime("20230101", format="%Y%m%d")]
med = med.sort_values(["ENC_HN", "D001KEY"])
med = med[med["D001KEY"] == med.groupby("ENC_HN")["D001KEY"].transform("min")]

med = med.groupby(["ENC_HN", "D001KEY"])["CODE"].apply(lambda x: list(set(x))).reset_index() # you need this


100%|██████████| 1/1 [00:21<00:00, 21.05s/it]


### New cases of Hypertension

This step is not necessary for the first time. For me, I am updating my cohort so subjects already in the cohort are removed.

In [146]:
# because it is an update, I remove old cases.

old_hn = pd.read_csv(r"path\hthn_2010_2022.csv")

new_dx = dx[~dx.ENC_HN.isin(old_hn.ENC_HN)].reset_index(drop=True)
new_dx.ENC_HN.nunique() # 5941

new_med = med[~med.ENC_HN.isin(old_hn.ENC_HN)].reset_index(drop=True)
new_med.ENC_HN.nunique() # 7998

new_dx.to_pickle(r"path/new_case/icd10_hn.pkl")
new_med.to_pickle(r"path/new_case/med_hn.pkl")

### Merging for tentative number of cases

This is diagnosis data.
| Primary Key | Date | ICD |
|-|-|-|
| A | 2023-01-09 | (icd code) |
| C | 2023-02-13 | (icd code) |

This is medication data.
| Primary Key | Date | MED |
|-|-|-|
| A | 2023-01-10 | (med code) |
| B | 2023-01-13 | (med code) |
| C | 2023-02-13 | (med code) |

This is merging for tentative number of cases. While the patient might be in both ICD and Medication groups, we identify by the first date of diagnosis.

| Primary Key | Bill_Date | Med_Date | First_Date | flag |
|-|-|-|-|-|
| A | 2023-01-09 | 2023-01-10 | 2023-01-09 | icd |
| B | | 2023-01-13 | 2023-01-13 | med |
| C | 2023-02-13 | 2023-02-13 | 2023-02-13 | both |

In these subjects, subject A is considered ICD only, subject B as Anti-H only and C as both ICD-10 and Anti-H group.



In [147]:
new_dx.columns = ["ENC_HN", "ICD_DATE", "ICD"]
new_med.columns = ["ENC_HN", "MED_DATE", "MED"]

data = new_dx.merge(new_med, on = ["ENC_HN"], how="outer")
data["D001KEY"] = data[["ICD_DATE", "MED_DATE"]].min(1) # first date of HT is whichever date that happen first.
data["flag"] = np.where(data["D001KEY"]!=data["ICD_DATE"], "med", # if ICD does not happen first, it is medication group.
                        np.where(data["D001KEY"]!=data["MED_DATE"], "icd", # if medication does not happen first, it is ICD group.
                                 "both")) # otherwise, it is both

print("No of possible HT subjects are {:,}.".format(data.ENC_HN.nunique())) # 10645

for flag in ["icd", "both", "med"]:
    print("For {} group, the number of subjects is {:,}.".format(flag, len(data[data.flag==flag])))

data.to_pickle(r"path/new_case/merged_hn_inferred.csv.pkl")

No of possible HT subjects are 10,645.
For icd group, the number of subjects is 3,429.
For both group, the number of subjects is 1,060.
For med group, the number of subjects is 6,156.


### Filtering the medication only cases

Since medications are identified using ramadrugcodes, they are transformed into generic drug classes using the previously shown mapping table.

``ast.literal_eval`` is required if the data file is imported again since Pandas does not import ``dtype`` as ``list``.

In [152]:
key = pd.read_excel(r"path\map_list\cohort_anti_HT.xlsx")
key = key.set_index("ramadrugcode")["drugclass"].to_dict()

med_only = data[data["flag"]=="med"].reset_index(drop=True) # med only group

try:
    med_only["drugclass"] = med_only["MED"].progress_apply(lambda x: [key[i] for i in x])
except:
    med_only["drugclass"] = med_only["MED"].progress_apply(lambda x: [key[i] for i in ast.literal_eval(x)])

med_only["drugclass"] = med_only["drugclass"].progress_apply(lambda x: list(set(x)))
print("\nMapping ramadrugcodes such as {} to generic drugclass such as {}".format(med_only["MED"][0], med_only["drugclass"][0]))
med_only[["MED", "drugclass"]].head()

100%|██████████| 6156/6156 [00:00<00:00, 558128.39it/s]
100%|██████████| 6156/6156 [00:00<00:00, 684049.58it/s]


Mapping ramadrugcodes such as ['BISS1T-', 'AMLP2T-'] to generic drugclass such as ['CCB_DHP', 'beta_blocker']





Unnamed: 0,MED,drugclass
0,"[BISS1T-, AMLP2T-]","[CCB_DHP, beta_blocker]"
1,[LOSL-T-],[ARB]
2,[ENAR-T-],[ACEI]
3,"[LOSL-T-, AMLP1T-]","[CCB_DHP, ARB]"
4,"[CRDP2I-, CRDP1I-, AVEX1I-]","[CCB_DHP, alpha_beta_blocker]"


Other indications for specific drug classes were prepared by Dr Kunlawat Thadanipon and Dr Thosaphol Limpijankit. The function ``check_history`` look for other indications in patient's history using ICD-10 codes.

The input is list of medications and the date of prescription to check.

| Primary Key | Date | drugclass |
|-|-|-|
| A | 2023-01-10 | [drug A, drug B] |

If they have corresponding history, the output is tuple of indication and medication.

| Primary Key | Date | drugclass | Check History|
|-|-|-|-|
| A | 2023-01-10 | [drug A, drug B] | [drug A, (other indication on 2022-10-04, drug B)] |

If the patient have at least one Anti-H that does not correspond with other indication, we include them.

In [159]:
# Other indication
indi = pd.read_excel(r"path\map_list\medication_map_2301.xlsx", sheet_name="other indications")
indi["ICD-10"] = indi["ICD-10"].apply(lambda x: [("".join(i.strip().split("."))).ljust(4, "0") for i in x.split(",")])

files = (glob.glob(r"path/*.csv"))

read = []
for file in files:
    temp = pd.read_csv(file)
    temp = temp[temp.ENC_HN.isin(med_only.ENC_HN)]
    temp = temp[temp.D035KEY.isin(indi["ICD-10"].sum())]
    temp["D001KEY"] = pd.to_datetime(temp["D001KEY"], format="%Y%m%d")
    temp = temp[["ENC_HN", "D001KEY", "D035KEY"]]
    read.append(temp)

dx = pd.concat(read)

def check_history(drugclass, hn, dx = dx, indimap = indi):
    res = drugclass
    if (hn in list(dx.ENC_HN)) & (drugclass in list(indimap["Drug Class"])):
        oth_indi = indimap[indimap["Drug Class"]==drugclass]
        cause = oth_indi["Other Indications"].values[0]
        oth_indi = oth_indi["ICD-10"].values[0]
        dx = dx[dx.D035KEY.isin(oth_indi)]
        if len(dx)>0:
            dx = dx[dx.D001KEY==dx.D001KEY.min()]["D001KEY"].astype(str).values[0]
            res = ("{} on {}".format(cause, dx), drugclass)
        else:
            pass
    else:
        pass
    return res

med_only["check_history"] = med_only.progress_apply(lambda row: [check_history(i, row["ENC_HN"]) for i in row["drugclass"]], axis=1)

print("For example, this is the list of medications to check {} and \nthis is the corresponding history {}.".format(med_only.loc[2442,"drugclass"], med_only.loc[2442,"check_history"]))

100%|██████████| 6156/6156 [00:00<00:00, 13210.32it/s]

For example, this is the list of medications to check ['CCB_nonDHP', 'beta_blocker'] and 
this is the corresponding history [('Arrhythmias on 2010-01-11', 'CCB_nonDHP'), ('Hyperthyroidism on 2010-04-12', 'beta_blocker')].





In [172]:
# Boolean column to drop rows
# Some patients are given more than one medication and not all of them have other indications, we include them.
# So, I use minimum of the list.
med_only["check2"] = med_only["check_history"].apply(lambda x: min([1 if isinstance(i, tuple) else 0 for i in x]))

print("\nFor example patient with {}, we will include so boolean for drop is {}.".format(med_only.loc[3236,"check_history"], med_only.loc[3236,"check2"]))
print("For example patient with {}, we will exclude so boolean for drop is {}.".format(med_only.loc[2442,"check_history"], med_only.loc[2442,"check2"]))
print("Medication only patients with history of other indications is {}.".format(sum(med_only["check2"]>0)))

med_only[["drugclass", "check_history", "check2"]].sample(5, random_state=2023)


For example patient with ['CCB_DHP', 'alpha2_agonist', 'ARB', ('Hyperthyroidism on 2010-04-12', 'beta_blocker')], we will include so boolean for drop is 0.
For example patient with [('Arrhythmias on 2010-01-11', 'CCB_nonDHP'), ('Hyperthyroidism on 2010-04-12', 'beta_blocker')], we will exclude so boolean for drop is 1.
Medication only patients with history of other indications is 47.


Unnamed: 0,drugclass,check_history,check2
956,"[CCB_DHP, ARB]","[CCB_DHP, ARB]",0
4222,[beta_blocker],[beta_blocker],0
4985,[alpha_beta_blocker],[alpha_beta_blocker],0
2442,"[CCB_nonDHP, beta_blocker]","[(Arrhythmias on 2010-01-11, CCB_nonDHP), (Hyp...",1
4592,[CCB_DHP],[CCB_DHP],0


### Final Merge

Diagnosis of HT is inferred from medication, after we have dropped those subjects.

In [179]:
to_drop = med_only[med_only["check2"]>0] # 47
final_data = data[~data.ENC_HN.isin(to_drop.ENC_HN)]

print("No of possible HT subjects are {:,}.".format(data.ENC_HN.nunique())) # 10645

for flag in ["icd", "both", "med"]:
    print("For {} group, the number of subjects is {:,}.".format(flag, len(final_data[final_data.flag==flag])))

pd.concat([pd.DataFrame({"ENC_HN": ["A", "B", "C", "D", "E"]}), final_data[["D001KEY", "ICD_DATE", "ICD", "MED_DATE", "MED", "flag"]].head(5)], 1)

No of possible HT subjects are 10,645.
For icd group, the number of subjects is 3,429.
For both group, the number of subjects is 1,060.
For med group, the number of subjects is 6,109.


Unnamed: 0,ENC_HN,D001KEY,ICD_DATE,ICD,MED_DATE,MED,flag
0,A,2023-07-11,2023-07-11,I10,2023-07-11,[AMLP1T-],both
1,B,2023-02-23,2023-02-23,I10,NaT,,icd
2,C,2023-01-09,2023-01-09,I10,2023-01-09,[LOSL-T-],both
3,D,2023-07-04,2023-07-07,I10,2023-07-04,"[BISS1T-, AMLP2T-]",med
4,E,2023-01-17,2023-01-17,I10,NaT,,icd


### Summarization for Data Flow

In [131]:
print("For the flow chart")
print("===========================")
print("ICD 10 for Hypertension (HT): n = {:,} | Anti-Hypertensive medication (Anti-H): n = {:,}".format(new_dx.ENC_HN.nunique(), new_med.ENC_HN.nunique()))
print("======tentative groups=====")
print("ICD-10 only: n = {:,}".format(len(data[data.flag=="icd"])), end=" | ")
print("both ICD-10 and Anti-H: n = {:,}".format(len(data[data.flag=="both"])), end=" | ")
print("Anti-H only: n = {:,}".format(len(data[data.flag=="med"])))
print("======med only=============")
print("other indications: n = {:,}".format(sum(med_only["check2"]>0)))
print("======final merge==========")
print("Diagnosis of HT: n = {:,} | Inferred Diagnosis of HT: n = {:,} ".format(new_dx.ENC_HN.nunique(), len(final_data[final_data.flag=="med"])))
print("HT subjects: n = {:,}".format(len(final_data)))
print("======sub-groups===========")
print("ICD-10 only: n = {:,}".format(len(final_data[final_data.flag=="icd"])), end=" | ")
print("both ICD-10 and Anti-H: n = {:,}".format(len(final_data[final_data.flag=="both"])), end=" | ")
print("Anti-H only: n = {:,}".format(len(final_data[final_data.flag=="med"])))

For the flow chart
ICD 10 for Hypertension (HT): n = 5,941 | Anti-Hypertensive medication (Anti-H): n = 7,998
ICD-10 only: n = 3,429 | both ICD-10 and Anti-H: n = 1,060 | Anti-H only: n = 6,156
other indications: n = 47
Diagnosis of HT: n = 5,941 | Inferred Diagnosis of HT: n = 6,109 
HT subjects: n = 10,598
ICD-10 only: n = 3,429 | both ICD-10 and Anti-H: n = 1,060 | Anti-H only: n = 6,109
