# Preprocessing Ethnicities

This file takes the preprocessed ethnicity files and labels the dependent variables.

In [99]:
import pandas as pd
import numpy as np

## Determine most frequently occuring dissease: poisoning

We want to label our data with 'has dissease x' or 'does not have dissease x' for our classification model. For the purpose of this research, what the model predicts is not really of importance. Therefore we decided to take the most occuring dissease, as this will give the model more data to train on.

Side note: we can see from the output below that the most occuring hospitalizations include fractures and sequela (pathological condition resulting from dissease), which we do not really count as disseases in itself. Thus, poisoning seemed to be the most occuring dissease.

In [100]:
d_icd_diagnoses = pd.read_csv("mimic-iv-0.4/hosp/d_icd_diagnoses.csv.gz")

In [101]:
s = d_icd_diagnoses['long_title'].str.split(expand=True).stack().value_counts()
print (s.head(60))

of              77276
encounter       36076
fracture        32307
with            23264
subsequent      21894
unspecified     21246
for             17191
initial         14261
and             14000
left            13412
right           13238
sequela         12122
or              11583
Other            9681
other            8195
in               8082
open             6978
Unspecified      6515
healing          6068
to               5671
type             5419
Nondisplaced     4998
Displaced        4953
by               4183
lower            4043
without          4029
closed           3902
injury           3688
shaft            3435
specified        3158
at               3150
delayed          3060
routine          3032
nonunion         3007
I                2957
II               2912
body             2831
injured          2784
femur,           2768
end              2756
malunion         2662
due              2588
arm,             2581
IIIC             2550
IIIB,            2550
IIIA,     

## Collect ICD codes for poisoning

In [102]:
poisoning = d_icd_diagnoses[d_icd_diagnoses['long_title'].str.contains('oisoning')]
poisononing_icds = poisoning['icd_code'].unique()
poisoning

Unnamed: 0,icd_code,icd_version,long_title
24,0050,9,Staphylococcal food poisoning
25,0051,9,Botulism food poisoning
26,0052,9,Food poisoning due to Clostridium perfringens ...
27,0053,9,Food poisoning due to other Clostridia
28,0054,9,Food poisoning due to Vibrio parahaemolyticus
...,...,...,...
74591,T61784A,10,"Other shellfish poisoning, undetermined, initi..."
74592,T61784D,10,"Other shellfish poisoning, undetermined, subse..."
74593,T61784S,10,"Other shellfish poisoning, undetermined, sequela"
78199,V156,9,"Personal history of poisoning, presenting haza..."


## Add labels to ethnicity files

Adds column 'is_poisened' with True if the ICD code is contained by the poisoning_icds, and False otherwise. The icd_code column is removed afterwards, otherwise the model could use the pattern in the icd_codes to determine the value of is_poisned.

In [103]:
def add_labels (df_name):
    # import data
    print('Importing ', df_name)
    df = pd.read_csv('data/preprocessing_I/' + df_name + '.csv')

    # add is_poisened column
    print("Adding labels for ", df_name)
    df['is_poisoned'] = False
    for row in df.itertuples():
        # filter nans
        if type(row.icd_code) != float:
            for icd in row.icd_code:
                if icd in poisononing_icds:
                    df.at[row.Index, 'is_poisoned'] = True

    # remove icd column
    df = df.drop(columns='icd_code')

    # save csv
    print('Saving .csv for ', df_name)
    df.to_csv("data/preprocessing_II/" + df_name + ".csv")

In [104]:
ethnic_group_names = ['unknown', 'white', 'other', 'asian', 'hispanic_latino', 'black_african_american', 'unable_to_obtain', 'american_indian_alaska_native']

for name in ethnic_group_names:
    add_labels(name)

Importing  unknown
Adding labels for  unknown
Saving .csv for  unknown
Importing  white
Adding labels for  white
Saving .csv for  white
Importing  other
Adding labels for  other
Saving .csv for  other
Importing  asian
Adding labels for  asian
Saving .csv for  asian
Importing  hispanic_latino
Adding labels for  hispanic_latino
Saving .csv for  hispanic_latino
Importing  black_african_american
Adding labels for  black_african_american
Saving .csv for  black_african_american
Importing  unable_to_obtain
Adding labels for  unable_to_obtain
Saving .csv for  unable_to_obtain
Importing  american_indian_alaska_native
Adding labels for  american_indian_alaska_native
Saving .csv for  american_indian_alaska_native
