## Predicting discharge statuses for patients presenting to the emergency room

Management of the resources of emergency department facilities based on the number of the patients plays an important role in performance of the hospital by reducing cost and increasing the quality of care.  A large influx of patients at any given time could result in shortage of staff and lack of available rooms.

In this context, predicting the outcome of the emergency department visit, whether the patient is admitted to the hospital or sent home,  early on in the patient stay is the objective of this project. 

The dataset selected for this study is part of the National Hospital Ambulatory Medical Care Survey (NHAMCS) public use data by the US Center for Disease Control and Prevention (CDC). 

### Reading data in a fixed-width format (fwf)

In [64]:
import pandas as pd
pd.set_option('mode.chained_assignment',None)

df_helper = pd.read_csv(
    '../input/metadata/ED_metadata.csv',
    header=0, 
    dtype={'width': int, 'column_name': str, 'variable_type': str}
)
print(df_helper.head(n=5))

   width column_name  variable_type
0      2      VMONTH    CATEGORICAL
1      1       VDAYR    CATEGORICAL
2      4     ARRTIME  NONPREDICTIVE
3      4    WAITTIME     CONTINUOUS
4      4         LOV  NONPREDICTIVE


In [65]:
width = df_helper['width'].tolist()
col_names = df_helper['column_name'].tolist()
var_types = df_helper['variable_type'].tolist()

In [66]:
df= pd.read_fwf(
    '../input/healthcare/ED2013',
    widths=width,
    header=None,
    dtype='str'  
)

In [67]:
df.columns = col_names

In [68]:
print(df.tail(n=5))

      VMONTH VDAYR ARRTIME WAITTIME ...    CSTRATM   CPSUM   PATWT EDWT
24772     08     1    1925     0000 ...   40300000  000023  003043  nan
24773     08     1    0929     0000 ...   40300000  000023  003043  nan
24774     07     1    0116     0000 ...   40300000  000023  003043  nan
24775     07     7    1300     0000 ...   40300000  000023  003043  nan
24776     07     6    2335     0045 ...   40300000  000023  003043  nan

[5 rows x 579 columns]


In [69]:
# print(list(df.columns))

In [70]:
print(df.shape)

(24777, 579)


In [71]:
df['ADMITHOS'].value_counts()

0    22304
1     2473
Name: ADMITHOS, dtype: int64

## Target Variable

In this project, we are trying to predict which patients presenting to the ED will eventually be hospitalized.

In this case, hospitalization encompasses:

a) Those admitted to an inpatient ward for further evaluation and treatment

b) Those transferred to a different hospital (either psychiatric or non-psychiatric) for further treatment

c) Those admitted to the observation unit for further evaluation (whether they are eventually admitted or discharged after their observation unit stay)


In [72]:
target_cols = ['ADMITHOS','TRANOTH','TRANPSYC','OBSHOS','OBSDIS']

In [73]:
df.loc[:, target_cols] = df.loc[:, target_cols].apply(pd.to_numeric)
df.loc[:, target_cols] = df.loc[:, target_cols].apply(pd.to_numeric)

df['ADMITTEMP'] = df[target_cols].sum(axis=1)

df['ADMITFINAL'] = 0
df.loc[df['ADMITTEMP'] >= 1, 'ADMITFINAL'] = 1

df.drop(target_cols, axis=1, inplace=True)
df.drop('ADMITTEMP', axis=1, inplace=True)

## Train and Test Split

In [74]:
def split_target(data, target_name):
    target = data[[target_name]]
    data.drop(target_name, axis=1, inplace=True)
    return (data, target)

X, y = split_target(df, 'ADMITFINAL')

In [75]:
from sklearn.model_selection import train_test_split

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.25, random_state=123
)

In [76]:
print(y_train.groupby('ADMITFINAL').size())

ADMITFINAL
0    16028
1     2554
dtype: int64


## Preprocessing the Predictor Variables

### Visit information

#### Month

In [77]:
print(X_train.groupby('VMONTH').size())

VMONTH
01    1780
02    1369
03    1433
04    1705
05    2034
06    1716
07    1768
08    1038
09    1225
10    1273
11    1669
12    1572
dtype: int64


In [78]:
def is_winter(vmonth):
    if vmonth in ['12','01','02','03']:
        return 1
    else:
        return 0
    
X_train.loc[:,'WINTER'] = df.loc[:,'VMONTH'].apply(is_winter)
X_test.loc[:,'WINTER'] = df.loc[:,'VMONTH'].apply(is_winter)

#### Day

In [79]:
X_train.groupby('VDAYR').size()

VDAYR
1    2576
2    2959
3    2775
4    2655
5    2529
6    2545
7    2543
dtype: int64

#### Arrival Time

In [80]:
def is_night(arrtime):
    arrtime_int = int(arrtime)
    if ((arrtime_int >= 0) & (arrtime_int < 800)):
        return 1
    elif ((arrtime_int >= 2000) & (arrtime_int < 2400)):
        return 1
    else:
        return 0
    
X_train.loc[:,'NIGHT'] = df.loc[:,'ARRTIME'].apply(is_night)
X_test.loc[:,'NIGHT'] = df.loc[:,'ARRTIME'].apply(is_night)

X_train.drop('ARRTIME', axis=1, inplace=True)
X_test.drop('ARRTIME', axis=1, inplace=True)

#### Wait Time

In [81]:
X_train.loc[:,'WAITTIME'] = X_train.loc[:,'WAITTIME'].apply(pd.to_numeric)
X_test.loc[:,'WAITTIME'] = X_test.loc[:,'WAITTIME'].apply(pd.to_numeric)

### Mean Imputation

In [82]:
# Mean imputation In the documentation
# The WAITTIME variable may take values of -9 and -7 when blank and not applicable, respectively. 

def mean_impute_values(data,col):  
    temp_mean = data.loc[(data[col] != -7) & (data[col] != -9), col].mean()
    data.loc[(data[col] == -7) | (data[col] == -9), col] = temp_mean            
    return data

X_train = mean_impute_values(X_train,'WAITTIME')
X_test = mean_impute_values(X_test,'WAITTIME')

#### Dropping other visit variables

In [83]:
X_train.drop('LOV', axis=1, inplace=True)
X_test.drop('LOV', axis=1, inplace=True)

## Demographic Variables

#### Age

In [84]:
X_train.loc[:,'AGE'] = X_train.loc[:,'AGE'].apply(pd.to_numeric)
X_test.loc[:,'AGE'] = X_test.loc[:,'AGE'].apply(pd.to_numeric)

X_train.drop('AGEDAYS', axis=1, inplace=True)
X_test.drop('AGEDAYS', axis=1, inplace=True)

#### Sex

We keep the sex column as it is.

#### Ethnicity and race
We leave the unimputed ethnicity and race variables (ETHUN and RACEUN) as is.

In [85]:
X_train.drop(['ETHIM','RACER','RACERETH'], axis=1, inplace=True)
X_test.drop(['ETHIM','RACER','RACERETH'], axis=1, inplace=True)

## Triage Variables

The IMMEDR variable represents the triage scores that range from 1 (critical) to 5 (non-urgent). 

ARREMS that shows whether or not the patient arrived via EMS and SEEN72 that indecate whether or not the patient has been seen and discharged within the last 72 hours (SEEN72) will also get included in our model.

## Financial Variables

We include all of the financial variables in our model except for the PAYTYPER variable, which is a nonbinary expansion of the other payment variables.

In [86]:
X_train.drop('PAYTYPER', axis=1, inplace=True)
X_test.drop('PAYTYPER', axis=1, inplace=True)

## Vital Signs

#### Temperature

In the dataset all temoratures are multiplied by 10 so we need to devide them by 10.

In [87]:
X_train.loc[:,'TEMPF'] = X_train.loc[:,'TEMPF'].apply(pd.to_numeric)
X_test.loc[:,'TEMPF'] = X_test.loc[:,'TEMPF'].apply(pd.to_numeric)

X_train = mean_impute_values(X_train,'TEMPF')
X_test = mean_impute_values(X_test,'TEMPF')

X_train.loc[:,'TEMPF'] = X_train.loc[:,'TEMPF'].apply(lambda x: float(x)/10)
X_test.loc[:,'TEMPF'] = X_test.loc[:,'TEMPF'].apply(lambda x: float(x)/10)

In [88]:
X_train['TEMPF'].head(n=10)

4232     98.7
20594    98.1
10153    98.7
15805    98.4
18066    98.0
6497     98.4
5395     98.4
13974    98.3
9200     97.5
12244    99.5
Name: TEMPF, dtype: float64

#### Pulse

In [89]:
X_train.loc[:,'PULSE'] = X_train.loc[:,'PULSE'].apply(pd.to_numeric)
X_test.loc[:,'PULSE'] = X_test.loc[:,'PULSE'].apply(pd.to_numeric)

In [90]:
def mean_impute_vitals(data,col): 
    temp_mean = data.loc[(data[col] != 998) & (data[col] != -9), col].mean()
    data.loc[(data[col] == 998) | (data[col] == -9), col] = temp_mean 
    return data

X_train = mean_impute_vitals(X_train,'PULSE')
X_test = mean_impute_vitals(X_test,'PULSE')

#### Respiratory Rate

In [91]:
X_train.loc[:,'RESPR'] = X_train.loc[:,'RESPR'].apply(pd.to_numeric)
X_test.loc[:,'RESPR'] = X_test.loc[:,'RESPR'].apply(pd.to_numeric)

X_train = mean_impute_values(X_train,'RESPR')
X_test = mean_impute_values(X_test,'RESPR')

#### Blood pressure

In [92]:
X_train.loc[:,'BPSYS'] = X_train.loc[:,'BPSYS'].apply(pd.to_numeric)
X_test.loc[:,'BPSYS'] = X_test.loc[:,'BPSYS'].apply(pd.to_numeric)

X_train = mean_impute_values(X_train,'BPSYS')
X_test = mean_impute_values(X_test,'BPSYS')

In [93]:
X_train.loc[:,'BPDIAS'] = X_train.loc[:,'BPDIAS'].apply(pd.to_numeric)
X_test.loc[:,'BPDIAS'] = X_test.loc[:,'BPDIAS'].apply(pd.to_numeric)

In [94]:
def mean_impute_bp_diast(data,col): 
    temp_mean = data.loc[(data[col] != 998) & (data[col] != -9), col].mean()
    data.loc[data[col] == 998, col] = 40
    data.loc[data[col] == -9, col] = temp_mean 
    return data

X_train = mean_impute_values(X_train,'BPDIAS')
X_test = mean_impute_values(X_test,'BPDIAS')

#### Oxygen saturation

In [95]:
X_train.loc[:,'POPCT'] = X_train.loc[:,'POPCT'].apply(pd.to_numeric)
X_test.loc[:,'POPCT'] = X_test.loc[:,'POPCT'].apply(pd.to_numeric)

X_train = mean_impute_values(X_train,'POPCT')
X_test = mean_impute_values(X_test,'POPCT')

In [96]:
X_train[['TEMPF','PULSE','RESPR','BPSYS','BPDIAS','POPCT']].head(n=10)

Unnamed: 0,TEMPF,PULSE,RESPR,BPSYS,BPDIAS,POPCT
4232,98.7,73.0,20.0,118.0,61.0,100.0
20594,98.1,86.0,20.0,148.0,85.0,98.0
10153,98.7,98.0,28.0,160.0,106.0,100.0
15805,98.4,75.0,19.568148,111.0,58.0,100.0
18066,98.0,109.0,18.0,199.0,111.0,94.0
6497,98.4,88.0,16.0,169.0,85.0,95.0
5395,98.4,152.0,40.0,125.0,77.0,95.0
13974,98.3,115.0,22.0,133.0,73.0,96.0
9200,97.5,81.0,18.0,140.0,68.0,96.0
12244,99.5,110.0,20.0,158.0,107.0,100.0


#### Pain level

In [97]:
X_train.loc[:,'PAINSCALE'] = X_train.loc[:,'PAINSCALE'].apply(pd.to_numeric)
X_test.loc[:,'PAINSCALE'] = X_test.loc[:,'PAINSCALE'].apply(pd.to_numeric)

In [98]:
def mean_impute_pain(data,col): 
    temp_mean = data.loc[(data[col] != -8) & (data[col] != -9), col].mean()
    data.loc[(data[col] == -8) | (data[col] == -9), col] = temp_mean 
    return data

X_train = mean_impute_pain(X_train,'PAINSCALE')
X_test = mean_impute_pain(X_test,'PAINSCALE')

### Reason-for-visit codes

In [99]:
rfv_codes_path = '../input/rfv-codes/RFV_CODES.csv'

rfv_codes = pd.read_csv(rfv_codes_path,header=0,dtype='str')

In [100]:
from re import sub

def add_rfv_column(data,code,desc,rfv_columns):
    column_name = 'rfv_' + sub(" ", "_", desc)
    data[column_name] = (data[rfv_columns] == rfv_code).any(axis=1).astype('int')
    return data

rfv_columns = ['RFV1','RFV2','RFV3']
for (rfv_code,rfv_desc) in zip(
    rfv_codes['Code'].tolist(),rfv_codes['Description'].tolist()
):
    X_train = add_rfv_column(
        X_train,
        rfv_code,
        rfv_desc,
        rfv_columns
    )
    X_test = add_rfv_column(
        X_test,
        rfv_code,
        rfv_desc,
        rfv_columns 
    )
    
# Remove original RFV columns
X_train.drop(rfv_columns, axis=1, inplace=True)
X_test.drop(rfv_columns, axis=1, inplace=True)

In [101]:
X_train.head(n=5)

Unnamed: 0,VMONTH,VDAYR,WAITTIME,AGE,AGER,RESIDNCE,SEX,ETHUN,RACEUN,ARREMS,NOPAY,PAYPRIV,PAYMCARE,PAYMCAID,PAYWKCMP,PAYSELF,PAYNOCHG,PAYOTH,PAYDK,TEMPF,PULSE,RESPR,BPSYS,BPDIAS,POPCT,ONO2,IMMEDR,PAINSCALE,SEEN72,EPISODE,INJURY,INJR1,INJR2,INJPOISAD,INJPOISADR1,INJPOISADR2,INTENT,INJDETR,INJDETR1,INJDETR2,...,rfv_Food_poisoning,rfv_Ingestion_inhalation_or_exposure_to_potentially_poisonous_products,rfv_Adverse_effect_of_medication,rfv_Adverse_effect_of_drug_abuse,rfv_Adverse_effect_of_alcohol,rfv_Alcohol_poisoning,rfv_Adverse_effects_of_environment,rfv_Adverse_effects_of_secondhand_smoke,rfv_Adverse_effects_of_terrorism_and_bioterrorism,rfv_Adverse_effects_other_and_unspecified,rfv_Complications_of_surgical_or_medical_procedures_and_treatments,rfv_For_other_findings_of_blood_tests,rfv_For_results_of_urine_tests,rfv_For_cytology_findings,rfv_For_radiological_findings,rfv_For_results_of_blood_glucose_tests,rfv_For_results_of_EKG_Holter_monitor_review_abnormal,rfv_For_results_of_skin_tests,rfv_For_results_of_cholesterol_and_triglyceride_tests,rfv_For_other_and_unspecified_test_results,rfv_For_results_of_test_for_human_immunodeficiency,rfv_Physical_examination_required_for_school_or_employment,rfv_Other_reason_for_visit_required_by_party_other_than_the_patient_or_the_health_care_provider,rfv_Physical_examination_required_for_employment,rfv_Executive_physical_examination,rfv_Physical_examination_required_for_school,rfv_Problems_complaints_NEC,rfv_Patient_unable_to_speak_English,rfv_Patient_or_patient's_spokesperson_refused_care,rfv_Physical_examination_for_extracurricular_activities,rfv_Entry_of_none_or_no_complaint,rfv_Insufficient_information,rfv_Driver's_license_examination_DOT_,rfv_Illegible_entry,rfv_Insurance_examination_,rfv_Disability_examination_,rfv_Worker’s_comp_exam,rfv_Premarital_examination,rfv_Premarital_blood_test,rfv_Direct_admission_to_hospital
4232,6,2,18.0,66,5,1,1,2,2,2,0,1,0,0,0,0,0,0,0,98.7,73.0,20.0,118.0,61.0,100.0,2,3,4.776526,2,1,0,0,0,4,4,4,-9,5,5,5,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
20594,1,5,30.0,28,3,1,2,1,-9,2,0,1,0,0,0,0,0,0,0,98.1,86.0,20.0,148.0,85.0,98.0,2,3,10.0,2,1,1,1,1,1,1,1,-9,-9,3,3,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
10153,10,5,11.0,52,4,1,2,2,1,2,0,0,0,1,0,0,0,0,0,98.7,98.0,28.0,160.0,106.0,100.0,2,4,10.0,1,2,1,1,1,1,1,1,-9,-9,3,3,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
15805,4,3,139.0,26,3,1,1,2,1,2,0,1,0,0,0,0,0,0,0,98.4,75.0,19.568148,111.0,58.0,100.0,2,3,7.0,2,1,0,0,0,4,4,4,-9,5,5,5,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0
18066,9,5,19.0,45,4,1,2,-9,1,2,0,0,1,0,0,0,0,0,0,98.0,109.0,18.0,199.0,111.0,94.0,2,-8,4.776526,2,1,0,0,0,4,4,4,-9,5,5,5,...,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0,0


### Injury codes

Injury codes only apply if the patient has undergone either physical injury, poisoning, or adverse effects of medical treatment (including suicide attempts). Because the exact reason for injury may not be known until a full workup has been performed, and that workup usually occurs after a decision to admit has already been made. Therefore, we will remove the injury code variables.

In [102]:
inj_cols = [
    'INJURY','INJR1','INJR2','INJPOISAD','INJPOISADR1',
    'INJPOISADR2','INTENT','INJDETR','INJDETR1','INJDETR2',
    'CAUSE1','CAUSE2','CAUSE3','CAUSE1R','CAUSE2R','CAUSE3R'
]

X_train.drop(inj_cols, axis=1, inplace=True)
X_test.drop(inj_cols, axis=1, inplace=True)

### Diagnostic codes
For the same reason as the injury codes, these variables will be removed.

In [103]:
diag_cols= [
    'DIAG1','DIAG2','DIAG3',
    'PRDIAG1','PRDIAG2','PRDIAG3',
    'DIAG1R','DIAG2R','DIAG3R'
]

X_train.drop(diag_cols, axis=1, inplace=True)
X_test.drop(diag_cols, axis=1, inplace=True)

### Medical history

In [104]:
X_train.loc[:,'TOTCHRON'] = X_train.loc[:,'TOTCHRON'].apply(pd.to_numeric)
X_test.loc[:,'TOTCHRON'] = X_test.loc[:,'TOTCHRON'].apply(pd.to_numeric)

X_train = mean_impute_values(X_train,'TOTCHRON')
X_test = mean_impute_values(X_test,'TOTCHRON')

### Test

In [105]:
testing_cols = [
    'ABG','BAC','BLOODCX','BNP','BUNCREAT',
    'CARDENZ','CBC','DDIMER','ELECTROL','GLUCOSE',
    'LACTATE','LFT','PTTINR','OTHERBLD','CARDMON',
    'EKG','HIVTEST','FLUTEST','PREGTEST','TOXSCREN',
    'URINE','WOUNDCX','URINECX','OTHRTEST','ANYIMAGE',
    'XRAY','IVCONTRAST','CATSCAN','CTAB','CTCHEST',
    'CTHEAD','CTOTHER','CTUNK','MRI','ULTRASND',
    'OTHIMAGE','TOTDIAG','DIAGSCRN'
]

X_train.drop(testing_cols, axis=1, inplace=True)
X_test.drop(testing_cols, axis=1, inplace=True)

### Procedures
We omit procedures because, similar to the tests, they often occur post-prediction time.

In [106]:
proc_cols = [
    'PROC','BPAP','BLADCATH','CASTSPLINT','CENTLINE',
    'CPR','ENDOINT','INCDRAIN','IVFLUIDS','LUMBAR',
    'NEBUTHER','PELVIC','SKINADH','SUTURE','OTHPROC',
    'TOTPROC'
]

X_train.drop(proc_cols, axis=1, inplace=True)
X_test.drop(proc_cols, axis=1, inplace=True)

### Medication codes

In [107]:
med_cols = [
    'MED1','MED2','MED3','MED4','MED5',
    'MED6','MED7','MED8','MED9','MED10',
    'MED11','MED12','GPMED1','GPMED2','GPMED3',
    'GPMED4','GPMED5','GPMED6','GPMED7','GPMED8',
    'GPMED9','GPMED10','GPMED11','GPMED12','NUMGIV',
    'NUMDIS','NUMMED',
]

X_train.drop(med_cols, axis=1, inplace=True)
X_test.drop(med_cols, axis=1, inplace=True)

### Provider information

In [108]:
prov_cols = [
    'NOPROVID','ATTPHYS','RESINT','CONSULT','RNLPN',
    'NURSEPR','PHYSASST','EMT','MHPROV','OTHPROV'
]

X_train.drop(prov_cols, axis=1, inplace=True)
X_test.drop(prov_cols, axis=1, inplace=True)

### Disposition information

In [109]:
disp_cols = [
    'NODISP','NOFU','RETRNED','RETREFFU','LEFTBTRI',
    'LEFTAMA','DOA','DIEDED','TRANNH','OTHDISP',
    'ADMIT','ADMTPHYS','BOARDED','LOS','HDDIAG1',
    'HDDIAG2','HDDIAG3','HDDIAG1R','HDDIAG2R','HDDIAG3R',
    'HDSTAT','ADISP','OBSSTAY','STAY24'
]

In [110]:
X_train.drop(disp_cols, axis=1, inplace=True)
X_test.drop(disp_cols, axis=1, inplace=True)

### Pre-Imputed columns

In [111]:
imp_cols = [
    'AGEFL','BDATEFL','SEXFL','ETHNICFL','RACERFL'
]

X_train.drop(imp_cols, axis=1, inplace=True)
X_test.drop(imp_cols, axis=1, inplace=True)

### Identifying variables

In [112]:
id_cols = [
    'HOSPCODE','PATCODE'
]

X_train.drop(id_cols, axis=1, inplace=True)
X_test.drop(id_cols, axis=1, inplace=True)

### Electronic medical record status columns

We omit these columns since they are valued on a per-hospital basis rather than a per-encounter basis.

In [113]:
emr_cols = [
    'EBILLANYE','EMRED','HHSMUE','EHRINSE','EDEMOGE',
    'EDEMOGER','EPROLSTE','EPROLSTER','EVITALE','EVITALER',
    'ESMOKEE','ESMOKEER','EPNOTESE','EPNOTESER','EMEDALGE',
    'EMEDALGER','ECPOEE','ECPOEER','ESCRIPE','ESCRIPER',
    'EWARNE','EWARNER','EREMINDE','EREMINDER','ECTOEE',
    'ECTOEER','EORDERE','EORDERER','ERESULTE','ERESULTER',
    'EGRAPHE','EGRAPHER','EIMGRESE','EIMGRESER','EPTEDUE',
    'EPTEDUER','ECQME','ECQMER','EGENLISTE','EGENLISTER',
    'EIMMREGE','EIMMREGER','ESUME','ESUMER','EMSGE',
    'EMSGER','EHLTHINFOE','EHLTHINFOER','EPTRECE','EPTRECER',
    'EMEDIDE','EMEDIDER','ESHAREE','ESHAREEHRE','ESHAREWEBE',
    'ESHAREOTHE','ESHAREUNKE','ESHAREREFE','LABRESE1','LABRESE2',
    'LABRESE3','LABRESE4','LABRESUNKE','LABRESREFE','IMAGREPE1',
    'IMAGREPE2','IMAGREPE3','IMAGREPE4','IMAGREPUNKE','IMAGREPREFE',
    'PTPROBE1','PTPROBE2','PTPROBE3','PTPROBE4','PTPROBUNKE',
    'PTPROBREFE','MEDLISTE1','MEDLISTE2','MEDLISTE3','MEDLISTE4',
    'MEDLISTUNKE','MEDLISTREFE','ALGLISTE1','ALGLISTE2','ALGLISTE3',
    'ALGLISTE4','ALGLISTUNKE','ALGLISTREFE','EDPRIM','EDINFO',
    'MUINC','MUYEAR'
]

X_train.drop(emr_cols, axis=1, inplace=True)
X_test.drop(emr_cols, axis=1, inplace=True)

### Detailed medication information

In [114]:
drug_id_cols = [
    'DRUGID1','DRUGID2','DRUGID3','DRUGID4','DRUGID5',
    'DRUGID6','DRUGID7','DRUGID8','DRUGID9','DRUGID10',
    'DRUGID11','DRUGID12'
]

drug_lev1_cols = [
    'RX1V1C1','RX1V1C2','RX1V1C3','RX1V1C4',
    'RX2V1C1','RX2V1C2','RX2V1C3','RX2V1C4',
    'RX3V1C1','RX3V1C2','RX3V1C3','RX3V1C4',
    'RX4V1C1','RX4V1C2','RX4V1C3','RX4V1C4',
    'RX5V1C1','RX5V1C2','RX5V1C3','RX5V1C4',
    'RX6V1C1','RX6V1C2','RX6V1C3','RX6V1C4',
    'RX7V1C1','RX7V1C2','RX7V1C3','RX7V1C4',
    'RX8V1C1','RX8V1C2','RX8V1C3','RX8V1C4',
    'RX9V1C1','RX9V1C2','RX9V1C3','RX9V1C4',
    'RX10V1C1','RX10V1C2','RX10V1C3','RX10V1C4',
    'RX11V1C1','RX11V1C2','RX11V1C3','RX11V1C4',
    'RX12V1C1','RX12V1C2','RX12V1C3','RX12V1C4'
]

drug_lev2_cols = [
    'RX1V2C1','RX1V2C2','RX1V2C3','RX1V2C4',
    'RX2V2C1','RX2V2C2','RX2V2C3','RX2V2C4',
    'RX3V2C1','RX3V2C2','RX3V2C3','RX3V2C4',
    'RX4V2C1','RX4V2C2','RX4V2C3','RX4V2C4',
    'RX5V2C1','RX5V2C2','RX5V2C3','RX5V2C4',
    'RX6V2C1','RX6V2C2','RX6V2C3','RX6V2C4',
    'RX7V2C1','RX7V2C2','RX7V2C3','RX7V2C4',
    'RX8V2C1','RX8V2C2','RX8V2C3','RX8V2C4',
    'RX9V2C1','RX9V2C2','RX9V2C3','RX9V2C4',
    'RX10V2C1','RX10V2C2','RX10V2C3','RX10V2C4',
    'RX11V2C1','RX11V2C2','RX11V2C3','RX11V2C4',
    'RX12V2C1','RX12V2C2','RX12V2C3','RX12V2C4'
]
drug_lev3_cols = [
    'RX1V3C1','RX1V3C2','RX1V3C3','RX1V3C4',
    'RX2V3C1','RX2V3C2','RX2V3C3','RX2V3C4',
    'RX3V3C1','RX3V3C2','RX3V3C3','RX3V3C4',
    'RX4V3C1','RX4V3C2','RX4V3C3','RX4V3C4',
    'RX5V3C1','RX5V3C2','RX5V3C3','RX5V3C4',
    'RX6V3C1','RX6V3C2','RX6V3C3','RX6V3C4',
    'RX7V3C1','RX7V3C2','RX7V3C3','RX7V3C4',
    'RX8V3C1','RX8V3C2','RX8V3C3','RX8V3C4',
    'RX9V3C1','RX9V3C2','RX9V3C3','RX9V3C4',
    'RX10V3C1','RX10V3C2','RX10V3C3','RX10V3C4',
    'RX11V3C1','RX11V3C2','RX11V3C3','RX11V3C4',
    'RX12V3C1','RX12V3C2','RX12V3C3','RX12V3C4'
]

addl_drug_cols = [
    'PRESCR1','CONTSUB1','COMSTAT1','RX1CAT1','RX1CAT2',
    'RX1CAT3','RX1CAT4','PRESCR2','CONTSUB2','COMSTAT2',
    'RX2CAT1','RX2CAT2','RX2CAT3','RX2CAT4','PRESCR3','CONTSUB3',
    'COMSTAT3','RX3CAT1','RX3CAT2','RX3CAT3','RX3CAT4','PRESCR4',
    'CONTSUB4','COMSTAT4','RX4CAT1','RX4CAT2','RX4CAT3',
    'RX4CAT4','PRESCR5','CONTSUB5','COMSTAT5','RX5CAT1',
    'RX5CAT2','RX5CAT3','RX5CAT4','PRESCR6','CONTSUB6',
    'COMSTAT6','RX6CAT1','RX6CAT2','RX6CAT3','RX6CAT4','PRESCR7',
    'CONTSUB7','COMSTAT7','RX7CAT1','RX7CAT2','RX7CAT3',
    'RX7CAT4','PRESCR8','CONTSUB8','COMSTAT8','RX8CAT1',
    'RX8CAT2','RX8CAT3','RX8CAT4','PRESCR9','CONTSUB9',
    'COMSTAT9','RX9CAT1','RX9CAT2','RX9CAT3','RX9CAT4',
    'PRESCR10','CONTSUB10','COMSTAT10','RX10CAT1','RX10CAT2',
    'RX10CAT3','RX10CAT4','PRESCR11','CONTSUB11','COMSTAT11',
    'RX11CAT1','RX11CAT2','RX11CAT3','RX11CAT4','PRESCR12',
    'CONTSUB12','COMSTAT12','RX12CAT1','RX12CAT2','RX12CAT3',
    'RX12CAT4'
]

X_train.drop(drug_id_cols, axis=1, inplace=True)
X_train.drop(drug_lev1_cols, axis=1, inplace=True)
X_train.drop(drug_lev2_cols, axis=1, inplace=True)
X_train.drop(drug_lev3_cols, axis=1, inplace=True)
X_train.drop(addl_drug_cols, axis=1, inplace=True)

X_test.drop(drug_id_cols, axis=1, inplace=True)
X_test.drop(drug_lev1_cols, axis=1, inplace=True)
X_test.drop(drug_lev2_cols, axis=1, inplace=True)
X_test.drop(drug_lev3_cols, axis=1, inplace=True)
X_test.drop(addl_drug_cols, axis=1, inplace=True)

### Miscellaneous information

In [115]:
design_cols = ['CSTRATM','CPSUM','PATWT','EDWT']

X_train.drop(design_cols, axis=1, inplace=True)
X_test.drop(design_cols, axis=1, inplace=True)

### One-hot encoding

In [116]:
categ_cols = df_helper.loc[
    df_helper['variable_type'] == 'CATEGORICAL', 'column_name'
]

one_hot_cols = list(set(categ_cols) & set(X_train.columns))

X_train = pd.get_dummies(X_train, columns=one_hot_cols)

In [117]:
X_test = pd.get_dummies(X_test, columns=one_hot_cols)

### Numeric conversion

In [118]:
X_train.loc[:,X_train.columns] = X_train.loc[:,X_train.columns].apply(pd.to_numeric)
X_test.loc[:,X_test.columns] = X_test.loc[:,X_test.columns].apply(pd.to_numeric)

### NumPy array conversion

In [119]:
X_train_cols = X_train.columns
X_test_cols = X_test.columns

In [120]:
X_train = X_train.values
X_test = X_test.values

# Building the models

## Logistic regression

In [122]:
from sklearn.linear_model import LogisticRegression

clfs = [LogisticRegression()]

for clf in clfs:
    clf.fit(X_train, y_train)
    print(type(clf))
    print('Training accuracy: ' + str(clf.score(X_train, y_train)))
    print('Validation accuracy: ' + str(clf.score(X_test, y_test)))
    
    coefs = {
        'column': [X_train_cols[i] for i in range(len(X_train_cols))],
        'coef': [clf.coef_[0,i] for i in range(len(X_train_cols))]
    }
    df_coefs = pd.DataFrame(coefs)
    print(df_coefs.sort_values('coef', axis=0, ascending=False))

  y = column_or_1d(y, warn=True)


<class 'sklearn.linear_model.logistic.LogisticRegression'>
Training accuracy: 0.8860187278010978
Validation accuracy: 0.886682808716707
                                                column      coef
346                     rfv_Symptoms_of_onset_of_labor  2.066923
108  rfv_Other_symptoms_or_problems_relating_to_psy...  1.388934
520  rfv_General_psychiatric_or_psychological_exami...  1.235171
688                                rfv_Suicide_attempt  1.031974
795                                          IMMEDR_01  0.899332
201         rfv_Labored_or_difficult_breathing_dyspnea  0.883669
88                                     rfv_Depression_  0.880418
796                                          IMMEDR_02  0.820472
95                     rfv_Delusions_or_hallucinations  0.808396
696                   rfv_Adverse_effect_of_drug_abuse  0.788406
55                             rfv_Chest_pain_soreness  0.782589
607                         rfv_Medical_Counseling_NOS  0.773309
42                 

#### Logistic regression provide a rubust model with consistancy between the training and test sets. The result of the model is as followes:

#### Training accuracy: 0.8860187278010978
#### Validation accuracy: 0.886682808716707

## Random forest

In [125]:
from sklearn.ensemble import RandomForestClassifier

clfs_rf = [RandomForestClassifier(n_estimators=100)]

for clf in clfs_rf:
    clf.fit(X_train, y_train)
    print(type(clf))
    print('Training accuracy: ' + str(clf.score(X_train, y_train)))
    print('Validation accuracy: ' + str(clf.score(X_test, y_test)))
    
    imps = {
        'column': [X_train_cols[i] for i in range(len(X_train_cols))],
        'imp': [clf.feature_importances_[i] for i in range(len(X_train_cols))]
    }
    df_imps = pd.DataFrame(imps)
    print(df_imps.sort_values('imp', axis=0, ascending=False))

  


<class 'sklearn.ensemble.forest.RandomForestClassifier'>
Training accuracy: 1.0
Validation accuracy: 0.8828087167070218
                                                column       imp
1                                                  AGE  0.036652
13                                               PULSE  0.028601
15                                               BPSYS  0.027059
16                                              BPDIAS  0.026471
12                                               TEMPF  0.024818
0                                             WAITTIME  0.024795
17                                               POPCT  0.021069
14                                               RESPR  0.020226
29                                            TOTCHRON  0.017520
18                                           PAINSCALE  0.016664
872                                          ARREMS_02  0.014394
871                                          ARREMS_01  0.014194
796                                

#### The result of Random Forest model is as follows:
#### Training accuracy: 1.0
#### Validation accuracy: 0.8828087167070218

#### Having a training accuracy of 1 indicates overfitting. However, Random Forest is typcally robust and does not overfit the data. Feature importance shows that there are many items with absolutely no effect on the model. We expect that removing these featureas would improve the model performance. 

## Neural network

In [127]:
from sklearn.preprocessing import StandardScaler
from sklearn.neural_network import MLPClassifier 

# Scale data
scaler = StandardScaler() 
scaler.fit(X_train) 
X_train_Tx = scaler.transform(X_train) 
X_test_Tx = scaler.transform(X_test) 

# Fit models that require scaling (e.g. neural networks)
hl_sizes = [150,100,80,60,40,20]
nn_clfs = [MLPClassifier(hidden_layer_sizes=(size,), random_state=2345, verbose=True) for size in hl_sizes]

for num, nn_clf in enumerate(nn_clfs):
    print(str(hl_sizes[num]) + '-unit network:')
    nn_clf.fit(X_train_Tx, y_train)
    print('Training accuracy: ' + str(nn_clf.score(X_train_Tx, y_train)))
    print('Validation accuracy: ' + str(nn_clf.score(X_test_Tx, y_test)))

150-unit network:


  y = column_or_1d(y, warn=True)


Iteration 1, loss = 0.46042747
Iteration 2, loss = 0.26762774
Iteration 3, loss = 0.23099945
Iteration 4, loss = 0.20990494
Iteration 5, loss = 0.19234417
Iteration 6, loss = 0.17454482
Iteration 7, loss = 0.15777271
Iteration 8, loss = 0.14261363
Iteration 9, loss = 0.12760340
Iteration 10, loss = 0.11404992
Iteration 11, loss = 0.10026963
Iteration 12, loss = 0.08864339
Iteration 13, loss = 0.07872664
Iteration 14, loss = 0.06766217
Iteration 15, loss = 0.06049338
Iteration 16, loss = 0.05281394
Iteration 17, loss = 0.04664527
Iteration 18, loss = 0.04040355
Iteration 19, loss = 0.03564911
Iteration 20, loss = 0.03147428
Iteration 21, loss = 0.02752492
Iteration 22, loss = 0.02445311
Iteration 23, loss = 0.02191182
Iteration 24, loss = 0.01964949
Iteration 25, loss = 0.01792721
Iteration 26, loss = 0.01644273
Iteration 27, loss = 0.01412957
Iteration 28, loss = 0.01222390
Iteration 29, loss = 0.01197503
Iteration 30, loss = 0.01027081
Iteration 31, loss = 0.00964919
Iteration 32, los

  y = column_or_1d(y, warn=True)


Iteration 1, loss = 0.36246986
Iteration 2, loss = 0.25986193
Iteration 3, loss = 0.23167331
Iteration 4, loss = 0.20980379
Iteration 5, loss = 0.19225833
Iteration 6, loss = 0.17533651
Iteration 7, loss = 0.16078798
Iteration 8, loss = 0.14319748
Iteration 9, loss = 0.12926094
Iteration 10, loss = 0.11454745
Iteration 11, loss = 0.10248543
Iteration 12, loss = 0.09084019
Iteration 13, loss = 0.08104346
Iteration 14, loss = 0.07228109
Iteration 15, loss = 0.06443965
Iteration 16, loss = 0.05622794
Iteration 17, loss = 0.05085809
Iteration 18, loss = 0.04518165
Iteration 19, loss = 0.03955685
Iteration 20, loss = 0.03703979
Iteration 21, loss = 0.03160814
Iteration 22, loss = 0.02916788
Iteration 23, loss = 0.02514193
Iteration 24, loss = 0.02260722
Iteration 25, loss = 0.02058863
Iteration 26, loss = 0.01810946
Iteration 27, loss = 0.01592094
Iteration 28, loss = 0.01407315
Iteration 29, loss = 0.01374555
Iteration 30, loss = 0.01317764
Iteration 31, loss = 0.01174809
Iteration 32, los

  y = column_or_1d(y, warn=True)


Iteration 1, loss = 0.35454070
Iteration 2, loss = 0.26225834
Iteration 3, loss = 0.23671209
Iteration 4, loss = 0.21851784
Iteration 5, loss = 0.20176831
Iteration 6, loss = 0.18600299
Iteration 7, loss = 0.17039497
Iteration 8, loss = 0.15523305
Iteration 9, loss = 0.14114271
Iteration 10, loss = 0.12678225
Iteration 11, loss = 0.11476329
Iteration 12, loss = 0.10208249
Iteration 13, loss = 0.09205445
Iteration 14, loss = 0.08243933
Iteration 15, loss = 0.07208936
Iteration 16, loss = 0.06474464
Iteration 17, loss = 0.05889930
Iteration 18, loss = 0.05399492
Iteration 19, loss = 0.04716947
Iteration 20, loss = 0.04413791
Iteration 21, loss = 0.03860912
Iteration 22, loss = 0.03468525
Iteration 23, loss = 0.03149394
Iteration 24, loss = 0.02844319
Iteration 25, loss = 0.02542893
Iteration 26, loss = 0.02332205
Iteration 27, loss = 0.02125126
Iteration 28, loss = 0.01900169
Iteration 29, loss = 0.01834657
Iteration 30, loss = 0.01587931
Iteration 31, loss = 0.01431643
Iteration 32, los

  y = column_or_1d(y, warn=True)


Iteration 1, loss = 0.43357381
Iteration 2, loss = 0.27888069
Iteration 3, loss = 0.24739004
Iteration 4, loss = 0.22782233
Iteration 5, loss = 0.21412121
Iteration 6, loss = 0.20152233
Iteration 7, loss = 0.18912834
Iteration 8, loss = 0.17768024
Iteration 9, loss = 0.16643064
Iteration 10, loss = 0.15500260
Iteration 11, loss = 0.14202183
Iteration 12, loss = 0.13168001
Iteration 13, loss = 0.12215761
Iteration 14, loss = 0.11166814
Iteration 15, loss = 0.10143313
Iteration 16, loss = 0.09344256
Iteration 17, loss = 0.08525069
Iteration 18, loss = 0.07834549
Iteration 19, loss = 0.07207005
Iteration 20, loss = 0.06564179
Iteration 21, loss = 0.06120169
Iteration 22, loss = 0.05591804
Iteration 23, loss = 0.05146194
Iteration 24, loss = 0.04729026
Iteration 25, loss = 0.04346369
Iteration 26, loss = 0.04028081
Iteration 27, loss = 0.03733748
Iteration 28, loss = 0.03463643
Iteration 29, loss = 0.03126852
Iteration 30, loss = 0.02920701
Iteration 31, loss = 0.02780134
Iteration 32, los

  y = column_or_1d(y, warn=True)


Iteration 1, loss = 0.37631294
Iteration 2, loss = 0.27670785
Iteration 3, loss = 0.25069618
Iteration 4, loss = 0.23501113
Iteration 5, loss = 0.22176468
Iteration 6, loss = 0.21070250
Iteration 7, loss = 0.19933209
Iteration 8, loss = 0.18843077
Iteration 9, loss = 0.17675223
Iteration 10, loss = 0.16661278
Iteration 11, loss = 0.15574234
Iteration 12, loss = 0.14577344
Iteration 13, loss = 0.13672148
Iteration 14, loss = 0.12788980
Iteration 15, loss = 0.11977264
Iteration 16, loss = 0.11183530
Iteration 17, loss = 0.10408129
Iteration 18, loss = 0.09882054
Iteration 19, loss = 0.09309868
Iteration 20, loss = 0.08651480
Iteration 21, loss = 0.08077091
Iteration 22, loss = 0.07644392
Iteration 23, loss = 0.07087569
Iteration 24, loss = 0.06622560
Iteration 25, loss = 0.06269006
Iteration 26, loss = 0.05816704
Iteration 27, loss = 0.05506414
Iteration 28, loss = 0.05203494
Iteration 29, loss = 0.04873037
Iteration 30, loss = 0.04528483
Iteration 31, loss = 0.04252240
Iteration 32, los

  y = column_or_1d(y, warn=True)


Iteration 1, loss = 0.65423207
Iteration 2, loss = 0.34671278
Iteration 3, loss = 0.29129340
Iteration 4, loss = 0.26584909
Iteration 5, loss = 0.24997232
Iteration 6, loss = 0.23907695
Iteration 7, loss = 0.23079262
Iteration 8, loss = 0.22356778
Iteration 9, loss = 0.21744450
Iteration 10, loss = 0.21146149
Iteration 11, loss = 0.20560899
Iteration 12, loss = 0.20038191
Iteration 13, loss = 0.19466724
Iteration 14, loss = 0.18971259
Iteration 15, loss = 0.18447994
Iteration 16, loss = 0.18000950
Iteration 17, loss = 0.17433077
Iteration 18, loss = 0.16904665
Iteration 19, loss = 0.16440944
Iteration 20, loss = 0.16018349
Iteration 21, loss = 0.15590088
Iteration 22, loss = 0.15012856
Iteration 23, loss = 0.14618453
Iteration 24, loss = 0.14133472
Iteration 25, loss = 0.13680620
Iteration 26, loss = 0.13328819
Iteration 27, loss = 0.12844420
Iteration 28, loss = 0.12476082
Iteration 29, loss = 0.12115142
Iteration 30, loss = 0.11809226
Iteration 31, loss = 0.11365508
Iteration 32, los



#### 100 layers resulted in 
#### Training accuracy: 0.9996771068776235
#### Validation accuracy: 0.8789346246973365
#### Training accuracy indicates that our model is overfitting and that could be due to the large number of features the data contains. In addition, the training set has approximately 18,000 observations, for neural network models to perform well typically much lareger amount of observations are needed. 

#### In this project we have built a model that could pridict the Emergency Department visits resulting in Hospital Admission with 88.60% accuracy on test data. 
#### The model performance can be improved by performing some feature selection methods. 
#### The model can help the hospitals to improve the management of their resources. 