## Canadian Hospital Readmittance Challenge 

This notebook is made as a part of the Machine Learning (AI-511) project. It has been made by the following students - 

1. Siddharth Kothari (IMT2021019)
2. Sankalp Kothari (IMT2021028)
3. M Srinivasan (IMT2021058)


In [25]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import plotly.graph_objects as go
import xgboost as xgb
import optuna
from math import floor
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, PolynomialFeatures
from sklearn.compose import ColumnTransformer
from sklearn.linear_model import LogisticRegression
from sklearn.tree import DecisionTreeClassifier
from sklearn.ensemble import RandomForestClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.naive_bayes import GaussianNB
from sklearn.metrics import accuracy_score,confusion_matrix,ConfusionMatrixDisplay
import re
pd.options.display.max_rows = 4000

In [26]:
df = pd.read_csv('./canadian-hospital-re-admittance-challenge/train.csv')
test_df = pd.read_csv('./canadian-hospital-re-admittance-challenge/test.csv')
df

Unnamed: 0,enc_id,patient_id,race,gender,age,weight,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmission_id
0,88346340,2488608,Caucasian,Male,[60-70),,1,2,6,3,...,No,Steady,No,No,No,No,No,Ch,Yes,2
1,92001408,52133202,Caucasian,Male,[70-80),[100-125),2,6,1,7,...,No,No,No,No,No,No,No,No,Yes,1
2,169424316,40945509,Caucasian,Female,[70-80),,3,2,1,7,...,No,Up,No,No,No,No,No,Ch,Yes,1
3,272987082,38850777,Caucasian,Female,[50-60),,1,1,7,1,...,No,No,No,No,No,No,No,No,Yes,2
4,150600612,72738225,Caucasian,Female,[80-90),,1,6,7,6,...,No,Down,No,No,No,No,No,Ch,Yes,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71231,198619164,85063725,Caucasian,Male,[70-80),,1,1,7,6,...,No,No,No,No,No,No,No,No,Yes,1
71232,177404100,86244345,Caucasian,Male,[90-100),,1,3,7,5,...,No,No,No,No,No,No,No,No,No,2
71233,50905206,5131368,Caucasian,Male,[70-80),,3,6,1,6,...,No,Steady,No,No,No,No,No,Ch,Yes,2
71234,216431502,85969035,Hispanic,Male,[50-60),,1,1,4,4,...,No,Steady,No,No,No,No,No,No,Yes,2


### EDA and Preprocessing

In [27]:
df.columns

Index(['enc_id', 'patient_id', 'race', 'gender', 'age', 'weight',
       'admission_type_id', 'discharge_disposition_id', 'admission_source_id',
       'time_in_hospital', 'payer_code', 'medical_specialty',
       'num_lab_procedures', 'num_procedures', 'num_medications',
       'number_outpatient', 'number_emergency', 'number_inpatient', 'diag_1',
       'diag_2', 'diag_3', 'number_diagnoses', 'max_glu_serum', 'A1Cresult',
       'metformin', 'repaglinide', 'nateglinide', 'chlorpropamide',
       'glimepiride', 'acetohexamide', 'glipizide', 'glyburide', 'tolbutamide',
       'pioglitazone', 'rosiglitazone', 'acarbose', 'miglitol', 'troglitazone',
       'tolazamide', 'examide', 'citoglipton', 'insulin',
       'glyburide-metformin', 'glipizide-metformin',
       'glimepiride-pioglitazone', 'metformin-rosiglitazone',
       'metformin-pioglitazone', 'change', 'diabetesMed', 'readmission_id'],
      dtype='object')

#### Null Removal

We first see the percentage of null values in each of the columns, to see which columns we can immediatelty drop.

In [28]:
percent_missing = df.isnull().sum() * 100 / len(df)

percentages_df = pd.DataFrame({
    'percent_missing': percent_missing
})

percentages_df

Unnamed: 0,percent_missing
enc_id,0.0
patient_id,0.0
race,2.275535
gender,0.0
age,0.0
weight,96.841485
admission_type_id,0.0
discharge_disposition_id,0.0
admission_source_id,0.0
time_in_hospital,0.0


We see that the following columns have a very large number of null values - 
1. Weight (96.84)
2. A1Cresult (83.32)
3. max_glu_serum (94.77)


We thus try to see the distribution of the values in these columns to see whether any useful info can be gained.

In [29]:
df['max_glu_serum'].value_counts()

Norm    1790
>200    1034
>300     897
Name: max_glu_serum, dtype: int64

In [30]:
df['A1Cresult'].value_counts()

>8      5715
Norm    3476
>7      2689
Name: A1Cresult, dtype: int64

In [31]:
df['weight'].value_counts()

[75-100)     944
[50-75)      643
[100-125)    421
[125-150)    103
[25-50)       69
[0-25)        34
[150-175)     27
[175-200)      7
>200           2
Name: weight, dtype: int64

##### Dropping Columns

Since no useful info can be gained from these columns, we drop them altogether.

We also drop columns which do not seem to be relevant to the readmission of a patient, such as payer code.

In [32]:
df.drop(columns=['weight','payer_code','max_glu_serum','A1Cresult'], inplace=True)
test_df.drop(columns=['weight','payer_code','max_glu_serum','A1Cresult'],inplace=True)

df

Unnamed: 0,enc_id,patient_id,race,gender,age,admission_type_id,discharge_disposition_id,admission_source_id,time_in_hospital,medical_specialty,...,citoglipton,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,readmission_id
0,88346340,2488608,Caucasian,Male,[60-70),1,2,6,3,Family/GeneralPractice,...,No,Steady,No,No,No,No,No,Ch,Yes,2
1,92001408,52133202,Caucasian,Male,[70-80),2,6,1,7,,...,No,No,No,No,No,No,No,No,Yes,1
2,169424316,40945509,Caucasian,Female,[70-80),3,2,1,7,,...,No,Up,No,No,No,No,No,Ch,Yes,1
3,272987082,38850777,Caucasian,Female,[50-60),1,1,7,1,,...,No,No,No,No,No,No,No,No,Yes,2
4,150600612,72738225,Caucasian,Female,[80-90),1,6,7,6,,...,No,Down,No,No,No,No,No,Ch,Yes,2
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
71231,198619164,85063725,Caucasian,Male,[70-80),1,1,7,6,,...,No,No,No,No,No,No,No,No,Yes,1
71232,177404100,86244345,Caucasian,Male,[90-100),1,3,7,5,,...,No,No,No,No,No,No,No,No,No,2
71233,50905206,5131368,Caucasian,Male,[70-80),3,6,1,6,Cardiology,...,No,Steady,No,No,No,No,No,Ch,Yes,2
71234,216431502,85969035,Hispanic,Male,[50-60),1,1,4,4,InternalMedicine,...,No,Steady,No,No,No,No,No,No,Yes,2


##### Medical Specialty

For the medical specialty column, we do not directly drop it, as some specialists' partients may have more chance of being readmitted than others'. Hence to tackle null values there, we impute the null values with a new value - "No-admitting-Physician".

In [33]:
df["medical_specialty"].fillna("No-Admitting-Physician", inplace=True)
test_df["medical_specialty"].fillna("No-Admitting-Physician", inplace=True)

In [34]:
percent_missing = df.isnull().sum() * 100 / len(df)

percentages_df = pd.DataFrame({
    'percent_missing': percent_missing
})

percentages_df

Unnamed: 0,percent_missing
enc_id,0.0
patient_id,0.0
race,2.275535
gender,0.0
age,0.0
admission_type_id,0.0
discharge_disposition_id,0.0
admission_source_id,0.0
time_in_hospital,0.0
medical_specialty,0.0


In [None]:
diag_grouping_dict = {
    0 : [0],
    1 : range(1,140),
    2 : range(140,240),
    3 : range(240,280),
    4 : range(280,290),
    5 : range(290,320), 
    6 : range(320,390),
    7 : range(390,460),
    8 : range(460,520),
    9 : range(520,580),
    10: range(580,630),
    11: range(630,680),
    12: range(680,710),
    13: range(710,740),
    14: range(740,760),
    15: range(760-780),
    16: range(780,800),
    17: range(800,1000)
}

def diag_convert1(row):
    if str(row['diag_1'])[0] in ['E','V']:
        return 18
    else:
        for j in diag_grouping_dict.keys():
            if floor(float(row['diag_1'])) in diag_grouping_dict[j]:
                return j


def diag_convert2(row):
    if str(row['diag_2'])[0] in ['E','V']:
        return 18
    else:
        for j in diag_grouping_dict.keys():
            if floor(float(row['diag_2'])) in diag_grouping_dict[j]:
                return j
            

def diag_convert3(row):
    if str(row['diag_3'])[0] in ['E','V']:
        return 18
    else:
        for j in diag_grouping_dict.keys():
            if floor(float(row['diag_3'])) in diag_grouping_dict[j]:
                return j

In [None]:
df.fillna("0", inplace=True)
test_df.fillna("0", inplace=True)

# new_col = df.apply(diag_convert1, axis=1)
# df.insert(loc = len(df.columns)-1, column = 'diag_1_new', value=new_col)
df.drop(columns=['diag_1'], inplace=True)

# new_col = test_df.apply(diag_convert1, axis=1)
# test_df.insert(loc = len(test_df.columns), column = 'diag_1_new', value=new_col)
test_df.drop(columns=['diag_1'], inplace=True)

new_col = df.apply(diag_convert2, axis=1)
df.insert(loc = len(df.columns)-1, column = 'diag_2_new', value=new_col)
df.drop(columns=['diag_2'], inplace=True)

new_col = test_df.apply(diag_convert2, axis=1)
test_df.insert(loc = len(test_df.columns), column = 'diag_2_new', value=new_col)
test_df.drop(columns=['diag_2'], inplace=True)

new_col = df.apply(diag_convert3, axis=1)
df.insert(loc = len(df.columns)-1, column = 'diag_3_new', value=new_col)
df.drop(columns=['diag_3'], inplace=True)

new_col = test_df.apply(diag_convert3, axis=1)
test_df.insert(loc = len(test_df.columns), column = 'diag_3_new', value=new_col)
test_df.drop(columns=['diag_3'], inplace=True)

In [None]:
# df.loc[df['diag_1'].notnull(), 'diag_1'] = 4
# df.loc[df['diag_2'].notnull(), 'diag_2'] = 2
# df.loc[df['diag_3'].notnull(), 'diag_3'] = 1

# df['diag_1'].fillna(0,inplace=True)
# df['diag_2'].fillna(0,inplace=True)
# df['diag_3'].fillna(0,inplace=True)

# test_df.loc[test_df['diag_1'].notnull(), 'diag_1'] = 4
# test_df.loc[test_df['diag_2'].notnull(), 'diag_2'] = 2
# test_df.loc[test_df['diag_3'].notnull(), 'diag_3'] = 1

# test_df['diag_1'].fillna(0,inplace=True)
# test_df['diag_2'].fillna(0,inplace=True)
# test_df['diag_3'].fillna(0,inplace=True)

# df.loc[:,'diag_1':'diag_3']

In [None]:
# new_col = df['diag_1']+df['diag_2']+df['diag_3']
# df.insert(loc = len(df.columns)-1, column = 'Number_of_Diagnosis', value=new_col)

# new_col = test_df['diag_1']+test_df['diag_2']+test_df['diag_3']
# test_df.insert(loc = len(test_df.columns), column = 'Number_of_Diagnosis', value=new_col)

In [None]:
# df.drop(columns=['diag_1','diag_2','diag_3'], inplace=True)
# test_df.drop(columns=['diag_1','diag_2','diag_3'], inplace=True)

In [None]:
df.drop(columns=['race','gender'], inplace=True)
test_df.drop(columns=['race','gender'], inplace=True)
df

In [5]:
admission_grouping_dict = {
    1 : [1],
    2 : [2],
    3 : [3],
    4 : [4],
    5 : [5,6,8], 
    6 : [7]
}

def admission_group(row):
    for j in admission_grouping_dict.keys():
        if row['admission_type_id'] in admission_grouping_dict[j]:
            return j

new_col = df.apply(admission_group, axis=1)
df.insert(loc = len(df.columns)-1, column = 'admission_type_id_new', value=new_col)
df.drop(columns=['admission_type_id'], inplace=True)

new_col = test_df.apply(admission_group, axis=1)
test_df.insert(loc = len(test_df.columns), column = 'admission_type_id_new', value=new_col)
test_df.drop(columns=['admission_type_id'], inplace=True)

temp_df = df.groupby(by=['admission_type_id_new'])
temp_df.count()

Unnamed: 0_level_0,enc_id,patient_id,race,gender,age,discharge_disposition_id,admission_source_id,time_in_hospital,medical_specialty,num_lab_procedures,...,insulin,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,service_utilization,readmission_id
admission_type_id_new,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,37831,37831,37051,37831,37831,37831,37831,37831,14545,37831,...,37831,37831,37831,37831,37831,37831,37831,37831,37831,37831
2,12979,12979,12501,12979,12979,12979,12979,12979,9147,12979,...,12979,12979,12979,12979,12979,12979,12979,12979,12979,12979
3,13188,13188,12916,13188,13188,13188,13188,13188,8457,13188,...,13188,13188,13188,13188,13188,13188,13188,13188,13188,13188
4,8,8,8,8,8,8,8,8,2,8,...,8,8,8,8,8,8,8,8,8,8
5,7223,7223,7133,7223,7223,7223,7223,7223,4155,7223,...,7223,7223,7223,7223,7223,7223,7223,7223,7223,7223
6,7,7,6,7,7,7,7,7,0,7,...,7,7,7,7,7,7,7,7,7,7


In [6]:
discharge_grouping_dict = {
    1 : [11,19,20,21],
    2 : [18,25,26],
    3 : [7],
    4 : [1,6,8,13,14],
    5 : [2,3,4,5,10,16,22,23,24,30,27,28,29],
    6 : [9,12,15,17]
}

def discharge_group(row):
    for j in discharge_grouping_dict.keys():
        if row['discharge_disposition_id'] in discharge_grouping_dict[j]:
            return j
        
new_col = df.apply(discharge_group, axis=1)
df.insert(loc = len(df.columns)-1, column = 'discharge_type_id_new', value=new_col)
df.drop(columns=['discharge_disposition_id'], inplace=True)

new_col = test_df.apply(discharge_group, axis=1)
test_df.insert(loc = len(test_df.columns), column = 'discharge_type_id_new', value=new_col)
test_df.drop(columns=['discharge_disposition_id'], inplace=True)

temp_df = df.groupby(by=['discharge_type_id_new'])
temp_df.count()

Unnamed: 0_level_0,enc_id,patient_id,race,gender,age,admission_source_id,time_in_hospital,medical_specialty,num_lab_procedures,num_procedures,...,glyburide-metformin,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,service_utilization,admission_type_id_new,readmission_id
discharge_type_id_new,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,1160,1160,1138,1160,1160,1160,1160,586,1160,1160,...,1160,1160,1160,1160,1160,1160,1160,1160,1160,1160
2,3319,3319,3298,3319,3319,3319,3319,1220,3319,3319,...,3319,3319,3319,3319,3319,3319,3319,3319,3319,3319
3,444,444,433,444,444,444,444,194,444,444,...,444,444,444,444,444,444,444,444,444,444
4,51763,51763,50554,51763,51763,51763,51763,27006,51763,51763,...,51763,51763,51763,51763,51763,51763,51763,51763,51763,51763
5,14489,14489,14131,14489,14489,14489,14489,7281,14489,14489,...,14489,14489,14489,14489,14489,14489,14489,14489,14489,14489
6,61,61,61,61,61,61,61,19,61,61,...,61,61,61,61,61,61,61,61,61,61


In [7]:
source_grouping_dict = {
    1 : [4,5,6,10,18,22,25,26],
    2 : [1,2,3],
    3 : [11,12,13,14],
    4 : [9,15,17,20,21],
    5 : [7],
    6 : [8]
}

def source_group(row):
    for j in source_grouping_dict.keys():
        if row['admission_source_id'] in source_grouping_dict[j]:
            return j

new_col = df.apply(source_group, axis=1)
df.insert(loc = len(df.columns)-1, column = 'admission_source_id_new', value=new_col)
df.drop(columns=['admission_source_id'], inplace=True)

new_col = test_df.apply(source_group, axis=1)
test_df.insert(loc = len(test_df.columns), column = 'admission_source_id_new', value=new_col)
test_df.drop(columns=['admission_source_id'], inplace=True)

temp_df = df.groupby(by=['admission_source_id_new'])
temp_df.count()

Unnamed: 0_level_0,enc_id,patient_id,race,gender,age,time_in_hospital,medical_specialty,num_lab_procedures,num_procedures,num_medications,...,glipizide-metformin,glimepiride-pioglitazone,metformin-rosiglitazone,metformin-pioglitazone,change,diabetesMed,service_utilization,admission_type_id_new,discharge_type_id_new,readmission_id
admission_source_id_new,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1,Unnamed: 10_level_1,Unnamed: 11_level_1,Unnamed: 12_level_1,Unnamed: 13_level_1,Unnamed: 14_level_1,Unnamed: 15_level_1,Unnamed: 16_level_1,Unnamed: 17_level_1,Unnamed: 18_level_1,Unnamed: 19_level_1,Unnamed: 20_level_1,Unnamed: 21_level_1
1,4465,4465,4130,4465,4465,4465,2771,4465,4465,4465,...,4465,4465,4465,4465,4465,4465,4465,4465,4465,4465
2,21610,21610,21060,21610,21610,21610,12726,21610,21610,21610,...,21610,21610,21610,21610,21610,21610,21610,21610,21610,21610
3,5,5,5,5,5,5,2,5,5,5,...,5,5,5,5,5,5,5,5,5,5
4,4877,4877,4771,4877,4877,4877,3034,4877,4877,4877,...,4877,4877,4877,4877,4877,4877,4877,4877,4877,4877
5,40268,40268,39638,40268,40268,40268,17765,40268,40268,40268,...,40268,40268,40268,40268,40268,40268,40268,40268,40268,40268
6,11,11,11,11,11,11,8,11,11,11,...,11,11,11,11,11,11,11,11,11,11


In [None]:
new_col = df['number_outpatient'] + df['number_emergency'] + df['number_inpatient']
df.insert(loc = len(df.columns)-1, column = 'service_utilization', value=new_col)

new_col = test_df['number_outpatient'] + test_df['number_emergency'] + test_df['number_inpatient']
test_df.insert(loc = len(test_df.columns), column = 'service_utilization', value=new_col)

df

In [11]:
def age_converter(row):
    if row['age'] == '[0-10)':
        return 5
    elif row['age'] == '[10-20)':
        return 15
    elif row['age'] == '[20-30)':
        return 25
    elif row['age'] == '[30-40)':
        return 35
    elif row['age'] == '[40-50)':
        return 45
    elif row['age'] == '[50-60)':
        return 55
    elif row['age'] == '[60-70)':
        return 65
    elif row['age'] == '[70-80)':
        return 75
    elif row['age'] == '[80-90)':
        return 85
    elif row['age'] == '[90-100)':
        return 95

new_col = df.apply(age_converter, axis=1)
df['age'] = new_col

new_col = test_df.apply(age_converter, axis=1)
test_df['age'] = new_col

df['age'].value_counts()

age
75    18179
65    15801
55    12080
85    12037
45     6785
35     2650
95     1940
25     1165
15      495
5       104
Name: count, dtype: int64

In [12]:
patient = df['patient_id'].value_counts()
patient = patient.add(test_df['patient_id'].value_counts(),fill_value=0)
patient

def id_convertor(row):
    return patient[row['patient_id']]

new_col = df.apply(id_convertor, axis=1)
df.insert(loc = len(df.columns)-1, column = 'patient_id_new', value=new_col)
df.drop(columns=['patient_id'],inplace=True)
df.drop(columns=['enc_id'], inplace=True)

new_col = test_df.apply(id_convertor, axis=1)
test_df.insert(loc = len(test_df.columns), column = 'patient_id_new', value=new_col)
test_df.drop(columns=['patient_id'], inplace=True)
test_df.drop(columns=['enc_id'], inplace=True)

In [13]:
for col in df.loc[:,'metformin':'diabetesMed']:
    med_groups = df.groupby(by=[col])
    print(med_groups.count().iloc[:, 0])

metformin
Down        392
No        57223
Steady    12885
Up          736
Name: age, dtype: int64
repaglinide
Down         33
No        70145
Steady      980
Up           78
Name: age, dtype: int64
nateglinide
Down          8
No        70770
Steady      444
Up           14
Name: age, dtype: int64
chlorpropamide
No        71170
Steady       62
Up            4
Name: age, dtype: int64
glimepiride
Down        131
No        67558
Steady     3308
Up          239
Name: age, dtype: int64
acetohexamide
No        71235
Steady        1
Name: age, dtype: int64
glipizide
Down        376
No        62301
Steady     8011
Up          548
Name: age, dtype: int64
glyburide
Down        389
No        63713
Steady     6577
Up          557
Name: age, dtype: int64
tolbutamide
No        71221
Steady       15
Name: age, dtype: int64
pioglitazone
Down         83
No        66074
Steady     4910
Up          169
Name: age, dtype: int64
rosiglitazone
Down         67
No        66740
Steady     4303
Up          126
Na

In [14]:
df.drop(columns=['chlorpropamide', 'tolbutamide', 'miglitol', 'acarbose', 'tolazamide', 'acetohexamide', 'troglitazone', 'examide', 'citoglipton', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone', 'glyburide-metformin'], inplace=True)
test_df.drop(columns=['chlorpropamide', 'tolbutamide', 'miglitol', 'acarbose', 'tolazamide', 'acetohexamide', 'troglitazone', 'examide', 'citoglipton', 'glipizide-metformin', 'glimepiride-pioglitazone', 'metformin-rosiglitazone', 'metformin-pioglitazone', 'glyburide-metformin'], inplace=True)
df.columns.size

29

In [15]:
def count_changes(row):
    count =0
    for col in ['metformin','repaglinide','nateglinide','glimepiride','glipizide','glyburide','pioglitazone','rosiglitazone','insulin']:
        if(row[col]=='Up' or row[col]=='Down'):
            count+=1
    return count

new_col = df.apply(count_changes, axis=1)
df.insert(loc = len(df.columns)-1, column = 'changes', value=new_col)

new_col = test_df.apply(count_changes, axis=1)
test_df.insert(loc = len(test_df.columns), column = 'changes', value=new_col)

In [16]:
df.drop(columns=['metformin','repaglinide','nateglinide','glimepiride','glipizide','glyburide','pioglitazone','rosiglitazone','insulin','change'], inplace=True)
test_df.drop(columns=['metformin','repaglinide','nateglinide','glimepiride','glipizide','glyburide','pioglitazone','rosiglitazone','insulin','change'], inplace=True)

df.columns.size

20

In [None]:
# medics_group = df[df['diag_2_new'] == 5].groupby(by=['medical_specialty'])
# medics_group.count()

In [None]:
# df.drop(columns=['medical_specialty'], inplace=True)
# test_df.drop(columns=['medical_specialty'], inplace=True)

In [None]:
# df.drop(columns=['num_procedures','num_medications','time_in_hospital'])
# test_df.drop(columns=['num_procedures','num_medications','time_in_hospital'])

In [18]:
plt.figure(figsize=(20,20))
corr = df.loc[:, ["time_in_hospital","num_lab_procedures","num_procedures","num_medications","number_outpatient","number_emergency","number_inpatient","number_diagnoses","changes", "readmission_id"]].corr()
sns.heatmap(corr,annot=True)

In [20]:
df.columns

Index(['age', 'time_in_hospital', 'medical_specialty', 'num_lab_procedures',
       'num_procedures', 'num_medications', 'number_outpatient',
       'number_emergency', 'number_inpatient', 'number_diagnoses',
       'diabetesMed', 'service_utilization', 'admission_type_id_new',
       'discharge_type_id_new', 'admission_source_id_new', 'diag_2_new',
       'diag_3_new', 'patient_id_new', 'changes', 'readmission_id'],
      dtype='object')

In [21]:
test_df.columns

Index(['age', 'time_in_hospital', 'medical_specialty', 'num_lab_procedures',
       'num_procedures', 'num_medications', 'number_outpatient',
       'number_emergency', 'number_inpatient', 'number_diagnoses',
       'diabetesMed', 'service_utilization', 'admission_type_id_new',
       'discharge_type_id_new', 'admission_source_id_new', 'diag_2_new',
       'diag_3_new', 'patient_id_new', 'changes'],
      dtype='object')

In [22]:
index_vals = df['readmission_id'].astype('category').cat.codes

dimensions = []

cols = ['num_lab_procedures','number_outpatient', 'number_emergency', 'number_inpatient', 'number_diagnoses', 'changes']

for i in cols:
    d = dict(label=i, values=df[i])
    dimensions.append(d)

fig = go.Figure(data=go.Splom(
     dimensions=dimensions,
     diagonal_visible=False,
     text=df['readmission_id'],
     marker=dict(color=index_vals,
               line_color='white', line_width=0.5)
))

fig.update_layout(
    title='Readmission id',
    dragmode='select',
    width=1300,
    height=1300,
    hovermode='closest',
)

fig.show()


In [23]:
input = df.loc[:, "age":"changes"]
labels = df.loc[:, "readmission_id"]
input.columns

Index(['age', 'time_in_hospital', 'medical_specialty', 'num_lab_procedures',
       'num_procedures', 'num_medications', 'number_outpatient',
       'number_emergency', 'number_inpatient', 'number_diagnoses',
       'diabetesMed', 'service_utilization', 'admission_type_id_new',
       'discharge_type_id_new', 'admission_source_id_new', 'diag_2_new',
       'diag_3_new', 'patient_id_new', 'changes'],
      dtype='object')

In [24]:
input_encoded = pd.get_dummies(input, columns=['medical_specialty','admission_type_id_new',
                    'discharge_type_id_new','admission_source_id_new','diag_2_new','diag_3_new'])

input_encoded['diabetesMed'] = np.where(input_encoded['diabetesMed']=='Yes',1,0)

print(input_encoded.columns)

Index(['age', 'time_in_hospital', 'num_lab_procedures', 'num_procedures',
       'num_medications', 'number_outpatient', 'number_emergency',
       'number_inpatient', 'number_diagnoses', 'diabetesMed',
       ...
       'diag_3_new_8', 'diag_3_new_9', 'diag_3_new_10', 'diag_3_new_11',
       'diag_3_new_12', 'diag_3_new_13', 'diag_3_new_14', 'diag_3_new_16',
       'diag_3_new_17', 'diag_3_new_18'],
      dtype='object', length=136)


In [25]:
test_encoded = pd.get_dummies(test_df, columns=['medical_specialty','admission_type_id_new', 
                    'discharge_type_id_new','admission_source_id_new','diag_2_new','diag_3_new'])

test_encoded['diabetesMed'] = np.where(test_encoded['diabetesMed']=='Yes',1,0)


print(test_encoded.columns)

Index(['age', 'time_in_hospital', 'num_lab_procedures', 'num_procedures',
       'num_medications', 'number_outpatient', 'number_emergency',
       'number_inpatient', 'number_diagnoses', 'diabetesMed',
       ...
       'diag_3_new_8', 'diag_3_new_9', 'diag_3_new_10', 'diag_3_new_11',
       'diag_3_new_12', 'diag_3_new_13', 'diag_3_new_14', 'diag_3_new_16',
       'diag_3_new_17', 'diag_3_new_18'],
      dtype='object', length=130)


In [26]:
for i in input_encoded.columns:
    if i not in test_encoded.columns:
       test_encoded[i] = 0

for i in test_encoded.columns:
    if i not in input_encoded.columns:
       test_encoded.drop(columns=[i], inplace=True)

input_encoded.sort_index(axis=1, inplace=True)
test_encoded.sort_index(axis=1, inplace=True)

In [27]:
print(input_encoded.columns)
print(test_encoded.columns)

Index(['admission_source_id_new_1', 'admission_source_id_new_2',
       'admission_source_id_new_3', 'admission_source_id_new_4',
       'admission_source_id_new_5', 'admission_source_id_new_6',
       'admission_type_id_new_1', 'admission_type_id_new_2',
       'admission_type_id_new_3', 'admission_type_id_new_4',
       ...
       'num_lab_procedures', 'num_medications', 'num_procedures',
       'number_diagnoses', 'number_emergency', 'number_inpatient',
       'number_outpatient', 'patient_id_new', 'service_utilization',
       'time_in_hospital'],
      dtype='object', length=136)
Index(['admission_source_id_new_1', 'admission_source_id_new_2',
       'admission_source_id_new_3', 'admission_source_id_new_4',
       'admission_source_id_new_5', 'admission_source_id_new_6',
       'admission_type_id_new_1', 'admission_type_id_new_2',
       'admission_type_id_new_3', 'admission_type_id_new_4',
       ...
       'num_lab_procedures', 'num_medications', 'num_procedures',
       'number

In [28]:
scaler = StandardScaler()
input_encoded[['patient_id_new','age','num_lab_procedures','number_outpatient', 'number_emergency', 'number_inpatient', 'number_diagnoses', 'changes','num_procedures','num_medications','time_in_hospital']] = scaler.fit_transform(input_encoded[['patient_id_new','age','num_lab_procedures','number_outpatient', 'number_emergency', 'number_inpatient', 'number_diagnoses', 'changes','num_procedures','num_medications','time_in_hospital']].to_numpy())
test_encoded[['patient_id_new','age','num_lab_procedures','number_outpatient', 'number_emergency', 'number_inpatient', 'number_diagnoses', 'changes','num_procedures','num_medications','time_in_hospital']] = scaler.fit_transform(test_encoded[['patient_id_new','age','num_lab_procedures','number_outpatient', 'number_emergency', 'number_inpatient', 'number_diagnoses', 'changes','num_procedures','num_medications','time_in_hospital']].to_numpy())

In [29]:
# preprocessor = ColumnTransformer(
#     transformers=[
#         ('transformer', PolynomialFeatures(degree=2, include_bias=False), ['time_in_hospital', 'num_lab_procedures', 'num_procedures', 'num_medications', 'number_outpatient', 'number_emergency', 'number_inpatient', 'number_diagnoses']),
#     ],
#     remainder='passthrough'
# )

# input_encoded = preprocessor.fit_transform(input_encoded)
# test_encoded = preprocessor.transform(test_encoded)

In [30]:
X_train,X_test,Y_train,Y_test = train_test_split(input_encoded, labels, test_size=0.2, random_state=42)

In [31]:
# lr = LogisticRegression(random_state=42, multi_class="multinomial")
# lr.fit(X_train,Y_train)

# y_pred = lr.predict(X_test)
# print(accuracy_score(y_pred, Y_test))

# nb = GaussianNB()
# nb.fit(X_train,Y_train)

# y_pred = nb.predict(X_test)
# print(accuracy_score(y_pred, Y_test))

# tree = DecisionTreeClassifier(max_depth=20,random_state=42)
# tree.fit(X_train,Y_train)

# y_pred = tree.predict(X_test)
# print(accuracy_score(y_pred, Y_test))

In [32]:
rf = RandomForestClassifier(random_state=42, criterion='entropy', max_depth=30, n_estimators=440)
rf.fit(X_train,Y_train)

y_pred = rf.predict(X_test)
print(accuracy_score(y_pred, Y_test))

0.7114682762492981


In [33]:
# gbc = GradientBoostingClassifier(n_estimators=100,learning_rate=0.1,max_depth=4,random_state=42)
# gbc.fit(X_train,Y_train)

# y_pred = gbc.predict(X_test)
# print(accuracy_score(y_pred, Y_test))

In [35]:
# def objective(trial):
#     criterion = trial.suggest_categorical("criterion", ["gini", "entropy"])
#     max_depth = trial.suggest_int("max_depth", 2, 32, log=True)
#     n_estimators = trial.suggest_int("n_estimators", 100,500)
#     random_state = trial.suggest_int("random_state",42,42)
#     rf = RandomForestClassifier(criterion =criterion,
#             max_depth=max_depth, 
#             n_estimators=n_estimators,
#             random_state=random_state
#         )
#     X_train,X_test,Y_train,Y_test = train_test_split(input_encoded, labels, test_size=0.2, random_state=42)
#     rf.fit(X_train,Y_train)
#     y_pred = rf.predict(X_test)
#     score = accuracy_score(y_pred, Y_test)
#     return score


# study = optuna.create_study(direction="maximize")
# study.optimize(objective, n_trials=15)


# def objective2(trial):
#     # data, target = sklearn.datasets.load_breast_cancer(return_X_y=True)
#     X_train,X_test,Y_train,Y_test = train_test_split(input_encoded, labels, test_size=0.3, random_state=42)
#     regex = re.compile(r"\[|\]|<", re.IGNORECASE)
#     dict ={0:1.45,2:1,1:1.4}
#     X_train.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_train.columns.values]
#     X_test.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_test.columns.values]

#     max_depth = trial.suggest_int("max_depth", 3, 10)
#     n_estimators = trial.suggest_int("n_estimators", 200,500)
#     learning_rate = trial.suggest_int("learning_rate",0,1)
#     # gamma = trial.suggest_int("gamma",0,5)
#     reg_lambda = trial.suggest_int("reg_lambda",0,5)
#     class_weight = trial.suggest_int("class_weight",0,3)
#     rf = xgb.XGBClassifier(
#             max_depth=max_depth, 
#             n_estimators=n_estimators,
#             learning_rate=learning_rate,
#             reg_lambda=reg_lambda,
#             class_weight = class_weight
#         )
#     rf.fit(X_train,Y_train)
#     preds = rf.predict(X_test)
#     pred_labels = np.rint(preds)
#     accuracy = accuracy_score(Y_test, pred_labels)
#     return accuracy

# study = optuna.create_study(direction="maximize")
# study.optimize(objective2, n_trials=15)


In [36]:
regex = re.compile(r"\[|\]|<", re.IGNORECASE)
 
X_train.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_train.columns.values]
X_test.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in X_test.columns.values]
test_encoded.columns = [regex.sub("_", col) if any(x in str(col) for x in set(('[', ']', '<'))) else col for col in test_encoded.columns.values]


xgb = xgb.XGBClassifier(max_depth=3,n_estimators=208,learning_rate=1,reg_lambda=3,class_weight=2)
xgb.fit(X_train,Y_train)

y_pred = xgb.predict(X_test)
print(accuracy_score(y_pred, Y_test))

Parameters: { "class_weight" } are not used.

0.7120297585626053


In [73]:
test_Y = xgb.predict(test_encoded)

df_output = pd.read_csv("./sample_submission.csv")
df_output["readmission_id"] = test_Y
df_output.to_csv("submission9.csv", index=False)

In [None]:
cm = confusion_matrix(Y_test, y_pred, labels=xgb.classes_)
disp = ConfusionMatrixDisplay(confusion_matrix=cm,display_labels=xgb.classes_)
disp.plot()
plt.show()
print(accuracy_score(y_pred, Y_test))