# Janatahack: Healthcare Analytics II

#### The healthcare sector has long been an early adopter of and benefited greatly from technological advances. These days, machine learning plays a key role in many health-related realms, including the development of new medical procedures, the handling of patient data, staff management & more.



# Problem Statement

#### Recent Covid-19 Pandemic has raised alarms over one of the most overlooked area to focus: Healthcare Management. 
#### While healthcare management has various use cases for using data science, patient length of stay is one critical parameter to observe and predict if one wants to improve the efficiency of the healthcare management in a hospital. 

#### This parameter helps hospitals to identify patients of high LOS risk (patients who will stay longer) at the time of admission. Once identified, patients with high LOS risk can have their treatment plan optimized to miminize LOS and lower the chance of staff/visitor infection. Also, prior knowledge of LOS can aid in logistics such as room and bed allocation planning.

#### Suppose you have been hired as Data Scientist of HealthMan – a not for profit organization dedicated to manage the functioning of Hospitals in a professional and optimal manner.
#### The task is to accurately predict the Length of Stay for each patient on case by case basis so that the Hospitals can use this information for optimal resource allocation and better functioning. The length of stay is divided into 11 different classes ranging from 0-10 days to more than 100 days.

## Data Description


#### Train.zip contains 1 csv alongside the data dictionary that contains definitions for each variable

#### train.csv – File containing features related to patient, hospital and Length of stay on case basis

#### train_data_dict.csv – File containing the information of the features in train file



#### Test Set

#### test.csv – File containing features related to patient, hospital. Need to predict the Length of stay for each case_id



#### Sample Submission:

#### case_id: Unique id for each case

#### Stay: Length of stay for the patient w.r.t each case id in test data

## Evaluation Metric

#### The evaluation metric for this hackathon is 100*Accuracy Score.



## Solution



#### Import all the necessary libraries

In [1]:
import numpy as np 
import pandas as pd
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import RobustScaler

#### Let us read the csv file with our training dataset

In [2]:
df = pd.read_csv("C:/Users/Daksha/Desktop/AV Healthcare Train.csv")

In [3]:
df.head()

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,1,8,c,3,Z,3,radiotherapy,R,F,2.0,31397,7.0,Emergency,Extreme,2,51-60,4911,0-10
1,2,2,c,5,Z,2,radiotherapy,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,5954,41-50
2,3,10,e,1,X,2,anesthesia,S,E,2.0,31397,7.0,Trauma,Extreme,2,51-60,4745,31-40
3,4,26,b,2,Y,2,radiotherapy,R,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,7272,41-50
4,5,26,b,2,Y,2,radiotherapy,S,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,5558,41-50


#### Let us check all the info. we can get from our dataset

In [4]:
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 318438 entries, 0 to 318437
Data columns (total 18 columns):
case_id                              318438 non-null int64
Hospital_code                        318438 non-null int64
Hospital_type_code                   318438 non-null object
City_Code_Hospital                   318438 non-null int64
Hospital_region_code                 318438 non-null object
Available Extra Rooms in Hospital    318438 non-null int64
Department                           318438 non-null object
Ward_Type                            318438 non-null object
Ward_Facility_Code                   318438 non-null object
Bed Grade                            318325 non-null float64
patientid                            318438 non-null int64
City_Code_Patient                    313906 non-null float64
Type of Admission                    318438 non-null object
Severity of Illness                  318438 non-null object
Visitors with Patient                318438 non-null

In [5]:
#### Let us also check the statistical info. of all the columns in our dataset

In [6]:
df.describe()

Unnamed: 0,case_id,Hospital_code,City_Code_Hospital,Available Extra Rooms in Hospital,Bed Grade,patientid,City_Code_Patient,Visitors with Patient,Admission_Deposit
count,318438.0,318438.0,318438.0,318438.0,318325.0,318438.0,313906.0,318438.0,318438.0
mean,159219.5,18.318841,4.771717,3.197627,2.625807,65747.579472,7.251859,3.284099,4880.749392
std,91925.276847,8.633755,3.102535,1.168171,0.873146,37979.93644,4.745266,1.764061,1086.776254
min,1.0,1.0,1.0,0.0,1.0,1.0,1.0,0.0,1800.0
25%,79610.25,11.0,2.0,2.0,2.0,32847.0,4.0,2.0,4186.0
50%,159219.5,19.0,5.0,3.0,3.0,65724.5,8.0,3.0,4741.0
75%,238828.75,26.0,7.0,4.0,3.0,98470.0,8.0,4.0,5409.0
max,318438.0,32.0,13.0,24.0,4.0,131624.0,38.0,32.0,11008.0


#### So let us start with checking for null values, from the info() we have seen two columns with null values.

In [7]:
df.isnull().sum()

case_id                                 0
Hospital_code                           0
Hospital_type_code                      0
City_Code_Hospital                      0
Hospital_region_code                    0
Available Extra Rooms in Hospital       0
Department                              0
Ward_Type                               0
Ward_Facility_Code                      0
Bed Grade                             113
patientid                               0
City_Code_Patient                    4532
Type of Admission                       0
Severity of Illness                     0
Visitors with Patient                   0
Age                                     0
Admission_Deposit                       0
Stay                                    0
dtype: int64

#### Bed Grade and City_Code_Patient have relatively small amounnt of null values when compared to total no. of rows present in the dataset

#### As we know that our ML Model doesnt accept null values, we will have to fill them

#### Since we do not have huge number of data with us, we cannot drop the null values since we lose vital info.

#### Let us first check the nature of the null values present in our dataset

#### There are chances were the null values are MCAR (Missing Completely at Random) or Not Missing at Random

#### Let us see if there is any relationship between the null values and other values present

In [8]:
df['Age'].value_counts()

41-50     63749
31-40     63639
51-60     48514
21-30     40843
71-80     35792
61-70     33687
Nov-20    16768
81-90      7890
0-10       6254
91-100     1302
Name: Age, dtype: int64

In [9]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df['Age'] = labelencoder.fit_transform(df['Age'])
print(df['Age'])

0         4
1         4
2         4
3         4
4         4
         ..
318433    3
318434    7
318435    6
318436    9
318437    9
Name: Age, Length: 318438, dtype: int32


In [10]:
df['Hospital_code'].value_counts()

26    33076
23    26566
19    21219
6     20425
11    17328
14    17328
28    17137
27    14244
9     11510
29    11311
12    11297
32    10703
25     9834
10     9435
15     9257
21     8150
24     7992
3      7116
17     5501
5      5261
1      5249
13     5236
2      5102
30     5002
22     4277
31     3967
16     3671
8      3663
18     3630
20     1405
7      1306
4      1240
Name: Hospital_code, dtype: int64

In [11]:
df[df['Bed Grade'].isnull()]

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
293,294,6,a,6,X,4,gynecology,Q,F,,27075,15.0,Trauma,Extreme,2,1,4420,31-40
1071,1072,6,a,6,X,2,gynecology,Q,F,,62491,8.0,Trauma,Extreme,4,5,5395,21-30
20379,20380,6,a,6,X,4,gynecology,Q,F,,69932,2.0,Trauma,Extreme,3,3,5989,31-40
23791,23792,6,a,6,X,3,gynecology,R,F,,29943,10.0,Emergency,Minor,3,2,4488,41-50
25162,25163,6,a,6,X,5,gynecology,R,F,,92499,1.0,Emergency,Minor,2,6,4885,21-30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
234337,234338,6,a,6,X,2,radiotherapy,R,F,,22881,7.0,Emergency,Minor,2,9,2416,0-10
234577,234578,6,a,6,X,2,gynecology,R,F,,120677,2.0,Trauma,Extreme,4,3,4932,51-60
234895,234896,6,a,6,X,2,gynecology,R,F,,111514,1.0,Trauma,Moderate,4,4,3984,Nov-20
235048,235049,6,a,6,X,2,gynecology,R,F,,57706,2.0,Trauma,Moderate,3,3,4139,51-60


In [12]:
#### Currently we do not need case_id for any of the predictions so we will drop the column
case_id2 = pd.DataFrame(df['case_id'])
df.drop('case_id', axis=1, inplace=True)

In [13]:
df[df['Bed Grade'].isnull()]

Unnamed: 0,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
293,6,a,6,X,4,gynecology,Q,F,,27075,15.0,Trauma,Extreme,2,1,4420,31-40
1071,6,a,6,X,2,gynecology,Q,F,,62491,8.0,Trauma,Extreme,4,5,5395,21-30
20379,6,a,6,X,4,gynecology,Q,F,,69932,2.0,Trauma,Extreme,3,3,5989,31-40
23791,6,a,6,X,3,gynecology,R,F,,29943,10.0,Emergency,Minor,3,2,4488,41-50
25162,6,a,6,X,5,gynecology,R,F,,92499,1.0,Emergency,Minor,2,6,4885,21-30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
234337,6,a,6,X,2,radiotherapy,R,F,,22881,7.0,Emergency,Minor,2,9,2416,0-10
234577,6,a,6,X,2,gynecology,R,F,,120677,2.0,Trauma,Extreme,4,3,4932,51-60
234895,6,a,6,X,2,gynecology,R,F,,111514,1.0,Trauma,Moderate,4,4,3984,Nov-20
235048,6,a,6,X,2,gynecology,R,F,,57706,2.0,Trauma,Moderate,3,3,4139,51-60


In [14]:
df['Hospital_type_code'].value_counts()

a    143425
b     68946
c     45928
e     24770
d     20389
f     10703
g      4277
Name: Hospital_type_code, dtype: int64

In [15]:
df['City_Code_Hospital'].value_counts()

1     55351
2     51809
6     46991
7     35463
3     31569
5     31105
9     26277
11    17137
4     13857
10     5249
13     3630
Name: City_Code_Hospital, dtype: int64

In [16]:
df['Hospital_region_code'].value_counts()

X    133336
Y    122428
Z     62674
Name: Hospital_region_code, dtype: int64

In [17]:
df['Bed Grade'].value_counts()

2.0    123671
3.0    110583
4.0     57566
1.0     26505
Name: Bed Grade, dtype: int64

In [18]:
df['Bed Grade'].fillna(2.0, inplace=True)

In [19]:
df.isnull().sum()

Hospital_code                           0
Hospital_type_code                      0
City_Code_Hospital                      0
Hospital_region_code                    0
Available Extra Rooms in Hospital       0
Department                              0
Ward_Type                               0
Ward_Facility_Code                      0
Bed Grade                               0
patientid                               0
City_Code_Patient                    4532
Type of Admission                       0
Severity of Illness                     0
Visitors with Patient                   0
Age                                     0
Admission_Deposit                       0
Stay                                    0
dtype: int64

In [20]:
df[df['City_Code_Patient'].isnull()]

Unnamed: 0,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
191,14,a,1,X,2,radiotherapy,Q,E,3.0,122110,,Emergency,Minor,2,6,9096,51-60
192,30,c,3,Z,2,anesthesia,Q,A,4.0,122110,,Trauma,Minor,2,6,5098,31-40
193,27,a,7,Y,2,radiotherapy,P,C,3.0,122110,,Trauma,Minor,2,6,7776,21-30
194,27,a,7,Y,2,anesthesia,Q,C,3.0,122110,,Trauma,Minor,2,6,5988,Nov-20
195,25,e,1,X,3,radiotherapy,S,E,3.0,122110,,Urgent,Minor,2,6,5333,21-30
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
318198,23,a,6,X,2,anesthesia,Q,F,3.0,58469,,Urgent,Minor,2,9,4432,Nov-20
318263,28,b,11,X,3,anesthesia,R,F,3.0,66803,,Trauma,Moderate,2,2,5415,Nov-20
318269,26,b,2,Y,3,gynecology,R,D,4.0,95483,,Trauma,Minor,5,4,4135,51-60
318271,28,b,11,X,2,gynecology,R,F,4.0,117128,,Emergency,Minor,2,5,3418,0-10


In [21]:
df['City_Code_Patient'].value_counts()

8.0     124011
2.0      38869
1.0      26377
7.0      23807
5.0      20079
4.0      15380
9.0      11795
15.0      8950
10.0      8174
6.0       6005
12.0      5647
3.0       3772
23.0      3698
14.0      2927
16.0      2254
13.0      1625
21.0      1602
20.0      1409
18.0      1404
19.0      1028
26.0      1023
25.0       798
27.0       771
11.0       658
28.0       521
22.0       405
24.0       360
30.0       133
29.0        98
33.0        78
31.0        59
37.0        57
32.0        52
34.0        46
35.0        16
36.0        12
38.0         6
Name: City_Code_Patient, dtype: int64

In [22]:
df['Type of Admission'].value_counts()

Trauma       152261
Emergency    117676
Urgent        48501
Name: Type of Admission, dtype: int64

In [23]:
df['City_Code_Patient'].fillna(8.0, inplace=True)

In [24]:
df.isnull().sum()

Hospital_code                        0
Hospital_type_code                   0
City_Code_Hospital                   0
Hospital_region_code                 0
Available Extra Rooms in Hospital    0
Department                           0
Ward_Type                            0
Ward_Facility_Code                   0
Bed Grade                            0
patientid                            0
City_Code_Patient                    0
Type of Admission                    0
Severity of Illness                  0
Visitors with Patient                0
Age                                  0
Admission_Deposit                    0
Stay                                 0
dtype: int64

In [25]:
df['Stay'].value_counts()

21-30                 87491
Nov-20                78139
31-40                 55159
51-60                 35018
0-10                  23604
41-50                 11743
71-80                 10254
More than 100 Days     6683
81-90                  4838
91-100                 2765
61-70                  2744
Name: Stay, dtype: int64

In [26]:
X = df.drop('Stay', axis=1)
Y = df['Stay']

In [27]:
X = pd.get_dummies(X)

In [28]:
X

Unnamed: 0,Hospital_code,City_Code_Hospital,Available Extra Rooms in Hospital,Bed Grade,patientid,City_Code_Patient,Visitors with Patient,Age,Admission_Deposit,Hospital_type_code_a,...,Ward_Facility_Code_C,Ward_Facility_Code_D,Ward_Facility_Code_E,Ward_Facility_Code_F,Type of Admission_Emergency,Type of Admission_Trauma,Type of Admission_Urgent,Severity of Illness_Extreme,Severity of Illness_Minor,Severity of Illness_Moderate
0,8,3,3,2.0,31397,7.0,2,4,4911,0,...,0,0,0,1,1,0,0,1,0,0
1,2,5,2,2.0,31397,7.0,2,4,5954,0,...,0,0,0,1,0,1,0,1,0,0
2,10,1,2,2.0,31397,7.0,2,4,4745,0,...,0,0,1,0,0,1,0,1,0,0
3,26,2,2,2.0,31397,7.0,2,4,7272,0,...,0,1,0,0,0,1,0,1,0,0
4,26,2,2,2.0,31397,7.0,2,4,5558,0,...,0,1,0,0,0,1,0,1,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
318433,6,6,3,4.0,86499,23.0,3,3,4144,1,...,0,0,0,1,1,0,0,0,0,1
318434,24,1,2,4.0,325,8.0,4,7,6699,1,...,0,0,1,0,0,0,1,0,0,1
318435,7,4,3,4.0,125235,10.0,3,6,4235,1,...,0,0,0,1,1,0,0,0,1,0
318436,11,2,3,3.0,91081,8.0,5,9,3761,0,...,0,1,0,0,0,1,0,0,1,0


In [29]:
from sklearn.preprocessing import StandardScaler
mm =StandardScaler()
X = mm.fit_transform(X)

In [32]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 5)
pca.fit_transform(X)

array([[-1.18805521,  2.74339907,  0.47223856, -0.3145233 , -0.32670496],
       [-0.95294102,  3.4499711 ,  0.62867087,  1.49451734,  0.01327427],
       [-2.29677406, -0.9674132 , -1.67753342,  2.99046613,  3.12723847],
       ...,
       [-1.90382113, -1.06273378,  1.08608305, -0.74957898, -0.47963824],
       [ 1.75173176, -0.30301772, -3.42787312,  1.37139859, -1.56788347],
       [ 1.71352973, -1.22871654,  0.95266763, -3.1434045 ,  0.6248279 ]])

In [33]:
from sklearn.model_selection import train_test_split
X_train,X_test,Y_train,Y_test = train_test_split(X,Y, test_size=0.2, random_state=0)

In [34]:
from sklearn.tree import DecisionTreeClassifier
from sklearn.neighbors import KNeighborsClassifier
from sklearn.ensemble import GradientBoostingClassifier
from sklearn.ensemble import RandomForestClassifier

In [35]:
import xgboost as xgb
xg_clf = xgb.XGBClassifier()
xg_clf.fit(X_train,Y_train)
xg_clf.score(X_test,Y_test)

0.42846376083406607

In [36]:
forest = RandomForestClassifier()
forest.fit(X_train,Y_train)
forest.score(X_test,Y_test)



0.36496671272453207

In [49]:
df2 = pd.read_csv("C:/Users/Daksha/Desktop/AV Healthcare Test.csv")

In [50]:
from sklearn.preprocessing import LabelEncoder
labelencoder = LabelEncoder()
df2['Age'] = labelencoder.fit_transform(df2['Age'])
print(df2['Age'])

0         6
1         6
2         6
3         6
4         6
         ..
137052    3
137053    0
137054    0
137055    3
137056    4
Name: Age, Length: 137057, dtype: int32


In [51]:
df2.drop('case_id', axis=1, inplace=True)

In [52]:
df2['Bed Grade'].fillna(2.0, inplace=True)

In [53]:
df2['City_Code_Patient'].fillna(8.0, inplace=True)

In [54]:
df2 = pd.get_dummies(df2)

In [55]:
from sklearn.preprocessing import StandardScaler
mm =StandardScaler()
df2 = mm.fit_transform(df2)

In [56]:
from sklearn.decomposition import PCA
pca = PCA(n_components = 5)
pca.fit_transform(df2)

array([[-0.26929259,  4.2774522 ,  0.1135387 , -1.34543131, -1.84834425],
       [-1.14317337, -0.95361454,  1.06508131,  0.89823503, -0.03078682],
       [ 2.53407292, -0.54089868, -2.84888961, -0.84726509,  0.72045973],
       ...,
       [-1.20795704,  4.51619681, -0.70346218,  2.16069305, -0.06912794],
       [-2.85648299, -1.27275199, -0.84686561,  2.65853994, -0.51536822],
       [-1.80362595, -0.91040012,  0.66834058, -0.39544737,  1.39575475]])

In [57]:
pd2 = xg_clf.predict(df2)

In [58]:
df3 = pd.DataFrame(pd2)

In [59]:
df3

Unnamed: 0,0
0,0-10
1,51-60
2,21-30
3,21-30
4,51-60
...,...
137052,21-30
137053,0-10
137054,Nov-20
137055,Nov-20


In [60]:
path="C:/Users/Daksha/Desktop/SampleSubmissionAVF2.csv"
df3.to_csv(path, index=False)