# Janatahack: Healthcare Analytics II

## [Janatahack: Healthcare Analytics II](https://datahack.analyticsvidhya.com/contest/janatahack-healthcare-analytics-ii)

The healthcare sector has long been an early adopter of and benefited greatly from technological advances. These days, machine learning plays a key role in many health-related realms, including the development of new medical procedures, the handling of patient data, staff management & more.

This weekend we invite you to participate in another Janatahack with the theme of healthcare analytics. Stay tuned for the problem statement and datasets this Friday and get a chance to work on a real healthcare case study along with 250 AV points at stake.

## Problem Statement

Recent Covid-19 Pandemic has raised alarms over one of the most overlooked area to focus: Healthcare Management. While healthcare management has various use cases for using data science, patient length of stay is one critical parameter to observe and predict if one wants to improve the efficiency of the healthcare management in a hospital. 

This parameter helps hospitals to identify patients of high LOS risk (patients who will stay longer) at the time of admission. Once identified, patients with high LOS risk can have their treatment plan optimized to miminize LOS and lower the chance of staff/visitor infection. Also, prior knowledge of LOS can aid in logistics such as room and bed allocation planning.

Suppose you have been hired as Data Scientist of HealthMan – a not for profit organization dedicated to manage the functioning of Hospitals in a professional and optimal manner.
The task is to accurately predict the Length of Stay for each patient on case by case basis so that the Hospitals can use this information for optimal resource allocation and better functioning. The length of stay is divided into 11 different classes ranging from 0-10 days to more than 100 days.

## Data

Column - Description

case_id - Case_ID registered in Hospital

Hospital_code - Unique code for the Hospital

Hospital_type_code - Unique code for the type of Hospital

City_Code_Hospital - City Code of the Hospital

Hospital_region_code - Region Code of the Hospital

Available Extra Rooms in Hospital - Number of Extra rooms available in the Hospital

Department - Department overlooking the case

Ward_Type -	Code for the Ward type

Ward_Facility_Code - Code for the Ward Facility

Bed Grade -	Condition of Bed in the Ward

patientid -	Unique Patient Id

City_Code_Patient -	City Code for the patient

Type of Admission -	Admission Type registered by the Hospital

Severity of Illness - Severity of the illness recorded at the time of admission

Visitors with Patient -	Number of Visitors with the patient

Age - Age of the patient

Admission_Deposit -	Deposit at the Admission Time

Stay - Stay Days by the patient

Evaluation Metric

The evaluation metric for this hackathon is 100*Accuracy Score.

# Load the Packages

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

#Basic Packages
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # Data Visualization
import seaborn as sns # Advance Data Visualization
%matplotlib inline

#OS packages
import os

#Encoding Packages
import pandas as pd
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

#Scaling Packages
from sklearn import preprocessing
mm_scaler = preprocessing.MinMaxScaler()

#Multicolinearity VIF
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

#Data Modelling Packages
from sklearn.model_selection import train_test_split

from imblearn.over_sampling import RandomOverSampler
sm = RandomOverSampler(random_state=294,sampling_strategy='not majority')

import sklearn.metrics
from sklearn.model_selection import train_test_split

#Model Packages
import lightgbm as lgb

# Load the Datasets

## Loading from Kaggle Input Data

In [None]:
# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

df_Train = pd.read_csv('../input/av-janatahack-healthcare-hackathon-ii/Data/train.csv')
df_Test = pd.read_csv('../input/av-janatahack-healthcare-hackathon-ii/Data/test.csv')

## To check for Data Leakage

Checking the leakage of Data in Case ID

In [None]:
# Checking Percentage(%) of Common Case ID's  between train and test data using Unique train values :

print(np.intersect1d(df_Train['case_id'], df_Test['case_id']).shape[0]/df_Train['case_id'].nunique())
common_ids = len(set(df_Test['case_id'].unique()).intersection(set(df_Train['case_id'].unique())))

print("Common IDs : ",common_ids)

# No - Data Leak between Train and Test !

print("Unique IDs : ",df_Test.shape[0] - common_ids)

Checking the leakage of Data in Patient ID

In [None]:
# Checking Percentage(%) of Common ID's  between train and test data using Unique train values :

print(np.intersect1d(df_Train['patientid'], df_Test['patientid']).shape[0]/df_Train['patientid'].nunique())
common_ids = len(set(df_Test['patientid'].unique()).intersection(set(df_Train['patientid'].unique())))

print("Common IDs : ",common_ids)

# No - Data Leak between Train and Test !

print("Unique IDs : ",df_Test.shape[0] - common_ids)

# Exploratory Data Analysis

In [None]:
#To find the head of the Data
df_Train.head()

In [None]:
#Information of the Dataset Datatype
df_Train.info()

In [None]:
#Information of the Dataset Continuous Values
df_Train.describe()

In [None]:
#Columns List
df_Train.columns

In [None]:
#Shape of the Train and Test Data
print('Shape of Train Data: ', df_Train.shape)
print('Shape of Test Data: ', df_Test.shape)

In [None]:
#Null values in the Train Dataset
print('Null values in Train Data: \n', df_Train.isnull().sum())

In [None]:
#Null Values in the Test Dataset
print('Null Values in Test Data: \n', df_Test.isnull().sum())

Missing Values in "Bed Grade" and "City_Code_Patient" columns.

In [None]:
print('Total Count of the Prediction Output Column Stay Variable: \n', df_Train['Stay'].value_counts())

# Data Insight and Visualization

## Target Variable "Stay" Count

In [None]:
#Counting Hospital Stay
df_Train['Stay'].value_counts()

In [None]:
#Counting Hospital Stay
sns.countplot(x='Stay',data=df_Train)
plt.xlabel("Stay")
plt.ylabel("Count")
plt.title("Stay Duration")
plt.show()

Stay Column is highly Imbalance. Need to use SMOTE to balance it

## Hospital Code Insight

In [None]:
#Counting Hospital Code
df_Train['Hospital_code'].value_counts()

In [None]:
#Counting Hospital Code
sns.countplot(x='Hospital_code',data=df_Train)
plt.xlabel("Hospital Code")
plt.ylabel("Count")
plt.title("Hospital Code Count")
plt.show()

Hospital Code is Highly Imbalanced and Might affect the model

## Hospital Type Code

In [None]:
#Counting Hospital Type Code
df_Train['Hospital_type_code'].value_counts()

In [None]:
#Counting Hospital Type Code
sns.countplot(x='Hospital_type_code',data=df_Train)
plt.xlabel("Hospital Type Code")
plt.ylabel("Count")
plt.title("Hospital Type Code Count")
plt.show()

Hospital Type Code is Imbalanced

## City Code Hospital

In [None]:
#Counting City Code Hospital
df_Train['City_Code_Hospital'].value_counts()

In [None]:
#Counting Hospital Type Code
sns.countplot(x='City_Code_Hospital',data=df_Train)
plt.xlabel("City Code Hospital")
plt.ylabel("Count")
plt.title("City Code Hospital Count")
plt.show()

City Code Hospital is Imbalanced

## Hospital Region Code

In [None]:
#Counting Hospital Region Code
df_Train['Hospital_region_code'].value_counts()

In [None]:
#Counting Hospital Region Code
sns.countplot(x='Hospital_region_code',data=df_Train)
plt.xlabel("Hospital Region Code")
plt.ylabel("Count")
plt.title("Hospital Region Code Count")
plt.show()

## Available Extra Rooms in Hospital

In [None]:
#Counting Hospital Region Code
df_Train['Available Extra Rooms in Hospital'].value_counts()

In [None]:
#Counting Available Extra Rooms in Hospital
sns.countplot(x='Available Extra Rooms in Hospital',data=df_Train)
plt.xlabel("Available Extra Rooms in Hospital")
plt.ylabel("Count")
plt.title("Available Extra Rooms in Hospital Count")
plt.show()

Need to Balance the Available Extra Rooms as its Skewed Positive

## Department

In [None]:
#Counting Department
df_Train['Department'].value_counts()

In [None]:
#Counting Department
sns.countplot(x='Department',data=df_Train)
plt.xlabel("Department")
plt.ylabel("Count")
plt.title("Department Count")
plt.show()

Department is Highly Imbalanced

## Ward Type Variable

In [None]:
#Counting Ward Type
df_Train['Ward_Type'].value_counts()

In [None]:
#Counting Ward Type
sns.countplot(x='Ward_Type',data=df_Train)
plt.xlabel("Ward Type")
plt.ylabel("Count")
plt.title("Ward Type Count")
plt.show()

Ward Type Count is highly imbalanced

## Ward Facility Code

In [None]:
#Counting Ward Facility Code
df_Train['Ward_Facility_Code'].value_counts()

In [None]:
#Counting Ward Facility Code
sns.countplot(x='Ward_Facility_Code',data=df_Train)
plt.xlabel("Ward Facility Code")
plt.ylabel("Count")
plt.title("Ward Facility Code Count")
plt.show()

## Bed Grade

In [None]:
#Counting Bed Grade
df_Train['Bed Grade'].value_counts()

In [None]:
#Counting Bed Grade
sns.countplot(x='Bed Grade',data=df_Train)
plt.xlabel("Bed Grade")
plt.ylabel("Count")
plt.title("Bed Grade Count")
plt.show()

## patientid Variable

In [None]:
#Counting patientid
df_Train['patientid'].value_counts()

In [None]:
#No of Unique Data in the Patient ID Column
df_Train['patientid'].nunique()

In [None]:
#Unique Data in the Patient ID Column
df_Train['patientid'].unique()

In [None]:
#Counting patientid
#sns.countplot(x='patientid',data=df_Train)
#plt.xlabel("patientid")
#plt.ylabel("Count")
#plt.title("patientid Count")
#plt.show()

## City Code Patient Vairable

In [None]:
#Counting City Code Patient
df_Train['City_Code_Patient'].value_counts()

In [None]:
#Counting City_Code_Patient
sns.countplot(x='City_Code_Patient',data=df_Train)
plt.xlabel("City Code Patient")
plt.ylabel("Count")
plt.title("City Code Patient Count")
plt.show()

City Code Patient is highly imbalance

## Type of Admission Variable

In [None]:
#Counting Type of Admission
df_Train['Type of Admission'].value_counts()

In [None]:
#Counting Type of Admission
sns.countplot(x='Type of Admission',data=df_Train)
plt.xlabel("Type of Admission")
plt.ylabel("Count")
plt.title("Type of Admission Count")
plt.show()

## Severity of Illness Variable

In [None]:
#Counting Severity of Illness
df_Train['Severity of Illness'].value_counts()

In [None]:
#Counting Severity of Illness
sns.countplot(x='Severity of Illness',data=df_Train)
plt.xlabel("Severity of Illness")
plt.ylabel("Count")
plt.title("Severity of Illness Count")
plt.show()

## Visitors with Patient Variable

In [None]:
#Counting Visitors with Patient
df_Train['Visitors with Patient'].value_counts()

In [None]:
#Counting Visitors with Patient
sns.countplot(x='Visitors with Patient',data=df_Train)
plt.xlabel("Visitors with Patient")
plt.ylabel("Count")
plt.title("Visitors with Patient Count")
plt.show()

## Age Variable

In [None]:
#Counting Age
df_Train['Age'].value_counts()

In [None]:
#Counting Age
sns.countplot(x='Age',data=df_Train)
plt.xlabel("Age")
plt.ylabel("Count")
plt.title("Age Count")
plt.show()

## Admission Deposit Variable

In [None]:
#Admission Deposit Price
sns.boxplot(x=df_Train['Admission_Deposit'])
plt.xlabel("Admission Deposit")
plt.title("Admission_Deposit")
plt.show()

Need to remove the outliers or Scale the Values

## Assumptions of the Predictor Variables

Target Variable

Stay - Highly Imbalanced. Need to use SMOTE to balance it


Predictor Variable

Hospital Code - Highly Imbalanced and Might affect the model

Hospital Type Code - Imbalanced

City Code Hospital - Imbalanced

Available Extra Rooms - Need to Balance the Available Extra Rooms as its Skewed Positive

Department - Highly Imbalanced

Ward Type Count - highly imbalanced

Patient ID - lot of Unique Values - Might need to drop it

City Code Patient - highly imbalance

Severity of Illness Variable - imbalanced

Visitors with Patient - imbalanced

Age - Imbalanced can be binned even more

Admission Deposit - Continous Need to remove the outliers or Scale the Values

# Basic Feature Engineering

## Remove Duplicate Rows

In [None]:
df_Train.drop_duplicates(keep='first', inplace=True)

NO Duplicate ROWS

## Joining the Train and Test Data for Encoding and Filling the Missing Values

In [None]:
# We will concat both train and test data set
df_Train['is_train'] = 1
df_Test['is_train'] = 0

#df_Frames = [df_Train,df_Test]
df_Total = pd.concat([df_Train, df_Test])

## Fill missing Values

In [None]:
#Null values in the Total Dataset
print('Null values in Total Data: \n', df_Total.isnull().sum())

In [None]:
#using Forward Fill to fill missing Values
df_Total['Bed Grade']=df_Total['Bed Grade'].fillna(method="ffill",axis=0)
df_Total['City_Code_Patient']=df_Total['City_Code_Patient'].fillna(method="ffill",axis=0)

## Feature Engineering

In [None]:
df_Total['Bill_per_patient'] = df_Total.groupby('patientid')['Admission_Deposit'].transform('sum')
df_Total['Min_Severity_of_Illness'] = df_Total.groupby('patientid')['Severity of Illness'].transform('min')

In [None]:
#Bill Per Patient
sns.boxplot(x=df_Total['Bill_per_patient'])
plt.xlabel("Bill_per_patient")
plt.title("Bill_per_patient")
plt.show()

## Encoding of the Columns

In [None]:
df_Total.head()

### For Tree Based Algorithm use Label Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_Total['Hospital_code'] = le.fit_transform(df_Total['Hospital_code'])
df_Total['Hospital_type_code'] = le.fit_transform(df_Total['Hospital_type_code'])
df_Total['City_Code_Hospital'] = le.fit_transform(df_Total['City_Code_Hospital'])
df_Total['Hospital_region_code'] = le.fit_transform(df_Total['Hospital_region_code'])
df_Total['Available Extra Rooms in Hospital'] = le.fit_transform(df_Total['Available Extra Rooms in Hospital'])
df_Total['Department'] = le.fit_transform(df_Total['Department'])
df_Total['Ward_Type'] = le.fit_transform(df_Total['Ward_Type'])
df_Total['Ward_Facility_Code'] = le.fit_transform(df_Total['Ward_Facility_Code'])
df_Total['Bed Grade'] = le.fit_transform(df_Total['Bed Grade'])
df_Total['patientid'] = le.fit_transform(df_Total['patientid'])
df_Total['City_Code_Patient'] = le.fit_transform(df_Total['City_Code_Patient'])
df_Total['Type of Admission'] = le.fit_transform(df_Total['Type of Admission'])
df_Total['Severity of Illness'] = le.fit_transform(df_Total['Severity of Illness'])
df_Total['Visitors with Patient'] = le.fit_transform(df_Total['Visitors with Patient'])
df_Total['Age'] = le.fit_transform(df_Total['Age'])
df_Total['Min_Severity_of_Illness'] = le.fit_transform(df_Total['Min_Severity_of_Illness'])

## For Scaling the Columns

In [None]:
from sklearn import preprocessing
mm_scaler = preprocessing.MinMaxScaler()
df_Total[['Admission_Deposit']] = mm_scaler.fit_transform(df_Total[['Admission_Deposit']])
df_Total[['Bill_per_patient']] = mm_scaler.fit_transform(df_Total[['Bill_per_patient']])

In [None]:
df_Total['Admission_Deposit'].describe()

## Un Merge the Train and Test Data after Feature Engineering

In [None]:
#Un-Merge code
df_Train_final = df_Total[df_Total['is_train'] == 1]
df_Test_final = df_Total[df_Total['is_train'] == 0]

In [None]:
df_Train_final

# Data Modelling

## Split the Data to x and y variable

In [None]:
df_Train_final.columns

In [None]:
x = df_Train_final
x = x.drop(['case_id'], axis=1)
#x = x.drop(['patientid'], axis=1)
x = x.drop(['is_train'], axis=1)
x = x.drop(['Stay'], axis=1)
y = df_Train_final['Stay']
x_pred = df_Test_final
x_pred = x_pred.drop(['case_id'], axis=1)
#x_pred = x_pred.drop(['patientid'], axis=1)
x_pred = x_pred.drop(['is_train'], axis=1)
x_pred = x_pred.drop(['Stay'], axis=1)

In [None]:
#y = le.fit_transform(y) #for Optuna hyperparameter tuning only

## Split the Train Dataset to Train and Validation

In [None]:
from sklearn.model_selection import train_test_split
x_train, x_valid, y_train, y_valid = train_test_split(x, y, test_size=0.20)

## LightGBM Model

### Optuna Girdsearch

Optuna Package that optimizes a classifier Parameter configuration

In [None]:
import lightgbm as lgb
import optuna

In [None]:
def objective(trial):
    dtrain = lgb.Dataset(x_train, label=np.ravel(y_train))

    param = {
        #"objective": "multiclass",
        #"metric": "multi_logloss",
        #"num_class": 11,
        "verbosity": -1,
        "boosting_type": "gbdt",
        "lambda_l1": trial.suggest_float("lambda_l1", 1e-8, 10.0, log=True),
        "lambda_l2": trial.suggest_float("lambda_l2", 1e-8, 10.0, log=True),
        "num_leaves": trial.suggest_int("num_leaves", 2, 256),
        "feature_fraction": trial.suggest_float("feature_fraction", 0.4, 1.0),
        "bagging_fraction": trial.suggest_float("bagging_fraction", 0.4, 1.0),
        "bagging_freq": trial.suggest_int("bagging_freq", 1, 7),
        "min_child_samples": trial.suggest_int("min_child_samples", 5, 100)
        #"n_estimators":trial.suggest_int("n_estimators", 0, 1000),
        #"learning_rate":trial.suggest_int("n_estimators", 0, 99)
    }

    gbm = lgb.train(param, dtrain)
    preds = gbm.predict(x_valid)
    pred_labels = np.rint(preds)
    accuracy = sklearn.metrics.accuracy_score(y_valid, pred_labels)
    return accuracy

In [None]:
opt_GS = optuna.create_study(direction="maximize")
opt_GS.optimize(objective, n_trials=300)

print("Number of finished trials: {}".format(len(opt_GS.trials)))

print("Best trial:")
trial = opt_GS.best_trial

print("Value: {}".format(trial.value))

print("Params: ")
for key, value in trial.params.items():
    print("{}: {}".format(key, value))

Number of finished trials: 300

Best trial:

Value: 0.3367510363019721

Params: 

lambda_l1: 0.004646494045162703

lambda_l2: 4.9034416966810145e-06

num_leaves: 249

feature_fraction: 0.9852240055032958

bagging_fraction: 0.7730445719570425

bagging_freq: 3

min_child_samples: 85

In [None]:
import lightgbm as lgb
lgb_cl = lgb.LGBMClassifier(boosting_type='gbdt', learning_rate=0.1, n_estimators=50000, importance_type='gain', objective='multiclass', num_boost_round=100,
                            min_child_samples=70, num_leaves=246, #max_depth=5, 
                            lambda_l1=9.62, lambda_l2=0.006, feature_fraction=0.73, bagging_fraction=0.82, bagging_freg=6,
                            #max_bin=60, bagging_faction=0.9, feature_fraction=0.9, subsample_freq=2, scale_pos_weight=2.5, 
                            random_state=294, n_jobs=-1, silent=False) #score accuracy 42.70

In [None]:
#lgb_cl.fit(x_train, y_train, eval_set=[(x_train, y_train),(x_valid, y_valid)], verbose=50, eval_metric='auc', early_stopping_rounds=100)
lgb_cl.fit(x, np.ravel(y))

In [None]:
y_pred = lgb_cl.predict(x_pred)

In [None]:
y_pred

In [None]:
submission_df = pd.DataFrame({'case_id':df_Test['case_id'], 'Stay':y_pred})
submission_df.to_csv('Sample Submission LGB v04.csv', index=False)

Public Score of 42.70

### K-Fold Cross Validation

Cross-validation is a statistical method used to estimate the skill of machine learning models.

It is commonly used in applied machine learning to compare and select a model for a given predictive modeling problem because it is easy to understand, easy to implement, and results in skill estimates that generally have a lower bias than other methods.

In [None]:
df_Total.columns()

In [None]:
categorical_features = ["Hospital_code", "Hospital_type_code", "City_Code_Hospital", "Hospital_region_code", "Available Extra Rooms in Hospital",
                        "Department", "Ward_Type", "Ward_Facility_Code", "Bed Grade", "patientid", "City_Code_Patient", "Type of Admission", 
                        "Visitors with Patient", "Severity of Illness", "Age", "Admission_Deposit","Bill_per_patient", "Min_Severity_of_Illness"]


param_lgb = LGBMClassifier(
    boosting_type='gbdt'
    ,learning_rate=0.1
    ,n_estimators=50000
    ,min_child_samples=21
    ,random_state = 294
    ,n_jobs=-1
    ,silent=False
    )


# Apply Stratified K-Fold Cross Validation where K=5 or n_splits=5 :
kf = StratifiedKFold(n_splits=10,shuffle=True)
preds={}
acc_score=0

# Pass predictor_train,target_train for Cross Validation :
for i,(train_idx,val_idx) in enumerate(kf.split(X)):    
    X_train, y_train = X.iloc[train_idx,:], y.iloc[train_idx]
    X_val, y_val = X.iloc[val_idx, :], y.iloc[val_idx]
    print('\nFold: {}\n'.format(i+1))
    lg=LGBMClassifier(device="gpu", boosting_type='gbdt',learning_rate=0.04,depth=8,objective='multi_class',num_class=11,
                      n_estimators=50000,
                     metric='multi_error',colsample_bytree=0.5,reg_alpha=2,reg_lambda=2,random_state=294,n_jobs=-1)    
    
    # lg.fit(X_train,y_train)
    lg.fit(X_train, y_train
                        # ,categorical_feature = categorical_features
                        ,eval_metric='multi_error'
                        ,eval_set=[(X_train, y_train),(X_val, y_val)]
                        ,early_stopping_rounds=100
                        ,verbose=50
                       )
    
    print(accuracy_score(y_val,lg.predict(X_val)))
    acc_score+=accuracy_score(y_val,lg.predict(X_val))
    preds[i+1]=lg.predict(X_main_test)
    
print('mean accuracy score: {}'.format(acc_score/10))

In [None]:
# #Permutation Importance of Features using eli5
# perm = PermutationImportance(lg,random_state=100).fit(X_val, y_val)
# eli5.show_weights(perm,feature_names=X_val.columns.tolist())

In [None]:
#Finding the most frequently classified categories
d = pd.DataFrame()
for i in range(1, 10):
    d = pd.concat([d,pd.DataFrame(preds[i])],axis=1)
d.columns=['1','2','3','4','5','6','7','8','9']
re = d.mode(axis=1)[0]

In [None]:
submission_df['Stay']=le.inverse_transform(re.astype(int))

sub_file_name = "BEST_1_43.27_GPU-LGBM_NO-early_stopping.csv"

submission_df.to_csv(sub_file_name,index=False)
submission_df.head(5)

from google.colab import files
files.download(sub_file_name)

Do share your comments on how to improvise the model