<img src='https://datahack-prod.s3.ap-south-1.amazonaws.com/__sized__/contest_cover/cover_4-thumbnail-1200x1200.png'>

# Problem Statement

Recent Covid-19 Pandemic has raised alarms over one of the most overlooked area to focus: Healthcare Management. While healthcare management has various use cases for using data science, patient length of stay is one critical parameter to observe and predict if one wants to improve the efficiency of the healthcare management in a hospital. 

This parameter helps hospitals to identify patients of high LOS risk (patients who will stay longer) at the time of admission. Once identified, patients with high LOS risk can have their treatment plan optimized to miminize LOS and lower the chance of staff/visitor infection. Also, prior knowledge of LOS can aid in logistics such as room and bed allocation planning.

<b>Suppose you have been hired as Data Scientist of HealthMan</b> – a not for profit organization dedicated to manage the functioning of Hospitals in a professional and optimal manner.
The task is to accurately predict the Length of Stay for each patient on case by case basis so that the Hospitals can use this information for optimal resource allocation and better functioning. The length of stay is divided into 11 different classes ranging from 0-10 days to more than 100 days.

In [None]:
import numpy as np 
import pandas as pd 
import plotly.express as px
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import accuracy_score, plot_confusion_matrix
from sklearn.model_selection import train_test_split
import matplotlib.pyplot as plt
from lightgbm import LGBMClassifier, LGBMRegressor
from sklearn import preprocessing
import optuna
from optuna.samplers import TPESampler

# Loading Dataset

In [None]:
train = pd.read_csv('/kaggle/input/av-healthcare-analytics-ii/healthcare/train_data.csv')
test = pd.read_csv('/kaggle/input/av-healthcare-analytics-ii/healthcare/test_data.csv')
sub = pd.read_csv('/kaggle/input/av-healthcare-analytics-ii/healthcare/sample_sub.csv')

train = train.drop(['case_id'], axis=1)
test = test.drop(['case_id'], axis=1)
train['dataset'] = 'train'
test['dataset'] = 'test'

df = pd.concat([train, test])

In [None]:
df

# Exploratory Data Analysis

In [None]:
ds = df.groupby(['Hospital_code', 'dataset'])['patientid'].count().reset_index()
ds.columns = ['hospital', 'dataset', 'count']
fig = px.bar(
    ds, 
    x='hospital', 
    y="count", 
    color = 'dataset',
    barmode='group',
    orientation='v', 
    title='Cases per hospital distribution', 
    width=900,
    height=700
)
fig.show()

In [None]:
ds = df.groupby(['Hospital_type_code', 'dataset'])['patientid'].count().reset_index()
ds.columns = ['hospital', 'dataset', 'count']
fig = px.bar(
    ds, 
    x='hospital', 
    y="count", 
    color = 'dataset',
    barmode='group',
    orientation='v', 
    title='Cases hospital type distribution', 
    width=900,
    height=600
)
fig.show()

In [None]:
ds = df.groupby(['Hospital_region_code', 'dataset'])['patientid'].count().reset_index()
ds.columns = ['hospital', 'dataset', 'count']
fig = px.bar(
    ds, 
    x='hospital', 
    y="count", 
    color = 'dataset',
    barmode='group',
    orientation='v', 
    title='Cases hospital region distribution', 
    width=900,
    height=600
)
fig.show()

In [None]:
ds = df.groupby(['Department', 'dataset'])['patientid'].count().reset_index()
ds.columns = ['department', 'dataset', 'count']
fig = px.bar(
    ds, 
    x='department', 
    y="count", 
    color = 'dataset',
    barmode='group',
    orientation='v', 
    title='Department distribution', 
    width=900,
    height=600
)
fig.show()

In [None]:
ds = df.groupby(['Ward_Type', 'dataset'])['patientid'].count().reset_index()
ds.columns = ['Ward_Type', 'dataset', 'count']
fig = px.bar(
    ds, 
    x='Ward_Type', 
    y="count", 
    color = 'dataset',
    barmode='group',
    orientation='v', 
    title='Ward Type distribution', 
    width=900,
    height=600
)
fig.show()

In [None]:
ds = ds[ds['dataset']=='train']
fig = px.pie(
    ds, 
    names='Ward_Type', 
    values="count", 
    title='Ward type pie chart for train set', 
    width=900,
    height=600
)
fig.show()

In [None]:
ds = df.groupby(['Ward_Facility_Code', 'dataset'])['patientid'].count().reset_index()
ds.columns = ['Ward_Facility_Code', 'dataset', 'count']
fig = px.bar(
    ds, 
    x='Ward_Facility_Code', 
    y="count", 
    color = 'dataset',
    barmode='group',
    orientation='v', 
    title='Ward Facility Code distribution', 
    width=900,
    height=600
)
fig.show()

In [None]:
ds = df.groupby(['Bed Grade', 'dataset'])['patientid'].count().reset_index()
ds.columns = ['bed_grade', 'dataset', 'count']
fig = px.bar(
    ds, 
    x='bed_grade', 
    y="count", 
    color = 'dataset',
    barmode='group',
    orientation='v', 
    title='Bed_grade distribution', 
    width=900,
    height=600
)
fig.show()

In [None]:
ds = df.groupby(['Age', 'dataset'])['patientid'].count().reset_index()
ds.columns = ['age', 'dataset', 'count']
fig = px.bar(
    ds, 
    x='age', 
    y="count", 
    color = 'dataset',
    barmode='group',
    orientation='v', 
    title='Age distribution', 
    width=900,
    height=600
)
fig.show()

In [None]:

ds = df.groupby(['Type of Admission', 'dataset'])['patientid'].count().reset_index()
ds.columns = ['admission', 'dataset', 'count']
fig = px.bar(
    ds, 
    x='admission', 
    y="count", 
    color = 'dataset',
    barmode='group',
    orientation='v', 
    title='Admission type distribution', 
    width=900,
    height=600
)
fig.show()

In [None]:
ds = df.groupby(['Severity of Illness', 'dataset'])['patientid'].count().reset_index()
ds.columns = ['Severity of Illness', 'dataset', 'count']
fig = px.bar(
    ds, 
    x='Severity of Illness', 
    y="count", 
    color = 'dataset',
    barmode='group',
    orientation='v', 
    title='Severity of Illness type distribution', 
    width=900,
    height=600
)
fig.show()

In [None]:
ds = df.groupby(['Stay', 'dataset'])['patientid'].count().reset_index()
ds.columns = ['Stay', 'dataset', 'count']
fig = px.bar(
    ds, 
    x='Stay', 
    y="count", 
    color = 'dataset',
    barmode='group',
    orientation='v', 
    title='Stay length distribution', 
    width=900,
    height=600
)
fig.show()

In [None]:
data = df['patientid'].value_counts().reset_index()
data.columns = ['patientid', 'cases']
data['patientid'] = 'patient ' + data['patientid'].astype(str)
data = data.sort_values('cases')
fig = px.bar(
    data.tail(50), 
    x="cases", 
    y="patientid", 
    orientation='h', 
    title='Top 50 patients',
    width=800,
    height=900
)
fig.show()

In [None]:
fig = px.histogram(
    df, 
    "City_Code_Patient", 
    nbins=40, 
    color = 'dataset',
    barmode='group',
    title='City_Code_Patient', 
    width=700,
    height=600
)
fig.show()

In [None]:
fig = px.histogram(
    df, 
    "Visitors with Patient", 
    nbins=40, 
    color = 'dataset',
    barmode='group',
    title='Visitors with Patient', 
    width=700,
    height=600
)
fig.show()

In [None]:
fig = px.histogram(
    df, 
    "Admission_Deposit", 
    nbins=50, 
    color = 'dataset',
    barmode='group',
    title='Admission Deposit destribution', 
    width=700,
    height=600
)
fig.show()

# Preparing Dataset For Modelling

In [None]:
df.loc[df['Stay'] == '0-10', 'Stay'] = 0
df.loc[df['Stay'] == '11-20', 'Stay'] = 1
df.loc[df['Stay'] == '21-30', 'Stay'] = 2
df.loc[df['Stay'] == '31-40', 'Stay'] = 3
df.loc[df['Stay'] == '41-50', 'Stay'] = 4
df.loc[df['Stay'] == '51-60', 'Stay'] = 5
df.loc[df['Stay'] == '61-70', 'Stay'] = 6
df.loc[df['Stay'] == '71-80', 'Stay'] = 7
df.loc[df['Stay'] == '81-90', 'Stay'] = 8
df.loc[df['Stay'] == '91-100', 'Stay'] = 9
df.loc[df['Stay'] == 'More than 100 Days', 'Stay'] = 10

### Lets try first linear model only on numerical features

In [None]:
train = df[df['dataset']=='train']
test = df[df['dataset']=='test']

target = train['Stay']

features = ['Available Extra Rooms in Hospital', 'Bed Grade', 'Visitors with Patient', 'Admission_Deposit']

train = train[features]
train = train.fillna(0)
test = test[features]

In [None]:
X, X_val, y, y_val = train_test_split(train, target, random_state=0, test_size=0.2, shuffle=True)
y=y.astype('int')
y_val=y_val.astype('int')

In [None]:
model = LogisticRegression(random_state=666)
model.fit(X, y)
preds = model.predict(X_val)
print('Baseline accuracy: ', accuracy_score(y_val, preds)*100, '%')

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
plot_confusion_matrix(model, X_val, y_val, ax=ax)

### Let's build LightGBM classifier

In [None]:
need_to_encode = ['Hospital_type_code', 'Hospital_region_code', 'Department', 'Ward_Type', 'Ward_Facility_Code', 'Type of Admission', 'Severity of Illness']
for column in need_to_encode:
    le = preprocessing.LabelEncoder()
    le.fit(df[column])
    df[column] = le.transform(df[column])

In [None]:
df.loc[df['Age'] == '0-10', 'Age'] = 0
df.loc[df['Age'] == '11-20', 'Age'] = 1
df.loc[df['Age'] == '21-30', 'Age'] = 2
df.loc[df['Age'] == '31-40', 'Age'] = 3
df.loc[df['Age'] == '41-50', 'Age'] = 4
df.loc[df['Age'] == '51-60', 'Age'] = 5
df.loc[df['Age'] == '61-70', 'Age'] = 6
df.loc[df['Age'] == '71-80', 'Age'] = 7
df.loc[df['Age'] == '81-90', 'Age'] = 8
df.loc[df['Age'] == '91-100', 'Age'] = 9

In [None]:
categorical = ['Hospital_code', 'Hospital_type_code', 'City_Code_Hospital', 'Hospital_region_code', 'Department', 'Ward_Type', 'Ward_Facility_Code', 
              'City_Code_Patient', 'Type of Admission', 'Severity of Illness']

In [None]:
train = df[df['dataset']=='train']
test = df[df['dataset']=='test']

target = train['Stay']
train = train.fillna(0)
test = test.fillna(0)
train = train.drop(['patientid', 'dataset', 'Stay'], axis=1)
test = test.drop(['patientid', 'dataset'], axis=1)
train

In [None]:
X, X_val, y, y_val = train_test_split(train, target, random_state=0, test_size=0.2, shuffle=True)
y=y.astype('int')
y_val=y_val.astype('int')

In [None]:
model = LGBMClassifier(random_state=666)
model.fit(X, y, categorical_feature=categorical)
preds = model.predict(X_val)
print('LGBM accuracy: ', accuracy_score(y_val, preds)*100, '%')

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
plot_confusion_matrix(model, X_val, y_val, ax=ax)

### We can see that we improved our score without any serious preprocessing of data and hyperparameters tunning. Let's do it next.

### Optuna optimization

#### What is Optuna Optimization?

It is hyperparameter optimization framework, nice Kaggle tutorial I found in Kaggle - https://www.kaggle.com/corochann/optuna-tutorial-for-hyperparameter-optimization

In [None]:
sampler = TPESampler(seed=0)
def create_model(trial):
    max_depth = trial.suggest_int("max_depth", 2, 30)
    n_estimators = trial.suggest_int("n_estimators", 1, 500)
    learning_rate = trial.suggest_uniform('learning_rate', 0.0000001, 1)
    num_leaves = trial.suggest_int("num_leaves", 2, 5000)
    min_child_samples = trial.suggest_int('min_child_samples', 3, 200)
    model = LGBMClassifier(learning_rate=learning_rate, n_estimators=n_estimators, max_depth=max_depth, num_leaves=num_leaves, min_child_samples=min_child_samples,
                           random_state=0)
    return model

def objective(trial):
    model = create_model(trial)
    model.fit(X, y)
    preds = model.predict(X_val)
    return accuracy_score(y_val, preds)

def optimize():
    study = optuna.create_study(direction="maximize", sampler=sampler)
    study.optimize(objective, n_trials=20)
    return study.best_params

params = optimize()

In [None]:
params['random_state'] = 666
model = LGBMClassifier(**params)
model.fit(X, y, categorical_feature=categorical)
preds = model.predict(X_val)
print('LGBM accuracy: ', accuracy_score(y_val, preds)*100, '%')

In [None]:
fig, ax = plt.subplots(figsize=(10, 10))
plot_confusion_matrix(model, X_val, y_val, ax=ax)

# Final Submission

In [None]:
X.columns

In [None]:
test.columns

In [None]:
preds = model.predict(test.drop('Stay',axis=1))

In [None]:
sub['Stay']=preds

In [None]:
sub.loc[sub['Stay'] == 0, 'Stay'] = '0-10'
sub.loc[sub['Stay'] == 1, 'Stay'] = '11-20'
sub.loc[sub['Stay'] == 2, 'Stay'] = '21-30'
sub.loc[sub['Stay'] == 3, 'Stay'] = '31-40'
sub.loc[sub['Stay'] == 4, 'Stay'] = '41-50'
sub.loc[sub['Stay'] == 5, 'Stay'] = '51-60'
sub.loc[sub['Stay'] == 6, 'Stay'] = '61-70'
sub.loc[sub['Stay'] == 7, 'Stay'] = '71-80'
sub.loc[sub['Stay'] == 8, 'Stay'] = '81-90'
sub.loc[sub['Stay'] == 9, 'Stay'] = '91-100'
sub.loc[sub['Stay'] == 10, 'Stay'] = 'More than 100 Days'

In [None]:
sub.to_csv('lgbm.csv',index=False)

# References:
1. Optuna Optimization - https://www.kaggle.com/corochann/optuna-tutorial-for-hyperparameter-optimization
2. LightGBM - https://towardsdatascience.com/understanding-lightgbm-parameters-and-how-to-tune-them-6764e20c6e5b

## Feel free to share your feedback, do Upvote if you like/found the notebook useful!