<h2>Problem Statement</h2>
Recent Covid-19 Pandemic has raised alarms over one of the most overlooked area to focus: Healthcare Management. While healthcare management has various use cases for using data science, <strong>patient length of stay is one critical parameter to observe and predict</strong> if one wants to improve the efficiency of the healthcare management in a hospital.
This parameter helps hospitals to identify patients of high LOS risk (patients who will stay longer) at the time of admission. Once identified, patients with high LOS risk can have their treatment plan optimized to miminize LOS and lower the chance of staff/visitor infection. Also, prior knowledge of LOS can aid in logistics such as room and bed allocation planning.
Suppose you have been hired as Data Scientist of HealthMan – a not for profit organization dedicated to manage the functioning of Hospitals in a professional and optimal manner.

</br><br><center><img align="center" titile="AV Hackathon" src="https://datahack-prod.s3.ap-south-1.amazonaws.com/__sized__/contest_cover/cover_4-thumbnail-1200x1200.png"></center>

<h2> Task: </h2>
The task is to accurately <font size="3"><strong>predict the Length of Stay for each patient</strong></font> on case by case basis so that the Hospitals can use this information for optimal resource allocation and better functioning. <br>

* The length of stay is divided into 11 different classes ranging from 0-10 days to more than 100 days.
* Our focus should be predicting the 11 categories as correct as possible.

In [None]:
import pandas as pd
import numpy as np
import os

from IPython.display import Markdown as md

import matplotlib.pyplot as plt
import seaborn as sns

import plotly.graph_objects as go

from sklearn import model_selection

In [None]:
DATA_DIR = r"../input/av-healthcare-analytics-ii/healthcare/"

train_df = pd.read_csv(os.path.join(DATA_DIR, "train_data.csv"))
test_df = pd.read_csv(os.path.join(DATA_DIR, "test_data.csv"))

features_desc_df = pd.read_csv(os.path.join(DATA_DIR, "train_data_dictionary.csv"))

## add new column
train_df.loc[:,'dataset'] = 'train'
test_df.loc[:,'dataset'] = 'test'

features_desc_df.loc[features_desc_df.shape[0]] = ["dataset", "Indicates the data belongs to train set or test set"]
df = pd.concat([train_df, test_df]) 

In [None]:
md("<h4>Dataset basic summary:</h4><br>The dataset contains <strong>{}</strong> features and <strong>{}</strong> samples. <br><font size='-1' color='red'> Note: It includes both train and test set</font>".format(df.shape[1], df.shape[0]))

Let's see what kind of different features we have:

In [None]:
features_format_str = "<h3>Features:</h3><br>"

## iterate each rows in dataframe
for row in features_desc_df.values.tolist():
    features_format_str+= f"- <strong>{row[0]}:</strong>     {row[1]}<br>"

## display the formated string in markdown
md(features_format_str)

<h2> Distribution of Target variable</h2>

In [None]:
print(f"No of Target variable: {df['Stay'].nunique()}")
target_distribution = df[df["dataset"] == "train"]["Stay"].value_counts().sort_values(ascending=True)
fig = go.Figure(data=go.Bar(x=target_distribution.index, y=target_distribution), layout_title_text="Distribution of Stay")
fig.show()

Observation:

- 11 target labels available (This is grouped by no of days stayed in the hospital like categories 0-10, 10-20, etc ).
- The distribution of taget variable is skewed right side.
- Very less percentage of samples we have on from left side.

<h2> Cumulative percentage of Target variable </h2>

* The problem is **Multilable classification**.
* Next, will see cumulative percentage of sample available for each target variables.

In [None]:
sns.set_style("whitegrid")

#Create combination chart
fig, ax1 = plt.subplots(figsize=(18,8))
color = 'tab:green'
#bar plot creation
ax1.set_title('Distribution of Stay', fontsize=16)
ax1.set_xlabel('Length of patient stay in hospital', fontsize=12)
ax1.set_ylabel('No of Samples', fontsize=12)
ax1 = sns.barplot(x=target_distribution.index, y=target_distribution)
ax1.tick_params(axis='y')

#specify we want to share the same x-axis
ax2 = ax1.twinx()
color = 'black'

ax2 = sns.lineplot(x=target_distribution.index, y=target_distribution.cumsum(), sort=False, color=color, markers=True, dashes=False)
ax2.tick_params(axis='y', color=color)
#line plot creation
ax2.set_ylabel('Total no of samples', fontsize=12)

## set label for lineplot
for x in target_distribution.cumsum().index:
    cum_percentage_of_target = round((target_distribution.cumsum()[x] / target_distribution.sum()) * 100, 2)
    ax2.text(x,target_distribution.cumsum()[x]-10000,f'{cum_percentage_of_target} %',color=color, fontsize=11)

#show plot
plt.show()

Interpretation of the Graph:

* Two graphs are combined in above. 
* The bar chart is representing distribution of the target variables. 
* The lineplot is representing the cumulative percentage of samples for each target. 

Observation:

* Less than <font size="3" color="red">**47.99%**</font> of the data contains almost 9 target classes.
* Only 2 labels having <font size="3" color="blue">**52%**</font> of the data samples.
* It is very unbalanced when it comes to label distribution of the data.
* This is how the real-world data looks like. Very difficult to get balanced data samples in each class when the data comes from real-world.

Very nice EDA has been done in this kernel. https://www.kaggle.com/isaienkov/healthcare-analysis-and-modeling-42-7

I will be straightaway going for modeling.

<h4> Crazy Feature:</h4>

In [None]:
df[df["dataset"] == "train"]["Visitors with Patient"].unique()

This is crazy when you look at the data.

The feature ` Visitors with Patient` column has values ranging from <strong>1 to 32</strong>. If you see the description of the column has `Number of Visitors with the patient`. 

Is really possible that the patient has more number of visitors with him?

In [None]:
from sklearn import preprocessing
from tqdm import tqdm
from sklearn.utils import class_weight

from sklearn.ensemble import AdaBoostClassifier
from sklearn.metrics import accuracy_score

<h2> Feature Engineering </h2>

Lets create the new feature called: `is_patient_admitted_in_same_city_hospital` means whether the patient is admitted in the same city where he is from or he got admitted to different city hospital.

In [None]:
df["is_patient_admitted_in_same_city_hospital"] = np.where(df["City_Code_Patient"] == df["City_Code_Hospital"], 1, 0)

In [None]:
df["is_patient_admitted_in_same_city_hospital"].value_counts()

`Bed Grade` feature has missing value. hence, mode of the value has been imputed directly.

In [None]:
## impute missing value
df.fillna(df[df["dataset"] == "train"]["Bed Grade"].mode()[0], inplace=True)

<h2> Encoding Features </h2>

In [None]:
label_encoding_cols = ["Hospital_type_code", "Hospital_region_code", "Department", "Ward_Type", "Type of Admission", "Severity of Illness"]

## store label encoder object
label_encoder_dict = {}
for cols in tqdm(label_encoding_cols):
    le = preprocessing.LabelEncoder()
    le.fit(df[df["dataset"] == "train"][cols])
    df[cols] = le.transform(df[cols])
    label_encoder_dict[cols] = le

Target variable encode

In [None]:
train_df_encoded = df[df["dataset"] == "train"]
test_df_encoded = df[df["dataset"] == "test"]

## target variable encoding
le = preprocessing.LabelEncoder()
le.fit(train_df_encoded["Stay"])
train_df_encoded["Stay"] = le.transform(train_df_encoded["Stay"])
label_encoder_dict["Stay"] = le

In [None]:
## age encode
age_encode = {'51-60':6, '71-80':8, '31-40':4, '41-50':5, '81-90':9, '61-70':7, '21-30':3,
       '11-20':2, '0-10':1, '91-100':10}

train_df_encoded["Age"] = train_df_encoded["Age"].map(age_encode)
test_df_encoded["Age"] = test_df_encoded["Age"].map(age_encode)

In [None]:
train_df_encoded.head()

<h2> Build ML model </h2>

In [None]:
## StratifiedKFold
NUM_FOLDS = 5
kfold = model_selection.StratifiedKFold(n_splits=NUM_FOLDS, shuffle=True, random_state=42)

for fold, (train_idx, val_idx) in enumerate(kfold.split(X=train_df_encoded, y=train_df_encoded["Stay"])):
    train_df_encoded.loc[val_idx, "fold"] = fold

Calculate cost sensitive class weights

In [None]:
def get_class_weights(y):    
    class_weights = class_weight.compute_class_weight('balanced',
                                                     np.unique(y),
                                                     y)
    class_weights = {label:weight for label, weight in enumerate(class_weights)}
    return class_weights

In [None]:
from sklearn.ensemble import RandomForestClassifier
from lightgbm import LGBMClassifier

RandomForest Model

In [None]:
selected_features = ["Hospital_code", "Hospital_type_code", "City_Code_Hospital", "Hospital_region_code", "Available Extra Rooms in Hospital", 
                         "Department", "Ward_Type", "Bed Grade", "City_Code_Patient",  "Type of Admission", "Severity of Illness", 'Age', "Admission_Deposit",
                        "is_patient_admitted_in_same_city_hospital"]

class_weights = get_class_weights(train_df_encoded[train_df_encoded["fold"] != 4]["Stay"])
## used cost sensitive funtion for RandomForest
model = RandomForestClassifier(random_state=666, class_weight=class_weights)
trn_df = train_df_encoded[train_df_encoded["fold"] != 4]
tst_df = train_df_encoded[train_df_encoded["fold"] == 4]
class_weights = get_class_weights(trn_df["Stay"])
model.fit(trn_df[selected_features], trn_df["Stay"])
preds = model.predict(tst_df[selected_features])
print(f'Accuracy: {accuracy_score(tst_df["Stay"], preds)*100}%')

Got accuracy of 30%.

<h2> KFOLD model </h2>

In [None]:
eval_fold = 4
selected_features = ["Hospital_code", "Hospital_type_code", "City_Code_Hospital", "Hospital_region_code", "Available Extra Rooms in Hospital", 
                         "Department", "Ward_Type", "Bed Grade", "City_Code_Patient",  "Type of Admission", "Severity of Illness", 'Age', "Admission_Deposit",
                        "is_patient_admitted_in_same_city_hospital"]

In [None]:
eval_fold = 4
selected_features = ["Hospital_code", "Hospital_type_code", "City_Code_Hospital", "Hospital_region_code", "Available Extra Rooms in Hospital", 
                         "Department", "Ward_Type", "Bed Grade", "City_Code_Patient",  "Type of Admission", "Severity of Illness", 'Age', "Admission_Deposit",
                        "is_patient_admitted_in_same_city_hospital"]
models = []

predicted_probs = np.array([])

## Build KFold model
for fold in tqdm(range(0, 4)):
    train_df_features = train_df_encoded[(train_df_encoded["fold"] != fold) & (train_df_encoded["fold"] != eval_fold)]
    test_df_features = train_df_encoded[(train_df_encoded["fold"] == fold) & (train_df_encoded["fold"] != eval_fold)]
    
    clf = LGBMClassifier(n_estimators= 300, random_state=666)
        
    ## fit the model
    clf.fit(train_df_features[selected_features], train_df_features["Stay"], categorical_feature=label_encoding_cols + ['Age', 'is_patient_admitted_in_same_city_hospital'])
    ## add model to list
    models.append(clf)
    ## predict for validation set
    pred = clf.predict(test_df_features[selected_features])
    
    ## predict on unseen kfold set
    unseen_pred_probs = clf.predict_proba(train_df_encoded[train_df_encoded["fold"] == eval_fold][selected_features])
    if predicted_probs.size == 0:
        predicted_probs = unseen_pred_probs
    else:
        predicted_probs = np.sum([predicted_probs, unseen_pred_probs], axis=0)
        
    print(f"Fold {fold} accuracy: {accuracy_score(test_df_features['Stay'], pred):.2f}")

In [None]:
predicted_probs_avg = predicted_probs / 4
predicted_probs_avg = np.argmax(predicted_probs_avg,axis=1)

print(f'Unseen Fold accuracy: {accuracy_score(train_df_encoded[train_df_encoded["fold"] == eval_fold]["Stay"], predicted_probs_avg)}')

<h2> Inference for Test set </h2>

In [None]:
test_predicted_probs = np.array([])

## Build KFold model
for model in tqdm(models):
    ## predict on unseen kfold set
    test_pred_probs = model.predict_proba(test_df_encoded[selected_features])
    if test_predicted_probs.size == 0:
        test_predicted_probs = test_pred_probs
    else:
        test_predicted_probs = np.sum([test_predicted_probs, test_pred_probs], axis=0)

## calculate average prob 
test_predicted_probs_avg = test_predicted_probs / 4
test_predicted_probs_avg = np.argmax(test_predicted_probs_avg,axis=1)

Write submission file

In [None]:
submission_df = pd.DataFrame(test_df_encoded["case_id"])
submission_df["Stay"] = label_encoder_dict["Stay"].inverse_transform(test_predicted_probs_avg)

In [None]:
## write into submission file
submission_df.to_excel("submission.xlsx", index=False)

In [None]:
submission_df["Stay"].unique()

Let's build model using **Pycarat**

In [None]:
!pip install pycaret

In [None]:
from pycaret.classification import *

In [None]:
cols = selected_features+['Stay']
exp = setup(train_df_encoded[train_df_encoded["fold"] != eval_fold][cols] , target = 'Stay')

In [None]:
models = compare_models()

<font size="4">Stay tuned!</font><br>
More will be updated soon. Thanks for reading my kernel.