# Problem Statement
Recent Covid-19 Pandemic has raised alarms over one of the most overlooked area to focus: Healthcare Management. While healthcare management has various use cases for using data science, patient length of stay is one critical parameter to observe and predict if one wants to improve the efficiency of the healthcare management in a hospital.

This parameter helps hospitals to identify patients of high LOS risk (patients who will stay longer) at the time of admission. Once identified, patients with high LOS risk can have their treatment plan optimized to miminize LOS and lower the chance of staff/visitor infection. Also, prior knowledge of LOS can aid in logistics such as room and bed allocation planning.

Suppose you have been hired as Data Scientist of HealthMan – a not for profit organization dedicated to manage the functioning of Hospitals in a professional and optimal manner. The task is to accurately predict the Length of Stay for each patient on case by case basis so that the Hospitals can use this information for optimal resource allocation and better functioning. The length of stay is divided into 11 different classes ranging from 0-10 days to more than 100 days.

Evaluation Metric : **100*accuracy score**

* Public Leaderboard: 42.917% (Rank 57)
* Private Leaderboard: 42.74% (Rank 53)

[Link To the Leaderboard](https://datahack.analyticsvidhya.com/contest/janatahack-healthcare-analytics-ii/#LeaderBoard)

# Importing libraries and Loading the Data

In [None]:
import os
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
%matplotlib inline

from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.metrics import classification_report, confusion_matrix
from xgboost import XGBClassifier

In [None]:
path='../input/av-healthcare-analytics-ii/healthcare'
train_orig=pd.read_csv(os.path.join(path,'train_data.csv'))
test_orig=pd.read_csv(os.path.join(path,'test_data.csv'))
subm=pd.read_csv(os.path.join(path,'sample_sub.csv'))

**Number of unique values for each column in train and test dataset**

In [None]:
for col in test_orig.columns:
    print("{}:\ntrain dataset:{}\ntest dataset:{}".format(col,train_orig[col].nunique(),test_orig[col].nunique()))
    print("=======================================")

In [None]:
test_orig.isna().sum()

In [None]:
train_orig.isna().sum()

In [None]:
train_orig['Admission_Deposit'].describe()

In [None]:
test_orig['Admission_Deposit'].describe()

In [None]:
train_orig.head()

In [None]:
test_orig.head()

In [None]:
train_orig.info()

In [None]:
print(train_orig['Stay'].unique())
print(f"\nTotal number of target values:{train_orig['Stay'].nunique()}")

**Concatenating train and test data for further inspection**

In [None]:
data= pd.concat([train_orig,test_orig],sort=False)

In [None]:
data.isna().sum()

In [None]:
data.info()

In [None]:
for col in data.columns:
    print("{}:{}".format(col,data[col].nunique()))
    print("=======================================")
    
# Hence case_id is unique for every row

## Feature Preprocessing and Feature Generation

In [None]:
categorical_col=[]
for col in data.columns:
    if data[col].dtype== object and data[col].nunique()<=50:
        categorical_col.append(col)
print(categorical_col)

In [None]:
for col in categorical_col:
    print(f"{col}:\n{data[col].value_counts()}")
    print("=======================================")

In [None]:
data.groupby(['Hospital_region_code','Ward_Facility_Code']).size()

**Generating a feature for Hospital_region_code_FEAT_Ward_Facility_Code because particular ward_Facility_Code corresponds to particular Hospital_region_code**

In [None]:
data['Hospital_region_code_FEAT_Ward_Facility_Code']= data['Hospital_region_code']+'_'+data['Ward_Facility_Code']

In [None]:
data.groupby(['Hospital_type_code','Hospital_code']).size()

**Generating a feature for Hospital_type_code_FEAT_Hospital_Code because particular Hospital_Code corresponds to particular Hospital_type_code**

In [None]:
data['Hospital_code']= data['Hospital_code'].apply(lambda x: str(x))
data['Hospital_type_code_FEAT_Hospital_code']= data['Hospital_type_code']+'_'+data['Hospital_code']
data['Hospital_code']= data['Hospital_code'].apply(lambda x: int(x))

In [None]:
data.groupby(['Hospital_type_code','Hospital_region_code']).size()

In [None]:
data['Hospital_type_code_FEAT_Hospital_region_code']= data['Hospital_type_code']+'_'+data['Hospital_region_code']

In [None]:
data.groupby(['Hospital_region_code','City_Code_Hospital']).size()

In [None]:
data['City_Code_Hospital']= data['City_Code_Hospital'].apply(lambda x: str(x))
data['Hospital_region_code_FEAT_City_Code_Hospital']= data['Hospital_region_code']+'_'+data['City_Code_Hospital']
data['City_Code_Hospital']= data['City_Code_Hospital'].apply(lambda x: int(x))

In [None]:
data.groupby(['Bed Grade','Ward_Facility_Code']).size()

In [None]:
data['Visitors with Patient'].unique()

In [None]:
data['City_Code_Patient'].unique()

In [None]:
data['Stay'].value_counts()

**Generating a feature which tells if particular patient has visited the same hospital again.**

In [None]:
data['prev_hosp_code']= data['Hospital_code'].shift(1,axis=0)
data['prev_patientid']= data['patientid'].shift(1,axis=0)
data['prev_hosp_code'].fillna(0,inplace=True)
data['prev_patientid'].fillna(31397,inplace=True)

In [None]:
def fxy(prev_hosp_code,hosp_code,prev_patientid,patientid):
    if ((prev_patientid-patientid==0)&(prev_hosp_code-hosp_code==0))==True:
        return 1
    else:
        return 0
data['patient_visiting_consecutive']= data.apply(lambda x: fxy(x['prev_hosp_code'],x['Hospital_code'],
                                                               x['prev_patientid'],x['patientid']),axis=1)

In [None]:
data['patient_visiting_consecutive'].value_counts()

In [None]:
data.head().T

**Dropping unnecessary columns from the data**

In [None]:
data.drop(['case_id','patientid','Stay','prev_hosp_code','prev_patientid'],axis=1,inplace=True)

## Label Encoding all the Categorical Features.

In [None]:
categorical_col=[]
for col in data.columns:
    if data[col].dtype== object and data[col].nunique()<=50:
        categorical_col.append(col)
print(categorical_col)

In [None]:
le= LabelEncoder()

In [None]:
for col in categorical_col:
    data[col]= le.fit_transform(data[col])

## Imputing Missing Values

In [None]:
#Filling null values
data['City_Code_Patient'].fillna(data['City_Code_Patient'].median(),inplace=True)
data['Bed Grade'].fillna(-1,inplace=True)

In [None]:
train_new= data[:len(train_orig)]
test_new= data[len(train_orig):]

In [None]:
y_le= LabelEncoder()

y= y_le.fit_transform(train_orig['Stay'])

## Checking correlation of features with the target Column.

In [None]:
check=pd.concat([train_new,pd.DataFrame(data=y,columns=['Stay'])],axis=1)

In [None]:
check.corr()['Stay'].sort_values()

In [None]:
y_le.classes_

# Model Building- XGBoost

In [None]:

X_train, X_test, y_train, y_test = train_test_split(train_new, y, test_size=0.2, random_state=101)

In [None]:
model = XGBClassifier(base_score=0.5, booster=None, colsample_bylevel=1,
              colsample_bynode=1, colsample_bytree=0.6, gamma=0.1, gpu_id=-1,
              importance_type='gain', interaction_constraints=None,
              learning_rate=0.0300000012, max_delta_step=0, max_depth=8,
              min_child_weight=3, monotone_constraints=None,
              n_estimators=500, n_jobs=0, num_class=11, num_parallel_tree=1,
              objective='multi:softprob', random_state=0, reg_alpha=0.1,
              reg_lambda=1, scale_pos_weight=None, subsample=0.6,
              tree_method=None, validate_parameters=False, verbosity=None)

> from sklearn.model_selection import RandomizedSearchCV
> 
> param_dist = {'n_estimators': [100,500],'learning_rate': [0.03,0.1],'max_depth': [5,8],'subsample':[i/10.0 for i in range(6,8)],'colsample_bytree':[i/10.0 for i in range(6,8)],'min_child_weight': [1,3]}

model_tuning= RandomizedSearchCV(estimator = model, param_distributions=param_dist,verbose = 1, n_jobs =-1, n_iter = 5)

In [None]:
model.fit(X_train,y_train)

In [None]:
pred= model.predict(X_test)

## Got 43% accuracy on Validation set.

In [None]:
print(classification_report(pred,y_test))

In [None]:
from xgboost import plot_importance

plot_importance(model);

## Prediction on Test set 
## Got an accuracy of 42.917% on Public leaderboard.

In [None]:
testset_pred= model.predict(test_new)

In [None]:
testset_pred= list(y_le.inverse_transform(testset_pred))

In [None]:
subm.head()

In [None]:
final_subm= pd.DataFrame(data= testset_pred,index=subm['case_id'],columns=['Stay'])

In [None]:
final_subm.to_csv('final_subm_new.csv')

In [None]:
df= pd.read_csv('final_subm_new.csv')
df.head()