# Janatahack: Healthcare Analytics II

## [Janatahack: Healthcare Analytics II](https://datahack.analyticsvidhya.com/contest/janatahack-healthcare-analytics-ii)

The healthcare sector has long been an early adopter of and benefited greatly from technological advances. These days, machine learning plays a key role in many health-related realms, including the development of new medical procedures, the handling of patient data, staff management & more.

This weekend we invite you to participate in another Janatahack with the theme of healthcare analytics. Stay tuned for the problem statement and datasets this Friday and get a chance to work on a real healthcare case study along with 250 AV points at stake.

## Problem Statement

Recent Covid-19 Pandemic has raised alarms over one of the most overlooked area to focus: Healthcare Management. While healthcare management has various use cases for using data science, patient length of stay is one critical parameter to observe and predict if one wants to improve the efficiency of the healthcare management in a hospital. 

This parameter helps hospitals to identify patients of high LOS risk (patients who will stay longer) at the time of admission. Once identified, patients with high LOS risk can have their treatment plan optimized to miminize LOS and lower the chance of staff/visitor infection. Also, prior knowledge of LOS can aid in logistics such as room and bed allocation planning.

Suppose you have been hired as Data Scientist of HealthMan – a not for profit organization dedicated to manage the functioning of Hospitals in a professional and optimal manner.
The task is to accurately predict the Length of Stay for each patient on case by case basis so that the Hospitals can use this information for optimal resource allocation and better functioning. The length of stay is divided into 11 different classes ranging from 0-10 days to more than 100 days.

## Data

Column - Description

case_id - Case_ID registered in Hospital

Hospital_code - Unique code for the Hospital

Hospital_type_code - Unique code for the type of Hospital

City_Code_Hospital - City Code of the Hospital

Hospital_region_code - Region Code of the Hospital

Available Extra Rooms in Hospital - Number of Extra rooms available in the Hospital

Department - Department overlooking the case

Ward_Type -	Code for the Ward type

Ward_Facility_Code - Code for the Ward Facility

Bed Grade -	Condition of Bed in the Ward

patientid -	Unique Patient Id

City_Code_Patient -	City Code for the patient

Type of Admission -	Admission Type registered by the Hospital

Severity of Illness - Severity of the illness recorded at the time of admission

Visitors with Patient -	Number of Visitors with the patient

Age - Age of the patient

Admission_Deposit -	Deposit at the Admission Time

Stay - Stay Days by the patient

Evaluation Metric

The evaluation metric for this hackathon is 100*Accuracy Score.

# Load the Packages

In [None]:
# This Python 3 environment comes with many helpful analytics libraries installed
# It is defined by the kaggle/python Docker image: https://github.com/kaggle/docker-python
# For example, here's several helpful packages to load

#Basic Packages
import numpy as np # linear algebra
import pandas as pd # data processing, CSV file I/O (e.g. pd.read_csv)
import matplotlib.pyplot as plt # Data Visualization
import seaborn as sns # Advance Data Visualization
%matplotlib inline

#OS packages
import os

#Encoding Packages
import pandas as pd
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

#Scaling Packages
from sklearn import preprocessing
mm_scaler = preprocessing.MinMaxScaler()

#Multicolinearity VIF
import statsmodels.api as sm
from statsmodels.stats.outliers_influence import variance_inflation_factor

#Data Modelling Packages
from sklearn.model_selection import train_test_split

from imblearn.over_sampling import RandomOverSampler
sm = RandomOverSampler(random_state=294,sampling_strategy='not majority')

#Model Packages
import lightgbm as lgb


# Input data files are available in the read-only "../input/" directory
# For example, running this (by clicking run or pressing Shift+Enter) will list all files under the input directory

import os
for dirname, _, filenames in os.walk('/kaggle/input'):
    for filename in filenames:
        print(os.path.join(dirname, filename))

# You can write up to 5GB to the current directory (/kaggle/working/) that gets preserved as output when you create a version using "Save & Run All" 
# You can also write temporary files to /kaggle/temp/, but they won't be saved outside of the current session

# Load the Datasets

## Loading from Kaggle Input Data

In [None]:
df_Train = pd.read_csv('../input/av-janatahack-healthcare-hackathon-ii/Data/train.csv')
df_Test = pd.read_csv('../input/av-janatahack-healthcare-hackathon-ii/Data/test.csv')

# Exploratory Data Analysis

In [None]:
#To find the head of the Data
df_Train.head()

In [None]:
#Information of the Dataset Datatype
df_Train.info()

In [None]:
#Information of the Dataset Continuous Values
df_Train.describe()

In [None]:
#Columns List
df_Train.columns

In [None]:
#Shape of the Train and Test Data
print('Shape of Train Data: ', df_Train.shape)
print('Shape of Test Data: ', df_Test.shape)

In [None]:
#Null values in the Train Dataset
print('Null values in Train Data: \n', df_Train.isnull().sum())

In [None]:
#Null Values in the Test Dataset
print('Null Values in Test Data: \n', df_Test.isnull().sum())

Missing Values in "Bed Grade" and "City_Code_Patient" columns.

In [None]:
print('Total Count of the Prediction Output Column Stay Variable: \n', df_Train['Stay'].value_counts())

## Assumptions of the Predictor Variables

Target Variable

Stay - Highly Imbalanced. Need to use SMOTE to balance it


Predictor Variable

Hospital Code - Highly Imbalanced and Might affect the model

Hospital Type Code - Imbalanced

City Code Hospital - Imbalanced

Available Extra Rooms - Need to Balance the Available Extra Rooms as its Skewed Positive

Department - Highly Imbalanced

Ward Type Count - highly imbalanced

Patient ID - lot of Unique Values - Might need to drop it

City Code Patient - highly imbalance

Severity of Illness Variable - imbalanced

Visitors with Patient - imbalanced

Age - Imbalanced can be binned even more

Admission Deposit - Continous Need to remove the outliers or Scale the Values

# Basic Feature Engineering

## Remove Duplicate Rows

In [None]:
df_Train.drop_duplicates(keep='first', inplace=True)

NO Duplicate ROWS

## Joining the Train and Test Data for Encoding and Filling the Missing Values

In [None]:
# We will concat both train and test data set
df_Train['is_train'] = 1
df_Test['is_train'] = 0

#df_Frames = [df_Train,df_Test]
df_Total = pd.concat([df_Train, df_Test])

## Fill missing Values

In [None]:
#Null values in the Total Dataset
print('Null values in Total Data: \n', df_Total.isnull().sum())

In [None]:
#using Forward Fill to fill missing Values
df_Total['Bed Grade']=df_Total['Bed Grade'].fillna(method="ffill",axis=0)
df_Total['City_Code_Patient']=df_Total['City_Code_Patient'].fillna(method="ffill",axis=0)

## Feature Engineering

In [None]:
df_Total['Bill_per_patient'] = df_Total.groupby('patientid')['Admission_Deposit'].transform('sum')

## Encoding of the Columns

In [None]:
df_Total.head()

### For Tree Based Algorithm use Label Encoding

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()
df_Total['Hospital_code'] = le.fit_transform(df_Total['Hospital_code'])
df_Total['Hospital_type_code'] = le.fit_transform(df_Total['Hospital_type_code'])
df_Total['City_Code_Hospital'] = le.fit_transform(df_Total['City_Code_Hospital'])
df_Total['Hospital_region_code'] = le.fit_transform(df_Total['Hospital_region_code'])
df_Total['Available Extra Rooms in Hospital'] = le.fit_transform(df_Total['Available Extra Rooms in Hospital'])
df_Total['Department'] = le.fit_transform(df_Total['Department'])
df_Total['Ward_Type'] = le.fit_transform(df_Total['Ward_Type'])
df_Total['Ward_Facility_Code'] = le.fit_transform(df_Total['Ward_Facility_Code'])
df_Total['Bed Grade'] = le.fit_transform(df_Total['Bed Grade'])
#df_Total['patientid'] = le.fit_transform(df_Total['patientid'])
df_Total['City_Code_Patient'] = le.fit_transform(df_Total['City_Code_Patient'])
df_Total['Type of Admission'] = le.fit_transform(df_Total['Type of Admission'])
df_Total['Severity of Illness'] = le.fit_transform(df_Total['Severity of Illness'])
df_Total['Visitors with Patient'] = le.fit_transform(df_Total['Visitors with Patient'])
df_Total['Age'] = le.fit_transform(df_Total['Age'])

## For Scaling the Columns

In [None]:
df_Total['Admission_Deposit']

In [None]:
df_Total['Admission_Deposit'].describe()

In [None]:
from sklearn import preprocessing
mm_scaler = preprocessing.MinMaxScaler()
#df_Total[['Admission_Deposit']] = mm_scaler.fit_transform(df_Total[['Admission_Deposit']])

In [None]:
df_Total['Admission_Deposit'].describe()

## Un Merge the Train and Test Data after Feature Engineering

In [None]:
#Un-Merge code
df_Train_final = df_Total[df_Total['is_train'] == 1]
df_Test_final = df_Total[df_Total['is_train'] == 0]

In [None]:
df_Train_final

In [None]:
df_Test_final

# Data Modelling

## Split the Data to x and y variable

In [None]:
df_Train_final.columns

In [None]:
x = df_Train_final
x = x.drop(['case_id'], axis=1)
#x = x.drop(['patientid'], axis=1)
x = x.drop(['is_train'], axis=1)
x = x.drop(['Stay'], axis=1)
y = df_Train['Stay']
x_pred = df_Test_final
x_pred = x_pred.drop(['case_id'], axis=1)
#x_pred = x_pred.drop(['patientid'], axis=1)
x_pred = x_pred.drop(['is_train'], axis=1)
x_pred = x_pred.drop(['Stay'], axis=1)

## Boosting Algorithm

### LightGBM Model

In [None]:
import lightgbm as lgb
lgb_cl = lgb.LGBMClassifier(boosting_type='gbdt', learning_rate=0.1, n_estimators=500, importance_type='gain', objective='multiclass', num_boost_round=100,
                            num_leaves=300, max_depth=5, 
                            max_bin=60, bagging_faction=0.9, feature_fraction=0.9, subsample_freq=2, scale_pos_weight=2.5, 
                            random_state=1994, n_jobs=-1, silent=False)

In [None]:
#lgb_cl.fit(x_train, y_train, eval_set=[x_test,y_test], verbose=50, eval_metric='auc', early_stopping_rounds=100)
lgb_cl.fit(x, np.ravel(y))

In [None]:
y_pred = lgb_cl.predict(x_pred)

In [None]:
y_pred

In [None]:
submission_df = pd.DataFrame({'case_id':df_Test['case_id'], 'Stay':y_pred})
submission_df.to_csv('Sample Submission LGB v01.csv', index=False)

Public Score of 42.35

Do share your comments on how to improvise the model