# **Notebook 4: Hospital Length of Stay (LOS) Prediction**

## **Importing Libraries**

In [1]:
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns

import warnings
warnings.filterwarnings("ignore")

# Removes the limit for the number of displayed columns
pd.set_option("display.max_columns", None)

# Sets the limit for the number of displayed rows
pd.set_option("display.max_rows", 200)

# To build models for prediction
from sklearn.model_selection import train_test_split, cross_val_score, KFold
from sklearn.linear_model import LinearRegression, Ridge, Lasso, ElasticNet
from sklearn.tree import DecisionTreeRegressor
from sklearn.ensemble import RandomForestRegressor,BaggingRegressor

# To encode categorical variables
from sklearn.preprocessing import LabelEncoder

# For tuning the model
from sklearn.model_selection import GridSearchCV

# To check model performance
from sklearn.metrics import make_scorer,mean_squared_error, r2_score, mean_absolute_error

In [2]:
# Read the healthcare dataset file
data = pd.read_csv("healthcare_data.csv")

In [3]:
# Copying data to another variable to avoid any changes to original data
same_data = data.copy()

## **Data Preparation for Model Building**

- Before we proceed to build a model, we'll have to encode categorical features.
- Separate the independent variables and dependent Variables.
- We'll split the data into train and test to be able to evaluate the model that we train on the training data.

In [10]:
# Creating dummy variables for the categorical columns
# drop_first=True is used to avoid redundant variables
data = pd.get_dummies(
    data,
    columns = data.select_dtypes(include = ["object", "category"]).columns.tolist(),
    drop_first = True,
)

In [11]:
# Check the data after handling categorical data
data

Unnamed: 0,Available Extra Rooms in Hospital,staff_available,Visitors with Patient,Admission_Deposit,Stay (in days),Department_anesthesia,Department_gynecology,Department_radiotherapy,Department_surgery,Ward_Facility_Code_B,Ward_Facility_Code_C,Ward_Facility_Code_D,Ward_Facility_Code_E,Ward_Facility_Code_F,doctor_name_Dr John,doctor_name_Dr Mark,doctor_name_Dr Nathan,doctor_name_Dr Olivia,doctor_name_Dr Sam,doctor_name_Dr Sarah,doctor_name_Dr Simon,doctor_name_Dr Sophia,Age_11-20,Age_21-30,Age_31-40,Age_41-50,Age_51-60,Age_61-70,Age_71-80,Age_81-90,Age_91-100,gender_Male,gender_Other,Type of Admission_Trauma,Type of Admission_Urgent,Severity of Illness_Minor,Severity of Illness_Moderate,health_conditions_Diabetes,health_conditions_Heart disease,health_conditions_High Blood Pressure,health_conditions_None,health_conditions_Other,Insurance_Yes
0,4,0,4,2966.408696,8,0,1,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1
1,4,2,2,3554.835677,9,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,1,0,1,0,0,1,0,0,0,0
2,2,8,2,5624.733654,7,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,1
3,4,7,4,4814.149231,8,0,1,0,0,0,0,1,0,0,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0
4,2,10,2,5169.269637,34,1,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,1,0,0,1,1,0,0,0,0,0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
499995,4,2,3,4105.795901,10,0,1,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,1,0,1,0,0,0,1,0,0,0
499996,13,8,2,4631.550257,11,0,1,0,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,1,0
499997,2,3,2,5456.930075,8,0,1,0,0,1,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,1,0,0,0
499998,2,1,2,4694.127772,23,0,0,1,0,0,0,0,0,0,1,0,0,0,0,0,0,0,0,0,0,0,0,1,0,0,0,0,0,1,0,0,0,1,0,0,0,0,0


In [12]:
# Separating independent variables and the target variable
x = data.drop('Stay (in days)',axis=1)

y = data['Stay (in days)'] 

In [13]:
# Splitting the dataset into train and test datasets
x_train, x_test, y_train, y_test = train_test_split(x, y, test_size = 0.2, shuffle = True, random_state = 1)

In [14]:
# Checking the shape of the train and test data
print("Shape of Training set : ", x_train.shape)
print("Shape of test set : ", x_test.shape)

Shape of Training set :  (400000, 42)
Shape of test set :  (100000, 42)


## **Serialization**

**Serialization is defined as the process of converting a data object (e.g., Python objects, models) into a format that enables storage or transmission, with the object being recreated when needed through deserialization. Two serialization formats utilized in Python—`Pickle` and `Joblib`—will be explored.**

## **Pickle**

Pickle is utilized to store Python objects in a format that supports easy retrieval and reuse, offering significant value for preserving program states—such as saving a trained machine learning model for subsequent predictions on new data.

**Advantages of Using Pickle:**

- It is valued for its user-friendly application.
- It is equipped to manage virtually any Python object, encompassing custom classes and functions.
- The serialized data can be compressed, diminishing its size and enhancing transmission speed.
- The deserialized object is guaranteed to preserve the identical type and value as the original object.

### **Importing the library**

In [15]:
import pickle

In [16]:
# Create a model with desired hyperparameters
model = RandomForestRegressor(n_estimators=120, max_depth=None, max_features=0.8, random_state=1)

In [17]:
model.fit(x_train, y_train)

### **Saving the trained model using Pickle**

In [None]:
with open('model.pkl', 'wb') as file:
    pickle.dump(model, file)

**The above code is using two functions:**

- **with open('model.pkl', 'wb') as file:** opens a new file named **model.pkl** in **write binary** mode. The `with` statement ensures that the file is closed properly after the data has been written to it.

- **pickle.dump(model, file)** writes the model object to the file object in binary format using the **pickle.dump()** method.

### **Loading the trained model using Pickle**

In [None]:
with open('model.pkl', 'rb') as file:
    loaded_model_pkl = pickle.load(file)

- This code uses the **open()** function to open the **model.pkl** file in **read mode ('rb')**, and assigns the file object to the variable file.

- Then, the **pickle.load()** method is called to **load the saved model** from the file **file** into the **loaded_model_pkl** variable.

- To summarize, this code is **loading the saved Random Forest Regression model** stored in the **model.pkl** file using the *pickle module*, and assigning the loaded model to the **loaded_model_pkl** variable. This loaded model can be used to make predictions on new data.

## **Joblib**

**Joblib:** Joblib is recognized as a Python library designed to provide lightweight pipelining and multi-threading utilities, focusing on efficiently pickling large NumPy arrays and persisting scikit-learn models.

- **Efficiency:** Optimized for handling large NumPy arrays, making it ideal for persisting extensive models and data.
- **Parallel Processing:** Enables easy multi-core parallelization, speeding up model training and evaluation.
- **Seamless Integration with scikit-learn:** Crafted to work seamlessly with scikit-learn, enhancing model persistence.

**Disadvantages of Using Joblib**

- Less widely adopted than Pickle, potentially limiting resource and support availability.
- May not be optimal for persisting data types or models beyond large NumPy arrays.

### **Importing the library**

In [None]:
import joblib

### **Saving the trained model using Joblib**

In [None]:
joblib.dump(model, 'model.joblib')

- This code is using the **joblib library** to save a trained machine learning model to disk in the file **model.joblib**.

- The **joblib.dump()** function takes two arguments: the first argument is the **trained machine learning model ('model')** that needs to be saved and the second argument is the **filename ('model.joblib')** where the model will be saved.

- The advantage of using joblib over pickle is that it is optimized for dealing with large numpy arrays, which are commonly used in machine learning. This means that joblib is often faster and more efficient than Pickle for saving and loading machine learning models.

### **Loading the trained model using Joblib**

In [None]:
loaded_model_joblib = joblib.load('model.joblib')

- This code uses the **joblib.load()** function from the joblib library to load a trained machine learning model that was saved in a binary file format with the **.joblib** extension. The name of the file to be loaded is passed as an argument to the **joblib.load()** function.

- Once the model is loaded from the file, it is stored in the variable **loaded_model_joblib** and can be used for making predictions on new data.

- **joblib.load():** This function is used to load a machine learning model that was saved using the **joblib.dump()** function
- **'model.joblib':** This is the name of the file containing the saved model. It should be in the same directory as the notebook or script that is loading the model.
- **loaded_model_joblib:** This is the variable where the loaded model will be stored. Once the model is loaded, it can be used to make predictions on new data.