## LOS (Length Of Stay) Analysis using Linear Regression/ Random Forest Algorithm ##

The following model will be used to predict the length of stay of a patient and
the time the beds will remain occupied. 

The model takes in the following data as input:
- Age (Integer): Older patients usually stay longer.
- Gender (Integer): 0 or 1.
- Complaint_Code (Integer): "Trauma" usually stays longer than "Flu".
- HR (Integer)
- BP (Integer)
- Temp (Float)
- SpO2 (Integer)

And gives the following data as output: 
- LOS (Float): The number of days.
    - Example: 4.2 (meaning roughly 4 days).

In [13]:
# IMPORTS
import pandas as pd
from sklearn.preprocessing import LabelEncoder
from sklearn.model_selection import train_test_split
from sklearn.linear_model import LinearRegression
from sklearn.metrics import mean_squared_error, r2_score
from sklearn.metrics import accuracy_score
from sklearn.ensemble import RandomForestRegressor
import joblib

In [14]:
# DATA

data = pd.read_csv('../data/raw/patients.csv')
data.head()

Unnamed: 0,ID,Age,Gender,HR,BP,Temp,SpO2,Complaint,Urgency,LOS
0,0,42,0,118,108,36.8,99,Trauma,Medium,5.5
1,1,31,1,113,199,36.9,100,Chest Pain,Critical,9.4
2,2,41,1,124,156,36.7,99,Chest Pain,Medium,7.0
3,3,53,0,73,128,36.7,97,General Checkup,Low,2.2
4,4,70,1,104,125,36.6,83,Difficulty Breathing,Critical,10.2


In [15]:
# encoding numarical labels to Complaint

label = LabelEncoder()

data['Complaint_Code'] = label.fit_transform(data['Complaint'])

data.head()

Unnamed: 0,ID,Age,Gender,HR,BP,Temp,SpO2,Complaint,Urgency,LOS,Complaint_Code
0,0,42,0,118,108,36.8,99,Trauma,Medium,5.5,4
1,1,31,1,113,199,36.9,100,Chest Pain,Critical,9.4,0
2,2,41,1,124,156,36.7,99,Chest Pain,Medium,7.0,0
3,3,53,0,73,128,36.7,97,General Checkup,Low,2.2,3
4,4,70,1,104,125,36.6,83,Difficulty Breathing,Critical,10.2,1


In [16]:
# spliting data into train test split

features = ["Age", "Gender", "HR", "BP", "Temp", "SpO2", "Complaint_Code"]

X = data.loc[:, features]
y = data.loc[:, ["LOS"]]

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size = 0.2)

### Linear regression implementation

In [17]:
# MODELtraining

lr_model = LinearRegression()

lr_model.fit(X_train, y_train)

In [18]:
print(f"Intercept: {lr_model.intercept_}")
print(f"Coefficients: {lr_model.coef_}")

Intercept: [-4.17996735]
Coefficients: [[ 0.05404311  0.11435362  0.1096988   0.0765352   0.34929597 -0.2944591
   1.06580692]]


In [19]:
lr_pred = lr_model.predict(X_test)

In [20]:
# Evaluating the model

mse = mean_squared_error(y_test, lr_pred)
r2 = r2_score(y_test, lr_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Mean Squared Error: 6.136285685793714
R-squared: 0.562327626369572


### Random forest regressor implementation

In [21]:
rfr_model = RandomForestRegressor()

rfr_model.fit(X_train, y_train.values.reshape(-1))

In [22]:
rfr_pred = rfr_model.predict(X_test)

In [23]:
# Evaluating the model

mse = mean_squared_error(y_test, rfr_pred)
r2 = r2_score(y_test, rfr_pred)
print(f"Mean Squared Error: {mse}")
print(f"R-squared: {r2}")

Mean Squared Error: 2.0338421935
R-squared: 0.8549356099114342


In [24]:
# EXPORTING MODEL TO SRC

joblib.dump(rfr_model, '../src/models/los.pkl')

['../src/models/los.pkl']