## Enhancing Hospital Efficiency with ML: Data Cleaning, XGBoost, and Predictive Modeling.

In the ever-evolving landscape of healthcare, the quest to enhance patient care outcomes and elevate the quality of healthcare services is an ongoing mission. Healthcare organizations are navigating a complex web of challenges, but within these challenges, lies a remarkable opportunity – the power of data.

Healthcare analytics is the compass that guides organizations towards this opportunity. It's the art of dissecting and interpreting data, employing both quantitative and qualitative techniques to unveil the hidden gems of insights and patterns within the wealth of healthcare information. Among the multitude of metrics used for performance evaluation, one vital indicator stands out - the Length of Stay (LOS) for patients.

Predicting a patient's Length of Stay is akin to holding a key that unlocks a world of possibilities. It empowers hospitals to optimize their treatment plans with precision, a measure that not only reduces LOS but also minimizes infection rates among patients, staff, and visitors. In essence, it's a pathway to not just improving patient care but revolutionizing healthcare management as a whole.

In this journey towards better healthcare, you play a pivotal role. You are the data virtuoso, armed with the latest tools and techniques in healthcare analytics. Your mission is to transform raw data into meaningful insights that illuminate the intricate web of patient care. Through the lens of data analysis, you decode the mysteries of patient LOS, revealing trends and patterns that hold the key to more efficient and effective healthcare delivery.

Collaborating closely with the healthcare team, you craft compelling data visualizations that bring these insights to life. Your data-driven creations become the guiding stars, steering healthcare professionals towards better decision-making, enhanced care, and safer environments. While the intricacies of your work may often go unnoticed, its impact reverberates throughout the healthcare organization.

In the world of healthcare analytics, you are the unsung hero, the one who helps unveil the extraordinary stories of improved patient care and streamlined healthcare management. Your dedication to data and your ability to transform it into illuminating insights contribute to the ongoing saga of healthcare excellence, making every patient's journey towards better health that much more extraordinary.

#### Step 1: Analyzing the train data

Task is to take the data from the file in the root directory named train.csv and assign it to the variable 'train'.

In [None]:
import pandas as pd
train = pd.read_csv('train.csv')

train

#### Step 2: Decoding the test data

Task is to take the data from the file in the root directory named test.csv and assign it to the variable 'test'.

In [None]:
test = pd.read_csv("./test.csv")
test

#### Step 3: Navigating the Missing Values.

Calculate the count of null values in each column of the 'train' DataFrame and store the results in a Pandas Series assigned to the variable 'null_values_train'.


In [None]:
null_values_train = train.isnull().sum()

null_values_train

#### Step 4: Conquering the Enigma of Missing Values

Calculate the count of null values in each column of the 'test' DataFrame and store the results in a Pandas Series assigned to the variable 'null_values_test'.

In [None]:
null_values_test = test.isnull().sum()

null_values_test

#### Step 5: Data Healing

Replace the missing values in the 'Bed Grade' and 'City_Code_Patient' columns of the 'train' DataFrame with their respective modes.

In [None]:
train['Bed Grade'].fillna(train['Bed Grade'].mode()[0], inplace = True)
train['City_Code_Patient'].fillna(train['City_Code_Patient'].mode()[0], inplace = True)

train

#### Step 6: Data Completeness

Replace the missing values in the 'Bed Grade' and 'City_Code_Patient' columns of the 'test' DataFrame with their respective modes.

In [None]:
test['Bed Grade'].fillna(test['Bed Grade'].mode()[0], inplace = True)
test['City_Code_Patient'].fillna(test['City_Code_Patient'].mode()[0], inplace = True)

test

#### Step 7: Transforming 'Stay' with LabelEncoder

We should encode the 'Stay' column in the 'train' DataFrame using LabelEncoder, replacing categorical values with numerical labels.

This transformation is essential to enable predictive modeling and machine learning algorithms to make sense of the data. 

In [None]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

train['Stay'] = le.fit_transform(train['Stay'].astype('str'))

train

#### Step 8: Charting the Unknown

Task is to assign the value -1 to the 'Stay' column for all rows in the 'test' DataFrame.

The reasons for this move lay hidden within the algorithmic realm, a placeholder to be eventually replaced by machine-generated predictions. As we execute this change, it symbolize the start of a new phase in their mission – the one where the algorithm would craft predictions and, in turn, determine the lengths of patients' stays.

In [None]:
test['Stay'] = -1

test

#### Step 9: Data Convergence

Task is to create a new DataFrame named 'df' by concatenating the 'train' and 'test' DataFrames along their rows and resetting the index with continuous numbering.

In [None]:
df = pd.concat([train, test], ignore_index = True)

df

#### Step 10: Transforming Categories into Numbers

Task is to encode the 'Hospital_type_code', 'Hospital_region_code', 'Department', 'Ward_Type', 'Ward_Facility_Code', 'Type of Admission', 'Severity of Illness', and 'Age' columns in the 'df' DataFrame using LabelEncoder, replacing categorical values with numerical labels.

This transformation turns the categorical variables into numerical representations, providing a common language for the algorithms to interpret.

In [None]:
for i in ['Hospital_type_code', 'Hospital_region_code', 'Department', 'Ward_Type', 'Ward_Facility_Code', 'Type of Admission', 'Severity of Illness', 'Age']:
          le = LabelEncoder()
          df[i] = le.fit_transform(df[i].astype('str'))

df

#### Step 11: Data Segmentation

Task is to create a new DataFrame named 'train' by filtering rows in the DataFrame 'df' where the value in the 'Stay' column is not equal to -1.

The "train" dataset now hold only the rows where the outcomes are known, and it is this dataset that would be the cornerstone for developing and validating predictive models.

In [None]:
train = df[df['Stay']!=-1]

train

#### Step 12: Preparation for Prediction

The task is to create a new DataFrame named 'test' by filtering rows in the DataFrame 'df' where the value in the 'Stay' column is equal to -1.

In [None]:
test = df[df['Stay']==-1]

test

#### Step 13: Feature Engineering for Enhanced Predictive Analysis

Task is to create a new DataFrame named 'test1' by dropping the columns 'Stay', 'patientid', 'Hospital_region_code', and 'Ward_Facility_Code' from the 'test' DataFrame.

With each transformation, we are enhancing the dataset, preparing it for the predictive modeling phase. The result is a dataset, "test1," meticulously refined, and now poised for predictive analysis, with features engineered to uncover hidden patterns in patient stays.

In [None]:
import numpy as np
def get_countid_enocde(train, test, cols, name):
  temp = train.groupby(cols)['case_id'].count().reset_index().rename(columns = {'case_id': name})
  temp2 = test.groupby(cols)['case_id'].count().reset_index().rename(columns = {'case_id': name})
  train = pd.merge(train, temp, how='left', on= cols)
  test = pd.merge(test,temp2, how='left', on= cols)
  train[name] = train[name].astype('float')
  test[name] = test[name].astype('float')
  train[name].fillna(np.median(temp[name]), inplace = True)
  test[name].fillna(np.median(temp2[name]), inplace = True)
  return train, test


train, test = get_countid_enocde(train, test, ['patientid'], name = 'count_id_patient')
train, test = get_countid_enocde(train, test,
                                 ['patientid', 'Hospital_region_code'], name = 'count_id_patient_hospitalCode')
train, test = get_countid_enocde(train, test,
                                 ['patientid', 'Ward_Facility_Code'], name = 'count_id_patient_wardfacilityCode')

test1 = test.drop(['Stay', 'patientid', 'Hospital_region_code', 'Ward_Facility_Code'], axis =1)
test1

#### Step 14: Sculpting the Data

Task is to create a new DataFrame named 'train1' by dropping the columns 'case_id', 'patientid', 'Hospital_region_code', and 'Ward_Facility_Code' from the 'train' DataFrame.

By removing these variables, we aim to reduce noise and focus on the most relevant features for predicting patient stays. The "train1" dataset now stood as a lean, purpose-built platform for the upcoming modeling stage, where machine learning algorithms would uncover the intricate patterns within the data.

In [None]:
train1 = train.drop(['case_id', 'patientid', 'Hospital_region_code', 'Ward_Facility_Code'], axis =1)

train1

#### Step 15: Data Splitting for Model Mastery

Task is to split the 'train1' data into training and testing sets. Create feature variables X1 by excluding the 'Stay' column and set y1 as the target variable. Split the data into training and testing sets using a 80-20 split ratio, with a random seed of 100 for reproducibility.

This division of data is a fundamental step in the process, essential for training and assessing the performance of machine learning models. With the training and testing datasets now established, we are poised to embark on the final phases of their data-driven journey, where predictive modeling would unveil insights, guide decision-making, and bring the project to its ultimate conclusion.

In [None]:
from sklearn.model_selection import train_test_split

# Splitting the dataset
X1 = train1.drop('Stay', axis=1)
y1 = train1['Stay']

X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size=0.20, random_state=100)

# Displaying a preview of the train and test datasets
print("Preview of X_train:")
display(X_train.head())

print("\nPreview of X_test:")
display(X_test.head())

print("\nPreview of y_train:")
display(y_train.head().to_frame())

print("\nPreview of y_test:")
display(y_test.head().to_frame())

# Summary of dataset splits
print("\nDataset Split Summary:")
print(f"X_train shape: {X_train.shape}, X_test shape: {X_test.shape}")
print(f"y_train shape: {y_train.shape}, y_test shape: {y_test.shape}")


#### Step 16: The XGBoost Model Story

Task is to create an XGBoost classifier instance named 'classifier_xgb' with specified hyperparameters. Fit the classifier to the training data ('X_train' and 'y_train') and assign it to 'model_xgb'. Use the trained model to make predictions on the test data ('X_test') and store the predictions. Calculate the accuracy of the XGBoost model's predictions and store it in 'acc_score_xgb'. Round the accuracy score to two decimal places.

In [None]:
import xgboost
from sklearn.metrics import accuracy_score


classifier_xgb = xgboost.XGBClassifier(max_depth=4, learning_rate=0.1, n_estimators=800,
                                  objective='multi:softmax', reg_alpha=0.5, reg_lambda=1.5,
                                  booster='gbtree', n_jobs=4, min_child_weight=2, base_score= 0.75)


model_xgb = classifier_xgb.fit(X_train, y_train)

prediction_xgb = model_xgb.predict(X_test)
acc_score_xgb = accuracy_score(prediction_xgb,y_test)

acc_score_xgb = round(acc_score_xgb, 2)


print(f"\nXGBoost Model Accuracy: {acc_score_xgb * 100:.2f}%\n")

#### Step 17: The Final Predictive Model Transformation

task is to use the trained 'classifier_xgb' model to make predictions on the 'test1' DataFrame excluding the 'case_id' column. Store the predictions in 'pred_xgb'. Create a new DataFrame named 'result_xgb' to organize the prediction results with 'pred_xgb' values added to the 'Stay' column. Assign the 'case_id' column from 'test1' to the 'case_id' column in 'result_xgb'. Reorder the columns in 'result_xgb' to have 'case_id' as the first column and 'Stay' as the second column. Replace the numeric labels in the 'Stay' column of 'result_xgb' with provided label_mapping.

In [None]:
label_mapping = {
    0: '0-10', 1: '11-20', 2: '21-30', 3: '31-40', 4: '41-50',
    5: '51-60', 6: '61-70', 7: '71-80', 8: '81-90', 9: '91-100',
    10: 'More than 100 Days'
}

# Predict on test data
pred_xgb = classifier_xgb.predict(test1.iloc[:, 1:])
result_xgb = pd.DataFrame({'case_id': test1['case_id'], 'Stay': pred_xgb})
result_xgb['Stay'] = result_xgb['Stay'].replace(label_mapping)

# Display sample predictions
print("\nSample Predictions:")
display(result_xgb.head())

#### Step 18: Decoding Patient Stays

Task is to group the 'result_xgb' DataFrame by unique 'Stay' values and calculate the count of unique 'case_id' values in each group. Store the result in the variable 'result'.

In [None]:
result = result_xgb.groupby('Stay')['case_id'].nunique().reset_index()

# Rename columns for better readability
result.columns = ['Stay Duration', 'Number of Cases']

# Display the summarized result
print("\nSummary of Stay Duration Counts:")
display(result)