### Enhancing Hospital Efficiency with ML: Data Cleaning, XGBoost, and Predictive Modeling.

In the ever-evolving landscape of healthcare, the quest to enhance patient care outcomes and elevate the quality of healthcare services is an ongoing mission. Healthcare organizations are navigating a complex web of challenges, but within these challenges, lies a remarkable opportunity – the power of data.

Healthcare analytics is the compass that guides organizations towards this opportunity. It's the art of dissecting and interpreting data, employing both quantitative and qualitative techniques to unveil the hidden gems of insights and patterns within the wealth of healthcare information. Among the multitude of metrics used for performance evaluation, one vital indicator stands out - the Length of Stay (LOS) for patients.

Predicting a patient's Length of Stay is akin to holding a key that unlocks a world of possibilities. It empowers hospitals to optimize their treatment plans with precision, a measure that not only reduces LOS but also minimizes infection rates among patients, staff, and visitors. In essence, it's a pathway to not just improving patient care but revolutionizing healthcare management as a whole.

In this journey towards better healthcare, you play a pivotal role. You are the data virtuoso, armed with the latest tools and techniques in healthcare analytics. Your mission is to transform raw data into meaningful insights that illuminate the intricate web of patient care. Through the lens of data analysis, you decode the mysteries of patient LOS, revealing trends and patterns that hold the key to more efficient and effective healthcare delivery.

Collaborating closely with the healthcare team, you craft compelling data visualizations that bring these insights to life. Your data-driven creations become the guiding stars, steering healthcare professionals towards better decision-making, enhanced care, and safer environments. While the intricacies of your work may often go unnoticed, its impact reverberates throughout the healthcare organization.

In the world of healthcare analytics, you are the unsung hero, the one who helps unveil the extraordinary stories of improved patient care and streamlined healthcare management. Your dedication to data and your ability to transform it into illuminating insights contribute to the ongoing saga of healthcare excellence, making every patient's journey towards better health that much more extraordinary.

#### Step 1: Analyzing the train data.

Load the train data

In [1]:
import pandas as pd
#--- Read in dataset ----
train = pd.read_csv('train.csv')


#--- Inspect data ---
train

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,1,8,c,3,Z,3,radiotherapy,R,F,2.0,31397,7.0,Emergency,Extreme,2,51-60,4911.0,0-10
1,2,2,c,5,Z,2,radiotherapy,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,5954.0,41-50
2,3,10,e,1,X,2,anesthesia,S,E,2.0,31397,7.0,Trauma,Extreme,2,51-60,4745.0,31-40
3,4,26,b,2,Y,2,radiotherapy,R,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,7272.0,41-50
4,5,26,b,2,Y,2,radiotherapy,S,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,5558.0,41-50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4996,19,a,7,Y,2,gynecology,S,C,2.0,104970,8.0,Emergency,Minor,2,11-20,4894.0,0-10
4996,4997,26,b,2,Y,2,gynecology,Q,D,4.0,104970,8.0,Trauma,Minor,2,11-20,6987.0,31-40
4997,4998,32,f,9,Y,3,gynecology,S,B,2.0,68447,5.0,Emergency,Moderate,4,41-50,4196.0,51-60
4998,4999,26,b,2,Y,3,gynecology,R,D,2.0,68447,5.0,Trauma,Moderate,3,41-50,4560.0,21-30


#### Step 2: Decoding the test data.
Load the test data.

In [2]:
test = pd.read_csv("./test.csv")

#--- Inspect data ---
test

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit
0,318439,21,c,3,Z,3,gynecology,S,A,2.0,17006,2.0,Emergency,Moderate,2,71-80,3095.0
1,318440,29,a,4,X,2,gynecology,S,F,2.0,17006,2.0,Trauma,Moderate,4,71-80,4018.0
2,318441,26,b,2,Y,3,gynecology,Q,D,4.0,17006,2.0,Emergency,Moderate,3,71-80,4492.0
3,318442,6,a,6,X,3,gynecology,Q,F,2.0,17006,2.0,Trauma,Moderate,3,71-80,4173.0
4,318443,28,b,11,X,2,gynecology,R,F,2.0,17006,2.0,Trauma,Moderate,4,71-80,4161.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,320434,26,b,2,Y,1,anesthesia,R,D,2.0,19303,8.0,Trauma,Extreme,4,51-60,8829.0
1996,320435,26,b,2,Y,4,gynecology,R,D,1.0,19303,8.0,Trauma,Extreme,6,51-60,3507.0
1997,320436,23,a,6,X,3,gynecology,Q,F,1.0,19303,8.0,Trauma,Extreme,3,51-60,4109.0
1998,320437,25,e,1,X,4,gynecology,Q,E,3.0,19303,8.0,Emergency,Extreme,4,51-60,4155.0


#### Step 3: Navigating the Missing Values.
Calculate the count of null values in each column of the 'train' DataFrame and store the results in a Pandas Series assigned to the variable 'null_values_train'.

In [3]:
null_values_train = train.isnull().sum()

#--- Inspect data ---
null_values_train

case_id                               0
Hospital_code                         0
Hospital_type_code                    0
City_Code_Hospital                    0
Hospital_region_code                  0
Available Extra Rooms in Hospital     0
Department                            0
Ward_Type                             0
Ward_Facility_Code                    0
Bed Grade                             2
patientid                             0
City_Code_Patient                    42
Type of Admission                     0
Severity of Illness                   0
Visitors with Patient                 0
Age                                   0
Admission_Deposit                     0
Stay                                  0
dtype: int64

#### Step 4: Conquering the Enigma of Missing Values.
Counting Null Values in test data.

Calculate the count of null values in each column of the 'test' DataFrame and store the results in a Pandas Series assigned to the variable 'null_values_test'.

In [4]:
null_values_test = test.isnull().sum()

#--- Inspect data ---
null_values_test

case_id                               0
Hospital_code                         0
Hospital_type_code                    0
City_Code_Hospital                    0
Hospital_region_code                  0
Available Extra Rooms in Hospital     0
Department                            0
Ward_Type                             0
Ward_Facility_Code                    0
Bed Grade                             2
patientid                             0
City_Code_Patient                    21
Type of Admission                     0
Severity of Illness                   0
Visitors with Patient                 0
Age                                   0
Admission_Deposit                     0
dtype: int64

#### Step 5: Data Healing
Handling Missing Values in train Data.

Replace the missing values in the 'Bed Grade' and 'City_Code_Patient' columns of the 'train' DataFrame with their respective modes.

In [5]:
train['Bed Grade'].fillna(train['Bed Grade'].mode()[0], inplace = True)

train['City_Code_Patient'].fillna(train['City_Code_Patient'].mode()[0], inplace = True)

#--- Inspect data ---
train

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train['Bed Grade'].fillna(train['Bed Grade'].mode()[0], inplace = True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train['City_Code_Patient'].fillna(train['City_Code_Patient'].mode()[0], inplace = True)


Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,1,8,c,3,Z,3,radiotherapy,R,F,2.0,31397,7.0,Emergency,Extreme,2,51-60,4911.0,0-10
1,2,2,c,5,Z,2,radiotherapy,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,5954.0,41-50
2,3,10,e,1,X,2,anesthesia,S,E,2.0,31397,7.0,Trauma,Extreme,2,51-60,4745.0,31-40
3,4,26,b,2,Y,2,radiotherapy,R,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,7272.0,41-50
4,5,26,b,2,Y,2,radiotherapy,S,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,5558.0,41-50
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4996,19,a,7,Y,2,gynecology,S,C,2.0,104970,8.0,Emergency,Minor,2,11-20,4894.0,0-10
4996,4997,26,b,2,Y,2,gynecology,Q,D,4.0,104970,8.0,Trauma,Minor,2,11-20,6987.0,31-40
4997,4998,32,f,9,Y,3,gynecology,S,B,2.0,68447,5.0,Emergency,Moderate,4,41-50,4196.0,51-60
4998,4999,26,b,2,Y,3,gynecology,R,D,2.0,68447,5.0,Trauma,Moderate,3,41-50,4560.0,21-30


#### Stp 6: Data Completeness
Handling Missing Values in test Data.

Replace the missing values in the 'Bed Grade' and 'City_Code_Patient' columns of the 'test' DataFrame with their respective modes.

In [6]:
test['Bed Grade'].fillna(test['Bed Grade'].mode()[0], inplace = True)

test['City_Code_Patient'].fillna(test['City_Code_Patient'].mode()[0], inplace = True)

#--- Inspect data ---
test

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test['Bed Grade'].fillna(test['Bed Grade'].mode()[0], inplace = True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test['City_Code_Patient'].fillna(test['City_Code_Patient'].mode()[0], inplace = True)


Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit
0,318439,21,c,3,Z,3,gynecology,S,A,2.0,17006,2.0,Emergency,Moderate,2,71-80,3095.0
1,318440,29,a,4,X,2,gynecology,S,F,2.0,17006,2.0,Trauma,Moderate,4,71-80,4018.0
2,318441,26,b,2,Y,3,gynecology,Q,D,4.0,17006,2.0,Emergency,Moderate,3,71-80,4492.0
3,318442,6,a,6,X,3,gynecology,Q,F,2.0,17006,2.0,Trauma,Moderate,3,71-80,4173.0
4,318443,28,b,11,X,2,gynecology,R,F,2.0,17006,2.0,Trauma,Moderate,4,71-80,4161.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,320434,26,b,2,Y,1,anesthesia,R,D,2.0,19303,8.0,Trauma,Extreme,4,51-60,8829.0
1996,320435,26,b,2,Y,4,gynecology,R,D,1.0,19303,8.0,Trauma,Extreme,6,51-60,3507.0
1997,320436,23,a,6,X,3,gynecology,Q,F,1.0,19303,8.0,Trauma,Extreme,3,51-60,4109.0
1998,320437,25,e,1,X,4,gynecology,Q,E,3.0,19303,8.0,Emergency,Extreme,4,51-60,4155.0


#### Step 7: Transforming 'Stay' with LabelEncoder
Encoding 'Stay' Column in train Data.

You should encode the 'Stay' column in the 'train' DataFrame using LabelEncoder, replacing categorical values with numerical labels.

In [7]:
from sklearn.preprocessing import LabelEncoder
le = LabelEncoder()

train['Stay'] = le.fit_transform(train['Stay'].astype('str'))
#--- Inspect data ---train
train

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,1,8,c,3,Z,3,radiotherapy,R,F,2.0,31397,7.0,Emergency,Extreme,2,51-60,4911.0,0
1,2,2,c,5,Z,2,radiotherapy,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,5954.0,4
2,3,10,e,1,X,2,anesthesia,S,E,2.0,31397,7.0,Trauma,Extreme,2,51-60,4745.0,3
3,4,26,b,2,Y,2,radiotherapy,R,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,7272.0,4
4,5,26,b,2,Y,2,radiotherapy,S,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,5558.0,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4996,19,a,7,Y,2,gynecology,S,C,2.0,104970,8.0,Emergency,Minor,2,11-20,4894.0,0
4996,4997,26,b,2,Y,2,gynecology,Q,D,4.0,104970,8.0,Trauma,Minor,2,11-20,6987.0,3
4997,4998,32,f,9,Y,3,gynecology,S,B,2.0,68447,5.0,Emergency,Moderate,4,41-50,4196.0,5
4998,4999,26,b,2,Y,3,gynecology,R,D,2.0,68447,5.0,Trauma,Moderate,3,41-50,4560.0,2


#### Step 8: Charting the Unknown
Setting a Default Value.

The "Stay" column, vital for the project, was initially left blank.Task is to assign the value -1 to the 'Stay' column for all rows in the 'test' DataFrame.

In [8]:
test['Stay'] = -1

#--- Inspect data ---
test

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,318439,21,c,3,Z,3,gynecology,S,A,2.0,17006,2.0,Emergency,Moderate,2,71-80,3095.0,-1
1,318440,29,a,4,X,2,gynecology,S,F,2.0,17006,2.0,Trauma,Moderate,4,71-80,4018.0,-1
2,318441,26,b,2,Y,3,gynecology,Q,D,4.0,17006,2.0,Emergency,Moderate,3,71-80,4492.0,-1
3,318442,6,a,6,X,3,gynecology,Q,F,2.0,17006,2.0,Trauma,Moderate,3,71-80,4173.0,-1
4,318443,28,b,11,X,2,gynecology,R,F,2.0,17006,2.0,Trauma,Moderate,4,71-80,4161.0,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,320434,26,b,2,Y,1,anesthesia,R,D,2.0,19303,8.0,Trauma,Extreme,4,51-60,8829.0,-1
1996,320435,26,b,2,Y,4,gynecology,R,D,1.0,19303,8.0,Trauma,Extreme,6,51-60,3507.0,-1
1997,320436,23,a,6,X,3,gynecology,Q,F,1.0,19303,8.0,Trauma,Extreme,3,51-60,4109.0,-1
1998,320437,25,e,1,X,4,gynecology,Q,E,3.0,19303,8.0,Emergency,Extreme,4,51-60,4155.0,-1


#### Step 9: Data Convergence

Merging Train and Test Data.

Task is to create a new DataFrame named 'df' by concatenating the 'train' and 'test' DataFrames along their rows and resetting the index with continuous numbering.

In [9]:
df = pd.concat([train, test], ignore_index = True)

#--- Inspect data ---
df

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,1,8,c,3,Z,3,radiotherapy,R,F,2.0,31397,7.0,Emergency,Extreme,2,51-60,4911.0,0
1,2,2,c,5,Z,2,radiotherapy,S,F,2.0,31397,7.0,Trauma,Extreme,2,51-60,5954.0,4
2,3,10,e,1,X,2,anesthesia,S,E,2.0,31397,7.0,Trauma,Extreme,2,51-60,4745.0,3
3,4,26,b,2,Y,2,radiotherapy,R,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,7272.0,4
4,5,26,b,2,Y,2,radiotherapy,S,D,2.0,31397,7.0,Trauma,Extreme,2,51-60,5558.0,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6995,320434,26,b,2,Y,1,anesthesia,R,D,2.0,19303,8.0,Trauma,Extreme,4,51-60,8829.0,-1
6996,320435,26,b,2,Y,4,gynecology,R,D,1.0,19303,8.0,Trauma,Extreme,6,51-60,3507.0,-1
6997,320436,23,a,6,X,3,gynecology,Q,F,1.0,19303,8.0,Trauma,Extreme,3,51-60,4109.0,-1
6998,320437,25,e,1,X,4,gynecology,Q,E,3.0,19303,8.0,Emergency,Extreme,4,51-60,4155.0,-1


#### Step 10: Transforming Categories into Numbers
Categorical Data Label Encoding.

Task is to encode the 'Hospital_type_code', 'Hospital_region_code', 'Department', 'Ward_Type', 'Ward_Facility_Code', 'Type of Admission', 'Severity of Illness', and 'Age' columns in the 'df' DataFrame using LabelEncoder, replacing categorical values with numerical labels.

In [10]:
for i in ['Hospital_type_code', 'Hospital_region_code', 'Department', 'Ward_Type', 'Ward_Facility_Code', 'Type of Admission', 'Severity of Illness', 'Age']:
          le = LabelEncoder()
          df[i] = le.fit_transform(df[i].astype('str'))

#--- Inspect data ---
df

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,1,8,2,3,2,3,3,2,5,2.0,31397,7.0,0,0,2,5,4911.0,0
1,2,2,2,5,2,2,3,3,5,2.0,31397,7.0,1,0,2,5,5954.0,4
2,3,10,4,1,0,2,1,3,4,2.0,31397,7.0,1,0,2,5,4745.0,3
3,4,26,1,2,1,2,3,2,3,2.0,31397,7.0,1,0,2,5,7272.0,4
4,5,26,1,2,1,2,3,3,3,2.0,31397,7.0,1,0,2,5,5558.0,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6995,320434,26,1,2,1,1,1,2,3,2.0,19303,8.0,1,0,4,5,8829.0,-1
6996,320435,26,1,2,1,4,2,2,3,1.0,19303,8.0,1,0,6,5,3507.0,-1
6997,320436,23,0,6,0,3,2,1,5,1.0,19303,8.0,1,0,3,5,4109.0,-1
6998,320437,25,4,1,0,4,2,1,4,3.0,19303,8.0,0,0,4,5,4155.0,-1


#### Step 11: Data Segmentation

Filtering the Training Dataset.

In this task, you have to create a new DataFrame named 'train' by filtering rows in the DataFrame 'df' where the value in the 'Stay' column is not equal to -1.

In [11]:
train = df[df['Stay']!=-1]

#--- Inspect data ---
train

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
0,1,8,2,3,2,3,3,2,5,2.0,31397,7.0,0,0,2,5,4911.0,0
1,2,2,2,5,2,2,3,3,5,2.0,31397,7.0,1,0,2,5,5954.0,4
2,3,10,4,1,0,2,1,3,4,2.0,31397,7.0,1,0,2,5,4745.0,3
3,4,26,1,2,1,2,3,2,3,2.0,31397,7.0,1,0,2,5,7272.0,4
4,5,26,1,2,1,2,3,3,3,2.0,31397,7.0,1,0,2,5,5558.0,4
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,4996,19,0,7,1,2,2,3,2,2.0,104970,8.0,0,1,2,1,4894.0,0
4996,4997,26,1,2,1,2,2,1,3,4.0,104970,8.0,1,1,2,1,6987.0,3
4997,4998,32,5,9,1,3,2,3,1,2.0,68447,5.0,0,2,4,4,4196.0,5
4998,4999,26,1,2,1,3,2,2,3,2.0,68447,5.0,1,2,3,4,4560.0,2


#### Step 12: Preparation for Prediction
Filtering the Testing Dataset.

In this task, you have to create a new DataFrame named 'test' by filtering rows in the DataFrame 'df' where the value in the 'Stay' column is equal to -1.

In [12]:
test = df[df['Stay']==-1]

#--- Inspect data ---
test

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Hospital_region_code,Available Extra Rooms in Hospital,Department,Ward_Type,Ward_Facility_Code,Bed Grade,patientid,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay
5000,318439,21,2,3,2,3,2,3,0,2.0,17006,2.0,0,2,2,7,3095.0,-1
5001,318440,29,0,4,0,2,2,3,5,2.0,17006,2.0,1,2,4,7,4018.0,-1
5002,318441,26,1,2,1,3,2,1,3,4.0,17006,2.0,0,2,3,7,4492.0,-1
5003,318442,6,0,6,0,3,2,1,5,2.0,17006,2.0,1,2,3,7,4173.0,-1
5004,318443,28,1,11,0,2,2,2,5,2.0,17006,2.0,1,2,4,7,4161.0,-1
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
6995,320434,26,1,2,1,1,1,2,3,2.0,19303,8.0,1,0,4,5,8829.0,-1
6996,320435,26,1,2,1,4,2,2,3,1.0,19303,8.0,1,0,6,5,3507.0,-1
6997,320436,23,0,6,0,3,2,1,5,1.0,19303,8.0,1,0,3,5,4109.0,-1
6998,320437,25,4,1,0,4,2,1,4,3.0,19303,8.0,0,0,4,5,4155.0,-1


#### Step 13: Feature Engineering for Enhanced Predictive Analysis
Column Removal in the Test DataFrame.

Task is to create a new DataFrame named 'test1' by dropping the columns 'Stay', 'patientid', 'Hospital_region_code', and 'Ward_Facility_Code' from the 'test' DataFrame.

Using this function, group data by different combinations of features such as 'patientid,' 'Hospital_region_code,' and 'Ward_Facility_Code' to calculate counts and then merged these counts back into the datasets for both training and testing data. This process creates new features, such as 'count_id_patient' and 'count_id_patient_hospitalCode,' that reflected the frequency of each combination, offering deeper insights into the data.

In [13]:
import numpy as np
def get_countid_enocde(train, test, cols, name):
   temp = train.groupby(cols)['case_id'].count().reset_index().rename(columns = {'case_id': name})
   temp2 = test.groupby(cols)['case_id'].count().reset_index().rename(columns = {'case_id': name})
   train = pd.merge(train, temp, how='left', on= cols)
   test = pd.merge(test,temp2, how='left', on= cols)
   train[name] = train[name].astype('float')
   test[name] = test[name].astype('float')
   train[name].fillna(np.median(temp[name]), inplace = True)
   test[name].fillna(np.median(temp2[name]), inplace = True)
   return train, test


train, test = get_countid_enocde(train, test, ['patientid'], name = 'count_id_patient')
train, test = get_countid_enocde(train, test,
                                 ['patientid', 'Hospital_region_code'], name = 'count_id_patient_hospitalCode')
train, test = get_countid_enocde(train, test,
                                 ['patientid', 'Ward_Facility_Code'], name = 'count_id_patient_wardfacilityCode')

test1 = test.drop(['Stay', 'patientid', 'Hospital_region_code', 'Ward_Facility_Code'], axis =1)
test1

The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  train[name].fillna(np.median(temp[name]), inplace = True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which we are setting values always behaves as a copy.

For example, when doing 'df[col].method(value, inplace=True)', try using 'df.method({col: value}, inplace=True)' or df[col] = df[col].method(value) instead, to perform the operation inplace on the original object.


  test[name].fillna(np.median(temp2[name]), inplace = True)
The behavior will change in pandas 3.0. This inplace method will never work because the intermediate object on which

Unnamed: 0,case_id,Hospital_code,Hospital_type_code,City_Code_Hospital,Available Extra Rooms in Hospital,Department,Ward_Type,Bed Grade,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,count_id_patient,count_id_patient_hospitalCode,count_id_patient_wardfacilityCode
0,318439,21,2,3,3,2,3,2.0,2.0,0,2,2,7,3095.0,7.0,1.0,1.0
1,318440,29,0,4,2,2,3,2.0,2.0,1,2,4,7,4018.0,7.0,4.0,4.0
2,318441,26,1,2,3,2,1,4.0,2.0,0,2,3,7,4492.0,7.0,2.0,2.0
3,318442,6,0,6,3,2,1,2.0,2.0,1,2,3,7,4173.0,7.0,4.0,4.0
4,318443,28,1,11,2,2,2,2.0,2.0,1,2,4,7,4161.0,7.0,4.0,4.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
1995,320434,26,1,2,1,1,2,2.0,8.0,1,0,4,5,8829.0,5.0,2.0,2.0
1996,320435,26,1,2,4,2,2,1.0,8.0,1,0,6,5,3507.0,5.0,2.0,2.0
1997,320436,23,0,6,3,2,1,1.0,8.0,1,0,3,5,4109.0,5.0,3.0,1.0
1998,320437,25,4,1,4,2,1,3.0,8.0,0,0,4,5,4155.0,5.0,3.0,2.0


#### Task 4: Sculpting the Data
Column Removal in the Train DataFrame.

Task is to create a new DataFrame named 'train1' by dropping the columns 'case_id', 'patientid', 'Hospital_region_code', and 'Ward_Facility_Code' from the 'train' DataFrame.

In [14]:
train1 = train.drop(['case_id', 'patientid', 'Hospital_region_code', 'Ward_Facility_Code'], axis=1)

# --- Inspect data ---
train1


Unnamed: 0,Hospital_code,Hospital_type_code,City_Code_Hospital,Available Extra Rooms in Hospital,Department,Ward_Type,Bed Grade,City_Code_Patient,Type of Admission,Severity of Illness,Visitors with Patient,Age,Admission_Deposit,Stay,count_id_patient,count_id_patient_hospitalCode,count_id_patient_wardfacilityCode
0,8,2,3,3,3,2,2.0,7.0,0,0,2,5,4911.0,0,14.0,4.0,5.0
1,2,2,5,2,3,3,2.0,7.0,1,0,2,5,5954.0,4,14.0,4.0,5.0
2,10,4,1,2,1,3,2.0,7.0,1,0,2,5,4745.0,3,14.0,4.0,2.0
3,26,1,2,2,3,2,2.0,7.0,1,0,2,5,7272.0,4,14.0,6.0,3.0
4,26,1,2,2,3,3,2.0,7.0,1,0,2,5,5558.0,4,14.0,6.0,3.0
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
4995,19,0,7,2,2,3,2.0,8.0,0,1,2,1,4894.0,0,3.0,3.0,2.0
4996,26,1,2,2,2,1,4.0,8.0,1,1,2,1,6987.0,3,3.0,3.0,1.0
4997,32,5,9,3,2,3,2.0,5.0,0,2,4,4,4196.0,5,3.0,2.0,1.0
4998,26,1,2,3,2,2,2.0,5.0,1,2,3,4,4560.0,2,3.0,2.0,1.0


#### Step 15: Data Splitting for Model Mastery
Splitting Data into Training and Testing.

Task is to split the 'train1' data into training and testing sets. Create feature variables X1 by excluding the 'Stay' column and set y1 as the target variable. Split the data into training and testing sets using a 80-20 split ratio, with a random seed of 100 for reproducibility.

In [15]:
from sklearn.model_selection import train_test_split


from sklearn.model_selection import train_test_split

X1 = train1.drop('Stay', axis =1)
y1 = train1['Stay']

X_train, X_test, y_train, y_test = train_test_split(X1, y1, test_size =0.20, random_state =100)
print(X_train, X_test, y_train, y_test)

      Hospital_code  Hospital_type_code  City_Code_Hospital  \
4833             10                   4                   1   
1218             26                   1                   2   
135              23                   0                   6   
3612              1                   3                  10   
3259             27                   0                   7   
...             ...                 ...                 ...   
4149              1                   3                  10   
1890             13                   0                   5   
350              14                   0                   1   
79               15                   2                   5   
3927             19                   0                   7   

      Available Extra Rooms in Hospital  Department  Ward_Type  Bed Grade  \
4833                                  2           2          3        2.0   
1218                                  4           2          1        2.0   
135         

#### Step 16: The XGBoost Model Story

Training an XGBoost Classifier and Evaluating Accuracy.

Task is to create an XGBoost classifier instance named 'classifier_xgb' with specified hyperparameters. Fit the classifier to the training data ('X_train' and 'y_train') and assign it to 'model_xgb'. Use the trained model to make predictions on the test data ('X_test') and store the predictions. Calculate the accuracy of the XGBoost model's predictions and store it in 'acc_score_xgb'. Round the accuracy score to two decimal places.

In [16]:
pip install xgboost

Defaulting to user installation because normal site-packages is not writeableNote: you may need to restart the kernel to use updated packages.


DEPRECATION: textract 1.6.5 has a non-standard dependency specifier extract-msg<=0.29.*. pip 24.1 will enforce this behaviour change. A possible replacement is to upgrade to a newer version of textract or contact the author to suggest that they release a version with a conforming dependency specifiers. Discussion can be found at https://github.com/pypa/pip/issues/12063

[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip





In [17]:
import xgboost
from sklearn.metrics import accuracy_score


classifier_xgb = xgboost.XGBClassifier(max_depth=4, learning_rate=0.1, n_estimators=800,
                                  objective='multi:softmax', reg_alpha=0.5, reg_lambda=1.5,
                                  booster='gbtree', n_jobs=4, min_child_weight=2, base_score= 0.75)


model_xgb = classifier_xgb.fit(X_train, y_train)

prediction_xgb = model_xgb.predict(X_test)
acc_score_xgb = accuracy_score(prediction_xgb,y_test)

acc_score_xgb = round(acc_score_xgb, 2)

acc_score_xgb

0.4

#### Step 17: Final Predictive Model Transformation

Using the trained XGBoost classifier, make predictions on the "test1" dataset. These predictions were based on the features in "test1" and stored as "pred_xgb."

Task is to use the trained 'classifier_xgb' model to make predictions on the 'test1' DataFrame excluding the 'case_id' column. Store the predictions in 'pred_xgb'. Create a new DataFrame named 'result_xgb' to organize the prediction results with 'pred_xgb' values added to the 'Stay' column. Assign the 'case_id' column from 'test1' to the 'case_id' column in 'result_xgb'. Reorder the columns in 'result_xgb' to have 'case_id' as the first column and 'Stay' as the second column. Replace the numeric labels in the 'Stay' column of 'result_xgb' with provided label_mapping.

In [18]:
label_mapping = {
    0: '0-10', 1: '11-20', 2: '21-30', 3: '31-40', 4: '41-50',
    5: '51-60', 6: '61-70', 7: '71-80', 8: '81-90', 9: '91-100',
    10: 'More than 100 Days'
}

pred_xgb = classifier_xgb.predict(test1.iloc[:, 1:])
result_xgb = pd.DataFrame(pred_xgb, columns=['Stay'])
result_xgb['case_id'] = test1['case_id']
result_xgb = result_xgb[['case_id', 'Stay']]
result_xgb['Stay'] = result_xgb['Stay'].replace(label_mapping)


result_xgb

Unnamed: 0,case_id,Stay
0,318439,11-20
1,318440,51-60
2,318441,11-20
3,318442,21-30
4,318443,51-60
...,...,...
1995,320434,51-60
1996,320435,71-80
1997,320436,21-30
1998,320437,11-20


#### Step 18: Decoding Patient Stays.

 Task is to group the 'result_xgb' DataFrame by unique 'Stay' values and calculate the count of unique 'case_id' values in each group. Store the result in the variable 'result'.

In [19]:
result = result_xgb.groupby('Stay')['case_id'].nunique()


result

Stay
0-10                   60
11-20                 375
21-30                 869
31-40                 246
41-50                  32
51-60                 301
61-70                   7
71-80                  49
81-90                  27
91-100                  6
More than 100 Days     28
Name: case_id, dtype: int64