### BUSINESS UNDERSTANDING

#### OVERVIEW

The dataset represents electronic health records collected from a private hospital in Indonesia. It contains laboratory test results of patients, which are used to determine whether the patient should be classified as an in-care or out-care patient. The objective is to build a machine learning model that can predict the patient's classification based on their laboratory test results and reduce the efforts and time expanded by the doctors which reflect on the type of services provided to the patient.

The model will utilize the patients' test results, such as haematocrit, haemoglobins, erythrocyte count, leucocyte count, thrombocyte count, MCH, MCHC, MCV, age, and gender, to predict whether a patient should be classified as in-care or out-care, the model aims to provide a predictive tool for healthcare professionals. This tool can assist in the decision-making process by quickly determining whether a patient requires inpatient care (overnight hospitalization) or outpatient care (no overnight stay required)

### Business Problem

### Business Objectives

- Improve Patient Care Classification by automating the process of categorizing patients based on their laboratory test results to ensure timely and appropriate treatment.

- Optimizing resource allocation by accurately predicting patient classifications, the hospital can optimize resource allocation, such as beds, staffing, and other medical resources. This ensures that the right resources are available to meet the needs of different patient categories, leading to improved efficiency and cost-effectiveness

### Data Understanding

The dataset is Electronic Health Record Predicting collected from a private Hospital in Indonesia. It contains the patient's laboratory test results used to determine next patient treatment whether in care or out care. This dataset was downloaded from Kaggle https://www.kaggle.com/datasets/saurabhshahane/patient-treatment-classification

Given is the attribute name and a brief description.


- HAEMATOCRIT /Continuous /35.1 / Patient laboratory test that measures the proportion of red blood cells in your blood


- HAEMOGLOBINS/Continuous/11.8 / Patient laboratory test that measures the levels of hemoglobin in your blood

- ERYTHROCYTE/Continuous/4.65 / Patient laboratory test that measures how many red blood cells (RBCs) your blood contains.

- LEUCOCYTE /Continuous /6.3 / Patient laboratory test that measures the number of white blood cells in your body
- THROMBOCYTE- test for the measurement of platelets count in your blood.

- MCH-Mean corpuscular hemoglobin, test for the measurement of the amount of hemoglobin in a red blood cell.

- MCHC/Continuous/33.6/ Patient laboratory test that evaluate whether RBC are carrying an appropriate amount of hemoglobin(mean corpuscular hemoglobin concentration)

- MCV/Continuous /75.5/ Patient laboratory tests that help diagnose or monitor certain blood disorders, including anemia(mean corpuscular volume)

- AGE-Patient age

- SEX- Patient gender

- SOURCE-The class target in care patient and out care patient.



In [1]:
#importing libraries
import pandas as pd

In [2]:
df=pd.read_csv('hosp_data.csv')
df.head()

Unnamed: 0,HAEMATOCRIT,HAEMOGLOBINS,ERYTHROCYTE,LEUCOCYTE,THROMBOCYTE,MCH,MCHC,MCV,AGE,SEX,SOURCE
0,35.1,11.8,4.65,6.3,310,25.4,33.6,75.5,1,F,out
1,43.5,14.8,5.39,12.7,334,27.5,34.0,80.7,1,F,out
2,33.5,11.3,4.74,13.2,305,23.8,33.7,70.7,1,F,out
3,39.1,13.7,4.98,10.5,366,27.5,35.0,78.5,1,F,out
4,30.9,9.9,4.23,22.1,333,23.4,32.0,73.0,1,M,out


In [3]:
#cheking data shape
df.shape

(4412, 11)

In [4]:
#checking data information
df.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 4412 entries, 0 to 4411
Data columns (total 11 columns):
 #   Column        Non-Null Count  Dtype  
---  ------        --------------  -----  
 0   HAEMATOCRIT   4412 non-null   float64
 1   HAEMOGLOBINS  4412 non-null   float64
 2   ERYTHROCYTE   4412 non-null   float64
 3   LEUCOCYTE     4412 non-null   float64
 4   THROMBOCYTE   4412 non-null   int64  
 5   MCH           4412 non-null   float64
 6   MCHC          4412 non-null   float64
 7   MCV           4412 non-null   float64
 8   AGE           4412 non-null   int64  
 9   SEX           4412 non-null   object 
 10  SOURCE        4412 non-null   object 
dtypes: float64(7), int64(2), object(2)
memory usage: 379.3+ KB


In [5]:
#Generating descriptive statistics 
df.describe().transpose()

Unnamed: 0,count,mean,std,min,25%,50%,75%,max
HAEMATOCRIT,4412.0,38.197688,5.974784,13.7,34.375,38.6,42.5,69.0
HAEMOGLOBINS,4412.0,12.741727,2.079903,3.8,11.4,12.9,14.2,18.9
ERYTHROCYTE,4412.0,4.54126,0.784091,1.48,4.04,4.57,5.05,7.86
LEUCOCYTE,4412.0,8.718608,5.049041,1.1,5.675,7.6,10.3,76.6
THROMBOCYTE,4412.0,257.524479,113.972365,8.0,188.0,256.0,321.0,1183.0
MCH,4412.0,28.234701,2.672639,14.9,27.2,28.7,29.8,40.8
MCHC,4412.0,33.343042,1.228664,26.0,32.7,33.4,34.1,39.0
MCV,4412.0,84.612942,6.859101,54.0,81.5,85.4,88.7,115.6
AGE,4412.0,46.626473,21.731218,1.0,29.0,47.0,64.0,99.0


- Count indicates the number of non-null values present in each column e.g 4412.0
- mean represents the average value of each column 
- std denotes the standard deviation, which measures the dispersion or spread of the values around the mean.
- Min shows the minimum value observed in each column. For instance, the minimum HAEMATOCRIT level is 13.70.
- Max represents the maximum value observed in each column. For example, the maximum HAEMATOCRIT level is 69.00.

### Data Preparation