<a href="https://colab.research.google.com/github/Amanshrivastav06/EDA--HealthCare-Project/blob/main/EDA_Healthcare_Project.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# **EDA- HealthCare Project**




## **Importing Libraries **

In [8]:
import pandas as pd
import numpy as np
import seaborn as sns
import matplotlib.pyplot as plt
import matplotlib_inline


Importing Dataset

**1. Understand the Dataset (Data Collection & Loading)**

In [13]:
df = pd.read_csv("/content/healthcare_dataset.csv")

## **Situation OR Problem Statement **
Predictive Analytics and Insights from Patient Admission Data for Improved Healthcare Management

The aim of this project is to analyze and extract meaningful insights from a comprehensive healthcare dataset that includes patient demographics, medical history, hospital admission details, and billing information. The goal is to support better hospital resource allocation, early disease detection, and personalized patient care.

**OBJECTIVES**
*   Identify patterns and correlations between patient demographics and medical conditions.
*   Predict the length of hospital stay based on admission details.
*   Forecast billing amounts using patient profiles and treatment history.
*   Classify patients by admission type (emergency vs elective).
*   Detect trends in medication and test result outcomes.
*   Assist in optimizing hospital resources and patient flow.












## ** Column/Data Descriptions**

1. 'Name',- Description: Represents the patient’s name.
2. 'Age',- Description: The age of the patient.
3. 'Gender', - Description: The gender of the patient.
4. 'Blood Type', -Description: The blood group of the patient.
5. 'Medical Condition',- Description: The primary diagnosis or health issue for which the patient was admitted.
6. 'Date of Admission',- Description: The date the patient was admitted.
7. 'Doctor', - Description: Name or ID of the doctor treating the patient.
8. 'Hospital', - Description: Name or ID of the hospital.
9. 'Insurance Provider',- Description: The insurance company covering the patient.
10. 'Billing Amount',- Description: Total charges billed for the patient.
11. 'Room Number',- Description: The room assigned to the patient.  
12. 'Admission Type',- Description: Type of admission, which indicates urgency.
13. 'Discharge Date',- Description: The date the patient was discharged.
14. 'Medication',- Description: Medication(s) prescribed during the stay.  
15. 'Test Results'- Description: Diagnostic test results.

**2. Data Cleaning**

In [16]:
# Checking the shape of data
df.shape

(55500, 15)

***There is 55500 rows and 15 columns***

In [19]:
# check the name of columns
print(list(df.columns))

['Name', 'Age', 'Gender', 'Blood Type', 'Medical Condition', 'Date of Admission', 'Doctor', 'Hospital', 'Insurance Provider', 'Billing Amount', 'Room Number', 'Admission Type', 'Discharge Date', 'Medication', 'Test Results']


In [23]:
# Check for Missing Values
print((df.isnull().sum()))

Name                  0
Age                   0
Gender                0
Blood Type            0
Medical Condition     0
Date of Admission     0
Doctor                0
Hospital              0
Insurance Provider    0
Billing Amount        0
Room Number           0
Admission Type        0
Discharge Date        0
Medication            0
Test Results          0
dtype: int64


**hence , there is not any null or missing value**

In [42]:

df = pd.read_csv("/content/healthcare_dataset.csv")

In [43]:
df.shape

(55500, 15)

In [44]:
# checking for duplicates value
df.duplicated().sum()

np.int64(534)

In [45]:
# remove duplicates
df = df.drop_duplicates()

In [46]:
df.shape

(54966, 15)

*534 data are duplicates which we removed *

In [47]:
# Checking the data types

df.dtypes


Unnamed: 0,0
Name,object
Age,int64
Gender,object
Blood Type,object
Medical Condition,object
Date of Admission,object
Doctor,object
Hospital,object
Insurance Provider,object
Billing Amount,float64


In [49]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 54966 entries, 0 to 55499
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Name                54966 non-null  object 
 1   Age                 54966 non-null  int64  
 2   Gender              54966 non-null  object 
 3   Blood Type          54966 non-null  object 
 4   Medical Condition   54966 non-null  object 
 5   Date of Admission   54966 non-null  object 
 6   Doctor              54966 non-null  object 
 7   Hospital            54966 non-null  object 
 8   Insurance Provider  54966 non-null  object 
 9   Billing Amount      54966 non-null  float64
 10  Room Number         54966 non-null  int64  
 11  Admission Type      54966 non-null  object 
 12  Discharge Date      54966 non-null  object 
 13  Medication          54966 non-null  object 
 14  Test Results        54966 non-null  object 
dtypes: float64(1), int64(2), object(12)
memory usage: 6.7+ MB


In [51]:
df.select_dtypes(include = [int, float])

Unnamed: 0,Age,Billing Amount,Room Number
0,30,18856.281306,328
1,62,33643.327287,265
2,76,27955.096079,205
3,28,37909.782410,450
4,43,14238.317814,458
...,...,...,...
55495,42,2650.714952,417
55496,61,31457.797307,316
55497,38,27620.764717,347
55498,43,32451.092358,321


In [52]:
df.select_dtypes(exclude = [int, float])

Unnamed: 0,Name,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Admission Type,Discharge Date,Medication,Test Results
0,Bobby JacksOn,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,Urgent,2024-02-02,Paracetamol,Normal
1,LesLie TErRy,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,Emergency,2019-08-26,Ibuprofen,Inconclusive
2,DaNnY sMitH,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,Emergency,2022-10-07,Aspirin,Normal
3,andrEw waTtS,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,Elective,2020-12-18,Ibuprofen,Abnormal
4,adrIENNE bEll,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,Urgent,2022-10-09,Penicillin,Abnormal
...,...,...,...,...,...,...,...,...,...,...,...,...
55495,eLIZABeTH jaCkSOn,Female,O+,Asthma,2020-08-16,Joshua Jarvis,Jones-Thompson,Blue Cross,Elective,2020-09-15,Penicillin,Abnormal
55496,KYle pEREz,Female,AB-,Obesity,2020-01-23,Taylor Sullivan,Tucker-Moyer,Cigna,Elective,2020-02-01,Aspirin,Normal
55497,HEATher WaNG,Female,B+,Hypertension,2020-07-13,Joe Jacobs DVM,"and Mahoney Johnson Vasquez,",UnitedHealthcare,Urgent,2020-08-10,Ibuprofen,Abnormal
55498,JENniFER JOneS,Male,O-,Arthritis,2019-05-25,Kimberly Curry,"Jackson Todd and Castro,",Medicare,Elective,2019-05-31,Ibuprofen,Abnormal


As we check the datatype of dataset , in this 'date of admission' and 'discharge date' are object so we change into datetime format.

In [58]:

df['Date of Admission'] = pd.to_datetime(df['Date of Admission'])
df['Discharge Date'] = pd.to_datetime(df['Discharge Date'])
,jxvtshghrjker

In [59]:
df.info()

<class 'pandas.core.frame.DataFrame'>
Index: 54966 entries, 0 to 55499
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype         
---  ------              --------------  -----         
 0   Name                54966 non-null  object        
 1   Age                 54966 non-null  int64         
 2   Gender              54966 non-null  object        
 3   Blood Type          54966 non-null  object        
 4   Medical Condition   54966 non-null  object        
 5   Date of Admission   54966 non-null  datetime64[ns]
 6   Doctor              54966 non-null  object        
 7   Hospital            54966 non-null  object        
 8   Insurance Provider  54966 non-null  object        
 9   Billing Amount      54966 non-null  float64       
 10  Room Number         54966 non-null  int64         
 11  Admission Type      54966 non-null  object        
 12  Discharge Date      54966 non-null  datetime64[ns]
 13  Medication          54966 non-null  object        


hence , we check all datatype in dataset and fix them accordingly.

there is not any duplicate value