# Advanced Data Mining Project – Project Deliverable 1
**Student Name:** Gaurab Karki  
**Course:** 2025 Fall - Advanced Big Data and Data Mining (MSCS-634-B01)

**Dataset:** Healthcare Dataset (Kaggle)  
**Source:** [https://www.kaggle.com/datasets/prasad22/healthcare-dataset](https://www.kaggle.com/datasets/prasad22/healthcare-dataset)

---

## Task 1: Dataset Selection and Description

### Dataset Overview
The **Healthcare Dataset** provides simulated patient data that mimics hospital records, including demographic, clinical, and billing information.  
It contains more than **10,000 records** and includes multiple attributes such as:

- `Name`
- `Age`
- `Gender`
- `Blood Type`
- `Medical Condition`
- `Date of Admission`
- `Doctor`
- `Hospital`
- `Insurance Provider`
- `Billing Amount`
- `Room Number`
- `Admission Type`
- `Discharge Date`
- `Medication`
- `Test Results`

### Why This Dataset?
This dataset is ideal for the Advanced Data Mining project because:
1. It contains **over 10,000 rows**, exceeding the 500-record minimum requirement.
2. It has **12+ attributes**, allowing exploration of categorical, numerical, and temporal data.
3. It supports **multiple analysis goals** like Regression, Classification and Clustering
4. It aligns with **data-driven decision-making** in healthcare — a domain where predictive modeling and insight discovery are vital.

In [5]:
# Import essential libraries
import pandas as pd

# Load the dataset (make sure the CSV file is in the same directory as your notebook)
# Replace the filename below if needed
file_path = "healthcare_dataset.csv"
df = pd.read_csv(file_path)

# Display first few rows to inspect structure
print("Dataset successfully loaded! Preview of data:\n")
df.head()

Dataset successfully loaded! Preview of data:



Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,Bobby JacksOn,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.281306,328,Urgent,2024-02-02,Paracetamol,Normal
1,LesLie TErRy,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.327287,265,Emergency,2019-08-26,Ibuprofen,Inconclusive
2,DaNnY sMitH,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.096079,205,Emergency,2022-10-07,Aspirin,Normal
3,andrEw waTtS,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.78241,450,Elective,2020-12-18,Ibuprofen,Abnormal
4,adrIENNE bEll,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal


In [6]:
# Display basic dataset information
print("Dataset Information:\n")
df.info()

# Check basic statistics for numerical columns
print("\nStatistical Summary:\n")
df.describe()

Dataset Information:

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55500 entries, 0 to 55499
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Name                55500 non-null  object 
 1   Age                 55500 non-null  int64  
 2   Gender              55500 non-null  object 
 3   Blood Type          55500 non-null  object 
 4   Medical Condition   55500 non-null  object 
 5   Date of Admission   55500 non-null  object 
 6   Doctor              55500 non-null  object 
 7   Hospital            55500 non-null  object 
 8   Insurance Provider  55500 non-null  object 
 9   Billing Amount      55500 non-null  float64
 10  Room Number         55500 non-null  int64  
 11  Admission Type      55500 non-null  object 
 12  Discharge Date      55500 non-null  object 
 13  Medication          55500 non-null  object 
 14  Test Results        55500 non-null  object 
dtypes: float64(1), int64(2), object

Unnamed: 0,Age,Billing Amount,Room Number
count,55500.0,55500.0,55500.0
mean,51.539459,25539.316097,301.134829
std,19.602454,14211.454431,115.243069
min,13.0,-2008.49214,101.0
25%,35.0,13241.224652,202.0
50%,52.0,25538.069376,302.0
75%,68.0,37820.508436,401.0
max,89.0,52764.276736,500.0


Here, The dataset contains several **categorical columns** (e.g., Gender, Medical Condition, Hospital) and **numerical columns** (e.g., Age, Billing Amount). There are also **datetime columns** such as `Date of Admission` and `Discharge Date`, which will be converted in later steps.

## Task 2: Data Cleaning and Preprocessing

After loading the dataset, the next step is  **data cleaning**. 
Cleaning ensures the dataset is accurate, consistent, and ready for reliable analysis as it might contain missing values, duplicate entries, or inconsistent information.  

In this step we will:
1. Identify and handle missing values.  
2. Remove or correct duplicate records.  
3. Detect and address noisy or inconsistent data (e.g., text formatting, outliers).

These preprocessing operations improve data quality and model performance.


In [7]:
# Step 1: Handling Missing Values

# Check how many missing values are in each column
print(" Missing Values per Column:\n")
print(df.isnull().sum())

# Calculate overall percentage of missing data
missing_percentage = (df.isnull().sum() / len(df)) * 100
print("\nPercentage of Missing Values:\n")
print(missing_percentage)

 Missing Values per Column:

Name                  0
Age                   0
Gender                0
Blood Type            0
Medical Condition     0
Date of Admission     0
Doctor                0
Hospital              0
Insurance Provider    0
Billing Amount        0
Room Number           0
Admission Type        0
Discharge Date        0
Medication            0
Test Results          0
dtype: int64

Percentage of Missing Values:

Name                  0.0
Age                   0.0
Gender                0.0
Blood Type            0.0
Medical Condition     0.0
Date of Admission     0.0
Doctor                0.0
Hospital              0.0
Insurance Provider    0.0
Billing Amount        0.0
Room Number           0.0
Admission Type        0.0
Discharge Date        0.0
Medication            0.0
Test Results          0.0
dtype: float64


In [None]:
# Identify numeric and categorical columns
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = df.select_dtypes(include=['object']).columns

# Fill missing numeric values with median
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())

# Fill missing categorical values with mode
for col in categorical_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

# Drop rows that still have missing critical fields (if any remain)
df.dropna(subset=['Date of Admission', 'Discharge Date'], inplace=True)

print(" Missing values handled successfully.")
print("Remaining missing values:", df.isnull().sum().sum())