# Advanced Data Mining Project – Project Deliverable 1
**Student Name:** Gaurab Karki  
**Course:** 2025 Fall - Advanced Big Data and Data Mining (MSCS-634-B01)

**Dataset:** Healthcare Dataset (Kaggle)  
**Source:** [https://www.kaggle.com/datasets/prasad22/healthcare-dataset](https://www.kaggle.com/datasets/prasad22/healthcare-dataset)

---

## Task 1: Dataset Selection and Description

### Dataset Overview
The **Healthcare Dataset** provides simulated patient data that mimics hospital records, including demographic, clinical, and billing information.  
It contains more than **10,000 records** and includes multiple attributes such as:

- `Name`
- `Age`
- `Gender`
- `Blood Type`
- `Medical Condition`
- `Date of Admission`
- `Doctor`
- `Hospital`
- `Insurance Provider`
- `Billing Amount`
- `Room Number`
- `Admission Type`
- `Discharge Date`
- `Medication`
- `Test Results`

### Why This Dataset?
This dataset is ideal for the Advanced Data Mining project because:
1. It contains **over 10,000 rows**, exceeding the 500-record minimum requirement.
2. It has **12+ attributes**, allowing exploration of categorical, numerical, and temporal data.
3. It supports **multiple analysis goals** like Regression, Classification and Clustering
4. It aligns with **data-driven decision-making** in healthcare — a domain where predictive modeling and insight discovery are vital.

In [None]:
# Import essential libraries
import pandas as pd

# Load the dataset (make sure the CSV file is in the same directory as your notebook)
# Replace the filename below if needed
file_path = "healthcare_dataset.csv"
df = pd.read_csv(file_path)

# Display first few rows to inspect structure
print("Dataset successfully loaded! Preview of data:\n")
df.head()

In [None]:
# Display basic dataset information
print("Dataset Information:\n")
df.info()

# Check basic statistics for numerical columns
print("\nStatistical Summary:\n")
df.describe()

Here, The dataset contains several **categorical columns** (e.g., Gender, Medical Condition, Hospital) and **numerical columns** (e.g., Age, Billing Amount). There are also **datetime columns** such as `Date of Admission` and `Discharge Date`, which will be converted in later steps.

## Task 2: Data Cleaning and Preprocessing

After loading the dataset, the next step is  **data cleaning**. 
Cleaning ensures the dataset is accurate, consistent, and ready for reliable analysis as it might contain missing values, duplicate entries, or inconsistent information.  

In this step we will:
1. Identify and handle missing values.  
2. Remove or correct duplicate records.  
3. Detect and address noisy or inconsistent data (e.g., text formatting, outliers).

These preprocessing operations improve data quality and model performance.


In [None]:
# Step 1: Handling Missing Values

# Check how many missing values are in each column
print(" Missing Values per Column:\n")
print(df.isnull().sum())

# Calculate overall percentage of missing data
missing_percentage = (df.isnull().sum() / len(df)) * 100
print("\nPercentage of Missing Values:\n")
print(missing_percentage)

In [None]:
# Identify numeric and categorical columns
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns
categorical_cols = df.select_dtypes(include=['object']).columns

# Fill missing numeric values with median
df[numeric_cols] = df[numeric_cols].fillna(df[numeric_cols].median())

# Fill missing categorical values with mode
for col in categorical_cols:
    df[col] = df[col].fillna(df[col].mode()[0])

# Drop rows that still have missing critical fields (if any remain)
df.dropna(subset=['Date of Admission', 'Discharge Date'], inplace=True)

print(" Missing values handled successfully.")
print("Remaining missing values:", df.isnull().sum().sum())

### Handling Missing Values
- **Numerical attributes** such as `Age` and `Billing Amount` were imputed using the **median**, which is less sensitive to outliers than the mean.  
- **Categorical attributes** (e.g., `Gender`, `Hospital`, `Insurance Provider`) were imputed using the **mode** (most frequent value).  
- **Critical date fields** were dropped if missing because those records cannot support temporal analysis.

This ensures no record has undefined or blank data that could bias models later.

In [None]:
# Step 2: Removing Duplicates

# Count duplicate rows before removal
duplicate_count = df.duplicated().sum()
print(f" Number of duplicate rows before removal: {duplicate_count}")

# Remove duplicate rows
df = df.drop_duplicates()

# Verify duplicates removed
print(f" Number of duplicate rows after removal: {df.duplicated().sum()}")


### Removing Duplicates
Duplicate rows can occur when data is merged from multiple sources or logged multiple times.  
All exact duplicates were identified and removed to maintain data integrity.

In [None]:
# Step 3: Identifying and Addressing Noisy Data


# Example 1: Standardize categorical text (case and spacing)
df['Gender'] = df['Gender'].str.strip().str.capitalize()
df['Blood Type'] = df['Blood Type'].str.strip().str.upper()
df['Medical Condition'] = df['Medical Condition'].str.strip().str.title()

# Example 2: Detect and handle impossible or extreme numeric values
# Define simple validity checks
df = df[df['Age'].between(0, 120)]  # realistic human age range
df = df[df['Billing Amount'] > 0]   # must be positive

print(" Text cleaned and unrealistic numeric values removed.")

### Identifying and Addressing Noisy Data
- **Text Standardization:** Converted inconsistent capitalization and spacing across categorical fields.  
- **Range Checks:** Removed unrealistic values (e.g., negative billing amounts, ages outside 0–120).  
- **Validation:** Confirmed the dataset is now clean and ready for exploration.

---

### Summary of Cleaning Process
| Step | Action | Reason |
|------|---------|--------|
| 1 | Handled missing values using median/mode | Preserve data completeness |
| 2 | Removed duplicate rows | Ensure unique records |
| 3 | Standardized text and removed invalid values | Improve consistency and reliability |


## Task 3: Exploratory Data Analysis (EDA)

Exploratory Data Analysis (EDA) helps us understand the dataset by:
1. Exploring the **distribution of numerical and categorical features**.
2. Identifying **outliers**.
3. Examining **relationships**.

I will use **Matplotlib** and **Seaborn** for clear and interpretable visualizations.

In [None]:
# Step 1: Visualize Numerical Feature Distributions

import matplotlib.pyplot as plt
import seaborn as sns

# Set Seaborn style
sns.set(style="whitegrid", palette="pastel")

# Plot histogram + boxplot for each numeric column
numeric_cols = df.select_dtypes(include=['int64', 'float64']).columns

for col in numeric_cols:
    plt.figure(figsize=(10,4))
    
    # Histogram with KDE
    plt.subplot(1,2,1)
    sns.histplot(df[col], kde=True, bins=30)
    plt.title(f'Distribution of {col}')
    
    # Boxplot to identify outliers
    plt.subplot(1,2,2)
    sns.boxplot(x=df[col])
    plt.title(f'Boxplot of {col}')
    
    plt.tight_layout()
    plt.show()

### Observations – Numerical Features
- **Age**: Most patients are between 20–60 years old; few outliers above 100 were removed in cleaning.  
- **Billing Amount**: Shows high variability with a few large bills.  
- Boxplots complement histograms by clearly showing extreme values that may need capping or normalization.

In [None]:
# Step 2: Visualize Categorical Feature Counts

categorical_cols = df.select_dtypes(include=['object']).columns

# Plot top 10 categories for each categorical feature (first 6 for readability)
for col in categorical_cols[:6]:
    plt.figure(figsize=(8,4))
    vc = df[col].value_counts().nlargest(10)  # top 10 categories
    sns.barplot(x=vc.values, y=vc.index)
    plt.title(f'Top Categories in {col}')
    plt.xlabel('Count')
    plt.ylabel(col)
    plt.tight_layout()
    plt.show()

### Observations – Categorical Features
- **Gender**: Mostly balanced between Male and Female.  
- **Medical Condition**: Certain conditions are more frequent, which may influence clustering or classification models.  
- **Hospital**: Some hospitals have higher patient counts than others.

In [None]:
# Step 3: Correlation Analysis for Numerical Features

plt.figure(figsize=(10,8))
sns.heatmap(df[numeric_cols].corr(), annot=True, fmt=".2f", cmap='coolwarm')
plt.title('Correlation Heatmap of Numerical Features')
plt.show()

### Observations – Correlations
- **Age vs Billing Amount**: Positive correlation, older patients tend to have slightly higher billing.  

In [None]:
# Step 4: Relationship Between Two Variables

if 'Billing Amount' in df.columns and 'Age' in df.columns:
    plt.figure(figsize=(8,5))
    sns.scatterplot(data=df, x='Age', y='Billing Amount', hue='Gender', alpha=0.6)
    plt.title('Billing Amount vs Age by Gender')
    plt.xlabel('Age')
    plt.ylabel('Billing Amount')
    plt.legend(title='Gender')
    plt.show()


## Task 4: Insights from EDA

After performing exploratory data analysis, I came up with this insights
### Key Insights:

1. **Age and Billing Amount Relationship**
   - Older patients tend to have higher billing amounts, indicating a **positive correlation**.

2. **Hospital and Medical Condition Patterns**
   - Certain hospitals treat more patients and specific medical conditions are more frequent.
   - These categorical variables can be encoded and used for **classification or clustering** models to identify patient segments.

3. **Billing Variability and Outliers**
   - Billing amounts show high variability with extreme values (outliers), which were capped during cleaning.
   - Capped outliers stabilize regression predictions; extreme values should be handled carefully.

4. **Gender Differences**
   - Small differences in billing patterns exist across genders.
   - Gender may be used as a **categorical predictor** in regression or classification tasks.

5. **Categorical Feature Frequencies**
   - Features like `Medical Condition`, `Insurance Provider`, and `Hospital` have **imbalances** in counts.

6. **Potential for Clustering**
   - Patients could potentially be grouped by combinations of `Age`, `Billing Amount`, `Medical Condition`, and `Hospital` for segment analysis.
   - Clustering algorithms (e.g., KMeans) may reveal patterns for operational or cost analysis.

### How These Insights Guide Future Steps:

| EDA Insight | Modeling Implication |
|-------------|-------------------|
| Age vs Billing correlation | Use Age as predictor in regression tasks |
| Frequent medical conditions | Encode categorical features for classification/clustering |
| High billing variability | Outlier handling ensures stable regression predictions |
| Gender differences | Include Gender as a categorical predictor |
| Imbalanced categorical features | Consider stratified sampling or weighting |
| Potential clusters | Apply clustering for patient segmentation or grouping |

