### Project Title: A Comprehensive Synthetic Healthcare Dataset for Classification and Analysis
#### Done By: Nozipho Sithembiso Ndebele & Tikedzani Geraldine Vele
---

<div style="text-align: center;">
<img src="https://res.cloudinary.com/healthmanagement-org/image/upload/c_thumb,f_auto,fl_lossy,q_90/v1730089956/cw/00128563_cw_image_wi_ac6a71f3514573255034c655004735eb.webp" alt="Anime Image" width="1000"/>
</div>

---

## Table of Contents

<a href=#BC> Background Context</a>

<a href=#one>1. Importing Packages</a>

<a href=#two>2. Data Collection and Description</a>

<a href=#three>3. Loading Data </a>

<a href=#four>4. Data Cleaning and Filtering</a>

<a href=#five>5. Exploratory Data Analysis (EDA)</a>

<a href=#six>6. Modeling </a>

<a href=#seven>7. Evaluation and Validation</a>

<a href=#eight>8. Final Model</a>

<a href=#nine>9. Conclusion and Future Work</a>

<a href=#ten>10. References</a>

---
 <a id="BC"></a>
## **Background Context**

### Purpose
This project aims to analyze Spotify user data to gain insights into music streaming behaviors, trends, and preferences. By leveraging data visualization techniques, the project will explore listening patterns, track popularity, and user engagement with different artists and albums over time.

### Significance
Understanding healthcare data is crucial for improving patient care, medical research, and healthcare services. This project aims to:

* Identify trends in disease prevalence across different age groups, genders, and geographic locations.
* Explore the effectiveness of treatments and healthcare interventions.
* Gain insights into the socio-economic factors influencing health outcomes.
* Identify patterns in patient visits, hospital stays, and health conditions over time.

These insights can guide healthcare practitioners, researchers, and policymakers in making data-driven decisions that enhance patient care and improve the overall healthcare system.

### Problem Domain
Healthcare datasets contain vast amounts of patient information, including demographics, medical history, lab results, and treatment details. Effectively interpreting this data requires advanced analysis techniques to uncover hidden patterns and correlations that can improve medical decision-making and public health strategies.

### Challenges
* Data Complexity: Healthcare datasets often contain a mix of structured (e.g., numeric) and unstructured data (e.g., text notes), requiring careful preprocessing and feature extraction.
* Missing Data: Many healthcare datasets may have missing values or incomplete records, which could impact the accuracy of the analysis.
* Data Privacy and Ethics: Working with sensitive patient data necessitates careful handling to ensure compliance with data privacy regulations (e.g., HIPAA).
* Healthcare Variability: Patient conditions and outcomes can vary widely due to many factors, including socio-economic status, genetics, and environmental influences.

### Key Questions
* What are the most common diseases and health conditions in the dataset across various demographic groups?
* How do treatment regimens and hospital stays influence patient recovery and health outcomes?
* What socio-economic factors (e.g., income, education, and location) correlate with health conditions and medical outcomes?
* Are there trends in the data that indicate a relationship between lifestyle factors (e.g., exercise, diet) and the development of chronic diseases?
* How do patient behaviors (e.g., adherence to medications, frequency of visits) affect health outcomes?

---
<a href=#one></a>
## **Importing Packages**

### Purpose
To set up the Python environment with the necessary libraries for data manipulation, visualization, and machine learning. These libraries will facilitate data preprocessing, feature extraction, model training, and evaluation.

### Details
* Pandas: For handling and analyzing data.

* NumPy: For numerical operations.

* Matplotlib/Seaborn: For data visualization to understand trends and patterns.

* scikit-learn: For building and evaluating machine learning models.

* NLTK/Spacy: For text preprocessing and natural language processing tasks.

---

In [2]:
# Import necessary packages  

# Data manipulation and analysis  
import pandas as pd  # Pandas for data handling  
import numpy as np  # NumPy for numerical operations  

# Data visualization  
import matplotlib.pyplot as plt  # Matplotlib for static plots  
import seaborn as sns  # Seaborn for statistical visualization  
import plotly.express as px  # Plotly for interactive plots  

# Natural Language Processing  
import nltk  # Natural Language Toolkit  
from nltk.corpus import stopwords  # Stopword removal  
from nltk.tokenize import word_tokenize  # Tokenization  
import re  # Regular expressions for text cleaning  

# Configure visualization settings
sns.set(style='whitegrid')  # Set the default style for Seaborn plots
plt.rcParams['figure.figsize'] = (10, 6)  # Set default figure size for Matplotlib

# Suppress warnings
import warnings  # Import the warnings module
warnings.filterwarnings('ignore')  # Ignore all warning messages


  from pandas.core import (


---
<a href=#two></a>
## **Data Collection and Description**
### Purpose
This section describes how the healthcare dataset was collected and provides insights into its structure. Understanding the dataset's composition is essential for effective data analysis, interpretation, and decision-making.

### Details
The dataset includes patient information, hospital admission records, and medical service details, providing valuable insights into healthcare trends, diagnoses, treatment outcomes, and patient demographics.

* Source: The dataset is a synthetic healthcare dataset created for analysis purposes. It is available from open-source repositories such as Kaggle and is intended for data science, machine learning, and healthcare analytics applications.

* Method of Collection: Data is generated and recorded based on hypothetical healthcare scenarios. It includes various patient records that contain information about their medical conditions, admission details, doctor assignments, and medical treatments received during hospitalization.

* Size: The dataset includes a comprehensive number of records representing multiple patients with diverse medical conditions, treatment outcomes, and healthcare services. The dataset size supports a variety of analysis tasks, including classification, regression, and clustering.

* Scope: This dataset covers several dimensions of healthcare, including:

  * Patient demographics (e.g., age, gender, blood type)
  * Admission details (e.g., date of admission, admission type, room number)
  * Medical conditions and diagnoses (e.g., diabetes, hypertension, asthma)
  * Treatment details (e.g., medications, test results)
  * Billing information (e.g., billing amounts, insurance providers)
* Types of Data:

  * Name: Represents the name of the patient associated with the healthcare record.
  * Age: The age of the patient at the time of admission, expressed in years.
  * Gender: The gender of the patient, either "Male" or "Female."
  * Blood Type: The patient's blood type (e.g., "A+", "O-").
  * Medical Condition: Specifies the primary diagnosis of the patient (e.g., "Diabetes," "Hypertension").
  * Date of Admission: The date on which the patient was admitted to the healthcare facility.
  * Doctor: The name of the doctor responsible for the patient’s care during admission.
  * Hospital: The name of the hospital where the patient was admitted.
  * Insurance Provider: The patient’s insurance provider (e.g., "Aetna," "Blue Cross").
  * Billing Amount: The total amount billed for healthcare services during admission.
  * Room Number: The room number where the patient stayed during their hospitalization.
  * Admission Type: Indicates the nature of the admission (e.g., "Emergency," "Elective").
  * Discharge Date: The date when the patient was discharged from the healthcare facility.
  * Medication: Medications administered to the patient during their admission (e.g., "Aspirin," "Ibuprofen").
  * Test Results: The result of medical tests conducted during the admission (e.g., "Normal," "Abnormal").
  
By leveraging this dataset, we can perform detailed analysis and gain insights into various aspects of healthcare, such as patient demographics, treatment effectiveness, and hospital resource usage.

---
<a href=#three></a>
## **Loading Data**
### Purpose
The purpose of this section is to load the dataset into the notebook for further manipulation and analysis. This is the first step in working with the data, and it allows us to inspect the raw data and get a sense of its structure.

### Details
In this section, we will load the dataset into a Pandas DataFrame and display the first few rows to understand what the raw data looks like. This will help in planning the next steps of data cleaning and analysis.


---

In [3]:
# Load the dataset into a Pandas DataFrame

# The dataset is stored in a CSV file named 'healthcare_dataset.csv'
df = pd.read_csv('healthcare_dataset.csv')

In [4]:
# df is the original dataset (DataFrame), this creates a copy of it
df_copy = df.copy()

# Now 'df_copy' is an independent copy of 'df'. Changes to 'df_copy' won't affect 'df'.


In [5]:
# Display the first few rows of the dataset to get a sense of what the raw data looks like
df_copy.head()

Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,Bobby JacksOn,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.281306,328,Urgent,2024-02-02,Paracetamol,Normal
1,LesLie TErRy,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.327287,265,Emergency,2019-08-26,Ibuprofen,Inconclusive
2,DaNnY sMitH,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.096079,205,Emergency,2022-10-07,Aspirin,Normal
3,andrEw waTtS,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.78241,450,Elective,2020-12-18,Ibuprofen,Abnormal
4,adrIENNE bEll,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal


In [6]:
# Display the number of rows and columns in the dataset to understand its size
df_copy.shape

(55500, 15)

In [7]:
# Check the structure of the dataset
df_copy.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55500 entries, 0 to 55499
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Name                55500 non-null  object 
 1   Age                 55500 non-null  int64  
 2   Gender              55500 non-null  object 
 3   Blood Type          55500 non-null  object 
 4   Medical Condition   55500 non-null  object 
 5   Date of Admission   55500 non-null  object 
 6   Doctor              55500 non-null  object 
 7   Hospital            55500 non-null  object 
 8   Insurance Provider  55500 non-null  object 
 9   Billing Amount      55500 non-null  float64
 10  Room Number         55500 non-null  int64  
 11  Admission Type      55500 non-null  object 
 12  Discharge Date      55500 non-null  object 
 13  Medication          55500 non-null  object 
 14  Test Results        55500 non-null  object 
dtypes: float64(1), int64(2), object(12)
memory usage: 6.4

---
<a href=#four></a>
## **Data Cleaning and Filtering**
Before analyzing the data, it is crucial to clean and filter it. This process involves handling missing values, removing outliers, correcting errors, and possibly reducing the data by filtering out irrelevant features. These steps ensure that the analysis is based on accurate and reliable data.

Details
In this section, we will:

* Check for Missing Values: Identify if there are any missing values in the dataset and handle them accordingly.
* Remove Duplicates: Ensure there are no duplicate rows that could skew the analysis.
* Correct Errors: Look for and correct any obvious data entry errors.
* Filter Data: Depending on the analysis requirements, filter the data to include only relevant records.

In [8]:
# 1. Check for missing values in the dataset

def check_missing_values(df):
    """
    Check for missing values in the dataset and display the number of missing values per column.

    Parameters:
    df (pandas.DataFrame): The dataset to check for missing values.

    Returns:
    pandas.Series: A series showing the number of missing values for each column.
    """
     # Check for missing values in the dataset and display them
    print("Missing values per column:")
    missing_values = df.isnull().sum()
    print(missing_values)
    return missing_values


In [9]:
# Assuming df is your DataFrame
missing_values = check_missing_values(df_copy)


Missing values per column:
Name                  0
Age                   0
Gender                0
Blood Type            0
Medical Condition     0
Date of Admission     0
Doctor                0
Hospital              0
Insurance Provider    0
Billing Amount        0
Room Number           0
Admission Type        0
Discharge Date        0
Medication            0
Test Results          0
dtype: int64


After examining the dataset, no missing values were found across any of the columns. This ensures data completeness and eliminates the need for imputation or further cleaning related to missing data.


In [None]:
# 2. Check for duplicate rows in the dataset
def remove_duplicates(df):
    """
    Checks for duplicate rows in the dataset and removes them if any are found.

    Args:
    df (pandas.DataFrame): The dataframe to check for duplicate rows.

    Returns:
    pandas.DataFrame: The dataframe with duplicate rows removed, if any existed.
    """
    # Check for duplicate rows
    duplicate_rows = df.duplicated().sum()
    print(f"\nNumber of duplicate rows: {duplicate_rows}")
    
    # Remove duplicates if any exist
    if duplicate_rows > 0:
        df.drop_duplicates(inplace=True)
        print(f"Duplicate rows removed. Updated dataframe has {len(df)} rows.")
    else:
        print("No duplicate rows found.")
    
    return df

In [11]:
df_copy = remove_duplicates(df_copy)


Number of duplicate rows: 534
Duplicate rows removed. Updated dataframe has 54966 rows.


Upon reviewing the dataset, 534 duplicate rows were found and removed. This ensures that all records are unique, and no further action is required for data deduplication.

In [None]:
# 3. Column Renaming for PEP 8 Compliance

def rename_columns_to_snake_case(df: pd.DataFrame) -> pd.DataFrame:
    """
    Renames the columns of a DataFrame to follow PEP 8 naming conventions (snake_case).
    
    Parameters:
        df (pd.DataFrame): The input DataFrame with original column names.
    
    Returns:
        pd.DataFrame: A DataFrame with updated column names.
    """
    # Dictionary mapping original column names to snake_case format
    column_mapping = {
        "Name": "name",
        "Age": "age",
        "Gender": "gender",
        "Blood Type": "blood_type",
        "Medical Condition": "medical_condition",
        "Date of Admission": "date_of_admission",
        "Doctor": "doctor",
        "Hospital": "hospital",
        "Insurance Provider": "insurance_provider",
        "Billing Amount": "billing_amount",
        "Room Number": "room_number",
        "Admission Type": "admission_type",
        "Discharge Date": "discharge_date",
        "Medication": "medication",
        "Test Results": "test_results",
    }

    # Rename columns
    df = df.rename(columns=column_mapping)

    return df


# Apply the function
df_copy = rename_columns_to_snake_case(df_copy)

# Display updated columns
print(df_copy.columns)


Index(['name', 'age', 'gender', 'blood_type', 'medical_condition',
       'date_of_admission', 'doctor', 'hospital', 'insurance_provider',
       'billing_amount', 'room_number', 'admission_type', 'discharge_date',
       'medication', 'test_results'],
      dtype='object')


We rename dataset columns to follow **PEP 8** naming conventions, which recommend using **snake_case** for variable names. This improves readability, consistency, and aligns with best practices in Python.  

This ensures easier access to DataFrame columns while maintaining code clarity and consistency. 

## **Saving the Cleaned Dataset**
### Purpose

This section outlines how to save the cleaned dataset for future use. Saving the dataset ensures that the data cleaning process does not need to be repeated and allows for consistent use in subsequent analyses.

### Details

We will save the cleaned dataset as a CSV file.

In [None]:
#4. Save the cleaned dataset to a new CSV file

def save_cleaned_dataset(df, filename='cleaned_healthcare_dataset.csv'):
    """
    Saves the cleaned dataframe to a CSV file.

    Args:
    df (pandas.DataFrame): The cleaned dataframe to save.
    filename (str): The name of the file to save the dataframe to (default is 'cleaned_domestic_violence.csv').

    Returns:
    None
    """
    # Save the cleaned dataset to a CSV file
    df.to_csv(filename, index=False)
    print(f"Cleaned dataset saved successfully as '{filename}'.")


In [16]:
save_cleaned_dataset(df_copy)


Cleaned dataset saved successfully as 'cleaned_healthcare_dataset.csv'.


---
<a href=#five></a>
## **Exploratory Data Analysis (EDA)**

It is the process of analyzing datasets to summarize key features, often through visualization methods. It aims to discover patterns, spot anomalies, and formulate hypotheses for deeper insights, which informs subsequent analysis.
#### Advantages

- Helps in understanding the data before modeling.
- Provides insights that guide feature selection and engineering.
- Assists in choosing appropriate modeling techniques.
- Uncovers potential data quality issues early.

`The following methods were employed to communicate our objective:`



---


---
<a href=#nine></a>
## **Conclusion and Future Work**


##### Conclusion



##### Future Work

To build upon this study, future work could focus on the following areas:



---
<a href=#ten></a>
## **References**

## Additional Sections to Consider

**Contributors**: Nozipho Sithembiso Ndebele & Thabisisle Xaba
