<a href="https://colab.research.google.com/github/LatiefDataVisionary/healthcare-test-results-prediction/blob/main/notebooks/notebook.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

## **Project Context**

**Project Title:** End-to-End Deep Learning for Medical Test Result Prediction

**Project Objective:** To build a Deep Neural Network (DNN) model using Keras for multi-class classification on the 'Test Results' column.

**Dataset:** "Healthcare Dataset" from Kaggle. (Link: https://www.kaggle.com/datasets/prasad22/healthcare-dataset/data)

**Target Column:** Test Results (has 3 categories: 'Normal', 'Abnormal', 'Inconclusive').

**Key Technologies:** Python, Pandas, NumPy, Matplotlib, Seaborn, Scikit-learn, TensorFlow (Keras).

## **Part 0: Environment Setup**

In this initial section, we prepare our digital workspace. This involves importing all necessary libraries for data manipulation, visualization, and deep learning, as well as configuring notebook settings for optimal display and reproducibility.

In [9]:
# 0.1. Import Core Libraries

# Data Manipulation
import pandas as pd
import numpy as np

# Visualization
import matplotlib.pyplot as plt
import seaborn as sns

# Deep Learning
import tensorflow as tf
from tensorflow.keras.models import Sequential
from tensorflow.keras.layers import Dense, Dropout

# Machine Learning Utilities
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline
from sklearn.metrics import classification_report, confusion_matrix

In [10]:
# 0.2. Configuration and Helper Functions

# Set visualization style
sns.set_style('whitegrid')

# Set pandas options to display all columns
pd.set_option('display.max_columns', None)

# Set random seeds for reproducibility
np.random.seed(42)
tf.random.set_seed(42)

## **Part 1: Data Loading & Initial Inspection** 🚀

Here, we will load the dataset and perform a high-level "first look" to grasp its structure, size, and content.

In [11]:
# 1.1. Load Dataset

# Load the dataset from the provided URL into a pandas DataFrame
# This dataset is hosted on GitHub, so we can read it directly using the raw content URL.
df = pd.read_csv('https://raw.githubusercontent.com/LatiefDataVisionary/healthcare-test-results-prediction/refs/heads/main/data/raw/healthcare_dataset.csv')

In [12]:
# 1.2. Initial Inspection

# Display the first 5 rows of the DataFrame to get a glimpse of the data
print("First 5 rows of the DataFrame:")
display(df.head())

# Display concise summary of the DataFrame, including data types and non-null values
print("\nDataFrame Information:")
df.info()

# Display the dimensions (number of rows and columns) of the DataFrame
print("\nDataFrame Shape:")
print(df.shape)

First 5 rows of the DataFrame:


Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
0,Bobby JacksOn,30,Male,B-,Cancer,2024-01-31,Matthew Smith,Sons and Miller,Blue Cross,18856.281306,328,Urgent,2024-02-02,Paracetamol,Normal
1,LesLie TErRy,62,Male,A+,Obesity,2019-08-20,Samantha Davies,Kim Inc,Medicare,33643.327287,265,Emergency,2019-08-26,Ibuprofen,Inconclusive
2,DaNnY sMitH,76,Female,A-,Obesity,2022-09-22,Tiffany Mitchell,Cook PLC,Aetna,27955.096079,205,Emergency,2022-10-07,Aspirin,Normal
3,andrEw waTtS,28,Female,O+,Diabetes,2020-11-18,Kevin Wells,"Hernandez Rogers and Vang,",Medicare,37909.78241,450,Elective,2020-12-18,Ibuprofen,Abnormal
4,adrIENNE bEll,43,Female,AB+,Cancer,2022-09-19,Kathleen Hanna,White-White,Aetna,14238.317814,458,Urgent,2022-10-09,Penicillin,Abnormal



DataFrame Information:
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 55500 entries, 0 to 55499
Data columns (total 15 columns):
 #   Column              Non-Null Count  Dtype  
---  ------              --------------  -----  
 0   Name                55500 non-null  object 
 1   Age                 55500 non-null  int64  
 2   Gender              55500 non-null  object 
 3   Blood Type          55500 non-null  object 
 4   Medical Condition   55500 non-null  object 
 5   Date of Admission   55500 non-null  object 
 6   Doctor              55500 non-null  object 
 7   Hospital            55500 non-null  object 
 8   Insurance Provider  55500 non-null  object 
 9   Billing Amount      55500 non-null  float64
 10  Room Number         55500 non-null  int64  
 11  Admission Type      55500 non-null  object 
 12  Discharge Date      55500 non-null  object 
 13  Medication          55500 non-null  object 
 14  Test Results        55500 non-null  object 
dtypes: float64(1), int64(2), obje

In [13]:
# 1.3. Statistical Summary

# Display statistical summary of the DataFrame, including descriptive statistics for all columns
print("\nStatistical Summary of the DataFrame:")
display(df.describe(include='all'))


Statistical Summary of the DataFrame:


Unnamed: 0,Name,Age,Gender,Blood Type,Medical Condition,Date of Admission,Doctor,Hospital,Insurance Provider,Billing Amount,Room Number,Admission Type,Discharge Date,Medication,Test Results
count,55500,55500.0,55500,55500,55500,55500,55500,55500,55500,55500.0,55500.0,55500,55500,55500,55500
unique,49992,,2,8,6,1827,40341,39876,5,,,3,1856,5,3
top,DAvId muNoZ,,Male,A-,Arthritis,2024-03-16,Michael Smith,LLC Smith,Cigna,,,Elective,2020-03-15,Lipitor,Abnormal
freq,3,,27774,6969,9308,50,27,44,11249,,,18655,53,11140,18627
mean,,51.539459,,,,,,,,25539.316097,301.134829,,,,
std,,19.602454,,,,,,,,14211.454431,115.243069,,,,
min,,13.0,,,,,,,,-2008.49214,101.0,,,,
25%,,35.0,,,,,,,,13241.224652,202.0,,,,
50%,,52.0,,,,,,,,25538.069376,302.0,,,,
75%,,68.0,,,,,,,,37820.508436,401.0,,,,
