## Overview

This project demonstrates an end-to-end approach to healthcare data analytics by:
- **Designing and managing** a relational database (PostgreSQL) for healthcare data.
- **Writing and optimizing SQL queries** to generate insights (e.g., readmission rates, common diagnoses).
- **Performing data analysis** with Python for data cleaning, visualization, and predictive modeling.
- **Building interactive dashboards** using Power BI to communicate key findings.


## Features

- **SQL Queries:** Examples of advanced JOINs, window functions, and aggregations on patient data.
- **Python EDA & Modeling:** Jupyter notebooks detailing data wrangling, exploratory analysis, and a sample logistic regression model for readmission prediction.
- **Data Simulation:** Use of Python Faker to generate realistic dummy datasets for demonstration and training.
- **Dashboard:** An interactive Power BI dashboard showcasing KPI trends (readmissions, length of stay, etc.).

---

## Tech Stack

- **Database:** PostgreSQL 14+  
- **Programming Language:** Python 3.13+ (libraries include `pandas`, `sqlalchemy`, `matplotlib`, `scikit-learn`, `faker`)


## Database Setup

- **Datasets Overview:**  
  - **Patient_vitals.csv:** Clinical measurements (e.g., BMI, heart rate, blood pressure).  
  - **Encounters.csv:** In-patient hospital admission details (admission/discharge dates, diagnosis codes).  
  - **Telemedicine_encounters.csv:** Remote consultation records (appointment dates, contact methods).  
  - **Patient_RX.csv:** Medication history (e.g., drugs for diabetes, hypertension).


## Data Analysis with Python

### Project 2: Predictive Modeling for Readmission Risk

- **Objective:** Predict the likelihood of a patient being readmitted within 30 days.
- **Data Integration:** Merging data from Patient_vitals, Encounters, and Patient_RX.
- **Key Steps:**  
  - **Data Cleaning:** Handle missing values, encode categorical variables, and standardize numerical values.
  - **Feature Engineering:**  
    - *Length of Stay* (derived from Admission_Date and Discharge_Date).  
    - *Time to Follow-Up* (difference between Discharge_Date and subsequent encounter date).  
    - *Medication Count* (number of medications prescribed).  
    - *Vital Sign Metrics* (BMI, heart rate, blood pressure).
  - **Modeling:**  
    - Target Variable: Readmission within 30 days.
    - Model: Logistic Regression (80/20 train-test split).
    - Evaluation: Accuracy, ROC AUC, and feature importance analysis.


 **Data Cleaning:** Handle missing values, encode categorical variables, and standardize numerical values. 
In this scenario, we're using Python to merge three data sets on the shared key "patient_id." 

In [None]:
import pandas as pd

encounters_df = pd.read_csv("tests/encounters.csv")
patient_vitals_df = pd.read_csv("tests/patient_vitals.csv")
patient_rx_df = pd.read_csv("tests/patient_rx.csv")

merged_1 = pd.merge(
    encounters_df,
    patient_vitals_df,
    on="patient_id",   # shared key column name
    how="left"        
)
merged_df = pd.merge(
    merged_1,
    patient_rx_df,
    on="patient_id",
    how="left"
)

# Show a summary of the DataFrame (column types, non-null counts)
merged_df.info()

# Basic descriptive statistics for numeric columns
merged_df.describe()

# Show first few rows
merged_df.head()

# Show last few rows
merged_df.tail()

# Random sample of 5 rows
merged_df.sample(5)

# Shape (rows, columns)
merged_df.shape

# Column names
merged_df.columns

# Data types of each column
merged_df.dtypes

# Count missing values in each column
merged_df.isnull().sum()

# Number of unique values in each column
merged_df.nunique()


Data columns (total 23 columns):
 #   Column                 Non-Null Count  Dtype  
---  ------                 --------------  -----  
 0   patient_id             502 non-null    int64  
 1   encounter_id           502 non-null    object 
 2   admission_date         502 non-null    object 
 3   discharge_date         502 non-null    object 
 4   diagnosis_code         411 non-null    object 
 5   diagnosis_desc         502 non-null    object 
 6   procedure              502 non-null    object 
 7   attending_physician    502 non-null    object 
 8   hospital_department    502 non-null    object 
 9   encounter_summary      502 non-null    object 
 10  age                    502 non-null    int64  
 11  gender                 502 non-null    object 
 12  blood_type             502 non-null    object 
 13  height (cm)            502 non-null    float64
 14  weight (kg)            502 non-null    float64
 15  BMI                    502 non-null    float64
 16  temperature (C)        502 non-null    float64
 17  heart_rate (bpm)       502 non-null    int64  
 18  blood_pressure (mmHg)  502 non-null    object 
 19  weight_loss_drug       47 non-null     object 
 20  hypertension_drug      42 non-null     object 
 21  diabetes_drug          40 non-null     object 
 22  cholesterol_drug       44 non-null     object 

dtypes: float64(4), int64(3), object(16)
memory usage: 90.3+ KB