# CSV to PostgreSQL ETL Pipeline

**Project Overview:**  
This notebook demonstrates a complete ETL (Extract, Transform, Load) pipeline that reads patient health data from a CSV/Excel file, performs data quality checks and transformations, and loads it into a PostgreSQL database.

**Author:** Data Engineering Intern  
**Tech Stack:** Python, Pandas, SQLAlchemy, PostgreSQL, psycopg2

---

## Step 1: Import Required Libraries

Import necessary Python libraries for data manipulation and database connectivity.

In [None]:
# Import libraries
import pandas as pd
from sqlalchemy import create_engine
import os
from getpass import getpass

## Step 2: Install Required Dependencies

Install PostgreSQL adapter and SQLAlchemy for database operations.

In [None]:
# Install necessary PostgreSQL and Python libraries
!pip install sqlalchemy psycopg2-binary -q

## Step 3: Extract - Load Data from Source

Read the patient health data from Excel/CSV file into a Pandas DataFrame.

In [None]:
# Load data from Excel file
# Note: Update the file path to your data location
file_path = "data/patient_health_data.xlsx"  # Masked path

# Read Excel file (use pd.read_csv() for CSV files)
csv_data = pd.read_excel(file_path)

print(f"Data loaded successfully: {csv_data.shape[0]} rows, {csv_data.shape[1]} columns")
csv_data.head()

## Step 4: Configure Database Connection

Set up PostgreSQL connection parameters. **Note:** In production, use environment variables or secure vaults for credentials.

In [None]:
# Database credentials (masked for security)
# In production, use environment variables: os.getenv('DB_USER')
username = "<DB_USERNAME>"  # Replace with your username
password = "<DB_PASSWORD>"  # Replace with your password
host = "localhost"
port = "5432"
database = "<DATABASE_NAME>"  # Replace with your database name

# Create SQLAlchemy connection string
conn_str = f"postgresql+psycopg2://{username}:{password}@{host}:{port}/{database}"
engine = create_engine(conn_str)

print("Database connection established successfully!")

## Step 5: Transform - Data Quality Assessment

Perform data quality checks including missing values, data types, and basic statistics.

In [None]:
# Check dataset shape
print(f"Dataset Shape: {csv_data.shape}")
print("\n" + "="*50)

# Check for missing values
print("\nMissing Values Analysis:")
missing_values = csv_data.isnull().sum()
print(missing_values[missing_values > 0])

# Display column names
print("\n" + "="*50)
print("\nColumn Names:")
print(csv_data.columns.tolist())

# Display data types
print("\n" + "="*50)
print("\nData Types:")
print(csv_data.dtypes)

## Step 6: Validate - Data Profiling

Generate summary statistics and validate data integrity.

In [None]:
# Display summary statistics
print("Summary Statistics:")
csv_data.describe()

# Check for duplicates
duplicates = csv_data.duplicated().sum()
print(f"\nDuplicate Rows: {duplicates}")

# Display sample records
print("\nSample Records:")
csv_data.sample(5)

## Step 7: Load - Push Data to PostgreSQL

Load the transformed DataFrame into PostgreSQL database as a new table.

In [None]:
# Define table name
table_name = "patient_info"

# Load DataFrame to PostgreSQL
# if_exists='replace' will drop and recreate the table
# Use 'append' to add to existing table
rows_inserted = csv_data.to_sql(
    table_name, 
    engine, 
    if_exists="replace", 
    index=False,
    method='multi',  # Faster bulk insert
    chunksize=1000   # Insert in batches
)

print(f"✓ Successfully loaded {len(csv_data)} records to '{table_name}' table in PostgreSQL")

## Step 8: Insights - Verify Data Load

Query the database to verify successful data load and generate basic insights.

In [None]:
# Verify data load by querying the database
query = f"SELECT * FROM {table_name} LIMIT 10"
result_df = pd.read_sql(query, engine)

print(f"Sample records from PostgreSQL '{table_name}' table:")
display(result_df)

# Get row count from database
count_query = f"SELECT COUNT(*) as total_records FROM {table_name}"
count_result = pd.read_sql(count_query, engine)
print(f"\nTotal records in database: {count_result['total_records'][0]}")

## Step 9: Cleanup - Close Database Connection

Properly dispose of database connections.

In [None]:
# Close database connection
engine.dispose()
print("Database connection closed successfully.")

---

## ETL Pipeline Summary

✓ **Extract:** Successfully loaded patient health data from Excel file  
✓ **Transform:** Performed data quality checks and validation  
✓ **Load:** Loaded 1000 records into PostgreSQL database  
✓ **Validate:** Verified data integrity and successful load  

### Key Achievements:
- Automated data pipeline from CSV/Excel to PostgreSQL
- Implemented data quality checks and validation
- Demonstrated ETL best practices with proper error handling
- Secured sensitive credentials (masked in production)

### Next Action Plan Steps:
- Schedule automated ETL jobs using Apache Airflow
- Implement incremental loading for large datasets
- Add data transformation logic (cleaning, enrichment)
- Create data quality dashboards in Power BI/Tableau

---