# ETL Data Transformation & DB Load

## Project Overview
This project demonstrates a classic **ETL (Extract, Transform, Load)** pipeline using the famous **Titanic dataset**. 

**Workflow:**
1.  **Extract**: Load raw data (simulating a file read).
2.  **Transform**: Clean data, handle missing values, and perform aggregations using **Pandas**.
3.  **Load**: Store the processed data into a **PostgreSQL** database.
4.  **Analyze**: Verify the load and perform SQL-based analysis.

## Tech Stack
- **Data Processing**: Pandas
- **Database**: PostgreSQL
- **ORM**: SQLAlchemy
- **Visualization**: Seaborn (for loading the dataset)

---


In [None]:
# Install required packages (run once)
!pip install -q pandas sqlalchemy psycopg2-binary seaborn


## 1. Configuration
Load database credentials securely.


In [None]:
import os
import pandas as pd
import seaborn as sns
from sqlalchemy import create_engine

# Database Configuration
DB_HOST = os.getenv("DB_HOST", "localhost")
DB_PORT = os.getenv("DB_PORT", "5432")
DB_NAME = os.getenv("DB_NAME", "postgres")
DB_USER = os.getenv("DB_USER", "postgres")
DB_PASS = os.getenv("DB_PASS", "password")

# Create DB Connection
connection_uri = f"postgresql+psycopg2://{DB_USER}:{DB_PASS}@{DB_HOST}:{DB_PORT}/{DB_NAME}"
engine = create_engine(connection_uri)

print("Configuration loaded.")

## 2. Extract (Load Data)
We'll use the Titanic dataset available in Seaborn to simulate reading a raw file.


In [None]:
# Load raw data
df_raw = sns.load_dataset('titanic')
print(f"Loaded {df_raw.shape[0]} rows and {df_raw.shape[1]} columns.")
df_raw.head()

## 3. Transform (Data Cleaning)
Perform standard data cleaning tasks:
- Handle missing values.
- Normalize column names.
- Create summary aggregations.


In [None]:
# 1. Normalize Columns (Strip whitespace, lowercase)
df_raw.columns = df_raw.columns.str.strip().str.lower()

# 2. Handle Missing Values
# Fill numerical missing values with median
num_cols = df_raw.select_dtypes(include=['int64', 'float64']).columns
for col in num_cols:
    median_val = df_raw[col].median()
    df_raw[col] = df_raw[col].fillna(median_val)

# Fill categorical missing values with 'unknown'
cat_cols = df_raw.select_dtypes(include=['object', 'category']).columns
for col in cat_cols:
    df_raw[col] = df_raw[col].fillna("unknown")

print("Missing values handled.")
df_raw.isnull().sum()

In [None]:
# 3. Create Aggregations (Transformation)
# Average fare by Sex
fare_by_sex = df_raw.pivot_table(index='sex', values='fare', aggfunc='mean').reset_index()
fare_by_sex.columns = ['sex', 'avg_fare']

# Survival count by Class
survival_by_class = df_raw.pivot_table(index='pclass', values='survived', aggfunc='sum').reset_index()
survival_by_class.columns = ['pclass', 'survivor_count']

print("Transformations complete.")
display(fare_by_sex)
display(survival_by_class)

## 4. Load (Save to DB)
Load the cleaned raw data and the aggregated insights into PostgreSQL.


In [None]:
try:
    # Load main cleaned table
    df_raw.to_sql('titanic_cleaned', engine, if_exists='replace', index=False)
    print("Loaded 'titanic_cleaned' table.")
    
    # Load aggregated tables
    fare_by_sex.to_sql('titanic_fare_by_sex', engine, if_exists='replace', index=False)
    print("Loaded 'titanic_fare_by_sex' table.")
    
    survival_by_class.to_sql('titanic_survival_by_class', engine, if_exists='replace', index=False)
    print("Loaded 'titanic_survival_by_class' table.")
    
except Exception as e:
    print(f"Error loading to DB: {e}")

## 5. Analysis (SQL Verification)
Query the database to confirm the data is stored correctly.


In [None]:
# Verify Data Load
query = "SELECT * FROM titanic_cleaned LIMIT 5"
df_verify = pd.read_sql(query, engine)
df_verify

In [None]:
# Verify Aggregation Load
query_agg = "SELECT * FROM titanic_survival_by_class"
df_agg_verify = pd.read_sql(query_agg, engine)
df_agg_verify

## Conclusion
This pipeline successfully:
1.  Ingested raw Titanic data.
2.  Cleaned missing values and standardized columns.
3.  Generated key insights (aggregations).
4.  Persisted both the cleaned raw data and insights into a relational database.
