# BoostCredit ETL Pipeline Demo

This notebook demonstrates the ETL pipeline for processing CSV and JSON data.

## Pipeline Flow:
1. **Extract** → Read data from CSV/JSON files
2. **Transform** → Clean, convert types, mask PII
3. **Store** → Save to object store (Parquet)
4. **Load** → Load from object store to PostgreSQL database

In [1]:
import os
import sys
import importlib
import pandas as pd
from pathlib import Path

# Reload modules to ensure we have the latest code (important for notebooks)
import src.loaders
import src.pipeline
import src.extractors
import src.transformers


from src.pipeline import Pipeline
from src.extractors import CSVExtractor, JSONExtractor
from src.transformers import CSVTransformer, JSONTransformer

# Set environment variables
os.environ['STORE_KEY'] = 'demo_data'
os.environ['DB_TYPE'] = 'postgresql'
os.environ['DB_HOST'] = 'localhost'
os.environ['DB_PORT'] = '5432'
os.environ['DB_USER'] = 'etl_user'
os.environ['DB_PASSWORD'] = 'etl_password'
os.environ['DB_NAME'] = 'etl_database'
os.environ['DATA_PATH'] = './data'
os.environ['OBJECT_STORE_PATH'] = './output'

print("✓ Environment variables set")
print("✓ Modules reloaded and imports successful")

✓ Environment variables set
✓ Modules reloaded and imports successful


## Step 1: Initialize Pipeline

The pipeline handles the complete ETL process automatically.

In [2]:
pipeline = Pipeline()
print("✓ Pipeline initialized")

✓ Pipeline initialized


## Step 2: Test Individual Components

Let's test each component separately to understand what they do.

In [3]:
# Test CSV Extractor
csv_extractor = CSVExtractor()
csv_file = Path('data/test.csv')
if csv_file.exists():
    sample_data = csv_extractor.extract(str(csv_file))
    print(f"✓ CSV Extracted: {len(sample_data)} rows")
    print(f"  Columns: {list(sample_data.columns)}")
    print(f"\n  First row sample:")
    print(sample_data.head(1))
else:
    print("⚠ CSV file not found")

✓ CSV Extracted: 4000000 rows
  Columns: ['id', 'name', 'address', 'color', 'created_at', 'last_login', 'is_claimed', 'paid_amount']

  First row sample:
     id            name                                            address  \
0  6311  Jennifer Green  7593 Juan Throughway Apt. 948\nWest Corey, TX ...   

  color               created_at  last_login is_claimed  paid_amount  
0  lime  Monday, June 30th, 2013  1202190735       True  5004.671532  


In [4]:
sample_data.head()

Unnamed: 0,id,name,address,color,created_at,last_login,is_claimed,paid_amount
0,6311,Jennifer Green,"7593 Juan Throughway Apt. 948\nWest Corey, TX ...",lime,"Monday, June 30th, 2013",1202190735,True,5004.671532
1,3350,Karen Grimes,"60975 Jessica Squares\nEast Sallybury, FL 71671",lime,"Monday, June 30th, 2013",195884769,True,893.40459503
2,9031,Calvin Cook,"PSC 3989, Box 4719\nAPO AA 42056",silver,1986-06-23TEST,623477862,True,266.6
3,1131,Peter Mcdowell,"PSC 1868, Box 4833\nAPO AP 77807",aqua,1998-07-17TEST,1244885561,True,674.5441267
4,1889,Mr. Ryan Sanchez,"352 Simmons Circle\nPort Dustinbury, OK 83627",white,2006-05-09 13:29:58,1293151276,truee,


In [5]:
# Test CSV Transformer
csv_transformer = CSVTransformer()
if csv_file.exists():
    transformed = csv_transformer.transform(sample_data.head(5))
    print("✓ CSV Transformed")
    print(f"  Data types converted")
    print(f"  PII masked (name, address)")
    print(f"\n  Transformed sample:")
    print(transformed[['id', 'name', 'created_at', 'is_claimed', 'paid_amount']].head(2))

✓ CSV Transformed
  Data types converted
  PII masked (name, address)

  Transformed sample:
     id       name created_at  is_claimed  paid_amount
0  6311  J*** G*** 2013-06-30        True      5004.67
1  3350  K*** G*** 2013-06-30        True       893.40


In [6]:
transformed.head()

Unnamed: 0,id,name,address,color,created_at,last_login,is_claimed,paid_amount
0,6311,J*** G***,"*****************************\nWest Corey, TX ...",lime,2013-06-30 00:00:00,2008-02-05 10:52:15,True,5004.67
1,3350,K*** G***,"*********************\nEast Sallybury, FL 71671",lime,2013-06-30 00:00:00,1976-03-17 09:26:09,True,893.4
2,9031,C*** C***,******************\nAPO AA 42056,silver,1986-06-23 00:00:00,1989-10-04 09:17:42,True,266.6
3,1131,P*** M***,******************\nAPO AP 77807,aqua,1998-07-17 00:00:00,2009-06-13 15:32:41,True,674.54
4,1889,M*** S***,"******************\nPort Dustinbury, OK 83627",white,2006-05-09 13:29:58,2010-12-24 05:41:16,True,


In [7]:
# Test JSON Extractor
json_extractor = JSONExtractor()
json_file = Path('data/test.json')
if json_file.exists():
    json_data = json_extractor.extract(str(json_file))
    print(f"✓ JSON Extracted: {len(json_data)} records")
    print(f"\n  First record keys: {list(json_data[0].keys())}")
    print(f"  Sample user_id: {json_data[0].get('user_id', 'N/A')}")
else:
    print("⚠ JSON file not found")

✓ JSON Extracted: 100000 records

  First record keys: ['user_id', 'created_at', 'updated_at', 'logged_at', 'user_details', 'jobs_history']
  Sample user_id: e9703a66-6556-4b48-8a0b-0ace129d7a11


In [8]:
# Test JSON Transformer
json_transformer = JSONTransformer()
if json_file.exists():
    json_transformed = json_transformer.transform(json_data[:2])  # Transform 2 records
    print("✓ JSON Transformed into 3 tables:")
    print(f"  - users: {len(json_transformed['users'])} rows")
    print(f"  - telephone_numbers: {len(json_transformed['telephone_numbers'])} rows")
    print(f"  - jobs_history: {len(json_transformed['jobs_history'])} rows")
    print(f"\n  Users sample:")
    print(json_transformed['users'][['user_id', 'name', 'username']].head(2))

✓ JSON Transformed into 3 tables:
  - users: 2 rows
  - telephone_numbers: 4 rows
  - jobs_history: 2 rows

  Users sample:
                                user_id       name  \
0  e9703a66-6556-4b48-8a0b-0ace129d7a11  J*** W***   
1  aa246388-104c-44f7-93f4-4b688dc0baff  S*** H***   

                       username  
0      b****e@garza-shelton.net  
1  h********s@baker-beasley.com  


In [9]:
json_transformed

{'users':                                 user_id created_at updated_at  \
 0  e9703a66-6556-4b48-8a0b-0ace129d7a11 2003-09-19 1986-01-02   
 1  aa246388-104c-44f7-93f4-4b688dc0baff 2002-10-08 1984-08-26   
 
             logged_at       name         dob  \
 0 1997-12-12 05:03:28  J*** W***  1983-05-15   
 1 1972-06-10 17:46:40  S*** H***  1996-09-24   
 
                                        address                      username  \
 0         **************\nNorth Dana, MN 35292      b****e@garza-shelton.net   
 1  ******************\nPort Kimmouth, MI 12236  h********s@baker-beasley.com   
 
      password national_id  
 0  **********   *****8140  
 1  **********   *****4594  ,
 'telephone_numbers':                                 user_id    telephone_number
 0  e9703a66-6556-4b48-8a0b-0ace129d7a11    ************7268
 1  e9703a66-6556-4b48-8a0b-0ace129d7a11  **************5397
 2  aa246388-104c-44f7-93f4-4b688dc0baff          ******9845
 3  aa246388-104c-44f7-93f4-4b688dc0baff    

## Step 3: Run Complete Pipeline

Now let's run the full pipeline for CSV processing.

In [None]:
# Process CSV file
if csv_file.exists():
    os.environ['STORE_KEY'] = 'csv_demo'
    pipeline.process_csv('test.csv')
    print("✓ CSV processing completed!")
    print("  → Data extracted, transformed, saved to object store, and loaded to database")
else:
    print("⚠ CSV file not found - skipping CSV processing")

## Step 4: Process JSON File

Process JSON data which creates multiple linked tables.

In [None]:
# Process JSON file
if json_file.exists():
    os.environ['STORE_KEY'] = 'json_demo'
    pipeline.process_json('test.json')
    print("✓ JSON processing completed!")
    print("  → Created 3 tables: users, telephone_numbers, jobs_history")
    print("  → All PII masked (emails, phones, national IDs, passwords)")
else:
    print("⚠ JSON file not found - skipping JSON processing")

## Step 5: Verify Data in Object Store

Check what was saved to the object store (intermediate step).

In [None]:
from src.storage import ObjectStore

store = ObjectStore('./output')

# Check CSV data in store
csv_data = store.load('csv_demo', 'parquet')
if csv_data is not None:
    print(f"✓ CSV data in object store: {len(csv_data)} rows")
    print(f"  Columns: {list(csv_data.columns)}")

# Check JSON data in store
json_data_store = store.load('json_demo', 'parquet')
if json_data_store is not None:
    print(f"\n✓ JSON data in object store:")
    for table_name, df in json_data_store.items():
        print(f"  - {table_name}: {len(df)} rows")

## Step 6: Cleanup

Close the pipeline to release database connections.

In [None]:
pipeline.close()
print("✓ Pipeline closed - database connections released")