# HelixGraph ETL Pipeline

This notebook demonstrates how to load the HR, Marketing, and Procurement datasets into Neo4j using the ETL framework.

## 1. Configure Project Path and Imports

In [1]:
# Set up project path (local Jupyter run)
import sys
import os

# Locate project root (notebook is inside notebooks/ directory)
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
print(f"Project root: {project_root}")

if project_root not in sys.path:
    sys.path.insert(0, project_root)
    print("✓ Added project root to Python path")

try:
    from etl import HRLoader, MarketingLoader, ProcurementLoader
    from etl.utils import get_neo4j_config
    print("✓ ETL loaders imported successfully")
except ImportError as e:
    print(f"✗ Import error: {e}")
    print("Ensure you are running this notebook from the project notebooks/ directory")
    print("Current directory:", os.getcwd())

Project root: /Users/ivan/FSFM/01_Courses/Coop/Helixgraph/HEL-20/HelixGraph
✓ Added project root to Python path
✓ ETL loaders imported successfully


## 2. Load Environment Configuration

Read Neo4j credentials from the `.env` file located in the project root.

In [2]:
# Load .env configuration
from dotenv import load_dotenv

# Locate and load .env file
env_path = os.path.join(project_root, '.env')
env_example_path = os.path.join(project_root, '.env.example')

if os.path.exists(env_path):
    load_dotenv(env_path)
    print("✓ Loaded configuration from .env")
elif os.path.exists(env_example_path):
    load_dotenv(env_example_path)
    print("⚠️  Loaded configuration from .env.example")
    print("   Tip: duplicate .env.example as .env and update credentials")
else:
    print("❌ .env or .env.example not found")
    print("   Please create an .env file or set environment variables")

✓ Loaded configuration from .env


## 3. Retrieve Neo4j Configuration

In [4]:
# Retrieve Neo4j configuration
try:
    config = get_neo4j_config()
    print("✓ Configuration loaded successfully")
    print(f"  URI:      {config['uri']}")
    print(f"  User:     {config['user']}")
    print("  Database: neo4j")
except ValueError as e:
    print(f"❌ Configuration error: {e}")
    print("\nPlease check:")
    print("1. .env file exists (or use .env.example)")
    print("2. Contains NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD")
    print("\nAlternatively set environment variables:")
    print("  export NEO4J_URI='...'")
    print("  export NEO4J_USER='...'")
    print("  export NEO4J_PASSWORD='...'")
    raise

✓ Configuration loaded successfully
  URI:      neo4j+s://561f8654.databases.neo4j.io
  User:     neo4j
  Database: neo4j


## 4. Run Multi-domain ETL Pipeline

In [8]:
from collections import defaultdict

CLEAR_DATABASE_FIRST = True

load_plan = [
    ("HR", HRLoader, {"batch_size": 100}),
    ("Marketing", MarketingLoader, {"batch_size": 500}),
    ("Procurement", ProcurementLoader, {"batch_size": 500}),
]

final_stats = None

for idx, (domain, loader_cls, loader_kwargs) in enumerate(load_plan, 1):
    print("=" * 70)
    print(f"[{idx}] Loading {domain} data")
    print("=" * 70)

    with loader_cls(**config, **loader_kwargs) as loader:
        print("- Testing connection...")
        if not loader.test_connection():
            print("✗ Connection failed. Aborting remaining loads.")
            break

        if CLEAR_DATABASE_FIRST:
            print("- Clearing database before first load...")
            loader.clear_database()
            CLEAR_DATABASE_FIRST = False

        print("- Running loader...")
        loader.load()

        stats = loader.get_graph_statistics()
        final_stats = stats
        print(f"  > {domain} nodes: {stats['total_nodes']:,}, relationships: {stats['total_relationships']:,}")

if final_stats:
    print("\nCombined graph statistics after all loads:")
    print(f"  Total Nodes:         {final_stats['total_nodes']:,}")
    print(f"  Total Relationships: {final_stats['total_relationships']:,}")

    print("\n  Nodes by Label:")
    for label, count in final_stats['nodes_by_label'].items():
        print(f"    {label:20} {count:>6,}")

    print("\n  Relationships by Type:")
    for rel_type, count in final_stats['relationships_by_type'].items():
        print(f"    {rel_type:20} {count:>6,}")

print("\n✅ Multi-domain ETL pipeline completed!")

2025-10-14 17:10:51 - HRLoader - INFO - Initialized HRLoader
2025-10-14 17:10:51 - HRLoader - INFO - Connected to Neo4j at neo4j+s://561f8654.databases.neo4j.io
2025-10-14 17:10:51 - HRLoader - INFO - Data directory: /Users/ivan/FSFM/01_Courses/Coop/Helixgraph/HEL-20/HelixGraph/data/hr


[1] Loading HR data
- Testing connection...


2025-10-14 17:10:52 - HRLoader - INFO - ✓ Connection test successful


- Clearing database before first load...


2025-10-14 17:10:52 - HRLoader - INFO - ✓ Database cleared
2025-10-14 17:10:52 - HRLoader - INFO - Starting HR Data Load
2025-10-14 17:10:52 - HRLoader - INFO - Loading data files...
2025-10-14 17:10:52 - HRLoader - INFO - Loading employees from /Users/ivan/FSFM/01_Courses/Coop/Helixgraph/HEL-20/HelixGraph/data/hr/hr_employees.json
2025-10-14 17:10:52 - HRLoader - INFO - ✓ Loaded 200 employees
2025-10-14 17:10:52 - HRLoader - INFO - Loading skills from /Users/ivan/FSFM/01_Courses/Coop/Helixgraph/HEL-20/HelixGraph/data/hr/hr_skills.json
2025-10-14 17:10:52 - HRLoader - INFO - ✓ Loaded 50 skills
2025-10-14 17:10:52 - HRLoader - INFO - Loading employee-skills from /Users/ivan/FSFM/01_Courses/Coop/Helixgraph/HEL-20/HelixGraph/data/hr/hr_employee_skills.json
2025-10-14 17:10:52 - HRLoader - INFO - ✓ Loaded 1496 employee-skill relationships
2025-10-14 17:10:52 - HRLoader - INFO - Validating data with Pydantic schemas...
2025-10-14 17:10:52 - HRLoader - INFO - Validated 200/200 records
2025-1

- Running loader...


2025-10-14 17:10:53 - HRLoader - INFO - ✓ Schema setup complete
2025-10-14 17:10:53 - HRLoader - INFO - Loading departments...
2025-10-14 17:10:53 - HRLoader - INFO - Loading 6 records in batches of 50
2025-10-14 17:10:53 - HRLoader - INFO - ✓ Loaded 6 records
2025-10-14 17:10:53 - HRLoader - INFO - ✓ Loaded 6 departments
2025-10-14 17:10:53 - HRLoader - INFO - Loading locations...
2025-10-14 17:10:53 - HRLoader - INFO - Loading 6 records in batches of 50
2025-10-14 17:10:53 - HRLoader - INFO - ✓ Loaded 6 records
2025-10-14 17:10:53 - HRLoader - INFO - ✓ Loaded 6 locations
2025-10-14 17:10:53 - HRLoader - INFO - Loading skills...
2025-10-14 17:10:53 - HRLoader - INFO - Loading 50 records in batches of 50
2025-10-14 17:10:53 - HRLoader - INFO - ✓ Loaded 50 records
2025-10-14 17:10:53 - HRLoader - INFO - ✓ Loaded 50 skills
2025-10-14 17:10:53 - HRLoader - INFO - Loading employees...
2025-10-14 17:10:53 - HRLoader - INFO - Loading 200 records in batches of 100
2025-10-14 17:10:53 - HRLoad


HRLoader - Loading Statistics

⏱️  Duration: 2.40 seconds

📊 Nodes Created:
  Department                6
  Location                  6
  Skill                    50
  Employee                200
  ──────────────────────────────
  Total                   262

🔗 Relationships Created:
  WORKS_IN                200
  LOCATED_IN              200
  REPORTS_TO              189
  HAS_SKILL             1,496
  ──────────────────────────────
  Total                 2,085

📈 Records:
  Processed:   1,947
  Failed:          0

✅ No errors

  > HR nodes: 262, relationships: 2,085
[2] Loading Marketing data
- Testing connection...


2025-10-14 17:10:54 - MarketingLoader - INFO - ✓ Connection test successful
2025-10-14 17:10:54 - MarketingLoader - INFO - Starting marketing data load
2025-10-14 17:10:54 - MarketingLoader - INFO - Loading marketing dataset from /Users/ivan/FSFM/01_Courses/Coop/Helixgraph/HEL-20/HelixGraph/data/marketing/marketing_data.json
2025-10-14 17:10:54 - MarketingLoader - INFO - Loaded marketing dataset: 600 campaigns, 10 brands
2025-10-14 17:10:54 - MarketingLoader - INFO - Setting up marketing schema constraints and indexes


- Running loader...


2025-10-14 17:10:55 - MarketingLoader - INFO - Loading 10 records in batches of 200
2025-10-14 17:10:55 - MarketingLoader - INFO - ✓ Loaded 10 records
2025-10-14 17:10:55 - MarketingLoader - INFO - Loading 5 records in batches of 200
2025-10-14 17:10:55 - MarketingLoader - INFO - ✓ Loaded 5 records
2025-10-14 17:10:55 - MarketingLoader - INFO - Loading 5 records in batches of 200
2025-10-14 17:10:55 - MarketingLoader - INFO - ✓ Loaded 5 records
2025-10-14 17:10:55 - MarketingLoader - INFO - Loading 8 records in batches of 200
2025-10-14 17:10:55 - MarketingLoader - INFO - ✓ Loaded 8 records
2025-10-14 17:10:55 - MarketingLoader - INFO - Loading 600 records in batches of 200
2025-10-14 17:10:56 - MarketingLoader - INFO - ✓ Loaded 600 records
2025-10-14 17:10:56 - MarketingLoader - INFO - Loading 1793 records in batches of 500
2025-10-14 17:10:57 - MarketingLoader - INFO - ✓ Loaded 1793 records
2025-10-14 17:10:57 - MarketingLoader - INFO - Loading 8965 records in batches of 1000
2025-10


MarketingLoader - Loading Statistics

⏱️  Duration: 5.31 seconds

📊 Nodes Created:
  Brand                    10
  MarketingObjective        5
  MarketingKPI              5
  MarketingChannel          8
  MarketingCampaign       600
  CommerceOrder         1,828
  ──────────────────────────────
  Total                 2,456

🔗 Relationships Created:
  FOR_BRAND               600
  HAS_OBJECTIVE           600
  ACTIVATED_ON          1,793
  KPI_RESULT            8,965
  ATTRIBUTED_TO         1,828
  ──────────────────────────────
  Total                13,786

📈 Records:
  Processed:  13,214
  Failed:          0

✅ No errors



2025-10-14 17:11:00 - MarketingLoader - INFO - Neo4j connection closed
2025-10-14 17:11:00 - ProcurementLoader - INFO - Initialized ProcurementLoader
2025-10-14 17:11:00 - ProcurementLoader - INFO - Connected to Neo4j at neo4j+s://561f8654.databases.neo4j.io
2025-10-14 17:11:00 - ProcurementLoader - INFO - ✓ Connection test successful
2025-10-14 17:11:00 - ProcurementLoader - INFO - Starting procurement data load
2025-10-14 17:11:00 - ProcurementLoader - INFO - Loading procurement dataset from /Users/ivan/FSFM/01_Courses/Coop/Helixgraph/HEL-20/HelixGraph/data/procurement/procurement_data.json


  > Marketing nodes: 2,718, relationships: 15,875
[3] Loading Procurement data
- Testing connection...
- Running loader...


2025-10-14 17:11:00 - ProcurementLoader - INFO - Loaded procurement dataset: 150 suppliers, 313 contracts
2025-10-14 17:11:00 - ProcurementLoader - INFO - Setting up procurement schema constraints and indexes
2025-10-14 17:11:00 - ProcurementLoader - INFO - Loading 150 records in batches of 500
2025-10-14 17:11:00 - ProcurementLoader - INFO - ✓ Loaded 150 records
2025-10-14 17:11:00 - ProcurementLoader - INFO - Loading 313 records in batches of 500
2025-10-14 17:11:00 - ProcurementLoader - INFO - ✓ Loaded 313 records
2025-10-14 17:11:00 - ProcurementLoader - INFO - Loading 534 records in batches of 500
2025-10-14 17:11:01 - ProcurementLoader - INFO - ✓ Loaded 534 records
2025-10-14 17:11:01 - ProcurementLoader - INFO - Loading 1747 records in batches of 500
2025-10-14 17:11:01 - ProcurementLoader - INFO - ✓ Loaded 1747 records
2025-10-14 17:11:01 - ProcurementLoader - INFO - Loading 4331 records in batches of 1000
2025-10-14 17:11:02 - ProcurementLoader - INFO - ✓ Loaded 4331 records
2


ProcurementLoader - Loading Statistics

⏱️  Duration: 2.30 seconds

📊 Nodes Created:
  Supplier                150
  Contract                313
  SupplierRisk            534
  PurchaseOrder         1,747
  PurchaseOrderLine     4,331
  ──────────────────────────────
  Total                 7,075

🔗 Relationships Created:
  HAS_CONTRACT            313
  HAS_RISK                534
  PLACED_ORDER          1,747
  HAS_LINE              4,331
  FULFILLED_BY          4,331
  ──────────────────────────────
  Total                11,256

📈 Records:
  Processed:   7,075
  Failed:          0

✅ No errors



2025-10-14 17:11:02 - ProcurementLoader - INFO - Neo4j connection closed


  > Procurement nodes: 9,793, relationships: 27,131

Combined graph statistics after all loads:
  Total Nodes:         9,793
  Total Relationships: 27,131

  Nodes by Label:
    PurchaseOrderLine     4,331
    CommerceOrder         1,828
    PurchaseOrder         1,747
    MarketingCampaign       600
    SupplierRisk            534
    Contract                313
    Employee                200
    Supplier                150
    Skill                    50
    Brand                    10
    MarketingChannel          8
    Department                6
    Location                  6
    MarketingObjective        5
    MarketingKPI              5

  Relationships by Type:
    KPI_RESULT            8,965
    HAS_LINE              4,331
    FULFILLED_BY          4,331
    ATTRIBUTED_TO         1,828
    ACTIVATED_ON          1,793
    PLACED_ORDER          1,747
    HAS_SKILL             1,496
    FOR_BRAND               600
    HAS_OBJECTIVE           600
    HAS_RISK                534
