# HelixGraph ETL Pipeline

This notebook demonstrates how to load the HR, Marketing, and Procurement datasets into Neo4j using the ETL framework.

**Data Sources**:
- **HR**: JSON data (self-generated)
- **Marketing**: JSON data (self-generated, 600 campaigns)
- **Procurement**: CSV data from HEL-19 (mertalpaydin, 3,446 records)

## 1. Configure Project Path and Imports

In [1]:
# Set up project path (local Jupyter run)
import sys
import os

# Locate project root (notebook is inside notebooks/ directory)
project_root = os.path.abspath(os.path.join(os.getcwd(), '..'))
print(f"Project root: {project_root}")

if project_root not in sys.path:
    sys.path.insert(0, project_root)
    print("✓ Added project root to Python path")

try:
    from etl.hr_loader import HRLoader
    from etl.marketing_loader import MarketingLoader
    from etl.procurement_csv_loader import ProcurementCSVLoader  # Updated to CSV version
    from etl.utils import get_neo4j_config
    print("✓ ETL loaders imported successfully")
    print("  - HRLoader (JSON)")
    print("  - MarketingLoader (JSON)")
    print("  - ProcurementCSVLoader (CSV) ✨")
except ImportError as e:
    print(f"✗ Import error: {e}")
    print("Ensure you are running this notebook from the project notebooks/ directory")
    print("Current directory:", os.getcwd())

Project root: /Users/ivan/FSFM/01_Courses/Coop/Helixgraph/HEL-20/HelixGraph
✓ Added project root to Python path
✓ ETL loaders imported successfully
  - HRLoader (JSON)
  - MarketingLoader (JSON)
  - ProcurementCSVLoader (CSV) ✨


## 2. Load Environment Configuration

Read Neo4j credentials from the `.env` file located in the project root.

In [2]:
# Load .env configuration
from dotenv import load_dotenv

# Locate and load .env file
env_path = os.path.join(project_root, '.env')
env_example_path = os.path.join(project_root, '.env.example')

if os.path.exists(env_path):
    load_dotenv(env_path)
    print("✓ Loaded configuration from .env")
elif os.path.exists(env_example_path):
    load_dotenv(env_example_path)
    print("⚠️  Loaded configuration from .env.example")
    print("   Tip: duplicate .env.example as .env and update credentials")
else:
    print("❌ .env or .env.example not found")
    print("   Please create an .env file or set environment variables")

✓ Loaded configuration from .env


## 3. Retrieve Neo4j Configuration

In [3]:
# Retrieve Neo4j configuration
try:
    config = get_neo4j_config()
    print("✓ Configuration loaded successfully")
    print(f"  URI:      {config['uri']}")
    print(f"  User:     {config['user']}")
    print("  Database: neo4j")
except ValueError as e:
    print(f"❌ Configuration error: {e}")
    print("\nPlease check:")
    print("1. .env file exists (or use .env.example)")
    print("2. Contains NEO4J_URI, NEO4J_USER, NEO4J_PASSWORD")
    print("\nAlternatively set environment variables:")
    print("  export NEO4J_URI='...'")
    print("  export NEO4J_USER='...'")
    print("  export NEO4J_PASSWORD='...'")
    raise

✓ Configuration loaded successfully
  URI:      neo4j+s://561f8654.databases.neo4j.io
  User:     neo4j
  Database: neo4j


## 4. Run Multi-domain ETL Pipeline

In [4]:
from collections import defaultdict

CLEAR_DATABASE_FIRST = True

load_plan = [
    ("HR", HRLoader, {"batch_size": 100}),
    ("Marketing", MarketingLoader, {"batch_size": 500}),
    ("Procurement CSV", ProcurementCSVLoader, {"batch_size": 500}),  # Updated to CSV version
]

final_stats = None

for idx, (domain, loader_cls, loader_kwargs) in enumerate(load_plan, 1):
    print("=" * 70)
    print(f"[{idx}] Loading {domain} data")
    print("=" * 70)

    with loader_cls(**config, **loader_kwargs) as loader:
        print("- Testing connection...")
        if not loader.test_connection():
            print("✗ Connection failed. Aborting remaining loads.")
            break

        if CLEAR_DATABASE_FIRST:
            print("- Clearing database before first load...")
            loader.clear_database()
            CLEAR_DATABASE_FIRST = False

        print("- Running loader...")
        loader.load()

        stats = loader.get_graph_statistics()
        final_stats = stats
        print(f"  > {domain} nodes: {stats['total_nodes']:,}, relationships: {stats['total_relationships']:,}")

if final_stats:
    print("\nCombined graph statistics after all loads:")
    print(f"  Total Nodes:         {final_stats['total_nodes']:,}")
    print(f"  Total Relationships: {final_stats['total_relationships']:,}")

    print("\n  Nodes by Label:")
    for label, count in final_stats['nodes_by_label'].items():
        print(f"    {label:20} {count:>6,}")

    print("\n  Relationships by Type:")
    for rel_type, count in final_stats['relationships_by_type'].items():
        print(f"    {rel_type:20} {count:>6,}")

print("\n✅ Multi-domain ETL pipeline completed!")

2025-10-15 18:06:13 - HRLoader - INFO - Initialized HRLoader
2025-10-15 18:06:13 - HRLoader - INFO - Connected to Neo4j at neo4j+s://561f8654.databases.neo4j.io
2025-10-15 18:06:13 - HRLoader - INFO - Data directory: /Users/ivan/FSFM/01_Courses/Coop/Helixgraph/HEL-20/HelixGraph/data/hr


[1] Loading HR data
- Testing connection...


2025-10-15 18:06:13 - HRLoader - INFO - ✓ Connection test successful


- Clearing database before first load...


2025-10-15 18:06:13 - HRLoader - INFO - ✓ Database cleared
2025-10-15 18:06:13 - HRLoader - INFO - Starting HR Data Load
2025-10-15 18:06:13 - HRLoader - INFO - Loading data files...
2025-10-15 18:06:13 - HRLoader - INFO - Loading employees from /Users/ivan/FSFM/01_Courses/Coop/Helixgraph/HEL-20/HelixGraph/data/hr/hr_employees.json
2025-10-15 18:06:13 - HRLoader - INFO - ✓ Loaded 200 employees
2025-10-15 18:06:13 - HRLoader - INFO - Loading skills from /Users/ivan/FSFM/01_Courses/Coop/Helixgraph/HEL-20/HelixGraph/data/hr/hr_skills.json
2025-10-15 18:06:13 - HRLoader - INFO - ✓ Loaded 50 skills
2025-10-15 18:06:13 - HRLoader - INFO - Loading employee-skills from /Users/ivan/FSFM/01_Courses/Coop/Helixgraph/HEL-20/HelixGraph/data/hr/hr_employee_skills.json
2025-10-15 18:06:13 - HRLoader - INFO - ✓ Loaded 1559 employee-skill relationships
2025-10-15 18:06:13 - HRLoader - INFO - Validating data with Pydantic schemas...
2025-10-15 18:06:13 - HRLoader - INFO - Validated 200/200 records
2025-1

- Running loader...


2025-10-15 18:06:13 - HRLoader - INFO - ✓ Schema setup complete
2025-10-15 18:06:13 - HRLoader - INFO - Loading departments...
2025-10-15 18:06:13 - HRLoader - INFO - Loading 3 records in batches of 50
2025-10-15 18:06:14 - HRLoader - INFO - ✓ Loaded 3 records
2025-10-15 18:06:14 - HRLoader - INFO - ✓ Loaded 3 departments
2025-10-15 18:06:14 - HRLoader - INFO - Loading locations...
2025-10-15 18:06:14 - HRLoader - INFO - Loading 3 records in batches of 50
2025-10-15 18:06:14 - HRLoader - INFO - ✓ Loaded 3 records
2025-10-15 18:06:14 - HRLoader - INFO - ✓ Loaded 3 locations
2025-10-15 18:06:14 - HRLoader - INFO - Loading skills...
2025-10-15 18:06:14 - HRLoader - INFO - Loading 50 records in batches of 50
2025-10-15 18:06:14 - HRLoader - INFO - ✓ Loaded 50 records
2025-10-15 18:06:14 - HRLoader - INFO - ✓ Loaded 50 skills
2025-10-15 18:06:14 - HRLoader - INFO - Loading employees...
2025-10-15 18:06:14 - HRLoader - INFO - Loading 200 records in batches of 100
2025-10-15 18:06:14 - HRLoad


HRLoader - Loading Statistics

⏱️  Duration: 1.29 seconds

📊 Nodes Created:
  Department                3
  Location                  3
  Skill                    50
  Employee                200
  ──────────────────────────────
  Total                   256

🔗 Relationships Created:
  WORKS_IN                200
  LOCATED_IN              200
  REPORTS_TO              189
  HAS_SKILL             1,559
  ──────────────────────────────
  Total                 2,148

📈 Records:
  Processed:   2,004
  Failed:          0

✅ No errors

  > HR nodes: 256, relationships: 2,148
[2] Loading Marketing data
- Testing connection...


2025-10-15 18:06:14 - MarketingLoader - INFO - ✓ Connection test successful
2025-10-15 18:06:14 - MarketingLoader - INFO - Starting marketing data load
2025-10-15 18:06:14 - MarketingLoader - INFO - Loading marketing dataset from /Users/ivan/FSFM/01_Courses/Coop/Helixgraph/HEL-20/HelixGraph/data/marketing/marketing_data.json
2025-10-15 18:06:15 - MarketingLoader - INFO - Loaded marketing dataset: 600 campaigns, 10 brands
2025-10-15 18:06:15 - MarketingLoader - INFO - Setting up marketing schema constraints and indexes


- Running loader...


2025-10-15 18:06:15 - MarketingLoader - INFO - Loading 10 records in batches of 200
2025-10-15 18:06:15 - MarketingLoader - INFO - ✓ Loaded 10 records
2025-10-15 18:06:15 - MarketingLoader - INFO - Loading 5 records in batches of 200
2025-10-15 18:06:15 - MarketingLoader - INFO - ✓ Loaded 5 records
2025-10-15 18:06:15 - MarketingLoader - INFO - Loading 5 records in batches of 200
2025-10-15 18:06:15 - MarketingLoader - INFO - ✓ Loaded 5 records
2025-10-15 18:06:15 - MarketingLoader - INFO - Loading 8 records in batches of 200
2025-10-15 18:06:15 - MarketingLoader - INFO - ✓ Loaded 8 records
2025-10-15 18:06:15 - MarketingLoader - INFO - Loading 600 records in batches of 200
2025-10-15 18:06:16 - MarketingLoader - INFO - ✓ Loaded 600 records
2025-10-15 18:06:16 - MarketingLoader - INFO - Loading 1793 records in batches of 500
2025-10-15 18:06:16 - MarketingLoader - INFO - ✓ Loaded 1793 records
2025-10-15 18:06:16 - MarketingLoader - INFO - Loading 8965 records in batches of 1000
2025-10


MarketingLoader - Loading Statistics

⏱️  Duration: 4.01 seconds

📊 Nodes Created:
  Brand                    10
  MarketingObjective        5
  MarketingKPI              5
  MarketingChannel          8
  MarketingCampaign       600
  CommerceOrder         1,828
  ──────────────────────────────
  Total                 2,456

🔗 Relationships Created:
  FOR_BRAND               600
  HAS_OBJECTIVE           600
  ACTIVATED_ON          1,793
  KPI_RESULT            8,965
  ATTRIBUTED_TO         1,828
  ──────────────────────────────
  Total                13,786

📈 Records:
  Processed:  13,214
  Failed:          0

✅ No errors

  > Marketing nodes: 2,712, relationships: 15,938
[3] Loading Procurement CSV data
- Testing connection...


2025-10-15 18:06:19 - ProcurementCSVLoader - INFO - ✓ Connection test successful
2025-10-15 18:06:19 - ProcurementCSVLoader - INFO - Starting procurement CSV data load
2025-10-15 18:06:19 - ProcurementCSVLoader - INFO - Loading procurement CSV data from /Users/ivan/FSFM/01_Courses/Coop/Helixgraph/HEL-20/HelixGraph/data/procurement_csv
2025-10-15 18:06:19 - ProcurementCSVLoader - INFO - Loaded 240 records from suppliers.csv
2025-10-15 18:06:19 - ProcurementCSVLoader - INFO - Loaded 120 records from products.csv
2025-10-15 18:06:19 - ProcurementCSVLoader - INFO - Loaded 1452 records from purchase_orders.csv
2025-10-15 18:06:19 - ProcurementCSVLoader - INFO - Loaded 674 records from invoices.csv
2025-10-15 18:06:19 - ProcurementCSVLoader - INFO - Loaded 960 records from risks.csv
2025-10-15 18:06:19 - ProcurementCSVLoader - INFO - Loaded procurement data: 240 suppliers, 120 products, 1452 POs, 674 invoices, 960 risks
2025-10-15 18:06:19 - ProcurementCSVLoader - INFO - Setting up procureme

- Running loader...


2025-10-15 18:06:19 - ProcurementCSVLoader - INFO - Loading 240 records in batches of 500
2025-10-15 18:06:19 - ProcurementCSVLoader - INFO - ✓ Loaded 240 records
2025-10-15 18:06:19 - ProcurementCSVLoader - INFO - Loading 120 records in batches of 200
2025-10-15 18:06:19 - ProcurementCSVLoader - INFO - ✓ Loaded 120 records
2025-10-15 18:06:19 - ProcurementCSVLoader - INFO - Loading 1452 records in batches of 500
2025-10-15 18:06:20 - ProcurementCSVLoader - INFO - ✓ Loaded 1452 records
2025-10-15 18:06:20 - ProcurementCSVLoader - INFO - Loading 674 records in batches of 500
2025-10-15 18:06:20 - ProcurementCSVLoader - INFO - ✓ Loaded 674 records
2025-10-15 18:06:20 - ProcurementCSVLoader - INFO - Loading 960 records in batches of 500
2025-10-15 18:06:20 - ProcurementCSVLoader - INFO - ✓ Loaded 960 records
2025-10-15 18:06:20 - ProcurementCSVLoader - INFO - Procurement CSV graph totals: 5310 nodes, 20035 relationships
2025-10-15 18:06:20 - ProcurementCSVLoader - INFO - Neo4j connection 


ProcurementCSVLoader - Loading Statistics

⏱️  Duration: 1.82 seconds

📊 Nodes Created:
  Supplier                240
  Product                 120
  PurchaseOrder         1,452
  Invoice                 674
  SupplierRisk            960
  ──────────────────────────────
  Total                 3,446

🔗 Relationships Created:
  FROM_SUPPLIER         1,452
  FOR_PRODUCT           1,452
  FOR_PURCHASE_ORDER      674
  ASSESSES                960
  ──────────────────────────────
  Total                 4,538

📈 Records:
  Processed:   3,446
  Failed:          0

✅ No errors

  > Procurement CSV nodes: 5,310, relationships: 20,035

Combined graph statistics after all loads:
  Total Nodes:         5,310
  Total Relationships: 20,035

  Nodes by Label:
    CommerceOrder         1,828
    SupplierRisk            960
    Invoice                 674
    MarketingCampaign       600
    PurchaseOrder           480
    Supplier                240
    Employee                200
    ProductCategory