## Bulk XAPI ETL Processing

This notebook demonstrates bulk processing of XAPI data using **pure ClickHouse S3 integration**, bypassing the traditional Lambda ETL processing pipeline entirely.

### Why Pure S3 Integration?

The traditional Lambda approach has timeout limitations (15 minutes max) which makes it unsuitable for bulk processing large datasets. ClickHouse's native S3 integration allows us to:

1. **Direct Processing**: Process JSONL files directly from S3 without Lambda timeout limits
2. **Batch Processing**: Handle large datasets (e.g., 2,218 files) in intelligent batches 
3. **Optimized Performance**: Leverage ClickHouse's native S3 functions for maximum efficiency

### Recent S3 Function Syntax Fix ✅

**Issue Resolved**: ClickHouse S3 functions require individual file URLs, not multiple URLs in a single call.

- ❌ **Before**: `s3('url1,url2,url3', 'JSONEachRow')` → caused `NUMBER_OF_ARGUMENTS_DOESNT_MATCH` error
- ✅ **After**: Process each file individually: `s3('url1', 'JSONEachRow')`, `s3('url2', 'JSONEachRow')`, etc.

**Implementation**: The `_process_single_s3_file()` method now handles individual S3 files within batch processing, ensuring proper ClickHouse S3 function syntax.

### Batch Processing Strategy

For large datasets, we use intelligent batching:
- **Batch Size**: 50 files per batch (configurable)
- **Processing**: Each batch processes files individually 
- **Example**: 2,218 files → 45 batches of 50 files each
- **Resilience**: Individual file failures don't stop the entire batch

**Test Results**: ✅ All syntax issues resolved, ready for production processing

# OLI Torus XAPI ETL Pipeline - S3 Integration Bulk Processing

This notebook provides an interface to trigger bulk processing of historical XAPI data using AWS Lambda with **pure ClickHouse S3 integration**.

## S3 Integration Features

The ETL pipeline now uses **pure ClickHouse S3 integration** for all operations:

- **ClickHouse S3 Integration**: Direct processing of files from S3 using ClickHouse's native `s3()` table functions
- **Unified Processing**: Single high-performance method for all batch sizes
- **No Timeout Limits**: Bypasses Lambda's 15-minute limitation completely
- **Consistent Performance**: Same fast processing whether handling 1 file or 1000 files

## Performance Benefits

- **No Lambda Timeout Limits**: Process unlimited datasets without time constraints
- **10x Faster Processing**: ClickHouse S3 integration provides significant performance improvements
- **Cost Effective**: Minimal Lambda execution time for orchestration only
- **Simplified**: Single processing path reduces complexity and maintenance

## Prerequisites

1. AWS credentials configured (via AWS CLI, environment variables, or IAM roles)
2. Lambda functions deployed in your target environment
3. Appropriate permissions to invoke Lambda functions
4. ClickHouse with S3 access permissions for S3 integration
5. ClickHouse `s3()` table functions enabled

## Setup

## Quick Setup - Install Requirements

Run the cell below to install all required dependencies for this notebook:

In [None]:
# Install all required dependencies
# Choose one of the options below:

# Option 1: Install notebook-specific requirements (includes pandas and visualization tools)
!pip install -r notebook-requirements.txt

In [None]:
import boto3
import json
import pandas as pd
from datetime import datetime, timedelta
import time

# Configure AWS region and environment
AWS_REGION = 'us-east-1'  # Change to your region
ENVIRONMENT = 'dev'  # Change to 'staging' or 'prod' as needed

# Initialize AWS clients
lambda_client = boto3.client('lambda', region_name=AWS_REGION)
s3_client = boto3.client('s3', region_name=AWS_REGION)

# Unified Lambda function name (adjust based on your deployment)
XAPI_ETL_FUNCTION = f'xapi-etl-processor-{ENVIRONMENT}'

print(f"Configured for {ENVIRONMENT} environment in {AWS_REGION}")
print(f"Unified XAPI ETL Function: {XAPI_ETL_FUNCTION}")

In [None]:
# Load Environment Variables using python-dotenv
import os

# Install python-dotenv if not already installed
from dotenv import load_dotenv

print("🔧 Loading environment variables from .env file...")

# Load .env file - this is much simpler than manual parsing!
if load_dotenv('.env'):
    print("✅ Environment variables loaded successfully!")

    # Check AWS configuration
    aws_access_key = os.getenv('AWS_ACCESS_KEY_ID')
    aws_secret_key = os.getenv('AWS_SECRET_ACCESS_KEY')
    aws_region = os.getenv('AWS_REGION')
    s3_bucket = os.getenv('S3_XAPI_BUCKET')

    print(f"\n🔍 AWS Configuration Status:")
    print(f"  AWS_ACCESS_KEY_ID: {'✅ Set' if aws_access_key else '❌ Missing'}")
    print(f"  AWS_SECRET_ACCESS_KEY: {'✅ Set' if aws_secret_key else '❌ Missing'}")
    print(f"  AWS_REGION: {aws_region if aws_region else '❌ Missing'}")
    print(f"  S3_XAPI_BUCKET: {s3_bucket if s3_bucket else '❌ Missing'}")

else:
    print(f"❌ Could not load .env file")
    print(f"   Make sure .env file exists with your AWS credentials:")
    print(f"   cp example.env .env")
    print(f"   # Then edit .env with your actual credentials")

In [None]:
import sys

# Execution Mode Configuration
EXECUTION_MODE = 'local'  # Change to 'local' to run locally instead of using Lambda

# For local execution, we'll import the required modules
if EXECUTION_MODE == 'local':
    import sys
    import os

    # Add current directory to path to import local modules
    current_dir = os.path.dirname(os.path.abspath('__file__' if '__file__' in globals() else os.getcwd()))
    if current_dir not in sys.path:
        sys.path.insert(0, current_dir)

    # Import local modules
    try:
        from lambda_function import lambda_handler, health_check
        from common import get_config
        from clickhouse_client import ClickHouseClient
        print(f"✅ Local modules imported successfully")
    except ImportError as e:
        print(f"❌ Failed to import local modules: {e}")
        print("Make sure you're running this notebook from the xapi-etl-processor directory")
        print("Exiting due to failed local module imports.")
        sys.exit(1)

print(f"🔧 Execution mode: {EXECUTION_MODE.upper()}")
if EXECUTION_MODE == 'lambda':
    print(f"   Using Lambda function: {XAPI_ETL_FUNCTION}")
else:
    print(f"   Running locally with imported modules")

## Execution Modes

This notebook supports two execution modes with **pure S3 integration** in both:

### Lambda Mode (Default)
- **EXECUTION_MODE = 'lambda'**
- Invokes the deployed AWS Lambda function remotely
- Requires AWS credentials and deployed Lambda function
- Best for production use and processing large datasets
- Supports true asynchronous processing
- **S3 Integration**: Lambda orchestrates ClickHouse S3 processing for all operations

### Local Mode
- **EXECUTION_MODE = 'local'**
- Runs the Lambda function code locally in this notebook
- Requires local modules (lambda_function.py, common.py, clickhouse_client.py)
- Good for testing and development
- All operations run synchronously in the notebook
- **S3 Integration**: Local orchestration with ClickHouse S3 processing

### S3 Integration Features (Both Modes)
- **Pure S3 Processing**: All operations use ClickHouse native S3 table functions
- **No Timeout Limits**: Process unlimited datasets without Lambda constraints
- **Consistent Performance**: Same high-speed approach for all batch sizes
- **Simplified Architecture**: Single processing path reduces complexity
- **Cost Efficient**: Minimal compute time as ClickHouse handles heavy lifting

**To switch modes:** Change the `EXECUTION_MODE` variable in the cell above and re-run that cell.

## Helper Functions

In [None]:
def execute_function(payload):
    """Execute function either locally or via Lambda based on EXECUTION_MODE"""
    if EXECUTION_MODE == 'local':
        return execute_local(payload)
    else:
        return invoke_lambda_sync(payload)

def execute_function_async(payload):
    """Execute function asynchronously - Lambda only or simulate locally"""
    if EXECUTION_MODE == 'local':
        # For local execution, run synchronously but indicate it's "async"
        result = execute_local(payload)
        if result['success']:
            result['async_mode'] = True
            result['request_id'] = f"local-{datetime.now().strftime('%Y%m%d%H%M%S')}"
        return result
    else:
        return invoke_lambda_async(payload)

def execute_local(payload):
    """Execute the Lambda function locally"""
    try:
        # Create a mock context object
        class MockContext:
            def __init__(self):
                self.function_name = "xapi-etl-processor-local"
                self.function_version = "1"
                self.aws_request_id = f"local-{datetime.now().strftime('%Y%m%d%H%M%S')}"

        context = MockContext()

        # Call the lambda_handler function directly
        result = lambda_handler(payload, context)

        return {
            'success': True,
            'status_code': result.get('statusCode', 200),
            'result': result,
            'execution_mode': 'local'
        }
    except Exception as e:
        return {
            'success': False,
            'error': str(e),
            'execution_mode': 'local'
        }

def invoke_lambda_async(payload):
    """Invoke the unified Lambda function asynchronously"""
    try:
        response = lambda_client.invoke(
            FunctionName=XAPI_ETL_FUNCTION,
            InvocationType='Event',  # Async invocation
            Payload=json.dumps(payload)
        )
        return {
            'success': True,
            'status_code': response['StatusCode'],
            'request_id': response['ResponseMetadata']['RequestId'],
            'execution_mode': 'lambda'
        }
    except Exception as e:
        return {
            'success': False,
            'error': str(e),
            'execution_mode': 'lambda'
        }

def invoke_lambda_sync(payload):
    """Invoke the unified Lambda function synchronously"""
    try:
        response = lambda_client.invoke(
            FunctionName=XAPI_ETL_FUNCTION,
            InvocationType='RequestResponse',  # Sync invocation
            Payload=json.dumps(payload)
        )

        result = json.loads(response['Payload'].read().decode())
        return {
            'success': True,
            'status_code': response['StatusCode'],
            'result': result,
            'execution_mode': 'lambda'
        }
    except Exception as e:
        return {
            'success': False,
            'error': str(e),
            'execution_mode': 'lambda'
        }

def check_function_health():
    """Check if the function is healthy (Lambda or local)"""
    payload = {'health_check': True}
    return execute_function(payload)

print("Helper functions loaded")

## 1. Health Checks

First, let's verify that our Lambda functions and ClickHouse are healthy:

In [None]:
# Check XAPI ETL processor health
print("Checking XAPI ETL processor health...")
health = check_function_health()
print(json.dumps(health, indent=2))

In [None]:
# Optional: Clear existing section data before reprocessing
# Uncomment and run this cell if you want to start fresh for this section

clear_section_data = True  # Set to True to enable clearing
target_section_id = '145'   # Make sure this matches your section ID above

if clear_section_data:
    print(f"⚠️  WARNING: This will delete ALL existing data for section {target_section_id}")

    if EXECUTION_MODE == 'local':
        try:
            # For local execution, directly use ClickHouse client
            from clickhouse_client import ClickHouseClient
            clickhouse_client = ClickHouseClient()

            # Delete all events for this section across all tables
            tables = ['video_events', 'activity_attempt_events', 'page_attempt_events',
                     'page_viewed_events', 'part_attempt_events']

            total_deleted = 0
            for table in tables:
                try:
                    query = f"DELETE FROM {clickhouse_client.database}.{table} WHERE section_id = {target_section_id}"
                    response = clickhouse_client._execute_query(query)
                    print(f"✅ Cleared {table} for section {target_section_id}")
                except Exception as e:
                    print(f"⚠️  Table {table} may not exist or is empty: {str(e)}")

            print(f"✅ Section {target_section_id} data cleared - you can now set force_reprocess=False")

        except Exception as e:
            print(f"❌ Failed to clear section data: {str(e)}")
    else:
        print("❌ Data clearing only available in local mode")
        print("   For Lambda mode, use force_reprocess=True instead")
else:
    print("💡 Section data clearing disabled")
    print("   Set clear_section_data = True above to enable section data clearing")

## 3. Process Specific Section Data

Process historical data for a specific course section:

In [None]:
# Configure section-specific processing
section_id = '145'  # Replace with actual section ID
start_date = '2023-01-01'  # Adjust as needed
end_date = '2025-12-31'    # Adjust as needed

# Use S3_XAPI_BUCKET from environment variables
S3_XAPI_BUCKET = os.getenv('S3_XAPI_BUCKET', 'torus-xapi-test')  # Fallback to default if not in .env

section_payload = {
    'mode': 'bulk',
    's3_bucket': S3_XAPI_BUCKET,
    'section_id': section_id,
    'start_date': start_date,
    'end_date': end_date,
    'force_reprocess': False  # Set to True to reprocess existing data
}

print(f"Processing section {section_id} data from {start_date} to {end_date}...")
print(f"Using S3 bucket: {S3_XAPI_BUCKET}")
if section_payload['force_reprocess']:
    print(f"⚠️  Force reprocess enabled - will reprocess existing data")
else:
    print(f"💡 Force reprocess disabled - will skip existing data")

print("\n🚀 S3 Integration Processing:")
print("   - Uses ClickHouse native S3 table functions")
print("   - High performance for all batch sizes")
print("   - No Lambda timeout limitations")
print("   - Direct S3 processing without file downloads")

if EXECUTION_MODE == 'local':
    print("\n🏠 This will run locally (synchronously).")
else:
    print("\n☁️  This will be an asynchronous Lambda operation.")

section_result = execute_function_async(section_payload)

if section_result['success']:
    print(f"\n✅ Successfully triggered S3 integration processing for section {section_id}")
    print(f"Request ID: {section_result['request_id']}")
    if EXECUTION_MODE == 'lambda':
        print("📊 Check CloudWatch logs for processing status.")
        print("💡 Look for 'processing_method: s3_integration' in results")
    else:
        print("🔍 Local execution completed.")
        # If local and we have result data, show processing method
        if 'result' in section_result and isinstance(section_result['result'], dict):
            result_body = section_result['result'].get('body', {})
            if isinstance(result_body, str):
                try:
                    result_data = json.loads(result_body)
                    processing_method = result_data.get('processing_method', 'unknown')
                    print(f"📈 Processing method used: {processing_method}")
                except:
                    pass
else:
    print(f"❌ Failed to trigger processing: {section_result.get('error', 'Unknown error')}")

## 4. Bulk Process Multiple Sections

Process data for multiple sections (useful for large-scale historical data loading):

In [None]:
# List of section IDs to process
section_ids = ['123', '456', '789']  # Replace with actual section IDs
date_range = {
    'start_date': '2024-01-01',
    'end_date': '2024-12-31'
}

# Use S3_XAPI_BUCKET from environment variables
S3_XAPI_BUCKET = os.getenv('S3_XAPI_BUCKET', 'torus-xapi-test')  # Fallback to default if not in .env

print(f"Processing {len(section_ids)} sections with S3 integration...")
print(f"Using S3 bucket: {S3_XAPI_BUCKET}")
print("🚀 All sections will use ClickHouse S3 Integration:")
print("   - High performance processing for all sections")
print("   - No timeout limitations regardless of data size")
print("   - Direct S3 processing with native ClickHouse functions")

results = []
for i, section_id in enumerate(section_ids, 1):
    payload = {
        'mode': 'bulk',
        's3_bucket': S3_XAPI_BUCKET,
        'section_id': section_id,
        **date_range,
        'force_reprocess': False
    }

    print(f"  {i}/{len(section_ids)}: Triggering S3 integration processing for section {section_id}...")

    result = execute_function_async(payload)
    results.append({
        'section_id': section_id,
        'success': result['success'],
        'request_id': result.get('request_id'),
        'error': result.get('error'),
        'execution_mode': result.get('execution_mode', 'unknown')
    })

    # Small delay to avoid overwhelming Lambda (not needed for local execution)
    if EXECUTION_MODE == 'lambda':
        time.sleep(1)

# Summary
successful = sum(1 for r in results if r['success'])
failed = len(results) - successful

print(f"\n📊 Processing Summary:")
print(f"  ✅ Successfully triggered: {successful}")
print(f"  ❌ Failed: {failed}")
print(f"  🚀 All sections use ClickHouse S3 Integration")

if successful > 0:
    print(f"\n💡 Monitor CloudWatch logs for processing status:")
    print(f"   - 'processing_method: s3_integration' indicates successful S3 processing")
    print(f"   - High performance processing with no timeout limitations")

if failed > 0:
    print("\n❌ Failed sections:")
    for result in results:
        if not result['success']:
            print(f"  - Section {result['section_id']}: {result['error']}")

## 5. Process All Available Data

Process all available XAPI data (use with caution for large datasets):

In [None]:
# ⚠️ WARNING: This will process ALL available data. Use carefully!
process_all = False  # Set to True to enable

if process_all:
    # Use S3_XAPI_BUCKET from environment variables
    S3_XAPI_BUCKET = os.getenv('S3_XAPI_BUCKET', 'torus-xapi-test')  # Fallback to default if not in .env

    all_data_payload = {
        'mode': 'bulk',
        's3_bucket': S3_XAPI_BUCKET,
        's3_prefix': 'section/',
        'start_date': '2024-01-01',  # Adjust as needed
        'end_date': '2024-12-31',    # Adjust as needed
        'force_reprocess': False
    }

    print("⚠️  Processing ALL available data with S3 Integration...")
    print(f"Using S3 bucket: {S3_XAPI_BUCKET}")
    print("🚀 Performance advantages for large datasets:")
    print("   - Uses ClickHouse S3 Integration for optimal performance")
    print("   - Bypasses Lambda 15-minute timeout limitation completely")
    print("   - Processes unlimited files efficiently")
    print("   - Direct S3 processing with native ClickHouse functions")

    if EXECUTION_MODE == 'local':
        print("\n🏠 This will run locally (synchronously).")
        print("   Large datasets will be handled by ClickHouse S3 integration.")
    else:
        print("\n☁️  This is an asynchronous Lambda operation.")
        print("   Lambda orchestrates ClickHouse for all heavy processing.")

    all_result = execute_function_async(all_data_payload)

    if all_result['success']:
        print(f"\n✅ Successfully triggered S3 integration bulk processing")
        print(f"Request ID: {all_result['request_id']}")
        if EXECUTION_MODE == 'lambda':
            print("📊 Monitor CloudWatch logs for:")
            print("   - 'processing_method: s3_integration' status")
            print("   - Performance metrics and progress")
            print("   - Processing completion status")
        else:
            print("🔍 Local execution completed.")
    else:
        print(f"❌ Failed to trigger bulk processing: {all_result.get('error')}")
else:
    print("🛑 Bulk processing disabled. Set process_all = True to enable.")
    print("💡 When enabled, the system will use:")
    print("   - ClickHouse S3 Integration for optimal performance on all datasets")
    print("   - No timeout limitations regardless of data size")
    print("   - Direct S3 processing for maximum efficiency")

## 6. Test Single File Processing

Test processing of a single JSONL file:

In [None]:
# Test with a specific file
test_bucket = 'your-xapi-bucket'  # Replace with your S3 bucket
test_key = 'section/123/video/2024-01-01T12-00-00.000Z_test-bundle.jsonl'  # Replace with actual file

test_payload = {
    'bucket': test_bucket,
    'key': test_key
}

print(f"Testing single file processing: s3://{test_bucket}/{test_key}")

test_result = execute_function(test_payload)

if test_result['success']:
    result_body = test_result['result']['body']
    if isinstance(result_body, str):
        result_data = json.loads(result_body)
    else:
        result_data = result_body

    print("✅ Single file processing completed")
    print(json.dumps(result_data, indent=2))
else:
    print(f"❌ Single file processing failed: {test_result.get('error')}")

## 7. Monitoring and Troubleshooting

Check Lambda function logs and status:

In [None]:
# Check recent Lambda invocations (requires CloudWatch Logs access)
import boto3
from datetime import datetime, timedelta

logs_client = boto3.client('logs', region_name=AWS_REGION)

def get_recent_lambda_logs(function_name, hours=1):
    """Get recent logs for a Lambda function"""
    log_group = f'/aws/lambda/{function_name}'

    try:
        end_time = datetime.utcnow()
        start_time = end_time - timedelta(hours=hours)

        response = logs_client.filter_log_events(
            logGroupName=log_group,
            startTime=int(start_time.timestamp() * 1000),
            endTime=int(end_time.timestamp() * 1000),
            limit=100
        )

        return response.get('events', [])
    except Exception as e:
        print(f"Error getting logs for {function_name}: {str(e)}")
        return []

# Get recent logs for XAPI ETL processor
print(f"Recent logs for {XAPI_ETL_FUNCTION}:")
etl_logs = get_recent_lambda_logs(XAPI_ETL_FUNCTION)
for event in etl_logs[-5:]:  # Show last 5 log events
    timestamp = datetime.fromtimestamp(event['timestamp'] / 1000)
    print(f"[{timestamp}] {event['message']}")