# Advanced Topics: Job Orchestration & Additional Pitfalls
## Companion Notebook to Performance Pitfalls Workshop

This notebook covers:
- Delta Lake transaction issues and best practices
- Object store semantics (S3/Azure/GCS)
- File formats and compression codecs comparison
- Advanced job scheduling patterns
- Alerting and retry strategies
- Real-world orchestration examples

---

## Setup

In [None]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *
from pyspark.sql.types import *
import time
from datetime import datetime, timedelta

# In Databricks, SparkSession is already available as 'spark'
# Delta Lake is pre-configured, no need to set configurations
print(f"Spark version: {spark.version}")

# Import Delta Table (available in Databricks by default)
try:
    from delta.tables import DeltaTable
    print("‚úÖ Delta Lake is available")
except ImportError:
    print("‚ö†Ô∏è  Delta Lake not available - some examples will be skipped")
    print("   (This is OK for learning other concepts)")

---
## Delta Lake: Concurrency & Transaction Issues

**What is it?**  
Delta Lake provides ACID transactions, but improper usage leads to conflicts and performance problems.

**Common Issues:**
- Concurrent write conflicts
- Small files problem
- Vacuum timing issues
- Transaction log growth

In [None]:
# Create a Delta table for demonstration
delta_path = "/tmp/delta_sales"

# Initial data
sales_data = [
    (1, 'Product_A', 100, '2025-01-01', 'US'),
    (2, 'Product_B', 200, '2025-01-01', 'UK'),
    (3, 'Product_C', 150, '2025-01-02', 'US'),
    (4, 'Product_A', 300, '2025-01-02', 'UK'),
    (5, 'Product_B', 250, '2025-01-03', 'US'),
]

df = spark.createDataFrame(sales_data, ['id', 'product', 'amount', 'date', 'region'])

# Write as Delta table
df.write.format('delta').mode('overwrite').save(delta_path)

print("‚úÖ Delta table created")
spark.read.format('delta').load(delta_path).show()

### Issue #1: Small Files Problem

In [None]:
# ‚ùå BAD: Multiple small writes create too many small files
print("BAD: Creating many small files...")

# Simulate streaming writes (creates many small files)
for i in range(10):
    small_batch = spark.createDataFrame(
        [(i + 100, f'Product_{i}', i * 10, '2025-01-04', 'US')],
        ['id', 'product', 'amount', 'date', 'region']
    )
    small_batch.write.format('delta').mode('append').save(delta_path)

# Check the number of files
delta_table = DeltaTable.forPath(spark, delta_path)
file_count = len(delta_table.toDF().inputFiles())
print(f"\n‚ö†Ô∏è  Total files: {file_count}")
print("Problem: Too many small files slow down reads!")

In [None]:
# ‚úÖ GOOD: Optimize to compact small files
print("GOOD: Compacting small files with OPTIMIZE...")

# Run OPTIMIZE to compact files
spark.sql(f"OPTIMIZE delta.`{delta_path}`")

# Check file count after optimization
delta_table_optimized = DeltaTable.forPath(spark, delta_path)
optimized_file_count = len(delta_table_optimized.toDF().inputFiles())
print(f"\n‚úÖ Files after OPTIMIZE: {optimized_file_count}")
print(f"Reduced files by: {file_count - optimized_file_count}")

In [None]:
# ‚úÖ BETTER: Auto-optimize for streaming workloads
print("BETTER: Enable auto-optimize for future writes")

# Enable auto-optimize and auto-compaction
spark.sql(f"""
    ALTER TABLE delta.`{delta_path}` 
    SET TBLPROPERTIES (
        'delta.autoOptimize.optimizeWrite' = 'true',
        'delta.autoOptimize.autoCompact' = 'true'
    )
""")

print("\n‚úÖ Auto-optimize enabled!")
print("Future writes will automatically compact files")

### Issue #2: Concurrent Write Conflicts

In [None]:
# Demonstrate merge/upsert pattern (safe for concurrent operations)
print("SAFE PATTERN: Using MERGE for upserts (avoids conflicts)")

# New data to upsert
updates = spark.createDataFrame([
    (1, 'Product_A', 150, '2025-01-05', 'US'),  # Update existing
    (200, 'Product_Z', 500, '2025-01-05', 'JP'), # Insert new
], ['id', 'product', 'amount', 'date', 'region'])

# Perform merge (upsert)
delta_table = DeltaTable.forPath(spark, delta_path)

delta_table.alias('target').merge(
    updates.alias('source'),
    'target.id = source.id'
).whenMatchedUpdateAll() \
 .whenNotMatchedInsertAll() \
 .execute()

print("\n‚úÖ Merge completed successfully")
spark.read.format('delta').load(delta_path).filter(col('id').isin(1, 200)).show()

### Issue #3: Vacuum Timing and Retention

In [None]:
# Understanding VACUUM and time travel
print("Understanding VACUUM and retention...")

# Check history
print("\nTable history:")
spark.sql(f"DESCRIBE HISTORY delta.`{delta_path}`").select(
    'version', 'timestamp', 'operation', 'operationMetrics'
).show(truncate=False)

# Time travel example
print("\nTime travel - reading version 0:")
spark.read.format('delta').option('versionAsOf', 0).load(delta_path).show()

print("""
‚ö†Ô∏è  VACUUM Best Practices:

1. Default retention: 7 days
2. Don't VACUUM too soon (breaks time travel!)
3. Set appropriate retention for your use case:
   - Development: 7 days (default)
   - Production: 30+ days
   - Compliance: 90+ days

4. VACUUM removes old files to save storage
5. But you can't time travel past VACUUM point!
""")

# Set retention period
spark.sql(f"""
    ALTER TABLE delta.`{delta_path}`
    SET TBLPROPERTIES ('delta.deletedFileRetentionDuration' = 'interval 30 days')
""")

print("‚úÖ Retention period set to 30 days")

**üí° Delta Lake Key Takeaways:**
- Run `OPTIMIZE` regularly to compact small files
- Enable auto-optimize for streaming workloads
- Use `MERGE` for upserts (handles concurrency)
- Set appropriate retention periods
- Don't `VACUUM` too aggressively
- Use `Z-ORDER` for frequently filtered columns

---
## Object Store Semantics (S3/Azure/GCS)

**What is it?**  
Cloud object stores are not filesystems - they have different consistency and performance characteristics.

**Key Issues:**
- LIST operations are slow and expensive
- Eventual consistency (S3)
- Rate limiting
- Small file performance

In [None]:
print("""
üìä OBJECT STORE BEST PRACTICES:

1. S3 Specific:
   ‚úÖ Use S3A filesystem (s3a://)
   ‚úÖ Enable S3 Select for filtered reads
   ‚úÖ Use S3 request rate limits in configs
   ‚úÖ Bucket naming: avoid sequential prefixes
   
2. Azure Blob/ADLS:
   ‚úÖ Use ADLS Gen2 (better performance)
   ‚úÖ Enable hierarchical namespace
   ‚úÖ Use appropriate access tiers
   
3. GCS (Google Cloud Storage):
   ‚úÖ Use composite objects for large files
   ‚úÖ Enable parallel composite uploads
   
4. General:
   ‚úÖ Minimize LIST operations (use partitioning)
   ‚úÖ Write larger files (128MB+)
   ‚úÖ Use columnar formats (Parquet/ORC)
   ‚úÖ Enable cloud-specific optimizations
   ‚ùå Don't treat like local filesystem!
   ‚ùå Avoid RENAME operations (copy + delete)
   ‚ùå Don't have too many small files
""")

# Example configurations for S3
s3_configs = {
    # Connection pooling
    "fs.s3a.connection.maximum": "100",
    
    # Enable multipart uploads
    "fs.s3a.multipart.size": "104857600",  # 100MB
    "fs.s3a.multipart.threshold": "209715200",  # 200MB
    
    # Fast upload
    "fs.s3a.fast.upload": "true",
    "fs.s3a.fast.upload.buffer": "disk",
    
    # Performance tuning
    "fs.s3a.threads.max": "50",
    "fs.s3a.connection.ssl.enabled": "true",
}

print("\nExample S3 Configurations:")
for key, value in s3_configs.items():
    print(f"  {key} = {value}")
    # spark.conf.set(key, value)  # Uncomment when using S3

---
## File Formats & Compression Codecs

**What is it?**  
Choice of file format and compression codec significantly impacts performance and storage costs.

**Key Considerations:**
- Read vs write performance
- Compression ratio vs CPU cost
- Splittability for parallelism
- Schema evolution support

In [None]:
# Create test data
test_data = spark.range(0, 1000000).select(
    col('id'),
    (col('id') % 100).alias('category'),
    (rand() * 1000).alias('value'),
    concat(lit('text_'), col('id').cast('string')).alias('description')
)

test_data.cache().count()
print(f"Test data created: {test_data.count():,} rows")

In [None]:
import os

# Compare different formats
formats = ['parquet', 'orc', 'csv', 'json']
results = []

for fmt in formats:
    path = f"/tmp/format_test_{fmt}"
    
    # Write
    start = time.time()
    test_data.write.format(fmt).mode('overwrite').save(path)
    write_time = time.time() - start
    
    # Get size (approximate for demo)
    size = "N/A"  # In production, calculate actual size
    
    # Read
    start = time.time()
    read_df = spark.read.format(fmt).load(path)
    read_df.count()
    read_time = time.time() - start
    
    results.append((fmt, write_time, read_time, size))

# Display results
results_df = spark.createDataFrame(results, ['format', 'write_time_sec', 'read_time_sec', 'size'])
print("\nFormat Performance Comparison:")
results_df.show()

test_data.unpersist()

In [None]:
# Compare compression codecs for Parquet
codecs = ['snappy', 'gzip', 'lz4', 'uncompressed']
codec_results = []

test_data_small = spark.range(100000).select(
    col('id'),
    concat(lit('text_'), col('id').cast('string')).alias('text')
).cache()
test_data_small.count()

for codec in codecs:
    path = f"/tmp/codec_test_{codec}"
    
    try:
        # Write with compression
        start = time.time()
        test_data_small.write \
            .format('parquet') \
            .option('compression', codec) \
            .mode('overwrite') \
            .save(path)
        write_time = time.time() - start
        
        # Read
        start = time.time()
        spark.read.parquet(path).count()
        read_time = time.time() - start
        
        codec_results.append((codec, write_time, read_time))
    except Exception as e:
        print(f"‚ö†Ô∏è  {codec} not available: {e}")

# Display results
if codec_results:
    codec_df = spark.createDataFrame(codec_results, ['codec', 'write_time_sec', 'read_time_sec'])
    print("\nCompression Codec Comparison (Parquet):")
    codec_df.show()

test_data_small.unpersist()

In [None]:
print("""
üìä FILE FORMAT RECOMMENDATIONS:

üèÜ PARQUET (Best for most use cases):
   ‚úÖ Columnar format (great for analytics)
   ‚úÖ Excellent compression
   ‚úÖ Predicate pushdown
   ‚úÖ Schema evolution support
   ‚úÖ Industry standard
   üéØ Use with: Snappy compression (balanced)
   
ü•à ORC (Alternative to Parquet):
   ‚úÖ Slightly better compression than Parquet
   ‚úÖ Built-in indexes
   ‚úÖ Native to Hive ecosystem
   ‚ö†Ô∏è  Less widespread adoption than Parquet
   
üìÑ DELTA (Parquet + ACID transactions):
   ‚úÖ All Parquet benefits + transactions
   ‚úÖ Time travel
   ‚úÖ Schema evolution
   ‚úÖ MERGE/UPDATE/DELETE support
   üéØ Recommended for production data lakes
   
‚ùå CSV/JSON (Avoid for large data):
   ‚ùå No compression (or inefficient)
   ‚ùå No schema enforcement
   ‚ùå No predicate pushdown
   ‚ùå Slow to parse
   ‚úÖ Only use for: data exchange, small files

üîê COMPRESSION CODECS:

   SNAPPY (default, recommended):
     ‚úÖ Fast compression/decompression
     ‚úÖ Good compression ratio
     ‚úÖ Splittable
     üéØ Best for: General use
     
   GZIP:
     ‚úÖ Better compression than Snappy
     ‚ùå Slower decompression
     ‚ùå Not splittable
     üéØ Best for: Cold storage, rarely read data
     
   LZ4:
     ‚úÖ Fastest decompression
     ‚ö†Ô∏è  Lower compression ratio
     üéØ Best for: Hot data, frequently queried
     
   ZSTD:
     ‚úÖ Excellent compression ratio
     ‚úÖ Fast decompression
     üéØ Best for: Newer Spark versions (3.2+)
""")

---
## Advanced Job Scheduling Patterns

Let's look at real-world job orchestration patterns.

### Pattern 1: Incremental Processing with Checkpointing

In [None]:
# Incremental processing pattern
class IncrementalProcessor:
    """
    Process only new data since last run using checkpoints
    """
    
    def __init__(self, checkpoint_path):
        self.checkpoint_path = checkpoint_path
    
    def get_last_checkpoint(self):
        """Read last processed timestamp"""
        try:
            checkpoint_df = spark.read.parquet(self.checkpoint_path)
            last_timestamp = checkpoint_df.agg(max('processed_until')).collect()[0][0]
            return last_timestamp
        except:
            # No checkpoint exists, process from beginning
            return '2025-01-01 00:00:00'
    
    def save_checkpoint(self, timestamp):
        """Save current processing timestamp"""
        checkpoint_data = spark.createDataFrame(
            [(timestamp, datetime.now().isoformat())],
            ['processed_until', 'checkpoint_time']
        )
        checkpoint_data.write.mode('append').parquet(self.checkpoint_path)
    
    def process_incremental(self, source_path, target_path):
        """Process only new data"""
        print("üîÑ Starting incremental processing...")
        
        # Get last checkpoint
        last_processed = self.get_last_checkpoint()
        print(f"   Last processed: {last_processed}")
        
        # Read only new data
        source_df = spark.read.format('delta').load(source_path)
        new_data = source_df.filter(col('date') > last_processed)
        
        new_count = new_data.count()
        print(f"   New records to process: {new_count:,}")
        
        if new_count > 0:
            # Process and write
            processed = new_data.groupBy('region', 'product').agg(
                sum('amount').alias('total_sales'),
                count('*').alias('transaction_count')
            )
            
            # Append to target
            processed.write.format('delta').mode('append').save(target_path)
            
            # Update checkpoint
            max_date = new_data.agg(max('date')).collect()[0][0]
            self.save_checkpoint(max_date)
            
            print(f"   ‚úÖ Processed {new_count:,} records")
            print(f"   ‚úÖ Updated checkpoint to: {max_date}")
        else:
            print("   ‚è≠Ô∏è  No new data to process")
        
        return new_count

# Demo
processor = IncrementalProcessor('/tmp/checkpoint_demo')
records_processed = processor.process_incremental(delta_path, '/tmp/aggregated_sales')

print("\nüí° This pattern ensures you only process new data each run!")

### Pattern 2: Retry Logic with Exponential Backoff

In [None]:
import time
import random

def retry_with_backoff(func, max_retries=3, initial_delay=1, backoff_factor=2):
    """
    Retry function with exponential backoff
    """
    delay = initial_delay
    
    for attempt in range(max_retries):
        try:
            print(f"\nüîÑ Attempt {attempt + 1}/{max_retries}...")
            result = func()
            print("   ‚úÖ Success!")
            return result
        except Exception as e:
            print(f"   ‚ùå Failed: {str(e)}")
            
            if attempt < max_retries - 1:
                # Add jitter to prevent thundering herd
                jitter = random.uniform(0, delay * 0.1)
                sleep_time = delay + jitter
                
                print(f"   ‚è≥ Waiting {sleep_time:.2f} seconds before retry...")
                time.sleep(sleep_time)
                
                # Exponential backoff
                delay *= backoff_factor
            else:
                print("   üí• Max retries reached, giving up")
                raise

# Demo: Simulate unreliable operation
attempt_count = 0

def unreliable_operation():
    global attempt_count
    attempt_count += 1
    
    # Fail first 2 times, succeed on 3rd
    if attempt_count < 3:
        raise Exception(f"Simulated failure (attempt {attempt_count})")
    
    return "Success!"

# Run with retry logic
print("Demonstrating retry with exponential backoff:")
result = retry_with_backoff(unreliable_operation, max_retries=5)
print(f"\nFinal result: {result}")

### Pattern 3: Circuit Breaker for Downstream Dependencies

In [None]:
class CircuitBreaker:
    """
    Circuit breaker pattern to prevent cascading failures
    """
    
    def __init__(self, failure_threshold=3, timeout=60):
        self.failure_count = 0
        self.failure_threshold = failure_threshold
        self.timeout = timeout
        self.last_failure_time = None
        self.state = 'CLOSED'  # CLOSED, OPEN, HALF_OPEN
    
    def call(self, func):
        """
        Execute function with circuit breaker protection
        """
        # Check if circuit is open
        if self.state == 'OPEN':
            # Check if timeout has passed
            if time.time() - self.last_failure_time > self.timeout:
                self.state = 'HALF_OPEN'
                print("üü° Circuit breaker HALF_OPEN (testing recovery)")
            else:
                raise Exception("Circuit breaker is OPEN - too many failures")
        
        try:
            result = func()
            
            # Success - reset failure count
            if self.state == 'HALF_OPEN':
                print("üü¢ Circuit breaker CLOSED (recovered)")
            
            self.failure_count = 0
            self.state = 'CLOSED'
            return result
            
        except Exception as e:
            self.failure_count += 1
            self.last_failure_time = time.time()
            
            print(f"‚ùå Failure {self.failure_count}/{self.failure_threshold}")
            
            if self.failure_count >= self.failure_threshold:
                self.state = 'OPEN'
                print("üî¥ Circuit breaker OPEN (too many failures)")
            
            raise

# Demo
breaker = CircuitBreaker(failure_threshold=3, timeout=5)

def flaky_api_call():
    if random.random() < 0.7:  # 70% failure rate
        raise Exception("API call failed")
    return "API response"

print("Demonstrating circuit breaker:")
for i in range(6):
    print(f"\n--- Call {i+1} ---")
    try:
        result = breaker.call(flaky_api_call)
        print(f"‚úÖ Success: {result}")
    except Exception as e:
        print(f"‚ùå Error: {e}")
    time.sleep(1)

print("\nüí° Circuit breaker prevents cascading failures!")

### Pattern 4: Dead Letter Queue for Failed Records

In [None]:
def process_with_dlq(input_df, output_path, dlq_path):
    """
    Process data and send failures to Dead Letter Queue
    """
    from pyspark.sql.functions import udf
    from pyspark.sql.types import StringType, StructType, StructField
    
    print("üîÑ Processing with Dead Letter Queue...")
    
    # Add processing status columns
    def safe_process(value):
        """Process with error handling"""
        try:
            # Simulate processing logic
            if value < 0:
                raise ValueError("Negative values not allowed")
            return (value * 2, None)  # (result, error)
        except Exception as e:
            return (None, str(e))  # (result, error)
    
    # Use native Spark instead of UDF for better performance
    processed_df = input_df.withColumn(
        'processed_amount',
        when(col('amount') >= 0, col('amount') * 2).otherwise(None)
    ).withColumn(
        'error_message',
        when(col('amount') < 0, 'Negative values not allowed').otherwise(None)
    )
    
    # Split successful and failed records
    success_df = processed_df.filter(col('error_message').isNull())
    failed_df = processed_df.filter(col('error_message').isNotNull())
    
    success_count = success_df.count()
    failed_count = failed_df.count()
    
    print(f"   ‚úÖ Successful records: {success_count:,}")
    print(f"   ‚ùå Failed records: {failed_count:,}")
    
    # Write successful records
    if success_count > 0:
        success_df.drop('error_message').write.format('delta').mode('append').save(output_path)
        print(f"   üíæ Saved successful records to: {output_path}")
    
    # Write failed records to DLQ
    if failed_count > 0:
        failed_df.withColumn('dlq_timestamp', current_timestamp()).write \
            .format('delta').mode('append').save(dlq_path)
        print(f"   üìÆ Sent failed records to DLQ: {dlq_path}")
    
    return success_count, failed_count

# Demo
test_data_with_errors = spark.createDataFrame([
    (1, 100, '2025-01-01', 'US'),
    (2, -50, '2025-01-01', 'UK'),  # This will fail
    (3, 200, '2025-01-02', 'US'),
    (4, -100, '2025-01-02', 'JP'), # This will fail
    (5, 300, '2025-01-03', 'US'),
], ['id', 'amount', 'date', 'region'])

success, failed = process_with_dlq(
    test_data_with_errors,
    '/tmp/processed_output',
    '/tmp/dead_letter_queue'
)

# Show DLQ contents
print("\nDead Letter Queue contents:")
spark.read.format('delta').load('/tmp/dead_letter_queue').show(truncate=False)

---
## Complete Production Job Template

In [None]:
# Production-grade job template with all patterns combined
import json
from datetime import datetime
from typing import Dict, Tuple

class ProductionETLJob:
    """
    Complete production ETL job with:
    - Incremental processing
    - Error handling & retries
    - Dead letter queue
    - Monitoring & alerting
    - Data quality checks
    """
    
    def __init__(self, job_name: str, config: Dict):
        self.job_name = job_name
        self.config = config
        self.metrics = {
            'job_name': job_name,
            'start_time': None,
            'end_time': None,
            'status': 'initialized',
            'records_read': 0,
            'records_processed': 0,
            'records_failed': 0,
            'errors': [],
            'warnings': []
        }
    
    def run_data_quality_checks(self, df):
        """Run data quality validations"""
        print("\nüîç Running data quality checks...")
        
        checks = []
        
        # Check 1: No null keys
        null_count = df.filter(col('id').isNull()).count()
        checks.append(('null_ids', null_count == 0, f"Found {null_count} null IDs"))
        
        # Check 2: Valid amounts
        invalid_amounts = df.filter((col('amount').isNull()) | (col('amount') < 0)).count()
        checks.append(('valid_amounts', invalid_amounts == 0, f"Found {invalid_amounts} invalid amounts"))
        
        # Check 3: Record count threshold
        count = df.count()
        min_expected = self.config.get('min_records', 0)
        checks.append(('record_count', count >= min_expected, 
                      f"Record count {count} {'>=' if count >= min_expected else '<'} minimum {min_expected}"))
        
        # Report
        all_passed = True
        for check_name, passed, message in checks:
            status = "‚úÖ" if passed else "‚ùå"
            print(f"   {status} {check_name}: {message}")
            
            if not passed:
                all_passed = False
                self.metrics['warnings'].append(f"Data quality check failed: {check_name} - {message}")
        
        return all_passed
    
    def process_data(self, input_df):
        """Main processing logic"""
        # Your transformation logic here
        result = input_df.groupBy('region', 'product').agg(
            sum('amount').alias('total_sales'),
            count('*').alias('transaction_count'),
            avg('amount').alias('avg_sale')
        )
        return result
    
    def run(self):
        """Main job execution"""
        self.metrics['start_time'] = datetime.now().isoformat()
        self.metrics['status'] = 'running'
        
        try:
            print(f"\n{'='*80}")
            print(f"üöÄ Starting job: {self.job_name}")
            print(f"{'='*80}")
            
            # Step 1: Read data
            print("\nüì• Step 1: Reading source data...")
            source_df = spark.read.format('delta').load(self.config['source_path'])
            self.metrics['records_read'] = source_df.count()
            print(f"   Read {self.metrics['records_read']:,} records")
            
            # Step 2: Data quality checks
            quality_passed = self.run_data_quality_checks(source_df)
            if not quality_passed and self.config.get('fail_on_quality_issues', False):
                raise Exception("Data quality checks failed")
            
            # Step 3: Process
            print("\n‚öôÔ∏è  Step 3: Processing data...")
            result_df = self.process_data(source_df)
            self.metrics['records_processed'] = result_df.count()
            print(f"   Processed {self.metrics['records_processed']:,} records")
            
            # Step 4: Write output
            print("\nüíæ Step 4: Writing output...")
            result_df.write \
                .format('delta') \
                .mode('overwrite') \
                .option('overwriteSchema', 'true') \
                .save(self.config['output_path'])
            print(f"   Output written to: {self.config['output_path']}")
            
            # Success!
            self.metrics['status'] = 'success'
            print("\n‚úÖ Job completed successfully!")
            
        except Exception as e:
            self.metrics['status'] = 'failed'
            self.metrics['errors'].append(str(e))
            print(f"\n‚ùå Job failed: {e}")
            raise
        
        finally:
            self.metrics['end_time'] = datetime.now().isoformat()
            
            # Log metrics
            print("\n" + "="*80)
            print("üìä JOB METRICS:")
            print("="*80)
            print(json.dumps(self.metrics, indent=2))
            
            # In production: send to monitoring system
            # self.send_metrics_to_cloudwatch(self.metrics)
            # self.send_alerts_if_needed(self.metrics)
        
        return self.metrics

# Example usage
job_config = {
    'source_path': delta_path,
    'output_path': '/tmp/production_output',
    'min_records': 5,
    'fail_on_quality_issues': False
}

job = ProductionETLJob('customer_analytics_v2', job_config)
metrics = job.run()

print("\n" + "="*80)
print("‚úÖ Production job template complete!")
print("="*80)

---
## Summary: Production Job Checklist

### üéØ Essential Components:

**1. Error Handling:**
- [ ] Try-catch blocks around all operations
- [ ] Retry logic with exponential backoff
- [ ] Circuit breaker for external dependencies
- [ ] Dead letter queue for failed records

**2. Monitoring:**
- [ ] Capture start/end times
- [ ] Track record counts at each stage
- [ ] Log all errors and warnings
- [ ] Send metrics to monitoring system

**3. Data Quality:**
- [ ] Validate input data
- [ ] Check for nulls, duplicates, outliers
- [ ] Verify record count thresholds
- [ ] Validate business logic constraints

**4. Performance:**
- [ ] Review execution plans
- [ ] Optimize joins and aggregations
- [ ] Appropriate partitioning
- [ ] Cache only when needed

**5. Reliability:**
- [ ] Incremental processing with checkpoints
- [ ] Idempotent operations
- [ ] Transaction safety (Delta Lake)
- [ ] Proper cleanup on failure

**6. Alerting:**
- [ ] Email/Slack on failure
- [ ] PagerDuty for critical failures
- [ ] Dashboard for job metrics
- [ ] SLA monitoring

### üìÖ Scheduling Best Practices:

**Databricks Jobs:**
```python
# Configure via UI or Jobs API
{
  "name": "customer_analytics_daily",
  "schedule": {
    "quartz_cron_expression": "0 0 2 * * ?",  # 2 AM daily
    "timezone_id": "UTC",
    "pause_status": "UNPAUSED"
  },
  "max_concurrent_runs": 1,
  "timeout_seconds": 3600,
  "max_retries": 2,
  "retry_on_timeout": true,
  "email_notifications": {
    "on_failure": ["data-team@company.com"],
    "on_success": [],
    "no_alert_for_skipped_runs": true
  }
}
```

### üîó Useful Resources:
- Databricks Jobs API: https://docs.databricks.com/dev-tools/api/latest/jobs.html
- Delta Lake Best Practices: https://docs.delta.io/latest/best-practices.html
- Spark Monitoring Guide: https://spark.apache.org/docs/latest/monitoring.html

---

## üéì Workshop Complete!

You now have:
- ‚úÖ Understanding of common performance pitfalls
- ‚úÖ Production-ready job templates
- ‚úÖ Error handling patterns
- ‚úÖ Monitoring and alerting strategies
- ‚úÖ Real-world scheduling examples

**Next Steps:**
1. Apply these patterns to your jobs
2. Set up monitoring dashboards
3. Implement incremental processing
4. Configure alerts and retries
