# Module 09: Advanced Scripting Techniques

**Difficulty**: ⭐⭐⭐ (Advanced)

**Estimated Time**: 75 minutes

**Prerequisites**: 
- Completed Modules 00-08
- Strong Python programming skills
- Understanding of processes and threading concepts
- Familiarity with testing principles

## Learning Objectives

By the end of this notebook, you will be able to:

1. **Implement** parallel processing for performance gains
2. **Apply** advanced error handling and recovery patterns
3. **Configure** structured logging for production systems
4. **Write** testable automation scripts
5. **Optimize** script performance using profiling
6. **Build** production-ready automation tools

## Introduction: Professional Automation

This module elevates your automation scripts from working prototypes to production-ready systems.

### What Makes Code "Production-Ready"?

**Basic Script** → **Production System**

| Aspect | Basic | Production |
|--------|-------|------------|
| **Error Handling** | Try-except if needed | Comprehensive with recovery |
| **Logging** | Print statements | Structured logging with levels |
| **Performance** | "Good enough" | Profiled and optimized |
| **Testing** | Manual testing | Automated test suite |
| **Scalability** | Single-threaded | Parallel when beneficial |
| **Monitoring** | None | Metrics and alerts |
| **Documentation** | Minimal comments | Complete docs + examples |

### When to Apply Advanced Techniques

**Use parallel processing when:**
- Processing many independent items (batch operations)
- I/O-bound tasks (API calls, file operations)
- CPU-bound tasks on multi-core systems

**Use structured logging when:**
- Scripts run unattended
- Debugging production issues
- Compliance/audit requirements
- Multiple team members use the script

**Write tests when:**
- Code will be reused frequently
- Multiple people contribute
- Changes are risky
- Regression bugs are costly

This module teaches you **how** and **when** to apply each technique.

In [None]:
# Setup: Import required libraries
import time
import logging
from pathlib import Path
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor, as_completed
import multiprocessing
from functools import wraps
import traceback
from datetime import datetime

print(f"CPU cores available: {multiprocessing.cpu_count()}")
print("Setup complete!")

## 1. Parallel Processing

Speed up batch operations by processing multiple items simultaneously.

### Threading vs Multiprocessing

| Method | Best For | Limitation |
|--------|----------|------------|
| **Threading** | I/O-bound (file/network) | GIL limits CPU usage |
| **Multiprocessing** | CPU-bound (computation) | Higher memory overhead |

**GIL (Global Interpreter Lock)**: Python limitation that prevents true parallel execution of Python code in threads. Use multiprocessing for CPU-heavy tasks.

In [None]:
# Example: Process files in parallel
import time
from concurrent.futures import ThreadPoolExecutor, as_completed

def process_file(file_path):
    """
    Simulate processing a file (I/O-bound operation).
    
    Args:
        file_path: Path to file
    
    Returns:
        dict: Processing results
    """
    # Simulate I/O operation
    time.sleep(0.5)
    
    return {
        'file': str(file_path),
        'status': 'success',
        'lines': 100
    }

def process_files_sequential(files):
    """Process files one at a time."""
    results = []
    for file in files:
        result = process_file(file)
        results.append(result)
    return results

def process_files_parallel(files, max_workers=4):
    """
    Process files in parallel using ThreadPoolExecutor.
    
    Args:
        files: List of file paths
        max_workers: Maximum concurrent threads
    
    Returns:
        list: Processing results
    """
    results = []
    
    with ThreadPoolExecutor(max_workers=max_workers) as executor:
        # Submit all tasks
        future_to_file = {executor.submit(process_file, f): f for f in files}
        
        # Collect results as they complete
        for future in as_completed(future_to_file):
            file = future_to_file[future]
            try:
                result = future.result()
                results.append(result)
            except Exception as e:
                print(f"Error processing {file}: {e}")
    
    return results

# Compare performance
test_files = [f'file_{i}.txt' for i in range(10)]

print("Processing 10 files...\n")

# Sequential
start = time.time()
results_seq = process_files_sequential(test_files)
time_seq = time.time() - start
print(f"Sequential: {time_seq:.2f}s")

# Parallel
start = time.time()
results_par = process_files_parallel(test_files, max_workers=4)
time_par = time.time() - start
print(f"Parallel (4 workers): {time_par:.2f}s")

speedup = time_seq / time_par
print(f"\nSpeedup: {speedup:.1f}x faster")

### 1.1 CPU-Bound Parallel Processing

For computation-heavy tasks, use multiprocessing to bypass the GIL.

In [None]:
# CPU-bound task example
from concurrent.futures import ProcessPoolExecutor
import multiprocessing

def compute_intensive_task(n):
    """
    Simulate CPU-intensive computation.
    
    Args:
        n: Input value
    
    Returns:
        int: Computed result
    """
    # Simulate computation (calculate prime numbers)
    result = sum(i for i in range(n) if all(i % j != 0 for j in range(2, int(i**0.5) + 1)))
    return result

def process_batch_parallel(items, max_workers=None):
    """
    Process items in parallel using multiple processes.
    
    Args:
        items: List of items to process
        max_workers: Number of processes (None = CPU count)
    
    Returns:
        list: Results
    """
    if max_workers is None:
        max_workers = multiprocessing.cpu_count()
    
    results = []
    
    with ProcessPoolExecutor(max_workers=max_workers) as executor:
        future_to_item = {executor.submit(compute_intensive_task, item): item for item in items}
        
        for future in as_completed(future_to_item):
            item = future_to_item[future]
            try:
                result = future.result()
                results.append(result)
            except Exception as e:
                print(f"Error processing {item}: {e}")
    
    return results

# Demo with smaller numbers for quick execution
test_items = [1000, 2000, 3000, 4000]

print(f"Processing {len(test_items)} CPU-intensive tasks")
print(f"Using {multiprocessing.cpu_count()} processes\n")

start = time.time()
results = process_batch_parallel(test_items)
elapsed = time.time() - start

print(f"Completed in {elapsed:.2f}s")
print(f"Results: {results}")

## 2. Advanced Error Handling Patterns

Production code needs robust error handling with recovery strategies.

In [None]:
# Retry decorator with exponential backoff
from functools import wraps
import time

def retry_with_backoff(max_retries=3, base_delay=1, max_delay=60, exceptions=(Exception,)):
    """
    Decorator to retry function with exponential backoff.
    
    Args:
        max_retries: Maximum number of retry attempts
        base_delay: Initial delay in seconds
        max_delay: Maximum delay between retries
        exceptions: Tuple of exceptions to catch
    """
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            delay = base_delay
            
            for attempt in range(1, max_retries + 1):
                try:
                    return func(*args, **kwargs)
                
                except exceptions as e:
                    if attempt == max_retries:
                        # Last attempt failed, re-raise
                        raise
                    
                    print(f"Attempt {attempt}/{max_retries} failed: {e}")
                    print(f"Retrying in {delay}s...")
                    
                    time.sleep(delay)
                    
                    # Exponential backoff
                    delay = min(delay * 2, max_delay)
        
        return wrapper
    return decorator

# Example usage
@retry_with_backoff(max_retries=3, base_delay=1)
def unreliable_api_call():
    """Simulate API call that sometimes fails."""
    import random
    if random.random() < 0.7:  # 70% failure rate for demo
        raise ConnectionError("API temporarily unavailable")
    return {"status": "success", "data": [1, 2, 3]}

# Test the retry mechanism
print("Testing retry with exponential backoff:\n")
try:
    result = unreliable_api_call()
    print(f"\n✓ Success: {result}")
except Exception as e:
    print(f"\n✗ Failed after all retries: {e}")

### 2.1 Context Managers for Resource Management

Ensure resources are always cleaned up, even when errors occur.

In [None]:
# Custom context manager for automation tasks
from contextlib import contextmanager
import logging

@contextmanager
def automation_task(task_name, logger=None):
    """
    Context manager for automation tasks.
    Ensures cleanup and logging even if task fails.
    
    Args:
        task_name: Name of the task
        logger: Logger instance (optional)
    """
    if logger is None:
        logger = logging.getLogger(__name__)
    
    start_time = time.time()
    logger.info(f"Starting task: {task_name}")
    
    try:
        yield
        
        elapsed = time.time() - start_time
        logger.info(f"✓ Task completed: {task_name} ({elapsed:.2f}s)")
    
    except Exception as e:
        elapsed = time.time() - start_time
        logger.error(f"✗ Task failed: {task_name} ({elapsed:.2f}s)")
        logger.error(f"Error: {e}")
        logger.debug(traceback.format_exc())
        raise
    
    finally:
        # Cleanup code always runs
        logger.debug(f"Cleanup for task: {task_name}")

# Example usage
logging.basicConfig(level=logging.INFO, format='%(levelname)s: %(message)s')
logger = logging.getLogger(__name__)

# Successful task
with automation_task("Data Processing", logger):
    print("Processing data...")
    time.sleep(0.5)
    print("Data processed!")

print()

# Failed task
try:
    with automation_task("Risky Operation", logger):
        print("Attempting risky operation...")
        raise ValueError("Something went wrong!")
except ValueError:
    print("Handled the error gracefully")

## 3. Structured Logging

Professional logging with levels, formatting, and multiple handlers.

In [None]:
# Production-grade logging setup
import logging
import logging.handlers
from pathlib import Path
from datetime import datetime

def setup_production_logger(
    name,
    log_dir='logs',
    console_level=logging.INFO,
    file_level=logging.DEBUG,
    max_bytes=10*1024*1024,  # 10 MB
    backup_count=5
):
    """
    Setup production-grade logger with rotation.
    
    Args:
        name: Logger name
        log_dir: Directory for log files
        console_level: Console logging level
        file_level: File logging level
        max_bytes: Max log file size before rotation
        backup_count: Number of backup files to keep
    
    Returns:
        logging.Logger: Configured logger
    """
    # Create logger
    logger = logging.getLogger(name)
    logger.setLevel(logging.DEBUG)
    logger.handlers.clear()
    
    # Create logs directory
    log_dir = Path(log_dir)
    log_dir.mkdir(parents=True, exist_ok=True)
    
    # File handler with rotation
    log_file = log_dir / f"{name}.log"
    file_handler = logging.handlers.RotatingFileHandler(
        log_file,
        maxBytes=max_bytes,
        backupCount=backup_count
    )
    file_handler.setLevel(file_level)
    
    # Console handler
    console_handler = logging.StreamHandler()
    console_handler.setLevel(console_level)
    
    # Detailed formatter for files
    file_formatter = logging.Formatter(
        '%(asctime)s - %(name)s - %(levelname)s - %(funcName)s:%(lineno)d - %(message)s',
        datefmt='%Y-%m-%d %H:%M:%S'
    )
    
    # Simple formatter for console
    console_formatter = logging.Formatter(
        '%(levelname)s: %(message)s'
    )
    
    file_handler.setFormatter(file_formatter)
    console_handler.setFormatter(console_formatter)
    
    logger.addHandler(file_handler)
    logger.addHandler(console_handler)
    
    logger.info(f"Logger initialized: {name}")
    logger.info(f"Log file: {log_file}")
    
    return logger

# Example usage
logger = setup_production_logger(
    'demo_app',
    log_dir='logs/demo',
    console_level=logging.INFO,
    file_level=logging.DEBUG
)

# Different log levels
logger.debug("Detailed debugging information")
logger.info("General information about program execution")
logger.warning("Warning: Something unexpected happened")
logger.error("Error: Operation failed")

# Structured logging with context
user_id = 12345
action = "data_export"
logger.info(f"User {user_id} performed {action}", extra={'user_id': user_id, 'action': action})

print("\n✓ Logs written to logs/demo/demo_app.log")

## 4. Performance Profiling

Measure and optimize script performance scientifically.

In [None]:
# Performance profiling decorator
import time
from functools import wraps

def profile_performance(func):
    """
    Decorator to profile function performance.
    
    Measures execution time and memory usage.
    """
    @wraps(func)
    def wrapper(*args, **kwargs):
        import tracemalloc
        
        # Start profiling
        tracemalloc.start()
        start_time = time.time()
        
        try:
            result = func(*args, **kwargs)
            return result
        
        finally:
            # Stop profiling
            elapsed_time = time.time() - start_time
            current, peak = tracemalloc.get_traced_memory()
            tracemalloc.stop()
            
            # Report
            print(f"\n{'='*60}")
            print(f"Performance Profile: {func.__name__}")
            print(f"{'='*60}")
            print(f"Execution time: {elapsed_time:.4f}s")
            print(f"Current memory: {current / 1024 / 1024:.2f} MB")
            print(f"Peak memory: {peak / 1024 / 1024:.2f} MB")
            print(f"{'='*60}\n")
    
    return wrapper

# Example: Profile a data processing function
@profile_performance
def process_large_dataset(size=100000):
    """
    Simulate processing a large dataset.
    
    Args:
        size: Dataset size
    """
    # Create data
    data = list(range(size))
    
    # Process data
    result = [x * 2 for x in data if x % 2 == 0]
    
    return len(result)

# Test the profiled function
count = process_large_dataset(size=100000)
print(f"Processed {count} items")

### 4.1 Code Optimization Strategies

Common optimization techniques for automation scripts.

In [None]:
# Optimization examples
import time

def compare_approaches(size=10000):
    """
    Compare different approaches to the same task.
    """
    data = list(range(size))
    
    # Approach 1: List comprehension (Pythonic)
    start = time.time()
    result1 = [x * 2 for x in data if x % 2 == 0]
    time1 = time.time() - start
    
    # Approach 2: For loop with append (Verbose)
    start = time.time()
    result2 = []
    for x in data:
        if x % 2 == 0:
            result2.append(x * 2)
    time2 = time.time() - start
    
    # Approach 3: Filter + map (Functional)
    start = time.time()
    result3 = list(map(lambda x: x * 2, filter(lambda x: x % 2 == 0, data)))
    time3 = time.time() - start
    
    # Report
    print("Performance Comparison:")
    print(f"List comprehension: {time1*1000:.2f}ms (baseline)")
    print(f"For loop:           {time2*1000:.2f}ms ({time2/time1:.2f}x slower)")
    print(f"Filter + map:       {time3*1000:.2f}ms ({time3/time1:.2f}x slower)")
    print(f"\nRecommendation: Use list comprehension (fastest and most readable)")

compare_approaches(size=100000)

## 5. Testing Automation Scripts

Write testable code and automated tests.

In [None]:
# Example: Testable automation function
def validate_data_file(file_path, required_columns=None, max_missing_pct=0.1):
    """
    Validate a data file meets quality standards.
    
    Args:
        file_path: Path to data file
        required_columns: List of required column names
        max_missing_pct: Maximum allowed missing data percentage
    
    Returns:
        dict: Validation results
    
    Raises:
        ValueError: If validation fails
    """
    from pathlib import Path
    
    file_path = Path(file_path)
    
    # Check file exists
    if not file_path.exists():
        raise FileNotFoundError(f"File not found: {file_path}")
    
    # For demo, simulate validation
    # In real code, would read CSV and check columns/data quality
    
    results = {
        'valid': True,
        'file': str(file_path),
        'rows': 1000,
        'columns': ['id', 'value', 'timestamp'],
        'missing_pct': 0.05
    }
    
    # Check required columns
    if required_columns:
        missing_cols = set(required_columns) - set(results['columns'])
        if missing_cols:
            results['valid'] = False
            results['error'] = f"Missing required columns: {missing_cols}"
            raise ValueError(results['error'])
    
    # Check missing data
    if results['missing_pct'] > max_missing_pct:
        results['valid'] = False
        results['error'] = f"Too much missing data: {results['missing_pct']:.1%} > {max_missing_pct:.1%}"
        raise ValueError(results['error'])
    
    return results

# Simple test function
def test_validate_data_file():
    """Test data validation function."""
    from pathlib import Path
    
    # Create test file
    test_file = Path('test_data.csv')
    test_file.write_text('id,value,timestamp\n1,100,2024-01-01')
    
    try:
        # Test 1: Valid file
        result = validate_data_file(test_file, required_columns=['id', 'value'])
        assert result['valid'] == True
        print("✓ Test 1 passed: Valid file")
        
        # Test 2: Missing required column
        try:
            validate_data_file(test_file, required_columns=['missing_column'])
            print("✗ Test 2 failed: Should have raised ValueError")
        except ValueError as e:
            print(f"✓ Test 2 passed: Caught expected error: {e}")
        
        # Test 3: Non-existent file
        try:
            validate_data_file('nonexistent.csv')
            print("✗ Test 3 failed: Should have raised FileNotFoundError")
        except FileNotFoundError:
            print("✓ Test 3 passed: Caught expected FileNotFoundError")
        
        print("\n✓ All tests passed!")
    
    finally:
        # Cleanup
        test_file.unlink()

# Run tests
test_validate_data_file()

## 6. Practice Exercises

### Exercise 1: Parallel File Processor

Build a production-grade parallel file processor:
1. Process multiple files in parallel
2. Implement retry logic for failed files
3. Add comprehensive logging
4. Profile performance (sequential vs parallel)
5. Write tests for edge cases

**Hint**: Combine ThreadPoolExecutor, retry decorator, and logging

In [None]:
# Exercise 1: Your solution here

class ProductionFileProcessor:
    """
    Production-ready parallel file processor.
    """
    
    def __init__(self, max_workers=4, logger=None):
        # TODO: Initialize processor with logging
        pass
    
    @retry_with_backoff(max_retries=3)
    def process_single_file(self, file_path):
        # TODO: Process one file with error handling
        pass
    
    def process_batch(self, file_paths):
        # TODO: Process files in parallel
        pass
    
    def get_stats(self):
        # TODO: Return processing statistics
        pass

# Test your processor
# processor = ProductionFileProcessor(max_workers=4)
# results = processor.process_batch(['file1.txt', 'file2.txt', 'file3.txt'])

### Exercise 2: Performance Optimizer

Create a tool to identify performance bottlenecks:
1. Profile multiple functions automatically
2. Compare execution times
3. Track memory usage
4. Generate optimization recommendations
5. Export performance report

**Hint**: Use the profile_performance decorator and extend it

In [None]:
# Exercise 2: Your solution here

class PerformanceAnalyzer:
    """
    Analyze and optimize script performance.
    """
    
    def __init__(self):
        # TODO: Initialize analyzer
        pass
    
    def profile_function(self, func, *args, **kwargs):
        # TODO: Profile function execution
        pass
    
    def compare_implementations(self, implementations):
        # TODO: Compare different implementations
        pass
    
    def generate_report(self):
        # TODO: Generate performance report
        pass

# Test your analyzer
# analyzer = PerformanceAnalyzer()
# analyzer.profile_function(some_function, arg1, arg2)
# analyzer.generate_report()

### Exercise 3: Complete Production Pipeline

Build an end-to-end data processing pipeline:
1. Load data from multiple sources in parallel
2. Validate data quality (use tests)
3. Process with error handling and retries
4. Log all operations comprehensively
5. Generate performance metrics
6. Save results with proper error handling

**Hint**: Combine all techniques from this module

In [None]:
# Exercise 3: Your solution here

class ProductionDataPipeline:
    """
    Production-grade data processing pipeline.
    """
    
    def __init__(self, config, logger=None):
        # TODO: Initialize pipeline
        pass
    
    def load_data_sources(self, sources):
        # TODO: Load data in parallel
        pass
    
    def validate_data(self, data):
        # TODO: Validate data quality
        pass
    
    def process_data(self, data):
        # TODO: Process with error handling
        pass
    
    def run(self):
        # TODO: Execute complete pipeline
        pass

# Test your pipeline
# pipeline = ProductionDataPipeline(config={'sources': ['s1', 's2']})
# pipeline.run()

## 7. Summary

### Key Concepts

1. **Parallel Processing**
   - ThreadPoolExecutor for I/O-bound tasks
   - ProcessPoolExecutor for CPU-bound tasks
   - Measure speedup to validate parallelization

2. **Advanced Error Handling**
   - Retry with exponential backoff
   - Context managers for resource cleanup
   - Comprehensive exception handling

3. **Structured Logging**
   - Multiple handlers (file, console)
   - Proper log levels (DEBUG, INFO, WARNING, ERROR)
   - Log rotation for disk management
   - Detailed formatting for debugging

4. **Performance Optimization**
   - Profile before optimizing
   - Measure time and memory
   - Use efficient Python patterns
   - Document optimization decisions

5. **Testing**
   - Write testable functions
   - Test edge cases and errors
   - Automate test execution
   - Maintain test coverage

### Production Checklist

Before deploying automation scripts:

- [ ] Comprehensive error handling with retries
- [ ] Structured logging (file + console)
- [ ] Performance profiled and optimized
- [ ] Tests written and passing
- [ ] Resource cleanup (context managers)
- [ ] Configuration externalized (no hardcoded values)
- [ ] Documentation complete
- [ ] Security reviewed (credentials, permissions)

### What's Next?

In **Module 10: Final Automation Project**, you'll:
- Apply all learned techniques
- Build a complete real-world automation system
- Follow production best practices
- Create deployment-ready code

### Self-Assessment

Before moving on, make sure you can:
- [ ] Implement parallel processing for appropriate tasks
- [ ] Add retry logic with exponential backoff
- [ ] Setup structured logging with rotation
- [ ] Profile and optimize script performance
- [ ] Write automated tests for automation scripts
- [ ] Use context managers for resource management

---

**Continue to Module 10** for the final capstone project!