# Lesson 3: Local Development Setup

## 🎯 Interactive Tutorial: Professional Spark Development

Welcome to Lesson 3! In this notebook, we'll explore professional development practices for Apache Spark applications. You'll learn how to structure projects, manage configurations, implement testing, and integrate development tools.

### 📋 What You'll Learn
1. **Project Structure Best Practices** - Modular, maintainable Spark applications
2. **Development Workflow** - Debugging, testing, and quality assurance
3. **Configuration Management** - Environment-specific settings and secrets
4. **Development Tools Integration** - Git, Docker, CI/CD foundations

### 🔧 Setup
Make sure you've completed the environment setup:
```bash
make setup
make install-dev
source .venv/bin/activate
```

In [None]:
# Initial setup and imports
import os
import sys
from pathlib import Path

# Add project root to Python path
project_root = Path().absolute()
if str(project_root) not in sys.path:
    sys.path.append(str(project_root))

# Essential imports
import pyspark
from pyspark.sql import SparkSession
from pyspark.sql.functions import col, when, avg, count
from pyspark.sql.types import StructType, StructField, StringType, IntegerType, DoubleType

print(f"🚀 PySpark version: {pyspark.__version__}")
print(f"📁 Project root: {project_root}")
print(f"🐍 Python path: {sys.executable}")

---

## 🏗️ Module 1: Project Structure Best Practices

Let's start by understanding how to structure a professional Spark project. We'll demonstrate the key principles through practical examples.

### 📂 1.1 Demonstrating Project Structure

Let's examine what a well-structured Spark project looks like:

In [None]:
# Let's create a sample project structure to demonstrate
def show_project_structure():
    """Display recommended project structure"""
    structure = """
    my-spark-project/
    ├── README.md                   # Project documentation
    ├── pyproject.toml             # Dependencies and configuration
    ├── Makefile                   # Development commands
    ├── .env.example               # Environment variables template
    ├── .gitignore                 # Git ignore patterns
    │
    ├── src/                       # Source code (production)
    │   ├── config/                # Configuration management
    │   │   ├── settings.py        # Application settings
    │   │   └── environments/      # Environment-specific configs
    │   ├── jobs/                  # Spark job definitions
    │   │   ├── base_job.py        # Abstract base job class
    │   │   └── etl_job.py         # ETL job implementation
    │   ├── transformations/       # Data transformation functions
    │   │   ├── cleaning.py        # Data cleaning functions
    │   │   └── aggregations.py    # Aggregation functions
    │   ├── utils/                 # Utility functions
    │   │   ├── spark_utils.py     # Spark session and utilities
    │   │   └── io_utils.py        # Input/output helpers
    │   └── schemas/               # Data schemas
    │       └── input_schemas.py   # Input data schemas
    │
    ├── tests/                     # Test suite
    │   ├── conftest.py           # Pytest configuration
    │   ├── unit/                 # Unit tests
    │   └── integration/          # Integration tests
    │
    ├── data/                     # Local data directory
    │   ├── raw/                  # Raw input data
    │   └── processed/            # Processed data
    │
    └── scripts/                  # Utility scripts
        ├── setup.sh             # Environment setup
        └── run_job.py           # Job runner script
    """
    print("📂 Recommended Project Structure:")
    print(structure)

show_project_structure()

### 🔧 1.2 Separation of Concerns Example

Let's see how to properly separate different concerns in a Spark application:

In [None]:
# Example: Proper separation of concerns

# ❌ BAD: Everything in one function
def monolithic_data_processing():
    # Spark session creation
    spark = SparkSession.builder.appName("MonolithicApp").getOrCreate()
    
    # Data loading
    df = spark.read.option("header", "true").csv("data/raw/customers.csv")
    
    # Business logic mixed with technical concerns
    processed_df = (df
                   .filter(col("age") > 18)
                   .withColumn("age_group", when(col("age") < 30, "Young")
                              .when(col("age") < 50, "Middle")
                              .otherwise("Senior"))
                   .groupBy("age_group")
                   .agg(count("*").alias("count"), avg("income").alias("avg_income")))
    
    # Output writing
    processed_df.write.mode("overwrite").parquet("data/processed/customer_analysis")
    
    spark.stop()

print("❌ Monolithic approach - everything mixed together")

In [None]:
# ✅ GOOD: Separated concerns
from typing import Optional
from dataclasses import dataclass

# 1. Configuration Management
@dataclass
class AppConfig:
    app_name: str = "CustomerAnalysis"
    input_path: str = "data/raw/customers.csv"
    output_path: str = "data/processed/customer_analysis"
    min_age: int = 18

# 2. Spark Utilities
class SparkUtils:
    @staticmethod
    def get_spark_session(app_name: str) -> SparkSession:
        """Create optimized Spark session"""
        return (SparkSession.builder
                .appName(app_name)
                .master("local[*]")
                .config("spark.sql.adaptive.enabled", "true")
                .config("spark.sql.adaptive.coalescePartitions.enabled", "true")
                .getOrCreate())

# 3. Data Schema Definition
class CustomerSchema:
    SCHEMA = StructType([
        StructField("customer_id", StringType(), True),
        StructField("name", StringType(), True),
        StructField("age", IntegerType(), True),
        StructField("income", DoubleType(), True),
        StructField("city", StringType(), True)
    ])

# 4. Business Logic (Transformations)
class CustomerTransformations:
    @staticmethod
    def filter_valid_customers(df, min_age: int = 18):
        """Filter customers with valid age"""
        return df.filter(col("age") >= min_age)
    
    @staticmethod
    def add_age_group(df):
        """Add age group categorization"""
        return df.withColumn(
            "age_group",
            when(col("age") < 30, "Young")
            .when(col("age") < 50, "Middle")
            .otherwise("Senior")
        )
    
    @staticmethod
    def calculate_age_group_stats(df):
        """Calculate statistics by age group"""
        return (df.groupBy("age_group")
                .agg(
                    count("*").alias("count"),
                    avg("income").alias("avg_income")
                )
                .orderBy("age_group"))

# 5. I/O Operations
class DataIO:
    @staticmethod
    def read_customer_data(spark: SparkSession, path: str):
        """Read customer data with schema"""
        return (spark.read
                .schema(CustomerSchema.SCHEMA)
                .option("header", "true")
                .csv(path))
    
    @staticmethod
    def write_analysis_results(df, path: str):
        """Write analysis results"""
        (df.write
         .mode("overwrite")
         .option("compression", "snappy")
         .parquet(path))

print("✅ Modular approach - separated concerns with clear responsibilities")

### 🏃‍♂️ 1.3 Putting It All Together

Now let's see how these separated components work together:

In [None]:
# 6. Main Job Class
class CustomerAnalysisJob:
    def __init__(self, config: AppConfig):
        self.config = config
        self.spark = SparkUtils.get_spark_session(config.app_name)
    
    def run(self):
        """Execute the complete analysis pipeline"""
        try:
            print(f"🚀 Starting {self.config.app_name}")
            
            # Extract
            print("📥 Loading customer data...")
            raw_data = DataIO.read_customer_data(self.spark, self.config.input_path)
            print(f"📊 Loaded {raw_data.count():,} customer records")
            
            # Transform
            print("🔄 Applying transformations...")
            valid_customers = CustomerTransformations.filter_valid_customers(raw_data, self.config.min_age)
            customers_with_groups = CustomerTransformations.add_age_group(valid_customers)
            analysis_results = CustomerTransformations.calculate_age_group_stats(customers_with_groups)
            
            print("📈 Analysis results:")
            analysis_results.show()
            
            # Load
            print(f"💾 Saving results to {self.config.output_path}")
            DataIO.write_analysis_results(analysis_results, self.config.output_path)
            
            print("✅ Job completed successfully!")
            
        except Exception as e:
            print(f"❌ Job failed: {str(e)}")
            raise
        finally:
            self.spark.stop()

print("✅ Complete modular job structure defined")

### 🧪 1.4 Testing the Modular Structure

Let's create some sample data and test our modular structure:

In [None]:
# Create sample data for demonstration
import tempfile
import pandas as pd

# Create sample customer data
sample_data = pd.DataFrame({
    'customer_id': ['C001', 'C002', 'C003', 'C004', 'C005', 'C006'],
    'name': ['Alice Johnson', 'Bob Smith', 'Charlie Brown', 'Diana Wilson', 'Eve Davis', 'Frank Miller'],
    'age': [25, 35, 45, 55, 17, 30],  # Note: one customer under 18
    'income': [50000, 75000, 90000, 120000, 25000, 60000],
    'city': ['New York', 'Chicago', 'Los Angeles', 'Houston', 'Phoenix', 'Philadelphia']
})

# Create temporary directories
temp_dir = Path(tempfile.mkdtemp())
input_dir = temp_dir / "input"
output_dir = temp_dir / "output"
input_dir.mkdir(exist_ok=True)
output_dir.mkdir(exist_ok=True)

# Save sample data
input_file = input_dir / "customers.csv"
sample_data.to_csv(input_file, index=False)

print(f"📁 Created test data at: {input_file}")
print(f"📊 Sample data:")
print(sample_data)

# Configure and run the job
config = AppConfig(
    input_path=str(input_file),
    output_path=str(output_dir / "customer_analysis")
)

print(f"\n🔧 Configuration:")
print(f"  Input: {config.input_path}")
print(f"  Output: {config.output_path}")
print(f"  Min age: {config.min_age}")

In [None]:
# Run the modular job
job = CustomerAnalysisJob(config)
job.run()

---

## 🔧 Module 2: Development Workflow

Now let's explore professional development workflows, including debugging techniques and testing strategies.

### 🐛 2.1 DataFrame Debugging Utilities

Debugging DataFrames can be challenging. Let's create utilities to make it easier:

In [None]:
# Advanced debugging utilities for DataFrames
from pyspark.sql import DataFrame
import time
from functools import wraps

class DataFrameDebugger:
    """Comprehensive DataFrame debugging utilities"""
    
    @staticmethod
    def debug_dataframe(df: DataFrame, 
                       name: str = "DataFrame",
                       show_rows: int = 10,
                       show_schema: bool = True,
                       show_count: bool = True,
                       show_sample: bool = True) -> DataFrame:
        """Comprehensive DataFrame debugging"""
        
        print(f"\n{'='*60}")
        print(f"🔍 DEBUG: {name}")
        print(f"{'='*60}")
        
        if show_schema:
            print(f"\n📋 Schema:")
            df.printSchema()
        
        if show_count:
            count = df.count()
            print(f"\n📊 Row Count: {count:,} rows")
        
        if show_sample and df.count() > 0:
            print(f"\n🔍 Sample Data (first {show_rows} rows):")
            df.show(show_rows, truncate=False)
            
            # Show data types and null counts
            print(f"\n📈 Column Statistics:")
            for column in df.columns:
                null_count = df.filter(col(column).isNull()).count()
                dtype = dict(df.dtypes)[column]
                print(f"  {column:20} | Type: {dtype:15} | Nulls: {null_count:,}")
        
        return df
    
    @staticmethod
    def profile_operation(operation_name: str):
        """Decorator to profile Spark operations"""
        def decorator(func):
            @wraps(func)
            def wrapper(*args, **kwargs):
                start_time = time.time()
                
                print(f"\n🚀 Starting: {operation_name}")
                result = func(*args, **kwargs)
                
                # Force action if result is DataFrame
                if hasattr(result, 'count'):
                    count = result.count()
                    execution_time = time.time() - start_time
                    print(f"✅ Completed: {operation_name}")
                    print(f"⏱️  Execution time: {execution_time:.2f} seconds")
                    print(f"📊 Result count: {count:,} rows")
                else:
                    execution_time = time.time() - start_time
                    print(f"✅ Completed: {operation_name}")
                    print(f"⏱️  Execution time: {execution_time:.2f} seconds")
                
                return result
            return wrapper
        return decorator
    
    @staticmethod
    def compare_dataframes(df1: DataFrame, df2: DataFrame, 
                          name1: str = "DataFrame 1", 
                          name2: str = "DataFrame 2"):
        """Compare two DataFrames"""
        print(f"\n🔍 Comparing {name1} vs {name2}")
        print(f"{'='*50}")
        
        # Compare counts
        count1, count2 = df1.count(), df2.count()
        print(f"📊 Row counts: {name1}: {count1:,}, {name2}: {count2:,}")
        
        # Compare schemas
        cols1, cols2 = set(df1.columns), set(df2.columns)
        print(f"📋 Column counts: {name1}: {len(cols1)}, {name2}: {len(cols2)}")
        
        if cols1 != cols2:
            print(f"⚠️  Schema differences:")
            print(f"  Only in {name1}: {cols1 - cols2}")
            print(f"  Only in {name2}: {cols2 - cols1}")
        else:
            print(f"✅ Schemas match")

print("🔧 DataFrame debugging utilities created")

### 🧪 2.2 Testing Our Debug Utilities

Let's test our debugging utilities with the customer data:

In [None]:
# Create a new Spark session for debugging demonstration
spark = SparkUtils.get_spark_session("DebuggingDemo")

# Load and debug the customer data
customer_df = DataIO.read_customer_data(spark, str(input_file))
customer_df = DataFrameDebugger.debug_dataframe(customer_df, "Raw Customer Data")

In [None]:
# Apply transformations with profiling
@DataFrameDebugger.profile_operation("Filter Valid Customers")
def filter_customers_with_profiling(df):
    return CustomerTransformations.filter_valid_customers(df)

@DataFrameDebugger.profile_operation("Add Age Groups")
def add_age_groups_with_profiling(df):
    return CustomerTransformations.add_age_group(df)

# Apply transformations
filtered_df = filter_customers_with_profiling(customer_df)
grouped_df = add_age_groups_with_profiling(filtered_df)

# Debug the final result
final_df = DataFrameDebugger.debug_dataframe(grouped_df, "Customers with Age Groups")

In [None]:
# Compare original vs filtered data
DataFrameDebugger.compare_dataframes(
    customer_df, filtered_df, 
    "Original Data", "Filtered Data (Age >= 18)"
)

spark.stop()

### 🧪 2.3 Unit Testing Framework

Let's create a comprehensive testing framework for our Spark components:

In [None]:
# Unit testing framework for Spark applications
import unittest
from typing import List, Tuple

class SparkTestCase(unittest.TestCase):
    """Base test case for Spark applications"""
    
    @classmethod
    def setUpClass(cls):
        """Set up Spark session for testing"""
        cls.spark = (SparkSession.builder
                    .appName("test-spark-app")
                    .master("local[2]")
                    .config("spark.sql.shuffle.partitions", "2")
                    .config("spark.ui.enabled", "false")
                    .getOrCreate())
        cls.spark.sparkContext.setLogLevel("WARN")
    
    @classmethod
    def tearDownClass(cls):
        """Clean up Spark session"""
        cls.spark.stop()
    
    def create_test_dataframe(self, data: List[Tuple], columns: List[str]):
        """Helper to create test DataFrames"""
        return self.spark.createDataFrame(data, columns)
    
    def assert_dataframe_equal(self, df1: DataFrame, df2: DataFrame, 
                              check_schema: bool = True):
        """Assert two DataFrames are equal"""
        if check_schema:
            self.assertEqual(df1.schema, df2.schema, "Schemas don't match")
        
        # Convert to lists for comparison
        rows1 = sorted(df1.collect())
        rows2 = sorted(df2.collect())
        
        self.assertEqual(rows1, rows2, "DataFrames don't match")

# Test cases for our customer transformations
class TestCustomerTransformations(SparkTestCase):
    """Test customer transformation functions"""
    
    def setUp(self):
        """Set up test data"""
        self.test_data = [
            ("C001", "Alice", 25, 50000.0, "New York"),
            ("C002", "Bob", 17, 30000.0, "Chicago"),      # Under 18
            ("C003", "Charlie", 35, 75000.0, "LA"),
            ("C004", "Diana", 45, 90000.0, "Houston"),
            ("C005", "Eve", 16, 25000.0, "Phoenix"),       # Under 18
        ]
        
        self.columns = ["customer_id", "name", "age", "income", "city"]
        self.df = self.create_test_dataframe(self.test_data, self.columns)
    
    def test_filter_valid_customers(self):
        """Test filtering customers by minimum age"""
        result = CustomerTransformations.filter_valid_customers(self.df, min_age=18)
        
        # Should have 3 customers (Alice, Charlie, Diana)
        self.assertEqual(result.count(), 3)
        
        # All remaining customers should be >= 18
        ages = [row.age for row in result.collect()]
        self.assertTrue(all(age >= 18 for age in ages))
    
    def test_add_age_group(self):
        """Test age group categorization"""
        result = CustomerTransformations.add_age_group(self.df)
        
        # Check that age_group column was added
        self.assertIn("age_group", result.columns)
        
        # Check age group assignments
        age_groups = {row.name: row.age_group for row in result.collect()}
        
        self.assertEqual(age_groups["Alice"], "Young")    # 25
        self.assertEqual(age_groups["Bob"], "Young")      # 17
        self.assertEqual(age_groups["Charlie"], "Middle") # 35
        self.assertEqual(age_groups["Diana"], "Middle")   # 45
        self.assertEqual(age_groups["Eve"], "Young")      # 16
    
    def test_calculate_age_group_stats(self):
        """Test age group statistics calculation"""
        df_with_groups = CustomerTransformations.add_age_group(self.df)
        result = CustomerTransformations.calculate_age_group_stats(df_with_groups)
        
        # Should have Middle and Young groups
        age_groups = [row.age_group for row in result.collect()]
        self.assertIn("Young", age_groups)
        self.assertIn("Middle", age_groups)
        
        # Check statistics
        stats = {row.age_group: (row.count, row.avg_income) for row in result.collect()}
        
        # Young: Alice (25, 50k), Bob (17, 30k), Eve (16, 25k)
        young_count, young_avg = stats["Young"]
        self.assertEqual(young_count, 3)
        self.assertAlmostEqual(young_avg, (50000 + 30000 + 25000) / 3, places=0)
        
        # Middle: Charlie (35, 75k), Diana (45, 90k)
        middle_count, middle_avg = stats["Middle"]
        self.assertEqual(middle_count, 2)
        self.assertAlmostEqual(middle_avg, (75000 + 90000) / 2, places=0)

print("🧪 Unit testing framework created")

In [None]:
# Run the unit tests
if __name__ == "__main__":
    # Create a test suite
    suite = unittest.TestLoader().loadTestsFromTestCase(TestCustomerTransformations)
    
    # Run the tests
    runner = unittest.TextTestRunner(verbosity=2)
    result = runner.run(suite)
    
    print(f"\n📊 Test Results:")
    print(f"  Tests run: {result.testsRun}")
    print(f"  Failures: {len(result.failures)}")
    print(f"  Errors: {len(result.errors)}")
    
    if result.wasSuccessful():
        print("✅ All tests passed!")
    else:
        print("❌ Some tests failed!")

---

## ⚙️ Module 3: Configuration Management

Let's explore professional configuration management patterns for different environments.

### 🔧 3.1 Hierarchical Configuration System

We'll create a flexible configuration system that supports multiple environments:

In [None]:
# Advanced configuration management system
import yaml
import os
from typing import Dict, Any, Optional
from dataclasses import dataclass, field
from pathlib import Path

@dataclass
class SparkConfig:
    """Spark-specific configuration"""
    app_name: str = "SparkApp"
    master: str = "local[*]"
    sql_shuffle_partitions: int = 200
    adaptive_enabled: bool = True
    adaptive_coalesce_partitions: bool = True
    serializer: str = "org.apache.spark.serializer.KryoSerializer"
    
    def to_spark_conf(self) -> Dict[str, str]:
        """Convert to Spark configuration dictionary"""
        return {
            "spark.sql.shuffle.partitions": str(self.sql_shuffle_partitions),
            "spark.sql.adaptive.enabled": str(self.adaptive_enabled).lower(),
            "spark.sql.adaptive.coalescePartitions.enabled": str(self.adaptive_coalesce_partitions).lower(),
            "spark.serializer": self.serializer
        }

@dataclass
class DataConfig:
    """Data-related configuration"""
    input_path: str = "data/raw"
    output_path: str = "data/processed"
    input_format: str = "parquet"
    output_format: str = "delta"
    compression: str = "snappy"

@dataclass
class DatabaseConfig:
    """Database configuration"""
    host: str = "localhost"
    port: int = 5432
    name: str = "spark_db"
    username: Optional[str] = None
    password: Optional[str] = None
    
    def get_jdbc_url(self) -> str:
        """Get JDBC connection URL"""
        return f"jdbc:postgresql://{self.host}:{self.port}/{self.name}"

@dataclass
class AppConfig:
    """Main application configuration"""
    environment: str = "dev"
    debug: bool = True
    log_level: str = "INFO"
    
    spark: SparkConfig = field(default_factory=SparkConfig)
    data: DataConfig = field(default_factory=DataConfig)
    database: DatabaseConfig = field(default_factory=DatabaseConfig)

class ConfigLoader:
    """Configuration loader with environment support"""
    
    def __init__(self, config_dir: str = "configs"):
        self.config_dir = Path(config_dir)
    
    def load_config(self, environment: str = None) -> AppConfig:
        """Load configuration for specified environment"""
        if environment is None:
            environment = os.getenv("ENVIRONMENT", "dev")
        
        # Load base configuration
        base_config = self._load_yaml_config("base.yaml")
        
        # Load environment-specific configuration
        env_config = self._load_yaml_config(f"{environment}.yaml")
        
        # Merge configurations (environment overrides base)
        merged_config = self._merge_configs(base_config, env_config)
        
        # Resolve environment variables
        resolved_config = self._resolve_env_vars(merged_config)
        
        # Convert to AppConfig object
        return self._dict_to_config(resolved_config)
    
    def _load_yaml_config(self, filename: str) -> Dict[str, Any]:
        """Load YAML configuration file"""
        file_path = self.config_dir / filename
        
        if not file_path.exists():
            return {}
        
        with open(file_path, 'r') as file:
            return yaml.safe_load(file) or {}
    
    def _merge_configs(self, base: Dict, override: Dict) -> Dict:
        """Deep merge two configuration dictionaries"""
        result = base.copy()
        
        for key, value in override.items():
            if key in result and isinstance(result[key], dict) and isinstance(value, dict):
                result[key] = self._merge_configs(result[key], value)
            else:
                result[key] = value
        
        return result
    
    def _resolve_env_vars(self, config: Dict) -> Dict:
        """Resolve environment variables in configuration"""
        def resolve_value(value):
            if isinstance(value, str) and value.startswith("${" ) and value.endswith("}"):
                env_var = value[2:-1]
                return os.getenv(env_var, value)
            elif isinstance(value, dict):
                return {k: resolve_value(v) for k, v in value.items()}
            elif isinstance(value, list):
                return [resolve_value(item) for item in value]
            return value
        
        return resolve_value(config)
    
    def _dict_to_config(self, config_dict: Dict) -> AppConfig:
        """Convert dictionary to AppConfig object"""
        # Extract sections
        app_section = config_dict.get("app", {})
        spark_section = config_dict.get("spark", {})
        data_section = config_dict.get("data", {})
        database_section = config_dict.get("database", {})
        
        # Create configuration objects
        spark_config = SparkConfig(**spark_section)
        data_config = DataConfig(**data_section)
        database_config = DatabaseConfig(**database_section)
        
        # Create main config
        return AppConfig(
            spark=spark_config,
            data=data_config,
            database=database_config,
            **app_section
        )

print("⚙️ Advanced configuration management system created")

### 📝 3.2 Creating Configuration Files

Let's create sample configuration files for different environments:

In [None]:
# Create sample configuration files
configs_dir = Path("configs")
configs_dir.mkdir(exist_ok=True)

# Base configuration
base_config = {
    "app": {
        "log_level": "INFO"
    },
    "spark": {
        "app_name": "CustomerAnalytics",
        "sql_shuffle_partitions": 200,
        "adaptive_enabled": True,
        "serializer": "org.apache.spark.serializer.KryoSerializer"
    },
    "data": {
        "input_format": "parquet",
        "output_format": "delta",
        "compression": "snappy"
    }
}

# Development configuration
dev_config = {
    "app": {
        "environment": "dev",
        "debug": True,
        "log_level": "DEBUG"
    },
    "spark": {
        "master": "local[2]",
        "sql_shuffle_partitions": 4
    },
    "data": {
        "input_path": "data/dev/input",
        "output_path": "data/dev/output"
    },
    "database": {
        "host": "localhost",
        "port": 5432,
        "name": "dev_database",
        "username": "dev_user",
        "password": "dev_password"
    }
}

# Production configuration
prod_config = {
    "app": {
        "environment": "prod",
        "debug": False,
        "log_level": "WARN"
    },
    "spark": {
        "master": "yarn",
        "sql_shuffle_partitions": 1000
    },
    "data": {
        "input_path": "s3a://prod-data-lake/input",
        "output_path": "s3a://prod-data-lake/output"
    },
    "database": {
        "host": "${DB_HOST}",
        "port": "${DB_PORT}",
        "name": "${DB_NAME}",
        "username": "${DB_USERNAME}",
        "password": "${DB_PASSWORD}"
    }
}

# Write configuration files
with open(configs_dir / "base.yaml", 'w') as f:
    yaml.dump(base_config, f, default_flow_style=False)

with open(configs_dir / "dev.yaml", 'w') as f:
    yaml.dump(dev_config, f, default_flow_style=False)

with open(configs_dir / "prod.yaml", 'w') as f:
    yaml.dump(prod_config, f, default_flow_style=False)

print("📝 Configuration files created:")
for config_file in configs_dir.glob("*.yaml"):
    print(f"  📄 {config_file}")

### 🧪 3.3 Testing Configuration Loading

Let's test our configuration system with different environments:

In [None]:
# Test configuration loading
config_loader = ConfigLoader("configs")

# Load development configuration
print("🔧 Loading Development Configuration:")
print("="*50)
dev_config = config_loader.load_config("dev")

print(f"Environment: {dev_config.environment}")
print(f"Debug mode: {dev_config.debug}")
print(f"Log level: {dev_config.log_level}")
print(f"\nSpark configuration:")
print(f"  Master: {dev_config.spark.master}")
print(f"  App name: {dev_config.spark.app_name}")
print(f"  Shuffle partitions: {dev_config.spark.sql_shuffle_partitions}")
print(f"\nData configuration:")
print(f"  Input path: {dev_config.data.input_path}")
print(f"  Output path: {dev_config.data.output_path}")
print(f"\nDatabase configuration:")
print(f"  JDBC URL: {dev_config.database.get_jdbc_url()}")
print(f"  Username: {dev_config.database.username}")

In [None]:
# Load production configuration (with environment variables)
print("\n🏭 Loading Production Configuration:")
print("="*50)

# Set some environment variables for demonstration
os.environ["DB_HOST"] = "prod-db.company.com"
os.environ["DB_PORT"] = "5432"
os.environ["DB_NAME"] = "prod_analytics"
os.environ["DB_USERNAME"] = "analytics_user"
os.environ["DB_PASSWORD"] = "secure_password_123"

prod_config = config_loader.load_config("prod")

print(f"Environment: {prod_config.environment}")
print(f"Debug mode: {prod_config.debug}")
print(f"Log level: {prod_config.log_level}")
print(f"\nSpark configuration:")
print(f"  Master: {prod_config.spark.master}")
print(f"  Shuffle partitions: {prod_config.spark.sql_shuffle_partitions}")
print(f"\nData configuration:")
print(f"  Input path: {prod_config.data.input_path}")
print(f"  Output path: {prod_config.data.output_path}")
print(f"\nDatabase configuration:")
print(f"  JDBC URL: {prod_config.database.get_jdbc_url()}")
print(f"  Username: {prod_config.database.username}")
print(f"  Password: {'*' * len(prod_config.database.password)}")

In [None]:
# Demonstrate Spark configuration conversion
print("\n⚙️ Spark Configuration for Development:")
print("="*50)
spark_conf = dev_config.spark.to_spark_conf()
for key, value in spark_conf.items():
    print(f"  {key}: {value}")

# Create Spark session with configuration
print("\n🚀 Creating Spark session with configuration...")
builder = SparkSession.builder.appName(dev_config.spark.app_name).master(dev_config.spark.master)

for key, value in spark_conf.items():
    builder = builder.config(key, value)

configured_spark = builder.getOrCreate()
print(f"✅ Spark session created: {configured_spark.sparkContext.appName}")
print(f"📊 Shuffle partitions: {configured_spark.conf.get('spark.sql.shuffle.partitions')}")

configured_spark.stop()

---

## 🔗 Module 4: Development Tools Integration

Let's explore how to integrate modern development tools into our Spark workflow.

### 🌿 4.1 Git Workflow Examples

Let's demonstrate Git best practices for Spark projects:

In [None]:
# Git workflow demonstration
import subprocess
from pathlib import Path

def run_git_command(command: str, cwd: str = ".") -> str:
    """Run a git command and return the output"""
    try:
        result = subprocess.run(
            command.split(),
            cwd=cwd,
            capture_output=True,
            text=True,
            check=True
        )
        return result.stdout.strip()
    except subprocess.CalledProcessError as e:
        return f"Error: {e.stderr.strip()}"

# Check current git status
print("🌿 Current Git Status:")
print("="*30)
try:
    status = run_git_command("git status --porcelain")
    if status:
        print("Modified files:")
        for line in status.split('\n'):
            print(f"  {line}")
    else:
        print("✅ Working directory clean")
        
    # Show current branch
    branch = run_git_command("git branch --show-current")
    print(f"\n📍 Current branch: {branch}")
    
    # Show recent commits
    commits = run_git_command("git log --oneline -5")
    print(f"\n📝 Recent commits:")
    for line in commits.split('\n'):
        print(f"  {line}")
        
except Exception as e:
    print(f"⚠️  Git not available or not in a git repository: {e}")

### 📋 4.2 Pre-commit Configuration

Let's create a comprehensive pre-commit configuration:

In [None]:
# Create pre-commit configuration
pre_commit_config = """
# .pre-commit-config.yaml
repos:
  - repo: https://github.com/pre-commit/pre-commit-hooks
    rev: v4.4.0
    hooks:
      - id: trailing-whitespace
      - id: end-of-file-fixer
      - id: check-yaml
      - id: check-added-large-files
        args: ['--maxkb=1000']
      - id: check-merge-conflict
      - id: debug-statements

  - repo: https://github.com/psf/black
    rev: 23.7.0
    hooks:
      - id: black
        language_version: python3
        args: [--line-length=100]

  - repo: https://github.com/pycqa/isort
    rev: 5.12.0
    hooks:
      - id: isort
        args: [--profile=black, --line-length=100]

  - repo: https://github.com/pycqa/flake8
    rev: 6.0.0
    hooks:
      - id: flake8
        args: [--max-line-length=100, --extend-ignore=E203,W503]

  - repo: https://github.com/pre-commit/mirrors-mypy
    rev: v1.5.1
    hooks:
      - id: mypy
        additional_dependencies: [types-PyYAML, pydantic]

  - repo: local
    hooks:
      - id: pytest
        name: pytest
        entry: pytest
        language: python
        pass_filenames: false
        always_run: true
        args: [tests/, --tb=short]
"""

# Write pre-commit configuration
with open(".pre-commit-config.yaml", "w") as f:
    f.write(pre_commit_config.strip())

print("📋 Pre-commit configuration created:")
print(pre_commit_config)

### 🐳 4.3 Docker Development Environment

Let's create a Docker setup for consistent development environments:

In [None]:
# Create Dockerfile for development
dockerfile_content = """
# Dockerfile
FROM python:3.11-slim

# Install Java (required for Spark)
RUN apt-get update && \
    apt-get install -y openjdk-11-jdk curl git && \
    apt-get clean && \
    rm -rf /var/lib/apt/lists/*

# Set Java environment
ENV JAVA_HOME=/usr/lib/jvm/java-11-openjdk-amd64

# Install uv
RUN curl -LsSf https://astral.sh/uv/install.sh | sh
ENV PATH="/root/.cargo/bin:$PATH"

# Set working directory
WORKDIR /app

# Copy dependency files
COPY pyproject.toml uv.lock ./

# Install dependencies
RUN uv sync --extra dev --extra docker

# Copy source code
COPY . .

# Set Python path
ENV PYTHONPATH=/app

# Expose ports
EXPOSE 4040 8888

# Default command
CMD ["uv", "run", "jupyter", "notebook", "--ip=0.0.0.0", "--port=8888", "--no-browser", "--allow-root"]
"""

# Create Docker Compose for development environment
docker_compose_content = """
# docker-compose.yml
version: '3.8'

services:
  spark-app:
    build: .
    volumes:
      - .:/app
      - ./data:/app/data
    environment:
      - ENVIRONMENT=dev
      - PYTHONPATH=/app
    ports:
      - "4040:4040"  # Spark UI
      - "8888:8888"  # Jupyter
    depends_on:
      - postgres
      - minio

  postgres:
    image: postgres:15
    environment:
      POSTGRES_DB: spark_dev
      POSTGRES_USER: spark_user
      POSTGRES_PASSWORD: spark_password
    ports:
      - "5432:5432"
    volumes:
      - postgres_data:/var/lib/postgresql/data

  minio:
    image: minio/minio
    ports:
      - "9000:9000"
      - "9001:9001"
    environment:
      MINIO_ROOT_USER: minioadmin
      MINIO_ROOT_PASSWORD: minioadmin
    command: server /data --console-address ":9001"
    volumes:
      - minio_data:/data

  jupyter:
    build: .
    command: uv run jupyter notebook --ip=0.0.0.0 --port=8888 --no-browser --allow-root
    ports:
      - "8889:8888"
    volumes:
      - .:/app
    environment:
      - PYTHONPATH=/app

volumes:
  postgres_data:
  minio_data:
"""

# Write Docker files
with open("Dockerfile", "w") as f:
    f.write(dockerfile_content.strip())

with open("docker-compose.yml", "w") as f:
    f.write(docker_compose_content.strip())

print("🐳 Docker configuration created:")
print("  📄 Dockerfile")
print("  📄 docker-compose.yml")
print("\n🚀 To start the development environment:")
print("  docker-compose up -d")
print("\n📱 Access points:")
print("  Jupyter: http://localhost:8888")
print("  Spark UI: http://localhost:4040")
print("  MinIO: http://localhost:9001")

---

## 🎯 Summary and Next Steps

Congratulations! You've completed the interactive tutorial for professional Spark development setup. Let's summarize what you've learned:

### ✅ What You've Accomplished

1. **🏗️ Project Structure Best Practices**
   - Created modular Spark application architecture
   - Implemented separation of concerns
   - Built reusable components and utilities

2. **🔧 Development Workflow**
   - Developed DataFrame debugging utilities
   - Created comprehensive testing framework
   - Implemented performance profiling tools

3. **⚙️ Configuration Management**
   - Built hierarchical configuration system
   - Implemented environment-specific settings
   - Created secrets management patterns

4. **🔗 Development Tools Integration**
   - Explored Git workflow best practices
   - Created pre-commit hook configuration
   - Set up Docker development environment

### 🎓 Key Takeaways

- **Modularity**: Break your code into small, focused, testable components
- **Configuration**: Use environment-specific configurations for flexibility
- **Testing**: Write comprehensive tests for your transformations and logic
- **Automation**: Use tools to enforce code quality and consistency
- **Documentation**: Code should be self-documenting with clear structure

### 🚀 Next Steps

1. **Complete the exercises** in the `exercises/` directory
2. **Explore the project templates** in the `templates/` directory
3. **Set up your own project** using the patterns you've learned
4. **Move on to Lesson 4**: File Formats Deep Dive

### 📚 Additional Practice

Try these challenges to reinforce your learning:

1. Create a new Spark project using the modular structure
2. Implement configuration for staging environment
3. Add more comprehensive test cases
4. Set up a CI/CD pipeline using GitHub Actions
5. Containerize your application with Docker

### 🆘 Getting Help

If you encounter issues:
1. Check the troubleshooting section in the README
2. Run the validation scripts: `make validate-learning`
3. Review the solution files in `solutions/`
4. Use the debugging utilities you've learned

**Happy coding! 🎉**