# Python Fundamentals for Data Engineering

## Overview
This notebook covers essential Python concepts needed for Databricks data engineering. We'll refresh Python basics, functions, OOP, and key libraries.

## Learning Objectives
- Master Python data structures
- Understand functions and lambda expressions
- Learn OOP basics for data engineering
- Get familiar with essential libraries

---

## 1. Python Data Structures Review

Understanding Python's built-in data structures is crucial for data manipulation.

In [None]:
# Lists - Ordered, mutable collection
data_sources = ['CSV', 'JSON', 'Parquet', 'Delta']
print(f"Data sources: {data_sources}")

# List comprehension - efficient data transformation
uppercase_sources = [source.upper() for source in data_sources]
print(f"Uppercase: {uppercase_sources}")

# Filtering with list comprehension
filtered = [s for s in data_sources if len(s) > 4]
print(f"Sources with >4 chars: {filtered}")

In [None]:
# Dictionaries - Key-value pairs (like JSON)
table_config = {
    'name': 'customer_transactions',
    'format': 'delta',
    'partitions': ['year', 'month'],
    'optimize': True
}

print(f"Table name: {table_config['name']}")
print(f"Partitions: {table_config.get('partitions', [])}")

# Dictionary comprehension
config_upper = {k.upper(): v for k, v in table_config.items()}
print(f"Uppercase keys: {config_upper}")

In [None]:
# Tuples - Immutable sequences (good for fixed configs)
bronze_silver_gold = ('bronze', 'silver', 'gold')
print(f"Medallion layers: {bronze_silver_gold}")

# Sets - Unique elements (useful for deduplication)
duplicate_ids = [1, 2, 2, 3, 4, 4, 5]
unique_ids = set(duplicate_ids)
print(f"Unique IDs: {unique_ids}")

## 2. Functions and Lambda Expressions

Functions help organize and reuse code. Lambda functions are useful for quick transformations.

In [None]:
# Function definition with type hints
def calculate_partition_key(date_str: str) -> str:
    """
    Extract year-month partition key from date string.
    
    Args:
        date_str: Date in format 'YYYY-MM-DD'
    
    Returns:
        Partition key in format 'year=YYYY/month=MM'
    """
    year, month, _ = date_str.split('-')
    return f"year={year}/month={month}"

# Test the function
partition = calculate_partition_key('2024-03-15')
print(f"Partition: {partition}")

In [None]:
# Function with default arguments
def create_table_path(database: str, table: str, layer: str = 'silver') -> str:
    """Generate Delta table path."""
    return f"/mnt/{layer}/{database}/{table}"

print(create_table_path('retail', 'orders'))
print(create_table_path('retail', 'orders', 'gold'))

In [None]:
# Lambda functions - anonymous functions for quick operations
multiply_by_2 = lambda x: x * 2
numbers = [1, 2, 3, 4, 5]
doubled = list(map(multiply_by_2, numbers))
print(f"Doubled: {doubled}")

# Lambda with filter
even_numbers = list(filter(lambda x: x % 2 == 0, numbers))
print(f"Even numbers: {even_numbers}")

# Lambda with sorted (useful for sorting data)
files = [('data1.csv', 100), ('data2.csv', 50), ('data3.csv', 200)]
sorted_files = sorted(files, key=lambda x: x[1], reverse=True)
print(f"Sorted by size: {sorted_files}")

## 3. Object-Oriented Programming (OOP) Basics

OOP helps structure complex data engineering workflows.

In [None]:
class DataPipeline:
    """Base class for data pipelines."""
    
    def __init__(self, name: str, source_path: str, target_path: str):
        self.name = name
        self.source_path = source_path
        self.target_path = target_path
        self.status = 'initialized'
    
    def extract(self):
        """Extract data from source."""
        print(f"[{self.name}] Extracting from {self.source_path}")
        self.status = 'extracted'
        return self
    
    def transform(self):
        """Transform extracted data."""
        print(f"[{self.name}] Transforming data")
        self.status = 'transformed'
        return self
    
    def load(self):
        """Load transformed data to target."""
        print(f"[{self.name}] Loading to {self.target_path}")
        self.status = 'completed'
        return self
    
    def run(self):
        """Run the complete ETL pipeline."""
        self.extract().transform().load()
        print(f"[{self.name}] Pipeline completed with status: {self.status}")

# Create and run pipeline
pipeline = DataPipeline(
    name='customer_etl',
    source_path='/raw/customers.csv',
    target_path='/silver/customers'
)
pipeline.run()

In [None]:
# Inheritance - extending base class
class StreamingPipeline(DataPipeline):
    """Streaming version of data pipeline."""
    
    def __init__(self, name: str, source_path: str, target_path: str, checkpoint_path: str):
        super().__init__(name, source_path, target_path)
        self.checkpoint_path = checkpoint_path
        self.is_streaming = True
    
    def extract(self):
        """Extract streaming data."""
        print(f"[{self.name}] Starting streaming from {self.source_path}")
        print(f"[{self.name}] Using checkpoint: {self.checkpoint_path}")
        self.status = 'streaming'
        return self

# Create streaming pipeline
streaming = StreamingPipeline(
    name='event_stream',
    source_path='/stream/events',
    target_path='/silver/events',
    checkpoint_path='/checkpoints/events'
)
streaming.extract()

## 4. Exception Handling

Proper error handling is critical in production data pipelines.

In [None]:
def read_config_file(file_path: str) -> dict:
    """Read configuration file with error handling."""
    import json
    
    try:
        with open(file_path, 'r') as f:
            config = json.load(f)
        print(f"Successfully loaded config from {file_path}")
        return config
    
    except FileNotFoundError:
        print(f"Error: Config file {file_path} not found")
        # Return default config
        return {'mode': 'default', 'retry': 3}
    
    except json.JSONDecodeError as e:
        print(f"Error: Invalid JSON in {file_path}: {e}")
        raise
    
    except Exception as e:
        print(f"Unexpected error: {e}")
        raise
    
    finally:
        print("Config loading attempt completed")

# Test with non-existent file
config = read_config_file('/tmp/nonexistent.json')
print(f"Config: {config}")

## 5. Context Managers

Context managers ensure proper resource cleanup (important for file handles, connections).

In [None]:
# Using context manager for file operations
import tempfile
import os

# Create a temporary file
temp_file = tempfile.NamedTemporaryFile(mode='w', delete=False, suffix='.txt')
temp_path = temp_file.name
temp_file.close()

# Write with context manager - automatic cleanup
with open(temp_path, 'w') as f:
    f.write("customer_id,name,amount\n")
    f.write("1,Alice,100\n")
    f.write("2,Bob,200\n")

# Read with context manager
with open(temp_path, 'r') as f:
    content = f.read()
    print("File content:")
    print(content)

# Cleanup
os.unlink(temp_path)
print("Temporary file cleaned up")

## 6. Essential Libraries for Data Engineering

These libraries are commonly used in Databricks data engineering.

In [None]:
# datetime - working with dates and times
from datetime import datetime, timedelta

now = datetime.now()
print(f"Current time: {now}")
print(f"Formatted: {now.strftime('%Y-%m-%d %H:%M:%S')}")

# Date arithmetic
yesterday = now - timedelta(days=1)
print(f"Yesterday: {yesterday.strftime('%Y-%m-%d')}")

# Parse date string
date_str = "2024-03-15"
parsed = datetime.strptime(date_str, "%Y-%m-%d")
print(f"Parsed date: {parsed}")

In [None]:
# json - working with JSON data
import json

# Python dict to JSON
pipeline_config = {
    'name': 'sales_pipeline',
    'schedule': '0 2 * * *',
    'tables': ['orders', 'customers', 'products']
}

json_string = json.dumps(pipeline_config, indent=2)
print("JSON output:")
print(json_string)

# JSON to Python dict
parsed_config = json.loads(json_string)
print(f"\nParsed tables: {parsed_config['tables']}")

In [None]:
# os and pathlib - file system operations
import os
from pathlib import Path

# Environment variables (common in Databricks)
# os.environ.get('DATABRICKS_TOKEN', 'default_value')

# Path operations
base_path = Path('/dbfs/mnt/data')
bronze_path = base_path / 'bronze' / 'customers'
print(f"Bronze path: {bronze_path}")

# Check if path exists (simulated)
print(f"Path object: {bronze_path}")
print(f"Parent: {bronze_path.parent}")
print(f"Name: {bronze_path.name}")

## 7. List Comprehensions and Generators

Efficient data processing techniques.

In [None]:
# List comprehension - create lists efficiently
numbers = range(1, 11)

# Square all numbers
squares = [n**2 for n in numbers]
print(f"Squares: {squares}")

# Conditional comprehension
even_squares = [n**2 for n in numbers if n % 2 == 0]
print(f"Even squares: {even_squares}")

# Nested comprehension - flatten list of lists
partitions = [['2024-01', '2024-02'], ['2024-03', '2024-04']]
flattened = [month for sublist in partitions for month in sublist]
print(f"Flattened: {flattened}")

In [None]:
# Generator expressions - memory efficient for large datasets
# Creates iterator instead of entire list in memory
large_numbers = (n**2 for n in range(1000000))
print(f"Generator object: {large_numbers}")

# Use with next() or in loop
print(f"First value: {next(large_numbers)}")
print(f"Second value: {next(large_numbers)}")

# Generator function with yield
def partition_generator(start_year, end_year):
    """Generate year-month partitions."""
    for year in range(start_year, end_year + 1):
        for month in range(1, 13):
            yield f"year={year}/month={month:02d}"

# Use generator
partitions = partition_generator(2023, 2024)
print("First 5 partitions:")
for i, partition in enumerate(partitions):
    if i >= 5:
        break
    print(partition)

## 8. Decorators

Decorators modify or enhance functions (useful for logging, timing, caching).

In [None]:
import time
from functools import wraps

def timing_decorator(func):
    """Decorator to measure function execution time."""
    @wraps(func)
    def wrapper(*args, **kwargs):
        start_time = time.time()
        result = func(*args, **kwargs)
        end_time = time.time()
        print(f"{func.__name__} took {end_time - start_time:.4f} seconds")
        return result
    return wrapper

@timing_decorator
def process_data(records_count):
    """Simulate data processing."""
    print(f"Processing {records_count} records...")
    time.sleep(0.1)  # Simulate work
    return records_count * 2

result = process_data(1000)
print(f"Result: {result}")

In [None]:
def retry_decorator(max_attempts=3):
    """Decorator to retry failed operations."""
    def decorator(func):
        @wraps(func)
        def wrapper(*args, **kwargs):
            for attempt in range(1, max_attempts + 1):
                try:
                    return func(*args, **kwargs)
                except Exception as e:
                    print(f"Attempt {attempt}/{max_attempts} failed: {e}")
                    if attempt == max_attempts:
                        raise
                    time.sleep(1)  # Wait before retry
        return wrapper
    return decorator

@retry_decorator(max_attempts=3)
def unstable_api_call(success_rate=0.7):
    """Simulate an unstable API call."""
    import random
    if random.random() < success_rate:
        return "Success!"
    raise Exception("API call failed")

# This will retry up to 3 times if it fails
try:
    result = unstable_api_call(success_rate=0.3)
    print(result)
except Exception as e:
    print(f"All attempts failed: {e}")

## Practice Exercises

### Exercise 1: Data Processing Function
Create a function that processes a list of customer records (dicts) and returns only customers with purchases > $100.

In [None]:
# Exercise 1 - Your code here
customers = [
    {'id': 1, 'name': 'Alice', 'purchase_amount': 150},
    {'id': 2, 'name': 'Bob', 'purchase_amount': 75},
    {'id': 3, 'name': 'Charlie', 'purchase_amount': 200},
    {'id': 4, 'name': 'David', 'purchase_amount': 90}
]

def filter_high_value_customers(customers, threshold=100):
    # TODO: Implement this function
    pass

# Test your function
# high_value = filter_high_value_customers(customers)
# print(high_value)

### Exercise 2: Pipeline Class
Create a class `DataValidator` that validates data quality (checks for nulls, duplicates, etc.).

In [None]:
# Exercise 2 - Your code here
class DataValidator:
    def __init__(self, data):
        self.data = data
        self.issues = []
    
    def check_nulls(self):
        # TODO: Check for None values
        pass
    
    def check_duplicates(self):
        # TODO: Check for duplicate entries
        pass
    
    def report(self):
        # TODO: Print validation report
        pass

## Summary

In this notebook, you learned:

✅ Python data structures (lists, dicts, sets, tuples)
✅ Functions, lambda expressions, and decorators
✅ OOP basics with classes and inheritance
✅ Exception handling and context managers
✅ Essential libraries (datetime, json, os, pathlib)
✅ List comprehensions and generators
✅ Practical patterns for data engineering

## Next Steps

1. Complete the practice exercises
2. Review any concepts that are unclear
3. Move to [02-SQL-Essentials.ipynb](./02-SQL-Essentials.ipynb)

## Additional Resources

- [Python Official Documentation](https://docs.python.org/3/)
- [Real Python Tutorials](https://realpython.com/)
- [Python for Data Analysis](https://wesmckinney.com/book/)