Development State Reset (⚠️ DEV ONLY)
Objective: To simulate a "fresh start" of the ingestion pipeline during the prototyping phase.

Context: Spark Structured Streaming maintains the state of processed files in a Checkpoint location. To re-process the same dataset or test configuration changes, we must clear this metadata.

Warning: This step deletes the checkpoint history. In a Production environment, this cell would be disabled or removed to prevent data duplication and loss of processing history.

In [0]:
# ==============================================================================
# CELL 1: CONFIGURATION & AUTHENTICATION
# Environment setup - Safe to run in any environment
# ==============================================================================

# 1. Secure Authentication
from src.config import setup_authentication, ProjectConfig, Paths

setup_authentication(spark, dbutils)

# 2. Path Definitions
base_path       = ProjectConfig.get_base_path()
raw_logs_path   = Paths.RAW_LOGS
schema_path     = Paths.SCHEMA
checkpoint_path_bronze = Paths.CHECKPOINT_BRONZE
quarantine_path = Paths.QUARANTINE

print(f"CONFIG: Paths set. Ingestion target: {raw_logs_path}")

In [0]:
# ==============================================================================
# CELL 2: ENVIRONMENT RESET (⚠️ DEV ONLY ⚠️)
# Executes a cleanup to restart streaming from scratch.
# DO NOT RUN THIS IN PRODUCTION or you will lose processing history!
# ==============================================================================

print(f"MAINTENANCE: Cleaning up checkpoints and schemas for a fresh start...")

# Remove checkpoints (resets stream offset to zero)
dbutils.fs.rm(checkpoint_path_bronze, recurse=True)

# Optional: Remove inferred schema if you want to re-learn column types
dbutils.fs.rm(schema_path, recurse=True)

# Resets quarantine folder and rebuild it
dbutils.fs.rm(quarantine_path, recurse=True)
dbutils.fs.mkdirs(quarantine_path)

print("MAINTENANCE: Environment clean. Ready for fresh ingestion.")