# Project Cyber-Trace: Bronze Layer Ingestion
**Author:** Jakub Milczarczyk  
**Description:** This notebook handles the ingestion of raw security logs (OTRF/Mordor dataset) from **Azure Data Lake Gen2 (ADLS)** into the Bronze Layer. It utilizes **Databricks Auto Loader** for scalable, incremental processing.

## 1. Environment Setup & Security
This section configures the connection between Databricks and Azure Storage.

### Architecture Note: Secure Authentication
**Standard Industry Practice:** Hardcoding credentials (access keys) in notebooks is a critical security risk.
**Selected Solution:** I utilize **Azure Key Vault** integrated with **Databricks Secret Scopes**.
* The Access Key is stored securely in Azure Key Vault.
* Databricks retrieves it at runtime using `dbutils.secrets.get()`.
* This ensures no sensitive data is exposed in the source code or version control systems (Git).

In [0]:
# ==============================================================================
# CELL 1: CONFIGURATION & AUTHENTICATION
# Environment setup - Safe to run in any environment
# ==============================================================================

# 1. Resource Definitions
storage_account_name = "cybertracebronze"
container_name       = "bronze"
scope_name           = "cybertrace-secrets"
key_name             = "storage-access-key"

# 2. Secure Authentication
try:
    sas_key = dbutils.secrets.get(scope=scope_name, key=key_name)
    spark.conf.set(
        f"fs.azure.account.key.{storage_account_name}.dfs.core.windows.net",
        sas_key
    )
    print("SYSTEM: Authentication configured successfully.")
except Exception as e:
    raise ValueError(f"CRITICAL ERROR: Failed to retrieve secrets. Details: {e}")

# 3. Path Definitions
base_path       = f"abfss://{container_name}@{storage_account_name}.dfs.core.windows.net"
raw_logs_path   = f"{base_path}/raw_logs"
schema_path     = f"{base_path}/schemas/ingestion_schema"
checkpoint_path = f"{base_path}/checkpoints/raw_logs_ingest"

print(f"CONFIG: Paths set. Ingestion target: {raw_logs_path}")

## 2. Development State Reset (⚠️ DEV ONLY)
**Objective:** To simulate a "fresh start" of the ingestion pipeline during the prototyping phase.

**Context:** Spark Structured Streaming maintains the state of processed files in a **Checkpoint** location. To re-process the same dataset or test configuration changes, we must clear this metadata.

> **Warning:** This step deletes the checkpoint history. In a **Production** environment, this cell would be disabled or removed to prevent data duplication and loss of processing history.

In [0]:
# ==============================================================================
# CELL 2: ENVIRONMENT RESET (⚠️ DEV ONLY ⚠️)
# Executes a cleanup to restart streaming from scratch.
# DO NOT RUN THIS IN PRODUCTION or you will lose processing history!
# ==============================================================================

print(f"MAINTENANCE: Cleaning up checkpoints and schemas for a fresh start...")

# Remove checkpoints (resets stream offset to zero)
dbutils.fs.rm(checkpoint_path, recurse=True)

# Optional: Remove inferred schema if you want to re-learn column types
# dbutils.fs.rm(schema_path, recurse=True) 

print("MAINTENANCE: Environment clean. Ready for fresh ingestion.")

## 3. Ingestion Pipeline (Auto Loader)
**Objective:** ingest raw JSON logs efficiently and robustly.

**Technology Stack:**
* **Databricks Auto Loader (`cloudFiles`):** An optimized file source that detects new files as they arrive in ADLS without listing the entire directory (solving the "S3/ADLS listing" performance bottleneck).
* **Schema Evolution:** The pipeline is configured to automatically detect and adapt to changes in the log structure (e.g., new fields in JSON events) using `schema_path`.
* **Checkpointing:** Ensures fault tolerance. If the cluster crashes, the stream resumes exactly where it left off.

**Output:**
For this prototype, the stream writes to an **in-memory table** (`memory` format) for immediate validation. In the next stage (Silver Layer), this will be replaced with a **Delta Lake** sink.

In [0]:
# ==============================================================================
# CELL 3: DATA INGESTION PIPELINE
# Core logic: Read Stream -> Write Memory -> Verify
# ==============================================================================

# 1. Define the Stream (Auto Loader)
print(f"ACTION: Initializing Auto Loader from {raw_logs_path}...")
df_stream = (spark.readStream
    .format("cloudFiles")
    .option("cloudFiles.format", "json")
    .option("cloudFiles.schemaLocation", schema_path)
    .load(raw_logs_path)
)

# 2. Run the Stream (In-Memory for testing)
temp_table_name = "raw_logs_view"

# We use the checkpoint defined in Cell 1. 
# If Cell 2 was run, it starts from scratch. If not, it resumes.
query = (df_stream.writeStream
    .format("memory")
    .queryName(temp_table_name)
    .option("checkpointLocation", checkpoint_path)
    .start()
)

print("SYSTEM: Stream is running. Waiting 5 seconds for data availability...")
import time
time.sleep(5) 

# 3. Data Verification
print(f"REPORT: Latest logs preview (Top 10):")
df_view = spark.sql(f"SELECT * FROM {temp_table_name} LIMIT 10")
display(df_view)