# Project Cyber-Trace: Silver Layer Transformation
**Author:** Jakub Milczarczyk  
**Pipeline Type:** Micro-batch Processing (FinOps Optimized)

## Objective
Process raw Bronze data into a clean, queryable Silver Layer optimized for Threat Hunting (SQL Analysis).

## Technical Highlights
* **Defensive Logic:** Implemented `standardize_security_logs` function (in `src.transformations`) to safely handle missing columns and nested JSON fields.
* **FinOps Strategy:** Uses `.trigger(availableNow=True)` to process all pending data in a single batch and shutdown. This reduces compute costs by ~90% compared to continuous streaming for periodic workloads.
* **Data Quality:** Null handling and type casting (Timestamp normalization) ensuring consistent schema for downstream analytics.

In [0]:
# ==============================================================================
# CELL 1: CONFIGURATION & AUTHENTICATION
# ==============================================================================
from src.config import setup_authentication, ProjectConfig, Paths
from src.logger import get_logger

logger = get_logger("SilverTransformation")

setup_authentication(spark, dbutils)
logger.info("SYSTEM: Authentication successful")

# 2. Path Definitions
base_path       = ProjectConfig.get_base_path()
raw_logs_path   = Paths.RAW_LOGS
schema_path     = Paths.SCHEMA
checkpoint_path_bronze = Paths.CHECKPOINT_BRONZE
checkpoint_path_silver = Paths.CHECKPOINT_SILVER
quarantine_path = Paths.QUARANTINE

logger.info(f"CONFIG: Paths set. Ingestion target: {checkpoint_path_bronze}")

In [0]:
# ==============================================================================
# CELL 2: IMPORTS AND ENVIRONMENT SETUP
# ==============================================================================

import sys
import os

notebook_path = os.getcwd()
project_root = os.path.dirname(notebook_path)

if project_root not in sys.path:
    sys.path.append(project_root)

from src.transformations import standardize_security_logs

In [0]:
# ==============================================================================
# CELL 3: READ BRONZE DATA
# ==============================================================================

logger.info("SYSTEM: Reading Bronze Data")
df_bronze = spark.readStream.table("bronze_mordor_logs")

# Write Silver Layer

**Processing:** Bronze Table to flattened structure and fill empty fields by Nulls

**FinOps:** Uses Trigger.AvailableNow to process data in micro-batches and shut down, reducing compute costs by ~90% vs continuous streaming.

**Output:** Silver Layer write to ADLS Gen2 as Table 'silver_security_logs'.

In [0]:
# ==============================================================================
# CELL 4: CONFIGURATION & AUTHENTICATION
# Write Silver Layer - Process and write to silver layer
# ==============================================================================

logger.info("SYSTEM: Start processing Silver Layer (with PII Masking)...")
df_silver = standardize_security_logs(df_bronze)

table_name_silver = "silver_security_logs"

query_silver = (df_silver.writeStream
    .format("delta")
    .outputMode("append")
    .option("checkpointLocation", checkpoint_path_silver)
    .option("mergeSchema", "true")
    .partitionBy("event_date")
    .trigger(availableNow=True)
    .table(table_name_silver)    
)

logger.info(f"Stream Silver initialized. Partitioned by 'event_date'. Target: {table_name_silver}")