# Project Cyber-Trace: Bronze Layer Ingestion (Prototype)
**Author:** Jakub Milczarczyk
**Date:** 2025-11-28
**Description:** This notebook handles the ingestion of raw security logs (OTRF/Mordor dataset) from AWS S3.
It implements the "Direct URI Injection" pattern (ADR-001) to bypass Free Tier limitations.

## Overview
This notebook serves as the **Initialization & Prototyping** phase for the Cyber-Trace pipeline.
1.  **Configuration:** Securely load AWS credentials from a local module.
2.  **Connection:** Configure the Spark session to communicate with the S3 bucket.
3.  **Ingestion:** Read raw JSON logs generated by the Mordor/OTRF project (Simulated SMB/Lateral Movement attacks).
4.  **Validation:** Verify schema and data preview.

In [0]:
# -------------------------------------------------------------------------
# 1. IMPORTS & CONFIGURATION LOADING
# -------------------------------------------------------------------------
import os
import sys
from urllib.parse import quote

# Attempt to load the local configuration module
try:
    import project_config
    import importlib
    importlib.reload(project_config) 
    print("SUCCESS: Configuration module 'project_config' loaded.")
except ImportError:
    raise ImportError("CRITICAL ERROR: 'project_config.py' not found.")

# Extract variables cleanly
aws_access_key = project_config.settings.get("AWS_ACCESS_KEY")
aws_secret_key = project_config.settings.get("AWS_SECRET_KEY")
bucket_name    = project_config.settings.get("S3_BUCKET")
folder_name    = project_config.settings.get("S3_FOLDER")
file_name      = project_config.settings.get("S3_FILE_NAME")

# Basic validation to ensure variables are not None
if not all([bucket_name, folder_name, file_name]):
    raise ValueError("Configuration Error: One or more S3 variables are missing in project_config.py")

SUCCESS: Configuration module 'project_config' loaded.


## ⚠️ Architecture Note: S3 Authentication Strategy

**Standard Industry Practice:**
In a production environment, the secure standard is to use **Instance Profiles** (IAM Roles) attached directly to the Cluster, or **Databricks Secrets** injected via `spark.conf.set("fs.s3a.access.key", ...)` during session initialization.

**Environment Constraints:**
This project runs on **Databricks Free Tier (Serverless Compute)**. This environment enforces strict security isolation and explicitly blocks the modification of global Hadoop configurations (`fs.s3a.*`) via `spark.conf`, returning a `[CONFIG_NOT_AVAILABLE]` error.

**Selected Solution:**
To bypass this limitation while maintaining secure secret management:
1.  We utilize the **Direct URI Scheme** (`s3a://access_key:secret_key@bucket/file`).
2.  We programmatically **URL-encode** the credentials to handle special characters safely.
3.  This approach ensures the pipeline functions within the restricted tier without hardcoding secrets in plain text.

In [0]:
# -------------------------------------------------------------------------
# 3. DATA INGESTION (Direct Credential Injection)
# -------------------------------------------------------------------------

# 1. Encode the keys safely
encoded_access_key = quote(aws_access_key, safe="")
encoded_secret_key = quote(aws_secret_key, safe="")

# 2. Construct the secure path dynamically using f-string
# Logic: s3a://credentials@bucket/folder/file
# We assume folder_name does not have a starting/trailing slash to avoid double slashes
source_path = f"s3a://{encoded_access_key}:{encoded_secret_key}@{bucket_name}/{folder_name}/{file_name}"

print(f"Attempting to read data from S3...")
print(f"Target Path Structure: s3a://***:***@{bucket_name}/{folder_name}/{file_name}")

try:
    # 3. Read JSON
    df_raw = spark.read.option("multiline", "true").json(source_path)
    
    print("SUCCESS: Dataframe created.")
    
    # Validation
    print("\n--- Data Preview ---")
    display(df_raw.limit(5))
    
except Exception as e:
    print(f"ERROR: Failed to read file.")
    safe_error = str(e).replace(encoded_secret_key, "***SECRET***")
    print(f"Error Details: {safe_error}")


Attempting to read data from S3...
Target Path Structure: s3a://***:***@cybertrace-project-bronze-data-jm/raw_logs/empire_smbexec_dcerpc_smb_svcctl_2020-09-20025716.json
SUCCESS: Dataframe created.

--- Data Preview ---


@timestamp,@version,AccountName,AccountType,CallTrace,Category,Channel,Domain,EventID,EventReceivedTime,EventTime,EventType,ExecutionProcessID,GrantedAccess,Hostname,Keywords,Opcode,OpcodeValue,ProviderGuid,RecordNumber,RuleName,Severity,SeverityValue,SourceImage,SourceModuleName,SourceModuleType,SourceName,SourceProcessGUID,SourceProcessId,SourceThreadId,TargetImage,TargetProcessGUID,TargetProcessId,Task,ThreadID,UserID,UtcTime,Version,host,port,tags
2020-09-20T06:57:17.371Z,1,SYSTEM,User,C:\windows\SYSTEM32\ntdll.dll+9c534|C:\windows\SYSTEM32\psmserviceexthost.dll+222a3|C:\windows\SYSTEM32\psmserviceexthost.dll+1a172|C:\windows\SYSTEM32\psmserviceexthost.dll+19e3b|C:\windows\SYSTEM32\psmserviceexthost.dll+19318|C:\windows\SYSTEM32\ntdll.dll+3081d|C:\windows\SYSTEM32\ntdll.dll+345b4|C:\windows\System32\KERNEL32.DLL+17bd4|C:\windows\SYSTEM32\ntdll.dll+6ce51,Process accessed (rule: ProcessAccess),Microsoft-Windows-Sysmon/Operational,NT AUTHORITY,10,2020-09-20 02:57:17,2020-09-20 02:57:14,INFO,9848,0x1000,WORKSTATION5.theshire.local,-9223372036854775808,Info,0,{5770385F-C22A-43E0-BF4C-06F5698FFBD9},1929240,-,INFO,2,C:\windows\system32\svchost.exe,eventlog,im_msvistalog,Microsoft-Windows-Sysmon,{b34bc01c-7fae-5f63-1000-000000000400},880.0,7488.0,C:\windows\System32\svchost.exe,{b34bc01c-803f-5f63-5402-000000000400},704.0,10,7976,S-1-5-18,2020-09-20 06:57:14.637,3,wec.internal.cloudapp.net,64545,List(mordorDataset)
2020-09-20T06:57:17.372Z,1,pgustavo,User,,Executing Pipeline,Microsoft-Windows-PowerShell/Operational,THESHIRE,4103,2020-09-20 02:57:17,2020-09-20 02:57:15,INFO,8948,,WORKSTATION5.theshire.local,0,To be used when operation is just executing a method,20,{A0C1853B-5C40-4B15-8766-3CF1C58F985A},37562,,INFO,2,,eventlog,im_msvistalog,Microsoft-Windows-PowerShell,,,,,,,106,9552,S-1-5-21-4228717743-1032521047-1810997296-1104,,1,wec.internal.cloudapp.net,64545,List(mordorDataset)
2020-09-20T06:57:17.373Z,1,SYSTEM,User,C:\windows\SYSTEM32\ntdll.dll+9c534|C:\windows\SYSTEM32\psmserviceexthost.dll+222a3|C:\windows\SYSTEM32\psmserviceexthost.dll+1a172|C:\windows\SYSTEM32\psmserviceexthost.dll+19e3b|C:\windows\SYSTEM32\psmserviceexthost.dll+19318|C:\windows\SYSTEM32\ntdll.dll+3081d|C:\windows\SYSTEM32\ntdll.dll+345b4|C:\windows\System32\KERNEL32.DLL+17bd4|C:\windows\SYSTEM32\ntdll.dll+6ce51,Process accessed (rule: ProcessAccess),Microsoft-Windows-Sysmon/Operational,NT AUTHORITY,10,2020-09-20 02:57:17,2020-09-20 02:57:14,INFO,9848,0x1000,WORKSTATION5.theshire.local,-9223372036854775808,Info,0,{5770385F-C22A-43E0-BF4C-06F5698FFBD9},1929241,-,INFO,2,C:\windows\system32\svchost.exe,eventlog,im_msvistalog,Microsoft-Windows-Sysmon,{b34bc01c-7fae-5f63-1000-000000000400},880.0,7488.0,C:\windows\System32\svchost.exe,{b34bc01c-803f-5f63-5402-000000000400},704.0,10,7976,S-1-5-18,2020-09-20 06:57:14.637,3,wec.internal.cloudapp.net,64545,List(mordorDataset)
2020-09-20T06:57:17.373Z,1,SYSTEM,User,C:\windows\SYSTEM32\ntdll.dll+9c534|C:\windows\SYSTEM32\psmserviceexthost.dll+222a3|C:\windows\SYSTEM32\psmserviceexthost.dll+1a172|C:\windows\SYSTEM32\psmserviceexthost.dll+19e3b|C:\windows\SYSTEM32\psmserviceexthost.dll+19318|C:\windows\SYSTEM32\ntdll.dll+3081d|C:\windows\SYSTEM32\ntdll.dll+345b4|C:\windows\System32\KERNEL32.DLL+17bd4|C:\windows\SYSTEM32\ntdll.dll+6ce51,Process accessed (rule: ProcessAccess),Microsoft-Windows-Sysmon/Operational,NT AUTHORITY,10,2020-09-20 02:57:17,2020-09-20 02:57:14,INFO,9848,0x1000,WORKSTATION5.theshire.local,-9223372036854775808,Info,0,{5770385F-C22A-43E0-BF4C-06F5698FFBD9},1929242,-,INFO,2,C:\windows\system32\svchost.exe,eventlog,im_msvistalog,Microsoft-Windows-Sysmon,{b34bc01c-7fae-5f63-1000-000000000400},880.0,7488.0,C:\windows\System32\svchost.exe,{b34bc01c-803f-5f63-5402-000000000400},704.0,10,7976,S-1-5-18,2020-09-20 06:57:14.637,3,wec.internal.cloudapp.net,64545,List(mordorDataset)
2020-09-20T06:57:17.373Z,1,SYSTEM,User,,Registry object added or deleted (rule: RegistryEvent),Microsoft-Windows-Sysmon/Operational,NT AUTHORITY,12,2020-09-20 02:57:17,2020-09-20 02:57:15,,9848,,WORKSTATION5.theshire.local,-9223372036854775808,Info,0,{5770385F-C22A-43E0-BF4C-06F5698FFBD9},1929243,-,INFO,2,,eventlog,im_msvistalog,Microsoft-Windows-Sysmon,,,,,,,12,7976,S-1-5-18,2020-09-20 06:57:15.069,2,wec.internal.cloudapp.net,64545,List(mordorDataset)
