# RAG - Data Handling Demo

This notebook demonstrates the **data_handling package** used for:
- Parsing incident reports
- Generating embeddings
- Storing vectors in Pinecone (organized by namespace)
- Querying for similar incidents

---

## 1. Setup and Configuration

In [1]:
import os
from configs.config import RAGConfig
from data_handling import (
    embed_documents,
    embed_query,
    VectorStore,
    IncidentReportParser,
    IngestionPipeline
)

# Initialize configuration
print("Initializing RAG Configuration...")
config = RAGConfig(
    index_name="incident-reports",
    model="models/gemini-2.0-flash",
    embedding_model="models/text-embedding-004",
    embedding_dimension=768
)
print("✓ Configuration initialized")

  from .autonotebook import tqdm as notebook_tqdm


Initializing RAG Configuration...
Successfully initialized model: models/gemini-2.0-flash
Using existing Pinecone index: incident-reports
✓ Configuration initialized


## 2. Example Incident Report

Let's look at a sample cybersecurity incident report:

In [2]:
# Read one example incident report
example_file = "demo incident reports/AV-SEC-2025-002.txt"

with open(example_file, 'r', encoding='utf-8') as f:
    full_report = f.read()

# Display the report (truncated for readability)
print("="*80)
print("EXAMPLE INCIDENT REPORT (First 600 characters)")
print("="*80)
print(full_report[:600] + "...\n")
print(f"Total length: {len(full_report)} characters")

EXAMPLE INCIDENT REPORT (First 600 characters)
Incident ID: AV-SEC-2025-002 Date of Detection: 2025-10-28 09:15 UTC Vehicle ID: AV-991X (Fleet: "Highway Hauler") Threat Category: CAN Bus Intrusion / Denial of Service Detection Method: In-Vehicle Network (IVN) Behavioral Anomaly Detection.

Detailed Incident Description: A high-severity event was triggered on vehicle AV-991X, a long-haul autonomous truck, while it was operating on Interstate 80. Our network monitoring agent on the primary Controller Area Network (CAN) bus detected a sudden flood of high-priority, non-standard messages originating from an unauthorized ECU. The attack manifes...

Total length: 3277 characters


## 3. Document Parsing

The `IncidentReportParser` extracts structured sections from the report:

In [3]:
# Initialize parser
parser = IncidentReportParser(config)

# Parse the incident report
sections = parser.parse_incident_report(example_file)

print("="*80)
print(f"PARSED SECTIONS: {len(sections)} sections extracted")
print("="*80)

for text, metadata, doc_id in sections:
    section_type = metadata['section_type']
    print(f"\n[{section_type.upper()}]")
    print(f"  Document ID: {doc_id}")
    print(f"  Text preview: {text[:150]}...")
    print(f"  Token count: {metadata['token_count']}")
    
    # Show cross-section metadata (other sections stored in metadata)
    other_sections = [k for k in metadata.keys() if k.startswith('section_') and k != f'section_{section_type}_text']
    print(f"  Cross-references: {len(other_sections)} other sections in metadata")

PARSED SECTIONS: 4 sections extracted

[DESCRIPTION]
  Document ID: AV-SEC-2025-002_description
  Text preview: A high-severity event was triggered on vehicle AV-991X, a long-haul autonomous truck, while it was operating on Interstate 80. Our network monitoring ...
  Token count: 244
  Cross-references: 4 other sections in metadata

[IMPACT]
  Document ID: AV-SEC-2025-002_impact
  Text preview: The immediate impact was a loss of stable vehicle control, creating a significant safety hazard on a public highway. The bus saturation could have led...
  Token count: 70
  Cross-references: 4 other sections in metadata

[RESPONSE]
  Document ID: AV-SEC-2025-002_response
  Text preview: The onboard intrusion detection system (IDS) immediately identified the message flood and correlated it with the unexpected control behavior. It broad...
  Token count: 159
  Cross-references: 4 other sections in metadata

[RECOMMENDATIONS]
  Document ID: AV-SEC-2025-002_recommendations
  Text preview: Implement

## 4. Embedding Generation

Convert text to vector embeddings using Google's embedding model:

In [6]:
# Extract just the description text
description_text = sections[0][0]  # First section is description

# Generate embedding
print("Generating embedding for description...")
embeddings = embed_documents([description_text], config.embedding_model)
embedding_vector = embeddings[0]

print("="*80)
print("EMBEDDING DETAILS")
print("="*80)
print(f"Embedding dimension: {len(embedding_vector)}")
print(f"First 10 values: {embedding_vector[:10]}")

Generating embedding for description...
Successfully generated 1 embeddings
EMBEDDING DETAILS
Embedding dimension: 768
First 10 values: [0.06197098, -0.0036456482, -0.051925324, 0.023179723, 0.010073917, 0.053349476, 0.051688734, 0.011818324, 0.035564896, 0.00312003]


## 5. Ingest Full Dataset to Pinecone

Load all incident reports and upload to Pinecone, organized by namespace:

In [7]:
# Create ingestion pipeline
pipeline = IngestionPipeline(config)

# Ingest all incident reports
print("Starting ingestion process...\n")
results = pipeline.ingest_incident_reports(
    directory_path="demo incident reports",
    file_pattern="*.txt"
)

# Display results
print("\n" + "="*80)
print("INGESTION SUMMARY")
print("="*80)
print(f"Total documents uploaded: {results['total_uploaded']}")
print(f"\nDocuments per namespace:")
for namespace, stats in results['namespaces'].items():
    print(f"  - {namespace}: {stats['uploaded']} documents")

Starting ingestion process...

[1/20] Processing: AV-SEC-2025-001.txt
  ✓ description: 267 tokens (ID: AV-SEC-2025-001_description)
  ✓ impact: 101 tokens (ID: AV-SEC-2025-001_impact)
  ✓ response: 184 tokens (ID: AV-SEC-2025-001_response)
  ✓ recommendations: 90 tokens (ID: AV-SEC-2025-001_recommendations)
  → Generated 4 documents from this report

[2/20] Processing: AV-SEC-2025-002.txt
  ✓ description: 244 tokens (ID: AV-SEC-2025-002_description)
  ✓ impact: 70 tokens (ID: AV-SEC-2025-002_impact)
  ✓ response: 159 tokens (ID: AV-SEC-2025-002_response)
  ✓ recommendations: 78 tokens (ID: AV-SEC-2025-002_recommendations)
  → Generated 4 documents from this report

[3/20] Processing: AV-SEC-2025-003.txt
  ✓ description: 211 tokens (ID: AV-SEC-2025-003_description)
  ✓ impact: 61 tokens (ID: AV-SEC-2025-003_impact)
  ✓ response: 145 tokens (ID: AV-SEC-2025-003_response)
  ✓ recommendations: 122 tokens (ID: AV-SEC-2025-003_recommendations)
  → Generated 4 documents from this report

[4/2

## 6. Query the Database

Search for similar incidents based on a query:

In [9]:
# Create a sample query
query_text = "CAN bus attack with denial of service on vehicle network"

print("="*80)
print(f"QUERY: {query_text}")
print("="*80)

# Generate query embedding
print("\nGenerating query embedding...")
query_embedding = embed_query(query_text, config.embedding_model)

# Query the description namespace
vector_store = VectorStore(config.index, config.embedding_model)
print("Searching description namespace...\n")

results = vector_store.query(
    query_vector=query_embedding,
    top_k=3,
    namespace="description",
    include_metadata=True
)

# Display results
print(f"TOP {len(results.matches)} SIMILAR INCIDENTS")

for i, match in enumerate(results.matches, 1):
    print(f"\n[{i}] Incident ID: {match.metadata.get('incident_id', 'Unknown')}")
    print(f"    Similarity Score: {match.score:.4f}")

    # Show description preview
    description = match.metadata.get('text', '')
    print(f"\n    Description Preview:")
    print(f"    {description[:200]}...")

QUERY: CAN bus attack with denial of service on vehicle network

Generating query embedding...
Searching description namespace...

TOP 3 SIMILAR INCIDENTS

[1] Incident ID: AV-SEC-2025-002
    Similarity Score: 0.7475

    Description Preview:
    A high-severity event was triggered on vehicle AV-991X, a long-haul autonomous truck, while it was operating on Interstate 80. Our network monitoring agent on the primary Controller Area Network (CAN)...

[2] Incident ID: AV-SEC-2025-008
    Similarity Score: 0.6803

    Description Preview:
    Our cloud analytics platform, which monitors telemetry data from the entire fleet, flagged a series of impossible state transitions for vehicle AV-202J. The vehicle, which was parked and powered down ...

[3] Incident ID: AV-SEC-2025-015
    Similarity Score: 0.6495

    Description Preview:
    Vehicle AV-101M, an autonomous haul truck operating in a remote mining complex, came to an emergency stop after its primary sensor fusion unit lost communicat