# OpenTitan RAG-based SystemVerilog Assertion Generator

This notebook examines a system that combines web scraping, semantic search, and large language models to automatically generate assertions for hardware IP verification.

### Learning Objectives
- Understand how Retrieval-Augmented Generation (RAG) applies to hardware verification
- Examine SystemVerilog Assertion generation using LLMs
- Analyze web scraping and document processing for hardware documentation
- Study semantic search with vector embeddings

### Prerequisites
- Basic understanding of Python programming
- Familiarity with hardware design and verification concepts
- Knowledge of SystemVerilog (helpful but not required)

## 1. Environment Setup

This section establishes the required dependencies and clones the repository for analysis.

In [None]:
import sys
import os

In [None]:
# Clone the OpenTitan RAG SVAGEN repository
import subprocess

repo_url = "https://github.com/AnandMenon12/OpenTitan_RAG_SVAGEN.git"
repo_name = "OpenTitan_RAG_SVAGEN"

# Check if repository already exists
if not os.path.exists(repo_name):
    print(f"Cloning repository from {repo_url}...")
    result = subprocess.run(["git", "clone", repo_url], capture_output=True, text=True)
    
    if result.returncode == 0:
        print("Repository cloned successfully")
    else:
        print(f"Error cloning repository: {result.stderr}")
else:
    print("Repository already exists")

# Change to the repository directory
os.chdir(repo_name)
print(f"Changed to directory: {os.getcwd()}")

# List the contents of the repository
print("\nRepository contents:")
for item in os.listdir('.'):
    if os.path.isfile(item):
        size = os.path.getsize(item)
        print(f"  {item} ({size} bytes)")
    else:
        print(f"  {item}/")

Cloning repository from https://github.com/AnandMenon12/OpenTitan_RAG_SVAGEN.git...
Repository cloned successfully!
Changed to directory: /home/eng/a/axm200085/Transformer_annotated/CollabTest/OpenTitan_RAG_SVAGEN/OpenTitan_RAG_SVAGEN

Repository contents:
  .git/
  .gitignore (585 bytes)
  README.md (3586 bytes)
  opentitan_sva_generator.py (20254 bytes)
  requirements.txt (208 bytes)
Repository cloned successfully!
Changed to directory: /home/eng/a/axm200085/Transformer_annotated/CollabTest/OpenTitan_RAG_SVAGEN/OpenTitan_RAG_SVAGEN

Repository contents:
  .git/
  .gitignore (585 bytes)
  README.md (3586 bytes)
  opentitan_sva_generator.py (20254 bytes)
  requirements.txt (208 bytes)


## 2. Code Structure Analysis

The following analysis examines the main components and dependencies of the OpenTitan RAG SVAGEN system.

In [5]:
# First, let's examine the requirements.txt file if it exists
if os.path.exists('requirements.txt'):
    print("Requirements from requirements.txt:")
    with open('requirements.txt', 'r') as f:
        requirements = f.read()
        print(requirements)
else:
    print("No requirements.txt found. We'll install dependencies manually.")

Requirements from requirements.txt:
# Requirements for OpenTitan SVA Generator
torch>=2.0.0
transformers>=4.35.0
sentence-transformers>=2.2.2
faiss-cpu>=1.7.4
numpy>=1.21.0
markdown>=3.4.0
beautifulsoup4>=4.11.0
hjson>=3.1.0
accelerate>=0.21.0



In [6]:
# Install the core dependencies
# Note: In practice, you would install from requirements.txt, but we'll install manually for educational purposes

dependencies = [
    "torch",                    # PyTorch for deep learning
    "transformers",            # Hugging Face transformers for LLMs
    "sentence-transformers",   # For text embeddings
    "faiss-cpu",              # Facebook AI Similarity Search
    "beautifulsoup4",          # HTML/XML parsing
    "requests",               # HTTP requests
    "numpy",                  # Numerical computing
    "markdown",               # Markdown processing
    "hjson",                  # Human JSON for hardware configs
    "lxml"                    # XML processing
]

print("Installing dependencies...")
for dep in dependencies:
    print(f"Installing {dep}...")
    result = subprocess.run([sys.executable, "-m", "pip", "install", dep], 
                          capture_output=True, text=True)
    
    if result.returncode == 0:
        print(f"  {dep} installed successfully")
    else:
        print(f"  Error installing {dep}: {result.stderr[:200]}...")

print("\nDependency installation complete!")

Installing dependencies...
Installing torch...
  torch installed successfully
Installing transformers...
  torch installed successfully
Installing transformers...
  transformers installed successfully
Installing sentence-transformers...
  transformers installed successfully
Installing sentence-transformers...
  sentence-transformers installed successfully
Installing faiss-cpu...
  sentence-transformers installed successfully
Installing faiss-cpu...
  faiss-cpu installed successfully
Installing beautifulsoup4...
  faiss-cpu installed successfully
Installing beautifulsoup4...
  beautifulsoup4 installed successfully
Installing requests...
  beautifulsoup4 installed successfully
Installing requests...
  requests installed successfully
Installing numpy...
  requests installed successfully
Installing numpy...
  numpy installed successfully
Installing markdown...
  numpy installed successfully
Installing markdown...
  markdown installed successfully
Installing hjson...
  markdown installed su

## 3. Data Structures Analysis

The system employs several dataclasses to represent different types of documentation elements. These structures provide type safety and clear interfaces for data handling.

In [None]:
# Extract and analyze dataclasses from the source code
dataclasses = []
lines = source_code.split('\n')

# Find dataclass definitions
for i, line in enumerate(lines):
    if line.strip().startswith('@dataclass'):
        class_line = lines[i+1]
        class_match = re.search(r'class (\w+)', class_line)
        if class_match:
            dataclasses.append(class_match.group(1))

print("Identified Dataclasses:")
for dc in dataclasses:
    print(f"- {dc}")

# Extract the complete dataclass code for analysis
print("\nDataclass Implementations:")
print("-" * 40)

# DocumentChunk dataclass
dc_match = re.search(r'@dataclass\nclass DocumentChunk:.*?(?=@dataclass|\nclass [A-Z]|\Z)', source_code, re.DOTALL)
if dc_match:
    print("DocumentChunk:")
    print(dc_match.group(0)[:300] + "...")

# RTLSignal dataclass  
rtl_match = re.search(r'@dataclass\nclass RTLSignal:.*?(?=@dataclass|\nclass [A-Z]|\Z)', source_code, re.DOTALL)
if rtl_match:
    print("\nRTLSignal:")
    print(rtl_match.group(0)[:300] + "...")

# RegisterField dataclass
reg_match = re.search(r'@dataclass\nclass RegisterField:.*?(?=@dataclass|\nclass [A-Z]|\Z)', source_code, re.DOTALL)
if reg_match:
    print("\nRegisterField:")
    print(reg_match.group(0)[:300] + "...")

Source code statistics:
  - Total lines: 525
  - Total characters: 20254
  - File size: 20254 bytes

Classes defined: ['DocumentChunk', 'RTLSignal', 'RegisterField', 'OpenTitanIngester', 'EmbeddingManager', 'SVAGenerator', 'OpenTitanSVASystem']
Data classes: ['DocumentChunk', 'RTLSignal', 'RegisterField']


## 4. OpenTitanIngester Class

The OpenTitanIngester class handles web scraping of OpenTitan documentation. It processes web pages to extract relevant documentation chunks for the RAG system.

In [None]:
# Analyze the OpenTitanIngester class
ingester_match = re.search(r'class OpenTitanIngester:.*?(?=class [A-Z]|\Z)', source_code, re.DOTALL)

if ingester_match:
    ingester_code = ingester_match.group(0)
    
    print("OpenTitanIngester Class Analysis:")
    print("-" * 35)
    
    # Find methods
    methods = re.findall(r'def (\w+)\(', ingester_code)
    print("Methods:")
    for method in methods:
        print(f"- {method}")
    
    # Show class structure
    lines = ingester_code.split('\n')
    print(f"\nClass structure ({len(lines)} lines of code)")
    
    # Show the __init__ method
    init_match = re.search(r'def __init__\(.*?\):(.*?)(?=def|\Z)', ingester_code, re.DOTALL)
    if init_match:
        print("\nInitialization parameters:")
        init_lines = init_match.group(0).split('\n')[:10]
        for line in init_lines:
            if line.strip():
                print(f"  {line.strip()}")
    
    # Show scraping method signature
    scrape_match = re.search(r'def scrape_documentation\([^)]*\):', ingester_code)
    if scrape_match:
        print(f"\nMain scraping method: {scrape_match.group(0)}")
else:
    print("OpenTitanIngester class not found in source code")

Examining the key data structures...

 1: import os
 2: import re
 3: import json
 4: import pickle
 5: from pathlib import Path
 6: from typing import Dict, List, Tuple, Optional, Any
 7: from dataclasses import dataclass
 8: from datetime import datetime
 9: 
10: import requests
11: from bs4 import BeautifulSoup
12: import markdown
13: import numpy as np
14: import faiss
15: from sentence_transformers import SentenceTransformer
16: from transformers import AutoTokenizer, AutoModelForCausalLM
17: import torch
18: 
19: @dataclass
20: class DocumentChunk:
21:     """Represents a chunk of documentation with metadata"""
22:     content: str
23:     ip_name: str
24:     file_path: str
25:     section: str
26:     signal_refs: List[str]
27:     fsm_state_refs: List[str]
28:     chunk_type: str  # 'doc', 'register', 'rtl'
29:     metadata: Dict[str, Any]
30: 
31: @dataclass
32: class RTLSignal:
33:     """Represents an RTL signal or port"""
34:     name: str
35:     width: int
36:     direct

## 5. EmbeddingManager Class

The EmbeddingManager handles vector embeddings and semantic search functionality. It creates and manages a FAISS index for efficient similarity searches.

In [None]:
# Analyze the EmbeddingManager class
embedding_match = re.search(r'class EmbeddingManager:.*?(?=class [A-Z]|\Z)', source_code, re.DOTALL)

if embedding_match:
    embedding_code = embedding_match.group(0)
    
    print("EmbeddingManager Class Analysis:")
    print("-" * 32)
    
    # Find methods
    methods = re.findall(r'def (\w+)\(', embedding_code)
    print("Methods:")
    for method in methods:
        print(f"- {method}")
    
    # Check for model initialization
    model_match = re.search(r'SentenceTransformer\([^)]*\)', embedding_code)
    if model_match:
        print(f"\nEmbedding model: {model_match.group(0)}")
    
    # Check for FAISS usage
    if 'faiss' in embedding_code:
        print("Uses FAISS for vector indexing")
        
    # Show search method
    search_match = re.search(r'def search\([^)]*\):(.*?)(?=def|\Z)', embedding_code, re.DOTALL)
    if search_match:
        print("\nSearch method implementation found")
        
else:
    print("EmbeddingManager class not found in source code")

OpenTitanIngester class structure:

Methods found: ['__init__']

 1: class OpenTitanIngester:
 2:     """Handles ingestion of OpenTitan documentation and RTL"""
 3:     
 4:     def __init__(self, opentitan_root: str):
 5:         self.opentitan_root = Path(opentitan_root)
 6:         self.target_ips = ['uart', 'i2c', 'kmac', 'lc_ctrl', 'otbn', 'sysrst_ctrl']
 7:         


## 6. SVAGenerator Class

The SVAGenerator class integrates with language models to generate SystemVerilog assertions. It processes RTL signal descriptions and documentation context to produce relevant assertions.

In [None]:
# Analyze the SVAGenerator class
sva_match = re.search(r'class SVAGenerator:.*?(?=class [A-Z]|\Z)', source_code, re.DOTALL)

if sva_match:
    sva_code = sva_match.group(0)
    
    print("SVAGenerator Class Analysis:")
    print("-" * 27)
    
    # Find methods
    methods = re.findall(r'def (\w+)\(', sva_code)
    print("Methods:")
    for method in methods:
        print(f"- {method}")
    
    # Check for model configuration
    model_match = re.search(r'model_name\s*=\s*["\']([^"\']+)', sva_code)
    if model_match:
        print(f"\nDefault model: {model_match.group(1)}")
    
    # Find system prompt
    system_prompt_match = re.search(r'system_prompt\s*=\s*["\']([^"\']+)', sva_code)
    if system_prompt_match:
        print("System prompt configuration found")
        
    # Check for assertion generation method
    if 'generate_assertion' in sva_code:
        print("Assertion generation method implemented")
        
else:
    print("SVAGenerator class not found in source code")

EmbeddingManager class analysis:

Methods: ['__init__', 'build_faiss_index']

 1: class EmbeddingManager:
 2:     """Manages embeddings and FAISS indexes"""
 3:     
 4:     def __init__(self, model_name: str = "BAAI/bge-base-en-v1.5"):
 5:         self.model = SentenceTransformer(model_name)
 6:         self.indexes = {}  # ip_name -> faiss index
 7:         self.chunks = {}   # ip_name -> list of chunks
 8:         
 9:     def create_embeddings(self, chunks: List[DocumentChunk]) -> np.ndarray:
10:         """Create embeddings for document chunks"""
11:         if not chunks:
12:             return np.array([])
13:         texts = [f"{chunk.section}: {chunk.content}" for chunk in chunks]
14:         embeddings = self.model.encode(texts, normalize_embeddings=True)
15:         return embeddings
16:     
17:     def build_faiss_index(self, ip_name: str, chunks: List[DocumentChunk]):

Key Insights:
  - Uses BAAI/bge-base-en-v1.5 model for embeddings
  - FAISS enables fast similarity sear

## 7. OpenTitanSVASystem Integration

The OpenTitanSVASystem class coordinates all components to provide a complete assertion generation pipeline. It manages the workflow from documentation ingestion to assertion output.

In [None]:
# Analyze the main OpenTitanSVASystem class
main_match = re.search(r'class OpenTitanSVASystem:.*?(?=if __name__|\Z)', source_code, re.DOTALL)

if main_match:
    main_code = main_match.group(0)
    
    print("OpenTitanSVASystem Class Analysis:")
    print("-" * 34)
    
    # Find methods
    methods = re.findall(r'def (\w+)\(', main_code)
    print("Methods:")
    for method in methods:
        print(f"- {method}")
    
    # Check component integration
    if 'OpenTitanIngester' in main_code and 'EmbeddingManager' in main_code and 'SVAGenerator' in main_code:
        print("\nIntegrates all three main components:")
        print("- OpenTitanIngester")
        print("- EmbeddingManager") 
        print("- SVAGenerator")
    
    # Show workflow methods
    workflow_methods = [m for m in methods if any(keyword in m.lower() for keyword in ['process', 'generate', 'run'])]
    if workflow_methods:
        print(f"\nWorkflow methods: {', '.join(workflow_methods)}")
        
else:
    print("OpenTitanSVASystem class not found in source code")

SVAGenerator class analysis:

Methods: ['__init__', 'load_model']

System Prompt (key instructions to the LLM):

You are a world-class expert in SystemVerilog Assertions (SVA) for semiconductor IP verification. Your task is to generate precise, high-quality SVA properties based on the provided context.

Follow these rules strictly:
1.  **Property Block:** Enclose every assertion in a named `property` block.
2.  **Assertion:** Follow each property with a corresponding `assert property` statement.
3.  **Clocking:** Use `@(posedge clk_i)` for clocking.
4.  **Reset:** Use `disable iff (!rst_ni)` for asynchronous reset.
5.  **Comments:** Add a brief, insightful comment above each property explaining its purpose.
6.  **No Placeholders:** Do not use placeholder signals. Only use signals found in the context.
7.  **Focus:** Generate properties directly related to the user's query and the provided context.

E...

Prompt Engineering Insights:
  - Establishes expert persona for the LLM
  - Provid

## 8. Practical Demonstration

This section demonstrates the system's functionality by processing actual OpenTitan documentation and generating sample assertions.

In [20]:
# First import the main components
try:

    exec(open('opentitan_sva_generator.py').read())
    print(" Successfully imported the system components!")
    
    # Show the supported IP blocks
    print("\n Supported IP Blocks:")
    supported_ips = ['uart', 'i2c', 'kmac', 'lc_ctrl', 'otbn', 'sysrst_ctrl']
    for ip in supported_ips:
        print(f"  - {ip.upper()}")
        
except Exception as e:
    print(f" Import error (expected in demo environment): {str(e)[:200]}...")


--- OpenTitan SVA Generator ---
Available IPs: uart, i2c, kmac, lc_ctrl, otbn, sysrst_ctrl

Generating SVA properties... (this may take a moment)
Loading cached data for i2c...
Loading Qwen/Qwen2-7B-Instruct...
Loading Qwen/Qwen2-7B-Instruct...


Loading checkpoint shards: 100%|██████████| 4/4 [00:02<00:00,  1.44it/s]
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable 
huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)
TOKENIZERS_PARALLELISM=(true | false)


Model loaded successfully
Interaction logged to cache/logs/i2c_20250828_013024.json

GENERATED SVA PROPERTIES:
```systemverilog
// Property: Ensure Start Condition is generated when transmitting
property start_condition_for_transmission;
  @(posedge clk_i) disable iff (!rst_ni)
  FDATA.START |=> (fmt_flag_start_before_i || !fmt_flag_stop_after_i);
endproperty
assert_start_condition: assert property (start_condition_for_transmission);

// Property: Ensure Stop Condition is generated when reading
property stop_condition_for_reading;
  @(posedge clk_i) disable iff (!rst_ni)
  FDATA.STOP |=> (fmt_flag_stop_after_i && !fmt_flag_start_before_i);
endproperty
assert_stop_condition: assert property (stop_condition_for_reading);

// Property: Address Acknowledgment is detected within the expected timing
property address_ack_within_timing;
  @(posedge clk_i) disable iff (!rst_ni)
  AddrAck |=> #(TIMING3.THD_DAT + 1) 1'b1;
endproperty
assert_address_ack_timing: assert property (address_ack_within_