<a href="https://colab.research.google.com/github/HafizAQ/RAG-LLM-SPC-SVA/blob/main/Enhanced_VLSI_Assertion_Generation_Conforming_to_High_Level_Specifications_and_Reducing_LLM_Hallucinations_with_RAG.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#1) Reuired Libraries & API Configurations

In [None]:
#Required libraries/ packages
!pip install openai
!pip install langchain_openai
!pip install --upgrade --quiet langchain-text-splitters tiktoken #split by tokens tiktoken is created by openai for chunking
!pip install langchain_chroma
!pip install langchain
!pip install TextLoader
!pip install langchain_experimental
!pip install -U langchain-chroma
!pip install langchain_community

#2)LLMs Agents

##### 2.(i) OpenAI: Chat Completion API: inputs commands (system, User) + output respons: llm_call_cc1(sys,usr)

In [None]:
#Applying Prompt Engineering via Chat Completion API using economical gpt-3.5-turbo model
from openai import OpenAI
client = OpenAI(api_key="sk-xxxxxxxxxxxxxxxxxxxxxxxxxx")

def llm_call_cc1(in_system_insts, in_user_req):
  output_txt=""
  stream = client.chat.completions.create(
    # model="gpt-4",
    model="gpt-3.5-turbo-0125",
    messages=[
      {"role": "system", "content": in_system_insts},
      {"role": "user", "content": in_user_req}
    ],
    stream=True,
    temperature=0.3)
  for part in stream:
     ot=part.choices[0].delta.content or ""
     output_txt=output_txt+ot
  return output_txt

##### 2.(ii) Role: Specification Extracter from Chunk: chunk_spec(chunk)

In [None]:
#Instruction for Chat Completion API of GPT
# Global Signals,
def chunk_spec(chunk):
  in_system_insts = "You are an expert VLSI specification analyzer tasked with extracting signal information (Write address channels, Write data channels, Write response channels, Read address channels, Read data channels, Global/system signals) from the provided specification '{chunk}' to support SystemVerilog Assertion (SVA) generation. Each task is independent of the others. \n\n Task Instructions: \n\t 1. Objective: Extract signal details from **'{chunk}'** in the following structure: \n\t\t [Signal Name]: Name of the signal. \n\t\t [Signal Description]: Detailed description (e.g., type, width, purpose).\n\t\t [Signal Functionality]: Explanation of the signal's role (e.g., data management, control flow). \n\t 2. Guidelines: \n\t\t Only use information from **'{chunk}'**.\n\t\t Do not make assumptions or add external information.\n\t\t If details are missing, leave them as is without assumptions.\n\t 3. Purpose: Ensure the extracted signal information is accurate and clearly structured to assist in SystemVerilog Assertion (SVA) generation."

  in_user_req = "Here is the text **'{chunk}'**:\n'{"+chunk+"}'\n\nPlease extract and list all relevant signal information for specifications from this text **'{chunk}'**, ensuring the output follows the format provided in the System Role prompt."

  chunk_spec=llm_call_cc1(in_system_insts,in_user_req) #2.(i) GPT model Completion API
  return chunk_spec

##### 2.(iii) Role: Signal Mapper: signal_map(hlsf,hdlImp)

In [None]:
#Instruction for Chat Completion API of GPT
def signal_map(hlsf,hdlImp):
  in_system_insts = "You are an intelligent signal mapping tool that links signal specifications to their definitions in HDL (SystemVerilog) code. Your task is to update the signal names in **'{spec}'** with their corresponding names in the HDL code **{'hdl'}** if corresponding signal names exit in **{'hdl'}**, ensuring accuracy for SystemVerilog Assertion (SVA) generation. Also extract both System Clock signals and System Reset Signal from **{'hdl'}** code and add into the updated **'{spec}'**. \n\n Task Overview: \n\t Input 1: Signal specifications from **'{spec}'**. \n\t Input 2: HDL code from **{'hdl'}**. \n\n Task Instructions: \n\t 1. Signal Name: Replace signal names in **'{spec}'** with their corresponding names from **{'hdl'}** (if available). If a signal is not found in **{'hdl'}**, leave it unchanged. \n\t 2. Signal Description & Functionality: Do not modify; retain as provided in **'{spec}'**. \n\t 3. System Signals (Mandatory): Always extract both System Reset and System Clock signals names from **{'hdl'}** code and add into the updated **'{spec}'**. \n\t 4. Signal Name Identification and adding to the updated **'{spec}'**: Find signal names from **'{spec}'** that correspond signal name of bit-vectors or buffers type logic in **{'hdl'}** and add into the updated **'{spec}'**. These are signals where the $isunknown() function is applicable (i.e., for detecting unknown values in bit-vector or buffer signals). \n\n Task Independence: \n\t Treat each task independently without modifying the HDL code **{'hdl'}**. \n\n Output: \n\t Final Output: Provide an updated version of **'{spec}'** that includes: \n\t Global signals (System Reset and System Clock) \n\n Signal name updates (where applicable from **{'hdl'}**),\n\t Bit-vector signals where $isunknown() applies.\n\n Ensure that the output is optimized for SVA generation with accurate signal mappings from **{'hdl'}**."


  in_user_req = "User Role: Here is the list of specifications-related information from the chunk \n**'{spec}'**:\n\n{"+hlsf+"}\n\nAnd here is the HDL code \n**{'hdl'}**:\n\n{"+hdlImp+"}\n\nPlease replace the keywords in the specification list with the relevant keywords/signal names from the HDL code and provide the edited specifications."

  hlspecs=llm_call_cc1(in_system_insts,in_user_req) #2.(i) GPT model Completion API
  return hlspecs

# 3) A: RAG-Part

##### 3.(i) Preprocess the text file using NLP (NLTK) and Sklearn: preprocess_text(text)

In [None]:
import re
import nltk
from nltk.tokenize import sent_tokenize, word_tokenize
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity

# Download required NLTK resources
nltk.download('punkt')

def preprocess_text(text):
    # Remove special characters but keep punctuations
    text = re.sub(r'[^a-zA-Z0-9\s.,;:!?\'\"()-]', '', text)
    # Remove excessive whitespace
    text = re.sub(r'\s+', ' ', text).strip()
    return text

def remove_semantic_repetition(sentences, threshold=0.7):
    # Initialize TF-IDF Vectorizer
    vectorizer = TfidfVectorizer().fit_transform(sentences)
    vectors = vectorizer.toarray()

    # Calculate cosine similarity
    similarity_matrix = cosine_similarity(vectors)

    # Identify and remove semantically similar sentences
    unique_sentences = []
    for i, sentence in enumerate(sentences):
        if all(similarity_matrix[i][j] < threshold for j in range(i)):
            unique_sentences.append(sentence)

    return unique_sentences

def preprocess(input_file): #, output_file):
    # with open(input_file, 'r', encoding='utf-8') as file:
    #     text = file.read()

    # Preprocess text
    text = preprocess_text(input_file)

    # Tokenize text into sentences
    sentences = sent_tokenize(text)
    #print(sentences)

    # Remove semantic repetitions
    unique_sentences = remove_semantic_repetition(sentences)

    # Join unique sentences back to a single string
    result_text = ' '.join(unique_sentences)

    return result_text

#####Langchain Platform Support

In [None]:
import os
import openai
os.environ['OPENAI_API_KEY'] = str("sk-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx")
#Splitter, embeddings, vector database
from langchain.text_splitter import RecursiveCharacterTextSplitter #Splitter
from langchain.text_splitter import CharacterTextSplitter #Splitter
from langchain_openai import OpenAIEmbeddings #Text embedding in vector form
from langchain_chroma import Chroma #Vector database
import re

#####3.(ii) Chunker: Tiktoken with Recursive Character Splitter (Provided by OpenAi for Context Windows)

In [None]:
#Chunker using tiktoken encoder to fit token count in GPT 3.5 input context window, applied as RecursiveCharacterTextSplitter
def chunker(in_str_docF):
  token_splitter_rcs= RecursiveCharacterTextSplitter.from_tiktoken_encoder(chunk_size=2100, chunk_overlap=100) #encoder merges tokens, means no of tokens could exetes the chunk size #Recusively look the character and find count the token as per OpenAI
  t_rcs_chunks=token_splitter_rcs.split_documents(in_str_docF) #from specification documents(File-F), as GPT 3.5 has 4096 token limt, we set 1500, token slitting and will contain the additiinal information tokens, for subsequent LLM model
  return t_rcs_chunks

#####3.(iii) Embedding: OpenAI embedding (Vector Embedding) & Vector Store: Chroma DB (Opne Source, interface with several technologies):
#####chromaDBSF(t_rcs_chunks)

######file_dbsF.persist()

In [None]:
#ChromaD: saving in knowledge representation
from langchain_community.vectorstores.chroma import Chroma
def chromaDBSF(t_rcs_chunks):
  file_dbsF =Chroma.from_documents(t_rcs_chunks,OpenAIEmbeddings(), persist_directory="./chroma_dbF")
  # file_dbsF.persist()
  return file_dbsF

####3.(iv) RAG_main: chromaDBSF_RP(hlsF_content,hdlImpF_content): return ChromaDBF

In [None]:
#Input High Level Soecification file, and Design file for synchromous, output vectorDB
def chromaDBSF_RP(hlsF_content,hdlImpF_content):
  hlsF_content_p=preprocess_text(hlsF_content) #3(i) preprocessing test file
  # print("HLSF after preprocessing: ",hlsF_content_p)
  doc_HLSF_content = Document(page_content=hlsF_content_p, metadata={"User": "High Level Specs File document"}) #Changing into document, so that split function could easily be applied
  doc_HLSF_content_L=[doc_HLSF_content,] #list of document objects
  hlsf_chunks=chunker(doc_HLSF_content_L) #3(ii) #return the chunks from the specification file

  hlsf_chunks_specs=[]
  for hlsf_chunk in hlsf_chunks:
    hlsf_chunk_spec=chunk_spec(hlsf_chunk.page_content) #2(ii) documents to page content as chunk: LLM-1: Chunk-Spec
    hlsf_chunk_sig_spec=signal_map(hlsf_chunk_spec,hdlImpF_content) #2(iii) LLM-2: Chunk-Spec-Sig
    hlsf_chunk.page_content=hlsf_chunk.page_content+ "\n Relevance Specs from the chunk:\n "+ hlsf_chunk_sig_spec  #Merge chunk with its relevant Specification

    hlsf_chunks_specs.append(hlsf_chunk)

  print("after loop on chunks")
  print("HLSF chunks specs list: ",hlsf_chunks_specs)
  print("HLSF chunks specs type: ",type(hlsf_chunks_specs), "length: ", len(hlsf_chunks_specs))
  file_dbsF=chromaDBSF(hlsf_chunks_specs) #3(iii)  pass chunks as document to ChromaDB (Vector Database) with OpenAI encoding
  # return file_dbsF
  print("HLSF ChromaDB: ",type(file_dbsF))
  return file_dbsF

# 4) B: LLM-Part

######4.(i) Specification as query to ChromaDB, it will return list of documents/ chunks (giving context) :spec_cdbSF(cdbSF,spcF)

In [None]:
def spec_cdbSF(cdbSF,spcF):
    q_embedding_vectorF = OpenAIEmbeddings().embed_query(spcF)# q_sim_search=cdb_doc.similarity_search(query) #q_sim_search
    doc_chunksF = (cdbSF.similarity_search_by_vector(q_embedding_vectorF))#similarity_search_with_score(query)#similarity_search_by_vector_with_relevance_scores(query)
    return doc_chunksF #returning documents against query

#####4.(ii) Role: Grand Assertion Generator from Designer's side specification: designSpec_sva (contextDSF,spec)

In [None]:
#LLM-3: Instruction for Chat Completion API of GPT

def designSpec_sva(contextDSF,spec):
  in_system_insts = "You are a professional VLSI engineer skilled in SystemVerilog Assertions (SVA) and HDL. Your task is to generate SVA based on the provided specification sentence (**'spec'**) with correct formatting, simple implication structure, and readiness for HDL code execution.\n\n Task Instructions:\n\t Input 1: A specification sentence (**'spec'**) with signal names and their relationships. \n\t Input 2: A specification context text (**'spec_context'**) to align global signals (clock and reset) and other relevant details. \n\n Guidelines: \n\t i) Cross-reference signals in **'spec'** with **'spec_context'**. If signals are logic bit-vectors (e.g., address/data signals), apply $isunknown() for validity where needed. \n\t ii) Use **'spec_context'** to align signal names and global signals (clock and reset). \n\t iii) The generated SVA must follow a formal implication structure, ready for HDL use. \n\t iv) Avoid adding extra information or natural language explanations. \n\t v) Treat each task independently without user history. \n\n Output Format:\n\t [Specification from **'spec'**] \n\t [Generated SVA incorporating $isunknown() if applicable] \n\t [Context from **'spec_context'** confirming or contradicting **'spec'**] \n\n Additional Requirement: \n\t Ensure global clock and reset signals are derived from **'spec_context'**."

  in_user_req = "Here is the specification sentence\n **'spec'**:\n\n {"+spec+"}\n\nAnd here is the specification context text **'spec_context'**:\n\n{"+contextDSF+"}\n\nPlease translate the specification sentence (**'spec'**) into a SystemVerilog Assertion (SVA)."

  assert_sva=llm_call_cc1(in_system_insts,in_user_req)
  return assert_sva

#5) Assertion Generation on Collab

#####5.(i) Propoeed Methodology: Combining RAG & Multi-LLM:  chromaDBSF_RP(hlsF, hdlImpF, spcFL)

In [22]:
#Main function
import re
from langchain_core.documents import Document #for creating document object from text file string to make it acceptable for ChromaDB
from langchain_chroma import Chroma

def rag_llm(hlsF_content,hdlImpF_content,spf_contentL):

  #####Either: buildining knowledge base/ chroma DB from scratch

  #cdbSF=chromaDBSF_RP(hlsF_content,hdlImpF_content) #3(iv)#Passing dpcument as a list of document to chromaDB function tion where, chunking, embedding and storing will be one after another

  #####OR: Use exiting knowledge-base/ Chroma DB

  cdbSF = Chroma(persist_directory="./chroma_dbF", embedding_function=OpenAIEmbeddings())
  print("Presistant ChromaDB .get(): ", cdbSF.get())

  assertionList=[]
  for spec in spf_contentL:
    DocsSpcHLS =spec_cdbSF(cdbSF,spec) #4(i)#will fetch suitable chunks/documents from ChromaDB, quering as specification
    print("Docs SpecF conteent against spec sentence",DocsSpcHLS)
    assertsGen=designSpec_sva(DocsSpcHLS[0].page_content, spec)#4(ii) passing chunk and specs and getting assertions in SVA
    assertionList.append(assertsGen)
  return assertionList

#####5.(ii) Start function: reading file, or recieving specification file and returning SVA assertions

In [None]:
#Upload all relevant files
#Input files: HLSF, HDLImpF, DSpec
#High Level Specification File by System Analyst: hlsF_content
hlsF_content=""
# file = open("Axi4L_HLSF.txt", "r")
fileS = open("Axi4L_HLSF_MD.md", "r")
hlsF_contentL=fileS.readlines()
hlsF_content = ' '.join([str(elem) for i,elem in enumerate(hlsF_contentL)])
fileS.close()

#HDL design implementation file by Designer: hdlImpF_content
hdlImpF_content=""
fileD = open("axi4_lite_if.sv", "r") #HDL file for text synchronization
hdlImpF_contentL=fileD.readlines()
hdlImpF_content = ' '.join([str(elem) for i,elem in enumerate(hdlImpF_contentL)])
fileD.close()

#Design Specification to verify the design implementation: spf_list
spf_contentL=""
fileSpcD = open("Axi4L_Specs.txt", "r")
spf_contentL=fileSpcD.readlines()
fileSpcD.close()

# rag_llm(hlsF_content,hdlImpF_content,spf_contentL) #5(i)
assertionList = rag_llm(hlsF_content,hdlImpF_content,spf_contentL) #5(i)

for item in assertionList:
  print(item)

#Extra

#####Vector Store Function

#####i) Preprocess the text file using NLP (NLTK) and Sklearn
#####ii) Support framework: LangChain (Support Apps and Open Source)
#####iii) Chunker: Tiktoken with Recursive Character Splitter (Provided by OpenAi for Context Windows)
#####iv) Embedding: OpenAI embedding (Vector Embedding)  
#####vi) Vector Store: Chroma DB (Opne Source, interface with several technologies)
#####vii) Load vector Database against presistant directory

####Download Zip file

In [None]:
!zip -r /content/file.zip /content/chroma_dbF2

In [None]:
from google.colab import files
files.download("/content/file.zip")

<IPython.core.display.Javascript object>

<IPython.core.display.Javascript object>

#####PDF to MarkDown converison

In [None]:
pip install --upgrade pymupdf

In [None]:
import pymupdf
from pymupdf_rag import to_markdown  # import Markdown converter

doc = pymupdf.open("IHI0022E_axi4_lite.pdf")  # open input PDF

# define desired pages: this corresponds “-pages 1-10,15,20-N”
page_list = list(range(9)) + [14] + list(range(19, len(doc)-1))

# get markdown string for all pages
md_text = to_markdown(doc, pages=page_list)

# write markdown string to some file
output = open("IHI0022E_axi4_lite.md", "w")
output.write(md_text)
output.close()