# Module 771764 – MSc Research Project

## StructureGPT: Multi-Model Retrieval-Augmented Generation System for UK Building Regulations using Low-Rank Adaptation and Quantization
### Student ID: 202403820 | Samuel Datubo Jaja
### MSc Artificial Intelligence & Data Science | DAIM - Data-Science Artificial Intelligence & Modelling

# Notebook 1 - GOV.UK Data Collection and Preprocessing | RAG Pipeline

In [5]:
import warnings 
warnings.filterwarnings('ignore')

## Environment Setup

In [4]:
!pip install torch

Defaulting to user installation because normal site-packages is not writeable
Collecting torch
  Downloading torch-2.7.0-cp312-cp312-win_amd64.whl.metadata (29 kB)
Collecting filelock (from torch)
  Downloading filelock-3.18.0-py3-none-any.whl.metadata (2.9 kB)
Collecting sympy>=1.13.3 (from torch)
  Downloading sympy-1.14.0-py3-none-any.whl.metadata (12 kB)
Collecting fsspec (from torch)
  Downloading fsspec-2025.3.2-py3-none-any.whl.metadata (11 kB)
Downloading torch-2.7.0-cp312-cp312-win_amd64.whl (212.5 MB)
   ---------------------------------------- 0.0/212.5 MB ? eta -:--:--
   ---------------------------------------- 0.0/212.5 MB 991.0 kB/s eta 0:03:35
   ---------------------------------------- 0.1/212.5 MB 2.1 MB/s eta 0:01:41
   ---------------------------------------- 0.3/212.5 MB 3.3 MB/s eta 0:01:05
   ---------------------------------------- 0.5/212.5 MB 3.9 MB/s eta 0:00:54
   ---------------------------------------- 1.1/212.5 MB 6.1 MB/s eta 0:00:35
   -----------------


[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [3]:
import torch
import sys
import psutil
import os
import yaml
from pathlib import Path

print("Starting environment setup...")

def check_environment():
    print("\n=== GPU Information ===")
    print(f"CUDA Available: {torch.cuda.is_available()}")
    if torch.cuda.is_available():
        print(f"GPU Device: {torch.cuda.get_device_name(0)}")
        print(f"GPU Memory: {torch.cuda.get_device_properties(0).total_memory / 1e9:.2f} GB")
    
    print("\n=== System Memory ===")
    memory = psutil.virtual_memory()
    print(f"Total: {memory.total / 1e9:.2f} GB")
    print(f"Available: {memory.available / 1e9:.2f} GB")
    
    print("\n=== Python Environment ===")
    print(f"Python Version: {sys.version}")

def create_project_structure():
    BASE_DIR = "uk_construction_bot"
    directories = [
        'data/raw/videos',
        'data/raw/documents',
        'data/processed/embeddings',
        'data/processed/summaries',
        'data/processed/evaluations',  # For RAGAS and Giskard evaluations
        'models/checkpoints',
        'config'
    ]
    
    for dir_path in directories:
        path = os.path.join(BASE_DIR, dir_path)
        os.makedirs(path, exist_ok=True)
        print(f"Created directory: {path}")
    return BASE_DIR

def create_config(base_dir):
    config = {
        'project': {
            'name': 'uk_construction_bot',
            'description': 'UK Construction Guidance Chatbot using RAG',
            'version': '1.0'
        },
        'data_collection': {
            'document_sources': [
                "https://www.gov.uk/government/collections/approved-documents"
            ],
            'unstructured_config': {
                'strategy': 'fast',
                'include_metadata': True
            }
        },
        'models': {
            'embedding': 'sentence-transformers/all-mpnet-base-v2',
        },
        'processing': {
            'chunk_size': 512,
            'chunk_overlap': 50,
            'min_chunk_length': 100,
            'summary_max_length': 150,
            'embedding_batch_size': 32
        },
        'vector_store': {
            'engine': 'chroma',
            'dimension': 768,
            'distance_metric': 'cosine'
        },
        'evaluation': {
            'ragas_metrics': [
                'faithfulness',
                'answer_relevancy',
                'context_relevancy'
            ],
            'batch_size': 5
        }
    }

    config_path = os.path.join(base_dir, 'config', 'config.yaml')
    with open(config_path, 'w') as f:
        yaml.dump(config, f, default_flow_style=False)
    print(f"Configuration file created at: {config_path}")
    return config

def install_dependencies():
    """Install required packages with better error handling"""
    base_packages = [
        "transformers==4.36.0",
        "sentence-transformers==2.2.2",
        "chromadb==0.4.0",
        "tqdm==4.66.1"
    ]
    
    document_packages = [
        "unstructured[pdf,local-inference]",
        "pdf2image",
        "pdfminer.six",
        "python-magic-bin; platform_system=='Windows'"
    ]
    
    evaluation_packages = [
        "ragas==0.0.22",
        "evaluate",
    ]
    
    utility_packages = [
        "python-dotenv==1.0.0",
        "beautifulsoup4==4.12.2",
        "requests==2.31.0"
    ]
    
    def install_package_group(packages, group_name):
        print(f"\nInstalling {group_name}...")
        for package in packages:
            try:
                print(f"Installing {package}")
                result = os.system(f"pip install -q {package}")
                if result == 0:
                    print(f"Successfully installed {package}")
                else:
                    print(f"Failed to install {package}")
            except Exception as e:
                print(f"Error installing {package}: {str(e)}")
    
    # Install in sequence
    install_package_group(base_packages, "base packages")
    install_package_group(utility_packages, "utility packages")
    install_package_group(document_packages, "document processing packages")
    install_package_group(evaluation_packages, "evaluation packages")
    
    # Verify installations
    def verify_package(package_name):
        try:
            __import__(package_name)
            return True
        except ImportError:
            return False
    
    print("\nVerifying critical packages:")
    critical_packages = {
        'unstructured': 'Document processing',
        'ragas': 'RAG evaluation',
        'transformers': 'Transformers',
        'chromadb': 'Vector store'
    }
    
    all_verified = True
    for package, description in critical_packages.items():
        if verify_package(package):
            print(f"✓ {description} successfully installed")
        else:
            print(f"✗ {description} not installed correctly")
            all_verified = False
    
    return all_verified

def setup_env_file(base_dir):
    env_path = os.path.join(base_dir, '.env')
    if not os.path.exists(env_path):
        with open(env_path, 'w') as f:
            f.write('GROQ_API_KEY=your-groq-key-here\n')
    print(f"Created .env file at: {env_path}")

if __name__ == "__main__":
    try:
        print("Checking environment...")
        check_environment()

        print("\nCreating project structure...")
        base_dir = create_project_structure()

        print("\nCreating configuration file...")
        config = create_config(base_dir)

        print("\nSetting up environment file...")
        setup_env_file(base_dir)

        print("\nInstalling dependencies...")
        if install_dependencies():
            print("\nAll dependencies installed successfully!")
        else:
            print("\nSome dependencies need manual installation. Please run:")
            print("pip install 'unstructured[pdf]'")
            print("pip install ragas")

        # Verify setup
        config_path = os.path.join(base_dir, 'config', 'config.yaml')
        with open(config_path, 'r') as f:
            loaded_config = yaml.safe_load(f)
        print("\nConfiguration loaded successfully!")
        
        print("\nSetup complete!")
        
    except Exception as e:
        print(f"\nError during setup: {str(e)}")
        raise

ModuleNotFoundError: No module named 'torch'

## Data Collection

In [6]:
!pip install "unstructured[all-docs]" unstructured-client python-dotenv pydantic langchain langchain-community langchain_core langchain_openai chromadb

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [8]:
!pip install langchain-groq sentence-transformers

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [9]:
!pip install tf-keras

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [10]:
!pip install sentencepiece

Defaulting to user installation because normal site-packages is not writeable
Collecting sentencepiece
  Downloading sentencepiece-0.2.0-cp312-cp312-win_amd64.whl.metadata (8.3 kB)
Downloading sentencepiece-0.2.0-cp312-cp312-win_amd64.whl (991 kB)
   ---------------------------------------- 0.0/992.0 kB ? eta -:--:--
   ---------------------------------------- 10.2/992.0 kB ? eta -:--:--
    -------------------------------------- 20.5/992.0 kB 640.0 kB/s eta 0:00:02
   - ------------------------------------- 41.0/992.0 kB 653.6 kB/s eta 0:00:02
   ----- ---------------------------------- 143.4/992.0 kB 1.2 MB/s eta 0:00:01
   ----------- ---------------------------- 286.7/992.0 kB 2.0 MB/s eta 0:00:01
   ----------------------- ---------------- 583.7/992.0 kB 3.3 MB/s eta 0:00:01
   ---------------------------------------  983.0/992.0 kB 4.8 MB/s eta 0:00:01
   ---------------------------------------- 992.0/992.0 kB 4.5 MB/s eta 0:00:00
Installing collected packages: sentencepiece
Succ


[notice] A new release of pip is available: 24.0 -> 25.0.1
[notice] To update, run: python.exe -m pip install --upgrade pip


In [2]:
!pip install PyPDF2
!pip install pycryptodome

Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


Defaulting to user installation because normal site-packages is not writeable



[notice] A new release of pip is available: 24.0 -> 25.1.1
[notice] To update, run: python.exe -m pip install --upgrade pip


## Preprocessing

In [1]:
#from pathlib import Path
from PyPDF2 import PdfReader
from PyPDF2.errors import PdfReadError
#import os

In [2]:
pdf_dir = Path("uk_construction_bot/data/raw/documents/gov_uk_building_regulations_PDFs")
pdf_files = list(pdf_dir.glob("*.pdf"))

print(f"\n✅ Total GOV.UK PDFs loaded: {len(pdf_files)}")
print("Loading Comprehensive UK Building Regulations Files:")
for i, pdf_file in enumerate(pdf_files, start=1):
    print(f"{i}. {pdf_file.name}")



✅ Total GOV.UK PDFs loaded: 18
Loading Comprehensive UK Building Regulations Files:
1. approved-document-R_Infrastructure_Electronic_communications.pdf
2. approved-document_P_Electrical_Safety.pdf
3. approved-document_Q_Security in Dwellings.pdf
4. ApprovedDocument_T Toilet accommodation.pdf
5. approved_document_A_Structure.pdf
6. Approved_Document_B_Fire_safety.pdf
7. Approved_Document_C_site preparation and resistance to contaminates and moisture.pdf
8. approved_document_D_Toxic Substance.pdf
9. approved_document_E_Resistance to sound.pdf
10. approved_document_F_Ventilation.pdf
11. approved_document_G_Sanitation, hot water safety and water efficiency.pdf
12. approved_document_H_Drainage and waste disposal.pdf
13. approved_document_J_Combustion appliances and fuel storage systems.pdf
14. Approved_Document_K_Protection from falling, collision and impact.pdf
15. approved_document_L_Conservation of fuel and power.pdf
16. approved_document_M_Access to and use of buildings.pdf
17. approve

In [3]:
def count_pages_in_directory(directory_path):
    """Count the total number of pages across all readable PDF files in a directory"""
    total_pages = 0
    file_count = 0
    total_files = 0
    
    print("📁 Counting pages in:", directory_path)
    
    for pdf_file in Path(directory_path).glob("*.pdf"):
        total_files += 1
        try:
            reader = PdfReader(str(pdf_file))

            if reader.is_encrypted:
                try:
                    reader.decrypt("")  # Try blank password
                except:
                    raise PdfReadError("Encrypted and couldn't decrypt.")

            num_pages = len(reader.pages)
            total_pages += num_pages
            file_count += 1
            print(f"📄 {pdf_file.name}: {num_pages} pages")
        except Exception as e:
            print(f"⚠️  Error reading {pdf_file.name}: {e}")
    
    print(f"\n✅ Total PDFs found: {total_files}")
    print(f"📘 Successfully read PDFs: {file_count}")
    print(f"🧾 Total pages across all readable PDFs: {total_pages}")
    return total_pages

pdf_folder = "uk_construction_bot/data/raw/documents/gov_uk_building_regulations_PDFs"
count_pages_in_directory(pdf_folder)

📁 Counting pages in: uk_construction_bot/data/raw/documents/gov_uk_building_regulations_PDFs
📄 approved-document-R_Infrastructure_Electronic_communications.pdf: 20 pages
📄 approved-document_P_Electrical_Safety.pdf: 22 pages
📄 approved-document_Q_Security in Dwellings.pdf: 20 pages
📄 ApprovedDocument_T Toilet accommodation.pdf: 36 pages
📄 approved_document_A_Structure.pdf: 54 pages
📄 Approved_Document_B_Fire_safety.pdf: 22 pages
📄 Approved_Document_C_site preparation and resistance to contaminates and moisture.pdf: 52 pages
📄 approved_document_D_Toxic Substance.pdf: 10 pages
📄 approved_document_E_Resistance to sound.pdf: 86 pages
📄 approved_document_F_Ventilation.pdf: 62 pages
📄 approved_document_G_Sanitation, hot water safety and water efficiency.pdf: 56 pages
📄 approved_document_H_Drainage and waste disposal.pdf: 64 pages
📄 approved_document_J_Combustion appliances and fuel storage systems.pdf: 89 pages
📄 Approved_Document_K_Protection from falling, collision and impact.pdf: 68 pages


930

In [6]:
##Creating Embeddings and Vector Store to Utilize in RAG System

In [None]:
import os
import json
import pandas as pd
from pathlib import Path
from typing import Any, List, Optional
from pydantic import BaseModel
from dotenv import load_dotenv
import time
import torch

#Importing Unstructured components
from unstructured.partition.pdf import partition_pdf
from unstructured_client import UnstructuredClient
from unstructured_client.models import shared
from unstructured_client.models.errors import SDKError
from unstructured.staging.base import dict_to_elements
from unstructured_client.models.operations.partition import PartitionRequest

#Importing LangChain components
from langchain.vectorstores import Chroma
from langchain.storage import InMemoryStore
from langchain.schema.document import Document
from langchain.embeddings import HuggingFaceEmbeddings
from langchain.retrievers.multi_vector import MultiVectorRetriever
from langchain_groq import ChatGroq
from langchain.prompts import ChatPromptTemplate
from langchain_core.output_parsers import StrOutputParser
from langchain.callbacks.streaming_stdout import StreamingStdOutCallbackHandler
import uuid

In [None]:
class Element(BaseModel):
    #Basic data wrapper to store type and content of each document chunk
    type: str
    page_content: Any

class GroqHandler:
    def __init__(self, api_key: str, model_name: str = "llama-3.3-70b-versatile"):
        #Initialize Groq LLM with streaming and no randomness
        self.llm = ChatGroq(
            api_key=api_key,
            model_name=model_name,
            temperature=0.1,
            streaming=True,
            callbacks=[StreamingStdOutCallbackHandler()]
        )
        
        # self.template = """You are an expert assistant specializing in UK construction regulations and building codes. Your task is to provide accurate, clear answers based solely on the provided context.

        # Approach:
        # 1. Analyze the provided context carefully
        # 2. If the answer is found in the context, provide it with specific references
        # 3. If tables or technical specifications are mentioned, include them
        # 4. If the answer cannot be found in the context, explicitly state this
        # 5. If details are unclear, mention this in your response

        # Context: {context}
        
        # Question: {question}
        
        # Answer the question step by step:"""

        #Defining prompt template
        self.template = """Answer the question based only on the following context, which can include text and tables:
        {context}
        Question: {question}
        """

        self.prompt = ChatPromptTemplate.from_template(self.template)

        #Build LangChain chain with LLM and output parser
        self.chain = (
            self.prompt 
            | self.llm 
            | StrOutputParser()
        )

    def generate_answer(self, context: str, question: str, max_retries: int = 3) -> Optional[str]:
        """Generate answer with retry logic and error handling"""
        for attempt in range(max_retries):
            try:
                print("\nGenerating response (streaming)...")
                response = self.chain.invoke({
                    "context": context,
                    "question": question
                })
                return response
            except Exception as e:
                if attempt == max_retries - 1:
                    print(f"\nError: Failed to generate answer after {max_retries} attempts: {str(e)}")
                    return None
                print(f"\nAttempt {attempt + 1} failed, retrying in {2 ** attempt} seconds...")
                time.sleep(2 ** attempt)  #Exponential backoff



In [None]:

class DataProcessor:
    def __init__(self
                #Initialize Unstructured API and HuggingFace embeddings
        self.client = UnstructuredClient(
            api_key_auth=os.getenv("UNSTRUCTURED_API_KEY"), #reading UNSTRUCTURED_API_KEY from .evn 
            server_url=os.getenv("UNSTRUCTURED_API_URL")
        )
        
        # Initialize embeddings
        self.embeddings = HuggingFaceEmbeddings(
            model_name="sentence-transformers/all-mpnet-base-v2",
            model_kwargs={'device': 'cuda' if torch.cuda.is_available() else 'cpu'}
        )
        
        self.setup_rag_components()
        
    def setup_rag_components(self):
        """Initialize RAG components"""
        #Initialize vector store and docstore for retrieval
        self.store = InMemoryStore()
        self.vectorstore = Chroma(
            collection_name="main_construction_rag",
            embedding_function=self.embeddings,
            persist_directory="./main_chroma_data",
        )
        self.retriever = MultiVectorRetriever(
            vectorstore=self.vectorstore,
            docstore=self.store,
            id_key="doc_id",
        )
    
    def save_processed_elements(self, elements: List[Element], filename: str):
        """Save processed elements to avoid reprocessing"""
        save_data = [{"type": el.type, "page_content": el.page_content} for el in elements]
        with open(filename, 'w', encoding='utf-8') as f:
            json.dump(save_data, f, ensure_ascii=False, indent=2)
        print(f"\nSaved {len(elements)} elements to {filename}")
        
    def load_processed_elements(self, filename: str) -> List[Element]:
        """Load previously processed elements"""
        if os.path.exists(filename):
            with open(filename, 'r', encoding='utf-8') as f:
                data = json.load(f)
            elements = [Element(**item) for item in data]
            print(f"\nLoaded {len(elements)} elements from {filename}")
            return elements
        return []

    def process_pdf_with_cache(self, pdf_path: str, cache_dir: str = "processed_cache"):
        """Process PDF with caching"""
        #Create cache directory if it doesn't exist
        os.makedirs(cache_dir, exist_ok=True)
        
        #Generate cache filename based on PDF name
        pdf_name = Path(pdf_path).stem
        cache_file = os.path.join(cache_dir, f"{pdf_name}_processed.json")
        
        #Check if cached version exists
        if os.path.exists(cache_file):
            print(f"\nLoading cached version of {pdf_path}...")
            return self.load_processed_elements(cache_file)
            
        #If no cache, process normally
        elements = self.process_pdf(pdf_path)
        if elements:
            print(f"\nCaching processed elements for {pdf_path}...")
            self.save_processed_elements(elements, cache_file)
        return elements

    def process_pdf(self, pdf_path: str):
        """Process PDF using Unstructured API"""
        try:
            print(f"\nProcessing {pdf_path}...")
            
            with open(pdf_path, "rb") as f:
                files = shared.Files(
                    content=f.read(),
                    file_name=pdf_path
                )
            
            partition_params = shared.PartitionParameters(
                files=files,
                strategy="hi_res",
                hi_res_model_name="yolox",
                skip_infer_table_types=[],
                pdf_infer_table_structure=True
            )
            
            req = PartitionRequest(partition_parameters=partition_params)
            elements = self.client.general.partition(request=req).elements
            api_elements = dict_to_elements(elements)
            
            categorized_elements = []
            for element in api_elements:
                if "Table" in str(type(element)):
                    categorized_elements.append(Element(
                        type="table",
                        page_content=str(element.metadata.text_as_html)
                    ))
                elif any(t in str(type(element)) for t in ["NarrativeText", "Title", "ListItem"]):
                    categorized_elements.append(Element(
                        type="text",
                        page_content=str(element)
                    ))
            
            print(f"Successfully processed with {len(categorized_elements)} elements")
            return categorized_elements
            
        except Exception as e:
            print(f"Error processing PDF: {e}")
            return []

    def process_youtube_data(self, csv_path: str):
        """Process YouTube transcript data"""
        df = pd.read_csv(csv_path)
        elements = []
        for _, row in df.iterrows():
            elements.append(Element(
                type="text",
                page_content=row['content']
            ))
        return elements

    def setup_retrieval_system(self, elements: List[Element]):
        """Set up multi-vector retrieval system with DistilBART summarization"""
        from transformers import pipeline
        import torch
        
        #Separate elements by type
        table_elements = [el for el in elements if el.type == "table"]
        text_elements = [el for el in elements if el.type == "text"]
        
        #Defining batch sizes
        CHROMA_BATCH_SIZE = 5000   #Max number of elements to process per batch when adding to Chroma vector store
        SUMMARY_BATCH_SIZE = 8    #Number of text chunks to summarize at once with the DistilBART model

        def generate_summaries_batch(texts: List[str]) -> List[str]:
            """Generate summaries using DistilBART in batches"""
            try:
                if not hasattr(self, 'summarizer'):
                    print("\nInitializing DistilBART summarizer...")
                    self.summarizer = pipeline(
                        "summarization",
                        model="sshleifer/distilbart-cnn-12-6",
                        device=0 if torch.cuda.is_available() else -1,
                        batch_size=SUMMARY_BATCH_SIZE
                    )
                
                print(f"Generating summaries for {len(texts)} texts...")
                summaries = []
                
                #Process in smaller batches
                for i in range(0, len(texts), SUMMARY_BATCH_SIZE):
                    batch = texts[i:i + SUMMARY_BATCH_SIZE]
                    print(f"Processing summary batch {i//SUMMARY_BATCH_SIZE + 1}/{len(texts)//SUMMARY_BATCH_SIZE + 1}")
                    
                    #Check summary cache
                    cache_file = f"summary_cache/batch_{i}.json"
                    if os.path.exists(cache_file):
                        with open(cache_file, 'r', encoding='utf-8') as f:
                            batch_summaries = json.load(f)
                        print(f"Loaded {len(batch_summaries)} summaries from cache")
                    else:
                        #Adjust max_length based on input length
                        max_input_length = max(len(text.split()) for text in batch)
                        max_summary_length = min(150, max(40, max_input_length // 2))
                        
                        #Generate new summaries
                        batch_summaries = self.summarizer(
                            batch,
                            max_length=max_summary_length,
                            min_length=min(30, max_summary_length - 10),
                            do_sample=False
                        )
                        batch_summaries = [s['summary_text'] for s in batch_summaries]
                        
                        #Cache the summaries
                        os.makedirs("summary_cache", exist_ok=True)
                        with open(cache_file, 'w', encoding='utf-8') as f:
                            json.dump(batch_summaries, f, ensure_ascii=False, indent=2)
                    
                    summaries.extend(batch_summaries)
                
                return summaries
                    
            except Exception as e:
                print(f"Error in summary generation: {e}")
                return [text[:200] for text in texts]  # Fallback to truncation

        def process_in_batches(elements_list, element_type: str):
            """Process elements in batches with multi-vector storage"""
            #Embed full and summarized documents into vectorstore for efficient retrival 
            for i in range(0, len(elements_list), CHROMA_BATCH_SIZE):
                batch = elements_list[i:i + CHROMA_BATCH_SIZE]
                print(f"\nProcessing {element_type} batch {i//CHROMA_BATCH_SIZE + 1}/{len(elements_list)//CHROMA_BATCH_SIZE + 1}")
                
                #Generate IDs for batch
                batch_ids = [str(uuid.uuid4()) for _ in batch]
                
                #Generate summaries for the batch
                summaries = generate_summaries_batch([el.page_content for el in batch])
                
                try:
                    #Create documents for both full content and summaries
                    full_docs = []
                    summary_docs = []
                    
                    for idx, (element, summary) in enumerate(zip(batch, summaries)):
                        #Add full content document
                        full_docs.append(Document(
                            page_content=element.page_content,
                            metadata={
                                "doc_id": batch_ids[idx],
                                "type": element_type,
                                "version": "full"
                            }
                        ))
                        
                        #Add summary document
                        summary_docs.append(Document(
                            page_content=summary,
                            metadata={
                                "doc_id": batch_ids[idx],
                                "type": element_type,
                                "version": "summary"
                            }
                        ))
                    
                    #Add to vectorstore
                    if full_docs and summary_docs:
                        print("Adding documents to vectorstore...")
                        self.retriever.vectorstore.add_documents(full_docs)
                        self.retriever.vectorstore.add_documents(summary_docs)
                        
                        #Store original elements
                        self.retriever.docstore.mset(list(zip(batch_ids, batch)))
                        
                except Exception as e:
                    print(f"Error processing batch: {e}")
                    continue

        
        
        #Process elements by type
        if table_elements:
            print(f"\nProcessing {len(table_elements)} table elements...")
            process_in_batches(table_elements, "table")
        
        if text_elements:
            print(f"\nProcessing {len(text_elements)} text elements...")
            process_in_batches(text_elements, "text")
        
        #Persist the vectorstore
        print("\nPersisting vectorstore...")
        self.vectorstore.persist()
        
        print("\nMulti-vector retrieval system setup complete!")

In [1]:
def main():
    """Main interactive CLI: lets user choose data source and ask questions""""
    # Load environment variables
    load_dotenv()
    
    processor = DataProcessor()
    
    print("\nSelect data source:")
    print("1. YouTube Transcripts")
    print("2. PDF Documents")
    print("3. Combined Data")
    
    choice = input("Enter your choice (1-3): ")
    
    elements = []
    if choice == "1":
        youtube_cache = "processed_cache/youtube_data.json"
        if os.path.exists(youtube_cache):
            elements = processor.load_processed_elements(youtube_cache)
        else:
            elements = processor.process_youtube_data("uk_construction_bot/data/processed/youtube_data.csv")
            processor.save_processed_elements(elements, youtube_cache)
    elif choice == "2":
        pdf_dir = Path("uk_construction_bot/data/raw/documents")
        for pdf_file in pdf_dir.glob("*.pdf"):
            elements.extend(processor.process_pdf_with_cache(str(pdf_file)))
    elif choice == "3":
        # Load YouTube data
        youtube_cache = "processed_cache/youtube_data.json"
        if os.path.exists(youtube_cache):
            youtube_data = processor.load_processed_elements(youtube_cache)
        else:
            youtube_data = processor.process_youtube_data("uk_construction_bot/data/processed/youtube_data.csv")
            processor.save_processed_elements(youtube_data, youtube_cache)
        
        #Loading PDF data
        pdf_elements = []
        pdf_dir = Path("uk_construction_bot/data/raw/documents")
        for pdf_file in pdf_dir.glob("*.pdf"):
            pdf_elements.extend(processor.process_pdf_with_cache(str(pdf_file)))
        
        elements = youtube_data + pdf_elements
    else:
        print("Invalid choice")
        return
    
    if not elements:
        print("No elements to process")
        return
        
    try:
        print("\nSetting up retrieval system...")
        processor.setup_retrieval_system(elements)
        
        print("\nInitializing Groq handler...")
        groq_handler = GroqHandler(
            api_key="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" ##this api key has been disabled, its not good to expose keys but for testing anf detection I exposed and disabled it afterwards, this applies to any other key seen within the 3 notebooks I used for thsi development.
        )
        
        print("\nSystem ready for queries!")
        
        # Interactive QA
        while True:
            try:
                question = input("\nEnter your question (or 'quit' to exit): ")
                if question.lower() == 'quit':
                    break
                    
                retrieved = processor.retriever.invoke(question)
                if not retrieved:
                    print("\nNo relevant information found in the documents.")
                    continue
                    
                context = retrieved[0].page_content
                print("\nRetrieved content:", context)
                
                response = groq_handler.generate_answer(context, question)
                if not response:
                    print("\nFalling back to simple context retrieval. Please rephrase your question.")
                    
            except KeyboardInterrupt:
                print("\nExiting QA session...")
                break
            except Exception as e:
                print(f"\nUnexpected error in QA: {str(e)}")
                print("Please try again with a different question.")
    except Exception as e:
        print(f"\nError in setup: {str(e)}")
        print("Processed elements have been cached and can be reused on next run.")

if __name__ == "__main__":
    main()

  self.embeddings = HuggingFaceEmbeddings(

INFO: PyTorch version 2.6.0 available.
INFO: TensorFlow version 2.18.0 available.
INFO: Load pretrained SentenceTransformer: sentence-transformers/all-mpnet-base-v2
  self.vectorstore = Chroma(
INFO: Anonymized telemetry enabled. See                     https://docs.trychroma.com/telemetry for more information.



Select data source:
1. YouTube Transcripts
2. PDF Documents
3. Combined Data


Enter your choice (1-3):  2



Loading cached version of uk_construction_bot\data\raw\documents\approved_document_A.pdf...

Loaded 1013 elements from processed_cache\approved_document_A_processed.json

Loading cached version of uk_construction_bot\data\raw\documents\Approved_Document_B__fire_safety__volume_1_-_Dwellings__2019_edition_incorporating_2020_and_2022_amendments_collated_with_2025__2026_and_2029_amendments.pdf...

Loaded 3495 elements from processed_cache\Approved_Document_B__fire_safety__volume_1_-_Dwellings__2019_edition_incorporating_2020_and_2022_amendments_collated_with_2025__2026_and_2029_amendments_processed.json

Loading cached version of uk_construction_bot\data\raw\documents\approved_document_C.pdf...

Loaded 1256 elements from processed_cache\approved_document_C_processed.json

Setting up retrieval system...

Processing 75 table elements...

Processing table batch 1/1

Initializing DistilBART summarizer...


config.json:   0%|          | 0.00/1.80k [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


pytorch_model.bin:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/1.22G [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/26.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Device set to use cpu


Generating summaries for 75 texts...
Processing summary batch 1/10
Loaded 8 summaries from cache
Processing summary batch 2/10
Loaded 8 summaries from cache
Processing summary batch 3/10
Loaded 8 summaries from cache
Processing summary batch 4/10
Loaded 8 summaries from cache
Processing summary batch 5/10
Loaded 8 summaries from cache
Processing summary batch 6/10
Loaded 8 summaries from cache
Processing summary batch 7/10
Loaded 8 summaries from cache
Processing summary batch 8/10
Loaded 8 summaries from cache
Processing summary batch 9/10
Loaded 8 summaries from cache
Processing summary batch 10/10
Loaded 8 summaries from cache
Adding documents to vectorstore...

Processing 5689 text elements...

Processing text batch 1/2
Generating summaries for 5000 texts...
Processing summary batch 1/626
Loaded 8 summaries from cache
Processing summary batch 2/626
Loaded 8 summaries from cache
Processing summary batch 3/626
Loaded 8 summaries from cache
Processing summary batch 4/626
Loaded 8 summ

  self.vectorstore.persist()



System ready for queries!



Enter your question (or 'quit' to exit):  What is the minimum ventilation requirement for suspended concrete ground floors?



Retrieved content: 4.19 A suspended concrete floor will meet the requirements if it incorporates:

Generating response (streaming)...


INFO: HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


According to the context, the minimum ventilation requirement for suspended concrete ground floors is not explicitly stated, but rather it mentions that the floor will meet the requirements if it incorporates certain elements (though the elements are not specified in the given context).


Enter your question (or 'quit' to exit):  What is the minimum ventilation requirement for suspended concrete ground floors?



Retrieved content: 4.19 A suspended concrete floor will meet the requirements if it incorporates:

Generating response (streaming)...


INFO: HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


According to the given context, the minimum ventilation requirement for suspended concrete ground floors is not explicitly stated. The context only mentions that a suspended concrete floor will meet the requirements if it incorporates certain features, but it does not specify what those requirements are. Therefore, the minimum ventilation requirement cannot be determined from the provided context.


Enter your question (or 'quit' to exit):  Give me the table of Clay type and Volume change potential for site preparation



Retrieved content: Table 1 Volume change potential for some common clays

Generating response (streaming)...


INFO: HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


Based on the provided context, here is the table of Clay type and Volume change potential for site preparation:

Unfortunately, the actual table data is not provided in the given context. However, I can suggest the general format of the table based on the description:

| Clay Type | Volume Change Potential |
| --- | --- |
|  |  |

Please note that the actual data is missing, and I'm only providing a placeholder table. If you provide the actual table data, I'll be happy to assist you further.


Enter your question (or 'quit' to exit):  What are the minimum thickness requirements for solid external walls in traditional construction ?



Retrieved content: The walls should have a minimum thickness of 90mm.

Generating response (streaming)...


INFO: HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


The minimum thickness requirement for solid external walls is 90mm.


Enter your question (or 'quit' to exit):  What factors determine the minimum width of strip foundations in different ground conditions?



Retrieved content: Minimum width of strip foundations

Generating response (streaming)...


INFO: HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


The minimum width of strip foundations is determined by several factors, which vary depending on the ground conditions. These factors include:

1. Bearing capacity of the soil: The width of the foundation needs to be sufficient to distribute the load of the structure evenly across the soil, without exceeding the soil's bearing capacity.
2. Settlement characteristics: The foundation width should be designed to minimize settlement and ensure that it is uniform across the foundation.
3. Soil type and properties: Different soils have varying strengths and settlement characteristics, which affect the required foundation width.
4. Load-bearing capacity: The foundation width needs to be sufficient to support the weight of the structure and any external loads.
5. Depth of foundation: The depth of the foundation also plays a role in determining the minimum width, as deeper foundations can be narrower due to the increased bearing capacity of the soil at greater depths.

In general, the minimum w


Enter your question (or 'quit' to exit):  What design loading must 'key elements' withstand under disproportionate collapse requirements ?



Retrieved content: A ‘key element’, as referred to in paragraph 5.1d, should be capable of sustaining an accidental design loading of 34kN/m2 applied in the horizontal and vertical directions (in one direction at a time) to the member and any attached components (e.g. cladding etc.) having regard to the ultimate strength of such components and their connections. Such accidental design loading should be assumed to act simultaneously with all other design loadings (i.e. wind and imposed loading) in accidental actions loading combination.

Generating response (streaming)...


INFO: HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


According to the context, 'key elements' must withstand an accidental design loading of 34kN/m2, applied in the horizontal and vertical directions (one direction at a time), in addition to all other design loadings (such as wind and imposed loading).


Enter your question (or 'quit' to exit):  Which geographical areas require special timber treatment against house longhorn ?



Retrieved content: 2B2 In the geographical areas specified in Table 1, softwood timber for roof construction or fixed in the roof space, including ceiling joists within the void spaces of the roof, should be adequately treated to prevent infestation by the house longhorn beetle (Hylotrupes bajulus L.).

Generating response (streaming)...


INFO: HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


The geographical areas that require special timber treatment against house longhorn beetle are specified in Table 1.


Enter your question (or 'quit' to exit):  What are the fixing requirements for wall cladding to ensure structural safety ?



Retrieved content: 3.1 Wall cladding presents a hazard if it becomes detached from the building. This section provides guidance on the support and fixing of wall cladding. An acceptable level of safety can be achieved by different means depending on the type and location of the cladding. The guidance given relates to all forms of cladding, including curtain walling and glass facades. It is not intended to provide guidance concerning the weather resistance of wall cladding which is included in Approved Document C, Site preparation and resistance to contaminants and moisture, or guidance on resistance to spread of fire which is included in Approved Document B, Fire safety, or guidance in relation to sound insulation, which is included in Approved Document E, Resistance to the passage of sound.

Generating response (streaming)...


INFO: HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


The provided context does not explicitly state the specific fixing requirements for wall cladding to ensure structural safety. It mentions that an acceptable level of safety can be achieved by different means depending on the type and location of the cladding, but it does not provide detailed information on the fixing requirements. It also refers to other Approved Documents (C, B, and E) for guidance on weather resistance, fire safety, and sound insulation, but not for structural safety. Therefore, the fixing requirements for wall cladding to ensure structural safety are not specified in the given context.


Enter your question (or 'quit' to exit):  What is the minimum subsoil drainage requirement below ground level?



Retrieved content: Section 3: Subsoil drainage

Generating response (streaming)...


INFO: HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


The minimum subsoil drainage requirement below ground level is not specified in the provided context. The context only mentions "Section 3: Subsoil drainage" and a question about the minimum requirement, but it does not provide the actual requirement.


Enter your question (or 'quit' to exit):  What are the radon protection requirements in areas with high radon levels?



Retrieved content: • Guidance on radon protective measures.

Generating response (streaming)...


INFO: HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"



Attempt 1 failed, retrying in 1 seconds...


INFO: HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"



Generating response (streaming)...

Attempt 2 failed, retrying in 2 seconds...


INFO: HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"



Generating response (streaming)...

Error: Failed to generate answer after 3 attempts: Error code: 429 - {'error': {'message': 'Rate limit reached for model `llama-3.3-70b-versatile` in organization `org_01jj0yv8a6fbjsjzm9xvry5zxn` service tier `on_demand` on : Limit 100000, Used 100828, Requested 66. Please try again in 12m52.743999999s. Visit https://console.groq.com/docs/rate-limits for more information.', 'type': '', 'code': 'rate_limit_exceeded'}}

Falling back to simple context retrieval. Please rephrase your question.



Enter your question (or 'quit' to exit):  What are the radon protection requirements in areas with high radon levels?



Retrieved content: • Guidance on radon protective measures.

Generating response (streaming)...


INFO: HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"



Attempt 1 failed, retrying in 1 seconds...

Generating response (streaming)...


INFO: HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"



Attempt 2 failed, retrying in 2 seconds...

Generating response (streaming)...


INFO: HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 429 Too Many Requests"



Error: Failed to generate answer after 3 attempts: Error code: 429 - {'error': {'message': 'Rate limit reached for model `llama-3.3-70b-versatile` in organization `org_01jj0yv8a6fbjsjzm9xvry5zxn` service tier `on_demand` on : Limit 100000, Used 101258, Requested 66. Please try again in 19m4.579s. Visit https://console.groq.com/docs/rate-limits for more information.', 'type': '', 'code': 'rate_limit_exceeded'}}

Falling back to simple context retrieval. Please rephrase your question.



Enter your question (or 'quit' to exit):  What are the radon protection requirements in areas with high radon levels?



Retrieved content: • Guidance on radon protective measures.

Generating response (streaming)...


INFO: HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


In areas with high radon levels, the following radon protection requirements are recommended:

1. **Sealing entry points**: Seal all entry points where radon can enter the building, such as cracks in floors and walls, and around pipes and electrical outlets.
2. **Installing a radon barrier**: Install a radon barrier, such as a plastic sheet, in crawl spaces and under floors to prevent radon from entering the building.
3. **Improving ventilation**: Improve ventilation in the building by installing a heat recovery ventilation system or a positive input ventilation system to reduce radon levels.
4. **Using radon-resistant materials**: Use radon-resistant materials, such as concrete with a low permeability, for building foundations and floors.
5. **Regular testing**: Regularly test for radon levels in the building to ensure that the protection measures are effective.

Note: The specific requirements may vary depending on the local regulations and guidelines. It is recommended to consult wi


Enter your question (or 'quit' to exit):  What is the minimum ventilation requirement for suspended concrete ground floors?



Retrieved content: 4.19 A suspended concrete floor will meet the requirements if it incorporates:

Generating response (streaming)...


INFO: HTTP Request: POST https://api.groq.com/openai/v1/chat/completions "HTTP/1.1 200 OK"


According to the given context, the minimum ventilation requirement for suspended concrete ground floors is not explicitly stated. The context only mentions that a suspended concrete floor will meet the requirements if it incorporates certain features, but it does not specify what those requirements are. Therefore, the minimum ventilation requirement cannot be determined from the provided context.


Enter your question (or 'quit' to exit):  QUIT


In [2]:
## The end of the code execution output above has an interactive Rag System which I used to test out here before taking to build a proper user interface 
## via streamlit and deploying that interface on HuggingFace

### Groq's Llama-3.3-70b model through their API HuggingFace embeddings (sentence-transformers/all-mpnet-base-v2)
Currently, the code uses DistilBART for summarization, specifically the "sshleifer/distilbart-cnn-12-6" model. This can be seen in the generate_summaries_batch function within the setup_retrieval_system method:
This summarization is used as part of the multi-vector retrieval system, where both full documents and their summaries are stored for improved retrieval performance.

This code is a RAG (Retrieval Augmented Generation) system that processes documents and answers questions about them. Here's the key functionality:

Document Processing: Processes both PDF documents and YouTube transcripts
Uses Unstructured API to extract text and tables from PDFs
Caches processed documents to avoid reprocessing


Embedding and Storage:Currently uses HuggingFace embeddings (sentence-transformers/all-mpnet-base-v2)
Stores embeddings in a Chroma vector database
Uses a multi-vector retrieval system that stores both full content and summaries


Text Generation:Currently uses Groq's Llama-3.3-70b model through their API
Includes retry logic and error handling
Uses a streaming response system