# Road Safety Intervention GPT (NRSH 2025)

This RAG (Retrieval-Augmented Generation) system is designed for the National Road Safety Hackathon 2025. It uses a local LLM (Llama 3.2 3B) and a curated vector database (`knowledge_base.json`) to answer questions about road safety interventions, standards, and best practices. 

It features a multi-step prompt that synthesizes information from multiple sources to provide **comprehensive, topic-wise replies** and **includes citations** as required by the competition rules.

### Step 1: Install Python Libraries

This cell installs all the necessary Python libraries. OCR-related libraries have been removed.

In [1]:
%pip install langchain langchain-community sentence-transformers faiss-cpu transformers torch accelerate huggingface_hub


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Step 2: Imports and Setup

In [2]:
import os
import json
import numpy as np
from langchain.docstore.document import Document
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain_community.vectorstores import FAISS
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import torch
import time

# Set environment variable to avoid tokenizer parallelism warnings
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

  from .autonotebook import tqdm as notebook_tqdm


### Step 3: Load and Prepare Documents from `knowledge_base.json`

This function loads our pre-curated `knowledge_base.json` file. It's designed to intelligently handle **both** of the JSON formats you provided:
1.  The IRC Standards format (with `s_no`, `problem`, `data`, `code`, `clause`).
2.  The Intervention Strategy format (with `id`, `intervention`, `description`).

It transforms each JSON object into a unified LangChain `Document` object, which separates the text content (for searching) from the metadata (for filtering and citations).

In [3]:
JSON_FILE_PATH = "knowledge_base.json"
documents = []

def load_and_prepare_documents(file_path):
    """Loads the curated JSON file and transforms it into a list of LangChain Documents."""
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
        
        doc_list = []
        for item in data:
            # Use 'intervention_name' as a key to detect our 'gold standard' format
            # (This is the format produced by the transformation script I provided earlier)
            if 'intervention_name' in item:
                page_content = f"Intervention/Standard: {item['intervention_name']}\nDetails: {item['content']}"
                metadata = item.get('metadata', {})
                final_metadata = {
                    'source': metadata.get('source_reference', 'N/A'),
                    'type': metadata.get('type', 'N/A'),
                    'category': metadata.get('category', 'N/A'),
                    'problems': metadata.get('common_problems_solved', 'N/A')
                }
            
            # Fallback for your 'IRC Standard' table format
            elif 's_no' in item and 'data' in item:
                page_content = f"Standard: {item['type']}\nSpecification: {item['data']}"
                final_metadata = {
                    'source': f"{item.get('code', 'N/A')}, Clause {item.get('clause', 'N/A')}",
                    'type': item.get('category', 'Standard'),
                    'category': item.get('category', 'N/A'),
                    'problems': item.get('problem', 'N/A')
                }

            # Fallback for your 'Intervention Strategy' format
            elif 'intervention' in item and 'description' in item:
                content_text = (
                    f"{item['description']}. "
                    f"When to Apply: {item.get('when_to_apply', 'N/A')}. "
                    f"Why it Works: {item.get('why_it_works', 'N/A')}. "
                    f"Constraints: {item.get('constraints', 'N/A')}."
                )
                page_content = f"Intervention: {item['intervention']}\nDetails: {content_text}"
                final_metadata = {
                    'source': item.get('source', 'N/A'),
                    'type': 'Intervention',
                    'category': item.get('category', 'N/A'),
                    'problems': f"Solution for {item.get('category', 'N/A')}"
                }

            else:
                print(f"Skipping unrecognized item: {item.get('id', 'N/A')}")
                continue
                
            doc_list.append(Document(page_content=page_content, metadata=final_metadata))
        
        return doc_list

    except FileNotFoundError:
        print(f"ERROR: '{file_path}' not found. Please make sure it's in the same directory.")
        return []
    except json.JSONDecodeError:
        print(f"ERROR: '{file_path}' is not a valid JSON file. Please check for syntax errors.")
        return []
    except Exception as e:
        print(f"Error loading and preparing documents: {e}")
        return []

documents = load_and_prepare_documents(JSON_FILE_PATH)
if documents:
    print(f"Successfully loaded and prepared {len(documents)} documents from '{JSON_FILE_PATH}'.")
    print("\n--- Example Document --- ")
    print(documents[0].page_content)
    print(documents[0].metadata)
    print("------------------------")

Successfully loaded and prepared 240 documents from 'knowledge_base.json'.

--- Example Document --- 
Intervention/Standard: STOP Sign
Details: The 'STOP' sign, used on Minor Roads intersecting Major Roads, requires vehicles to stop before entering and proceed only when safe. It is octagonal with a red background, a white border, and "STOP" written centrally in white. Installed on the left side of the approach, it should be placed close to the stop line, typically 1.5 m in advance, without impairing visibility of the Major Road.
The dimensions vary by approach speed: up to 50 km/h, 750 mm height, 25 mm border, 175 mm font; 51–65 km/h, 900 mm height, 30 mm border, 210 mm font; and over 65 km/h, 1200 mm height, 40 mm border, 280 mm font.
{'source': 'IRC:67-2022 - Clause 14.4', 'type': 'Standard', 'category': 'Road Sign', 'problems': 'Damaged'}
------------------------


### Step 4: Generate Embeddings and Create Vector Store

This step embeds the `Document` objects we created. We use `FAISS.from_documents` to build the vector database in memory.

In [4]:
vector_store = None
if documents:
    print("Loading embedding model and creating vector store...")

    # Using a reliable and high-performing open-source embedding model
    embeddings = SentenceTransformerEmbeddings(model_name='BAAI/bge-large-en-v1.5')
    
    # Create the vector store from our list of Document objects
    vector_store = FAISS.from_documents(documents, embeddings)
    
    print("Vector store created successfully.")
else:
    print("Skipping vector store creation because no documents were loaded from the JSON file.")

Loading embedding model and creating vector store...


  embeddings = SentenceTransformerEmbeddings(model_name='BAAI/bge-large-en-v1.5')


Vector store created successfully.


### Step 5: Load Llama 3.2 3B Model

This step loads the local LLM to act as the 'brain' of our RAG system. You must add your Hugging Face token to download the model.

In [None]:

HF_TOKEN = "" # <-- PASTE YOUR TOKEN HERE

if HF_TOKEN == "":
    print("="*50)
    print("ERROR: Please set your Hugging Face token in the HF_TOKEN variable.")
    print("Get one from: https://huggingface.co/settings/tokens")
    print("="*50)
    llm_generator = None
else:
    print("Loading Llama 3.2 3B LLM for generation...")
    if torch.cuda.is_available():
        device = "cuda"
    elif torch.backends.mps.is_available():
        device = "mps"
    else:
        device = "cpu"
    print(f"Using device: {device}")

    model_name = "meta-llama/Llama-3.2-3B-Instruct"
    tokenizer = AutoTokenizer.from_pretrained(model_name, token=HF_TOKEN)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map=device,
        torch_dtype=torch.bfloat16,
        token=HF_TOKEN
    )

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    llm_generator = pipeline(
        "text-generation",
model=model,
        tokenizer=tokenizer,
        max_new_tokens=1024 # Increased token limit for detailed answers
    )
    print("Llama 3.2 3B LLM loaded successfully.")

Loading Llama 3.2 3B LLM for generation...
Using device: mps


`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████| 2/2 [00:04<00:00,  2.36s/it]
Device set to use mps


Llama 3.2 3B LLM loaded successfully.


### Step 6: Define Intents and **Topic-Wise** Prompts

This is the upgraded 'brain' of the bot. We're replacing the 'study' intents with 'road safety' intents. 

Crucially, the `ask_question` prompt is now a **synthesis prompt**. It instructs the AI to find *all* relevant pieces of information, **group them by topic** (like I do!), and present a comprehensive answer, not just the first chunk it finds. It also **explicitly commands the AI to cite its sources** from the metadata.

In [6]:
def detect_intent_simple(query):
    """A simple keyword-based intent detection function."""
    query_lower = query.lower()
    if 'summarize' in query_lower or 'summary' in query_lower:
        return 'request_summary'
    if 'quiz' in query_lower or 'test me' in query_lower or 'flashcard' in query_lower:
        return 'request_quiz'
    return 'ask_question' 

def generate_content(context_docs, query, intent):
    """Generates a response, summary, or quiz based on the intent."""
    
    # We format the context to be crystal clear for the LLM
    context_text = "\n\n---\n".join([
        f"Source: {doc.metadata.get('source', 'N/A')}\nContent: {doc.page_content}"
        for doc in context_docs
    ])

    system_prompt = ""
    user_prompt = ""

    if intent == 'ask_question':
        system_prompt = (
            "You are an expert Road Safety Analyst for the National Road Safety Hackathon. "
            "Your task is to provide a comprehensive, well-structured answer based ONLY on the provided context, which contains IRC standards and best practices. "
            "Do not just answer with the first relevant fact. **Synthesize** information from *all* the provided context documents. "
            "If you find multiple relevant interventions (e.g., for 'speeding'), **group them by topic** (e.g., 'Engineering Solutions', 'Enforcement Solutions'). "
            "Be precise, actionable, and **you MUST cite your sources** for every claim you make, using the format [Source]. "
            "If the context does not contain the information, state that you cannot answer from the provided knowledge base."
        )
        user_prompt = f"Context from Knowledge Base:\n{context_text}\n\nQuestion: {query}"
    
    elif intent == 'request_summary':
        system_prompt = "You are an expert Road Safety Analyst. Create a concise, easy-to-read summary of the key points from the following context. Use bullet points and cite your sources [Source]."
        user_prompt = f"Context:\n{context_text}\n\nSummary:"
    
    elif intent == 'request_quiz':
        system_prompt = "You are an expert Road Safety Analyst. Create an engaging multiple-choice quiz with 3 to 4 questions based on the following context. For each question, provide four options (A, B, C, D), and clearly state the correct answer, citing the source [Source]."
        user_prompt = f"Context:\n{context_text}\n\nQuiz:"

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]
    
    try:
        prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        response = llm_generator(prompt, num_return_sequences=1)
        full_response_text = response[0]['generated_text']
        
        answer = full_response_text.split(prompt)[-1].strip()
        return answer, [doc.metadata.get('source', 'N/A') for doc in context_docs]
    
    except Exception as e:
        print(f"Error during LLM generation: {e}")
        return f"Sorry, I encountered an error while generating the response: {e}", []

### Step 7: The RAG Query Function

This function ties everything together. It detects the intent, retrieves context, and then either answers the question or generates content. 

I've increased `k=5` to retrieve *more* documents. This is key for the "topic-wise" synthesis, as it gives the LLM more information to group and summarize.

In [7]:
def process_query(query):
    if not vector_store or not llm_generator:
        print("ERROR: Vector store or LLM is not initialized. Please run all preceding steps.")
        return

    print(f"\n{'='*20}")
    print(f"Query: {query}")
    
    intent = detect_intent_simple(query)
    print(f"Detected Intent: {intent}")
    
    # Retrieve relevant documents from the vector store
    retrieved_docs = vector_store.similarity_search(query, k=5) # Increased to k=5 for better synthesis
    
    print(f"Retrieved {len(retrieved_docs)} relevant documents...")
    
    start_time = time.time()
    
    # Generate the answer using the LLM and the retrieved context
    final_answer, sources = generate_content(retrieved_docs, query, intent)
    end_time = time.time()
    
    content_type = "Response"
    if intent == 'ask_question':
        content_type = 'Answer'
    elif intent == 'request_summary':
        content_type = 'Summary'
    elif intent == 'request_quiz':
        content_type = 'Quiz'

    print(f"\n✅ {content_type} (generated in {end_time - start_time:.2f}s):")
    print(final_answer)
    
    # This is for your reference, to see what the bot was 'thinking'
    print("\n--- Retrieved Sources ---")
    for i, doc in enumerate(retrieved_docs):
        print(f"Doc {i+1} Source: {doc.metadata.get('source', 'N/A')}")
        # print(f"Doc {i+1} Content: {doc.page_content[:100]}...") # Uncomment for debugging
    print(f"{'='*20}\n")

### Step 8: Example Queries

Let's test our new Road Safety GPT! Uncomment the queries to run them.

In [8]:
# --- Test Queries (Uncomment and run one by one) --- 

# 1. Test a specific standard (from your table data)
# process_query("What is the standard for a STOP sign?")

# 2. Test a problem-based intervention (from your table data)
# process_query("My road markings are faded and not retro-reflective. What's the rule for that?")

# 3. Test a broad, topic-wise question (should get multiple answers)
# process_query("What interventions can I use for speeding?")

# 4. Test a summary intent
# process_query("summarize the standards for speed humps and rumble strips")

# 5. Test a quiz intent
# process_query("quiz me on road signs")

In [9]:
process_query("What is the standard for a STOP sign?")



Query: What is the standard for a STOP sign?
Detected Intent: ask_question
Retrieved 5 relevant documents...

✅ Answer (generated in 17.52s):
According to the provided context, the standard for a STOP sign, as outlined in IRC:67-2022 - Clause 14.4, is as follows:

- The 'STOP' sign is octagonal in shape with a red background, a white border, and "STOP" written centrally in white.
- It is installed on the left side of the approach, close to the stop line, typically 1.5 m in advance, without impairing visibility of the Major Road.
- The dimensions vary by approach speed:
  - Up to 50 km/h, 750 mm height, 25 mm border, 175 mm font.
  - 51–65 km/h, 900 mm height, 30 mm border, 210 mm font.
  - Over 65 km/h, 1200 mm height, 40 mm border, 280 mm font.

[Source: IRC:67-2022 - Clause 14.4]

--- Retrieved Sources ---
Doc 1 Source: IRC:67-2022 - Clause 14.4
Doc 2 Source: IRC:67-2022 - Clause 14.4
Doc 3 Source: IRC:67-2022 - Clause 14.8.4
Doc 4 Source: WHO Inspired
Doc 5 Source: IRC:67-2022 - Cl

In [10]:
process_query("My road markings are faded and not retro-reflective. What's the rule for that?")



Query: My road markings are faded and not retro-reflective. What's the rule for that?
Detected Intent: ask_question
Retrieved 5 relevant documents...

✅ Answer (generated in 26.69s):
Based on the provided context, it appears that the road markings on your road are not meeting the standards for visibility, particularly for drivers to detect the markings at least two seconds ahead.

According to the IRC:35-2015 - Clause 2.7, road markings must be clearly visible day and night, providing essential guidance, especially on unlit roads. The minimum preview distance with respect to speed for drivers to detect markings is specified in the table.

Since your road markings are faded and not retro-reflective, it is likely that they are not meeting the visibility requirements.

The PWD Inspired standard recommends repainting faded lane markings for clarity, especially in intersections with poor visibility. This could be a suitable solution to address the issue.

Additionally, the WHO Inspired sta