# Road Safety Intervention GPT (NRSH 2025)

This RAG (Retrieval-Augmented Generation) system is designed for the National Road Safety Hackathon 2025. It uses a local LLM (Llama 3.2 3B) and a curated vector database (`knowledge_base.json`) to answer questions about road safety interventions, standards, and best practices. 

It features a multi-step prompt that synthesizes information from multiple sources to provide **comprehensive, topic-wise replies** and **includes citations** as required by the competition rules.

### Step 1: Install Python Libraries

This cell installs all the necessary Python libraries. OCR-related libraries have been removed.

In [1]:
%pip install langchain langchain-community sentence-transformers faiss-cpu transformers torch accelerate huggingface_hub


[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m A new release of pip is available: [0m[31;49m23.0.1[0m[39;49m -> [0m[32;49m25.3[0m
[1m[[0m[34;49mnotice[0m[1;39;49m][0m[39;49m To update, run: [0m[32;49mpip install --upgrade pip[0m
Note: you may need to restart the kernel to use updated packages.


### Step 2: Imports and Setup

In [2]:
import os
import json
import numpy as np
from langchain.docstore.document import Document
from langchain_community.embeddings import SentenceTransformerEmbeddings
from langchain_community.vectorstores import FAISS
from transformers import pipeline, AutoTokenizer, AutoModelForCausalLM
import torch
import time

# Set environment variable to avoid tokenizer parallelism warnings
os.environ['TOKENIZERS_PARALLELISM'] = 'false'

  from .autonotebook import tqdm as notebook_tqdm


### Step 3: Load and Prepare Documents from `knowledge_base.json`

This function loads our pre-curated `knowledge_base.json` file. It transforms each JSON object into a LangChain `Document` object, which separates the text content (for searching) from the metadata (for filtering and citations).

In [3]:
JSON_FILE_PATH = "knowledge_base.json"
documents = []

def load_and_prepare_documents(file_path):
    """Loads the curated JSON file and transforms it into a list of LangChain Documents."""
    try:
        with open(file_path, 'r', encoding='utf-8') as f:
            data = json.load(f)
        
        doc_list = []
        for item in data:
            # Check if the item has the expected structure
            if 'full_text' in item and 'metadata' in item:
                
                # Use the 'full_text' field for embedding and search
                page_content = item['full_text']
                
                # Use the 'metadata' object directly
                final_metadata = item['metadata']
                
                # Add other top-level fields to metadata for reference
                final_metadata['id'] = item.get('id', 'N/A')
                final_metadata['intervention_name'] = item.get('intervention_name', 'N/A')

                doc_list.append(Document(page_content=page_content, metadata=final_metadata))
            else:
                print(f"Skipping item with unexpected format: {item.get('id', 'Unknown ID')}")

        return doc_list

    except FileNotFoundError:
        print(f"ERROR: '{file_path}' not found. Please make sure it's in the same directory.")
        return []
    except json.JSONDecodeError:
        print(f"ERROR: '{file_path}' is not a valid JSON file. Please check for syntax errors.")
        return []
    except Exception as e:
        print(f"Error loading and preparing documents: {e}")
        return []

documents = load_and_prepare_documents(JSON_FILE_PATH)
if documents:
    print(f"Successfully loaded and prepared {len(documents)} documents from '{JSON_FILE_PATH}'.")
    print("\n--- Example Document --- ")
    print(f"Content: {documents[0].page_content}")
    print(f"Metadata: {documents[0].metadata}")
    print("------------------------")

Successfully loaded and prepared 240 documents from 'knowledge_base.json'.

--- Example Document --- 
Content: STOP Sign. The 'STOP' sign, used on Minor Roads intersecting Major Roads, requires vehicles to stop before entering and proceed only when safe. It is octagonal with a red background, a white border, and "STOP" written centrally in white. Installed on the left side of the approach, it should be placed close to the stop line, typically 1.5 m in advance, without impairing visibility of the Major Road.
The dimensions vary by approach speed: up to 50 km/h, 750 mm height, 25 mm border, 175 mm font; 51–65 km/h, 900 mm height, 30 mm border, 210 mm font; and over 65 km/h, 1200 mm height, 40 mm border, 280 mm font.
Metadata: {'type': 'Standard', 'category': 'Road Sign', 'common_problems_solved': 'Damaged', 'source_reference': 'IRC:67-2022 - Clause 14.4', 'id': 'std-1', 'intervention_name': 'STOP Sign'}
------------------------


### Step 4: Generate Embeddings and Create Vector Store

This step embeds the `Document` objects we created. We use `FAISS.from_documents` to build the vector database in memory.

In [4]:
vector_store = None
if documents:
    print("Loading embedding model and creating vector store...")

    # Using a reliable and high-performing open-source embedding model
    embeddings = SentenceTransformerEmbeddings(model_name='BAAI/bge-large-en-v1.5')
    
    # Create the vector store from our list of Document objects
    vector_store = FAISS.from_documents(documents, embeddings)
    
    print("Vector store created successfully.")
else:
    print("Skipping vector store creation because no documents were loaded from the JSON file.")

Loading embedding model and creating vector store...


  embeddings = SentenceTransformerEmbeddings(model_name='BAAI/bge-large-en-v1.5')


Vector store created successfully.


### Step 5: Load Llama 3.2 3B Model

This step loads the local LLM to act as the 'brain' of our RAG system. You must add your Hugging Face token to download the model.

In [None]:

HF_TOKEN = "" # <-- PASTE YOUR TOKEN HERE

if HF_TOKEN == "YOUR_HUGGING_FACE_TOKEN_GOES_HERE":
    print("="*50)
    print("ERROR: Please set your Hugging Face token in the HF_TOKEN variable.")
    print("Get one from: https://huggingface.co/settings/tokens")
    print("="*50)
    llm_generator = None
else:
    print("Loading Llama 3.2 3B LLM for generation...")
    if torch.cuda.is_available():
        device = "cuda"
    elif torch.backends.mps.is_available():
        device = "mps"
    else:
        device = "cpu"
    print(f"Using device: {device}")

    model_name = "meta-llama/Llama-3.2-3B-Instruct"
    tokenizer = AutoTokenizer.from_pretrained(model_name, token=HF_TOKEN)
    model = AutoModelForCausalLM.from_pretrained(
        model_name,
        device_map=device,
        torch_dtype=torch.bfloat16,
        token=HF_TOKEN
    )

    if tokenizer.pad_token is None:
        tokenizer.pad_token = tokenizer.eos_token

    llm_generator = pipeline(
        "text-generation",
model=model,
tokenizer=tokenizer,
max_new_tokens=1024 # Increased token limit for detailed answers
    )
    print("Llama 3.2 3B LLM loaded successfully.")

Loading Llama 3.2 3B LLM for generation...
Using device: mps


`torch_dtype` is deprecated! Use `dtype` instead!
Loading checkpoint shards: 100%|██████████| 2/2 [00:05<00:00,  2.68s/it]
Device set to use mps


Llama 3.2 3B LLM loaded successfully.


### Step 6: Define Intents and **Topic-Wise** Prompts

This is the new 'brain' of the bot. We're replacing the 'study' intents with 'road safety' intents. 

Crucially, the `ask_question` prompt is now a **synthesis prompt**. It instructs the AI to find *all* relevant pieces of information, **group them by topic** (like you requested!), and present a comprehensive answer. It also **explicitly commands the AI to cite its sources** from the metadata.

In [6]:
def detect_intent_simple(query):
    """A simple keyword-based intent detection for the road safety bot."""
    query_lower = query.lower()
    
    # Intent to find a specific standard or specification
    if 'what is the standard for' in query_lower or 'standard for' in query_lower or 'rule for' in query_lower or 'specification for' in query_lower:
        return 'find_standard'
    
    # Intent to find an intervention for a specific problem
    if 'what should i do about' in query_lower or 'how to fix' in query_lower or 'intervention for' in query_lower or 'my sign is' in query_lower or 'marking is' in query_lower or 'problem is' in query_lower:
        return 'find_intervention'
    
    # Intent to compare two things
    if 'difference between' in query_lower or 'compare' in query_lower:
        return 'compare_interventions'
    
    # Default: A general question
    return 'ask_question' 

def generate_content(context_docs, query, intent):
    """Generates a response based on the detected intent, including citations."""
    
    # We format the context to be crystal clear for the LLM
    context_text = "\n\n---\n".join([
        f"Source: {doc.metadata.get('source_reference', 'N/A')}\nContent: {doc.page_content}"
        for doc in context_docs
    ])

    system_prompt = ""
    user_prompt = ""

    if intent == 'ask_question' or intent == 'find_standard' or intent == 'find_intervention' or intent == 'compare_interventions':
        system_prompt = (
            "You are an Expert Road Safety Analyst for the National Road Safety Hackathon. "
            "Your task is to provide a comprehensive, well-structured answer based ONLY on the provided context, which contains IRC standards and best practices. "
            "Do not just answer with the first relevant fact. **Synthesize** information from *all* the provided context documents. "
            "If you find multiple relevant interventions (e.g., for 'speeding'), **group them by topic** (e.g., 'Engineering Solutions', 'Enforcement Solutions'). "
            "Be precise, actionable, and **you MUST cite your sources** for every claim you make, using the format [Source]. "
            "If the context does not contain the information, state that you cannot answer from the provided knowledge base."
        )
        if intent == 'find_standard':
            user_prompt = f"Context from Knowledge Base:\n{context_text}\n\nQuestion: {query}\n\nTask: Provide the specific standard or regulation for the user's question. Be detailed and quote the exact specifications (like dimensions, placement, etc.) if available. Cite your source."
        elif intent == 'find_intervention':
             user_prompt = f"Context from Knowledge Base:\n{context_text}\n\nQuestion: {query}\n\nTask: The user has a problem. Recommend specific interventions or standards from the context to solve it. Explain what to do and how it helps. Cite your source."
        else: # 'ask_question' or 'compare_interventions'
            user_prompt = f"Context from Knowledge Base:\n{context_text}\n\nQuestion: {query}\n\nTask: Answer the user's question by synthesizing all relevant context. If comparing, create a clear comparison. Cite your sources."

    # --- This section is from your original notebook, modified slightly --- #
    elif intent == 'request_summary':
        system_prompt = "You are an expert Road Safety Analyst. Create a concise, easy-to-read summary of the key points from the following context. Use bullet points and cite your sources [Source]."
        user_prompt = f"Context:\n{context_text}\n\nSummary:"
    
    elif intent == 'request_quiz':
        system_prompt = "You are an expert Road Safety Analyst. Create an engaging multiple-choice quiz with 3 to 4 questions based on the following context. For each question, provide four options (A, B, C, D), and clearly state the correct answer, citing the source [Source]."
        user_prompt = f"Context:\n{context_text}\n\nQuiz:"
    # ------------------------------------------------------------------ #

    messages = [
        {"role": "system", "content": system_prompt},
        {"role": "user", "content": user_prompt},
    ]
    
    try:
        prompt = tokenizer.apply_chat_template(messages, tokenize=False, add_generation_prompt=True)
        response = llm_generator(prompt, num_return_sequences=1) # Ensure only one response is generated
        full_response_text = response[0]['generated_text']
        
        # The model's answer is what comes after the prompt
        answer = full_response_text.split(prompt)[-1].strip()
        return answer, [doc.metadata.get('source', 'N/A') for doc in context_docs]
    
    except Exception as e:
        print(f"Error during LLM generation: {e}")
        return f"Sorry, I encountered an error while generating the response: {e}", []

### Step 7: The RAG Query Function

This function ties everything together. It detects the intent, retrieves context, and then either answers the question or generates content. 

I've increased `k=5` to retrieve *more* documents. This is key for the "topic-wise" synthesis, as it gives the LLM more information to group and summarize.

In [7]:
def process_query(query):
    if not vector_store or not llm_generator:
        print("ERROR: Vector store or LLM is not initialized. Please run all preceding steps, including setting your HF_TOKEN.")
        return

    print(f"\n{'='*20}")
    print(f"Query: {query}")
    
    intent = detect_intent_simple(query)
    print(f"Detected Intent: {intent}")
    
    # Retrieve relevant documents from the vector store
    retrieved_docs = vector_store.similarity_search(query, k=5) # Increased to k=5 for better synthesis
    
    print(f"Retrieved {len(retrieved_docs)} relevant documents...")
    
    start_time = time.time()
    
    # Generate the answer using the LLM and the retrieved context
    final_answer, sources = generate_content(retrieved_docs, query, intent)
    end_time = time.time()
    
    content_type = "Response"
    if intent == 'ask_question' or intent == 'find_standard' or intent == 'find_intervention' or intent == 'compare_interventions':
        content_type = 'Answer'
    elif intent == 'request_summary':
        content_type = 'Summary'
    elif intent == 'request_quiz':
        content_type = 'Quiz'

    print(f"\n✅ {content_type} (generated in {end_time - start_time:.2f}s):")
    print(final_answer)
    
    # This is for your reference, to see what the bot was 'thinking'
    print("\n--- Retrieved Sources ---")
    unique_sources = sorted(list(set(sources)))
    for i, source in enumerate(unique_sources):
        print(f"Source {i+1}: {source}")
    print(f"{'='*20}\n")

### Step 8: Example Queries

Let's test our new Road Safety GPT! Uncomment the queries to run them. **You must run the cells above this one first.**

In [17]:
process_query("My sign is faded and not retro-reflective. What's the rule for that?")



Query: My sign is faded and not retro-reflective. What's the rule for that?
Detected Intent: find_standard
Retrieved 5 relevant documents...

✅ Answer (generated in 23.90s):
Based on the provided context, I can provide the relevant information for the user's question. According to the IRC:67-2022 - Clause 14.6.22, the standard for retro-reflective regulatory signs, including the signs for "STOP", "GIVE WAY", and "SPEED LIMIT", states that:

"The retro-reflective sheeting used shall cover the entire surface, be weather-resistant, colorfast, and free from defects, with a minimum coefficient of retro-reflection in accordance with ASTM D 4956 standards."

This implies that the sign's retro-reflective sheeting must be of a certain quality and meet specific standards.

However, it does not explicitly state the rule for replacing or replacing faded and non-retro-reflective signs.

But, according to the IRC:67-2022 - Clause 13.3, the sign shall be replaced either at the end of the warranty pe

In [18]:
# --- Test Queries (Uncomment and run one by one in a new cell) --- 

# 1. Test a specific standard (from your table data)
# process_query("What is the standard for a STOP sign?")

# 2. Test a problem-based intervention (from your table data)
# process_query("My sign is faded and not retro-reflective. What's the rule for that?")

# 3. Test a broad, topic-wise question (should get multiple answers from your intervention data)
# process_query("What interventions can I use for speeding in a school zone?")

# 4. Test a comparison question (for topic-wise synthesis)
# process_query("What is the difference between a speed hump and a rumble strip?")

# 5. Test a quiz intent
# process_query("quiz me on road signs")

In [19]:
process_query("What interventions can I use for speeding in a school zone?")


Query: What interventions can I use for speeding in a school zone?
Detected Intent: ask_question
Retrieved 5 relevant documents...

✅ Answer (generated in 34.43s):
Based on the provided context, here are the interventions for speeding in a school zone:

**Engineering Solutions:**

1. **Speed Humps**: Vertical calming devices that reduce vehicle speeds to safe walking speeds, particularly effective on minor roads around schools. (Source: WHO Inspired)
2. **Speed Limit 20/30 kmph Zones**: Strictly enforced low-speed zones using signs and calming measures, ideal for school opening and closing hours or dense urban corridors. (Source: WHO Inspired)

**Enforcement Solutions:**

1. **School Zone Enforcement Cameras**: CCTV or radar-based systems monitoring speeding and improper parking, particularly effective in accident-prone or high-volume school corridors. (Source: MoRTH Inspired)

**Comparison:**

| Intervention | Effectiveness | Suitable for |
| --- | --- | --- |
| Speed Humps | Reduces

In [8]:
process_query("My road markings are faded and not retro-reflective. What's the rule for that?")


Query: My road markings are faded and not retro-reflective. What's the rule for that?
Detected Intent: find_standard
Retrieved 5 relevant documents...

✅ Answer (generated in 33.03s):
Based on the provided context, the rule for faded and non-retro-reflective road markings is outlined in the IRC:35-2015 - Clause 2.7 and PWD Inspired.

According to IRC:35-2015 - Clause 2.7, road markings must be clearly visible day and night, providing essential guidance, especially on unlit roads. Drivers shall detect markings at least two seconds ahead and that minimum preview distance with respect to speed is as follows: For <30 km/h: 17 m; 30–40 km/h: 22 m; 40–50 km/h: 28 m; 50–65 km/h: 36 m; 65–70 km/h: 39 m; 70–80 km/h: 44 m; 80–90 km/h: 50 m; 90–100 km/h: 56 m; 100–110 km/h: 61 m; 110–120 km/h: 67 m.

However, there is no specific rule for faded and non-retro-reflective markings. The only relevant information provided is that visibility improves with wider lines, higher mark-to-gap ratios, and in