# RAG Pipeline for Motor Insurance Claim Analysis

A step-by-step tutorial building a Retrieval-Augmented Generation system that reads insurance PDFs and answers questions about them.

**Author:** Parham Imanzadeh Charandabi | University of Salford

## What We're Building

A system that:
1. **Loads** insurance PDF documents
2. **Splits** them into manageable chunks
3. **Embeds** each chunk into a vector (list of numbers capturing meaning)
4. **Stores** vectors in a FAISS index for fast similarity search
5. **Retrieves** the most relevant chunks when a user asks a question
6. **Generates** a natural language answer using Google Gemini AI

This is RAG: **R**etrieval-**A**ugmented **G**eneration.

## Cell 1: Install Libraries

Each library has a specific job:
- **langchain** — Orchestration framework that connects all RAG components
- **langchain-google-genai** — Connects LangChain to Google's Gemini AI
- **langchain-community** — Extra tools like PDF loaders
- **langchain-text-splitters** — Splits documents into chunks
- **faiss-cpu** — Facebook's fast vector similarity search (our 'smart filing cabinet')
- **pypdf** — Reads PDF files
- **google-generativeai** — Google's AI library for Gemini

In [None]:
!pip install langchain langchain-google-genai langchain-community langchain-text-splitters faiss-cpu pypdf google-generativeai -q

print("All libraries installed!")

## Cell 2: Upload PDFs

Upload the 5 sample insurance documents from the `sample_docs/` folder.

In [None]:
from google.colab import files
import os

# Create a folder to store our documents
os.makedirs("insurance_docs", exist_ok=True)

print("Click the button below to upload your 5 PDF files:")
uploaded = files.upload()

# Move uploaded files to our folder
for filename in uploaded:
    with open(f"insurance_docs/{filename}", "wb") as f:
        f.write(uploaded[filename])
    print(f"  Saved: {filename}")

print(f"\nTotal files uploaded: {len(uploaded)}")

## Cell 3: Set Up API Key

Your 'password' to use Google Gemini AI.
Get a free key from https://aistudio.google.com/apikey

**Security tip:** Use Colab's Secrets feature (key icon in left sidebar) instead of pasting directly.

In [None]:
import os

# OPTION 1 (Recommended - Secure):
# Click the KEY icon in Colab's left sidebar
# Add a secret called GOOGLE_API_KEY with your key
# Then uncomment the next two lines:
# from google.colab import userdata
# os.environ["GOOGLE_API_KEY"] = userdata.get("GOOGLE_API_KEY")

# OPTION 2 (Quick testing only - delete before sharing):
os.environ["GOOGLE_API_KEY"] = "PASTE_YOUR_API_KEY_HERE"

# Verify
if os.environ.get("GOOGLE_API_KEY") and os.environ["GOOGLE_API_KEY"] != "PASTE_YOUR_API_KEY_HERE":
    print("API Key is set! Ready to go.")
else:
    print("WARNING: Please set your API key above before continuing.")

## Cell 4: Load PDFs (RAG Step 1: LOAD)

PyPDFLoader reads each PDF and extracts the text.
It also tracks which file and page each piece came from (metadata).

In [None]:
from langchain_community.document_loaders import PyPDFLoader

all_documents = []

for filename in sorted(os.listdir("insurance_docs")):
    if filename.endswith(".pdf"):
        filepath = f"insurance_docs/{filename}"
        loader = PyPDFLoader(filepath)
        pages = loader.load()
        all_documents.extend(pages)
        print(f"Loaded: {filename} ({len(pages)} pages)")

print(f"\nTotal pages loaded: {len(all_documents)}")

# Peek at the first document
print(f"\n--- Example ---")
print(f"Source: {all_documents[0].metadata['source']}")
print(f"Preview: {all_documents[0].page_content[:200]}...")

## Cell 5: Split into Chunks (RAG Step 2: CHUNK)

We split documents into smaller pieces (~500 characters each) so we can
retrieve only the relevant parts instead of feeding entire documents to the AI.

- **chunk_size=500**: Each chunk is about 100 words
- **chunk_overlap=50**: Prevents cutting sentences in half

In [None]:
from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(
    chunk_size=500,
    chunk_overlap=50,
    separators=["\n\n", "\n", ". ", " "]
)

chunks = splitter.split_documents(all_documents)

print(f"Split {len(all_documents)} pages into {len(chunks)} chunks")
print(f"\n--- Example chunk ---")
print(f"From: {chunks[0].metadata['source']}")
print(f"Text: {chunks[0].page_content[:300]}...")

## Cell 6: Create Embeddings (RAG Step 3: EMBED)

This is the magic step. An **embedding** converts text into a vector
(list of numbers) that captures its meaning.

- "vehicle hire cost" → [0.23, -0.41, 0.87, ...]
- "car rental price" → [0.21, -0.39, 0.85, ...] (similar meaning = similar numbers!)
- "the weather today" → [0.91, 0.12, -0.55, ...] (different meaning = different numbers)

In [None]:
from langchain_google_genai import GoogleGenerativeAIEmbeddings

embedding_model = GoogleGenerativeAIEmbeddings(
    model="models/gemini-embedding-001"
)

# Test with a simple example
test_embedding = embedding_model.embed_query("vehicle hire rate")
print(f"The phrase 'vehicle hire rate' becomes a vector of {len(test_embedding)} numbers")
print(f"First 10 numbers: {[round(x, 4) for x in test_embedding[:10]]}")
print(f"\nThink of it as coordinates in a {len(test_embedding)}-dimensional space!")

## Cell 7: Store in FAISS (RAG Step 4: STORE)

FAISS (Facebook AI Similarity Search) stores all our vectors and lets us
search by meaning, not just keywords.

This embeds ALL chunks and indexes them. May take 30-60 seconds.

In [None]:
from langchain_community.vectorstores import FAISS

print("Creating vector store (may take 30-60 seconds)...")

vector_store = FAISS.from_documents(chunks, embedding_model)

print(f"Vector store created with {len(chunks)} entries!")
print("Documents are now searchable by meaning.")

## Cell 8: Test Retrieval (RAG Step 5: RETRIEVE)

Before connecting the AI, let's verify FAISS can find relevant chunks.

In [None]:
print("=" * 60)
print("TESTING RETRIEVAL (search without AI)")
print("=" * 60)

query = "What was the daily hire rate for the Ford Focus?"
results = vector_store.similarity_search(query, k=3)

print(f"\nQuery: '{query}'")
print(f"Found {len(results)} relevant chunks:\n")

for i, result in enumerate(results):
    source = result.metadata['source'].split('/')[-1]
    print(f"  Result {i+1} (from {source}):")
    print(f"  {result.page_content[:150]}...")
    print()

## Cell 9: Build RAG Pipeline (RAG Step 6: GENERATE)

Now we connect everything: question → FAISS retrieval → prompt → Gemini → answer.

We build this manually (instead of using LangChain's RetrievalQA chain) so every
step is transparent and understandable.

In [None]:
import google.generativeai as genai

# Connect to Gemini AI
model = genai.GenerativeModel("gemini-2.5-flash")

def ask(question):
    """Ask a question about the insurance documents."""
    print(f"\nQ: {question}")
    print("-" * 50)

    # STEP 1: Retrieve — find the 4 most relevant chunks
    relevant_chunks = vector_store.similarity_search(question, k=4)

    # STEP 2: Build context — combine chunks into one text block
    context = ""
    sources = set()
    for chunk in relevant_chunks:
        context += chunk.page_content + "\n\n"
        sources.add(chunk.metadata['source'].split('/')[-1])

    # STEP 3: Create prompt — instructions + context + question
    prompt = f"""You are an expert motor insurance claims analyst working for Whichrate Consulting Ltd.
Use ONLY the documents below to answer. If the answer is not in the documents, say
\"I cannot find this information in the available documents.\"
Cite which document your answer comes from. Be precise with numbers, dates, and amounts.

DOCUMENTS:
{context}

QUESTION: {question}

ANSWER:"""

    # STEP 4: Generate — send to Gemini and get the answer
    response = model.generate_content(prompt)
    answer = response.text

    print(f"\nA: {answer.strip()}")
    print(f"\nSources: {', '.join(sources)}")
    print("=" * 60)

print("RAG pipeline ready! The ask() function is now available.")

## Cell 10: Test with Example Questions

In [None]:
ask("What was the total claim value for Mrs Sarah Thompson's case?")

In [None]:
ask("How does the CHO rate compare to the BHR average for Insurance Group 18?")

In [None]:
ask("What fraud indicators were identified in the Ahmed Hassan claim?")

In [None]:
ask("What is the maximum daily hire rate for a BMW 3 Series according to the policy?")

In [None]:
ask("What did the witness say about the BMW driver's behaviour?")

## Cell 11: Interactive Mode

Type your own questions! Type 'quit' to stop.

In [None]:
print("=" * 60)
print("INTERACTIVE MODE - Ask anything about the insurance claims")
print("Type 'quit' to stop")
print("=" * 60)

while True:
    question = input("\nYour question: ")
    if question.lower() in ['quit', 'exit', 'q']:
        print("Goodbye!")
        break
    if question.strip():
        ask(question)

## Cell 12: Save Vector Store

Save the FAISS index so you don't need to re-process PDFs next time.

In [None]:
# Save
vector_store.save_local("insurance_faiss_index")
print("Vector store saved!")
print("To reload later (skip cells 4-7):")
print('  vector_store = FAISS.load_local("insurance_faiss_index", embedding_model, allow_dangerous_deserialization=True)')