<a href="https://colab.research.google.com/github/MariyahW/Outamation_Externship/blob/main/Hands_On_Query_Processing_%26_Retrieval_Optimization.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [12]:
!pip install llama-index llama-index-llms-gemini pymupdf llama-index-embeddings-huggingface

# from google.colab import files
# uploaded=files.upload()

import fitz  # PyMuPDF

# Load PDF document
doc = fitz.open("sample_data/sample_docs/contract.pdf")

# Extract text from all pages
text = "\n".join([page.get_text() for page in doc])

print(f"Extracted {len(text.split())} words from the PDF.")

from llama_index.llms.gemini import Gemini
from llama_index.core.llms import ChatMessage

# Set up Gemini API key
import os
os.environ["GOOGLE_API_KEY"] = "AIzaSyBRMSVStz-ngxmdlBHL1tidWoKiX41EdiM"


import google.generativeai as genai

# ---------------------------
# 1) API KEY (paste your NEW rotated key here)
# ---------------------------

genai.configure(api_key=os.environ["GOOGLE_API_KEY"])

# ---------------------------
# 2) Discover available models (no guessing)
# ---------------------------
valid_models = []
for m in genai.list_models():
    if "generateContent" in m.supported_generation_methods:
        valid_models.append(m.name)

if not valid_models:
    raise RuntimeError("No models available for this API key. Check your key/project access.")

print("Valid models:")
for m in valid_models:
    print(" -", m)
MODEL_NAME = valid_models[0]
# Initialize Gemini LLM
llm = Gemini(model=MODEL_NAME)

# Define query rewriting function
def rewrite_query(user_query):
    messages = [
        ChatMessage(role="system", content="Rewrite this query for improved retrieval relevance."),
        ChatMessage(role="user", content=user_query),
    ]
    response = llm.chat(messages)
    return response.message.content

# Test query rewriting
query = "What are the penalties for late payments?"
expanded_query = rewrite_query(query)

print(f"Original Query: {query}")
print(f"Expanded Query: {expanded_query}")

from llama_index.core import SimpleDirectoryReader, VectorStoreIndex
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.core.retrievers import BM25Retriever, VectorIndexRetriever
from llama_index.core.query_engine import RetrieverQueryEngine
from llama_index.core.retrievers import HybridRetriever

# Load documents from the directory
documents = SimpleDirectoryReader("sample_docs").load_data()

# Initialize Hugging Face embedding model
embed_model = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

# Create a document store
docstore = SimpleDocumentStore()

# Create a vector index for embedding-based retrieval
vector_index = VectorStoreIndex.from_documents(documents, embed_model=embed_model)
vector_retriever = VectorIndexRetriever(index=vector_index, similarity_top_k=5)

# Create a BM25 keyword-based retriever
bm25_retriever = BM25Retriever.from_defaults(docstore=docstore, similarity_top_k=5)

# Combine both retrievers into a Hybrid Retriever
hybrid_retriever = HybridRetriever(
    vector_retriever=vector_retriever, bm25_retriever=bm25_retriever, alpha=0.5
)

# Set up query engine with hybrid retrieval
query_engine = RetrieverQueryEngine(retriever=hybrid_retriever)

# Test hybrid retrieval
query = "What is the refund policy?"
response = query_engine.query(query)

print(response)

from llama_index.core.retrievers import LLMReranker

# Initialize reranker
reranker = LLMReranker(llm=llm)

# Get the retrieved results
retrieved_chunks = query_engine.query(query, return_results=True)

# Apply reranking
reranked_results = reranker.rerank(query, retrieved_chunks)

print("Top-ranked result:", reranked_results[0].text)

Extracted 290 words from the PDF.
Valid models:
 - models/gemini-2.5-flash
 - models/gemini-2.5-pro
 - models/gemini-2.0-flash
 - models/gemini-2.0-flash-001
 - models/gemini-2.0-flash-exp-image-generation
 - models/gemini-2.0-flash-lite-001
 - models/gemini-2.0-flash-lite
 - models/gemini-exp-1206
 - models/gemini-2.5-flash-preview-tts
 - models/gemini-2.5-pro-preview-tts
 - models/gemma-3-1b-it
 - models/gemma-3-4b-it
 - models/gemma-3-12b-it
 - models/gemma-3-27b-it
 - models/gemma-3n-e4b-it
 - models/gemma-3n-e2b-it
 - models/gemini-flash-latest
 - models/gemini-flash-lite-latest
 - models/gemini-pro-latest
 - models/gemini-2.5-flash-lite
 - models/gemini-2.5-flash-image
 - models/gemini-2.5-flash-preview-09-2025
 - models/gemini-2.5-flash-lite-preview-09-2025
 - models/gemini-3-pro-preview
 - models/gemini-3-flash-preview
 - models/gemini-3-pro-image-preview
 - models/nano-banana-pro-preview
 - models/gemini-robotics-er-1.5-preview
 - models/gemini-2.5-computer-use-preview-10-2025

  llm = Gemini(model=MODEL_NAME)


Original Query: What are the penalties for late payments?
Expanded Query: The original query "What are the penalties for late payments?" is too broad. The type of payment significantly changes the penalties.

To improve retrieval relevance, you need to add **context** about the *type* of payment.

Here are several improved versions, ranging from slightly more specific to highly specific:

**Better General Queries (if you don't have a specific payment type in mind yet):**

1.  **"What are the consequences of late payments?"** (Broader term than "penalties," might include credit score impact, not just fees.)
2.  **"Late payment fees and charges"** (Focuses on monetary penalties.)
3.  **"Impact of late payments on credit score"** (Focuses on a specific, common consequence.)

**More Specific Queries (Highly Recommended - choose the one that fits your situation):**

*   **For Credit Cards:**
    *   "Credit card late payment fees"
    *   "What happens if I pay my credit card late?"
    *  



ImportError: cannot import name 'BM25Retriever' from 'llama_index.core.retrievers' (/usr/local/lib/python3.12/dist-packages/llama_index/core/retrievers/__init__.py)

In [6]:
import os
print("Current working directory:", os.getcwd())
print("Files here:", os.listdir())

Current working directory: /content
Files here: ['.config', 'sample_data', '.ipynb_checkpoints']
