In [1]:
%pip install pymongo langchain_google_genai



### Connect to MongoDB Atlas

Establishing a connection to a MongoDB Atlas cluster using a secure API key (`MONGO_URI`) stored in Colab secrets. Targeting the `invoice_reader_db` database and `invoices` collection—likely for storing or retrieving structured invoice data in a separate workflow.

In [2]:
from sentence_transformers import SentenceTransformer
from langchain_google_genai import ChatGoogleGenerativeAI
import numpy as np
from google.colab import userdata
from pymongo import MongoClient
import os

# mongo_uri = os.getenv("MONGO_URI")
mongo_uri = userdata.get('MONGO_URI')
client = MongoClient(mongo_uri)
db = client["invoice_reader_db"]
collection = db["invoices"]

### Initialize Sentence Transformer for Embeddings

Loading the `all-MiniLM-L6-v2` sentence transformer model to generate dense vector embeddings for English text. This lightweight yet effective model is commonly used for semantic similarity, clustering, or retrieval tasks—potentially supporting downstream invoice analysis or document comparison in the current pipeline.

This is a sentence-transformers model: It maps sentences & paragraphs to a **384** dimensional dense vector space and can be used for tasks like clustering or semantic search.



In [3]:
model = SentenceTransformer("all-MiniLM-L6-v2")

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


### Semantic Search in MongoDB for Invoice Data

Performing a vector search in MongoDB Atlas using an embedding of the query *"What is the seller name in this invoice number us-001?"* to retrieve relevant invoice documents. The search leverages a precomputed `embedding` field and a vector index (`invoice_vector_index`). Extracted text from the top matches is collected for further processing or answer generation.

In [4]:
query = "What is the seller name in this invoicevnumber us-001?"
query_embedding = model.encode(query).tolist()

results = collection.aggregate([
    {
        "$vectorSearch": {
            "queryVector": query_embedding,
            "path": "embedding",
            "numCandidates": 5,
            "limit": 3,
            "index": "invoice_vector_index"
        }
    }
])

# Convert the cursor to a list to inspect the documents
results_list = list(results)

# Print the keys of the first document to help diagnose the error
if results_list:
    print("Keys in the first document:", results_list[0].keys())

retrieved_docs = [r["extracted_text"] for r in results_list]

Keys in the first document: dict_keys(['_id', 'file_name', 'extracted_text', 'invoice_data', 'embedding'])


### Generate Answer from Retrieved Invoice Context

Using the Gemini LLM (`gemini-2.5-flash`) with low temperature for focused, deterministic responses. The model answers the query about the seller name in invoice `us-001` **strictly based on the retrieved invoice texts** from MongoDB. This ensures grounded, context-aware extraction without hallucination.

In [5]:
# google_api_key = os.getenv("GEMINI_API_KEY")
google_api_key = userdata.get('GOOGLE_API_KEY')

llm = ChatGoogleGenerativeAI(
    google_api_key=google_api_key,
    temperature=0.1,
    max_retries=2,
    convert_system_message_to_human=True,
    model="gemini-2.5-flash"
)

context = "\n\n".join(retrieved_docs)
prompt = f"Answer the question based only on the following context:\n{context}\n\nQuestion: {query}"

response = llm.invoke(prompt)
print(response.content)

East Repair Inc.


### Verification: Correct Seller Name Retrieved

The LLM correctly identified **"East Repair Inc."** as the seller name for invoice `us-001`, matching the ground truth stored in MongoDB (`seller_name: "East Repair Inc."`). This confirms the end-to-end pipeline—vector search, context retrieval, and LLM-based extraction—is working accurately.