# Goal

A Patent RAG system with:

- Contextual Embeddings: chunks are enriched with concise context (e.g., “This chunk is from Patent X, abstract section, about photovoltaic cells, filed 2016-01-15”).

- Hybrid Retrieval: combine vector similarity (Milvus) + lexical BM25 (via Elasticsearch/OpenSearch).

- Re-ranking: re-order retrieved chunks with a cross-encoder or LLM scoring step.


Contextual Retrieval is an advanced retrieval method proposed by Anthropic to address the issue of semantic isolation of chunks, which arises in current Retrieval-Augmented Generation (RAG) solutions.

## High-level System Design

1. Data ingestion & normalization

        Source: HUPD dataset (abstract, claims, summary, full_description, metadata fields).

        Normalize each record into structured JSON.

        Store metadata (application number, filing date, CPC labels, etc.).

2. Chunking

        Chunk by semantic sections (abstract, claims, background, summary, full_description).

        Within each, split into ~300–500 tokens (with overlap ~50 tokens).

        Preserve section + patent metadata with each chunk.

3. Contextualization (Anthropic method)

        For each chunk:

            Input to Claude (or another LLM): whole patent + chunk.

            Ask for 50–100 token contextual summary situating chunk in the whole patent (which part, what invention, etc.).

            Prepend this context to chunk text → "contextualized chunk".

        Example:

            Context: This chunk is from Patent 20160012345, Abstract, about solar panel efficiency improvements.  
            Chunk: "The system improves photon capture by embedding nanostructures in the substrate layer."

4. Dual Indexing

        Vector index (Milvus):

        Store embeddings of contextualized chunks.

        Embedding model: sentence-transformers or OpenAI/Anthropic embeddings.

        BM25 index (ElasticSearch/OpenSearch):

        Index contextualized chunk text (to catch exact terms, e.g. “US20160234A1”).

5. Retrieval Pipeline

        User query comes in.

        Run query against Milvus (vector search) and BM25 (keyword search).

        Merge candidate sets (e.g., 20 from each).

        Deduplicate.

        Re-rank using:

            Cross-encoder (e.g. ms-marco-MiniLM-L-6-v2)

            OR a lightweight LLM call scoring relevance.

            Take top-K (e.g., 5–10 chunks).

6. Answer generation

        Construct prompt with:

        User query

        Retrieved contextualized chunks (as citations)

        Send to LLM (Claude / GPT).

        Ask for structured output (answer + cited patent IDs/chunks).

7. Observability / monitoring

        Log query latency, recall @K, and re-ranker confidence.

        Track cost (embedding, contextualization).

        Monitor cluster health (Milvus, ES).

In [8]:
### It seems like the HF repo is not working, i eded up downloading it manually
# The dataset is mainly 2018 IP data from HUPD 
from pprint import pprint
from datasets import load_dataset
import os
import json

# dataset_dict = load_dataset('HUPD/hupd',
#     name='sample',
#     data_files="https://huggingface.co/datasets/HUPD/hupd/blob/main/hupd_metadata_2022-02-22.feather",
#     icpr_label=None,
# )


In [17]:
def get_IP_data(ip_limit=20):
    ip_files = [json.load(open("RawData/2018/" + file)) for file in os.listdir(r"RawData/2018")[:ip_limit]] 

    relevant_fields = [
        # Identifiers & Linking
        "publication_number",  # primary external ID
        "application_number",  # join back to source system
        "patent_number",      # if granted; null otherwise

        # Dates (as epoch ints)
        "date_published",
        "filing_date",
        "patent_issue_date",  # nullable
        "abandon_date",       # nullable

        # Status & Classes
        "decision",           # e.g., granted/pending/withdrawn
        "main_cpc_label",
        "main_ipcr_label",

        # Retrievable Text
        "title",             # short; don't chunk
        "abstract",          
        "summary",           # chunked
        #"full_description",  # chunked
    ]

    ip_files = [{key: value for key, value in file.items() if key in relevant_fields} for file in ip_files]
    return ip_files

data = get_IP_data()
data[0]

{'application_number': '15756453',
 'publication_number': 'US20180249132A1-20180830',
 'title': 'VEHICLE IMAGING SYSTEM AND METHOD',
 'decision': 'PENDING',
 'date_published': '20180830',
 'main_cpc_label': 'H04N7185',
 'main_ipcr_label': 'H04N718',
 'patent_number': 'nan',
 'filing_date': '20180228',
 'patent_issue_date': '',
 'abandon_date': '',
 'abstract': 'The present disclosure relates to a vehicle imaging apparatus (2) having a location determining module (10) for determining the relative location of a remote imaging apparatus (3). The location determining module (10) is configured to receive a tracking signal (S2) from a remote transmitter (16) associated with the remote imaging apparatus (3). An image receiver module (7) is provided to receive image data (DT) transmitted by the remote imaging apparatus (3). At least one image processor (5) is provided to process the image data (DT) in dependence on the determined location of the remote imaging apparatus (3). The present disclo

In [21]:
from pymilvus import Collection, connections

# Connect to Milvus
connections.connect("default", host="127.0.0.1", port="19530")
coll = Collection("ip_chunks")
coll.load()

# Search by vector
from sentence_transformers import SentenceTransformer
model = SentenceTransformer("sentence-transformers/all-MiniLM-L6-v2")

query = "methods for packaging materials"
query_vec = model.encode([query], normalize_embeddings=True)[0].tolist()

results = coll.search(
    data=[query_vec],
    anns_field="embedding",
    param={"metric_type": "COSINE", "params": {"ef": 64}},
    limit=5,
    output_fields=["text", "publication_number", "section", "decision"]
)

for hits in results:
    for hit in hits:
        print(f"Score: {hit.score}")
        print(f"Text: {hit.entity.get('text')[:200]}...")
        print(f"Patent: {hit.entity.get('publication_number')}")
        print("---")

Score: 0.5012791156768799
Text: within the predetermined space. packaging may generally surround the anode, cathode, cathode collector and the electrolyte. terminal ends of the anode may extend through the packaging along a first ve...
Patent: US20180159094A1-20180607
---
Score: 0.47259262204170227
Text: layer made of a barrier plastic, a second adhesion promoter layer, a cracking - resistant intermediate layer and optionally a third adhesion promoter layer, and also a structure - providing external l...
Patent: US20180272860A1-20180927
---
Score: 0.41813868284225464
Text: < soh > summary < eoh > the invention is therefore based on the object of providing a multilayer composite material which is particularly lightweight and resistant to fracture for the production of pl...
Patent: US20180272860A1-20180927
---
Score: 0.4055662453174591
Text: the invention relates to a multilayer composite material for the production of plastics moldings, to a container made of this composite material, a

### Experiment I: Standard Retrieval