# Goal

A Patent RAG system with:

- Contextual Embeddings: chunks are enriched with concise context (e.g., “This chunk is from Patent X, abstract section, about photovoltaic cells, filed 2016-01-15”).

- Hybrid Retrieval: combine vector similarity (Milvus) + lexical BM25 (via Elasticsearch/OpenSearch).

- Re-ranking: re-order retrieved chunks with a cross-encoder or LLM scoring step.


Contextual Retrieval is an advanced retrieval method proposed by Anthropic to address the issue of semantic isolation of chunks, which arises in current Retrieval-Augmented Generation (RAG) solutions.

## High-level Design

1. Data ingestion & normalization

        Source: HUPD dataset (abstract, claims, summary, full_description, metadata fields).

        Normalize each record into structured JSON.

        Store metadata (application number, filing date, CPC labels, etc.).

2. Chunking

        Chunk by semantic sections (abstract, claims, background, summary, full_description).

        Within each, split into ~300–500 tokens (with overlap ~50 tokens).

        Preserve section + patent metadata with each chunk.

3. Contextualization (Anthropic method)

        For each chunk:

            Input to the LLM: whole patent + chunk.

            Ask for 50–100 token contextual summary situating chunk in the whole patent (which part, what invention, etc.).

            Prepend this context to chunk text → "contextualized chunk".

        Example:

            Context: This chunk is from Patent 20160012345, Abstract, about solar panel efficiency improvements.  
            Chunk: "The system improves photon capture by embedding nanostructures in the substrate layer....."

4. Dual Indexing

        Vector index (Milvus):

        Store embeddings of contextualized chunks.

        Embedding model: sentence-transformers

        BM25 index (ElasticSearch/OpenSearch):

        Index contextualized chunk text (to catch exact terms, e.g. “US20160234A1”).

5. Retrieval Pipeline

        User query comes in.

        Run query against Milvus (vector search) and BM25 (keyword search).

        Merge candidate sets (e.g., 20 from each).

        Deduplicate.

        Re-rank using:

            Cross-encoder (e.g. ms-marco-MiniLM-L-6-v2)

            OR a lightweight LLM call scoring relevance.

            Take top-K (e.g., 5–10 chunks).

6. Answer generation

        Construct prompt with:

        User query

        Retrieved contextualized chunks (as citations)

        Send to LLM (Claude / GPT).

        Ask for structured output (answer + cited patent IDs/chunks).

7. Observability / monitoring

        Log query latency, recall @K, and re-ranker confidence.

        Track cost (embedding, contextualization).

        Monitor cluster health (Milvus, ES).

In [1]:
### It seems like the HF repo is not working, i eded up downloading it manually
# The dataset is mainly 2018 IP data from HUPD 
from pprint import pprint
from datasets import load_dataset
import os
import json

# dataset_dict = load_dataset('HUPD/hupd',
#     name='sample',
#     data_files="https://huggingface.co/datasets/HUPD/hupd/blob/main/hupd_metadata_2022-02-22.feather",
#     icpr_label=None,
# )


  from .autonotebook import tqdm as notebook_tqdm


In [3]:
def get_IP_data(limit=20):
    ip_files = [json.load(open("RawData/2018/" + file)) for file in os.listdir(r"RawData/2018")[:limit]] 

    relevant_fields = [
        # Identifiers & Linking
        "publication_number",  # primary external ID
        "application_number",  # join back to source system
        "patent_number",      # if granted; null otherwise

        # Dates (as epoch ints)
        "date_published",
        "filing_date",
        "patent_issue_date",  # nullable
        "abandon_date",       # nullable

        # Status & Classes
        "decision",           # e.g., granted/pending/withdrawn
        "main_cpc_label",
        "main_ipcr_label",

        # Retrievable Text
        "title",             # short; don't chunk
        "abstract",          
        "summary",           # chunked
        #"full_description",  # chunked
    ]

    ip_files = [{key: value for key, value in file.items() if key in relevant_fields} for file in ip_files]
    return ip_files

data = get_IP_data()
pprint(data[0])

{'abandon_date': '',
 'abstract': 'The present disclosure relates to a vehicle imaging apparatus '
             '(2) having a location determining module (10) for determining '
             'the relative location of a remote imaging apparatus (3). The '
             'location determining module (10) is configured to receive a '
             'tracking signal (S2) from a remote transmitter (16) associated '
             'with the remote imaging apparatus (3). An image receiver module '
             '(7) is provided to receive image data (DT) transmitted by the '
             'remote imaging apparatus (3). At least one image processor (5) '
             'is provided to process the image data (DT) in dependence on the '
             'determined location of the remote imaging apparatus (3). The '
             'present disclosure also relates to a remote imaging apparatus '
             '(3) for mounting to a trailer (T). The remote imaging apparatus '
             '(3) having a camera (CT);

In [14]:
from retriever import PatentRetriever

retriever = PatentRetriever()
results = retriever.search("solar panel technology", top_k=2)
for i, result in enumerate(results, 1):
    print(f"\nResult {i}:")
    print(f"Patent: {result['publication_number']}")

    print(f"Score: {result['score']:.3f}")
    print(f"Text: {result['text'][:150]}...")

ConnectionNotExistException: <ConnectionNotExistException: (code=1, message=should create connection first.)>

In [None]:
query = "methods for packaging materials"
query_vec = model.encode([query], normalize_embeddings=True)[0].tolist()

results = coll.search(
    data=[query_vec],
    anns_field="embedding",
    param={"metric_type": "COSINE", "params": {"ef": 64}},
    limit=5,
    output_fields=["text", "publication_number", "section", "decision"]
)

for hits in results:
    for hit in hits:
        print(f"Score: {hit.score}")
        print(f"Text: {hit.entity.get('text')[:200]}...")
        print(f"Patent: {hit.entity.get('publication_number')}")
        print("---")

Score: 0.5012791156768799
Text: within the predetermined space. packaging may generally surround the anode, cathode, cathode collector and the electrolyte. terminal ends of the anode may extend through the packaging along a first ve...
Patent: US20180159094A1-20180607
---
Score: 0.47259262204170227
Text: layer made of a barrier plastic, a second adhesion promoter layer, a cracking - resistant intermediate layer and optionally a third adhesion promoter layer, and also a structure - providing external l...
Patent: US20180272860A1-20180927
---
Score: 0.41813868284225464
Text: < soh > summary < eoh > the invention is therefore based on the object of providing a multilayer composite material which is particularly lightweight and resistant to fracture for the production of pl...
Patent: US20180272860A1-20180927
---
Score: 0.4055662453174591
Text: the invention relates to a multilayer composite material for the production of plastics moldings, to a container made of this composite material, a

# Evals

Self-retrieval

Idea: Generate queries from a document and mark the document (or chunk) itself as relevant.
Options:

Use the title or abstract as the query; relevant = the same patent’s claims/description.

Use an LLM to create 2–5 synthetic queries (“doc2query”) from the abstract/claims; relevant = the patent (and specific chunks).
This is strong for building lots of training/eval pairs quickly; BEIR-style pipelines often use such tricks

In [9]:
import json
from typing import List, Dict, Optional
from pathlib import Path
import pandas as pd


class RetrievalDataset:
    def __init__(self, data: List[Dict], output_file: str = "retrieval_dataset.jsonl"):
        self.data = data
        self.output_file = output_file

    def generate_queries(self, patent: Dict) -> List[str]:
        queries = []
        
        if title := patent.get('title'):
            queries.append(title.strip())
        
        # 2. Add first sentence of abstract
        if abstract := patent.get('abstract'):
            first_sentence = abstract.split('.')[0] + '.'
            if first_sentence not in queries:
                queries.append(first_sentence)
        
        # 3. Add first 30 words of summary
        if summary := patent.get('summary'):
            summary_preview = ' '.join(summary.split()[:30])
            if summary_preview not in queries:
                queries.append(summary_preview)
        return queries

    def create_dataset(self) -> 'pd.DataFrame':
        
        testData = []
        for patent in self.data:
            doc_id = patent.get('publication_number', str(hash(str(patent))))
            queries = self.generate_queries(patent)
            
            for query in queries:
                testData.append({
                    "query": query,
                    "document_id": doc_id,
                    "title": patent.get('title', ''),
                    "abstract": patent.get('abstract', ''),
                    "summary": patent.get('summary', ''),
                    "relevance_score": 1.0
                })
        
        # Create DataFrame
        df = pd.DataFrame(testData)
        
        # Save to file
        output_file = Path(self.output_file).with_suffix('.jsonl')
        df.to_json(output_file, orient='records', lines=True)
        
        print(f"Saved {len(df)} query-document pairs to {output_file}")
        return df


patent_data = get_IP_data(limit=300)  # Adjust limit as needed    
# Create dataset
dataset_creator = RetrievalDataset(patent_data)
dataset = dataset_creator.create_dataset()

Saved 862 query-document pairs to retrieval_dataset.jsonl


In [8]:
dataset

Unnamed: 0,query,document_id,title,abstract,summary,relevance_score
0,VEHICLE IMAGING SYSTEM AND METHOD,US20180249132A1-20180830,VEHICLE IMAGING SYSTEM AND METHOD,The present disclosure relates to a vehicle im...,<SOH> SUMMARY OF THE INVENTION <EOH>Aspects of...,1.0
1,The present disclosure relates to a vehicle im...,US20180249132A1-20180830,VEHICLE IMAGING SYSTEM AND METHOD,The present disclosure relates to a vehicle im...,<SOH> SUMMARY OF THE INVENTION <EOH>Aspects of...,1.0
2,<SOH> SUMMARY OF THE INVENTION <EOH>Aspects of...,US20180249132A1-20180830,VEHICLE IMAGING SYSTEM AND METHOD,The present disclosure relates to a vehicle im...,<SOH> SUMMARY OF THE INVENTION <EOH>Aspects of...,1.0
3,PTYCHOGRAPHY SYSTEM,US20180284418A1-20181004,PTYCHOGRAPHY SYSTEM,A single-exposure ptychography system is prese...,<SOH> SUMMARY OF THE INVENTION <EOH>The presen...,1.0
4,A single-exposure ptychography system is prese...,US20180284418A1-20181004,PTYCHOGRAPHY SYSTEM,A single-exposure ptychography system is prese...,<SOH> SUMMARY OF THE INVENTION <EOH>The presen...,1.0
...,...,...,...,...,...,...
857,A rotary knife fixture for cutting vegetable p...,US20180141230A1-20180524,"ROTARY KNIFE FIXTURE FOR CUTTING SPIRAL, TEXTU...",A rotary knife fixture for cutting vegetable p...,<SOH> SUMMARY <EOH>In accordance with the inve...,1.0
858,<SOH> SUMMARY <EOH>In accordance with the inve...,US20180141230A1-20180524,"ROTARY KNIFE FIXTURE FOR CUTTING SPIRAL, TEXTU...",A rotary knife fixture for cutting vegetable p...,<SOH> SUMMARY <EOH>In accordance with the inve...,1.0
859,SOLID STATE FORMS OF ELUXADOLINE,US20180228773A1-20180816,SOLID STATE FORMS OF ELUXADOLINE,Disclosed are solid state forms of Eluxadoline...,<SOH> SUMMARY OF THE INVENTION <EOH>The presen...,1.0
860,Disclosed are solid state forms of Eluxadoline...,US20180228773A1-20180816,SOLID STATE FORMS OF ELUXADOLINE,Disclosed are solid state forms of Eluxadoline...,<SOH> SUMMARY OF THE INVENTION <EOH>The presen...,1.0
