# Data Acquisition & Goal

### Goal

A Patent RAG system with:

- Contextual Embeddings: chunks are enriched with concise context (e.g., “This chunk is from Patent X, abstract section, about photovoltaic cells, filed 2016-01-15”).

- Hybrid Retrieval: combine vector similarity (Milvus) + lexical BM25 (via Elasticsearch/OpenSearch).

- Re-ranking: re-order retrieved chunks with a cross-encoder or LLM scoring step.


Contextual Retrieval is an advanced retrieval method proposed by Anthropic to address the issue of semantic isolation of chunks, which arises in current Retrieval-Augmented Generation (RAG) solutions.

In [1]:
### It seems like the HF repo is not working, i eded up downloading it manually
# The dataset is mainly 2018 IP data from HUPD 
from pprint import pprint
from datasets import load_dataset
import os
import json

# dataset_dict = load_dataset('HUPD/hupd',
#     name='sample',
#     data_files="https://huggingface.co/datasets/HUPD/hupd/blob/main/hupd_metadata_2022-02-22.feather",
#     icpr_label=None,
# )


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
def get_IP_data(limit=20):
    ip_files = [json.load(open("RawData/2018/" + file)) for file in os.listdir(r"RawData/2018")[:limit]] 

    relevant_fields = [
        # Identifiers
        "publication_number",
        "application_number",
        "patent_number",
        # Dates
        "date_published",
        "filing_date",
        "patent_issue_date",
        "abandon_date",
        # Status & Classes
        "decision",
        "main_cpc_label",
        "main_ipcr_label",
        # Retrievable Text
        "title",
        "abstract",
        "summary",
    ]

    ip_files = [{key: value for key, value in file.items() if key in relevant_fields} for file in ip_files]
    return ip_files

data = get_IP_data()
pprint(data[0])

{'abandon_date': '',
 'abstract': 'The present disclosure relates to a vehicle imaging apparatus '
             '(2) having a location determining module (10) for determining '
             'the relative location of a remote imaging apparatus (3). The '
             'location determining module (10) is configured to receive a '
             'tracking signal (S2) from a remote transmitter (16) associated '
             'with the remote imaging apparatus (3). An image receiver module '
             '(7) is provided to receive image data (DT) transmitted by the '
             'remote imaging apparatus (3). At least one image processor (5) '
             'is provided to process the image data (DT) in dependence on the '
             'determined location of the remote imaging apparatus (3). The '
             'present disclosure also relates to a remote imaging apparatus '
             '(3) for mounting to a trailer (T). The remote imaging apparatus '
             '(3) having a camera (CT);

# RAG Design

### Ingestion


## Overview

The IP Assistant is a Retrieval-Augmented Generation (RAG) system for intellectual property (patent) documents. It ingests patent data, chunks text content, generates embeddings using sentence transformers, and stores everything in a Milvus vector database for semantic search and retrieval.


## System Components

### 1. Docker Services

The system runs three Docker containers an isolated containerized environment

- **etcd**: Distributed key-value store for Milvus metadata
- **minio**: Object storage for Milvus data persistence
- **milvus-standalone**: Vector database for embeddings and metadata

### 2. IP Assistant Data Pipeline

1. Data Loading
- Loads patent data from JSON files in RawData/2018/
- Filters to essential fields defined in RELEVANT_FIELDS
- Processes metadata and text content
2. Text Processing
- Chunking: Splits text into 256-token chunks with 50-token overlap
- Fields Processed: Title, abstract, and summary
- Handles special characters and formatting
3. Embedding Generation
- Uses sentence-transformers/all-MiniLM-L6-v2 model
- Generates 384-dimensional vectors
- Normalizes embeddings for cosine similarity
4. Milvus Storage
- Collection: ip_chunks

 - Metadata: Publication/patent numbers, dates, classifications
 - Content: Text chunks and embeddings
-Indexing: HNSW for vector search, secondary indexes for filtering

In [8]:
from ingestion import ingest_patents

ingest_patents(ip_limit=1000)

  Clearing existing collection 'ip_chunks'...
  ✓ Collection 'ip_chunks' cleared successfully


Ingesting patents:   0%|          | 0/1000 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (1995 > 512). Running this sequence through the model will result in indexing errors
Ingesting patents: 100%|██████████| 1000/1000 [00:33<00:00, 29.82it/s]


✓ Ingestion complete!
  • Processed 7167 text chunks from 1000 patents
  • Collection 'ip_chunks' now has 7167 entities


### Retrieval

In [3]:
from retriever import PatentRetriever
retriever = PatentRetriever()


In [7]:
results = retriever.search("methods for packaging materials", top_k=2)
for i, result in enumerate(results, 1):
    print(f"\nResult {i}:")
    print(f"Patent: {result['publication_number']}")
    print(f"Score: {result['score']:.3f}")
    print(f"Text: {result['text'][:150]}...")


Result 1:
Patent: US20180216051A1-20180802
Score: 0.636
Text: ##s and fabric softeners. method of making a water - soluble package the water - soluble packages of the present invention can be manufactured using s...

Result 2:
Patent: US20180158784A1-20180607
Score: 0.633
Text: a method for fabricating an electronic package is provided, including steps of : providing a carrier having at least an electronic element and at leas...


# Evals - Binary Evaluation
Single-label retrieval



the current evaluation supposes each query has exactly one correct document -this is how the evaluation set is constructed-

What the metrics mean in this current setup (1 relevant per query)

If there is only one relevant document per query, the metrics simplify beautifully:

    Metric	Meaning in your setup
    Precision@K	= 1/K if the relevant doc is within top-K, else 0. Because only one of the K docs can be relevant.
    Recall@K	= 1 if the relevant doc is within top-K, else 0. Because you either retrieved the only relevant one or not.
    Success@K	Same as Recall@K (they’re identical).
    F1@K	= 2 * P * R / (P + R) → either 2 / (K + 1) when retrieved, or 0.
    MRR	Measures the average 1 / rank_of_relevant. Example: if the relevant doc is usually at rank 1.2, MRR ≈ 0.83.
    AvgRank	Average position of the relevant doc (1 = top).


In [8]:
import json
from typing import List, Dict, Optional
from pathlib import Path
import pandas as pd


class RetrievalDataset:
    def __init__(self, data: List[Dict], output_file: str = "retrieval_dataset.jsonl"):
        self.data = data
        self.output_file = output_file

    def generate_queries(self, patent: Dict) -> List[str]:
        queries = []
        
        if title := patent.get('title'):
            queries.append(title.strip())
        
        # 2. Add first sentence of abstract
        if abstract := patent.get('abstract'):
            first_sentence = abstract.split('.')[0] + '.'
            if first_sentence not in queries:
                queries.append(first_sentence)
        
        # 3. Add first 30 words of summary
        if summary := patent.get('summary'):
            summary_preview = ' '.join(summary.split()[:30])
            if summary_preview not in queries:
                queries.append(summary_preview)
        return queries

    def create_dataset(self) -> 'pd.DataFrame':
        
        testData = []
        for patent in self.data:
            doc_id = patent.get('publication_number', str(hash(str(patent))))
            queries = self.generate_queries(patent)
            
            for query in queries:
                testData.append({
                    "query": query,
                    "document_id": doc_id,
                    "title": patent.get('title', ''),
                    "abstract": patent.get('abstract', ''),
                    "summary": patent.get('summary', ''),
                    "relevance_score": 1.0
                })
        
        # Create DataFrame
        df = pd.DataFrame(testData)
        
        # Save to file
        output_file = Path(self.output_file).with_suffix('.jsonl')
        df.to_json(output_file, orient='records', lines=True)
        
        print(f"Saved {len(df)} query-document pairs to {output_file}")
        return df


patent_data = get_IP_data(limit=300)  # Adjust limit as needed    
# Create dataset
dataset_creator = RetrievalDataset(patent_data)
dataset = dataset_creator.create_dataset()
dataset

Saved 862 query-document pairs to retrieval_dataset.jsonl


Unnamed: 0,query,document_id,title,abstract,summary,relevance_score
0,VEHICLE IMAGING SYSTEM AND METHOD,US20180249132A1-20180830,VEHICLE IMAGING SYSTEM AND METHOD,The present disclosure relates to a vehicle im...,<SOH> SUMMARY OF THE INVENTION <EOH>Aspects of...,1.0
1,The present disclosure relates to a vehicle im...,US20180249132A1-20180830,VEHICLE IMAGING SYSTEM AND METHOD,The present disclosure relates to a vehicle im...,<SOH> SUMMARY OF THE INVENTION <EOH>Aspects of...,1.0
2,<SOH> SUMMARY OF THE INVENTION <EOH>Aspects of...,US20180249132A1-20180830,VEHICLE IMAGING SYSTEM AND METHOD,The present disclosure relates to a vehicle im...,<SOH> SUMMARY OF THE INVENTION <EOH>Aspects of...,1.0
3,PTYCHOGRAPHY SYSTEM,US20180284418A1-20181004,PTYCHOGRAPHY SYSTEM,A single-exposure ptychography system is prese...,<SOH> SUMMARY OF THE INVENTION <EOH>The presen...,1.0
4,A single-exposure ptychography system is prese...,US20180284418A1-20181004,PTYCHOGRAPHY SYSTEM,A single-exposure ptychography system is prese...,<SOH> SUMMARY OF THE INVENTION <EOH>The presen...,1.0
...,...,...,...,...,...,...
857,A rotary knife fixture for cutting vegetable p...,US20180141230A1-20180524,"ROTARY KNIFE FIXTURE FOR CUTTING SPIRAL, TEXTU...",A rotary knife fixture for cutting vegetable p...,<SOH> SUMMARY <EOH>In accordance with the inve...,1.0
858,<SOH> SUMMARY <EOH>In accordance with the inve...,US20180141230A1-20180524,"ROTARY KNIFE FIXTURE FOR CUTTING SPIRAL, TEXTU...",A rotary knife fixture for cutting vegetable p...,<SOH> SUMMARY <EOH>In accordance with the inve...,1.0
859,SOLID STATE FORMS OF ELUXADOLINE,US20180228773A1-20180816,SOLID STATE FORMS OF ELUXADOLINE,Disclosed are solid state forms of Eluxadoline...,<SOH> SUMMARY OF THE INVENTION <EOH>The presen...,1.0
860,Disclosed are solid state forms of Eluxadoline...,US20180228773A1-20180816,SOLID STATE FORMS OF ELUXADOLINE,Disclosed are solid state forms of Eluxadoline...,<SOH> SUMMARY OF THE INVENTION <EOH>The presen...,1.0


In [None]:
from typing import List, Dict, Callable, Iterable, Tuple
import math

def dcg_at_k(labels: List[int], k: int) -> float:
    dcg = 0.0
    for i, rel in enumerate(labels[:k]):
        if rel:
            dcg += 1.0 / math.log2(i + 2)  # positions are 1-based in log
    return dcg

def ndcg_at_k(ranked_ids: List[str], relevant_set: set, k: int) -> float:
    labels = [1 if doc_id in relevant_set else 0 for doc_id in ranked_ids[:k]]
    dcg = dcg_at_k(labels, k)
    ideal_labels = sorted(labels, reverse=True)  # ideal = all 1s first
    idcg = dcg_at_k(ideal_labels, k)
    return (dcg / idcg) if idcg > 0 else 0.0

def mrr_at_k(ranked_ids: List[str], relevant_set: set, k: int) -> float:
    for i, doc_id in enumerate(ranked_ids[:k], start=1):
        if doc_id in relevant_set:
            return 1.0 / i
    return 0.0

def precision_at_k(ranked_ids: List[str], relevant_set: set, k: int) -> float:
    hits = sum(1 for doc_id in ranked_ids[:k] if doc_id in relevant_set)
    return hits / float(k)

def recall_at_k(ranked_ids: List[str], relevant_set: set, k: int) -> float:
    if not relevant_set:
        return 0.0
    hits = sum(1 for doc_id in ranked_ids[:k] if doc_id in relevant_set)
    return hits / float(len(relevant_set))

def hitrate_at_k(ranked_ids: List[str], relevant_set: set, k: int) -> float:
    return 1.0 if any(doc_id in relevant_set for doc_id in ranked_ids[:k]) else 0.0


def evaluate_retriever(
    df: 'pd.DataFrame',
    retriever: Callable[[str, int], List[str]],
    ks: Iterable[int] = (1, 3, 5, 10),
) -> Dict[int, Dict[str, float]]:
    """
    Evaluate a retriever using binary success metrics.
    
    Args:
        df: DataFrame containing 'query' and 'document_id' columns
        retriever: Function that takes (query: str, k: int) and returns list of document IDs
        ks: List of k values to evaluate at (e.g., [1, 3, 5, 10])
        
    Returns:
        Dictionary mapping k to success rate and average rank
    """
    ks = sorted(set(ks))
    results = {k: {"Success@K": 0.0, "AvgRank": 0.0} for k in ks}
    n = len(df)
    max_k = max(ks) if ks else 10
    
    # Convert document_id to string for consistent comparison
    df = df.copy()
    df['document_id'] = df['document_id'].astype(str)
    
    for idx, row in df.iterrows():
        query = row['query']
        target_doc_id = row['document_id']
        
        retrieved_docs = [str(doc_id) for doc_id in retriever(query, max_k)]
            
            # Check if target document is in the retrieved list
        found = False
        for rank, doc_id in enumerate(retrieved_docs, 1):
            if doc_id == target_doc_id:
                found = True
                # Update metrics for each k where rank <= k
                for k in ks:
                    if rank <= k:
                        results[k]["Success@K"] += 1.0
                        results[k]["AvgRank"] += rank
                break
            
            # If not found in top-k, increment rank by k+1 for average calculation
            if not found:
                for k in ks:
                    results[k]["AvgRank"] += max_k + 1
                    
    
    # Calculate final metrics
    for k in ks:
        results[k]["Success@K"] = round(results[k]["Success@K"] / n, 4)
        results[k]["AvgRank"] = round(results[k]["AvgRank"] / n, 2)
    
    # Print results
    print("\n" + "="*80)
    print("Retriever Evaluation Results (Binary Success)")
    print("="*80)
    print(f"Number of queries: {n}")
    print("-"*80)
    print("K".ljust(6) + "| " + "Success@K".ljust(10) + " | " + "AvgRank")
    print("-"*80)
    for k in ks:
        r = results[k]
        print(f"{str(k).ljust(6)}| {str(r['Success@K']).ljust(10)} | {r['AvgRank']}")
    
    return results

Unnamed: 0,query,document_id,title,abstract,summary,relevance_score
0,VEHICLE IMAGING SYSTEM AND METHOD,US20180249132A1-20180830,VEHICLE IMAGING SYSTEM AND METHOD,The present disclosure relates to a vehicle im...,<SOH> SUMMARY OF THE INVENTION <EOH>Aspects of...,1
1,The present disclosure relates to a vehicle im...,US20180249132A1-20180830,VEHICLE IMAGING SYSTEM AND METHOD,The present disclosure relates to a vehicle im...,<SOH> SUMMARY OF THE INVENTION <EOH>Aspects of...,1
2,<SOH> SUMMARY OF THE INVENTION <EOH>Aspects of...,US20180249132A1-20180830,VEHICLE IMAGING SYSTEM AND METHOD,The present disclosure relates to a vehicle im...,<SOH> SUMMARY OF THE INVENTION <EOH>Aspects of...,1
3,PTYCHOGRAPHY SYSTEM,US20180284418A1-20181004,PTYCHOGRAPHY SYSTEM,A single-exposure ptychography system is prese...,<SOH> SUMMARY OF THE INVENTION <EOH>The presen...,1
4,A single-exposure ptychography system is prese...,US20180284418A1-20181004,PTYCHOGRAPHY SYSTEM,A single-exposure ptychography system is prese...,<SOH> SUMMARY OF THE INVENTION <EOH>The presen...,1
...,...,...,...,...,...,...
857,A rotary knife fixture for cutting vegetable p...,US20180141230A1-20180524,"ROTARY KNIFE FIXTURE FOR CUTTING SPIRAL, TEXTU...",A rotary knife fixture for cutting vegetable p...,<SOH> SUMMARY <EOH>In accordance with the inve...,1
858,<SOH> SUMMARY <EOH>In accordance with the inve...,US20180141230A1-20180524,"ROTARY KNIFE FIXTURE FOR CUTTING SPIRAL, TEXTU...",A rotary knife fixture for cutting vegetable p...,<SOH> SUMMARY <EOH>In accordance with the inve...,1
859,SOLID STATE FORMS OF ELUXADOLINE,US20180228773A1-20180816,SOLID STATE FORMS OF ELUXADOLINE,Disclosed are solid state forms of Eluxadoline...,<SOH> SUMMARY OF THE INVENTION <EOH>The presen...,1
860,Disclosed are solid state forms of Eluxadoline...,US20180228773A1-20180816,SOLID STATE FORMS OF ELUXADOLINE,Disclosed are solid state forms of Eluxadoline...,<SOH> SUMMARY OF THE INVENTION <EOH>The presen...,1


In [12]:
evaluate_retriever(dataset, retriever)

Error processing query 'VEHICLE IMAGING SYSTEM AND METHOD': 'PatentRetriever' object is not callable
Error processing query 'The present disclosure relates to a vehicle imaging apparatus (2) having a location determining module (10) for determining the relative location of a remote imaging apparatus (3).': 'PatentRetriever' object is not callable
Error processing query '<SOH> SUMMARY OF THE INVENTION <EOH>Aspects of the present invention relate to a vehicle imaging apparatus; to a remote imaging apparatus; to a method of modifying image data; to a': 'PatentRetriever' object is not callable
Error processing query 'PTYCHOGRAPHY SYSTEM': 'PatentRetriever' object is not callable
Error processing query 'A single-exposure ptychography system is presented.': 'PatentRetriever' object is not callable
Error processing query '<SOH> SUMMARY OF THE INVENTION <EOH>The present invention relates, in some embodiments thereof, to single-exposure ptychography system comprising an optical unit defining an

{1: {'Success@K': 0.0, 'AvgRank': 11.0},
 3: {'Success@K': 0.0, 'AvgRank': 11.0},
 5: {'Success@K': 0.0, 'AvgRank': 11.0},
 10: {'Success@K': 0.0, 'AvgRank': 11.0}}