# Data Acquisition & Goal

### Goal

A Patent RAG system with:

- Contextual Embeddings: chunks are enriched with concise context (e.g., “This chunk is from Patent X, abstract section, about photovoltaic cells, filed 2016-01-15”).

- Hybrid Retrieval: combine vector similarity (Milvus) + lexical BM25 (via Elasticsearch/OpenSearch).

- Re-ranking: re-order retrieved chunks with a cross-encoder or LLM scoring step.


Contextual Retrieval is an advanced retrieval method proposed by Anthropic to address the issue of semantic isolation of chunks, which arises in current Retrieval-Augmented Generation (RAG) solutions.

In [9]:
### It seems like the HF repo is not working, i eded up downloading it manually
# The dataset is mainly 2018 IP data from HUPD 
from pprint import pprint
from datasets import load_dataset
import os
import json

# dataset_dict = load_dataset('HUPD/hupd',
#     name='sample',
#     data_files="https://huggingface.co/datasets/HUPD/hupd/blob/main/hupd_metadata_2022-02-22.feather",
#     icpr_label=None,
# )


In [3]:
def get_IP_data(limit=20):
    ip_files = [json.load(open("RawData/2018/" + file)) for file in os.listdir(r"RawData/2018")[:limit]] 

    relevant_fields = [
        # Identifiers
        "publication_number",
        "application_number",
        "patent_number",
        # Dates
        "date_published",
        "filing_date",
        "patent_issue_date",
        "abandon_date",
        # Status & Classes
        "decision",
        "main_cpc_label",
        "main_ipcr_label",
        # Retrievable Text
        "title",
        "abstract",
        "summary",
    ]

    ip_files = [{key: value for key, value in file.items() if key in relevant_fields} for file in ip_files]
    return ip_files

data = get_IP_data()
pprint(data[0])

{'abandon_date': '',
 'abstract': 'The present disclosure relates to a vehicle imaging apparatus '
             '(2) having a location determining module (10) for determining '
             'the relative location of a remote imaging apparatus (3). The '
             'location determining module (10) is configured to receive a '
             'tracking signal (S2) from a remote transmitter (16) associated '
             'with the remote imaging apparatus (3). An image receiver module '
             '(7) is provided to receive image data (DT) transmitted by the '
             'remote imaging apparatus (3). At least one image processor (5) '
             'is provided to process the image data (DT) in dependence on the '
             'determined location of the remote imaging apparatus (3). The '
             'present disclosure also relates to a remote imaging apparatus '
             '(3) for mounting to a trailer (T). The remote imaging apparatus '
             '(3) having a camera (CT);

# RAG Design

### Ingestion


## Overview

The IP Assistant is a Retrieval-Augmented Generation (RAG) system for intellectual property (patent) documents. It ingests patent data, chunks text content, generates embeddings using sentence transformers, and stores everything in a Milvus vector database for semantic search and retrieval.


## System Components

### 1. Docker Services

The system runs three Docker containers an isolated containerized environment

- **etcd**: Distributed key-value store for Milvus metadata
- **minio**: Object storage for Milvus data persistence
- **milvus-standalone**: Vector database for embeddings and metadata

### 2. IP Assistant Data Pipeline

1. Data Loading
- Loads patent data from JSON files in RawData/2018/
- Filters to essential fields defined in RELEVANT_FIELDS
- Processes metadata and text content
2. Text Processing
- Chunking: Splits text into 256-token chunks with 50-token overlap
- Fields Processed: Title, abstract, and summary
- Handles special characters and formatting
3. Embedding Generation
- Uses sentence-transformers/all-MiniLM-L6-v2 model
- Generates 384-dimensional vectors
- Normalizes embeddings for cosine similarity
4. Milvus Storage
- Collection: ip_chunks

 - Metadata: Publication/patent numbers, dates, classifications
 - Content: Text chunks and embeddings
-Indexing: HNSW for vector search, secondary indexes for filtering

In [4]:
from ip_assistant.ingestion import ingest_patents

ingest_patents(ip_limit=1000)

  Clearing existing collection 'ip_chunks'...
  ✓ Collection 'ip_chunks' cleared successfully


Ingesting patents:   0%|          | 0/1000 [00:00<?, ?it/s]Token indices sequence length is longer than the specified maximum sequence length for this model (991 > 512). Running this sequence through the model will result in indexing errors
Ingesting patents: 100%|██████████| 1000/1000 [01:04<00:00, 15.55it/s]


✓ Ingestion complete!
  • Processed 12813 text chunks from 1000 patents
  • Collection 'ip_chunks' now has 12813 entities


### Retrieval

In [21]:
from ip_assistant.retriever import PatentRetriever
retriever = PatentRetriever()


combined, vec_results, bm25_results = retriever.search("Solar panel tech using butter chicken for dust cleaning", top_k=5)

In [22]:
pprint("\nVector Results:")
pprint(vec_results)
pprint("\nBM25 Results:")
pprint(bm25_results)

'\nVector Results:'
[{'_v_score': 0.3877083659172058,
  'decision': 'PENDING',
  'publication_number': 'US20180200672A1-20180719',
  'score': 0.3877083659172058,
  'section': 'summary',
  'text': 'during the flight phase in the flue gas and during the dwell phase '
          'in the filter cake of a dedusting system. as a result of higher '
          'hgox concentrations, the overall degree of separation of hg '
          'increases via downstream devices for flue gas cleaning. adsorptive '
          'separation effects of mercury can also take place directly on the '
          'injected material. essentially, the method is composed of the '
          'following method steps and characteristics : 1. the method is used '
          'for oxidation and thus improves the separation of mercury in power '
          'plant flue gas with the addition of a powdery, catalytically active '
          'material having mean grain diameters < 35 μm, preferably iron ( iii '
          ') oxide. 2. the p

In [24]:
results, _, _ = retriever.search("Solar panel tech", top_k=5)
for i, result in enumerate(results, 1):
    print(f"\nResult {i}:")
    print(f"Patent: {result['publication_number']}")
    print(f"Score: {result['score']:.3f}")
    print(f"Text: {result['text'][:150]}...")


Result 1:
Patent: US20180302022A1-20181018
Score: 0.032
Text: integrated solar energy utilization apparatus is connected in a thermally conductive manner to the first working fluid in the container. the first pip...

Result 2:
Patent: US20180302022A1-20181018
Score: 0.031
Text: filled with a heat storage substance for storing heat energy. 12. the apparatus according to claim 11, wherein the thermal - energy storage is provide...

Result 3:
Patent: US20180302022A1-20181018
Score: 0.031
Text: device node connected in the first pipeline network for storage, or for energy conversion, or for energy exchange. 14. the system according to claim 1...

Result 4:
Patent: US20180302022A1-20181018
Score: 0.031
Text: storing power generation element, it further includes a working substance for directly storing heat flowing to the first heat - conducting end or conv...

Result 5:
Patent: US20180302022A1-20181018
Score: 0.016
Text: 1. an integrated solar energy utilization apparatus, comprising : at 

# Evals - Binary Evaluation
Single-label retrieval



the current evaluation supposes each query has exactly one correct document -since this is how the evaluation set is constructed-

What the metrics mean in this current setup (1 relevant patent per query)

If there is only one relevant document per query, the metrics simplify beautifully:

    Metric	Meaning in your setup
    Precision@K	= 1/K if the relevant doc is within top-K, else 0. Because only one of the K docs can be relevant.
    Recall@K	= 1 if the relevant doc is within top-K, else 0. Because you either retrieved the only relevant one or not.
    Success@K	Same as Recall@K (they’re identical).
    F1@K	= 2 * P * R / (P + R) → either 2 / (K + 1) when retrieved, or 0.
    MRR	Measures the average 1 / rank_of_relevant. Example: if the relevant doc is usually at rank 1.2, MRR ≈ 0.83.
    AvgRank	Average position of the relevant doc (1 = top).


In [6]:
import json
from typing import List, Dict, Optional
from pathlib import Path
import pandas as pd


class RetrievalDataset:
    def __init__(self, data: List[Dict], output_file: str = "retrieval_dataset.jsonl"):
        self.data = data
        self.output_file = output_file

    def generate_queries(self, patent: Dict) -> List[str]:
        queries = []
        
        if title := patent.get('title'):
            queries.append(title.strip())
        
        # 2. Add first sentence of abstract
        if abstract := patent.get('abstract'):
            first_sentence = abstract.split('.')[0] + '.'
            if first_sentence not in queries:
                queries.append(first_sentence)
        
        # 3. Add first 30 words of summary
        if summary := patent.get('summary'):
            summary_preview = ' '.join(summary.split()[:30])
            if summary_preview not in queries:
                queries.append(summary_preview)
        return queries

    def create_dataset(self) -> 'pd.DataFrame':
        
        testData = []
        for patent in self.data:
            doc_id = patent.get('publication_number', str(hash(str(patent))))
            queries = self.generate_queries(patent)
            
            for query in queries:
                testData.append({
                    "query": query,
                    "document_id": doc_id,
                    "title": patent.get('title', ''),
                    "abstract": patent.get('abstract', ''),
                    "summary": patent.get('summary', ''),
                    "relevance_score": 1.0
                })
        
        # Create DataFrame
        df = pd.DataFrame(testData)
        
        # Save to file
        output_file = Path(self.output_file).with_suffix('.jsonl')
        df.to_json(output_file, orient='records', lines=True)
        
        print(f"Saved {len(df)} query-document pairs to {output_file}")
        return df


patent_data = get_IP_data(limit=300)  # Adjust limit as needed    
# Create dataset
dataset_creator = RetrievalDataset(patent_data)
test_dataset = dataset_creator.create_dataset()
test_dataset

Saved 862 query-document pairs to retrieval_dataset.jsonl


Unnamed: 0,query,document_id,title,abstract,summary,relevance_score
0,VEHICLE IMAGING SYSTEM AND METHOD,US20180249132A1-20180830,VEHICLE IMAGING SYSTEM AND METHOD,The present disclosure relates to a vehicle im...,<SOH> SUMMARY OF THE INVENTION <EOH>Aspects of...,1.0
1,The present disclosure relates to a vehicle im...,US20180249132A1-20180830,VEHICLE IMAGING SYSTEM AND METHOD,The present disclosure relates to a vehicle im...,<SOH> SUMMARY OF THE INVENTION <EOH>Aspects of...,1.0
2,<SOH> SUMMARY OF THE INVENTION <EOH>Aspects of...,US20180249132A1-20180830,VEHICLE IMAGING SYSTEM AND METHOD,The present disclosure relates to a vehicle im...,<SOH> SUMMARY OF THE INVENTION <EOH>Aspects of...,1.0
3,PTYCHOGRAPHY SYSTEM,US20180284418A1-20181004,PTYCHOGRAPHY SYSTEM,A single-exposure ptychography system is prese...,<SOH> SUMMARY OF THE INVENTION <EOH>The presen...,1.0
4,A single-exposure ptychography system is prese...,US20180284418A1-20181004,PTYCHOGRAPHY SYSTEM,A single-exposure ptychography system is prese...,<SOH> SUMMARY OF THE INVENTION <EOH>The presen...,1.0
...,...,...,...,...,...,...
857,A rotary knife fixture for cutting vegetable p...,US20180141230A1-20180524,"ROTARY KNIFE FIXTURE FOR CUTTING SPIRAL, TEXTU...",A rotary knife fixture for cutting vegetable p...,<SOH> SUMMARY <EOH>In accordance with the inve...,1.0
858,<SOH> SUMMARY <EOH>In accordance with the inve...,US20180141230A1-20180524,"ROTARY KNIFE FIXTURE FOR CUTTING SPIRAL, TEXTU...",A rotary knife fixture for cutting vegetable p...,<SOH> SUMMARY <EOH>In accordance with the inve...,1.0
859,SOLID STATE FORMS OF ELUXADOLINE,US20180228773A1-20180816,SOLID STATE FORMS OF ELUXADOLINE,Disclosed are solid state forms of Eluxadoline...,<SOH> SUMMARY OF THE INVENTION <EOH>The presen...,1.0
860,Disclosed are solid state forms of Eluxadoline...,US20180228773A1-20180816,SOLID STATE FORMS OF ELUXADOLINE,Disclosed are solid state forms of Eluxadoline...,<SOH> SUMMARY OF THE INVENTION <EOH>The presen...,1.0


In [11]:
from evals.evaluation import evaluate_retriever
import pandas as pd
from typing import List
from ip_assistant.retriever import PatentRetriever

retriever = PatentRetriever()
test_dataset = pd.read_json('evals/retrieval_dataset.jsonl', lines=True)

def search_wrapper(query: str, k: int) -> List[str]:
    results = retriever.search(query, top_k=k)
    return [str(result['publication_number']) for result in results]


eval_res = evaluate_retriever(test_dataset, search_wrapper, ks=(1, 3, 5, 10, 20))

Retriever Evaluation Results (Micro-Averaged, each row is an independent query)
Number of rows (queries): 862
K     | Success@K | Precision@K | Recall@K | F1@K  | MRR@K | AvgRank
--------------------------------------------------------------------------------
1     | 0.7981   | 0.7981      | 0.7981   | 0.7981 | 0.7981 | 1.20
3     | 0.8387   | 0.5220      | 0.8387   | 0.6435 | 0.8157 | 1.54
5     | 0.8492   | 0.3667      | 0.8492   | 0.5122 | 0.8181 | 1.85
10    | 0.8689   | 0.1938      | 0.8689   | 0.3170 | 0.8206 | 2.54
20    | 0.8910   | 0.0970      | 0.8910   | 0.1750 | 0.8221 | 3.72
