Semantic search and RAG CLI over indexed research catalogs. All output is machine-readable JSON designed for AI agent consumption (Claude, Codex, etc.).
Requires Python 3.14+ and uv.
cd corpus
uv sync
uv run corpus --helpTo install as a global CLI:
uv pip install -e .
corpus --help# 1. List available catalogs
corpus catalogs
# 2. Pre-compute embeddings (run once per catalog)
corpus embed
# 3. Semantic search
corpus search "agentic fuzzing for web applications"
# 4. Retrieve papers as LLM context
corpus rag "supply chain vulnerability detection" --top-k 5All commands output JSON to stdout. Stderr is suppressed for agent use (2>/dev/null).
Pre-compute and cache sentence-transformer embeddings for all papers in a catalog. Run once after adding or updating a catalog file.
corpus embed
corpus embed -c my-custom-catalogOutput:
{
"status": "ok",
"papers_embedded": 28,
"model": "all-MiniLM-L6-v2",
"embedding_dim": 384,
"cache_path": "/path/to/catalogs/.cache-arxiv-vuln-catalog/.embeddings.npz"
}Semantic search over embedded papers. Returns ranked results by cosine similarity.
corpus search "agentic DAST fuzzing"
corpus search "LLM vulnerability detection" --top-k 3
corpus search "supply chain" -k 10 -c arxiv-vuln-catalog| Flag | Default | Description |
|---|---|---|
--top-k, -k |
5 |
Number of results to return |
--catalog, -c |
arxiv-vuln-catalog |
Catalog name (without .json) |
Output:
{
"query": "agentic DAST fuzzing",
"top_k": 3,
"results": [
{
"id": "2604.01442",
"title": "Fuzzing with Agents? Generators Are All You Need (Gentoo)",
"url": "https://arxiv.org/abs/2604.01442",
"tier": 1,
"categories": ["agentic", "fuzzing", "dast"],
"submitted": "2026-04-01",
"authors": ["Vasudev Vikram", "Rohan Padhye"],
"score": 0.5684
}
]
}Retrieve papers and format as a structured LLM context block. Includes full abstracts, key results, relevance notes, cross-references, and identified research gaps.
corpus rag "multi-agent vulnerability detection"
corpus rag "supply chain LLM detection" --top-k 5| Flag | Default | Description |
|---|---|---|
--top-k, -k |
5 |
Number of papers to retrieve |
--catalog, -c |
arxiv-vuln-catalog |
Catalog name (without .json) |
Output structure:
{
"query": "...",
"retrieved_count": 5,
"papers": [
{
"id": "...",
"title": "...",
"url": "...",
"pdf": "...",
"tier": 1,
"categories": ["..."],
"authors": ["..."],
"submitted": "...",
"abstract": "...",
"key_results": {},
"relevance_notes": "...",
"similarity_score": 0.6547
}
],
"instruction": "Use the papers above as context...",
"cross_references": {
"agentic_architecture_patterns": [],
"key_findings": []
},
"gaps": []
}Filter papers by tier and/or category tag. Does not require embeddings.
corpus filter --tier 1
corpus filter --category sca
corpus filter --tier 1 --category multi_agent| Flag | Default | Description |
|---|---|---|
--tier, -t |
None |
Filter by tier (1-5) |
--category |
None |
Filter by category tag |
--catalog, -c |
arxiv-vuln-catalog |
Catalog name |
Available tiers:
| Tier | Description |
|---|---|
| 1 | Agentic systems with foundation models for security |
| 2 | Foundation model vuln detection (non-agentic) |
| 3 | Supply chain / SCA / SBOM adjacent |
| 4 | DAST / API security testing |
| 5 | Security benchmarks for agentic systems |
Available categories:
agentic, benchmark, dast, detection_engineering, fuzzing, hardware_security, llm_evaluation, multi_agent, neuro_symbolic, penetration_testing, rag, sast, sca, smart_contract, supply_chain, survey, traditional_security_assessment
Get full details for a paper by its arxiv ID.
corpus get 2604.00704Returns the complete paper object including abstract, key results, and relevance notes. Returns an error with available_ids if the paper is not found.
List all papers with compact summaries. Does not require embeddings.
corpus list
corpus list -c my-catalogShow identified research gaps from the catalog metadata.
corpus gapsShow agentic architecture patterns and cross-cutting key findings extracted from the catalog.
corpus patternsOutput structure:
{
"agentic_architecture_patterns": [
{
"pattern": "multi-agent with shared memory",
"papers": ["2603.26270", "2603.27127"]
}
],
"key_findings": [
{
"finding": "Data quality and balancing > model scale for vuln detection",
"papers": ["2604.00112"]
}
]
}List all available catalog files in the catalogs directory.
corpus catalogsAll output is JSON. An AI agent can use corpus as a subprocess:
# Semantic search, pipe to jq
result=$(corpus search "agentic DAST" 2>/dev/null)
echo "$result" | jq '.results[].title'
# RAG context for LLM prompt construction
context=$(corpus rag "supply chain detection" 2>/dev/null)
# Structured filtering
corpus filter --tier 1 --category sca 2>/dev/null | jq '.results'
# Paper lookup
corpus get 2604.00704 2>/dev/null | jq '.abstract'Error handling: Errors are returned as JSON with error and fix fields:
{
"error": "embeddings not cached",
"fix": "run: corpus embed -c arxiv-vuln-catalog"
}Exit code is 1 on error, 0 on success.
Catalog files are JSON placed in the catalogs/ directory. Each catalog must conform to this schema:
{
"$schema": "arxiv-research-catalog/v1",
"metadata": {
"description": "...",
"source_categories": ["cs.SE", "cs.CR"],
"date_range": {"start": "2026-03-26", "end": "2026-04-03"},
"total_papers_scanned": 334,
"total_relevant_found": 28,
"created": "2026-04-02",
"search_terms": ["..."]
},
"taxonomy": {
"tiers": {
"tier_1": "Description of tier 1",
"tier_2": "Description of tier 2"
},
"categories": ["agentic", "sca", "dast"]
},
"papers": [
{
"id": "2604.00704",
"title": "Paper Title",
"url": "https://arxiv.org/abs/2604.00704",
"pdf": "https://arxiv.org/pdf/2604.00704",
"authors": ["Author One", "Author Two"],
"submitted": "2026-04-01",
"tier": 1,
"categories": ["agentic", "sca"],
"abstract": "Full abstract text...",
"key_results": {
"success_rate": 0.82,
"tasks_evaluated": 660
},
"relevance_notes": "Why this paper matters..."
}
],
"gaps_identified": [
{
"category": "sbom",
"description": "No papers address...",
"opportunity": "Could combine..."
}
],
"cross_references": {
"agentic_architecture_patterns": [
{"pattern": "multi-agent with shared memory", "papers": ["2603.26270"]}
],
"key_findings": [
{"finding": "Data quality > model scale", "papers": ["2604.00112"]}
]
}
}| Field | Type | Description |
|---|---|---|
id |
string |
Unique identifier (e.g., arxiv ID) |
title |
string |
Full paper title |
url |
string |
URL to the paper page |
pdf |
string |
URL to the PDF |
authors |
string[] |
List of author names |
submitted |
string |
ISO date string |
tier |
int |
Relevance tier (1 = highest) |
categories |
string[] |
Category tags |
abstract |
string |
Full abstract text |
| Field | Type | Default | Description |
|---|---|---|---|
key_results |
dict |
{} |
Quantitative results |
relevance_notes |
string |
"" |
Curator notes |
- Create a JSON file in
catalogs/following the schema above - Run
corpus embed -c <name>to compute embeddings - Search with
corpus search "query" -c <name>
# Example: add a new catalog
cp my-new-catalog.json catalogs/
corpus embed -c my-new-catalog
corpus search "interesting topic" -c my-new-catalog| Env Variable | Default | Description |
|---|---|---|
CORPUS_CATALOGS_DIR |
<project>/catalogs/ |
Override the catalogs directory path |
src/corpus/
cli/
main.py # Typer CLI with all subcommands
engine/
embeddings.py # EmbeddingEngine: sentence-transformer wrapper with caching
models/
paper.py # Paper, PaperSummary domain models
catalog.py # Catalog, CatalogMetadata, Gap, CrossReferences
results.py # SearchResponse, RagResponse, FilterResponse, ErrorResponse
services/
loader.py # CatalogLoader: JSON file loading and validation
search.py # SearchService: semantic search and filtering
rag.py # RagService: RAG context retrieval and formatting
catalogs/
arxiv-vuln-catalog.json # Bundled catalog
.cache-arxiv-vuln-catalog/ # Auto-generated embedding cache
CLI (main.py)
-> Services (search.py, rag.py, loader.py)
-> Engine (embeddings.py)
-> Models (paper.py, catalog.py, results.py)
- CLI depends on services, never on engine directly
- Services depend on engine and models
- Models have no internal dependencies beyond pydantic
- Engine has no dependency on models (operates on raw lists/arrays)
CatalogLoaderreads and validates JSON intoCatalogmodelEmbeddingEngine.build_corpus()concatenates paper fields into searchable stringsEmbeddingEngine.encode()produces L2-normalised vectors viaall-MiniLM-L6-v2- Vectors are cached as compressed
.npzincatalogs/.cache-<name>/ - At query time, query is encoded and scored against cached corpus via dot product
| Package | Purpose |
|---|---|
sentence-transformers |
Embedding model (all-MiniLM-L6-v2, 22M params) |
numpy |
Vector operations and caching |
pydantic |
Strict model validation |
typer |
CLI framework |
MIT