Skip to content

Sawmonabo/corpus

Repository files navigation

corpus

Semantic search and RAG CLI over indexed research catalogs. All output is machine-readable JSON designed for AI agent consumption (Claude, Codex, etc.).

Install

Requires Python 3.14+ and uv.

cd corpus
uv sync
uv run corpus --help

To install as a global CLI:

uv pip install -e .
corpus --help

Quick Start

# 1. List available catalogs
corpus catalogs

# 2. Pre-compute embeddings (run once per catalog)
corpus embed

# 3. Semantic search
corpus search "agentic fuzzing for web applications"

# 4. Retrieve papers as LLM context
corpus rag "supply chain vulnerability detection" --top-k 5

CLI Reference

All commands output JSON to stdout. Stderr is suppressed for agent use (2>/dev/null).

corpus embed

Pre-compute and cache sentence-transformer embeddings for all papers in a catalog. Run once after adding or updating a catalog file.

corpus embed
corpus embed -c my-custom-catalog

Output:

{
  "status": "ok",
  "papers_embedded": 28,
  "model": "all-MiniLM-L6-v2",
  "embedding_dim": 384,
  "cache_path": "/path/to/catalogs/.cache-arxiv-vuln-catalog/.embeddings.npz"
}

corpus search <query>

Semantic search over embedded papers. Returns ranked results by cosine similarity.

corpus search "agentic DAST fuzzing"
corpus search "LLM vulnerability detection" --top-k 3
corpus search "supply chain" -k 10 -c arxiv-vuln-catalog
Flag Default Description
--top-k, -k 5 Number of results to return
--catalog, -c arxiv-vuln-catalog Catalog name (without .json)

Output:

{
  "query": "agentic DAST fuzzing",
  "top_k": 3,
  "results": [
    {
      "id": "2604.01442",
      "title": "Fuzzing with Agents? Generators Are All You Need (Gentoo)",
      "url": "https://arxiv.org/abs/2604.01442",
      "tier": 1,
      "categories": ["agentic", "fuzzing", "dast"],
      "submitted": "2026-04-01",
      "authors": ["Vasudev Vikram", "Rohan Padhye"],
      "score": 0.5684
    }
  ]
}

corpus rag <query>

Retrieve papers and format as a structured LLM context block. Includes full abstracts, key results, relevance notes, cross-references, and identified research gaps.

corpus rag "multi-agent vulnerability detection"
corpus rag "supply chain LLM detection" --top-k 5
Flag Default Description
--top-k, -k 5 Number of papers to retrieve
--catalog, -c arxiv-vuln-catalog Catalog name (without .json)

Output structure:

{
  "query": "...",
  "retrieved_count": 5,
  "papers": [
    {
      "id": "...",
      "title": "...",
      "url": "...",
      "pdf": "...",
      "tier": 1,
      "categories": ["..."],
      "authors": ["..."],
      "submitted": "...",
      "abstract": "...",
      "key_results": {},
      "relevance_notes": "...",
      "similarity_score": 0.6547
    }
  ],
  "instruction": "Use the papers above as context...",
  "cross_references": {
    "agentic_architecture_patterns": [],
    "key_findings": []
  },
  "gaps": []
}

corpus filter

Filter papers by tier and/or category tag. Does not require embeddings.

corpus filter --tier 1
corpus filter --category sca
corpus filter --tier 1 --category multi_agent
Flag Default Description
--tier, -t None Filter by tier (1-5)
--category None Filter by category tag
--catalog, -c arxiv-vuln-catalog Catalog name

Available tiers:

Tier Description
1 Agentic systems with foundation models for security
2 Foundation model vuln detection (non-agentic)
3 Supply chain / SCA / SBOM adjacent
4 DAST / API security testing
5 Security benchmarks for agentic systems

Available categories:

agentic, benchmark, dast, detection_engineering, fuzzing, hardware_security, llm_evaluation, multi_agent, neuro_symbolic, penetration_testing, rag, sast, sca, smart_contract, supply_chain, survey, traditional_security_assessment

corpus get <paper_id>

Get full details for a paper by its arxiv ID.

corpus get 2604.00704

Returns the complete paper object including abstract, key results, and relevance notes. Returns an error with available_ids if the paper is not found.

corpus list

List all papers with compact summaries. Does not require embeddings.

corpus list
corpus list -c my-catalog

corpus gaps

Show identified research gaps from the catalog metadata.

corpus gaps

corpus patterns

Show agentic architecture patterns and cross-cutting key findings extracted from the catalog.

corpus patterns

Output structure:

{
  "agentic_architecture_patterns": [
    {
      "pattern": "multi-agent with shared memory",
      "papers": ["2603.26270", "2603.27127"]
    }
  ],
  "key_findings": [
    {
      "finding": "Data quality and balancing > model scale for vuln detection",
      "papers": ["2604.00112"]
    }
  ]
}

corpus catalogs

List all available catalog files in the catalogs directory.

corpus catalogs

Agent Usage

All output is JSON. An AI agent can use corpus as a subprocess:

# Semantic search, pipe to jq
result=$(corpus search "agentic DAST" 2>/dev/null)
echo "$result" | jq '.results[].title'

# RAG context for LLM prompt construction
context=$(corpus rag "supply chain detection" 2>/dev/null)

# Structured filtering
corpus filter --tier 1 --category sca 2>/dev/null | jq '.results'

# Paper lookup
corpus get 2604.00704 2>/dev/null | jq '.abstract'

Error handling: Errors are returned as JSON with error and fix fields:

{
  "error": "embeddings not cached",
  "fix": "run: corpus embed -c arxiv-vuln-catalog"
}

Exit code is 1 on error, 0 on success.

Catalog Format

Catalog files are JSON placed in the catalogs/ directory. Each catalog must conform to this schema:

{
  "$schema": "arxiv-research-catalog/v1",
  "metadata": {
    "description": "...",
    "source_categories": ["cs.SE", "cs.CR"],
    "date_range": {"start": "2026-03-26", "end": "2026-04-03"},
    "total_papers_scanned": 334,
    "total_relevant_found": 28,
    "created": "2026-04-02",
    "search_terms": ["..."]
  },
  "taxonomy": {
    "tiers": {
      "tier_1": "Description of tier 1",
      "tier_2": "Description of tier 2"
    },
    "categories": ["agentic", "sca", "dast"]
  },
  "papers": [
    {
      "id": "2604.00704",
      "title": "Paper Title",
      "url": "https://arxiv.org/abs/2604.00704",
      "pdf": "https://arxiv.org/pdf/2604.00704",
      "authors": ["Author One", "Author Two"],
      "submitted": "2026-04-01",
      "tier": 1,
      "categories": ["agentic", "sca"],
      "abstract": "Full abstract text...",
      "key_results": {
        "success_rate": 0.82,
        "tasks_evaluated": 660
      },
      "relevance_notes": "Why this paper matters..."
    }
  ],
  "gaps_identified": [
    {
      "category": "sbom",
      "description": "No papers address...",
      "opportunity": "Could combine..."
    }
  ],
  "cross_references": {
    "agentic_architecture_patterns": [
      {"pattern": "multi-agent with shared memory", "papers": ["2603.26270"]}
    ],
    "key_findings": [
      {"finding": "Data quality > model scale", "papers": ["2604.00112"]}
    ]
  }
}

Required paper fields

Field Type Description
id string Unique identifier (e.g., arxiv ID)
title string Full paper title
url string URL to the paper page
pdf string URL to the PDF
authors string[] List of author names
submitted string ISO date string
tier int Relevance tier (1 = highest)
categories string[] Category tags
abstract string Full abstract text

Optional paper fields

Field Type Default Description
key_results dict {} Quantitative results
relevance_notes string "" Curator notes

Adding a New Catalog

  1. Create a JSON file in catalogs/ following the schema above
  2. Run corpus embed -c <name> to compute embeddings
  3. Search with corpus search "query" -c <name>
# Example: add a new catalog
cp my-new-catalog.json catalogs/
corpus embed -c my-new-catalog
corpus search "interesting topic" -c my-new-catalog

Configuration

Env Variable Default Description
CORPUS_CATALOGS_DIR <project>/catalogs/ Override the catalogs directory path

Architecture

src/corpus/
  cli/
    main.py          # Typer CLI with all subcommands
  engine/
    embeddings.py    # EmbeddingEngine: sentence-transformer wrapper with caching
  models/
    paper.py         # Paper, PaperSummary domain models
    catalog.py       # Catalog, CatalogMetadata, Gap, CrossReferences
    results.py       # SearchResponse, RagResponse, FilterResponse, ErrorResponse
  services/
    loader.py        # CatalogLoader: JSON file loading and validation
    search.py        # SearchService: semantic search and filtering
    rag.py           # RagService: RAG context retrieval and formatting
catalogs/
  arxiv-vuln-catalog.json     # Bundled catalog
  .cache-arxiv-vuln-catalog/  # Auto-generated embedding cache

Dependency flow

CLI (main.py)
  -> Services (search.py, rag.py, loader.py)
    -> Engine (embeddings.py)
    -> Models (paper.py, catalog.py, results.py)
  • CLI depends on services, never on engine directly
  • Services depend on engine and models
  • Models have no internal dependencies beyond pydantic
  • Engine has no dependency on models (operates on raw lists/arrays)

Embedding pipeline

  1. CatalogLoader reads and validates JSON into Catalog model
  2. EmbeddingEngine.build_corpus() concatenates paper fields into searchable strings
  3. EmbeddingEngine.encode() produces L2-normalised vectors via all-MiniLM-L6-v2
  4. Vectors are cached as compressed .npz in catalogs/.cache-<name>/
  5. At query time, query is encoded and scored against cached corpus via dot product

Dependencies

Package Purpose
sentence-transformers Embedding model (all-MiniLM-L6-v2, 22M params)
numpy Vector operations and caching
pydantic Strict model validation
typer CLI framework

License

MIT

About

Semantic search and RAG CLI over indexed research catalogs for AI agent consumption

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors

Languages