Skip to content

GXL-ai/paperclip

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

5 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Paperclip

Search, read, and analyze biomedical papers, regulatory documents, and clinical trials from the command line.

Paperclip is a CLI and MCP server for AI agents where every document is a directory containing full text, sections, figures, and supplements on a virtual filesystem.

  • Search with natural language or regex across biomedical papers from bioRxiv, medRxiv, arXiv, and PubMed Central, plus FDA regulatory documents, ClinicalTrials.gov, and international regulatory and trial registries
  • Run parallel AI readers across papers with map and synthesize with reduce
  • Pipe results through standard Unix tools (grep, awk, sed, jq, etc.)
  • Ask questions about figures with vision AI
  • Query the database directly with SQL

Full documentation: paperclip.gxl.ai

Community

This repository hosts the source code for the Paperclip CLI client. Use it to:

Install

Python 3.8+ required.

curl -fsSL https://paperclip.gxl.ai/install.sh | bash

Installs to ~/.paperclip/ with a wrapper at ~/.local/bin/paperclip.

Or install via pip:

pip install https://paperclip.gxl.ai/paperclip.whl
paperclip setup

Sign in

Sign-in happens automatically on first use, or run manually:

paperclip login

Verify

paperclip config
# Server:  https://paperclip.gxl.ai
# Auth:    ✓ you@example.com
# Config:  ~/.paperclip

MCP Server (alternative)

Use Paperclip as an MCP server directly — no local install needed.

Claude Code

claude mcp add --transport http paperclip https://paperclip.gxl.ai/mcp

Then start claude, enter /mcp, and select Authenticate under the paperclip server.

Cursor

Add to ~/.cursor/mcp.json (or .cursor/mcp.json in your project):

{
  "mcpServers": {
    "paperclip": {
      "url": "https://paperclip.gxl.ai/mcp",
      "type": "http"
    }
  }
}

Then Cmd/Ctrl + Shift + P → Tools & MCPs, enable the paperclip server, and authenticate.

Quick Start

# Search for papers
paperclip search "CRISPR base editing efficiency"

# Read a paper's metadata
paperclip cat /papers/bio_4f78753a6feb/meta.json

# Preview the first 50 lines
paperclip head -50 /papers/bio_4f78753a6feb/content.lines

# Grep within a single paper
paperclip grep -i "binding affinity" /papers/bio_4f78753a6feb/content.lines

# Regex search across the entire corpus (sub-second)
paperclip grep "alphamissense" /papers/

# Map over search results with an AI reader
paperclip map --from s_abc123 "What methods were used?"

# Run SQL queries
paperclip sql "SELECT title, doi FROM documents WHERE authors ILIKE '%Doudna%' LIMIT 5"

# Save results to a local file
paperclip search "CRISPR" -n 5 > results.txt

Use paperclip bash '...' for pipes and chains:

paperclip bash 'search "protein folding" | grep "deep learning"'

Commands

Command Description
search Hybrid search (BM25 + vector) across papers, regulatory documents, and trials
searches Run multiple queries in parallel and merge results
grep Regex search within a paper or across the entire corpus
scan Multi-pattern grep in a single pass
lookup Find papers by DOI, PMC ID, PMID, author, title, journal
sql Read-only SQL queries against the papers database
map Parallel AI reader across multiple papers
reduce Synthesize map results into summaries, tables, or themes
filter Filter search results for relevance
ask-image Analyze figures with vision AI
cat Read files from the paper filesystem
head / tail Preview first or last lines
ls / tree List directory contents
grep / scan Search within papers
sed / awk / jq Text processing
results View, browse, and export saved results
config Show or set configuration, connection diagnostics
install Install agent skill for Claude Code, Cursor, or Codex
update Update to the latest version
Paper Repos
init Create a new paper repo
checkout List repos, switch repos or branches
add / remove Add or remove papers
import Seed repo from a paper's bibliography
commit Snapshot with reasoning message
annotate Pin notes to specific papers
status Repo state: papers, branches, annotations
log Commit history
diff Compare commits or branches
export Export to BibTeX, RIS, Markdown, or CSV
branch / merge Branching and merging
cite Citation counts and relationships

Agent Integration

Install a skill so your coding agent can use Paperclip automatically:

paperclip install

Supports Claude Code, Cursor, and Codex. The skill teaches the agent the full command set. Then just mention /paperclip in your prompt:

Using /paperclip, find recent papers on GLP-1 receptor agonists and summarize the primary endpoints.

Paper Filesystem

Each paper lives at /papers/<id>/:

meta.json        — title, authors, doi, date, abstract, journal
content.lines    — full text, line-numbered (L<n>: <text>)
sections/        — named section files (Introduction.lines, Methods.lines, ...)
figures/         — figure files (PMC papers)
supplements/     — supplementary files (PMC papers)

Paper IDs use prefixes by source: bio_ (bioRxiv), med_ (medRxiv), PMC (PubMed Central), arx_ (arXiv). Regulatory documents and clinical trials are accessed via /fda/ and /clinicaltrials/ virtual directories.

Paper Repos

Build versioned, annotated collections of papers with git-like workflows:

# Create a repo and seed from a key paper's references
paperclip init my-review "Systematic review of XYZ"
paperclip import PMC11271413 --min-cites 50
paperclip import refs.bib                    # import .bib/.ris → library + repo

# View your personal library (persists across repos)
paperclip library

# Curate: annotate, commit
paperclip annotate PMC123 "Key finding on mechanism X"
paperclip commit -m "Initial seed from review + manual curation"

# Review your work
paperclip repo                       # list all repos
paperclip repo <name>                # repo overview: papers, branches, annotations
paperclip log                        # commit history
paperclip diff 9a6d..559a            # compare commits

# Export to reference managers
paperclip export bib -o refs.bib     # BibTeX (annotations in note field)
paperclip export ris -o refs.ris     # RIS (Zotero, Paperpile, Mendeley, EndNote)
paperclip export md -o review.md     # structured markdown report
paperclip export csv -o papers.csv   # tabular data

Saving files locally

Redirect cat to write any paper file to disk. Text files come back as text; figures and other binaries stream as raw bytes when stdout is redirected (no base64 wrapping):

paperclip cat /papers/PMC10791696/meta.json > meta.json
paperclip cat /papers/PMC10791696/figures/fig1.tif > fig1.tif

For bulk, loop over ls:

mkdir -p figures
for f in $(paperclip ls /papers/PMC10791696/figures/); do
  paperclip cat /papers/PMC10791696/figures/$f > figures/$f
done

Python SDK

The gxl-paperclip package ships a Python SDK alongside the CLI, so you can call Paperclip directly from scripts, notebooks, and other tools. Installing the package (via pip install or the installer script above) gives you both the paperclip command and the gxl_paperclip module.

Authentication

The SDK uses API keys (OAuth is reserved for interactive CLI sign-in). Create a key from the dashboard and make it available to your code:

export PAPERCLIP_API_KEY="pk_..."
from gxl_paperclip import PaperclipClient

client = PaperclipClient.from_env()           # picks up PAPERCLIP_API_KEY
# — or pass an explicit strategy —
from gxl_paperclip import APIKeyAuth
client = PaperclipClient(auth=APIKeyAuth("pk_..."))

from_env() falls back to the credentials saved by paperclip login (~/.paperclip/credentials.json) via FileCredentialsAuth when no API key is set — handy on a workstation where you've already signed in.

Quick start

from gxl_paperclip import PaperclipClient

client = PaperclipClient.from_env()

result = client.search("CRISPR lipid nanoparticle", limit=5, source="pmc")
print(result.output)           # same formatted text the CLI prints
print(result.result_id)        # e.g. "s_14bebc10" — pass to map_()

for event in client.map_("What delivery methods were used?", from_results=result.result_id):
    if event.type == "progress":
        print(f"{event.completed}/{event.total} papers done")
    else:
        print(event.output)

Method reference

Every optional kwarg defaults to None (or False for flags) on the client, which means the flag is omitted from the underlying command — the server then applies its own default.

client.search(query, *, limit=None, source=None, exact=False, since=None, sort=None, author=None, journal=None, year=None, type=None, category=None, mode=None, all=False, timeout=None) -> ExecuteResult

Hybrid search across bioRxiv, medRxiv, arXiv, PubMed Central, FDA, ClinicalTrials.gov, and international registries.

Argument Default when omitted Notes
query required Natural-language query string.
limit 100 Server caps at 1000.
source PMC, bioRxiv, medRxiv, arXiv Pass "pmc", "biorxiv", "medrxiv", "arxiv", "abstracts", "fda", "trials", or a comma-separated list.
exact False True switches search mode to phrase matching.
since no recency filter e.g. "7d", "30d", "6m", "1y".
sort "relevance" Pass "date" for newest-first.
author no filter Substring match on authors.
journal no filter PMC only.
year no filter e.g. 2024.
type no filter e.g. "review-article" (PMC).
category no filter e.g. "Neuroscience" (bioRxiv).
mode "any" Also supports "all", "50%", "75%".
all False When True, searches the full corpus instead of the default recency-weighted slice.
timeout 120 s Seconds before the request aborts.

client.lookup(field, value, *, limit=None, timeout=None) -> ExecuteResult

Look up papers by a metadata field.

Argument Default when omitted Notes
field required "doi", "pmc", "pmid", "author", "title", "journal", "year", "keywords", etc.
value required The value to match (partial, case-insensitive).
limit 25
timeout 120 s

client.sql(query, *, source=None, timeout=None) -> ExecuteResult

Read-only SQL over the documents table. 15s server-side timeout, 200-row cap.

Argument Default when omitted Notes
query required Must be a SELECT against documents.
source "all" Pass "pmc" or "biorxiv" to restrict.
timeout 120 s

client.map_(question, *, from_results, timeout=None) -> Iterator[MapEvent]

Run an AI reader against every paper in a prior search/lookup result set. Yields MapProgressEvent objects (OAuth streaming path) followed by a single MapResultEvent.

Argument Default when omitted Notes
question required Question asked against each paper.
from_results required Pass the result_id returned by search or lookup.
timeout 300 s Map defaults to the slow-command timeout.

client.pull(target, dest=None, *, timeout=None) -> ExecuteResult

Download a paper or single file from the virtual filesystem.

Argument Default when omitted Notes
target required e.g. "PMC10791696" or "PMC10791696/figures/fig1.jpg".
dest current directory Output directory on the server's side of the command.
timeout 120 s

client.ask_image(path, question=None, *, fn=None, timeout=None) -> ExecuteResult

Analyse a paper figure with vision AI.

Argument Default when omitted Notes
path required Figure path, e.g. "PMC11576387/figures/fx1.jpg".
question "Describe this figure in detail." Custom prompt.
fn free-form prompt Pass "describe" or "extract-data" for canned flows.
timeout 300 s Uses the slow-command default.

client.bash(script, *, timeout=None) -> ExecuteResult

Run an arbitrary server-side pipeline, exactly like paperclip bash '...'.

result = client.bash('search "protein folding" | grep -i "deep learning"')
Argument Default when omitted Notes
script required A single shell-style command string.
timeout 120 s

client.health(*, timeout=None) -> HealthStatus

Ping the server and confirm auth works. Returns HealthStatus(reachable: bool, output: str, exit_code: int).

client.results

  • client.results.list(*, limit=None) -> list[ResultRow] — recent saved results for the authenticated user. Server default limit is 20.
  • client.results.get(result_id) -> ResultData — raw saved output for a specific result ID (e.g. "s_14bebc10", "m_ec2c9cc9").

client.papers.*

Typed wrappers over the virtual filesystem commands. Each returns an ExecuteResult.

Method Defaults
papers.cat(path) no options
papers.head(path, *, lines=None) lines defaults to the CLI's head default (10).
papers.tail(path, *, lines=None) lines defaults to the CLI's tail default (10).
papers.ls(path) no options
papers.grep(pattern, path, *, ignore_case=False, extended=False) no flags passed when both are False.
papers.scan(path, patterns) multiple patterns OR'd in a single pass.

client.execute(command, args=None, *, timeout=None) -> ExecuteResult

Escape hatch for any command without a typed wrapper (sed, awk, sort, cut, tr, jq, new server commands, ...). args is a list of argv tokens — the SDK quotes them for you.

result = client.execute("awk", ["-F", "\t", "{print $1}", "/papers/PMC1/content.lines"])

client.stream(command, args=None, *, timeout=None) -> Iterator[MapEvent]

Streaming escape hatch. Currently only "map" streams; other commands raise ValueError.

Error handling

All HTTP and network failures raise a subclass of PaperclipError:

from gxl_paperclip import (
    AuthError, RateLimitError, NotFoundError, ServerError,
    RequestTimeoutError, NetworkError,
)

try:
    client.search("AlphaFold")
except AuthError:
    ...  # invalid API key or expired credentials
except RateLimitError:
    ...  # HTTP 429
except RequestTimeoutError:
    ...  # client-side timeout

Result types

  • ExecuteResult(output, exit_code, elapsed_ms, result_id, download_url, download_filename, cwd, raw)
  • MapProgressEvent(total, completed, failed, elapsed_s)
  • MapResultEvent(output, result_id, elapsed_ms, exit_code)
  • ResultRow(result_id, command, raw_input, latency_ms, created_at, raw)
  • ResultData(result_id, output, command, raw_input, latency_ms, created_at, raw)
  • HealthStatus(reachable, output, exit_code, elapsed_ms)

License

Apache-2.0 — see LICENSE.

About

Paperclip — search, read, and analyze 8M+ biomedical papers from the command line

Resources

License

Stars

Watchers

Forks

Contributors

Languages