A local, scriptable research-gap analysis pipeline for arXiv papers.
Given one or more arXiv papers (by ID or PDF path), the pipeline:
- Downloads the PDF (or accepts a local file / batch directory)
- Converts the PDF to structured TEI XML via GROBID
- Extracts all section headings and selected key sections into a
context_pack.json - Identifies research gaps either via heuristic patterns (
--no-llm) or an LLM (OpenAI) - Verifies novelty by searching OpenAlex and Semantic Scholar, ranking candidates with a local embedding model (
all-MiniLM-L6-v2) - Generates a
report.mdsummarising everything
All processing runs locally β no data is sent anywhere except to APIs you explicitly configure.
- Setup
- Start GROBID (Docker)
- Configuration (.env)
- Usage
- Output files
- Module structure
- Running tests
- FAQ
- Python 3.9+
- Docker (for GROBID)
# Create and activate a virtual environment
python -m venv .venv
source .venv/bin/activate # Windows: .venv\Scripts\activate
# Install the package (includes all runtime dependencies)
pip install -e ".[dev]"Dependencies installed:
requests,lxml,openai,python-dotenv,sentence-transformers,numpy
Dev extras:pytest,pytest-mock,responses
GROBID converts PDF files to structured TEI XML. Start it with Docker before running the pipeline:
docker run --rm -p 8070:8070 lfoppiano/grobid:0.8.1The service will be available at http://localhost:8070. The pipeline uses this URL by default; override with --grobid-url if needed.
You can test that GROBID is running:
curl http://localhost:8070/api/isaliveCopy .env.example to .env and fill in your keys:
cp .env.example .env| Variable | Required | Description |
|---|---|---|
OPENAI_API_KEY |
For LLM mode | OpenAI API key for gap extraction |
SEMANTIC_SCHOLAR_API_KEY |
Optional | Higher rate limits on S2 API |
OPENALEX_EMAIL |
Recommended | Joins OpenAlex polite pool for better throughput |
If OPENAI_API_KEY is not set, the pipeline automatically falls back to heuristic mode.
research-gap --arxiv-id 2604.24717v1 --out-dir out/2604.24717v1This downloads the PDF, runs GROBID, extracts gaps with LLM (if OPENAI_API_KEY is set), verifies novelty, and writes all outputs to out/2604.24717v1/.
research-gap --pdf paper.pdf --out-dir out/paperProcess all PDFs in a directory:
research-gap --input-dir papers/ --out-dir out/batchEach PDF gets its own subdirectory inside out/batch/.
Skip the OpenAI call entirely β use fast regex-based gap extraction:
research-gap --arxiv-id 2604.24717v1 --out-dir out/2604.24717v1 --no-llmusage: research-gap [-h] [--arxiv-id ID [ID ...]] [--pdf PATH [PATH ...]]
[--input-dir DIR] [--out-dir DIR] [--force]
[--grobid-url URL] [--no-llm] [--llm-model MODEL]
[--openai-api-key KEY] [--top-k N] [--s2-api-key KEY]
[--openalex-email EMAIL] [-v]
Input sources:
--arxiv-id ID arXiv paper ID(s), e.g. 2604.24717v1
--pdf PATH Path(s) to local PDF file(s)
--input-dir DIR Directory containing PDF files
Output:
--out-dir DIR Output directory (default: out/)
--force Rerun GROBID even if TEI XML already exists
GROBID:
--grobid-url URL GROBID service URL (default: http://localhost:8070)
Gap extraction:
--no-llm Use heuristic extraction instead of LLM
--llm-model MODEL OpenAI model (default: gpt-4o-mini)
--openai-api-key KEY Overrides OPENAI_API_KEY env var
Prior-work:
--top-k N Top-K candidate papers per query (default: 10)
--s2-api-key KEY Overrides SEMANTIC_SCHOLAR_API_KEY env var
--openalex-email EMAIL Overrides OPENALEX_EMAIL env var
Misc:
-v, --verbose Enable debug logging
All files are written to --out-dir:
| File | Description |
|---|---|
paper.pdf |
Downloaded PDF (when using --arxiv-id) |
paper.tei.xml |
GROBID TEI XML output |
context_pack.json |
Parsed sections: title, abstract, headings, key sections |
gaps.json |
Extracted research gaps with evidence, directions, queries |
novelty_report.json |
Ranked prior work per gap/direction with similarity + risk |
report.md |
Human-readable Markdown summary of all findings |
{
"title": "string",
"abstract": "string",
"headings": ["Introduction", "Related Work", ...],
"sections": { "heading": "full section text", ... },
"key_sections": {
"Abstract": "...",
"Introduction": "...",
"Related Work": "...",
"Experiments": "...",
"Discussion": "...",
"Limitations": "...",
"Conclusion": "...",
"Future Work": "..."
}
}[
{
"gap": "Concise 1-2 sentence description of the research gap",
"evidence": [
{ "section": "Limitations", "quote": "Verbatim substring from paper" },
{ "section": "Conclusion", "quote": "Another verbatim substring" }
],
"why_it_matters": "Explanation of scientific/practical significance",
"non_incremental_directions": [
{
"direction": "Concrete proposed research direction",
"axis_of_difference": "problem formulation | assumptions | evaluation target | modality | ..."
}
],
"prior_work_search_queries": ["keyword query 1", "keyword query 2"]
}
][
{
"gap": "...",
"idea": "...",
"nearest_prior_work": [
{
"title": "Paper Title",
"authors": ["Author A", "Author B"],
"year": 2023,
"venue": "NeurIPS",
"abstract": "...",
"url": "https://...",
"citation_count": 42,
"source": "openalex | semantic_scholar",
"similarity": 0.812,
"risk": "low | medium | high"
}
]
}
]Risk labels:
- π’ low (similarity < 0.55) β idea appears novel
- π‘ medium (0.55β0.74) β related work exists, review carefully
- π΄ high (β₯ 0.75) β very similar work found, idea may not be novel
research_gap/
βββ __init__.py
βββ __main__.py # CLI entrypoint (argparse, pipeline orchestration)
βββ parsing.py # arXiv download, GROBID call, TEI β context_pack
βββ gaps.py # Heuristic + LLM gap extraction
βββ reporting.py # report.md generation
βββ prior_work/
βββ __init__.py
βββ openalex.py # OpenAlex API client
βββ semantic_scholar.py # Semantic Scholar API client
βββ embeddings.py # sentence-transformers similarity ranking
tests/
βββ test_parsing.py # TEI parsing unit tests
βββ test_prior_work.py # API clients with mocked HTTP
βββ test_gaps.py # Heuristic extraction tests
βββ test_reporting.py # Report generation tests
pip install -e ".[dev]"
pytest -vTests use responses to mock all HTTP calls β no network access required.
Q: GROBID crashes / returns an error for my PDF.
A: Some PDFs are malformed or encrypted. Try a different version of the paper (e.g., v2 instead of v1). You can also pass the raw text by pre-converting with pdftotext and submitting as plain text.
Q: LLM mode returns empty gaps.
A: The LLM may not find explicit gap sentences. The heuristic mode (--no-llm) is more reliable for papers that mention gaps indirectly. Also check that OPENAI_API_KEY is set correctly.
Q: How do I use a different OpenAI model?
A: Pass --llm-model gpt-4o (or any supported model). The default is gpt-4o-mini for cost efficiency.
Q: Can I run without Semantic Scholar / OpenAlex?
A: Yes. If both APIs fail (e.g., due to rate limiting or no network), novelty_report.json will have empty nearest_prior_work arrays. The rest of the pipeline completes normally.
Q: Are my papers / API keys sent anywhere?
A: PDFs are sent to your local GROBID instance only. Gap extraction text is sent to OpenAI (if using LLM mode). Search queries (not paper content) are sent to OpenAlex and Semantic Scholar.