Akhand

A literary geography platform that maps fiction to the physical world.

Canonical data layer status (Mar 17, 2026):

Working scaffold: 11,073 entries
Strict release: 9,057 entries
Active versioned release: backend/data/releases/2026-03-17-strict-v2/literary_places.json

The name means "undivided" in Sanskrit. The platform treats South Asia's literary geography as a continuous space, ignoring political boundaries in favor of narrative ones.

This project builds on the work of Cities in Fiction, an archival project by Apoorva Saini and Divya Ravindranath that documents real-world places in Indian literature. Their curated entries (436 total across two sources) are integrated here with full attribution. Akhand extends this with NLP extraction, multi-source data ingestion, and WebGL visualization.

Architecture

Frontend (Next.js 14, MapLibre GL, deck.gl)
    |
    | GET /api/places (fallback to static data.ts if backend is down)
    v
Backend API (FastAPI, Pydantic)
    |
    |-- /api/places        serves canonical release entries with search/filter
    |-- /api/meta          dataset version/source/count metadata
    |-- /api/places.geojson full GeoJSON FeatureCollection export
    |-- /api/export        bulk CSV export
    |-- /api/extract       spaCy + GLiNER + Gemini NLP pipeline
    |-- /api/wikidata/*    SPARQL proxy for Wikidata P840
    |
Data Ingestion (CLI scripts)
    |
    |-- ingest.py          Open Library search-by-place, 54 cities, alias expansion
    |-- cif_ingest.py      CitiesInFiction.xlsx parser + Nominatim geocoder
    |-- openlibrary.py     async client with rate limiting
    |-- wikidata.py        P840 narrative location queries

Data

Current corpus snapshot:

Working scaffold (generated): 11,073
Strict canonical release (v2026-03-17-strict-v2): 9,057
Frontend static index/details synced to strict release: 9,057
Enrichment run continues in background with checkpointed resume.

NLP pipeline

Four layers, designed so each failure degrades gracefully instead of crashing:

Layer 1: spaCy NER (en_core_web_md, 50MB). Fast first pass extracting GPE, LOC, FAC entities. The md model includes word vectors that improve recognition of out-of-vocabulary place names in literary syntax.

Layer 2: GLiNER zero-shot NER (urchade/gliner_medium-v2.1). Runs domain-specific labels: City, Village, Region, Country, River, Mountain, Neighborhood, Landmark, Historical Place Name, Fictional Place, Route, Body of Water. When both models agree on an entity, confidence is boosted. Threshold set to 0.4 to reduce noise from metaphorical place usage in literary text.

Layer 3: Geocoding (Nominatim via geopy). Converts entity text to coordinates. 80+ pre-populated coordinates avoid rate limiting.

Layer 4: Gemini 3 Flash structured extraction (gemini-3-flash-preview). Called only on passages containing NER-detected entities, not on full texts. A 100,000-word novel produces maybe 20 passages (6,000 characters) instead of 500,000 characters. At Gemini Flash pricing, that is $0.0006/book instead of $0.05, an 83x cost reduction. Extracts sentiment, themes, place classification.

If Gemini fails, the pipeline falls back to rule-based sentiment. If GLiNER fails to load, spaCy runs alone. If the backend is down entirely, the frontend serves curated entries from a static file.

Visualization

Three deck.gl layers on MapLibre GL (CARTO Dark Matter basemap, no API key):

Scatter: sentiment-colored dots, radius scales with book density
Heatmap: geographic clustering of literary places
Arcs: author connection networks across cities

PMTiles protocol registered for future zero-cost self-hosted tile serving.

API

Method	Path	Description
GET	`/api/places`	List places. Params: `q`, `region`, `city`, `author`, `genre`, `year_min`, `year_max`, `limit`, `offset`
GET	`/api/places/{id}`	Single place by ID
GET	`/api/meta`	Dataset version/source/count metadata
GET	`/api/places.geojson`	Full dataset as GeoJSON FeatureCollection
GET	`/api/export?format=csv`	Full dataset as CSV
POST	`/api/places/refresh`	Hot-reload data from disk after re-ingestion
POST	`/api/extract`	Run NLP pipeline on arbitrary text
POST	`/api/extract/summary`	Gemini structured extraction from book summary
GET	`/api/wikidata/narrative-locations`	Wikidata P840 query. Param: `region=south_asia`
GET	`/health`	Pipeline status

Full-text search across titles, authors, cities, genres, themes, and passages. All query terms must match (AND logic).

Quick start

All commands run from the project root (akhand/), not from subdirectories.

# Frontend only (250 curated fallback entries, no backend needed)
cd frontend && npm install && npm run dev

# Backend
pip install -r backend/requirements.txt
python -m spacy download en_core_web_md
uvicorn backend.main:app --port 8000

# Frontend + backend together
# Terminal 1: uvicorn backend.main:app --port 8000
# Terminal 2: cd frontend && npm run dev
# Open http://localhost:3000/explore

# Re-ingest data
python -m backend.data.ingest              # Open Library (54 cities)
python -m backend.data.cif_ingest --merge  # merge CIF spreadsheet + archive
curl -X POST http://localhost:8000/api/places/refresh

# Cut a versioned release
python -m backend.scripts.quality_gate --input backend/data/releases/2026-03-17-strict-v2/literary_places.json --threshold 0.55 --reject --block-filler --filler-min-hits 2 --output backend/data/generated/literary_places_release_strict_next.json --output-report backend/data/generated/quality_report_strict_next.json
python -m backend.scripts.cut_release --input backend/data/generated/literary_places_release_strict_next.json --report backend/data/generated/quality_report_strict_next.json --version 2026-03-18-strict

# Docker (full stack)
docker compose up

Stack

Frontend: Next.js 14, React 18, MapLibre GL 4.7, deck.gl 9.1, Framer Motion, Tailwind CSS, PMTiles

Backend: FastAPI, spaCy 3.8 (en_core_web_md), GLiNER 0.2, Google GenAI (Gemini 3 Flash), geopy, httpx

Database (schema written, not yet wired): PostgreSQL 17, PostGIS, pgvector (HNSW), ltree, pg_trgm

Limitations

API write/extract/admin routes are key-protected and rate-limited. Public read routes remain open.
CORS allows localhost:3000 and shahdev.me. Additional origins require updating the middleware.
Full enrichment is still in progress for the complete scaffold. The strict release intentionally excludes weaker rows until re-enriched.
Neither source contains actual literary passages, only plot summaries (Open Library) and contributor descriptions (CIF). Copyrighted text requires publisher APIs or Project Gutenberg (public domain, pre-1928).
Geocoding approximates regions to centroids. "Marwar region in Western part of Rajasthan" maps to Jodhpur. State-level entries and fictional places are similarly approximate.
Open Library sorts by relevance, not recency. Recently published books are underrepresented.
Wikidata SPARQL endpoint rate-limits heavily (429 on every query during development). Code is correct but the live endpoint is unreliable for bulk queries.
The en_core_web_md spaCy model, while better than sm, still misses literary place names in unusual syntactic positions. GLiNER compensates but its 0.4 threshold needs manual benchmarking against annotated passages.

Production Security Checklist

Set these environment variables in production:

AKHAND_ADMIN_API_KEY (required for /api/places/refresh)
AKHAND_WRITE_API_KEY (required for /api/contribute)
AKHAND_EXTRACT_API_KEY (required for /api/extract* and /api/analyze/passage)
AKHAND_TRUSTED_HOSTS (comma-separated host allowlist, e.g. api.example.com)
AKHAND_CORS_ORIGINS (comma-separated explicit origins)
AKHAND_CORS_METHODS (default GET,POST,OPTIONS)
AKHAND_CORS_HEADERS (default Content-Type,X-API-Key)
AKHAND_ENABLE_SECURITY_HEADERS=1 (enables HSTS/XFO/nosniff/referrer-policy)

Release hardening:

python -m backend.scripts.data_cleanup \
    --input backend/data/releases/2026-03-19-research-v1/literary_places.json \
    --output backend/data/generated/literary_places_cleaned_prod.json \
    --manifest backend/data/generated/cleanup_manifest_prod.json

python -m backend.scripts.quality_gate \
    --input backend/data/generated/literary_places_cleaned_prod.json \
    --reject --threshold 0.6 --geo-threshold 0.65 \
    --output backend/data/generated/literary_places_release_prod.json \
    --output-report backend/data/generated/quality_report_prod.json

python -m backend.scripts.cut_release \
    --input backend/data/generated/literary_places_release_prod.json \
    --report backend/data/generated/quality_report_prod.json \
    --version 2026-03-19-prod \
    --min-passing-ratio 0.60

Name		Name	Last commit message	Last commit date
Latest commit History 25 Commits
.github/workflows		.github/workflows
.vscode		.vscode
backend		backend
frontend		frontend
.env.example		.env.example
.gitignore		.gitignore
Dockerfile		Dockerfile
README.md		README.md
docker-compose.yml		docker-compose.yml

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Akhand

Architecture

Data

NLP pipeline

Visualization

API

Quick start

Stack

Limitations

Production Security Checklist

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Akhand

Architecture

Data

NLP pipeline

Visualization

API

Quick start

Stack

Limitations

Production Security Checklist

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages