Async Python scraper for collecting book metadata from kirja.fi and storing it locally as JSON, with optional cover image downloads and optional metadata extraction from product pages.
For searching and reporting on downloaded data, see search-tools/README.md.
- Python 3.8+
- pip
Create and activate a virtual environment:
python -m venv venv
# Windows PowerShell
.\venv\Scripts\Activate.ps1
# Windows CMD
venv\Scripts\activate.bat
# Linux/macOS
source venv/bin/activateInstall dependencies:
pip install -r requirements.txtOptional convenience scripts:
start.ps1activatesvenvand prints common commands (PowerShell)start.batactivatesvenvand opens an interactive CMD session
Run the scraper:
python scraper.pyThe scraper writes:
data/books/one JSON file per bookdata/images/cover images (if enabled)data/metadata.jsonsummary metadatascraper.loglog output
Basic local text search:
python utils.py "search term"data/
books/ # one JSON file per book
images/ # cover images (when enabled)
metadata.json # scraping summary
Adjust settings in config.py:
MAX_CONCURRENT_REQUESTS/SEMAPHORE_LIMIT: concurrencyREQUEST_DELAY: delay between collection page requestsHTML_REQUEST_DELAY: delay between HTML page requestsMAX_RETRIES,REQUEST_TIMEOUT: reliability/timeoutsDOWNLOAD_IMAGES: enable/disable cover downloadsFETCH_HTML_METADATA: enable/disable extra metadata extraction from product pages
- Search tooling: search-tools/README.md
- Background notes and API investigation: kirja_fi_investigation_report.md
Use responsibly:
- Follow kirja.fi terms of service and robots guidelines
- Use reasonable rate limits
- Do not republish copyrighted content