A web crawler and API for scraping book data built with python and its libraries.
This project used MongoDb to store the database; specifically MongoDb Atlas so as to have it on the cloud.
The data retrieved is served using FastAPI with API key aunthentication for security purposes and controlled access.
- Asynchronous web crawler application
- Stores data in MonogDb
- FastApi endpoints to:
- List books ( with pagination)
- Search books with certain filers
- Retrieve specific books by their ID
- View changes
- See statistics of data
- Swagger UI documentation (/docs)
-
Clone the repository
-
Create a .env file in the root directory this will hold the authentication keys.
-
Access the API
Swagger UI: http://127.0.0.1:8000/docs
ReDoc: http://127.0.0.1:8000/redoc
4. Endpoints
Example MongoDB Document:
{
"_id": "64b9e5f9e5a4b1234567890a",
"title": "A Light in the Attic",
"price_including_tax": 51.77,
"availability": "In stock",
"rating": 3,
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"category": "Poetry"
}
All tests are in the Tests/ folder and run using pytest
Test include:
- Endpoints status check
- Valid and invalid id checks
- Search and filter checks
- Categories, stats and health checks
Pytest test
Page Crawling (Page 51 does not exist hence the error message and after 3 tries it moves on)
Crawler successfully found 1000 books from 50 pages
Each book and their details crawled and stored to the database successfully



