Skip to content

Matre5/Web_Crawler

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

18 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Web Crawling project

A web crawler and API for scraping book data built with python and its libraries. This project used MongoDb to store the database; specifically MongoDb Atlas so as to have it on the cloud.
The data retrieved is served using FastAPI with API key aunthentication for security purposes and controlled access.

Features

  • Asynchronous web crawler application
  • Stores data in MonogDb
  • FastApi endpoints to:
    • List books ( with pagination)
    • Search books with certain filers
    • Retrieve specific books by their ID
    • View changes
    • See statistics of data
  • Swagger UI documentation (/docs)

Folder Structure

image

Setup Instructions

  1. Clone the repository

  2. Create and activate a virtual environment
    image

  3. Install dependencies
    image

  4. Create a .env file in the root directory this will hold the authentication keys.

Running the Project

  1. Start the FastAPI server
    image

  2. Access the API
    Swagger UI: http://127.0.0.1:8000/docs


ReDoc: http://127.0.0.1:8000/redoc
4. Endpoints
image

Example MongoDB Document:
{ "_id": "64b9e5f9e5a4b1234567890a",
"title": "A Light in the Attic",
"price_including_tax": 51.77,
"availability": "In stock",
"rating": 3,
"url": "https://books.toscrape.com/catalogue/a-light-in-the-attic_1000/index.html",
"category": "Poetry"
}

Testing

All tests are in the Tests/ folder and run using pytest image
Test include:

  1. Endpoints status check
  2. Valid and invalid id checks
  3. Search and filter checks
  4. Categories, stats and health checks

Screenshots

Pytest test
image
Page Crawling (Page 51 does not exist hence the error message and after 3 tries it moves on) image image Crawler successfully found 1000 books from 50 pages
Each book and their details crawled and stored to the database successfully image

About

A web crawler and API for scraping book data built with python and its libraries. This project used MongoDb to store the database; specifically MongoDb Atlas so as to have it on the cloud.

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

 
 
 

Contributors