The Book Query Engine is an innovative tool designed to provide fast and efficient retrieval of information about books stored in a centralized data lake.
This project consists of four main components:
- Crawler
- Indexer
- Metadata Datamart Builder
- Webservice
The crawler is responsible for downloading books from the Gutenberg Project website, which is a vast collection of free eBooks. The books are downloaded and stored in a datalake.
The indexer creates an inverted index datamart, which is a data structure used in information retrieval systems to efficiently map terms to the documents or postings that contain them. An inverted index allows for faster and more efficient search of the books stored in the data lake.
The metadata datamart builder creates a datamart of metadata, which includes information about the books such as:
- Author
- Title
- Publication Date
This metadata is stored in a separate datamart, allowing for efficient retrieval of information about the books in the datalake.
The webserice provides an interface for querying the inverted index and metadata datamarts, allowing for fast and efficient retrieval of information about the books stored in the data lake. The API allows users to search for books by author, title, publication date, and other metadata fields, as well as querying for words. Here are the consults you can make:
GET /stats/:type
- Retrieve statistics about the data stored in the data lake and datamarts.GET /documents/:words?from=…&to=…&author=…
- Retrieve information about books that contain the specified words and match the optional author, publication date range, and other criteria.
You can deploy the Book Query Engine either on your local machine or using Docker. If you choose to deploy on your local machine, you will need to install the necessary dependencies and configure the environment. If you choose to use Docker, you can simply run the provided Docker container and start using the engine
(c) 2022 José Juan Hernández Gálvez
Github: https://github.com/josejuanhernandezgalvez
(c) 2022 David Cruz Sánchez
Github: https://github.com/Davoestacogido
(c) 2022 Juan Carlos Santana Santana
Github: https://github.com/JuanCarss
(c) 2022 Jorge Hernández Hernández
Github: https://github.com/Yorchz