Data extractor component #13

Sibyx · 2022-07-14T16:52:14Z

Description

Read the content of documents and save it to the database, enabling full-text search in the future. Implement configurable drivers for various file formats (PDF, epub, mobi). This feature should be pluggable due to dependencies.

This process should run in async mode to ensure OCR does not block publication uploads.

Given the dependencies and async nature, it would be beneficial to implement this feature as an independent component. This component could consume a Redis queue and store results in the ElasticSearch database. Additionally, creating a map of pages with content during this process would enhance searchability and indexing.

Extra Ideas:

Implement a monitoring system to track the progress and efficiency of the OCR process.
Consider incorporating a machine learning model for improved accuracy in text recognition, especially for complex layouts or low-quality scans.
Explore the possibility of user-defined settings for OCR, allowing customized processing based on document type or quality.

Resources

OCRmyPDF: Adds a text layer to uploaded PDFs.
pytesseract & opencv: For extracting text from PDFs.
EbookLib: ePUB manipulation.
nltk: For text processing.
papermage: For processing academic papers.
camelot: For extracting tables from PDFs.

Sibyx added the enhancement New feature or request label Jul 14, 2022

Sibyx added this to the 1.0 milestone Jul 14, 2022

Sibyx self-assigned this Jul 14, 2022

Sibyx mentioned this issue Dec 6, 2022

Entry flags #19

Closed

Sibyx changed the title ~~OCR~~ Data extractor component Jan 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Data extractor component #13

Data extractor component #13

Sibyx commented Jul 14, 2022 •

edited

Data extractor component #13

Data extractor component #13

Comments

Sibyx commented Jul 14, 2022 • edited

Description

Resources

Sibyx commented Jul 14, 2022 •

edited