Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Data extractor component #13

Open
Sibyx opened this issue Jul 14, 2022 · 0 comments
Open

Data extractor component #13

Sibyx opened this issue Jul 14, 2022 · 0 comments
Assignees
Labels
enhancement New feature or request

Comments

@Sibyx
Copy link
Collaborator

Sibyx commented Jul 14, 2022

Description

Read the content of documents and save it to the database, enabling full-text search in the future. Implement configurable drivers for various file formats (PDF, epub, mobi). This feature should be pluggable due to dependencies.

This process should run in async mode to ensure OCR does not block publication uploads.

Given the dependencies and async nature, it would be beneficial to implement this feature as an independent component. This component could consume a Redis queue and store results in the ElasticSearch database. Additionally, creating a map of pages with content during this process would enhance searchability and indexing.

Extra Ideas:

  • Implement a monitoring system to track the progress and efficiency of the OCR process.
  • Consider incorporating a machine learning model for improved accuracy in text recognition, especially for complex layouts or low-quality scans.
  • Explore the possibility of user-defined settings for OCR, allowing customized processing based on document type or quality.

Resources

@Sibyx Sibyx added the enhancement New feature or request label Jul 14, 2022
@Sibyx Sibyx added this to the 1.0 milestone Jul 14, 2022
@Sibyx Sibyx self-assigned this Jul 14, 2022
@Sibyx Sibyx mentioned this issue Dec 6, 2022
@Sibyx Sibyx changed the title OCR Data extractor component Jan 15, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: Todo
Development

No branches or pull requests

1 participant