Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Enhancing Table Parsing with DETR and Pytesseract Integration #295

Open
35C4n0r opened this issue Feb 14, 2024 · 6 comments
Open

Enhancing Table Parsing with DETR and Pytesseract Integration #295

35C4n0r opened this issue Feb 14, 2024 · 6 comments
Assignees

Comments

@35C4n0r
Copy link

35C4n0r commented Feb 14, 2024

Description

We have observed that the current implementation using Table Transformer is not achieving satisfactory performance in accurately detecting rows and columns within tables, particularly in the context of parsing Hindi tables from PDFs. To address this, we propose a new approach that integrates Detection Transformer (DETR) models with Pytesseract for improved detection of text objects within tables.

The objective is to develop a method where DETR models are used in conjunction with Pytesseract's OCR capabilities to enhance the accuracy of text detection and bounding box identification within table cells. This approach aims to provide a more robust solution for parsing tables by leveraging the strengths of both DETR models for object detection and Pytesseract for optical character recognition.

Proposed Workflow

Input

The input to the system will be PDFs or images containing tables, alongside the specification of the DETR model to be used and the language setting for Pytesseract.
DETR Model Processing: Use the specified DETR model to detect text objects within the tables. DETR models, known for their efficiency in object detection tasks, will help identify text blocks or cells within the complex structure of tables.
Pytesseract OCR: Apply Pytesseract with the specified language setting to the detected text objects to recognize the text within each cell.

Output Mapping

The output will be a structured mapping of each word detected to its corresponding location within the table (e.g., row1/column1/cell1/table1). This includes combining words that belong to the same cell or object for a comprehensive representation of the table's content.

Expected Outcome:

  • Enhanced accuracy in the detection of rows, columns, and text within tables.
  • A structured and detailed output that maps each detected word to its precise location within the table, facilitating easier parsing and conversion into structured formats like JSON or CSV.
@35C4n0r
Copy link
Author

35C4n0r commented Feb 14, 2024

cc: @GautamR-Samagra

@basedsaksham
Copy link

hi @GautamR-Samagra I'd like to work on this problem. Please assign me this.

@basedsaksham
Copy link

basedsaksham commented Mar 15, 2024 via email

@GautamR-Samagra
Copy link
Collaborator

Greetings of the day Samagra-Development/ai-tools , I have started my work on improving NER issue. I have already prepared a code to detect phone number, email, time, rates and units and calculate the dates given as "next monday, agle somvar". If it's possible may I be assigned to this issue and get the access to the crop, seeds and pests datasets so i can proceed further with the issue.

On Fri, 15 Mar 2024, 09:00 Gautam, @.> wrote: Assigned #295 <#295> to @basedsaksham https://github.com/basedsaksham. — Reply to this email directly, view it on GitHub <#295 (comment)>, or unsubscribe https://github.com/notifications/unsubscribe-auth/A32ZWDMUMVUVDMB3B6N2SODYYJTPHAVCNFSM6AAAAABDIJZF4GVHI2DSMVQWIX3LMV45UABCJFZXG5LFIV3GK3TUJZXXI2LGNFRWC5DJN5XDWMJSGEZDMMRTG44DKOI . You are receiving this because you were assigned.Message ID: @. com>

You are probably referring to the wrong ticket here

@basedsaksham
Copy link

Enhancing Table Parsing with DETR and Pytesseract Integration · Issue #295 · Samagra-Development_ai-tools - Google Chrome 3_21_2024 2_12_58 AM
I have got this as a result after extracting rows and columns using DETR. I will proceed to work on recognizing texts using OCR and pytesseract. Kindly let me know if this example output is satisfactory

@basedsaksham
Copy link

hey @35C4n0r can you please explain what pytesseract settings and configs can be used to achieve the best output

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants