Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Use Machine Learning #3

Open
Divyansh-Gemini opened this issue Mar 1, 2024 · 5 comments
Open

Use Machine Learning #3

Divyansh-Gemini opened this issue Mar 1, 2024 · 5 comments
Labels
enhancement New feature or request

Comments

@Divyansh-Gemini
Copy link
Owner

Divyansh-Gemini commented Mar 1, 2024

Use Machine Learning instead of just searching keywords in the PDF to improve Accuracy.

@Divyansh-Gemini Divyansh-Gemini added the enhancement New feature or request label Mar 1, 2024
@sarayusreeyadavpadala
Copy link

sarayusreeyadavpadala commented Apr 7, 2024

@Divyansh-Gemini I want to work on this issue. Can you provide further detail for clarification regarding this matter?

@Divyansh-Gemini
Copy link
Owner Author

Hi @sarayusreeyadavpadala
Our goal was to fetch required details from the PDF such as exporter name, exporter address, invoice no., invoice date, port of discharge, total net weight, total gross weight, etc...

For this purpose we have used camelot library for getting PDF's table data as Pandas Dataframe, & searching keywords in that df and returning the value that is next to it.

But the problem in the approach is that this is working only for fixed format on PDF Invoices. If there is any other format or the PDF has the scanned invoice, then this approach will not work.

So to solve this problem, we need to detect the data in PDF of those particular fields.

  • OCR can be useful for getting data from PDF if it is a scanned one.
  • ML can be used to detect the values of the fields we require.

Sample PDF Invoices:

Path of the Python code that is currently in use is app/src/main/python/camScript.py.

Feel free to ask if you have any other doubt.

@sarayusreeyadavpadala
Copy link

Hi @sarayusreeyadavpadala Our goal was to fetch required details from the PDF such as exporter name, exporter address, invoice no., invoice date, port of discharge, total net weight, total gross weight, etc...

For this purpose we have used camelot library for getting PDF's table data as Pandas Dataframe, & searching keywords in that df and returning the value that is next to it.

But the problem in the approach is that this is working only for fixed format on PDF Invoices. If there is any other format or the PDF has the scanned invoice, then this approach will not work.

So to solve this problem, we need to detect the data in PDF of those particular fields.

  • OCR can be useful for getting data from PDF if it is a scanned one.
  • ML can be used to detect the values of the fields we require.

Sample PDF Invoices:

Path of the Python code that is currently in use is app/src/main/python/camScript.py.

Feel free to ask if you have any other doubt.

Thank you :-) @Divyansh-Gemini

@sarayusreeyadavpadala
Copy link

Are the PDFs being uploaded by the user specifically about food export invoices?

@Divyansh-Gemini
Copy link
Owner Author

Are the PDFs being uploaded by the user specifically about food export invoices?

Yes, Invoices are of exports related to Food products only.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

No branches or pull requests

2 participants