Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

PDF text extraction is missing pages #164

Open
1jamesthompson1 opened this issue May 23, 2024 · 0 comments
Open

PDF text extraction is missing pages #164

1jamesthompson1 opened this issue May 23, 2024 · 0 comments
Labels
bug Something isn't working Engine

Comments

@1jamesthompson1
Copy link
Owner

Problem

Currently in the PDFParser the PDFs are parsed into text. There is a problem where some of the pages are missed out.

This affectes sections extraction for #146, for two reasons:

  • Some of the sections are missed out as they are not in the txt file
  • The section that is before the missing page will capture until the next higher section as it cant find the end of its own section (becuase it finds the end of its section by trying to find the satrt of the next section).

Ideas and suggestions

Links and references

@1jamesthompson1 1jamesthompson1 added bug Something isn't working Engine labels May 23, 2024
@1jamesthompson1 1jamesthompson1 changed the title PDF text extraction is missing pagesf PDF text extraction is missing pages May 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working Engine
Projects
None yet
Development

No branches or pull requests

1 participant