Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Revert "Fix chunk breaks." #11

Merged
merged 1 commit into from
Aug 3, 2015
Merged

Revert "Fix chunk breaks." #11

merged 1 commit into from
Aug 3, 2015

Conversation

Sicos1977
Copy link
Owner

Reverts #10

Your code breaks the reading of PDF files in chunks. The way it worked with your code changes is that the complete PDF text is returned in one go.

Sicos1977 pushed a commit that referenced this pull request Aug 3, 2015
@Sicos1977 Sicos1977 merged commit 1082a63 into master Aug 3, 2015
@Sicos1977 Sicos1977 deleted the revert-10-patch-2 branch August 3, 2015 17:12
@Sicos1977
Copy link
Owner Author

This line already takes into account if the break is a sentence break.

if (textBuffer[textLength - 1] != ' ' && textBuffer[textLength - 1] != '\n')

@AllTaken
Copy link
Contributor

AllTaken commented Aug 3, 2015

I'm not sure I understand what you mean?

The issue is that if there is a chunk after a paragraph break, there will be no break between the two chunks in the output from the Read function.
As in the following example:
Input text:

Mary had a little lamb
that was very pretty.

Chunk 1:

  • Text: "Mary had a little lamb"
  • CHUNK_BREAKTYPE: CHUNK_NO_BREAK

Chunk 2:

  • Text: "that was very pretty."
  • CHUNK_BREAKTYPE: CHUNK_EOP (f.ex.)

Result:

"Mary had a little lambthat was very pretty. "

Notice the missing space between "lamb" and "that" and the extraneous space after "pretty."

This erroneous result is very common when processing DOCX files.

I'm not seeing any issues with PDF files after applying my patch.

@AllTaken AllTaken mentioned this pull request Aug 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants