Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Apostrophe and Double Quote Parsing Issue #86

Closed
rbower54 opened this issue Feb 20, 2017 · 6 comments
Closed

Apostrophe and Double Quote Parsing Issue #86

rbower54 opened this issue Feb 20, 2017 · 6 comments

Comments

@rbower54
Copy link

rbower54 commented Feb 20, 2017

Hi,
I'm running into a small issue with TikaOnDotNet where we've parsed in some *.doc documents that contain quoted strings and apostrophes such as:
Robert's famous quote of "I Love TikaOnDotNet" and some more words that follow it.
The parsing will result in the following literal results:
Robert famous quote of Love TikaOnDotNetand some more words that follow it.
It strips the apostrophe and double quotes and the characters/spaces which follow them.
Any guidance would be greatly appreciated!
Thanks

@rbower54
Copy link
Author

rbower54 commented Mar 3, 2017

?

@KevM
Copy link
Owner

KevM commented Mar 3, 2017

I've been busy with work. Do you think you can write up a failing test? Have you asked the Tika project for assistance?

@KevM
Copy link
Owner

KevM commented Mar 31, 2017

Not sure but I think this may be related to how our content handler is handling the quotes in content.

@KevM KevM mentioned this issue Mar 31, 2017
@KevM
Copy link
Owner

KevM commented Mar 31, 2017

@rbower54 Please take a look at the PR #93 I just posted and let me know if this corrects your problem.

@afederici75
Copy link

Hi KevM, I work with rbower54 and wanted to test this, but we download this component using NuGet from VS. I am not sure how to get a build of the package(s) that has this fix in: the stuff VS shows seems rather old (January).

@KevM
Copy link
Owner

KevM commented Apr 15, 2017

I uploaded a pre-release nuget. Go ahead and give it a try.

https://www.nuget.org/packages/TikaOnDotnet.TextExtractor/1.14.2-pre

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants