Problem with a Uri extraction #84

HolisticMystic · 2017-01-25T16:45:34Z

First, I am very impressed with this project and it appears to be just what I need. That said, I am having very disappointing results retrieving a 'text extract' from a Uri. I was hoping to be able to get a text version of Google results or StackOverflow pages but am getting mostly garbage.
I may have to use a different approach for this requirement.
I was able to extract fairly good text from a very simple webpage with no script and minimal tags, but it still had a few errors.
None the less and in as much ... what a great project. Thanks.

KevM · 2017-01-25T18:16:04Z

Thank you for the nice words. I'm sorry you're having problems can you give us a failing test and we can take a look at helping it get fixed.

KevM · 2017-03-07T14:41:11Z

I happened to use the extract URI overload and am seeing similar junk in the output. It looks like more needs to be done to make this work as expected.

http://www.programcreek.com/java-api-examples/index.php?api=org.apache.tika.io.TikaInputStream

KevM · 2017-03-31T15:20:42Z

I rolledback to a commit pre streaming parsing and URI parsing works correctly. So the problem seems that for web content our ContentHandler is not working as well as the one returned by the ContentHandlerFactory.

        [Test]
        public void should_extract_uri_contents()
        {

            var textExtractionResult = _cut.Extract(metadata =>
            {
                metadata.add("uri", "uri");
                var pageBytes = new WebClient().DownloadData("https://en.wikipedia.org/wiki/Apache_Tika");
                return TikaInputStream.get(pageBytes, metadata);

            });

            textExtractionResult.Text.Should().Contain("Apache Tika is a content detection and analysis framework");
        }

One of the reasons we went to our own content handler was to avoid some weird extra whitespace issues people were running into.

- Restored usage of the Tika Content Handler factory. This may get rid of many content issues people have been reporting. - Using WebClient to download the URI contents as the Tika uri stream getter does not do anything. - Made StreamExtractor obsolete as it is no longer used but is public and I didn't want make a breaking change. Closes #84

KevM · 2017-03-31T16:49:34Z

@HolisticMystic Please take a look at the PR #93 I just posted and let me know if this corrects your problem.

KevM mentioned this issue Mar 31, 2017

Unable to get the Content of HTML email component stored in Blob #92

Closed

KevM added the 2 - Working label Mar 31, 2017

KevM mentioned this issue Mar 31, 2017

Fix uri extraction #93

Merged

KevM closed this as completed in #93 Apr 22, 2017

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Problem with a Uri extraction #84

Problem with a Uri extraction #84

HolisticMystic commented Jan 25, 2017

KevM commented Jan 25, 2017

KevM commented Mar 7, 2017

KevM commented Mar 31, 2017 •

edited

Loading

KevM commented Mar 31, 2017

Problem with a Uri extraction #84

Problem with a Uri extraction #84

Comments

HolisticMystic commented Jan 25, 2017

KevM commented Jan 25, 2017

KevM commented Mar 7, 2017

KevM commented Mar 31, 2017 • edited Loading

KevM commented Mar 31, 2017

KevM commented Mar 31, 2017 •

edited

Loading