Allowing the parse output to output more than one document #135

jorgelbg · 2015-06-04T18:51:45Z

This address the #117 issue at some extent this also encapsulate in a ParseResult the parsing data/metadata of a URL. Each URL has only one ParseResult which at the same time has at least 1 ParseData instance to hold the data of the "parent" URL in case subdocuments are extracted in any ParseFilter the URLs and ParseData of each subdocument is added to the parent ParseResult. So the ParserBolt emits each ParseData container in the ParseResult as a tuple. So at the very least 1 ParseData, or any number if subdocuments are extracted, also this implies that there is no difference between the extracted information about the parent URL or any subdocument each ParseData gets emitted as a tuple.

I made some simple graph to try to explain this :) I hope it makes things a little more clearer than my explanation.

jnioche · 2015-06-05T08:38:14Z

core/src/main/java/com/digitalpebble/storm/crawler/spout/RandomURLSpout.java

-    private String[] urls = new String[] { "http://www.lequipe.fr/",
-            "http://www.lemonde.fr/", "http://www.bbc.co.uk/",
-            "http://www.facebook.com/", "http://www.rmc.fr" };
+    private String[] urls = new String[] { "http://firefoxmania.uci.cu/"};


this doesn't really need to be changed

Yep this slipped through my fingers, I modified to avoid the use a proxy to do the tests, I'll fix this.

jnioche · 2015-06-05T08:48:28Z

Thanks Jorge. Could you please apply our formatting to the code? It can be found at [https://github.com/DigitalPebble/storm-crawler/blob/master/eclipse-formatting-profile.xml].

I've just merged #138 so that you can do it with Maven, regardless of which editor you use. Just run mvn java-formatter:format

jnioche · 2015-06-07T10:21:56Z

core/src/main/java/com/digitalpebble/storm/crawler/parse/filter/SubDocumentsParseFilter.java

@@ -0,0 +1,41 @@
+package com.digitalpebble.storm.crawler.parse.filter;


This file is missing the license header. It is also not very useful in practice but is a good illustration of use.
Instead, what about adding it to the parse test package and use that for checking that we are getting what we expect?

jnioche · 2015-06-07T10:24:52Z

@GuiForget @jakekdodd do you want to give this a try? It would be useful for processing pages containing multiple products

jnioche · 2015-06-07T10:30:24Z

core/src/main/java/com/digitalpebble/storm/crawler/parse/ParseData.java

+    private String text;
+    private Metadata metadata;
+
+    public static final ParseData empty = new ParseData(Metadata.empty);


this does not seem to be used anywhere. Remove?

Kept this as a standard way to reference an empty ParseData/ParseResult, right now is not used and could be easily removed.

if it is not used, then let's remove it

GuiForget · 2015-06-07T21:15:54Z

I'm a bit confused by the structure of ParseResult. ParseResults holds multiple ParseData. So far so good. But what is the key in the map? I see it's a URL but in which case would the URL be different from the one of the containing ParseResult?

jorgelbg · 2015-06-08T02:01:02Z

@GuiForget The idea behind the map is that each subdocument extracted from the parent page will have it's own identifying URL which could be a totally different URL (pointing to a detailed product view, for instance) or a bookmark in the same parent page, etc.

jnioche · 2015-06-08T08:46:02Z

@GuiForget Imagine for instance that we process an extended RSS feed containing all the info we need for a document. The subdocuments would have their own individual URL and with this mechanism we could bypass fetching + parsing them altogether.
For subdocs which do not have their own URLs, a solution would be to generate a dummy one e.g. with an anchor tag.

jakekdodd · 2015-06-08T11:30:20Z

Architecturally, I wouldn't emit each subdocument from the parser bolt. Maybe instead emit the whole ParseResult as a field in the default tuple? That way, a downstream bolt can decide what to do with the results--which might include emitting the subdocuments for individual processing, or it might not.

jnioche · 2015-06-08T12:50:02Z

Hi @jakekdodd, thanks for taking the time to comment.

Maybe instead emit the whole ParseResult as a field in the default tuple?

That would mean having another brand new serializable object to do that and more importantly a substantial change to the contract between bolts.

I expect most users would not use this feature - from my Nutch experience very few people did. I'd rather keep things as simple as possible.

That way, a downstream bolt can decide what to do with the results--which might include emitting the subdocuments for individual processing, or it might not.

You could implement that logic within a subsequent parse filter maybe? Or have a downstream bolt to decide what to do based on the content of the individual tuple?

jorgelbg · 2015-06-09T05:43:05Z

I was thinking of rewriting the SubDocumentsParseFilter and the test around the RSS example, parsing each url into a separated document, I think this could demonstrate in a more useful way how to work with sub documents, right now is a very dummy example. What do you think @jnioche ?

jnioche · 2015-06-09T12:23:15Z

@jorgelbg yes, good idea

jnioche · 2015-06-10T17:40:25Z

Another advantage of this is that ParseFilters can modify the text extracted for the main document. One application of this would be to have a PF taking an XPath as parameter in order to restrict the text indexed to a given HTML element. Something that Nutch users ask all the time, would be a lot more elegant the way it's done here.

@jakekdodd @GuiForget are you happy for me to merge this?

jorgelbg · 2015-06-11T19:21:46Z

@jnioche Taking into account that digitalpebble.com doesn't have a RSS feed can we use a similar to the sitemap? the general idea is to index each url element of the sitemaps.xml file as a separated document. This will keep examples concise around the digitalpebble.com info and also easy to test.

jorgelbg · 2015-06-11T20:33:36Z

@jnioche I've committed an updated version of the SubDocumentsParseFilter and the test, working with the sitemap.xml file take a look at it.

jnioche · 2015-06-11T21:24:30Z

core/src/test/java/com/digitalpebble/storm/crawler/parse/filter/SubDocumentsParseFilter.java

+                }
+            }
+        } catch (Exception e) {
+            e.printStackTrace();


yuk. slf4j instead

jnioche · 2015-06-11T21:25:27Z

Looks good. Could you please just fix the issue with the e.printStackTrace(); above then squash your commits into a single one?
Thanks

Allowing the parse output to output more than one document

jnioche · 2015-06-12T14:59:07Z

Thanks Jorge!

jnioche reviewed Jun 5, 2015
View reviewed changes

jnioche reviewed Jun 7, 2015
View reviewed changes

jnioche reviewed Jun 11, 2015
View reviewed changes

Issue apache#117 Parse could generate output for more than one document

8f16de0

jorgelbg force-pushed the parse-multidocs branch from 8a4845f to 8f16de0 Compare June 12, 2015 04:26

jnioche added a commit that referenced this pull request Jun 12, 2015

Merge pull request #135 from jorgelbg/parse-multidocs

db000ab

Allowing the parse output to output more than one document

jnioche merged commit db000ab into apache:master Jun 12, 2015

jnioche mentioned this pull request Jun 15, 2015

Update WIKI about ParseFilters and ParseData #141

Closed

jnioche added this to the 0.6 milestone Sep 4, 2015

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Allowing the parse output to output more than one document #135

Allowing the parse output to output more than one document #135

jorgelbg commented Jun 4, 2015

jnioche Jun 5, 2015

jorgelbg Jun 5, 2015

jnioche commented Jun 5, 2015

jnioche Jun 7, 2015

jnioche commented Jun 7, 2015

jnioche Jun 7, 2015

jorgelbg Jun 8, 2015

jnioche Jun 8, 2015

GuiForget commented Jun 7, 2015

jorgelbg commented Jun 8, 2015

jnioche commented Jun 8, 2015

jakekdodd commented Jun 8, 2015

jnioche commented Jun 8, 2015

jorgelbg commented Jun 9, 2015

jnioche commented Jun 9, 2015

jnioche commented Jun 10, 2015

jorgelbg commented Jun 11, 2015

jorgelbg commented Jun 11, 2015

jnioche Jun 11, 2015

jnioche commented Jun 11, 2015

jnioche commented Jun 12, 2015

		@@ -0,0 +1,41 @@
		package com.digitalpebble.storm.crawler.parse.filter;

Allowing the parse output to output more than one document #135

Allowing the parse output to output more than one document #135

Conversation

jorgelbg commented Jun 4, 2015

jnioche Jun 5, 2015

Choose a reason for hiding this comment

jorgelbg Jun 5, 2015

Choose a reason for hiding this comment

jnioche commented Jun 5, 2015

jnioche Jun 7, 2015

Choose a reason for hiding this comment

jnioche commented Jun 7, 2015

jnioche Jun 7, 2015

Choose a reason for hiding this comment

jorgelbg Jun 8, 2015

Choose a reason for hiding this comment

jnioche Jun 8, 2015

Choose a reason for hiding this comment

GuiForget commented Jun 7, 2015

jorgelbg commented Jun 8, 2015

jnioche commented Jun 8, 2015

jakekdodd commented Jun 8, 2015

jnioche commented Jun 8, 2015

jorgelbg commented Jun 9, 2015

jnioche commented Jun 9, 2015

jnioche commented Jun 10, 2015

jorgelbg commented Jun 11, 2015

jorgelbg commented Jun 11, 2015

jnioche Jun 11, 2015

Choose a reason for hiding this comment

jnioche commented Jun 11, 2015

jnioche commented Jun 12, 2015