-
Notifications
You must be signed in to change notification settings - Fork 258
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Allowing the parse output to output more than one document #135
Conversation
private String[] urls = new String[] { "http://www.lequipe.fr/", | ||
"http://www.lemonde.fr/", "http://www.bbc.co.uk/", | ||
"http://www.facebook.com/", "http://www.rmc.fr" }; | ||
private String[] urls = new String[] { "http://firefoxmania.uci.cu/"}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this doesn't really need to be changed
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yep this slipped through my fingers, I modified to avoid the use a proxy to do the tests, I'll fix this.
Thanks Jorge. Could you please apply our formatting to the code? It can be found at [https://github.com/DigitalPebble/storm-crawler/blob/master/eclipse-formatting-profile.xml]. I've just merged #138 so that you can do it with Maven, regardless of which editor you use. Just run |
@@ -0,0 +1,41 @@ | |||
package com.digitalpebble.storm.crawler.parse.filter; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This file is missing the license header. It is also not very useful in practice but is a good illustration of use.
Instead, what about adding it to the parse test package and use that for checking that we are getting what we expect?
@GuiForget @jakekdodd do you want to give this a try? It would be useful for processing pages containing multiple products |
private String text; | ||
private Metadata metadata; | ||
|
||
public static final ParseData empty = new ParseData(Metadata.empty); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this does not seem to be used anywhere. Remove?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Kept this as a standard way to reference an empty ParseData/ParseResult, right now is not used and could be easily removed.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
if it is not used, then let's remove it
I'm a bit confused by the structure of ParseResult. ParseResults holds multiple ParseData. So far so good. But what is the key in the map? I see it's a URL but in which case would the URL be different from the one of the containing ParseResult? |
@GuiForget The idea behind the map is that each subdocument extracted from the parent page will have it's own identifying URL which could be a totally different URL (pointing to a detailed product view, for instance) or a bookmark in the same parent page, etc. |
@GuiForget Imagine for instance that we process an extended RSS feed containing all the info we need for a document. The subdocuments would have their own individual URL and with this mechanism we could bypass fetching + parsing them altogether. |
Architecturally, I wouldn't emit each subdocument from the parser bolt. Maybe instead emit the whole |
Hi @jakekdodd, thanks for taking the time to comment.
That would mean having another brand new serializable object to do that and more importantly a substantial change to the contract between bolts. I expect most users would not use this feature - from my Nutch experience very few people did. I'd rather keep things as simple as possible.
You could implement that logic within a subsequent parse filter maybe? Or have a downstream bolt to decide what to do based on the content of the individual tuple? |
I was thinking of rewriting the |
@jorgelbg yes, good idea |
Another advantage of this is that ParseFilters can modify the text extracted for the main document. One application of this would be to have a PF taking an XPath as parameter in order to restrict the text indexed to a given HTML element. Something that Nutch users ask all the time, would be a lot more elegant the way it's done here. @jakekdodd @GuiForget are you happy for me to merge this? |
@jnioche Taking into account that digitalpebble.com doesn't have a RSS feed can we use a similar to the sitemap? the general idea is to index each |
@jnioche I've committed an updated version of the |
} | ||
} | ||
} catch (Exception e) { | ||
e.printStackTrace(); |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
yuk. slf4j instead
Looks good. Could you please just fix the issue with the |
8a4845f
to
8f16de0
Compare
Allowing the parse output to output more than one document
Thanks Jorge! |
This address the #117 issue at some extent this also encapsulate in a
ParseResult
the parsing data/metadata of a URL. Each URL has only oneParseResult
which at the same time has at least 1ParseData
instance to hold the data of the "parent" URL in case subdocuments are extracted in anyParseFilter
the URLs and ParseData of each subdocument is added to the parent ParseResult. So theParserBolt
emits eachParseData
container in theParseResult
as a tuple. So at the very least 1ParseData
, or any number if subdocuments are extracted, also this implies that there is no difference between the extracted information about the parent URL or any subdocument eachParseData
gets emitted as a tuple.I made some simple graph to try to explain this :) I hope it makes things a little more clearer than my explanation.