Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Allowing the parse output to output more than one document #135

Merged
merged 1 commit into from
Jun 12, 2015

Conversation

jorgelbg
Copy link
Member

@jorgelbg jorgelbg commented Jun 4, 2015

This address the #117 issue at some extent this also encapsulate in a ParseResult the parsing data/metadata of a URL. Each URL has only one ParseResult which at the same time has at least 1 ParseData instance to hold the data of the "parent" URL in case subdocuments are extracted in any ParseFilter the URLs and ParseData of each subdocument is added to the parent ParseResult. So the ParserBolt emits each ParseData container in the ParseResult as a tuple. So at the very least 1 ParseData, or any number if subdocuments are extracted, also this implies that there is no difference between the extracted information about the parent URL or any subdocument each ParseData gets emitted as a tuple.

I made some simple graph to try to explain this :) I hope it makes things a little more clearer than my explanation.
storm-parsing

private String[] urls = new String[] { "http://www.lequipe.fr/",
"http://www.lemonde.fr/", "http://www.bbc.co.uk/",
"http://www.facebook.com/", "http://www.rmc.fr" };
private String[] urls = new String[] { "http://firefoxmania.uci.cu/"};
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this doesn't really need to be changed

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yep this slipped through my fingers, I modified to avoid the use a proxy to do the tests, I'll fix this.

@jnioche
Copy link
Contributor

jnioche commented Jun 5, 2015

Thanks Jorge. Could you please apply our formatting to the code? It can be found at [https://github.com/DigitalPebble/storm-crawler/blob/master/eclipse-formatting-profile.xml].

I've just merged #138 so that you can do it with Maven, regardless of which editor you use. Just run mvn java-formatter:format

@@ -0,0 +1,41 @@
package com.digitalpebble.storm.crawler.parse.filter;
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file is missing the license header. It is also not very useful in practice but is a good illustration of use.
Instead, what about adding it to the parse test package and use that for checking that we are getting what we expect?

@jnioche
Copy link
Contributor

jnioche commented Jun 7, 2015

@GuiForget @jakekdodd do you want to give this a try? It would be useful for processing pages containing multiple products

private String text;
private Metadata metadata;

public static final ParseData empty = new ParseData(Metadata.empty);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this does not seem to be used anywhere. Remove?

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Kept this as a standard way to reference an empty ParseData/ParseResult, right now is not used and could be easily removed.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

if it is not used, then let's remove it

@GuiForget
Copy link

I'm a bit confused by the structure of ParseResult. ParseResults holds multiple ParseData. So far so good. But what is the key in the map? I see it's a URL but in which case would the URL be different from the one of the containing ParseResult?

@jorgelbg
Copy link
Member Author

jorgelbg commented Jun 8, 2015

@GuiForget The idea behind the map is that each subdocument extracted from the parent page will have it's own identifying URL which could be a totally different URL (pointing to a detailed product view, for instance) or a bookmark in the same parent page, etc.

@jnioche
Copy link
Contributor

jnioche commented Jun 8, 2015

@GuiForget Imagine for instance that we process an extended RSS feed containing all the info we need for a document. The subdocuments would have their own individual URL and with this mechanism we could bypass fetching + parsing them altogether.
For subdocs which do not have their own URLs, a solution would be to generate a dummy one e.g. with an anchor tag.

@jakekdodd
Copy link
Contributor

Architecturally, I wouldn't emit each subdocument from the parser bolt. Maybe instead emit the whole ParseResult as a field in the default tuple? That way, a downstream bolt can decide what to do with the results--which might include emitting the subdocuments for individual processing, or it might not.

@jnioche
Copy link
Contributor

jnioche commented Jun 8, 2015

Hi @jakekdodd, thanks for taking the time to comment.

Maybe instead emit the whole ParseResult as a field in the default tuple?

That would mean having another brand new serializable object to do that and more importantly a substantial change to the contract between bolts.

I expect most users would not use this feature - from my Nutch experience very few people did. I'd rather keep things as simple as possible.

That way, a downstream bolt can decide what to do with the results--which might include emitting the subdocuments for individual processing, or it might not.

You could implement that logic within a subsequent parse filter maybe? Or have a downstream bolt to decide what to do based on the content of the individual tuple?

@jorgelbg
Copy link
Member Author

jorgelbg commented Jun 9, 2015

I was thinking of rewriting the SubDocumentsParseFilter and the test around the RSS example, parsing each url into a separated document, I think this could demonstrate in a more useful way how to work with sub documents, right now is a very dummy example. What do you think @jnioche ?

@jnioche
Copy link
Contributor

jnioche commented Jun 9, 2015

@jorgelbg yes, good idea

@jnioche
Copy link
Contributor

jnioche commented Jun 10, 2015

Another advantage of this is that ParseFilters can modify the text extracted for the main document. One application of this would be to have a PF taking an XPath as parameter in order to restrict the text indexed to a given HTML element. Something that Nutch users ask all the time, would be a lot more elegant the way it's done here.

@jakekdodd @GuiForget are you happy for me to merge this?

@jorgelbg
Copy link
Member Author

@jnioche Taking into account that digitalpebble.com doesn't have a RSS feed can we use a similar to the sitemap? the general idea is to index each url element of the sitemaps.xml file as a separated document. This will keep examples concise around the digitalpebble.com info and also easy to test.

@jorgelbg
Copy link
Member Author

@jnioche I've committed an updated version of the SubDocumentsParseFilter and the test, working with the sitemap.xml file take a look at it.

}
}
} catch (Exception e) {
e.printStackTrace();
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yuk. slf4j instead

@jnioche
Copy link
Contributor

jnioche commented Jun 11, 2015

Looks good. Could you please just fix the issue with the e.printStackTrace(); above then squash your commits into a single one?
Thanks

jnioche added a commit that referenced this pull request Jun 12, 2015
Allowing the parse output to output more than one document
@jnioche jnioche merged commit db000ab into apache:master Jun 12, 2015
@jnioche
Copy link
Contributor

jnioche commented Jun 12, 2015

Thanks Jorge!

@jnioche jnioche added this to the 0.6 milestone Sep 4, 2015
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants