Navigation Menu

Skip to content

Thoughts on a article body algoritme

Andreas Madsen edited this page May 21, 2013 · 2 revisions

After have been reading some articles on the matter a text/tag or word/tag density on all elements seams like a good start. But since comments and so, could score well here it is not the end. Next is the meta distance I used with the headers, here the comments won't score good, but it is expensive and a description meta is not necessarily given.

Calculate the text/tag density on all elements in tree compatible way. Then store each density with a node reference in a list and sort that by the density. Then select the best node and calculate the meta distance, this is now the new likelihood value for that node. If there is a node there is more likely given by the density then calculate the meta distance again, continue this way until the temporarily best node is found.

The good thing about this algoritme is that is highly based on text/tag which is a good indicate and if I think this right, not very expensive. It also used the meta distance in the likelihood calculation, but only do it for a subset nodes. The bad thing about this, is that the meta description might be very badly related to the article and could case the meta distance to be performed on all elements and also add noise.

Update: I can't believe I didn't think of it, but obviously it is impossible to calculate the meta distance, since that is an adjusted value, based on a knowlege on all elements.