Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

O(n!) processing in tag name/path for Paragraph in dedupe code #27

Open
tfmorris opened this issue Apr 3, 2016 · 2 comments
Open

O(n!) processing in tag name/path for Paragraph in dedupe code #27

tfmorris opened this issue Apr 3, 2016 · 2 comments
Milestone

Comments

@tfmorris
Copy link
Contributor

tfmorris commented Apr 3, 2016

Attempts to process this segment:

s3://aws-publicdatasets/common-crawl/crawl-data/CC-MAIN-2015-27/segments/1435375093899.18/warc/CC-MAIN-20150627031813-00201-ip-10-179-60-89.ec2.internal.warc.gz

stalls between 7k-8k records when it encounters a deeply nested tag structure that triggers the O(n!) complexity in tree depth processing of Paragraph.getPath(Node).

The document is pathological in that its many thousands of levels deeply nested, but it causes the entire segment to fail when the mapper gets killed.

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 3, 2016
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 3, 2016
@habernal habernal added this to the 1.0.1 milestone Apr 4, 2016
@habernal
Copy link
Contributor

habernal commented Apr 4, 2016

Many thanks, Tom!

Ideally, it should be tested on the benchmark data for boilerplate removal to make sure it delivers the same results.

@tfmorris
Copy link
Contributor Author

tfmorris commented Apr 4, 2016

The fix needs improvement because, although it fixes the processing time issue, it can still exhaust heap in a constrained environment like a Hadoop cluster. I'm testing a revised version which doesn't keep the entire string of tag names, since it doesn't appear to be used anywhere.

I don't see any tests in the dkpro-c4corpus-boilerplate sub-project. How does one run the tests you are describing?

tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 9, 2016
Also moves all initialization into constructor and simplifies it.
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 9, 2016
Also moves all initialization into constructor and simplifies it.
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 10, 2016
Also moves all initialization into constructor and simplifies it.
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 13, 2016
Also moves all initialization into constructor and simplifies it.
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 15, 2016
Also moves all initialization into constructor and simplifies it.
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 15, 2016
Also moves all initialization into constructor and simplifies it.
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Apr 28, 2016
Also moves all initialization into constructor and simplifies it.
tfmorris added a commit to tfmorris/dkpro-c4corpus that referenced this issue Jun 12, 2020
Also moves all initialization into constructor and simplifies it.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants