Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

maxUrls config not honored #25

Open
FiV0 opened this issue Mar 7, 2021 · 6 comments
Open

maxUrls config not honored #25

FiV0 opened this issue Mar 7, 2021 · 6 comments

Comments

@FiV0
Copy link

FiV0 commented Mar 7, 2021

I have tried the crawler and everything runs fine, except that the maxUrls parameter does not seem to get honored correctly. Admittedly, I set it to a rather low value of 10K. Is there something I am missing?

@FiV0 FiV0 changed the title maxUrls value not honored maxUrls config not honored Mar 7, 2021
@boldip
Copy link
Member

boldip commented Mar 8, 2021

It should honor it. How did you set the maxUrlsPerSchemeAuthority parameter?

@FiV0
Copy link
Author

FiV0 commented Mar 8, 2021

@boldip I left maxUrlsPerSchemeAuthority at a 1000 and used around 700 seed urls. I just tried once more with maxUrlsPerSchemeAuthority set to 1 and only 10 seeds. After a while I stop the crawler and inspect the number of records in the created store.warc.gz and in both cases it had more than 10K records.

@vigna
Copy link
Member

vigna commented Mar 9, 2021

OK, but this is not the meaning of the parameter maxUrlsPerSchemeAuthority. If you reach more than 10000 sites, you'll get more than 10000 records. There's nothing wrong with that.

Can you send us the property file, and a complete log at INFO level, and the list of crawled URLs of a crawl of this kind?

@vigna
Copy link
Member

vigna commented Mar 9, 2021

It would be important to know also how many of the records are duplicate, as duplicate records are not part of the maxUrls limit. You can find a count of the duplicates in the logs, or you can use the Warc tools to scan the store and count the non-duplicate items.

@vigna
Copy link
Member

vigna commented Mar 9, 2021

Er... it's a bit embarrassing, but we just realized that at some point we deleted the code that was performing the check and never reinstated it again. So you're entirely right—presently, maxUrls is not honored. We'll fix it soon.

@FiV0
Copy link
Author

FiV0 commented Mar 9, 2021

OK, but this is not the meaning of the parameter maxUrlsPerSchemeAuthority. If you reach more than 10000 sites, you'll get more than 10000 records. There's nothing wrong with that.

My last comment above was probably misleading. I didn't expect the change to maxUrlsPerSchemeAuthority to have any effect on the number of sites crawled, I just wanted to mention that even setting it to a low value of 1 doesn't seem to have an effect on stopping at 10K sites crawled.

My current understanding (when everything works) from what I gathered above is:

  • maxUrlsPerSchemeAuthority - maximum number of sites crawled of urls having the same scheme + authority. Setting this to 1 will crawl at most once an url of the form http://example.com/some/path, but https://example.com/some/path or http://subdomain.example.com/some/path could still get crawled.
  • maxUrl - maximum number of urls crawled minus duplicates, so if http://example.com and https://example.com return the same response, they only count once towards this value.

So you're entirely right—presently, maxUrls is not honored. We'll fix it soon.

That's awesome.

My current-config:

rootDir=extra/bubing-crawl
maxUrlsPerSchemeAuthority=1000
parsingThreads=64
dnsThreads=50
fetchingThreads=1024
fetchFilter=true
scheduleFilter=( SchemeEquals(http) or SchemeEquals(https) ) and HostEndsWithOneOf(.gb\,.com\,.org\,.us\,.io\,.me) and not PathEndsWithOneOf(.axd\,.xls\,.rar\,.sflb\,.tmb\,.pdf\,.js\,.swf\,.rss\,.kml\,.m4v\,.tif\,.avi\,.iso\,.mov\,.ppt\,.bib\,.docx\,.css\,.fits\,.png\,.gif\,.jpg\,.jpeg\,.ico\,.doc\,.wmv\,.mp3\,.mp4\,.aac\,.ogg\,.wma\,.gz\,.bz2\,.Z\,.z\,.zip) and URLShorterThan(2048) and DuplicateSegmentsLessThan(3)
followFilter=true
parseFilter=( ContentTypeStartsWith(text/) or PathEndsWithOneOf(.html\,.htm\,.txt) ) and not IsProbablyBinary()
storeFilter=( ContentTypeStartsWith(text/) or PathEndsWithOneOf(.html\,.htm\,.txt) ) and not IsProbablyBinary()
schemeAuthorityDelay=10s
ipDelay=2s
maxUrls=10k
bloomFilterPrecision=1E-8
seed=file:extra/bubing_seed.txt
socketTimeout=60s
connectionTimeout=60s
fetchDataBufferByteSize=200K
cookiePolicy=compatibility
cookieMaxByteSize=2000
userAgent=BUbiNG (+https://finnvolkel.com/)
userAgentFrom=my email (redacted)
robotsExpiration=1h
responseBodyMaxByteSize=2M
digestAlgorithm=MurmurHash3
startPaused=false
workbenchMaxByteSize=512Mi
urlCacheMaxByteSize=1Gi
sieveSize=64Mi
parserSpec=HTMLParser(MurmurHash3)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants