maxUrls config not honored #25

FiV0 · 2021-03-07T17:37:05Z

I have tried the crawler and everything runs fine, except that the maxUrls parameter does not seem to get honored correctly. Admittedly, I set it to a rather low value of 10K. Is there something I am missing?

The text was updated successfully, but these errors were encountered:

boldip · 2021-03-08T07:57:41Z

It should honor it. How did you set the maxUrlsPerSchemeAuthority parameter?

FiV0 · 2021-03-08T17:42:33Z

@boldip I left maxUrlsPerSchemeAuthority at a 1000 and used around 700 seed urls. I just tried once more with maxUrlsPerSchemeAuthority set to 1 and only 10 seeds. After a while I stop the crawler and inspect the number of records in the created store.warc.gz and in both cases it had more than 10K records.

vigna · 2021-03-09T11:22:38Z

OK, but this is not the meaning of the parameter maxUrlsPerSchemeAuthority. If you reach more than 10000 sites, you'll get more than 10000 records. There's nothing wrong with that.

Can you send us the property file, and a complete log at INFO level, and the list of crawled URLs of a crawl of this kind?

vigna · 2021-03-09T13:28:18Z

It would be important to know also how many of the records are duplicate, as duplicate records are not part of the maxUrls limit. You can find a count of the duplicates in the logs, or you can use the Warc tools to scan the store and count the non-duplicate items.

vigna · 2021-03-09T13:44:46Z

Er... it's a bit embarrassing, but we just realized that at some point we deleted the code that was performing the check and never reinstated it again. So you're entirely right—presently, maxUrls is not honored. We'll fix it soon.

FiV0 · 2021-03-09T15:25:58Z

OK, but this is not the meaning of the parameter maxUrlsPerSchemeAuthority. If you reach more than 10000 sites, you'll get more than 10000 records. There's nothing wrong with that.

My last comment above was probably misleading. I didn't expect the change to maxUrlsPerSchemeAuthority to have any effect on the number of sites crawled, I just wanted to mention that even setting it to a low value of 1 doesn't seem to have an effect on stopping at 10K sites crawled.

My current understanding (when everything works) from what I gathered above is:

maxUrlsPerSchemeAuthority - maximum number of sites crawled of urls having the same scheme + authority. Setting this to 1 will crawl at most once an url of the form http://example.com/some/path, but https://example.com/some/path or http://subdomain.example.com/some/path could still get crawled.
maxUrl - maximum number of urls crawled minus duplicates, so if http://example.com and https://example.com return the same response, they only count once towards this value.

So you're entirely right—presently, maxUrls is not honored. We'll fix it soon.

That's awesome.

My current-config:

rootDir=extra/bubing-crawl
maxUrlsPerSchemeAuthority=1000
parsingThreads=64
dnsThreads=50
fetchingThreads=1024
fetchFilter=true
scheduleFilter=( SchemeEquals(http) or SchemeEquals(https) ) and HostEndsWithOneOf(.gb\,.com\,.org\,.us\,.io\,.me) and not PathEndsWithOneOf(.axd\,.xls\,.rar\,.sflb\,.tmb\,.pdf\,.js\,.swf\,.rss\,.kml\,.m4v\,.tif\,.avi\,.iso\,.mov\,.ppt\,.bib\,.docx\,.css\,.fits\,.png\,.gif\,.jpg\,.jpeg\,.ico\,.doc\,.wmv\,.mp3\,.mp4\,.aac\,.ogg\,.wma\,.gz\,.bz2\,.Z\,.z\,.zip) and URLShorterThan(2048) and DuplicateSegmentsLessThan(3)
followFilter=true
parseFilter=( ContentTypeStartsWith(text/) or PathEndsWithOneOf(.html\,.htm\,.txt) ) and not IsProbablyBinary()
storeFilter=( ContentTypeStartsWith(text/) or PathEndsWithOneOf(.html\,.htm\,.txt) ) and not IsProbablyBinary()
schemeAuthorityDelay=10s
ipDelay=2s
maxUrls=10k
bloomFilterPrecision=1E-8
seed=file:extra/bubing_seed.txt
socketTimeout=60s
connectionTimeout=60s
fetchDataBufferByteSize=200K
cookiePolicy=compatibility
cookieMaxByteSize=2000
userAgent=BUbiNG (+https://finnvolkel.com/)
userAgentFrom=my email (redacted)
robotsExpiration=1h
responseBodyMaxByteSize=2M
digestAlgorithm=MurmurHash3
startPaused=false
workbenchMaxByteSize=512Mi
urlCacheMaxByteSize=1Gi
sieveSize=64Mi
parserSpec=HTMLParser(MurmurHash3)

FiV0 changed the title ~~maxUrls value not honored~~ maxUrls config not honored Mar 7, 2021

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

maxUrls config not honored #25

maxUrls config not honored #25

FiV0 commented Mar 7, 2021

boldip commented Mar 8, 2021

FiV0 commented Mar 8, 2021

vigna commented Mar 9, 2021

vigna commented Mar 9, 2021

vigna commented Mar 9, 2021

FiV0 commented Mar 9, 2021

maxUrls config not honored #25

maxUrls config not honored #25

Comments

FiV0 commented Mar 7, 2021

boldip commented Mar 8, 2021

FiV0 commented Mar 8, 2021

vigna commented Mar 9, 2021

vigna commented Mar 9, 2021

vigna commented Mar 9, 2021

FiV0 commented Mar 9, 2021