Skip to content

Question about the size of the 'state' directory #533

Answered by anjackson
cgr71ii asked this question in Q&A
Discussion options

You must be logged in to vote

Hi @cgr71ii,

From the information you gave, I can see your crawler currently has 166,045,099 URLs queued for download. This data, also called the crawl frontier, is what is taking up most of the crawler state folder.

If I've got my maths right, 113GB / 166,045,099 = 680 bytes for each URL. This seems pretty reasonable to me, given various bits of metadata are also held along with the URL.

So, yes, this is what I'd expect to see, given the size of your crawl frontier.

Note that if checkpointing is being used, the state folders just get larger and larger because all previous versions of the frontier and kept by default. In this situation, you can delete older checkpoints manually.

HTH,
Andy…

Replies: 3 comments

Comment options

You must be logged in to vote
0 replies
Answer selected by ato
Comment options

You must be logged in to vote
0 replies
Comment options

You must be logged in to vote
0 replies
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Category
Q&A
Labels
None yet
3 participants
Converted from issue

This discussion was converted from issue #498 on September 30, 2022 02:45.