Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add checksum parameter to Document #793

Merged
merged 4 commits into from Jan 3, 2016
Merged

Add checksum parameter to Document #793

merged 4 commits into from Jan 3, 2016

Conversation

RubenVerborgh
Copy link
Contributor

As suggested in #790, this pull request allows a Document to specify a pre-calculated checksum. This gives data sources the ability to determine a checksum for the items and layouts they create.
This pull request:

  • adds a checksum parameter to the initializer of Document (as well as new_item and new_layout)
  • modifies Checksummer to use the checksum on Item and Layout if present
  • modifies the filesystem datasource to set the size/modification time as checksum

This realizes a major compilation speed gain for my website (from 11s to 3s if nothing is changed), and paves the way for lazy loading of items, which could bring down compilation time even further.

@RubenVerborgh
Copy link
Contributor Author

Note that 289f6f3 is redundant after 8c72d51; however, I left it in for comparison (especially with regard to the tests).

@RubenVerborgh
Copy link
Contributor Author

Also, the code to generate a checksum for a file is kind of a duplicate of the code of the checksummer for pathnames. I did not want to create a dependency between a datasource and the checksummer; while this might be okay for the default filesystem datasource, external datasources would not want to depend on the internal checksummer. (Would also be an argument to have pathnames calculate their own checksum, but we've discussed that already 😉)

@denisdefreyne
Copy link
Member

This looks good! Some remarks:

  • I’m not in favour of using the file size and mtime to detect modifications. It’s not accurate enough—it’s trivial to generate two different files with the same checksum. The content and attributes (as strings) are already loaded, and generating a checksum from a string is quite fast (SHA-1 is intended to be fast). For binary items, this isn’t as straightforward, and file size + mtime will have to do.

    I’m not saying no to using mtime at all, but for a checksum, I don’t think it suffices. (Idea: only calculate a full checksum when the mtime appears to indicate that the item is outdated—but I believe such a change is out of scope for this PR and would require quite a bit more (architectural) rework.)

  • The checksummer is part of a private API, but I have no issues with the filesystem data source using that API. It is part of the private API because I don’t want to expose it outside of Nanoc (and have to maintain it as part of the public API).

@RubenVerborgh
Copy link
Contributor Author

  • I wasn't too sure either, hence I left 289f6f3 in. I'll adjust the range of this pull request to not include 8c72d51 anymore.
    • Regarding “content already loaded”: this pull request would actually avoid the need to load the content, and this would be a nice next step. Content only needs to be loaded when the checksum is different.
  • Alright, but it's not necessary anymore then if the mtime is not used.

@denisdefreyne
Copy link
Member

Content only needs to be loaded when the checksum is different.

Not entirely correct: the checksum being different is a necessary but not sufficient condition for content to be loaded. For instance, an item (whose checksum is identical) that includes content from another item (whose checksum is different) will need to be recompiled even though its checksum hasn’t changed.

The idea of only loading data when necessary is nonetheless quite interesting and is something I plan to tackle in Nanoc at some point in the future.

@denisdefreyne denisdefreyne added this to the 4.2 milestone Jan 3, 2016
@@ -249,7 +251,7 @@ def parse(content_filename, meta_filename, _kind)
content = pieces[4]

# Done
[meta, content]
[meta, content, meta_raw]
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don’t think meta_raw is used.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

nm—it is!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you update the #parse documentation to reflect the changed return value?

@denisdefreyne
Copy link
Member

👍 apart from comments

@RubenVerborgh
Copy link
Contributor Author

Content only needs to be loaded when the checksum is different.

Not entirely correct: the checksum being different is a necessary but not sufficient condition for content to be loaded.

Yeah, I cut a corner in writing that down. What I meant is: content only needs to be loaded when a reason is found to recompile, since determining that reason no longer requires the item contents.

The idea of only loading data when necessary is nonetheless quite interesting and is something I plan to tackle in Nanoc at some point in the future.

I think it could be as simple as allowing new_item to accept a block and a checksum, but I might be oversimplifying. Can I make an issue to discuss that?

@denisdefreyne
Copy link
Member

Yeah, I cut a corner in writing that down. What I meant is: content only needs to be loaded when a reason is found to recompile, since determining that reason no longer requires the item contents.

Yup.

I think it could be as simple as allowing new_item to accept a block and a checksum, but I might be oversimplifying. Can I make an issue to discuss that?

I have a few ideas regarding optimisations like these, and I’d rather write down these thoughts in a proper RFC first. (Can’t seem to find the time, though.)

@RubenVerborgh
Copy link
Contributor Author

Okay, keep me updated when that RFC goes out 😄

Finishing this PR now.

@RubenVerborgh
Copy link
Contributor Author

All done.

@denisdefreyne
Copy link
Member

Looks good, thanks!

denisdefreyne added a commit that referenced this pull request Jan 3, 2016
@denisdefreyne denisdefreyne merged commit d47c197 into nanoc:master Jan 3, 2016
@denisdefreyne
Copy link
Member

Some follow-up discussion:

  • Using mtime+file size as an initial checksum, and only calculating the full checksum if necessary (and only then loading content) is pointless, because when the mtime/file size checksum is different, the item will need to be recompiled and that means loading the content anyway. Additionally, the full checksum will need to be calculated for this item anyway, because it needs to be stored so that it can be used in future compilations. (The previous checksum could be reused rather than recalculated, but I’m unsure whether the speedup would be large enough to make it worth optimise this.)
  • Not loading content, or lazy-loading content, made much more sense in previous versions of Ruby, but since Ruby as of recently has a generational GC, the performance gains of keeping memory usage low are not nearly as big anymore.

@RubenVerborgh
Copy link
Contributor Author

Lazy loading still makes sense, not for memory reasons, but to avoid the overhead of parsing the file and its metadata. The checksum can be made without parsing – only when the checksum is different, the file actually needs to be parsed.

@denisdefreyne
Copy link
Member

I’d argue that in most cases, parsing the metadata doesn’t impose enough of an overhead to try to avoid doing it—but I haven’t profiled this extensively to be certain of that.

@RubenVerborgh
Copy link
Contributor Author

I can give you a quick number: a site with a BibTeX datasource (± 150 items in my use case), compilation time was 12s before #790 and #793. Now it is down to 3s. Skipping field loading for the 150 items brings compilation time further down to 2s.

@denisdefreyne
Copy link
Member

This is the “nothing changed” case, correct?

@RubenVerborgh
Copy link
Contributor Author

Yes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

2 participants