Add checksum parameter to Document #793

RubenVerborgh · 2016-01-01T16:06:02Z

As suggested in #790, this pull request allows a Document to specify a pre-calculated checksum. This gives data sources the ability to determine a checksum for the items and layouts they create.
This pull request:

adds a checksum parameter to the initializer of Document (as well as new_item and new_layout)
modifies Checksummer to use the checksum on Item and Layout if present
modifies the filesystem datasource to set the size/modification time as checksum

This realizes a major compilation speed gain for my website (from 11s to 3s if nothing is changed), and paves the way for lazy loading of items, which could bring down compilation time even further.

RubenVerborgh · 2016-01-01T16:06:44Z

Note that 289f6f3 is redundant after 8c72d51; however, I left it in for comparison (especially with regard to the tests).

RubenVerborgh · 2016-01-01T17:55:17Z

Also, the code to generate a checksum for a file is kind of a duplicate of the code of the checksummer for pathnames. I did not want to create a dependency between a datasource and the checksummer; while this might be okay for the default filesystem datasource, external datasources would not want to depend on the internal checksummer. (Would also be an argument to have pathnames calculate their own checksum, but we've discussed that already 😉)

denisdefreyne · 2016-01-02T12:50:06Z

This looks good! Some remarks:

I’m not in favour of using the file size and mtime to detect modifications. It’s not accurate enough—it’s trivial to generate two different files with the same checksum. The content and attributes (as strings) are already loaded, and generating a checksum from a string is quite fast (SHA-1 is intended to be fast). For binary items, this isn’t as straightforward, and file size + mtime will have to do.

I’m not saying no to using mtime at all, but for a checksum, I don’t think it suffices. (Idea: only calculate a full checksum when the mtime appears to indicate that the item is outdated—but I believe such a change is out of scope for this PR and would require quite a bit more (architectural) rework.)
The checksummer is part of a private API, but I have no issues with the filesystem data source using that API. It is part of the private API because I don’t want to expose it outside of Nanoc (and have to maintain it as part of the public API).

RubenVerborgh · 2016-01-02T14:47:54Z

I wasn't too sure either, hence I left 289f6f3 in. I'll adjust the range of this pull request to not include 8c72d51 anymore.
- Regarding “content already loaded”: this pull request would actually avoid the need to load the content, and this would be a nice next step. Content only needs to be loaded when the checksum is different.
Alright, but it's not necessary anymore then if the mtime is not used.

denisdefreyne · 2016-01-03T10:14:16Z

Content only needs to be loaded when the checksum is different.

Not entirely correct: the checksum being different is a necessary but not sufficient condition for content to be loaded. For instance, an item (whose checksum is identical) that includes content from another item (whose checksum is different) will need to be recompiled even though its checksum hasn’t changed.

The idea of only loading data when necessary is nonetheless quite interesting and is something I plan to tackle in Nanoc at some point in the future.

denisdefreyne · 2016-01-03T10:15:02Z

lib/nanoc/data_sources/filesystem.rb

@@ -249,7 +251,7 @@ def parse(content_filename, meta_filename, _kind)
      content = pieces[4]

      # Done
-      [meta, content]
+      [meta, content, meta_raw]


I don’t think meta_raw is used.

nm—it is!

Can you update the #parse documentation to reflect the changed return value?

denisdefreyne · 2016-01-03T10:17:00Z

👍 apart from comments

RubenVerborgh · 2016-01-03T10:22:38Z

Content only needs to be loaded when the checksum is different.

Not entirely correct: the checksum being different is a necessary but not sufficient condition for content to be loaded.

Yeah, I cut a corner in writing that down. What I meant is: content only needs to be loaded when a reason is found to recompile, since determining that reason no longer requires the item contents.

The idea of only loading data when necessary is nonetheless quite interesting and is something I plan to tackle in Nanoc at some point in the future.

I think it could be as simple as allowing new_item to accept a block and a checksum, but I might be oversimplifying. Can I make an issue to discuss that?

denisdefreyne · 2016-01-03T10:30:54Z

Yeah, I cut a corner in writing that down. What I meant is: content only needs to be loaded when a reason is found to recompile, since determining that reason no longer requires the item contents.

Yup.

I think it could be as simple as allowing new_item to accept a block and a checksum, but I might be oversimplifying. Can I make an issue to discuss that?

I have a few ideas regarding optimisations like these, and I’d rather write down these thoughts in a proper RFC first. (Can’t seem to find the time, though.)

RubenVerborgh · 2016-01-03T10:32:48Z

Okay, keep me updated when that RFC goes out 😄

Finishing this PR now.

RubenVerborgh · 2016-01-03T10:59:38Z

All done.

denisdefreyne · 2016-01-03T11:10:58Z

Looks good, thanks!

Add checksum parameter to Document

denisdefreyne · 2016-01-03T13:24:04Z

Some follow-up discussion:

Using mtime+file size as an initial checksum, and only calculating the full checksum if necessary (and only then loading content) is pointless, because when the mtime/file size checksum is different, the item will need to be recompiled and that means loading the content anyway. Additionally, the full checksum will need to be calculated for this item anyway, because it needs to be stored so that it can be used in future compilations. (The previous checksum could be reused rather than recalculated, but I’m unsure whether the speedup would be large enough to make it worth optimise this.)
Not loading content, or lazy-loading content, made much more sense in previous versions of Ruby, but since Ruby as of recently has a generational GC, the performance gains of keeping memory usage low are not nearly as big anymore.

RubenVerborgh · 2016-01-03T15:52:43Z

Lazy loading still makes sense, not for memory reasons, but to avoid the overhead of parsing the file and its metadata. The checksum can be made without parsing – only when the checksum is different, the file actually needs to be parsed.

denisdefreyne · 2016-01-03T16:01:48Z

I’d argue that in most cases, parsing the metadata doesn’t impose enough of an overhead to try to avoid doing it—but I haven’t profiled this extensively to be certain of that.

RubenVerborgh · 2016-01-03T16:03:59Z

I can give you a quick number: a site with a BibTeX datasource (± 150 items in my use case), compilation time was 12s before #790 and #793. Now it is down to 3s. Skipping field loading for the 150 items brings compilation time further down to 2s.

denisdefreyne · 2016-01-03T16:04:44Z

This is the “nothing changed” case, correct?

RubenVerborgh · 2016-01-03T16:04:58Z

Yes.

denisdefreyne added the feature label Jan 3, 2016

denisdefreyne added this to the 4.2 milestone Jan 3, 2016

denisdefreyne reviewed Jan 3, 2016
View reviewed changes

RubenVerborgh added 4 commits January 3, 2016 11:39

Add checksum attribute to Document.

f343125

Allow checksum in new_item and new_layout.

eb3dc55

Use Item and Layout checksum data when available.

95fb240

Use file contents as checksum data.

c8ea817

denisdefreyne added a commit that referenced this pull request Jan 3, 2016

Merge pull request #793 from RubenVerborgh/feature-checksum

d47c197

Add checksum parameter to Document

denisdefreyne merged commit d47c197 into nanoc:master Jan 3, 2016

RubenVerborgh mentioned this pull request Jan 3, 2016

Speed up checksum calculation for items #790

Closed

This was referenced Jan 3, 2016

RFC: lazy-load items nanoc/rfcs#3

Open

RFC: let data source find objects nanoc/rfcs#1

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add checksum parameter to Document #793

Add checksum parameter to Document #793

RubenVerborgh commented Jan 1, 2016

RubenVerborgh commented Jan 1, 2016

RubenVerborgh commented Jan 1, 2016

denisdefreyne commented Jan 2, 2016

RubenVerborgh commented Jan 2, 2016

denisdefreyne commented Jan 3, 2016

denisdefreyne Jan 3, 2016

denisdefreyne Jan 3, 2016

denisdefreyne Jan 3, 2016

denisdefreyne commented Jan 3, 2016

RubenVerborgh commented Jan 3, 2016

denisdefreyne commented Jan 3, 2016

RubenVerborgh commented Jan 3, 2016

RubenVerborgh commented Jan 3, 2016

denisdefreyne commented Jan 3, 2016

denisdefreyne commented Jan 3, 2016

RubenVerborgh commented Jan 3, 2016

denisdefreyne commented Jan 3, 2016

RubenVerborgh commented Jan 3, 2016

denisdefreyne commented Jan 3, 2016

RubenVerborgh commented Jan 3, 2016

Add checksum parameter to Document #793

Add checksum parameter to Document #793

Conversation

RubenVerborgh commented Jan 1, 2016

RubenVerborgh commented Jan 1, 2016

RubenVerborgh commented Jan 1, 2016

denisdefreyne commented Jan 2, 2016

RubenVerborgh commented Jan 2, 2016

denisdefreyne commented Jan 3, 2016

denisdefreyne Jan 3, 2016

Choose a reason for hiding this comment

denisdefreyne Jan 3, 2016

Choose a reason for hiding this comment

denisdefreyne Jan 3, 2016

Choose a reason for hiding this comment

denisdefreyne commented Jan 3, 2016

RubenVerborgh commented Jan 3, 2016

denisdefreyne commented Jan 3, 2016

RubenVerborgh commented Jan 3, 2016

RubenVerborgh commented Jan 3, 2016

denisdefreyne commented Jan 3, 2016

denisdefreyne commented Jan 3, 2016

RubenVerborgh commented Jan 3, 2016

denisdefreyne commented Jan 3, 2016

RubenVerborgh commented Jan 3, 2016

denisdefreyne commented Jan 3, 2016

RubenVerborgh commented Jan 3, 2016