Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

RFC: lazy-load items #3

Open
wants to merge 3 commits into
base: master
Choose a base branch
from
Open

RFC: lazy-load items #3

wants to merge 3 commits into from

Conversation

denisdefreyne
Copy link
Member

@denisdefreyne denisdefreyne commented Jan 3, 2016

rendered view

The implementation of this RFC depends on not having a preprocessor, or having a preprocessor whose effects can be analysed. Work on such a preprocessor is pending (see #7).

@denisdefreyne
Copy link
Member Author

Question: does it make sense to lazy-load layouts? I don’t think it does, given that a site will have a handful of layouts at most.

@denisdefreyne
Copy link
Member Author

CC @RubenVerborgh — this is the RFC you are looking for (still quite WIP though). The idea you brought up is described in the “alternatives” section.

@RubenVerborgh
Copy link

Would it make sense to make compilation speed part of the motivation? It makes a major difference for my use case (BibTeX datasource). Not tested yet with other datasources.

the content and attributes for each item needs to be loaded at some point anyway, in order for the checksum to be calculated.

The content needs to be loaded indeed, but not the attributes since nanoc/nanoc#793.

@denisdefreyne
Copy link
Member Author

I’d welcome a PR to make attribute loading lazy! My idea would be to allow attributes to be a lambda that evaluates to the attributes.

@RubenVerborgh
Copy link

That would indeed be cool. Would it make sense to do the same with content in the same pass?

@denisdefreyne
Copy link
Member Author

Hmm, not sure. The content will be loaded anyway (for the checksum) so lazy-loading it likely won’t make a difference. I’d say “no” and go for attributes only (to eliminate the YAML parsing overhead).

@RubenVerborgh
Copy link

Okay, I'll proceed with attributes.

I do a have use case for lazy content though: the content for BibTeX datasource items is generated after parsing. I don't want to keep on nagging about my own project of course 😉, but it's just a reminder that not all datasources are filesystem.

For the sake of discussion, here is a brief sketch of how the BibTeX datasource works:
– input: a folder with .bib files, each of which contains multiple entries (100+ entries are not an exception)
– as checksum data, the file contents are used (no parsing needed)
– to determine the item identifiers, the file has to be split
– to determine the contents (= .bib entry with special markup removed), the entry has to be parsed
– to determine the attributes, the parsed fields have to be unescaped

The two last steps take 33% of my compilation time.

@denisdefreyne
Copy link
Member Author

In that case, it makes sense for content to be (optionally) lazy too. 👍 for lazifying both content and attributes.

@RubenVerborgh RubenVerborgh mentioned this pull request Jan 3, 2016
2 tasks
@connorshea
Copy link

connorshea commented Nov 30, 2016

Is there any progress on this? I'm currently working on adding "versions" to GitLab's documentation website and it's causing some issues with compile time (e.g. going from 4 unique sets of documentation to 8 is causing the compile time to grow from 4 minutes to 13), I suspect in part due to the problems described here.

Running nanoc compile has a single Ruby process using 2.4GB of RAM and it takes a few minutes before any pages actually start being created.

tmp/compiled_content is 93.7MB. Site compiled in 789.85s.

@connorshea
Copy link

Hm, upon further testing it seems that much of the time is taken because Nanoc is comparing the older version of the site to see what it should recompile.

For our repo:
master (no public/, no items in tmp/): 61.70s
add-versions (public/, items in tmp/): 789.85s
add-versions (no public/, no items in tmp/): 382.10s

This doesn't really explain why it's taking an unholy amount of time (add-versions is 15 minutes vs. 4 minutes for master) on CI though since nothing is cached there, perhaps because it's hitting a RAM limit? I'll also need to check how much of that is going into pulling down repos.

@denisdefreyne
Copy link
Member Author

@connorshea Are you on the latest version of Nanoc? There have been some performance issues recently, which have all been fixed. (Ironically, a recent optimisation made things far worse in terms of compilation speed.)

@connorshea
Copy link

@ddfreyne was on 4.4.2 for master, 4.4.0 for add-versions, I'll test again with 4.4.2 on add-versions.

One thing I just realized for the test wherein it takes 789.85s is that the public and tmp folders had content from master, which has a fairly dissimilar structure from add-versions (each directory under content has subdirectories for each respective version), so the compiler was likely confused by the complete different structure of the content when I switched over. Hence the discrepancy in the compile times.

@denisdefreyne
Copy link
Member Author

As for the state of the RFC: it’s work in progress and blocked by having a redesigned preprocessor. The preprocessor as it stands now is a bottleneck to future optimisations in terms of CPU and RAM usage, as it’s a black box where anything can happen, and its effects cannot be analysed. This means that unless the preprocessor is replaced with something smarter, Nanoc will have to keep loading all items into memory.

@connorshea
Copy link

383.98s on 4.4.2, so the same.

@denisdefreyne
Copy link
Member Author

@connorshea Interesting… Nanoc 4.3.8 fixed the last known performance issue, so it might be related to having more content in the add-versions branch. Is there a place where I can check out this repository?

@connorshea
Copy link

connorshea commented Nov 30, 2016

@ddfreyne Yup: https://gitlab.com/gitlab-com/gitlab-docs/

It's a bit odd in that it doesn't have most of the content in content/ initially. You need to run the Rake task (rake pull_repos) which shallow clones the repositories in the tmp/ directory and then copies the content from the doc/ directory in each repo into their respective folders in content/. It's a bit convoluted I suppose, but it works pretty well all things considered.

Here's pretty much what you need to run to test this:

git clone https://gitlab.com/gitlab-com/gitlab-docs.git
cd gitlab-docs
bundle install
rake pull_repos
nanoc compile
git checkout add-versions
# RAKE_FORCE_DELETE deletes the content directories since the Rake task doesn't currently overwrite them,
# I've never had it do anything wrong but if you're paranoid you can just delete the `tmp/ce/`, `content/ce/`, `tmp/ee/`, `content/ee/`,
# `tmp/omnibus/`, `content/omnibus/`, `tmp/runner/`, and `content/runner/` directories manually.
RAKE_FORCE_DELETE=true rake pull_repos

@denisdefreyne
Copy link
Member Author

@connorshea I’d prefer not to pollute the discussion of this PR here. Shall we move the conversation to the Google group or Gitter?

@connorshea
Copy link

@ddfreyne of course, here's a Google Group thread: https://groups.google.com/forum/#!topic/nanoc/4iLK826kO7A

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

3 participants