New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Repo size issue: consider using git submodules #1554

Closed
mfirry opened this Issue Apr 22, 2015 · 17 comments

Comments

Projects
None yet
4 participants
@mfirry
Contributor

mfirry commented Apr 22, 2015

I'm thinking the frameworks part of your repo could contain 'just' submodules for all different language so that people could just clone/push the part that includes the language they need instead of the whole thing (which is now almost 180Mb)

@hamiltont

This comment has been minimized.

Contributor

hamiltont commented Apr 22, 2015

Please see #1050 where we go into the repo size in detail. Here's some summary points:

  • this repo should never contain source code for any frameworks - If you find cases where that's not true, it's a historical accident that we would love PRs to address
  • Most of the size of our current repo is the history, locked in the git pack objects
  • Most of that is 3-4MB "helper" binaries that were committed here at some point in the past

This sums up to an annoyingly inescapable conclusion - rewriting the history for this project won't help reduce size, submodules won't help reduce size. THe only thing that will help reduce size is to move to a clean repo that does not contain the long history that this project does.

Going to close this as addressed, but happy to continue chatting if you disagree or have other ideas

@hamiltont hamiltont closed this Apr 22, 2015

@mfirry

This comment has been minimized.

Contributor

mfirry commented Apr 22, 2015

Thanks for your answer.
Only one thing: rephrasing a little bit on your "submodules won't help reduce size"... submodules will not reduce the size of THIS repo, but one could, theoretically, only fork/clone the Scala repo and add his/her brand-new-shiny-scala-framework and push that.

@mfirry

This comment has been minimized.

Contributor

mfirry commented Apr 22, 2015

One more thing (since I've also just PR-ed something)... You state "this repo should never contain source code for any frameworks", I've been reading a few times your guidelines for contribution and I've totally missed that point (it might be me, of course).
That said... what should this repo contain (if not source code)?

@msmith-techempower

This comment has been minimized.

Member

msmith-techempower commented Apr 22, 2015

What he meant there is that a given test should only contain the source code required in spinning up that implementation, but NOT the framework's source code.

@mfirry

This comment has been minimized.

Contributor

mfirry commented Apr 22, 2015

Ouch! Ok yes... thanks @msmith-techempower

@msmith-techempower

This comment has been minimized.

Member

msmith-techempower commented Apr 22, 2015

Sure thing. I think we had examples of issues like CakePHP (which is quite large) being included entirely in the test's directory instead of being treated as a requirement of the test itself.

Today, we have a system implemented where dependencies can be specified, downloaded, and built in an environment automatically for a few KB, so that is preferred.

@mfirry

This comment has been minimized.

Contributor

mfirry commented Apr 22, 2015

Yeah totally... I didn't even think of adding framework code, that's why I didn't get it at all :)

@hamiltont

This comment has been minimized.

Contributor

hamiltont commented Apr 22, 2015

Ah yea, poor clarity

For what it's worth, one "hack" is to use the --depth option of git if you're willing to work without the full history - we've made huge strides in reducing the current repo size (e.g. we've slimmed it down by ~250MB since R9 I think!?). It's still not trivial - a git clone --depth 10 will still use take about 200MB on disk, so we still have work to do in finding unneeded framework source code and/or binary files and converting them into downloads (e.g. frameworks/CSharp' is 65MB, so there is something fishy going on there)

@ghost

This comment has been minimized.

ghost commented Oct 26, 2015

How much easier and pleasant it would be to maintain framework benchmarks if they were located at their own repos. Just imagine, you're developing your benchmark as a usual app in its own repo, using travis and some kind of microservice provided by TechEmpower to validate the app works as expected. No need to install ruby, vagrant, VirtualBox, try to find out why VB 4.10 doesn't mount directories, etc.

That's it. And then TechEmpower's main script could clone the repos and run the apps to measure whatever you want. But one repo is enough for all? That's terrible no matter sources of frameworks are included or not.

@hamiltont

This comment has been minimized.

Contributor

hamiltont commented Oct 26, 2015

And then TechEmpower's main script could clone the repos and run the apps to measure whatever you want

That's actually exactly what our main script does 😄. I recommend you read the thread above this point to get some clarifications on a few points you raised, and invite you to comment on #1050

@ghost

This comment has been minimized.

ghost commented Oct 27, 2015

@hamiltont Here is a random PR that adds Goji. The app that could be tested in a minute takes 39 minutes. Is it a straightforward solution to rerun tests of 129 apps when only 1 is affected by change? And every test installs all available databases just in case some of them will be used (though, only one is required per test). Is my understanding correct?

@hamiltont

This comment has been minimized.

Contributor

hamiltont commented Oct 28, 2015

Is it a straightforward solution to rerun tests of 129 apps when only 1 is affected by change?

It's the best option we have, but feel free to prove me wrong with a PR! Running ~129 instead of 1 is a Travis-CI limitation, and one that we have discussed with travis many times (both via Github and telephone conferences). They are being incredibly generous with their computing time, so we're in no position to complain. There are a bunch of long threads in our issues (filter by travis) where we discuss how we integrate with travis and potential improvements.

FYI, once we enabled using travis pull request merge time went from months to days, so while it may seem slow it has been a massive improvement up to now. If travis ever enables the use of sudo in their docker-based execution environment then you can expect there to be another order-of-magnitude improvement in build execution times as we won't have to wait for the unneeded VMs to launch. While it is technically possible for someone to use the travis requests API and submit each job independently, 1) a pretty decent undertaking, and no one has stepped up 2) not really the way the requests API is intended to be used, and would likely cause lots of breakage of nice stuff like the badges on github, github linking to travis builds, and perhaps even the history page on travis-ci

You are correct that we install all databases for each test, see here for a location that could be updated, a PR would be great! It might save up to ~45 seconds per test that is actually required to run, so in the worst case (which currently consumes ~21 hours) of needing to test every framework that would recover us 45*125 = ~90 minutes

@ghost

This comment has been minimized.

ghost commented Oct 29, 2015

@hamiltont I'll try to illustrate what I meant by my first message. Very likely I'm missing something. But I'm really curious could the following approach work for your project?

  • Test apps are distributed across repositories. E.g.: TechEmpower/revel-jet, TechEmpower/php-raw, TechEmpower/goji-raw, etc. I.e. one app per repo.
  • Every app has its own .travis that may look as follows:
install:
  # The script below installs PostgreSQL and makes all necessary
  # data available as environment variables.
  - curl https://techempower.github.io/scripts/PREPARE_POSTGRES.sh | sh

  # Now $PGQL_NAME, $PGQL_PASS, $PGQL_PORT, etc. environment
  # variables are available and may be used by the app.

  - ./this/app's/BUILD.sh        # Script that describes the building process of the app.
  - ./this/app's/START_DAEMON.sh # Script for running the app.

  # The line below tests the app and at the end kills it.
  - curl https://techempower.github.io/scripts/START_TESTS.sh | sh
  • To benchmark the apps, all of them are cloned one-by-one, installed, and started (BUILD.sh and START_DAEMON.sh scripts from the step above could be used).

This approach looks the most natural to me. No need to do something unusual with the travis API, possible to test only things that are changed. What is the motivation to use a single repo for all apps instead and try to defeat travis that's on your way because of that?

@hamiltont

This comment has been minimized.

Contributor

hamiltont commented Oct 29, 2015

@alkchr This is a really interesting idea. Previously the TFB repo was pretty unorganized so this would have been impossible, but now it's more realistic.

I would probably choose to have one TechEmpower/some-repo per programming language, not per project. Having >100 individual repositories sounds quite tedious for any administrative tasks, while ~10 is manageable. Travis will still be a little slower than it is for a traditional one-project-per-repository, but much faster than our current ~100-projects-per-repository.

Here are my thoughts:

  • Much faster travis-CI results
  • Much lower travis-CI resource consumption
  • A nice self-organization where people that "care about" python-related tools can post issues in a dedicated repository. This might up community involvement as we would no longer be spamming subscribers with unwanted notifications about some language they had never heard of
  • Problems with changing the 'core toolset'. All framework code has an implicit dependency on the code underneath toolset and config, so a modification in the "core toolset" repository would require triggering travis builds in all programming-language-specific repositories. Currently travis-CI has no way to automate this process. Clever use of git submodules might make this easier, or perhaps someone could think of another clever idea

My main concern is the core toolset issue. We've consistently had to make changes that end up affecting frameworks in unexpected ways, so automated testing of the entire set of frameworks is critical. Input welcome...

@msmith-techempower @LadyMozzarella thoughts?

PS - I'm well aware that our travis-ci today is having serious issues testing anything. There are bugs due to many small changes (we depend on 10s of external projects just to get Travis-CI up and running for a basic test), and no one seems to have had the time to put in a PR.

@hamiltont

This comment has been minimized.

Contributor

hamiltont commented Oct 29, 2015

Oh, @bhauer too. I forgot who to tag - it's been a while ;-)

@msmith-techempower

This comment has been minimized.

Member

msmith-techempower commented Oct 30, 2015

This is a long thread that I honestly ignored at first since submodules have been annoying to me in the past, so please excuse me for trying to summarize and probably getting it wrong.

It sounds like what is being suggested here is similar to something we did internally with our Gemini framework recently: update the current project to be FrameworkBenchmarks and add several submodules (one per language).

I am definitely not the guy to ping on this matter, and probably Brian is going to balk at it as well. I think I would love to get the input of ... crap. I thought Michael Hixson had a github account on here, but he either does not or his handle is different than I am expecting. Regardless, I shall simply send him an email directly.

@michaelhixson

This comment has been minimized.

Member

michaelhixson commented Nov 13, 2015

What we did with Gemini involved Maven modules, not Git submodules.

(They are sort of related in that we used Git submodules in Gemini-based projects to point at Gemini. We had a bad time with Git submodules. Then they became totally unnecessary once we Maven'd it up and starting hosting actual versioned releases of Gemini internally.)

For the framework benchmarks, Git submodules might be a good idea. It makes sense. How much Git history would we lose? Any?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment