Issues Cloning Spec repo - GitHub taking a very long time to download changes to the Specs Repo #4989

Closed
jlubeck opened this Issue Mar 7, 2016 · 72 comments

Projects

None yet
@jlubeck
jlubeck commented Mar 7, 2016

Note from @orta -

If you are here because your Specs repo isn't updating, run: cd ~/.cocoapods/repos/master && git fetch --depth=2147483647 - this will convert your local repository of Podspecs to be a full clone, as opposed to a shallow copy.


What did you do?

Run pod setup

What did you expected to happen?

Clone Spec repo master

What happened instead?

It only downloads a few bytes and then throws error:

fatal: unable to access 'https://github.com/CocoaPods/Specs.git/': transfer closed with outstanding read data remaining

Podfile

No Podfile yet

I also tried cloning the repo manually or with the githhub desktop app with no avail.
I´m having no issues cloning any other repo in github. Only with this one. Is it possible there is something wrong with it???

Thanks

@jlubeck
jlubeck commented Mar 7, 2016

Tried again, new error:

error: RPC failed; result=18, HTTP code = 200
fatal: The remote end hung up unexpectedly
[!] /usr/bin/git clone https://github.com/CocoaPods/Specs.git master --depth=1

Cloning into 'master'...
error: RPC failed; result=18, HTTP code = 200
fatal: The remote end hung up unexpectedly

Very weird...

@ecch531
ecch531 commented Mar 7, 2016

I got same issue, too.

@mangofever

Same.
Cannot clone spec repo.

@art-divin

+1, same issue
git clone https://github.com/CocoaPods/Specs.git takes forever

@stringsanbu

+1, been messing around with this for awhile. I doubled the buffer, didn't work. Uninstalled and reinstalled pods, didn't work. Tried to clone manually, no cigar. It actually seems to be getting "something" but fails. Using verbose didn't say much, just said it had issues accessing it.

I tried accessing my other repos and it seemed to be OK, but it was definitely slower than normal.

@huinme
huinme commented Mar 7, 2016

+1

I got same issue, too.
my pod version was "0.39.0"

I tried cloning master repos directly (by git clone git@github.com:CocoaPods/Specs.git master --depth=1 --verbose), but also failed.

@oronbz
oronbz commented Mar 7, 2016

+1

@aceontech

+1

@pedrocid
pedrocid commented Mar 7, 2016

+1

@MarkMolina

+1. No success after increasing buffer / reinstall / manual clone

@stringsanbu

Temporary workaround which might work: https://github.com/CocoaPods/Specs/archive/master.zip haven't tested though

wget https://github.com/CocoaPods/Specs/archive/master.zip
@fraxool
fraxool commented Mar 7, 2016

Same error here. What should we do with the file at https://github.com/CocoaPods/Specs/archive/master.zip ?

@stringsanbu

My bad, the wget link is correct. Just edited my first link.

@stringsanbu

Not sure what to do with the file yet. Trying to see if we can manually run the commands to have pod setup work.

@aceontech

Yeah, it merely downloads the repo's contents. The .git/ directory is missing, so it's not recognized as a git repo.

@samuel-mellert

yes.. same here.. It always tries to clone the master repo. Even when I run it with --no-repo-update I get "Creating shallow clone of spec repo master-1 from https://github.com/CocoaPods/Specs.git"

@MarkMolina

Did anyone try this with 1.0.0 beta 4?

@huinme
huinme commented Mar 7, 2016

@MarkMolina I tried, but same result.

@czechboy0
Contributor

Try to cd into ~/.cocoapods/repos/master, then git clean -fd to clean up the working copy, git checkout -- . to ensure you're on master and then git pull manually. This took ages but worked for me.

@SoundBlaster

+1

@aceontech

Thx, but I removed my master spec repo before I realized something was up with the Github repo ^.^

@stringsanbu

Got a temp workaround! Tested with my app and everything is working. This is really only needed if you deleted the master repo. If the master folder is still in your ~/.cocoapods/repos folder with contents then you should be ok to just use pod install --no-repo-update.

  • Try doing a pod setup. This should at minimum download the .git to ~/.cocoapods/repos/master
  • While this is going, you need to move the .git folder from the master folder to somewhere temporary.
  • Stop the pod setup. Delete master folder in repos
  • Use the wget command below to get the zip of the repo
    wget https://github.com/CocoaPods/Specs/archive/master.zip
  • Unzip the master zip and move its contents to ~/.cocoapods/repos/master
  • Move your .git folder from wherever you put it to ~/.cocoapods/repos/master as .git
  • Go to your project folder, do a pod install --no-repo-update

And you should be good to go!

So in short, here is the basic list of commands I used:

pod setup (in a separate tab)
mv ~/.cocoapods/repos/master/.git ~/tempSpecsGitFolder
^C on pod setup tab
wget https://github.com/CocoaPods/Specs/archive/master.zip
open master.zip (unzipping it)
mv Specs-master ~/.cocoapods/repos/master
mv ~/tempSpecsGitFolder ~/.cocoapods/repos/master/.git
cd [project folder]
pod install --no-repo-update
@aceontech

Is this a Cocoapods or a wider GitHub issue?

@stringsanbu

@aceontech Pretty sure it is a GitHub issue, but my other repos are working fine so perhaps only certain repos on certain servers (on their backend) are affected.

@aceontech

I was just able to do a successful pod setup. Don't know if it's repeatable.

AlexMacBookPro:repos alex$ pod setup --verbose

Setting up CocoaPods master repo

Creating shallow clone of spec repo `master` from `https://github.com/CocoaPods/Specs.git` (branch `master`)
  $ /usr/bin/git clone https://github.com/CocoaPods/Specs.git master --depth=1
  Cloning into 'master'...
  Checking out files: 100% (74393/74393), done.
  $ /usr/bin/git checkout master
  Already on 'master'
  Your branch is up-to-date with 'origin/master'.
@SoundBlaster

Github is very very slow: ~ 40-50KB/s

@pedrocid
pedrocid commented Mar 7, 2016

I was able to do pod setup -- verbose right now.

@segiddins
Member

This is a GitHub issue rather than a cocoapods issue -- you're best off reporting it to their support rather than us, since there's nothing we can do about it.

@tychop
tychop commented Mar 7, 2016

And github refers to Cocoapods... Great...

@segiddins
Member

There's nothing we can do -- github are the ones who host the repo and are responsible for serving it. The only commits in the past day have been changing files via their REST API, so the idea a bad commit got in is very unlikely. For the meantime, installing using --no-repo-update if you already have the master repo cloned is probably the best bet.

@orta
Member
orta commented Mar 7, 2016

I've contacted support about it, hopefully everything should be pretty easy to fix

@jcampbell05
Collaborator

I was able to clone the repo manually, do we know what is special about the way CocoaPods clones it that we could use to help github ?

@orta
Member
orta commented Mar 7, 2016

CocoaPods uses this git command line API, it's calling git clone https://github.com/CocoaPods/Specs.git - I wonder if the problems are location specific, as these commands aren't working for me in NYC.

@art-divin

I've also contacted GitHub support today, no answer still. I think that the issue is not location-based since there's plenty of distance between NYC & Munich.

Issue was "coming and going" during morning hours, GitHub status page did not reveal any problems during outage.

What would be a really good addition to CocoaPods is a possibility to change CocoaPods specs repository URL to use in-house replication of it

@jcampbell05
Collaborator

Potentially mirriors could be hosted at bitbucket or other providers ?

@segiddins
Member

@jcampbell05 right now, Pod::Source doesn't know how to deal with mirrors, so it wouldn't be much help

@orta orta changed the title from Cannot clone Spec repo to Issues Cloning Spec repo - GitHub taking a very long time to download changes to the Specs Repo Mar 8, 2016
@mhagger
mhagger commented Mar 8, 2016

Hey all,

I'm one of the engineers on GitHub's Git infrastructure team. I'd like to start by apologizing for not responding more quickly to this thread. We've been investigating the issues that the CocoaPods community has been experiencing, and I wanted to give you an update on what we have found out so far.

The slow fetches and clones (which sometimes time out) that the CocoaPods community is experiencing are caused by automatic rate limiting on our servers, which is done to make sure that extremely high levels of load in one repository cannot impact other GitHub users. The CocoaPods/Specs repository is more or less permanently being rate limited. Why? There are several factors coming together:

  1. This repository experiences a huge volume of fetches (multiple fetches per second on average). We understand that part of the CocoaPods workflow is that its end users (i.e., not just the people contributing to CocoaPods/Specs) fetch regularly from GitHub, but the results of this are painful for our infrastructure: there have been approximately 1.1 Million clones/fetches from CocoaPods/Specs in the past week. This activity has kept, on average, more than 5 server CPUs permanently pegged, and used several terabytes of bandwidth out of our datacenters. There are only a handful of other repositories in all of GitHub that even come close to this level of activity. As far as I know, this level of activity is not new, but has been going on for many months and probably longer. Suffice it to say that the name CocoaPods/Specs is quite well known within our team 😉
  2. Apparently, most of the initial clones are shallow, meaning that not the whole history is fetched, but just the top commit. But then subsequent fetches don't use the --depth=1 option. Ironically, this practice can be much more expensive than full fetches/clones, especially over the long term. It is usually preferable to pay the price of a full clone once, then incrementally fetch into the repository, because then Git is better able to negotiate the minimum set of changes that have to be transferred to bring the clone up to date.
  3. Moreover, you seem to be hitting an edge case in Git's shallow fetch support, which is causing a significant fraction of your users' fetches to consume disproportionate CPU time (i.e., 100+ seconds each) on our servers. When this happens, the shallow clones are being converted into nearly-full clones, in a way that is much more expensive than doing a full clone from the start.
  4. Finally, the layout of the repo itself doesn't help. Specifically, the Specs directory, which contains 16k+ subdirectories, causes some Git operations to be unexpectedly expensive, further driving up CPU usage.

All of these factors combine to make CocoaPods/Specs one of the top five most resource-costly repositories that we host on all of GitHub.com. And that is why it is rate-limited; otherwise it would consume even more resources and cause service interruptions for other GitHub users. The symptoms of the rate limiting for you and your users are that your repository accesses (clones, fetches, pushes) have to wait in a queue on our end, sometimes for a long time, before being processed. This causes fetches/clones to take much longer than they would otherwise, and might cause timeouts at your end. Moreover, if the load on our servers becomes too overwhelming, a fraction of the accesses might be rejected altogether.

So, what can we do about it?

First and foremost, let me make reiterate our commitment to hosting Open Source projects for free, forever. Our platform doesn't have "hard limits" or monthly traffic quotas. But the same commitment we have towards CocoaPods we also have towards all the other OSS projects that share their storage hosts with your project, and that simply wouldn't be able to operate if our automatic monitoring didn't throttle access to the CocoaPods/Specs repository.

That said, we're working in the open-source Git project on patches to fix the pathological behavior you're experiencing (e.g., see http://thread.gmane.org/gmane.comp.version-control.git/288403). We think Git's handling of shallow clones can be improved, but this might take a while. If the Git client needs to be changed, it wouldn't help until the new client is in the hands of the majority of your users.

The remaining issues, however, are mostly in the hands of the CocoaPods project. I have the feeling that the easiest possible first step would be to address point 2, by changing CocoaPods to use full rather than shallow clones. I assume that the typical clone is updated many times during its lifetime, in which case the initial cost of the larger clone should easily pay off over time while significantly decreasing the load on our servers. Existing clones can be converted from shallow to deep by running

git fetch --depth=2147483647

within the repository.

I believe that the change to using non-shallow clones will start reducing the cost of fetches, which will automatically cause the rate limits imposed by our systems to be loosened, ultimately giving a much better experience to the users of CocoaPods.

Longer-term, you should also consider points 1 and 4. Using GitHub as your CDN is not ideal, for anybody involved. I would urge you to consider how CocoaPods could be distributed without using Git operations, which are intrinsically hard to scale. I'm confident that you could come up with a more reliable approach for serving packages. Perhaps a method that is more similar to the approaches used by other packaging systems would work better.

I hope this information is helpful. Please let us know if you have any questions!

@jcampbell05
Collaborator

@mhagger would HTTP fetching be easier to scale ?

@mhagger
mhagger commented Mar 8, 2016

@jcampbell05: unfortunately, HTTPS vs SSH wouldn't make a noticeable difference. The expensive part is figuring out which Git objects the client already has, which ones it needs, computing deltas for those objects, and compressing the deltas. When the client has a non-shallow history, the first two steps become much cheaper and the last two steps can often be optimized away entirely.

@jcampbell05
Collaborator

@mhagger What I was meaning is that you can directly link to files via HTTP using the raw.githubusercontent.com domain. If we were to download some things via HTTP directly rather than git would that help ?

@orta
Member
orta commented Mar 8, 2016

I've removed a post noting that I wish we could have been told about the burden earlier so we could have helped out before hitting a ceiling, however, I can imagine it's difficult on your side to keep people in the loop about things like this. Sorry, don't want to de-rail!

@jcampbell05
Collaborator

I've left some ideas here to help with the above but I'm not sure if they will help #5000

I'm very passionate about us getting a deploy command at some point (Works just like bundler's bundle install --deployment).

@MikeMcQuaid

Hi, another GitHub employee here from the Platform (i.e. API) team and Homebrew maintainer (so I feel the pain of both sides).

If we were to download some things via HTTP directly rather than git would that help ?

It would help if you were using e.g. master.tar.gz tarballs as they can be more easily cached and served without hitting the Git layer every time. The problem from your side is that you'd need to do a ~60MB download every time so I can see this being undesirable.

As well as the shallow changes @mhagger suggested this new, preview API should help: https://developer.github.com/changes/2016-02-24-commit-reference-sha-api/. It's helped Homebrew dramatically reduce the number of no-op git fetchs which also will make things better for your users as a no-op API HTTP call is significantly faster for you (and less expensive for GitHub) than a no-op git fetch. Feel free to @mention me directly on any pull request implementing it so I can help you ensure you're caching it nicely.

@jcampbell05
Collaborator

@mikemcquaid That looks like it will be a huge help, thank you!. I'm sure @segiddins, @alloy or @orta will get in touch with their thoughts :) 🚀

For me a three-tier approach may be best:

  • Add the shallow changes so that updating the repo is both faster and less expensive for Github.
  • Add a deploy option (https://github.com/jcampbell05/cocoapods-deploy), since some installations don't need to pull down all of the versions for every library and instead just need to pulldown the podspecs for a explicit version declared in a lockfile (Mainly for CI)
  • And finally implement that new API above to reduce the cost of the no-ops
@alloy
Member
alloy commented Mar 8, 2016

@mhagger

I'm one of the engineers on GitHub's Git infrastructure team. I'd like to start by apologizing for not responding more quickly to this thread.

No worries and thanks for jumping on this at all 👍

I knew that GitHub must spend a sizeable amount of resources on making a repo like CocoaPods/Specs available for ‘free’ to all our users before, but some of the information you’ve now given makes that even clearer.

So in name of all CP users, first of all, thanks for all that 👏

With all the hugs and kisses out of the way, let’s get onto sorting this all out. I’ll try to focus on what I think is important for this discussion, but please do point it out if I overlooked important information from your message!


Longer-term, you should also consider points 1

It’s unclear to me what it is in point 1 specifically that we should consider. Can you make that more explicit?

and 4.

This point seems an interesting tidbit, but it’s not clear to me at all why this is the case. Do you have links for us to read-up on this?

Using GitHub as your CDN is not ideal, for anybody involved. I would urge you to consider how CocoaPods could be distributed without using Git operations, which are intrinsically hard to scale. […]

There are a few reasons why we decided to go this route:

  • [As CocoaPods developers] Not having to develop a system that somehow syncs required data at all means we get to spend more time on the work that matters more to us, in this case. (i.e. funding of dev hours)
  • [As CocoaPods developers] Scaling and operating this repo is actually quite simple for us as CocoaPods developers whom do not want to take on the burden of having to maintain a cloud service around the clock (users in all time zones) or, frankly, at all. Trying to have a few devs do this, possibly in their spare-time, is a sure way to burn them out. And then there’s also the funding aspect to such a service.
  • [For CocoaPods users] Setting up external/private services is really really simple–only need to setup a Git repo–and something that quite some companies/people do.

Perhaps a method that is more similar to the approaches used by other packaging systems would work better.

For the ‘HR’ and funding reasons listed above, I think we’re actually being ‘smarter’ than various other packaging systems. I’m not going to name them, but I’m sure you can think of examples.

I'm confident that you could come up with a more reliable approach for serving packages.

I’m not at all afraid that we as devs can’t come up with all sorts of solutions 😉, but I’d like to stay away from immediately assuming that things cannot work at all with the current design and ending up building a cathedral.

I.e. I’d like us to continue this discussion, at first, from the notion of us maintaining the existing architecture. Where things are absolutely impossible, it would be great if you can include more links to docs/source that explain why things are impossible.

Maybe we could host a snapshot of the git repo as a ‘release’ and initially download that?

  • This would still use your resources, but through a proper CDN.
  • We could possibly apply better compression.
  • Only simple changes would be required, the overall architecture would remain the same, especially for those that wish to host their own service.

In addition, reading the linked to bug report, I’m not entirely sure I understand if shallow clones are or are not able to work in any feasible way right now. Could you expand on that? E.g. the bug report thread mentions various options, such as “--deepen, --shallow-since and --shallow-exclude”, could any of these be helpful to us in any way?

@alloy
Member
alloy commented Mar 8, 2016

@mikemcquaid

It would help if you were using e.g. master.tar.gz tarballs as they can be more easily cached and served without hitting the Git layer every time. The problem from your side is that you'd need to do a ~60MB download every time so I can see this being undesirable.

You are referring to these, yeah?

screen shot 2016-03-08 at 15 48 20

Yeah that kinda sounds like my idea, except I’d like that to be a one time thing.

I should have stated in my earlier comment that my idea of hosting a snapshot was meant as a way for users to more easily get a full clone, which, as I understand it, would take the shallow/server-side CPU usage burden away?

As well as the shallow changes @mhagger suggested this new, preview API should help: https://developer.github.com/changes/2016-02-24-commit-reference-sha-api/. It's helped Homebrew dramatically reduce the number of no-op git fetchs

This looks very interesting, thanks for sharing!

Just to be clear, are the number of no-op git fetchs currently a burden that’s leading to the rate-limiting as well?

@MikeMcQuaid

You are referring to these, yeah?

@alloy I am, yep.

Yeah that kinda sounds like my idea, except I’d like that to be a one time thing.

Sure. Unfortunately that archive is the output of git archive so does not include any .git directory/metadata.

Just to be clear, are the number of no-op git fetchs currently a burden that’s leading to the rate-limiting as well?

That's something that's hard for me to identify exactly. I guess it's a question of how often you think users are running git fetch (or equivalent) when there's nothing new to download. My experience locally is that a no-op git fetch for this repository is extremely slow so it's probably worth implementing just for that case and it definitely will decrease load for GitHub rather than increase it.

@mhagger
mhagger commented Mar 8, 2016

and 4.

This point seems an interesting tidbit, but it’s not clear to me at all why this is the case. Do you have links for us to read-up on this?

@alloy: In the Git object model, each version of each directory is stored as a "tree" object. Whenever something changes under the directory, a whole new, modified copy of the tree object has to be written to the object database. The Specs directory has 16k+ entries, and is about 450kb in size (compressed). Every single commit requires a new version of this giant tree.

This superficially doesn't seem so bad, because usually only a single entry in the tree changes each time. So successive versions of the tree delta well against each other, and the repository doesn't explode in size.

The problem is that many Git operations have to traverse the tree, which means that internally the 450kb object has to be recreated from its deltas (usually through multiple steps of deltas, each of which has to be found and decompressed). And your repository has nearly 100k commits, so operations that need to traverse the whole history become extremely expensive.

If, for example, this directory were sharded into subdirectories based on the first and second letters of the package name like so

a/_/A
a/f/A-Framework
a/2/A2DynamicDelegate
a/2/A2StoryboardSegueContext
a/3/A3GridTableView
...
a/u/authorizenet-sdk
a/u/autoAutoLayout
b/6/B68UIFloatLabelTextField
b/a/BABAudioPlayer
b/a/BABCropperView
b/a/BABFrameObservingInputAccessoryView
...
z/i/zipzap
z/l/zldtest
z/l/zldzhang
z/x/zxcvbn-ios

then the Specs directory and its subdirectories would only have 26ish entries, and the next level of directories would all have fewer than a few hundred entries. A modification in such a directory layout would have to rewrite three trees instead of one, but each tree is so much smaller than the current Specs tree that it would nevertheless be a big win.

Such a layout is also a big win for many other reasons. For example, when computing diffs, if two Specs trees have identical a subdirectories, then that can be seen without looking inside the subdirectory's tree at all (because the SHA-1s of the trees would be identical). So computing the diff between two successive versions in the sharded scheme probably only requires a few (small) trees to be opened and a few dozen SHA-1s to be compared, whereas today it requires two gigantic trees to be opened and 16k SHA-1s to be compared.

@vmg
vmg commented Mar 8, 2016

Thanks for your thoughtful reply, @alloy. Hope @mhagger has cleared up the question about your large trees. Regarding your other points:

It’s unclear to me what it is in point 1 specifically that we should consider. Can you make that more explicit?

Point 1 basically refers to using GitHub as a CDN. We totally understand this is convenient for you, and we work hard around the clock to make this a viable option, but Git, by design, is not suited to act as a CDN. You're burning weeks of CPU time and gigabytes of bandwidth from our infrastructure that could be replaced with very little CPU and very little bandwidth if CocoaPods were using a more traditional design for a package management system.

Maybe we could host a snapshot of the git repo as a ‘release’ and initially download that?

This would not be a strict improvement. If you use the tarballs that we offer for download, you will not have the Git metadata for the repository, so further fetches won't be possible. It'd be just as cheap to perform a full clone through Git -- GitHub has a special implementation on the server-side that can make serving a full clone particularly cheap as long as not a shallow clone. And obviously, you can continue fetching on top of the original clone.

To reiterate: the major performance issue is not on doing an initial clone of the CocoaPods repository, but in performing a shallow clone and then repeatedly fetching into it, like the CocoaPods client is currently doing.

I’m not entirely sure I understand if shallow clones are or are not able to work in any feasible way right now. Could you expand on that? E.g. the bug report thread mentions various options, such as “--deepen, --shallow-since and --shallow-exclude”, could any of these be helpful to us in any way?

Our advice would be for CocoaPods to stop using any kind of shallow feature from Git altogether. Users should perform a full clone of the repository, and then fetch into it as usual. Simply performing that change should significantly soften the load on our fileservers.

You may be led to believe that this is inefficient (in bandwidth or disk storage), but it actually ends up being significantly cheaper than your current approach. Git is not very good at shallow data, and one pattern we've found (and that we're trying to fix upstream in Git itself) is that merging a branch and fetching that into a shallow repository can cause Git to send an unreasonable amount of objects when that merge crosses the grafted shallow-point of the repository. You can read the investigation in the Git ML here: http://thread.gmane.org/gmane.comp.version-control.git/288403

Besides dropping the shallow clones, I would still urge you to implement @mikemcquaid's suggestion regarding the preview API for no-op updates. At this point, most of the throttling comes from expensive fetches, but every small bit helps.

At the end of the day, any Git pattern will "work" in practice: we have a unique in-house monitoring system that ensures the full availability of our Git platform no matter the circumstances. But this obviously leads to issues like the current thread. If the operations you're performing are not as optimal as they could (or are pathological like in this case), they will be automatically throttled or cancelled on our servers, and this is a poor experience for the users of CocoaPods.

We cannot force you to change the design of your package manager, but we'd like to reiterate that Git (the version control system itself -- nothing to do with GitHub as a platform) is unsuited for what you're trying to do here. We're here to help you soften the pain, and we'll continue improving the performance of our platform and of the OSS Git client to make pathological workflows work in practice, but this is hard work. We can't assure an ideal user experience with CocoaPod's design choices. 😿

@alloy
Member
alloy commented Mar 8, 2016

Thanks for your thoughtful reply, @alloy. Hope @mhagger has cleared up the question about your large trees.

Aye, it did. Very great, thanks for that, @mhagger 👍

Maybe we could host a snapshot of the git repo as a ‘release’ and initially download that?

This would not be a strict improvement. If you use the tarballs that we offer for download, you will not have the Git metadata for the repository, so further fetches won't be possible.

Sorry about that, I wasn’t clear in my initial question about this. What I meant was a snapshot of the repo including metadata, so that there would be no more shallow fetches. So not the one that @mikemcquaid meant.

If the operations you're performing are not as optimal as they could (or are pathological like in this case), they will be automatically throttled or cancelled on our servers, and this is a poor experience for the users of CocoaPods.

Totally understandable, and the experience is clearly not something anyone wants.

@indygreg
indygreg commented Mar 8, 2016

@mhagger I suggest you look at the support I added to Mercurial for having servers "redirect" clients to pre-generated static bundles to facilitate more efficient cloning. This feature has been talked about on the Git mailing list but AFAIK hasn't yet manifested into patches. I believe that if it were in place GitHub wouldn't be seeing some of the scaling issues exhibited in this issue. (Although shallow clone is a hard problem given the challenges of caching something that is always changing.) I wrote a bit more on Mercurial's feature and its impact at Mozilla at https://news.ycombinator.com/item?id=11246458 and http://gregoryszorc.com/blog/2015/10/22/cloning-improvements-in-mercurial-3.6/.

@alloy
Member
alloy commented Mar 8, 2016

@mhagger @mikemcquaid @vmg

Ok, in order to ensure that everything is clear on either side, I'd like to summarise what I understand to be the ways in which we can do our part, without completely dropping the use of GitHub for this. Please let me know if these are indeed acceptable, or if I got anything wrong.

  • We’re no longer going to use shallow clones.
  • We’re going to use that new API to check if there are updates available at all.
  • We’re going to look at prefixing subdirectories in the spec repo to limit the number of entries per directory.

In addition, for the user’s experience, we may look at offering a tarball of the full repo (with metadata) to see if that helps with the initial setup.

@alloy
Member
alloy commented Mar 8, 2016

Just to be extra clear from my side, if I got it wrong and you don’t see this working at all, with or without these changes, then we’ll definitely look at other solutions. We do not want to needlessly take away resources that you so generously offer ❤️

@vmg
vmg commented Mar 8, 2016

@alloy: That sounds fantastic. Those are the 3 actionable points, and sorted by (we believe) the impact they will have on the performance of CocoaPods. We're keeping a close eye on the graphs and will let you know how is this looking as you start implementing these changes.

Thank you so much for helping us out. 🙇

@jlubeck
jlubeck commented Mar 8, 2016

The speed of how everyone involved is handling this is quite remarkable. Thanks all!

@alloy
Member
alloy commented Mar 8, 2016

@vmg Sweet. Thank you all for taking the time and explaining the issues in-depth and how we can solve them! 👏

@mhagger
mhagger commented Mar 8, 2016

@mhagger I suggest you look at the support I added to Mercurial for having servers "redirect" clients to pre-generated static bundles to facilitate more efficient cloning.

@indygreg: Thanks for the pointer. Full Git clones are actually quite cheap CPU-wise, so this approach would not solve the problem being discussed here, which is foremost a problem of CPU load. Of course, the ability to put a static bundle on a CDN is useful for other reasons.

@vmg
vmg commented Mar 8, 2016

@indygreg: Thank you for that suggestion! We've worked extensively on clone performance at GitHub, and we've certainly researched using a "bundle" approach like Mercurial does, but that has many practical shortcomings for a platform like GitHub, where a repository like rails/rails has thousands of unique forks which would require a bundle for each one.

We ended up settling on a more advanced approach called bitmap indexes, which allows us to serve a full clone with less than a second of CPU by keeping an "index" of objects reachable from certain points of our history and writing these objects straight to the wire from sections of the Git Packfiles on disk. The same optimization also applies to fetches, which makes it very interesting.

I wrote a pretty long post on the issue last year, in case you want to read more on our approach: http://githubengineering.com/counting-objects/

As you may guess, the main performance issue with CocoaPods (besides the fact that it's extremely popular) is that shallow clones cannot be served efficiently with our indexes, because the data on-disk for the repository doesn't match a shallow copy, and hence needs to be repacked before sending it to the client. We're confident that as soon as the CocoaPods client stops doing shallow clones, we'll be able to serve them with a much softer cost to our infrastructure.

@jcampbell05
Collaborator

This is just a question, but when these changes are implemented I'm wondering if you are open to improving the of people running it in CI. Travis CI creates a new clone of the repo each time and I think there are ways beyond these three of speeding up both the operation on a user-end and the impact that could have on github.

If there is a way to measure this impact, I would be interested.

@solarce
solarce commented Mar 8, 2016

Hey there, I am the infrastructure manager for Travis CI and my team is responsible for our OSX build images, which includes CocoaPods and the speed of the pod setup operation in each build is something we've been discussing lately.

We'd definitely love to hear more about we how could:

  1. Improve how we do an initial install and setup of cocoapods and the specs, when we build new images for new Xcode updates, e.g. 7.3b5 yesterday. Right now we just do https://github.com/travis-infrastructure/osx-image-bootstrap/blob/master/bootstrap.sh#L73-L75
  2. If there are some improvements we can make to the things we do at runtime, in https://github.com/travis-ci/travis-build/blob/master/lib/travis/build/script/objective_c.rb and/or how our build in caching handles the specs
@codeOfRobin

Stupid Question. Are similar load problems seen in repos like Homebrew(which maintains similar kinds of information) as CocoaPods?

@ZevEisenberg

I'd like to put in a plug for checking your Pods folder in to source control. I know there are arguments on both sides, but I've been checking in Pods for over a year on all projects and loving it. It removes your project's dependence on pod install in order to run on CI, which speeds up builds, and the repo size hit isn't really a big problem in my experience. One note: you'll generally want to change your Podfile and run pod install and merge that separately from any other changes, to keep your pull requests readable.

@jcampbell05
Collaborator

@solarce Aside from using Travis CI caching (which is awesome BTW) and the ideas above (Which are a positive direction and great work on the quick response from @alloy and the team ). What I've been using as a solution for my company is a plugin I wrote which attempts to use the bundle install --deployment functionality.

Check it out here:
https://github.com/jcampbell05/cocoapods-deploy

It looks at the lockfile, and uses the raw HTTP links to the podspecs for each dependency and downloads them. This means that we don't need to clone the CocoaPods Specs repository at all, we can take advantage of Github's HTTP caching (Reducing the workload for @vmg and @mhagger) and TravisCI can cache these Podspecs as well (Since they reside in the Pods folder).

With this plugin we've experienced a 85% decrease in CocoaPods install times (Down from around 500 seconds to 85). I would love to know your thoughts on this and to see if it works for you.

@jcampbell05
Collaborator

@ZevEisenberg I've never been a fan of that approach and always felt it could be something that should be solved via a CocoaPods's plugin (like the one above). Since other package managers solve this via a special deploy mode which doesn't require you to mix your dependencies and source together.

I think its important to give people the choice :)

@orta
Member
orta commented Mar 8, 2016

OK, looks like it's time to lock this thread. Sorry people wanting to continue this conversation productively, but we've hit the hackernews/mainstream point where the tone of the conversation is going awry due to outsiders. If you'd like to reach out to me or the CocoaPods team, you can direct message me on twitter (same username) for a quick response, or use info@cocoapods.org for something team-y.

GitHub staff we totally appreciate the discussion, and will update this thread once we've got somewhere towards having a release with some fixes. I'll unlock it then 👍.

@orta orta locked and limited conversation to collaborators Mar 8, 2016
@alloy
Member
alloy commented Mar 8, 2016

Shame :(

@segiddins segiddins self-assigned this Mar 10, 2016
@segiddins
Member

Just a quick update: I've just released CocoaPods v1.0.0.beta.6 with a few improvements to how we handle specs repos, particularly https://github.com/CocoaPods/Specs. The full CHANGELOG is available at https://github.com/CocoaPods/CocoaPods/releases/tag/1.0.0.beta.6, but in short:

  • Specs repos will no longer auto-update by default when running pod install
  • When attempting to pull the master specs repo, we check that preview API for a 304 and will not perform the fetch if the API says we have the latest commit
  • Shallow clones of specs repos are no longer supported
  • Shallow clones of the master specs repo will be automatically unshallowed.

In the next beta, we will be adding forward compatibility support for a sharded specs repo directory structure, to be migrated to in a few months.

We hope that this set of changes will lead to a better user experience in the long run for all users of CocoaPods, and will also help ease the burden of serving the master specs repository that GitHub faces. I speak on behalf of the team when I say thank you to everyone who has helped us over the past week, in particular @DanielTomlinson and @mrackwitz for making massive pull requests, @orta and @alloy for helping coordinate this effort, and the entire GitHub team, particularly @vmg, @mhagger, and @arthurnn for providing technical advice and infrastructure to CocoaPods, our users, and the entire open source community.

If anyone has further questions / comments about this release and the work to come, please feel free to reach out either to me personally (segiddins@segiddins.me) or to the CocoaPods team (info@cocoapods.org). Thank you everyone for your patience over the past week.

-- ❤️ The CocoaPods Team

@segiddins segiddins added the label Mar 15, 2016
@orta orta unlocked this conversation Mar 15, 2016
@kylef
Contributor
kylef commented Mar 15, 2016

I'd suspect it might be a while for users to update to 1.0.0 considering it includes many breaking changes over previous versions. Does anyone think it would be worth back porting these fixes to the 0.39.x series and issuing a patch update removing shallow etc?

Especially if the master repository is going to be sharded and become incompatible with 0.39.

@jcampbell05
Collaborator

@kylef I agree it would be good to have a 0.40 with some of these changes.

@vinnybad

@kylef it would certainly help our teams!

@orta orta locked and limited conversation to collaborators Mar 15, 2016
@segiddins
Member

Thanks everyone for sticking with us! We've written up a post-mortem of the issue and the steps taken to mitigate at http://blog.cocoapods.org/Master-Spec-Repo-Rate-Limiting-Post-Mortem/.

@segiddins segiddins closed this May 4, 2016
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.