-
Notifications
You must be signed in to change notification settings - Fork 46
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
gitdist: Support for large binary files or SVN? #139
Comments
@rppawlo, I am not sure that adding non-git commands directly into gitdist is a good solution. Right now, the appeal of gitdist is that it just distributes raw git commands across a set of git repos; simple. If we start to add commands for other VC systems, this is going to get complicated very fast. It sounds like instead that Drekar developers are looking for a tool like mr ("myrepos"). See the debate going on in: I might suggest some options more compatible with CMake/TriBITS/gitdist that Drekar developers might consider:
Personally, I am not excited about either of these two approaches but they would work with TriBITS and github with very little extra effort (e.g. once gitdist supports the dist-foreach command). I have been waiting to see if Git LFS takes off and someone creates quality open-source server and local implementations. That would really make working with large binary files with git more seamless but I don't have high-hopes (see here and here). For now, Drekar developers might just consider adding a simple script like Otherwise, let's talk. |
FYI: I cloned the myrepos git repo off github, installed
I looked at the source code for
With all of these repos, it would be confusing what 'origin' pointed to without having to constantly run The issue is that what "origin" points to is determined by what project and what workflow I am using. So when I am working with Trilinos, I see:
So it is clear that I am pulling and pushing to the 'github' repos. When I am working with CASL with Trilinos, I see:
That makes it clear that I am pulling and pushing to the 'casl-dev' repos. See, no confusion!
But |
@rppawlo, it looks like git-annex has a new v6 repository mode that looks very close to the ease of usage of standard git commands (just Would have to try this out and experiment with it to see how it works. However, if this works out, it might give users and developers an interface much closer to a standard git repo without having to resort to using SVN or some other way of managing large files. I will ask Seth Johnson at ORNL if he has looked into git-annex v6 repository mode yet. |
@rppawlo, we should wait and see how Seth responds to the below email but from looking into this new git-annex mode, I think it might be possible to streamline the usage of git-annex with new
Git-annex meshes with git and gitdist much better than SVN. That is because you can still just run any regular git command in git repos that use git-annex. You can't do that with an SVN repo so it would be very ugly to try to shoehorn in support for SVN. But since git-annex is an extension to git, this should be much more seamless. One issue is that they will need to install an updated version of git-annex on development.sandia.gov to be able to use this git-annex with git repos there. That is something that might require some SEMS support. But before that, we could experiment with a temp bare repo and see how it works. Also, users of would need to install git-annex on their local machines. But the SEMS NFS mount could help with that and make it more automatic. If you are interested at looking into git-annex, perhaps we can look into this together a little? I know that none of us really has time for this type of thing, but if this works out, it could be a very nice solution to a pretty big problem with the usage of git for large binary files and multi-repos, all in one shot. From: Bartlett, Roscoe A Hello Seth, I know that Exnihilo developers use git-annex for managing large binary files. Therefore, I thought that I might ping you on this … I was looking at git-annex documentation today and I see that newer versions of git-annex (version 6.20160211 and newer) support a new “v6 repository mode” that looks to make git-annex much easier to use (much like Git LFS): https://git-annex.branchable.com/tips/unlocked_files/ Basically, it allows you to set But I am not sure if ‘git push’ and ‘git pull’ result in large files being synced automatically like happens with Git LFS or if you still have to use git-annex commands to do that (e.g. ‘git annex sync --content’). That would be the big question for usability. (But I think I could wrap this into gitdist commands dist-fetch, dist-pull and dist-push to automatically handle git annex repos in this fashion.) Some discussion of this is in the comment: I am looking into this because a project at SNL (one of Roger’s projects) is using SVN to handle large binary files. But using git-annex would have some advantages and it interfaces nicer with git. The issue is making git-annex as easy to use as raw git (or getting pretty close). If this works out, we might adopt this for CASL VERA repos as well (i.e. the VERAData repo). Just curious if you had seen this git-annex “v6 repository mode” and have looked into this at all. Thanks, -Ross |
Thanks for looking into this Ross! I'll sit tight until we hear back from Seth. |
Adding @eric-c-cyr to this conversation bummer about the "mr" tools being hard coded to origin - it looked really good. |
It might just be that the 'mr register' command is hard-coded to 'origin'. Grepping through the script, I don't see other explicit mentions of 'origin'. It might be that if you manually build the the I understand the attractiveness of SVN (because many people already know that tool) but git-annex might be worth a look. Let's see what Seth says. |
The response back from Seth my response to him are given below. Boy, if GitLab could be installed at SNL and supported Git-LFS, then that is something that SEMS should really look into. That could make GitLab very attractive. Note that Sandia does seem to have a GitLab instance installed (see issue https://sems-jira.sandia.gov/browse/SEMS-1104). But note that Git-LFS is not all roses: But of course git-annex and especially SVN are not perfect either. But at least git-annex fits in with a git-based workflow and supports distributed repos. SVN does not at all. The biggest problem with SVN is that it clashes with the git workflow. You can't commit things locally and then push to all the repos all at once with SVN. With SVN, you need to create the commit message at the same time that you push the commit. When multiple repos are involved, this is just a mess. Based on this, I would look at git-annex. I just wish we had someone at SNL who could dig into all of this for us. Should we ask SEMS to look into this? Dealing with large binary files in a git-based workflow is not a trivial problem. When you starting throwing SVN repos into the mix, I don't know how you build a smooth development and multi-repo integration model out of that. From: Bartlett, Roscoe A Seth, Except for push and pull, git-annex with this new repository mode looks almost as easy to use as Git-LFS (but we would have to do a detailed comparison). Good to hear about Git-LFS support with GitLab. I just found:
Is that supported on the ORNL installation of GitLab (code-int.ornl.gov) currently or it is planned for the future? It also seems that GitLab supports git-annex as well: (but not sure about the newest “v6 repository mode”). So GitLab seems like a good opportunity to compare git-annex and git-lfs side-by-side. In any case, please let us know how your experimentation with Git-LFS and GitLab goes and how it compares to git-annex. But at this point, you would still recommend using git-annex over using a separate SVN repo? Thanks, -Ross From: Johnson, Seth R. [mailto:johnsonsr@ornl.gov] Hey Ross, Thanks for the ping! I'd not heard of that new git-annex mode but it sounds pretty easy to use. I doubt that git-annex will ever work as cleanly as just git, but it sure works easier than a separate SVN repository. We're actually looking into using Git-LFS because that's what may be supported by our local gitlab server. Please let me know what you find out. Thanks, |
Below is and email response from Seth and my response to him. To sum up feedback from Seth:
Since my last comment, I have done a lot more research on git-annex and Git-LFS. I looked at several presentations on youtube including:
At this point I think I know enough about git-annex and Git-LFS to draw some conclusions. First, Git-LFS is the future for the handling of large binary files with git. There is no question about that. And using Git-LFS would require zero changes to gitdist or checkin-test.py. However, unless you have a local recent installation of GitLab Enterprise Edition (EE) or GitHub Enterprise, or BitBucket installed locally and supported on your local system, you can't use it for sensitive projects. And even for open-source projects, the support for large files with Git-LFS with GitHub, GitLab, and BitBucket is going to be very limited (because large files and lots of I/O costs real money). Also, there open-source Git-LFS server implementation. Therefore, I don't think that Git-LFS is a viable solution for projects that have sensitive data (that therefore needs to be protected behind a firewall) or needs to handle really large files. It may take a while before wwe can use Git-LFS at the labs due to need for serious infrastructure (both programmatic support for the supporting tools and the hardware to support it). In a few years, I would think that Git-LFS will overtake everything, including git-annex. But at the preset time, I don't see that Git-LFS is a viable option. If the desire is to keep using gitdist for handling all the repos, then I think that git-annex with using "v6 repository mode" is worth looking into. It may be nearly as easy to use as raw git and it will integrate well with the gitdist tool (if that is indeed the desire). If there is interest in this, then the next step would be to install a recent version of git-annex v6 and prototype its usage at handling large binary files and setting up an rsync git-annex specialremote repo. However, if the desire is to just keep using SVN for managing large binary files, then I would recommend creating a specialized script called something like In summary, a reasonable plan would be:
I will update the title of this Issue to reflect the larger scope. From: Bartlett, Roscoe A Seth, Responses inline ...
[Ross] That is the beauty of the git-annex “v6 repository mode”. The git-annex author used the same git clean and smudge filters as Git-LFS to let you specify what types of files should get annexed and which should be handled by regular git and then users can just use
[Ross] That will result in all files larger than 100kb, except *.h and *.cpp files, being annexed automatically when added with a raw
[Ross] It looks like you can include/exclude entire directories, files with a given extension, etc. It looks very flexible. [Ross] What git-annex is lacking is the pre-push hook that Git-LFS uses to automatically send the annexed files to the server using a raw [Ross] I am not sure if git-annex does this or not. We would have to see. But having to run ‘git annex sync --contents [--no-push|--no-pull]’ does not seems so bad (and I can add that to a new gitdist dist-pull/dist-pull commands). [Ross] If you are interested, you might just take a quick look at these two web pages above (they are not very long).
[Ross] I think that the git-annex “v6 repository mode” makes the usage of something like autoannex.py unnecessary (see the above .gitattributes file). [Ross] I think the only thing the checkin-test.py script would need is to handle the pull and push operations for annexed repos. And for that, I think I would add support for git annex push/pull to the gitdist script and make the checkin-test.py script use gitdist for pull and push operations. Also, we would need a way to make sure that modifications to annexed files where committed before doing the final test and push. We would need to experiment with git-annex to see how to get that info (but I would guess that ‘git annex status’ would do that).
[Ross] What that means is that Git-LFS is not even an option right now at ORNL, right? Opening the door to huge file storage is a bit scary for them I suspect. It looks like code-int.ornl.gov GitLab site at ORNL does not even support git-annex (which was added before support for Git-LFS). A big advantage of git-annex is that you can actually use a different machine to store your annexed files so you can use it even with GitHub or BitBucket. See:
[Ross] This makes your project its own master for storing large binary files. You can just set up an rsync special remote server that everyone in your team can access (using a unix group protection on the server) and then make them point to that. See:
[Ross] That is what I am thinking. From the documentation that I have seen for the newer git-annex “v6 repository mode”, it seems fairly usable (almost as good as Git-LFS, if it actually works). The big advantage of git-annex that I can see is that it has a fully free and open-source server implementation that you can use right now. There are no open-source server implementations for Git-LFS that I can find. All of the major commercial players have their own server implementations (e.g. GitHub, BitBucket, GitLab, MS VSO) but there is nothing that you can use with your own git repos. The only Git-LFS option for behind the firewall is proprietary commercial GitLab EE (and GitHub Enterprise but that is very expensive). The fact that Git-LFS only works with proprietary implementations bothers me a lot. Given all of this, I think I would sit tight and wait and see where Git-LFS goes. My guess is that in 2 more years there will be a quality open-source server implementation (perhaps GitLab?) and the git community will have worked out all of the bugs. That will be a wonderful time! But until then, we have to get our work done. Cheers, -Ross |
FYI: Looks like Sandia has GitLab sites (SON and SRN) set up that are supposed to support Git-LFS (but not git-annex). I have created Trilinos JIRA Issue TRIL-62 to document this and investigate if Git-LFS works. |
So after a lot of back and forth with the maintainer of the gitlab.sandia.gov site and a lot of research and experimentation, it seems that Git-LFS works with that site but you have to use HTTPS authentication (SSH authentication does not work with GitLab with Git-LFS). See Trilinos JIRA Issue TRIL-62. But I think I am far enough along to say that Git-LFS with the SRN site gitlab.sandia.gov may be a viable solution right now to replace SVN to manage large binary files. I think the next step would be to discuss this some and then, if it makes sense, to actually try creating the Drekar git repo copy of the SVN repo using Git-LFS on gitlab.sandia.gov and see how it performs with cloning, changing files, pushing etc. Git-LFS is very easy to use once you have the Git-LFS client installed (which is also very easy to do and we can provide that on all platforms pretty easily I think). Given what I have learned about Git-LFS, I think that this might be a good option for some of the larger test files with Drekar. It might even be a good option for trimming down some of the larger test files in Trilinos. It seems that if you clone a git repo that manages some files with Git-LFS and the local user is not set up to use Git-LFS, then nothing all that terrible happens (see this comment). |
I created a Git-LFS version of the DrekarSystemTests repo on the SNL GitLab servers gitlab-ex.sandia.gov and gitlab.sandia.gov and it went smoothly. It was an easy process and I got it working with Git-LFS on a new machine (shiler) in just a few minutes. The details are below. If we can get the Git-LFS client installed on to the SEMS NFS mount and the ATTB machines, then it should be trivial for Drekar developers to use a Git-LFS version of the DrekarSystemTests repo. Drekar developers need to run two commands one time on each new machine:
and they are set and never have to think about Git-LFS again. For machines that don't have the Git-LFS client centrally installed, installing it locally is just a few simple commands (see below) and we could provide a single script that would do this in one shot (i.e. download the Git-LFS client and install it). The only issue I can see with this is that you have to use HTTPS to authenticate with the current GitLab servers. Therefore, you have to cache your username and password. This might be an issue for automated tests run by Jenkins. But you can use a file to store these and then you never need to type them again (see this TRIL-62 comment) which is allowed for SNL entity accounts. Detailed Notes: For the heck of it, I converted the DrekarSystemTests SVN repo to a Git-LFS repo. First, I copied the Git-LFS client to shiller:
and then installed it with:
(This installed I set up for git to cache my username and password for HTTPS:
I then created the empty private GitLab projects under my account:
I then cloned the empty repo and set up for Git-LFS with:
(The *.xml and other files should be stored by raw git since they are human-written text files.) I then copied the contents of the DrekarSystemTests.svn repo trunk/ directory into the new git repo, removed all of the .svn/ dirs and then did:
I then moved the existing DrekarSystemTests local repo out of the way and did a fresh clone with:
Here is how the size of the clones match up:
The Git-LFS repo is larger than the SVN repo because it has to store two copies of every binary file (see this TRIL-62 comment). But you can see that Git-LFS is managing most of the data from looking at:
See, the Now you can use gitdist including DrekarSystemTests with no modifications (just add DrekarSystemTests to the .gitdist file):
Now I can pull and push to all of the Drekar repos at once cleanly! We could streamline the cloning of these repos using the TriBITS clone-extra-repos.py script for the Drekar repos. I can show how to do that if interested. |
On internal machines that accessing HTTP requires a proxy. Thus on some of the clusters, I've used ssh to get to github. Is there a way to set the HTTP proxy in git? |
Eric, I did not need to set any proxy from the machines shiller or muir in order to access github.sandia.gov or github-ex.sandia.gov. What machines need a proxy? Are these on the SRN, SON or outside of the SNL network? In any case, hopefully there should be simple instructions to set up the HTTPS proxy on any given machine. Longer term, the hope is that GitLab will get Git-LFS to use SSH authentication. See:
But note that BitBucket seems to have this figured out:
Don't know. Can you point me to a specific machine that has a problem and perhaps I can give it a try? |
Just a tip, but you can speed up the initial clone using The time with raw
The time with
The The same goes for I think that TriBITS could be made to learn about Git-LFS repos (just list them as such in the ExtraRepositoriesList.cmake file, such as with 'GIT-LFS' instead of just 'GIT') and a special gitdist command |
BTW, I went ahead and updated the DrekarSystemTests Git-LFS repo for the current version of the SVN repo just to show that it is easy to do in the commit:
It is just a few commands to do this:
|
Ross - just occurred to me that there might be another solution. git supports working directly from svn repos. https://git-scm.com/book/en/v1/Git-and-Other-Systems-Git-and-Subversion I recall that the moose team did this for quite a long time before moving to github. It's not the most appealing, since there are some a few dangerous things you have to manage. Just thought I would mention. I discussed the git-lfs path with the Drekar team and they were hesitant about adoption. They would like to make sure this is the best long term solution. I think that one issue is that the COE doesn't have a package for git-lfs and as far as I can tell, ubuntu doesn't either. So every computer we run on requires this install. sems support would go a long way. |
I had not thought about git-svn. The truth is that the DrekarSystemTests repo is really pretty small so I suspect that even if git-svn is extremely slow (which people seem to be claiming) that using git-svn with DrekarSystemTests may not be too bad. But looking at:
you have to use separate commands for pulling and pushing to the SVN repo using git-svn and this would require training for Drekar developers and gitdist would need to be extended to work with git-svn repos. To pull with git-svn, you have to use To push with git-svn, you have to type That is a big advantage of git-lfs over git-svn.
I definitely understand the apprehension. Having to use HTTPS authentication instead of SSH apprehension is the biggest apprehension I personally have. But compared to git-svn, just from what I have seen so far, I would personally use git-lfs over git-svn if I had a choice, and not just because git-lfs does not require any changes to gitdist at all. Git-LFS just fits in better with a git-based workflow than git-svn.
From everything that I have read and I have experienced, all things considered, Git-LFS looks like the best long-term solution for managing large binary files with a git-based workflow. It is the most transparent for developers to use and it has what appears to be very broad industry support (e.g. GitHub, BitBucket, and most importantly GitLab) and there will be better and better implementations and more implementations of Git-LFS capable servers as time goes on. I am about 100% sure about that given what I have seen.
Yes that is an issue, but I think we can making it a simple two-step process to install the Git-LFS client on any SON or SRN machine in each users home directory using:
I would put together the install-git-lfs.sh script and then ask Drekar developers just to try it and see if it works for them. That should be a low investment on their part. As for support the systems, things don't look too good for git-svn. I just tried to use git-svn to clone the DrekarSystemTests repo on the SR machine muir using the SEMS installed git:
and it seems that the SEMS team did not install git 2.1.3 to support git-svn:
Even the default COE version of git:
does not seem to support git-svn:
Seeing this, it might be easier for the SEMS team and the ATTB machines to install the binary git-lfs than to fix git install that has a broken git-svn. However, things look better for git-svn on the ATTB machines. I tried git-svn on the machine shiller and it worked:
So that is a little longer than the 16s needed to clone the git-lfs repo with:'
but not too bad. What is interesting is that the git-svn repo is not on a tracking branch but it does show modified and untracked files:
That problem is that I made a local commit not pushed to the SVN repo:
and I am not sure how to see that commit has not been pushed yet. With
That is problem with the workflow using git-svn. We would need to train the gitdist My initial opinion is that git-svn is more confusing that git-lfs. In summary, the pros and cons for git-lfs vs. git-svn that I can see so far are: git-lfs:
git-svn:
Rather than all this stuff, Drekar might just consider a simple We should discuss all of this stuff. |
Looks like GitLab CE is on the verge of allowing Git-LFS to use pure SSH authentication with the GitLab server (but transferred over HTTPS). See this update. We don't care how the objects are copied (SSH or HTTPS are fine, both are encrypted), we only care about how the authentication is done. Hopefully we will see this on the SNL GitLab server in a few months. We need to wait for this to get merged to the GitLab 'master' branch and put into an offical release (which go out once a month) and then wait for the people in SNL org 9300 to update the SNL GitLab servers. |
FYI: It looks like the next release of GitLab CE will have support for pure SSH authentication for Git-LFS. See: I think this means that this will show up on the SNL GitLab servers in a few months. |
Good news, the latest release of GitLab has the full SSH authentication for Git-LFS: Now we just need to get them to insall it for the SNL GitLab servers. |
Looks like they already installed GitLab 8.12 on the SNL GitLab servers but it does not seem to work for SSH Git-LFS authentication yet. From: Bartlett, Roscoe A It looks like it does not work:
This cloned the repos but failed to replace the LFS-mangaed files with the full versions as shown by, for example:
Anyway, not urgent but it would be great if we could get this to work. Thanks, -Ross |
Looks like Git-LFS with SSH should work on the SNL GitLab servers. I need to try that out. See: |
FYI: Git-LFS is working on the ORNL GitLab servers code-int.ornl.gov and code.ornl.gov. But since this story has gotten way off scope and since Drekar is going to just use a raw git repo (or not change anything), there is no sense keeping this Issue open any longer. If there is a desired to teach gitdist about Git-LFS, then we can open a new issue. But for now, it makes sense to close this story as wontfix I think. |
For large binary files, we use svn to store system test data. When adding new capability, this results in adding system tests along with changes to the source code. We use gitdist to pull and push multi-repo changes. Multiple users have been burned because gitdist doesn't push the svn repo changes. It would be nice for gitdist to support some simple commands for updating/pulling and pushing to repos.
The text was updated successfully, but these errors were encountered: