Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

gitdist: Support for large binary files or SVN? #139

Closed
rppawlo opened this issue Jul 26, 2016 · 24 comments
Closed

gitdist: Support for large binary files or SVN? #139

rppawlo opened this issue Jul 26, 2016 · 24 comments

Comments

@rppawlo
Copy link
Contributor

rppawlo commented Jul 26, 2016

For large binary files, we use svn to store system test data. When adding new capability, this results in adding system tests along with changes to the source code. We use gitdist to pull and push multi-repo changes. Multiple users have been burned because gitdist doesn't push the svn repo changes. It would be nice for gitdist to support some simple commands for updating/pulling and pushing to repos.

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Jul 26, 2016

@rppawlo, I am not sure that adding non-git commands directly into gitdist is a good solution. Right now, the appeal of gitdist is that it just distributes raw git commands across a set of git repos; simple. If we start to add commands for other VC systems, this is going to get complicated very fast.

It sounds like instead that Drekar developers are looking for a tool like mr ("myrepos").

See the debate going on in:

I might suggest some options more compatible with CMake/TriBITS/gitdist that Drekar developers might consider:

  1. CMake ExternalData which only impacts the CMake configure process (@nschloe, really likes this approach).
  2. Git-annex (Exnihilo developers at ORNL really like git-annex, you should talk with Seth J.)

Personally, I am not excited about either of these two approaches but they would work with TriBITS and github with very little extra effort (e.g. once gitdist supports the dist-foreach command).

I have been waiting to see if Git LFS takes off and someone creates quality open-source server and local implementations. That would really make working with large binary files with git more seamless but I don't have high-hopes (see here and here).

For now, Drekar developers might just consider adding a simple script like drekar-repos.sh in your base project repo with the command pull that calls gitdist pull and then does the svn update on the few SVN repos and the command stauts that does gitdist-status and svn status on the SVN repos or something.

Otherwise, let's talk.

@bartlettroscoe
Copy link
Member

FYI:

I cloned the myrepos git repo off github, installed mr under ~/bin/ and then tried to set it up to use with my set of Trilinos git repos:

[8vt@th232 Trilinos (develop)]$ mr register
mr: cannot determine git url

I looked at the source code for mr and the problem is that it is hard-coded to assume that the remote repo is named 'origin'. Because I use multiple git repos and different projects, I don't name my remotes 'origin' so that I don't get confused what repos I am point to in which project. For example, for doing direct Trilinos development, I have:

[8vt@th232 Trilinos (develop)]$ gitdist remote -v | grep -v "^$" 
*** Base Git Repo: Trilinos
amklinv git@github.com:amklinv/Trilinos.git (fetch)
amklinv git@github.com:amklinv/Trilinos.git (push)
bddavid git@github.com:bddavid/Trilinos.git (fetch)
bddavid git@github.com:bddavid/Trilinos.git (push)
github  git@github.com:trilinos/Trilinos.git (fetch)
github  git@github.com:trilinos/Trilinos.git (push)
nightly software.sandia.gov:/space/git/nightly/Trilinos (fetch)
nightly software.sandia.gov:/space/git/nightly/Trilinos (push)
rab-gh  git@github.com:bartlettroscoe/Trilinos.git (fetch)
rab-gh  git@github.com:bartlettroscoe/Trilinos.git (push)
techx   git@github.com:Tech-XCorp/Trilinos (fetch)
techx   git@github.com:Tech-XCorp/Trilinos (push)
...

*** Git Repo: TriBITS
bb-rab  git@bitbucket.com:bartlettra72/tribits (fetch)
bb-rab  git@bitbucket.com:bartlettra72/tribits (push)
casl-dev    git@casl-dev:TriBITS (fetch)
casl-dev    git@casl-dev:TriBITS (push)
github  git@github.com:tribitspub/TriBITS.git (fetch)
github  git@github.com:tribitspub/TriBITS.git (push)
github-rab  git@github.com:bartlettroscoe/TriBITS.git (fetch)
github-rab  git@github.com:bartlettroscoe/TriBITS.git (push)
gsjaardema  git@github.com:gsjaardema/TriBITS.git (fetch)
gsjaardema  git@github.com:gsjaardema/TriBITS.git (push)
nschloe git@github.com:nschloe/TriBITS.git (fetch)
nschloe git@github.com:nschloe/TriBITS.git (push)

With all of these repos, it would be confusing what 'origin' pointed to without having to constantly run git remote -v (which I used to do all of the time).

The issue is that what "origin" points to is determined by what project and what workflow I am using. So when I am working with Trilinos, I see:

[8vt@th232 Trilinos (develop)]$ gitdist-status
----------------------------------------------------------------------
| ID | Repo Dir              | Branch  | Tracking Branch | C | M | ? |
|----|-----------------------|---------|-----------------|---|---|---|
|  0 | Trilinos (Base)       | develop | github/develop  |   |   |   |
...
|  7 | TriBITS               | master  | github/master   | 1 |   |   |
|  8 | TriBITS/TriBITSDoc    | master  | bb/master       |   |   |   |
...
----------------------------------------------------------------------

(tip: to see a legend, pass in --dist-legend.)

So it is clear that I am pulling and pushing to the 'github' repos.

When I am working with CASL with Trilinos, I see:

[8vt@th232 VERA (master)]$ gitdist-status

------------------------------------------------------------------
| ID | Repo Dir           | Branch | Tracking Branch | C | M | ? |
|----|--------------------|--------|-----------------|---|---|---|
|  0 | VERA (Base)        | master | casl-dev/master |   |   |   |
|  1 | TriBITS            | master | casl-dev/master |   |   |   |
|  2 | Trilinos           | master | casl-dev/master |   |   |   |
...
------------------------------------------------------------------

(tip: to see a legend, pass in --dist-legend.)

That makes it clear that I am pulling and pushing to the 'casl-dev' repos. See, no confusion!

mr forces all of your git repos to point to 'origin'. What mr should do is to just get the tracking branch and from the current branch. It should not hard-code that the name of the remote repo is 'origin'.

But mr has no automated test suite at all so how can you safely modify and maintain it?

@bartlettroscoe
Copy link
Member

@rppawlo, it looks like git-annex has a new v6 repository mode that looks very close to the ease of usage of standard git commands (just git add <large-file> and git commit). It looks very close to the usage of the Git LFS spec (see annex-largefiles). The downside is that it will store two copies of each large file in your local git repo (one in the working tree and one in the .git/ directory). This is a very new feature as it seems this was just released in 2/11/2-16 (i.e. git-annex version 6.20160211). My guess is that the author of git-annex copied the ideas of Git LFS to make this easy to use. I am actually pretty excited about this if this actually works and is robust (because it would be essentially an open-source implementation of Git LFS).

Would have to try this out and experiment with it to see how it works. However, if this works out, it might give users and developers an interface much closer to a standard git repo without having to resort to using SVN or some other way of managing large files. I will ask Seth Johnson at ORNL if he has looked into git-annex v6 repository mode yet.

@bartlettroscoe
Copy link
Member

bartlettroscoe commented Jul 27, 2016

@rppawlo, we should wait and see how Seth responds to the below email but from looking into this new git-annex mode, I think it might be possible to streamline the usage of git-annex with new gitidst commands dist-fetch (which will do a 'git annex sync'), dist-pull (which will do 'git annex sync --content') and dist-push (which might also do 'git annex sync --content'). Once you set up the usage of gitdist and set up the logic for what (large/binary) files git-annex should manage in the committed .gitattributes file, then using this could be as easy as:

$ cd <base-repo>/
$ gitdist-pull  # alias to 'gitdist dist-pull' which also pulls new git-annex files
$ cd large_test_files/
$ git add <some-large-file>
$ git commit
$ cd ..
$ gitdist-status  # alias to 'gitdist dist-repo-status' which could also consider git-annex status
$ gitdist-push  # alias to 'gitdist dist-push' which also pushes new git-annex files

Git-annex meshes with git and gitdist much better than SVN. That is because you can still just run any regular git command in git repos that use git-annex. You can't do that with an SVN repo so it would be very ugly to try to shoehorn in support for SVN. But since git-annex is an extension to git, this should be much more seamless.

One issue is that they will need to install an updated version of git-annex on development.sandia.gov to be able to use this git-annex with git repos there. That is something that might require some SEMS support. But before that, we could experiment with a temp bare repo and see how it works. Also, users of would need to install git-annex on their local machines. But the SEMS NFS mount could help with that and make it more automatic.

If you are interested at looking into git-annex, perhaps we can look into this together a little? I know that none of us really has time for this type of thing, but if this works out, it could be a very nice solution to a pretty big problem with the usage of git for large binary files and multi-repos, all in one shot.


From: Bartlett, Roscoe A
Sent: Wednesday, July 27, 2016 10:55 AM
To: Johnson, Seth R.
Cc: Pawlowski, Roger P; Evans, Thomas M.
Subject: git-annex v6 repository mode?

Hello Seth,

I know that Exnihilo developers use git-annex for managing large binary files. Therefore, I thought that I might ping you on this …

I was looking at git-annex documentation today and I see that newer versions of git-annex (version 6.20160211 and newer) support a new “v6 repository mode” that looks to make git-annex much easier to use (much like Git LFS):

https://git-annex.branchable.com/tips/unlocked_files/
https://git-annex.branchable.com/tips/largefiles/
https://git-annex.branchable.com/not/

Basically, it allows you to set annex.largefiles settings in our .gitattributes file and then you can just use basic git add, git commit commands to automatically handle large files correctly with git-annex. It looks like it might be a better replacement for ‘auto-annex’ that is documented in the Exnihilo developers guide (attached).

But I am not sure if ‘git push’ and ‘git pull’ result in large files being synced automatically like happens with Git LFS or if you still have to use git-annex commands to do that (e.g. ‘git annex sync --content’). That would be the big question for usability. (But I think I could wrap this into gitdist commands dist-fetch, dist-pull and dist-push to automatically handle git annex repos in this fashion.)

Some discussion of this is in the comment:

I am looking into this because a project at SNL (one of Roger’s projects) is using SVN to handle large binary files. But using git-annex would have some advantages and it interfaces nicer with git. The issue is making git-annex as easy to use as raw git (or getting pretty close). If this works out, we might adopt this for CASL VERA repos as well (i.e. the VERAData repo).

Just curious if you had seen this git-annex “v6 repository mode” and have looked into this at all.

Thanks,

-Ross

@rppawlo
Copy link
Contributor Author

rppawlo commented Jul 27, 2016

Thanks for looking into this Ross! I'll sit tight until we hear back from Seth.

@rppawlo
Copy link
Contributor Author

rppawlo commented Jul 27, 2016

Adding @eric-c-cyr to this conversation

bummer about the "mr" tools being hard coded to origin - it looked really good.

@bartlettroscoe
Copy link
Member

bummer about the "mr" tools being hard coded to origin - it looked really good.

It might just be that the 'mr register' command is hard-coded to 'origin'. Grepping through the script, I don't see other explicit mentions of 'origin'. It might be that if you manually build the the .mrconfig file that you can get around that restriction.

I understand the attractiveness of SVN (because many people already know that tool) but git-annex might be worth a look. Let's see what Seth says.

@bartlettroscoe
Copy link
Member

The response back from Seth my response to him are given below.

Boy, if GitLab could be installed at SNL and supported Git-LFS, then that is something that SEMS should really look into. That could make GitLab very attractive. Note that Sandia does seem to have a GitLab instance installed (see issue https://sems-jira.sandia.gov/browse/SEMS-1104).

But note that Git-LFS is not all roses:

But of course git-annex and especially SVN are not perfect either. But at least git-annex fits in with a git-based workflow and supports distributed repos. SVN does not at all. The biggest problem with SVN is that it clashes with the git workflow. You can't commit things locally and then push to all the repos all at once with SVN. With SVN, you need to create the commit message at the same time that you push the commit. When multiple repos are involved, this is just a mess. Based on this, I would look at git-annex.

I just wish we had someone at SNL who could dig into all of this for us. Should we ask SEMS to look into this? Dealing with large binary files in a git-based workflow is not a trivial problem. When you starting throwing SVN repos into the mix, I don't know how you build a smooth development and multi-repo integration model out of that.


From: Bartlett, Roscoe A
Sent: Thursday, July 28, 2016 1:43 PM
To: 'Johnson, Seth R.'
Cc: Pawlowski, Roger P; Evans, Thomas M.
Subject: RE: [EXTERNAL] Re: git-annex v6 repository mode?

Seth,

Except for push and pull, git-annex with this new repository mode looks almost as easy to use as Git-LFS (but we would have to do a detailed comparison).

Good to hear about Git-LFS support with GitLab. I just found:

Is that supported on the ORNL installation of GitLab (code-int.ornl.gov) currently or it is planned for the future?

It also seems that GitLab supports git-annex as well:

(but not sure about the newest “v6 repository mode”).

So GitLab seems like a good opportunity to compare git-annex and git-lfs side-by-side.

In any case, please let us know how your experimentation with Git-LFS and GitLab goes and how it compares to git-annex.

But at this point, you would still recommend using git-annex over using a separate SVN repo?

Thanks,

-Ross


From: Johnson, Seth R. [mailto:johnsonsr@ornl.gov]
Sent: Thursday, July 28, 2016 12:55 PM
To: Bartlett, Roscoe A
Cc: Pawlowski, Roger P; Evans, Thomas M.
Subject: [EXTERNAL] Re: git-annex v6 repository mode?

Hey Ross,

Thanks for the ping! I'd not heard of that new git-annex mode but it sounds pretty easy to use. I doubt that git-annex will ever work as cleanly as just git, but it sure works easier than a separate SVN repository. We're actually looking into using Git-LFS because that's what may be supported by our local gitlab server. Please let me know what you find out.

Thanks,
Seth

@bartlettroscoe
Copy link
Member

Below is and email response from Seth and my response to him. To sum up feedback from Seth:

  • Exnihilo is currently using git-annex and they use a script called autoannex.py to automate handling of large files to the Exnihilo git repo with git-annex.
  • They would like to move from git-annex to Git-LFS but there is no support for that yet at ORNL (but it could be added to the ORNL code-int.ornl.gov server at some point but they are not sure how to deal with the large data files that would need the be handled).
  • Seth has not looked into the new git-annex "v6 repository mode" that simplifies the interaction with handling what files get managed with git annex and what files get managed by regular git.
  • Even with the current git-annex v5 implementation (i.e. manual git annex add <file>, etc.), Seth would still recommend using git-annex over an SVN repo.

Since my last comment, I have done a lot more research on git-annex and Git-LFS. I looked at several presentations on youtube including:

At this point I think I know enough about git-annex and Git-LFS to draw some conclusions.

First, Git-LFS is the future for the handling of large binary files with git. There is no question about that. And using Git-LFS would require zero changes to gitdist or checkin-test.py. However, unless you have a local recent installation of GitLab Enterprise Edition (EE) or GitHub Enterprise, or BitBucket installed locally and supported on your local system, you can't use it for sensitive projects. And even for open-source projects, the support for large files with Git-LFS with GitHub, GitLab, and BitBucket is going to be very limited (because large files and lots of I/O costs real money). Also, there open-source Git-LFS server implementation. Therefore, I don't think that Git-LFS is a viable solution for projects that have sensitive data (that therefore needs to be protected behind a firewall) or needs to handle really large files. It may take a while before wwe can use Git-LFS at the labs due to need for serious infrastructure (both programmatic support for the supporting tools and the hardware to support it). In a few years, I would think that Git-LFS will overtake everything, including git-annex. But at the preset time, I don't see that Git-LFS is a viable option.

If the desire is to keep using gitdist for handling all the repos, then I think that git-annex with using "v6 repository mode" is worth looking into. It may be nearly as easy to use as raw git and it will integrate well with the gitdist tool (if that is indeed the desire). If there is interest in this, then the next step would be to install a recent version of git-annex v6 and prototype its usage at handling large binary files and setting up an rsync git-annex specialremote repo.

However, if the desire is to just keep using SVN for managing large binary files, then I would recommend creating a specialized script called something like drekar-repo and then give it commands pull, status and push which would just call gitdist pull, gitdist-status, and gitdist push, respectively. I will go into the proposal more in the next comment.

In summary, a reasonable plan would be:

  1. Short term: Develop a simple project-specific wrapper script drekar-repo that calls gitdist and SVN commands and tell Drekar developers to always use drekar-repo pull, drekar-repo status and drekar-repo push and not raw pushes or raw gitdist.

  2. Intermediate term: Investigate the usage of git-annex with "v6 repository mode" enabled. See if it is a viable tool (i.e. it works and is relatively easy to use for the targeted workflows). If it is, then set up Drekar to use git-annex and add support for git-annex to gitdist (and checkin-test.py so that CASL VERA can use git-annex).

  3. Long term: Watch and see where Git-LFS goes. It is likely that Git-LFS will become ubiquitous and make every other solution for handling large binary files with git obsolete. But it could take years before that happens.

I will update the title of this Issue to reflect the larger scope.


From: Bartlett, Roscoe A
Sent: Friday, July 29, 2016 2:14 PM
To: 'Johnson, Seth R.'
Cc: Pawlowski, Roger P; Evans, Thomas M.
Subject: RE: [EXTERNAL] git-annex v6 repository mode?

Seth,

Responses inline ...

From: Johnson, Seth R. [mailto:johnsonsr@ornl.gov]
Sent: Thursday, July 28, 2016 7:58 PM
To: Bartlett, Roscoe A
Cc: Pawlowski, Roger P; Evans, Thomas M.
Subject: Re: [EXTERNAL] git-annex v6 repository mode?

The trick may be getting users set up correctly, and able to know when to annex and not to annex files. :)

[Ross] That is the beauty of the git-annex “v6 repository mode”. The git-annex author used the same git clean and smudge filters as Git-LFS to let you specify what types of files should get annexed and which should be handled by regular git and then users can just use git add <file>. For example, in your .gitattributes file, you can specify:

* annex.largefiles=(largerthan=100kb)
*.c annex.largefiles=nothing
*.h annex.largefiles=nothing

[Ross] That will result in all files larger than 100kb, except *.h and *.cpp files, being annexed automatically when added with a raw git add <file>. See:

[Ross] It looks like you can include/exclude entire directories, files with a given extension, etc. It looks very flexible.

[Ross] What git-annex is lacking is the pre-push hook that Git-LFS uses to automatically send the annexed files to the server using a raw git push command. Git-LFS handles pulling down LFS-managed files by copying them in the smudge filter when you checkout a branch. The details of how this works are given in this nice presentation:

[Ross] I am not sure if git-annex does this or not. We would have to see. But having to run ‘git annex sync --contents [--no-push|--no-pull]’ does not seems so bad (and I can add that to a new gitdist dist-pull/dist-pull commands).

[Ross] If you are interested, you might just take a quick look at these two web pages above (they are not very long).

I guess if you're using a checkin-test script, you could have it query uncommitted files for their kind and size, and automatically annex them based on some heuristic. I have something like this in a standalone script at Exnihilo/environment/python/exnihiloenv/autoannex.py

[Ross] I think that the git-annex “v6 repository mode” makes the usage of something like autoannex.py unnecessary (see the above .gitattributes file).

[Ross] I think the only thing the checkin-test.py script would need is to handle the pull and push operations for annexed repos. And for that, I think I would add support for git annex push/pull to the gitdist script and make the checkin-test.py script use gitdist for pull and push operations. Also, we would need a way to make sure that modifications to annexed files where committed before doing the final test and push. We would need to experiment with git-annex to see how to get that info (but I would guess that ‘git annex status’ would do that).

The ORNL GitLab says they're looking into it but would have to figure out a model for charging for scalable disk access. So I can't provide any feedback on it at this time.

[Ross] What that means is that Git-LFS is not even an option right now at ORNL, right? Opening the door to huge file storage is a bit scary for them I suspect. It looks like code-int.ornl.gov GitLab site at ORNL does not even support git-annex (which was added before support for Git-LFS). A big advantage of git-annex is that you can actually use a different machine to store your annexed files so you can use it even with GitHub or BitBucket. See:

[Ross] This makes your project its own master for storing large binary files. You can just set up an rsync special remote server that everyone in your team can access (using a unix group protection on the server) and then make them point to that. See:

Even so, I'd still vastly prefer git-annex over a separate SVN repository. It means one fewer thing to pull from and keep synchronized; plus the git method of retaining history is clearer.

[Ross] That is what I am thinking. From the documentation that I have seen for the newer git-annex “v6 repository mode”, it seems fairly usable (almost as good as Git-LFS, if it actually works). The big advantage of git-annex that I can see is that it has a fully free and open-source server implementation that you can use right now. There are no open-source server implementations for Git-LFS that I can find. All of the major commercial players have their own server implementations (e.g. GitHub, BitBucket, GitLab, MS VSO) but there is nothing that you can use with your own git repos. The only Git-LFS option for behind the firewall is proprietary commercial GitLab EE (and GitHub Enterprise but that is very expensive). The fact that Git-LFS only works with proprietary implementations bothers me a lot. Given all of this, I think I would sit tight and wait and see where Git-LFS goes. My guess is that in 2 more years there will be a quality open-source server implementation (perhaps GitLab?) and the git community will have worked out all of the bugs. That will be a wonderful time! But until then, we have to get our work done.

Cheers,

-Ross

@bartlettroscoe bartlettroscoe changed the title basic svn support for gitdist script gitdist: Support for large binary files or SVN? Jul 29, 2016
@bartlettroscoe bartlettroscoe changed the title gitdist: Support for large binary files or SVN? 3gitdist: Support for large binary files or SVN? Aug 1, 2016
@bartlettroscoe bartlettroscoe changed the title 3gitdist: Support for large binary files or SVN? gitdist: Support for large binary files or SVN? Aug 1, 2016
@bartlettroscoe
Copy link
Member

FYI: Looks like Sandia has GitLab sites (SON and SRN) set up that are supposed to support Git-LFS (but not git-annex). I have created Trilinos JIRA Issue TRIL-62 to document this and investigate if Git-LFS works.

@bartlettroscoe
Copy link
Member

So after a lot of back and forth with the maintainer of the gitlab.sandia.gov site and a lot of research and experimentation, it seems that Git-LFS works with that site but you have to use HTTPS authentication (SSH authentication does not work with GitLab with Git-LFS). See Trilinos JIRA Issue TRIL-62.

But I think I am far enough along to say that Git-LFS with the SRN site gitlab.sandia.gov may be a viable solution right now to replace SVN to manage large binary files. I think the next step would be to discuss this some and then, if it makes sense, to actually try creating the Drekar git repo copy of the SVN repo using Git-LFS on gitlab.sandia.gov and see how it performs with cloning, changing files, pushing etc. Git-LFS is very easy to use once you have the Git-LFS client installed (which is also very easy to do and we can provide that on all platforms pretty easily I think).

Given what I have learned about Git-LFS, I think that this might be a good option for some of the larger test files with Drekar. It might even be a good option for trimming down some of the larger test files in Trilinos. It seems that if you clone a git repo that manages some files with Git-LFS and the local user is not set up to use Git-LFS, then nothing all that terrible happens (see this comment).

bartlettroscoe added a commit to trilinos/Trilinos that referenced this issue Aug 3, 2016
@bartlettroscoe
Copy link
Member

bartlettroscoe commented Aug 3, 2016

@rppawlo,

I created a Git-LFS version of the DrekarSystemTests repo on the SNL GitLab servers gitlab-ex.sandia.gov and gitlab.sandia.gov and it went smoothly. It was an easy process and I got it working with Git-LFS on a new machine (shiler) in just a few minutes. The details are below.

If we can get the Git-LFS client installed on to the SEMS NFS mount and the ATTB machines, then it should be trivial for Drekar developers to use a Git-LFS version of the DrekarSystemTests repo. Drekar developers need to run two commands one time on each new machine:

$ git lfs install
$ git config --global credential.helper cache

and they are set and never have to think about Git-LFS again. For machines that don't have the Git-LFS client centrally installed, installing it locally is just a few simple commands (see below) and we could provide a single script that would do this in one shot (i.e. download the Git-LFS client and install it).

The only issue I can see with this is that you have to use HTTPS to authenticate with the current GitLab servers. Therefore, you have to cache your username and password. This might be an issue for automated tests run by Jenkins. But you can use a file to store these and then you never need to type them again (see this TRIL-62 comment) which is allowed for SNL entity accounts.


Detailed Notes:

For the heck of it, I converted the DrekarSystemTests SVN repo to a Git-LFS repo.

First, I copied the Git-LFS client to shiller:

$ scp git-lfs-linux-386-1.3.0.tar.gz shiller:~/.

and then installed it with:

$ ssh shiller
$ tar -xzvf git-lfs-linux-386-1.3.0.tar.gz
$ cd git-lfs-1.3.0/
$ env PREFIX=$HOME ./install.sh
$ git lfs install

(This installed git-lfs into $HOME/bin and I have $HOME/bin set in my PATH env var so that git can find it.)

I set up for git to cache my username and password for HTTPS:

$ git  config --global credential.helper cache

I then created the empty private GitLab projects under my account:

I then cloned the empty repo and set up for Git-LFS with:

$ git clone https://gitlab.sandia.gov/28084/DrekarSystemTests.git
$ cd DrekarSystemTests/
$ git lfs track "*.exo"
$ git lfs track "*.gen"
$ git lfs track "*.png"

(The *.xml and other files should be stored by raw git since they are human-written text files.)

I then copied the contents of the DrekarSystemTests.svn repo trunk/ directory into the new git repo, removed all of the .svn/ dirs and then did:

$ git add .
$ git commit
$ git remote rename origin gitlab  # I don't like the name origin!
$ git remote add gitlab-ex https://gitlab-ex.sandia.gov/rabartl/DrekarSystemTests.git
$ git push gitlab master
$ git push -u gitlab-ex master

I then moved the existing DrekarSystemTests local repo out of the way and did a fresh clone with:

$ time git clone https://gitlab.sandia.gov/28084/DrekarSystemTests.git
Cloning into 'DrekarSystemTests'...
remote: Counting objects: 284, done.
remote: Compressing objects: 100% (154/154), done.
remote: Total 284 (delta 123), reused 284 (delta 123)
Receiving objects: 100% (284/284), 171.04 KiB | 0 bytes/s, done.
Resolving deltas: 100% (123/123), done.
Checking connectivity... done.
Downloading ATDM/Verification/linear_plasma_waves/dispersion_plots_data/cold_efluid_Bnorm.png (39.46 KB)
...
Downloading vector-restart/vector-restart.gold.exo (285.77 KB)
Checking out files: 100% (245/245), done.

real    0m16.698s
user    0m5.924s
sys 0m2.455s

Here is how the size of the clones match up:

$ du -sh DrekarSystemTests DrekarSystemTests.svn
77M DrekarSystemTests
54M DrekarSystemTests.svn

The Git-LFS repo is larger than the SVN repo because it has to store two copies of every binary file (see this TRIL-62 comment). But you can see that Git-LFS is managing most of the data from looking at:

$ du -sh DrekarSystemTests/.git/* | sort -rh
25M .git/lfs
184K    .git/objects
48K .git/hooks
32K .git/index
16K .git/logs
12K .git/refs
4.0K    .git/packed-refs
4.0K    .git/ORIG_HEAD
4.0K    .git/info
4.0K    .git/HEAD
4.0K    .git/FETCH_HEAD
4.0K    .git/description
4.0K    .git/config
0   .git/branches

See, the .git/lfs directory is storing 25M worth of large files and git itself is only storing 184K of (compressed) regular files.

Now you can use gitdist including DrekarSystemTests with no modifications (just add DrekarSystemTests to the .gitdist file):

$ cd Trilinos/
$ cp .gitdist.default .gitdist
$ echo DrekarSystemTests >> .gitdist

$ gitdist-status
----------------------------------------------------------------------
| ID | Repo Dir             | Branch  | Tracking Branch  | C | M | ? |
|----|----------------------|---------|------------------|---|---|---|
|  0 | Trilinos (Base)      | develop | github/develop   |   |   |   |
|  1 | packages/moocho      | master  | github/master    |   |   |   |
|  2 | packages/Sundance    | master  | origin/master    |   |   |   |
|  3 | packages/CTrilinos   | master  | origin/master    |   |   |   |
|  4 | packages/ForTrilinos | master  | origin/master    |   |   |   |
|  5 | packages/mesquite    | master  | origin/master    |   |   |   |
|  6 | TriBITS              | master  | github/master    |   |   |   |
|  7 | TriBITS/TriBITSDoc   | master  | bb/master        |   |   |   |
|  8 | preCopyrightTrilinos | master  | github/master    |   |   |   |
|  9 | DrekarBase           | master  | ssg/master       |   |   |   |
| 10 | DrekarResearch       | master  | ssg/master       |   |   |   |
| 11 | DrekarSystemTests    | master  | gitlab-ex/master |   |   |   |
----------------------------------------------------------------------

(tip: to see a legend, pass in --dist-legend.)

Now I can pull and push to all of the Drekar repos at once cleanly!

We could streamline the cloning of these repos using the TriBITS clone-extra-repos.py script for the Drekar repos. I can show how to do that if interested.

@eric-c-cyr
Copy link

eric-c-cyr commented Aug 3, 2016

On internal machines that accessing HTTP requires a proxy. Thus on some of the clusters, I've used ssh to get to github. Is there a way to set the HTTP proxy in git?

@bartlettroscoe
Copy link
Member

On internal machines that accessing HTTPS requires a proxy. Thus on some of the clusters, I've used ssh to get to github.

Eric,

I did not need to set any proxy from the machines shiller or muir in order to access github.sandia.gov or github-ex.sandia.gov. What machines need a proxy? Are these on the SRN, SON or outside of the SNL network?

In any case, hopefully there should be simple instructions to set up the HTTPS proxy on any given machine.

Longer term, the hope is that GitLab will get Git-LFS to use SSH authentication. See:

But note that BitBucket seems to have this figured out:

Is there a way to set the HTTPS proxy?

Don't know. Can you point me to a specific machine that has a problem and perhaps I can give it a try?

@bartlettroscoe
Copy link
Member

Just a tip, but you can speed up the initial clone using git lfs clone <url> vs. git clone <url>.

The time with raw git clone <url> is:

 time git clone https://gitlab-ex.sandia.gov/rabartl/DrekarSystemTests DrekarSystemTests.again
Cloning into 'DrekarSystemTests.again'...
remote: Counting objects: 426, done.
remote: Compressing objects: 100% (223/223), done.
remote: Total 426 (delta 206), reused 414 (delta 194)
Receiving objects: 100% (426/426), 223.14 KiB | 0 bytes/s, done.
Resolving deltas: 100% (206/206), done.
Checking connectivity... done.
Downloading ATDM/Verification/linear_plasma_waves/dispersion_plots_data/cold_efluid_Bnorm.png (39.46 KB)
...
Downloading vector-restart/vector-restart.gold.exo (285.77 KB)
Checking out files: 100% (362/362), done.

real    0m25.300s
user    0m5.339s
sys     0m1.291s

The time with git lfs clone <url>:

time git lfs clone https://gitlab-ex.sandia.gov/rabartl/DrekarSystemTests DrekarSystemTests.again
Cloning into 'DrekarSystemTests.again'...
remote: Counting objects: 426, done.
remote: Compressing objects: 100% (223/223), done.
remote: Total 426 (delta 206), reused 414 (delta 194)
Receiving objects: 100% (426/426), 223.14 KiB | 0 bytes/s, done.
Resolving deltas: 100% (206/206), done.
Checking connectivity... done.
Git LFS: (45 of 45 files) 24.27 MB / 24.60 MB                                                                                                                                      

real    0m5.141s
user    0m4.379s
sys     0m0.918s

The git lfs clone <url> command gets all of the LFS objects at once. The raw git clone <url> gets the LFS files one at a time.

The same goes for git lfs pull vs. git pull. It is just a performance thing.

I think that TriBITS could be made to learn about Git-LFS repos (just list them as such in the ExtraRepositoriesList.cmake file, such as with 'GIT-LFS' instead of just 'GIT') and a special gitdist command dist-pull and dist-push could run git lfs pull and git lfs push, respectively. If a repo is not using Git-LFS then it just does the basic pull and push, respectively. That could be a value added for gitdist and TriBITS to automatically figure out that the Git-LFS client is installed and then use it.

@bartlettroscoe
Copy link
Member

BTW, I went ahead and updated the DrekarSystemTests Git-LFS repo for the current version of the SVN repo just to show that it is easy to do in the commit:

commit 76799afc8ef2156e1892e0150309e63daaa48a43
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Wed Aug 10 19:22:57 2016 -0600

    Update snapshot from SVN r128

    r128 | seamill | 2016-08-10 19:00:16 -0600 (Wed, 10 Aug 2016) | 1 line

    Consolidating some tests.

   4.4% ATDM/Verification/linear_plasma_waves/WaveWarmEFluid/TEM/
  10.8% ATDM/Verification/linear_plasma_waves/util_scripts/warm_electron_modules/scripts/
  15.3% ATDM/Verification/linear_plasma_waves/util_scripts/warm_electron_modules/
  16.8% ATDM/Verification/linear_plasma_waves_convergence/WaveColdEFluid/Bnorm/
  22.2% ATDM/Verification/linear_plasma_waves_convergence/WaveColdEFluid/Bpara/
  10.9% ATDM/Verification/linear_plasma_waves_convergence/WaveColdEFluid/Bzero/
   5.8% ATDM/Verification/linear_plasma_waves_convergence/convergence_plots_data/
   4.9% ATDM/Verification/linear_plasma_waves_convergence/util_scripts_convergence/
   3.0% ATDM/Verification/
   4.9% ATDM/braginskii_zeroB/

It is just a few commands to do this:

$ cd Trilinos/
$ cd DrekarSystemTests.svn/    # my rename of the SVN repo
$ cp -r . ../../DrekarSystemTests/
$ find . -type d -name .svn -exec rm -rf {} \;
$ git add .
$ git commit  # just copied in info from the top SVN commit

@rppawlo
Copy link
Contributor Author

rppawlo commented Aug 17, 2016

Ross - just occurred to me that there might be another solution. git supports working directly from svn repos.

https://git-scm.com/book/en/v1/Git-and-Other-Systems-Git-and-Subversion

I recall that the moose team did this for quite a long time before moving to github. It's not the most appealing, since there are some a few dangerous things you have to manage. Just thought I would mention. I discussed the git-lfs path with the Drekar team and they were hesitant about adoption. They would like to make sure this is the best long term solution. I think that one issue is that the COE doesn't have a package for git-lfs and as far as I can tell, ubuntu doesn't either. So every computer we run on requires this install. sems support would go a long way.

@bartlettroscoe
Copy link
Member

Ross - just occurred to me that there might be another solution. git supports working directly from svn repos.

https://git-scm.com/book/en/v1/Git-and-Other-Systems-Git-and-Subversion

I recall that the moose team did this for quite a long time before moving to github. It's not the most appealing, since there are some a few dangerous things you have to manage. Just thought I would mention.

I had not thought about git-svn. The truth is that the DrekarSystemTests repo is really pretty small so I suspect that even if git-svn is extremely slow (which people seem to be claiming) that using git-svn with DrekarSystemTests may not be too bad.

But looking at:

you have to use separate commands for pulling and pushing to the SVN repo using git-svn and this would require training for Drekar developers and gitdist would need to be extended to work with git-svn repos.

To pull with git-svn, you have to use git svn rebase instead of git pull like with git-lfs and raw git. That means that the gitdist script would need a new command called something like dist-pull that would run git svn rebase instead of git pull on a repo that was detected to be a git-svn repo.

To push with git-svn, you have to type git svn dcommit instead of just git push like with git-lfs and raw git. So adoption git-svn would mean that we we would have to add a special command to gitdist called dist-push that would have to be trained to run git svn dcommit when called on an git-svn repo. However, using git-lfs to push does not require any changes to gitdist at all.

That is a big advantage of git-lfs over git-svn.

I discussed the git-lfs path with the Drekar team and they were hesitant about adoption.

I definitely understand the apprehension. Having to use HTTPS authentication instead of SSH apprehension is the biggest apprehension I personally have. But compared to git-svn, just from what I have seen so far, I would personally use git-lfs over git-svn if I had a choice, and not just because git-lfs does not require any changes to gitdist at all. Git-LFS just fits in better with a git-based workflow than git-svn.

They would like to make sure this is the best long term solution.

From everything that I have read and I have experienced, all things considered, Git-LFS looks like the best long-term solution for managing large binary files with a git-based workflow. It is the most transparent for developers to use and it has what appears to be very broad industry support (e.g. GitHub, BitBucket, and most importantly GitLab) and there will be better and better implementations and more implementations of Git-LFS capable servers as time goes on. I am about 100% sure about that given what I have seen.

I think that one issue is that the COE doesn't have a package for git-lfs and as far as I can tell, ubuntu doesn't either. So every computer we run on requires this install. sems support would go a long way.

Yes that is an issue, but I think we can making it a simple two-step process to install the Git-LFS client on any SON or SRN machine in each users home directory using:

  1. Run install-git-lfs.sh as your regular user
  2. Add $HOME/bin to your PATH in your .bash_profile

I would put together the install-git-lfs.sh script and then ask Drekar developers just to try it and see if it works for them. That should be a low investment on their part.

As for support the systems, things don't look too good for git-svn. I just tried to use git-svn to clone the DrekarSystemTests repo on the SR machine muir using the SEMS installed git:

$ which git
/projects/sems/install/rhel6-x86_64/sems/utility/git/2.1.3/bin/git

and it seems that the SEMS team did not install git 2.1.3 to support git-svn:

$ git svn clone svn+ssh://software.sandia.gov/svn/DrekarSystemTests
Can't locate SVN/Core.pm in @INC (@INC contains: /projects/sems/install/rhel6-x86_64/sems/utility/git/2.1.3/share/perl5 /usr/local/lib64/perl5 /usr/local/share/perl5 /usr/lib64/perl5/vendor_perl /usr/share/perl5/vendor_perl /usr/lib64/perl5 /usr/share/perl5 .) at /projects/sems/install/rhel6-x86_64/sems/utility/git/2.1.3/share/perl5/Git/SVN/Utils.pm line 6.
BEGIN failed--compilation aborted at /projects/sems/install/rhel6-x86_64/sems/utility/git/2.1.3/share/perl5/Git/SVN/Utils.pm line 6.
Compilation failed in require at /projects/sems/install/rhel6-x86_64/sems/utility/git/2.1.3/share/perl5/Git/SVN.pm line 33.
BEGIN failed--compilation aborted at /projects/sems/install/rhel6-x86_64/sems/utility/git/2.1.3/share/perl5/Git/SVN.pm line 33.
Compilation failed in require at /projects/sems/install/rhel6-x86_64/sems/utility/git/2.1.3/libexec/git-core/git-svn line 25.
BEGIN failed--compilation aborted at /projects/sems/install/rhel6-x86_64/sems/utility/git/2.1.3/libexec/git-core/git-svn line 25.

Even the default COE version of git:

$ /usr/bin/git --version
git version 1.7.1

does not seem to support git-svn:

$ /usr/bin/git svn clone svn+ssh://software.sandia.gov/svn/DrekarSystemTests
git: 'svn' is not a git command. See 'git --help'.

Did you mean one of these?
        fsck
        show

Seeing this, it might be easier for the SEMS team and the ATTB machines to install the binary git-lfs than to fix git install that has a broken git-svn.

However, things look better for git-svn on the ATTB machines. I tried git-svn on the machine shiller and it worked:

$ time git svn clone svn+ssh://software.sandia.gov/svn/DrekarSystemTests
...

real    0m47.089s
user    0m7.434s
sys     0m12.692s

So that is a little longer than the 16s needed to clone the git-lfs repo with:'

$ time git clone https://gitlab.sandia.gov/28084/DrekarSystemTests.git
...

real    0m16.698s
user    0m5.924s
sys 0m2.455s

but not too bad.

What is interesting is that the git-svn repo is not on a tracking branch but it does show modified and untracked files:

[rabartl@shiller01 DrekarSystemTests (master)]$ git status
On branch master
Changes not staged for commit:
  (use "git add <file>..." to update what will be committed)
  (use "git checkout -- <file>..." to discard changes in working directory)

        modified:   trunk/config/platform_plugin.py

Untracked files:
  (use "git add <file>..." to include in what will be committed)

        junk.out

no changes added to commit (use "git add" and/or "git commit -a")

That problem is that I made a local commit not pushed to the SVN repo:

fefa5a5 "Added a comment line (don't push)"
Author: Roscoe A. Bartlett <rabartl@sandia.gov>
Date:   Wed Aug 17 09:32:54 2016 -0600 (5 minutes ago)

M       trunk/config/idplatform.py

and I am not sure how to see that commit has not been pushed yet.

With gitdist-status, therefore, this will not show any updated commits:

[rabartl@shiller01 Trilinos (develop)]$ gitdist-status 
------------------------------------------------------------------
| ID | Repo Dir          | Branch  | Tracking Branch | C | M | ? |
|----|-------------------|---------|-----------------|---|---|---|
|  0 | Trilinos (Base)   | develop | github/develop  |   |   | 1 |
|  1 | DrekarBase        | master  | ssg/master      |   |   |   |
|  2 | DrekarResearch    | master  | ssg/master      |   |   |   |
|  3 | DrekarSystemTests | master  |                 |   | 1 | 1 |
------------------------------------------------------------------

That is problem with the workflow using git-svn. We would need to train the gitdist dist-repo-status command to figure out

My initial opinion is that git-svn is more confusing that git-lfs.

In summary, the pros and cons for git-lfs vs. git-svn that I can see so far are:

git-lfs:

  • Pro: Once set up on a given development machine and for a given repo, developers use raw git commands and don't even need to know that git-lfs is being used if they just use the simple centralized single-branch workflow.
  • Pro: Requires no changes to tools like gitdist and checkin-test.py
  • Pro: Cloning is pretty fast (especially if you use 'git lfs clone')
  • Pro: Allows for sharing branches
  • Con: Requires the git-lfs client to be installed on every machine where the repo is cloned from.
  • Con: Even if the git-lfs client is globally installed, users must remember to run git lfs install on time for git-lfs to work properly on that machine.
  • Con: Have to rely on a server that supports git-lfs. At Sandia, that means using gitlab-ex.sandia.gov (but the org that supports that seems very help and eager to support usage).
    **Con:* Have to use HTTPS instead of SSH authentication and that means that cron and Jenikins jobs have to store the entity account password in a non-encrypted file.

git-svn:

  • Pro: Works with any existing SVN repo on any machine
  • Pro: git-svn is a built-in git command that is supposed to be installed with every installation of git (but we have seen that the SNL COE git 1.7.0 and the SEMS git 2.1.3 installs have this messed up).
  • Pro: Works with SSH authentication
  • Con: Can't really share branches with other developers. Can only safely make local commits and pull and push to the single SVN repo.
  • Con: Developers (and tools like gitdist) have to use special commands to pull, examine the local state, and push.
  • Con: Would have to add new commands to gitdist like dist-pull and git-push and would have to train dist-repo-status to detect new commits.
  • Con: Would not work with the checkin-test.py script without some work similar to the gitdist tool
  • Con: git-svn does not work on the basic SNL COE installed version of git or the SEMS installed version of git

Rather than all this stuff, Drekar might just consider a simple drekar-rep script with commands pull, status, and push. That would be easy to write and would likely be enough for Drekar developers until you adopt something better (like Git-LFS when SSH authentication is supported with GitLab, which GitLab is currently working, and 9300 will upgrade GitLab within a month once GitLab CE supports SSH authentication).

We should discuss all of this stuff.

@bartlettroscoe
Copy link
Member

Looks like GitLab CE is on the verge of allowing Git-LFS to use pure SSH authentication with the GitLab server (but transferred over HTTPS). See this update. We don't care how the objects are copied (SSH or HTTPS are fine, both are encrypted), we only care about how the authentication is done.

Hopefully we will see this on the SNL GitLab server in a few months. We need to wait for this to get merged to the GitLab 'master' branch and put into an offical release (which go out once a month) and then wait for the people in SNL org 9300 to update the SNL GitLab servers.

@bartlettroscoe
Copy link
Member

FYI:

It looks like the next release of GitLab CE will have support for pure SSH authentication for Git-LFS. See:

I think this means that this will show up on the SNL GitLab servers in a few months.

@bartlettroscoe
Copy link
Member

Good news, the latest release of GitLab has the full SSH authentication for Git-LFS:

Now we just need to get them to insall it for the SNL GitLab servers.

@bartlettroscoe
Copy link
Member

Looks like they already installed GitLab 8.12 on the SNL GitLab servers but it does not seem to work for SSH Git-LFS authentication yet.


From: Bartlett, Roscoe A
Sent: Tuesday, October 18, 2016 10:28 PM
To: Hickey, Richard A
Subject: RE: GitLab 8.12 with pure SSH for Git-LFS

It looks like it does not work:

$ git lfs clone git@gitlab.sandia.gov:28084/DrekarSystemTests.git
Cloning into 'DrekarSystemTests'...

…

remote: Counting objects: 426, done.
remote: Compressing objects: 100% (223/223), done.
remote: Total 426 (delta 206), reused 414 (delta 194)
Receiving objects: 100% (426/426), 223.14 KiB | 0 bytes/s, done.
Resolving deltas: 100% (206/206), done.
Checking connectivity... done.
Git LFS: (0 of 45 files) 0 B / 24.60 MB                                                                                                                                            
Post /idp/Authn/AuthMenu/menu;jsessionid=F21790870C138646A04F987CFB4FB6E3?conversation=e1s1: stopped after 3 redirects
Post /idp/Authn/AuthMenu/menu;jsessionid=F21790870C138646A04F987CFB4FB6E3?conversation=e1s1: stopped after 3 redirects

This cloned the repos but failed to replace the LFS-mangaed files with the full versions as shown by, for example:

[rabartl@crf450 DrekarSystemTests (master)]$ cat ./normal_tangent_bc/tangent_bc.gold.exo
version https://git-lfs.github.com/spec/v1
oid sha256:aa67419494db421f6ef43131c403b1c164d3001d5e4137e971c8448e81c7ad9e
size 237404

Anyway, not urgent but it would be great if we could get this to work.

Thanks,

-Ross

@bartlettroscoe
Copy link
Member

Looks like Git-LFS with SSH should work on the SNL GitLab servers. I need to try that out. See:

@bartlettroscoe
Copy link
Member

FYI: Git-LFS is working on the ORNL GitLab servers code-int.ornl.gov and code.ornl.gov.

But since this story has gotten way off scope and since Drekar is going to just use a raw git repo (or not change anything), there is no sense keeping this Issue open any longer.

If there is a desired to teach gitdist about Git-LFS, then we can open a new issue. But for now, it makes sense to close this story as wontfix I think.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

3 participants