Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

builtins.gitLsFiles #2944

Open
nomeata opened this issue Jun 13, 2019 · 16 comments
Open

builtins.gitLsFiles #2944

nomeata opened this issue Jun 13, 2019 · 16 comments

Comments

@nomeata
Copy link
Contributor

nomeata commented Jun 13, 2019

I started to use this file here:

# The function call
#
#   gitSource ./toplevel subpath
#
# creates a Nix store path of ./toplevel/subpath that includes only those files
# tracked by git. More precisely: mentioned in the git index (i.e. git add is enough
# to get them to be included, you do not have to commit).
#
# This is a whitelist-based alternative to manually listing files or using
# nix-gitignore.

# Internally, it works by calling git ls-files at evaluation time. To
# avoid copying all of `.git` to the git store, it only copies the least amount
# of files necessary for `git ls-files` to work; this is a bit fragile, but
# very fast.

with builtins;

# We read the git index once, before getting the subdir parameter, so that it
# is shared among multiple invocations of gitSource:

let
  filter_from_list = root: files:
    let
      all_paren_dirs = p:
        if p == "." || p == "/"
        then []
        else [ p ] ++ all_paren_dirs (dirOf p);

      whitelist_set = listToAttrs (
        concatMap (p:
          let full_path = toString (root + "/${p}"); in
          map (p': { name = p'; value = true; }) (all_paren_dirs full_path)
        ) files
      );
    in
    p: t: hasAttr (toString p) whitelist_set;

  has_prefix = prefix: s:
    prefix == builtins.substring 0 (builtins.stringLength prefix) s;
  remove_prefix = prefix: s:
    builtins.substring
      (builtins.stringLength prefix)
      (builtins.stringLength s - builtins.stringLength prefix)
      s;

  lines = s: filter (x : x != [] && x != "") (split "\n" s);
in

if builtins.pathExists ../.git
then
  let
    nixpkgs = (import ./nixpkgs.nix).nixpkgs {};

    git_dir =
      if builtins.pathExists ../.git/index
      then ../.git
      else # likely a git worktree, so follow the indirection
        let
          git_content = lines (readFile ./../.git);
          first_line = head git_content;
          prefix = "gitdir: ";
          ok = length git_content == 1 && has_prefix prefix first_line;
        in
          if ok
          then /. + remove_prefix prefix first_line
          else abort "gitSource.nix: Cannot parse ${toString ./../.git}";

    whitelist_file =
      nixpkgs.runCommand "git-ls-files" {envVariable = true;} ''
        cp ${git_dir + "/index"} index
        echo "ref: refs/heads/master" > HEAD
        mkdir objects refs
        ${nixpkgs.git}/bin/git --git-dir . ls-files > $out
      '';

    whitelist = lines (readFile (whitelist_file.out));

    filter = filter_from_list ../. whitelist;
  in
    subdir: path {
      name = baseNameOf (toString subdir);
      path = if isString subdir then (../. + "/${subdir}") else subdir;
      filter = filter;
    }

else
  trace "gitSource.nix: ${toString ../.} does not seem to be a git repository,\nassuming it is a clean checkout." (
    subdir: path {
      name = baseNameOf (toString subdir);
      path = if isString subdir then (../. + "/${subdir}") else subdir;
    }
  )

This is essentially doing what this code fragment from builtins.fetchGit does:

GitInfo gitInfo;
gitInfo.rev = "0000000000000000000000000000000000000000";
gitInfo.shortRev = std::string(gitInfo.rev, 0, 7);
auto files = tokenizeString<std::set<std::string>>(
runProgram("git", true, { "-C", uri, "ls-files", "-z" }), "\0"s);
PathFilter filter = [&](const Path & p) -> bool {
assert(hasPrefix(p, uri));
std::string file(p, uri.size() + 1);
auto st = lstat(p);
if (S_ISDIR(st.st_mode)) {
auto prefix = file + "/";
auto i = files.lower_bound(prefix);
return i != files.end() && hasPrefix(*i, prefix);
}
return files.count(file);
};
gitInfo.storePath = store->addToStore("source", uri, true, htSHA256, filter);

The main difference to using something like builtsin.fetchGit is

  • You can get a subdirectory, and files outside the directory never reach the store
  • Even if the working directory is not dirty, you don’t go through the git fetch/git archive/tar dance

And JFTR, the main benefits over approaches that parse .gitignore, or regex-based whitelists, are:

  • It does not pick up random files that you happen to have dropped in the repository (e.g. gperf.out) which may be large, or have sensitive data that you don’t want to reach /nix/store.
  • It does not pick up files that you have not added to git yet, avoiding the case where a local nix-build works, but not for your colleages, because you forgot to add source files.

But obviously there is a hack here, with this import from derivation, and it would be much cleaner if there was a
builtins.gitLsFiles that would take a path (which can be a subdirectory of the repository), run

auto files = tokenizeString<std::set<std::string>>(runProgram("git", true, { "-C", uri, "ls-files", "-z" }), "\0"s);

and return that as a list of strings. This would also allow useful things like combining it with further filters.

Since the code is essentially already there, as part of builtins.fetchGit, would it be feasible to expose it as a builtin?

@Kha
Copy link
Contributor

Kha commented Jul 11, 2019

I wholeheartedly agree with everything @nomeata has said (and so do three other people, it seems). For the vast majority of packages, this is the most sensible solution for setting up src.

If this feature is deemed desirable but lacks someone to implement it, I would volunteer for that.

@edolstra
Copy link
Member

Wouldn't it be better to add a filter argument to fetchGit? Functions like gitLsFiles seem very non-declarative and non-hermetic to me. (IMHO you shouldn't do arbitrary filesystem traversals in Nix - having a readDir primop was a mistake.)

@Kha
Copy link
Contributor

Kha commented Jul 11, 2019

Mmh, a filter argument would admittedly solve my biggest issue with fetchGit. But I actually like that the name gitLsFiles implies a non-hermetic operation mostly useful for local development (with the assumption being that non-development contexts should not have .git folders lying around, so gitLsFiles would be harmless/useless there anyway). The trouble with fetchGit is that it is not any more hermetic than gitLsFiles when used on a dirty working tree, but the name does not imply this at all - I would never have guessed that this function could check out a state that does not correspond to any git commit until I accidentally observed it doing just that.

@nomeata
Copy link
Contributor Author

nomeata commented Jul 12, 2019

Wouldn't it be better to add a filter argument to fetchGit? Functions like gitLsFiles seem very non-declarative and non-hermetic to me. (IMHO you shouldn't do arbitrary filesystem traversals in Nix - having a readDir primop was a mistake.)

A filter and a subdir argument to fetchGit, plus clear documentation about what fetchGit does in a dirty tree, would solve my use case, I believe.

@nomeata
Copy link
Contributor Author

nomeata commented Aug 22, 2019

Ping @hamishmack

@nomeata
Copy link
Contributor Author

nomeata commented Nov 7, 2020

Fund! I just noticed that this has spread to haskell.nix (https://github.com/input-output-hk/haskell.nix/blob/master/lib/clean-git.nix) and Serokell (serokell/serokell.nix#1).

Guess people like the semantics, would be nice to have support for git ls-files it in nix directly (so that evaluation doesn’t have to build a git binary first and to avoid the Import-from-Derivation)

@Kha
Copy link
Contributor

Kha commented Nov 8, 2020

For me, this has been/will be solved by the introduction of flakes, which cannot see untracked files by default.

@nomeata
Copy link
Contributor Author

nomeata commented Nov 8, 2020

Hi Sebastian! Didn’t know you are also traveling in nix land :-)

I guess if flakes have this feature somehow inside, even more the reason to also offer it somehow to “plain nix” maybe, until everything is a flake.

@nixos-discourse
Copy link

This issue has been mentioned on NixOS Discourse. There might be relevant details there:

https://discourse.nixos.org/t/filtersource-and-friends-do-not-compose-with-fetchgit-and-friends/10466/2

@stale
Copy link

stale bot commented Jun 12, 2021

I marked this as stale due to inactivity. → More info

@stale stale bot added the stale label Jun 12, 2021
@dminuoso
Copy link

dminuoso commented Jan 16, 2023

I agree with @nomeata. We should have this in plain nix, if not just for the simple reason that flakes are "not even experimental" currently.
There was a PR that has subsequently been closed for being stale #3171 - I think this in principle addresses the concerns mentioned here.

@stale stale bot removed the stale label Jan 16, 2023
@roberth
Copy link
Member

roberth commented Jan 17, 2023

I imagine this could be achieved through fetchTree, which is not a flake-specific builtin; just developed in conjunction with it. The goal of Source tree abstraction is to improve the performance, and it's quite possible that it will do about as little as git ls-files does when traversing its returned directory structure with readDir.

The main difference to using something like builtsin.fetchGit is

  • You can get a subdirectory, and files outside the directory never reach the store
  • Even if the working directory is not dirty, you don’t go through the git fetch/git archive/tar dance

These goals seem achievable with a lazy fetchTree implementation for git fetching.

@nomeata
Copy link
Contributor Author

nomeata commented Jan 17, 2023

So you are saying we can add a GitWorkdirAccessor implementation of the InputAccessor abstraction that uses git ls-files to know which files to ignore?

@roberth
Copy link
Member

roberth commented Jan 18, 2023

Something like that, yes. Not sure if you're asking to implement it? @edolstra did you plan to work on a libgit2 fetcher implementation?

The main benefit of a libgit2 implementation is that it by default sidesteps the impurities that the git CLI adds on top of the otherwise rather clean tree+blob model.

@nomeata
Copy link
Contributor Author

nomeata commented Jan 18, 2023

No, no urgent plans to dive into the C++ code, sorry :-). But I'll cheer for whoever does.

@roberth
Copy link
Member

roberth commented Dec 29, 2023

Update: the upcoming Nix release reimplements fetchGit and fetchTree with libgit2, achieving the desired lazy behavior.
Currently remaining issues (using "blocked on" syntax to get notified when done)

See also

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging a pull request may close this issue.

6 participants