Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Autogenerate gix-packetline-blocking/src #1340

Merged
merged 19 commits into from
Apr 9, 2024

Conversation

EliahKagan
Copy link
Member

@EliahKagan EliahKagan commented Apr 9, 2024

Fixes #1310

This replaces the symbolic link at gix-packetline-blocking/src with a copied source tree from gix-packetline/src, with headers prepended to state that they are copies and, for each file, what file it is a copy of (other than that header).

The headers are in the format suggested in #1310 (comment), except that a blank line (i.e., two newlines rather than one) appears after each one. In files where there are already //! comments at the top, I am unsure if a blank line is correct, or if it should instead be omitted or a second line of just //! added to signify a paragraph break. Currently there is only one such file, gix-packetline/src/lib.rs.

This also adds a shell script, etc/copy-packetline.sh, which was used to perform the copy, and can be used at any time to update the copy when changes are made in gix-packetline/src. It can be invoked directly, or by running just copy-packetline. Both approaches work even if gix-packeltine-blocking/src has is absent or has been checked out as a regular file due to core.symlinks being false or symbolic links being otherwise unavailable.

However, the latter approach, even though at this time no Rust tool is used, will still print an error message that may mislead the person running it into thinking it has failed. It shows the same message as in #1310 (comment). But this message actually always happens whenever a just command is run when the project structure is unacceptable to cargo. The cause is:

https://github.com/Byron/gitoxide/blob/68d1a29e8e497fa130e5733d1ecb6b7694a489ca/justfile#L193

It seems to me that it should be possible to keep that message from being printed for recipes that don't rely on it. But I think that ought be considered outside the scope of this PR.

The script is used on CI to check that it has no effect. For now I have added three jobs with a matrix strategy, but this is unnecessary to check that the script has been run if needed; for that we just need one. Instead, running it on three platforms checks that the script works on all of them (I don't have a macOS system). Because it does not test the full range of the script's functionality--as noted below, the script implements specific rules for when it is willing to make the copy--I also tested the script, I believe thoroughly, locally on Windows and Ubuntu.

I've kept all three jobs so if the script has to be changed during further work on this PR then they will be there for testing. I think all but the ubuntu-latest job can ultimately be removed.

The script tries to refuse only when its effects could lead to data loss or confusion. The rules it follows, for the target, are implemented and commented in the check_target_dir shell function. Basically:

  • If nothing at all exists at the target location gix-packetline-blocking/src on disk, then always proceed. This is safe, and it provides a simple way to get to a state where the script will regenerate the files: deleting the entire subtree indicates a willingness for it to be replaced.
  • If something exists at the target on disk and there is no merge in progress, then proceed so long as all changes at or under the target location are staged and no ignored files are present there.
  • If something exists at the target location on disk and there is a merge in progress (whether due to a conflict or a deliberately uncommitted merge), then proceed so long as there are no changes of any kind an the target location (not even staged changes) and no ignored files are present there.

The script does not stage its own changes. This helps avoid data loss, makes it easy to keep changes from the script identifiable by staging and then running it, and makes the script suitable for use in a pre-commit hook if desired.

The script decides between \n and \r\n for the line endings of each file's prepended header by checking how the first line of the file being copied ends. I believe this is a better approach that analyzing the file as a whole, since if a file is to be inconsistent, it ought not to be made more inconsistent by having more separate sections with different line endings. Because various platforms, git configurations, and approaches to editing files affect line endings, I think there is no reasonable way to decide this besides to look at the files themselves. They are checked in a way that overcomes the strong tendency of tools in Cygwin-like environments (including MSYS2 and "Git Bash") to claim that \r\n is merely \n.

The script performs a single pass, adding the headers at the time the files are written, rather than copying the files and then modifying them. This has some advantages, which might justify retaining this approach over others that have been discussed, as detailed below.

It may be that this shell script should be partly or wholly replaced with a tool coded in Rust. I believe that was the idea in #1310 (comment): that a shell script would check if it is safe and reasonable to copy the file and, if so, do so and then invoke a Rust tool to prepend the headers. The reason not to write the whole thing in Rust was that it can fail initially if there is no symlink or directory tree. I embarked on writing the script with the idea that I would write the whole thing as a shell script to get clear on exactly what it should do, then pare down the script, replacing the header-prepending code with a Rust tool.

As I wrote the script, I found that it was unexpectedly tricky, relative to the apparent simplicity of the task, to implement it as a sufficiently portable shell script that was robust to rare but plausible oddities. That made me think as much of it should be turned into Rust as possible. But I persisted in implementing the whole thing as a shell script initially, because I also noticed some advantages of it being implemented that way. I now suspect those advantages are sufficient to justify it remaining as a shell script in its entirety:

  • Copying the files and then prepending their headers runs the risk of having copies without headers, that appear to be files that are reasonable to edit and that are easy to edit by accident instead of the correct ones, if the task is interrupted or fails. Failure is likely as long as running a separate Rust tool requires cargo to be able to run and see files where they're supposed to be everywhere in the workspace; it may fail due to conceptually unrelated breakages in any crate.
  • Copying the files and then prepending their headers seems like it will introduce delays, and occasional minor annoyances, due to language servers (as used in editors and IDEs) triggering an analysis of the files as they are originally created, and then again when they are edited. The delay in between may be enough for this to happen, and slight experimentation suggests this is a real issue at least on Windows, but I do not really know for sure how big of a problem this would be. (In particular, this sort of thing could be even further ameliorated by avoiding removing an existing tree before updating it, and I did not go that far.)
  • The script is very fast on CI, which runs the script directly rather than through just, because no Rust toolchain has to be installed.
  • When one is making frequent edits inside gix-packetline/src, one might wish to perform the updates automatically using a pre-commit hook. Having them done by a script that doesn't depend on other state of the project benefits this, mainly in that one could leave the hook enabled all the time even when making other unrelated changes that don't require cargo to accept the state of the workspace.

I am inclined to find those factors persuasive when taken together, but I am not certain. In particular, this should be weighed against one of the benefits of having as much of this logic in Rust as possible: more gitoxide developers are likely to be able to work efficiently with the code, since gitoxide is a Rust project. (On the other hand, gitoxide does rely substantially on shell scripts as fixtures.) In addition, while this is the sort of thing one would think a shell script would be great for, it actually incurs some complexities due to limitations related to portability that I believe would be absent in Rust code. Ways the script is weird or cumbersome in service of portability are detailed below.

Although I did use CI checks during development (rebasing onto a branch of my fork in which CI runs on all branches), I've decided not to deliberately push each commit separately in this PR as I often do in GitPython. I'd be pleased to do something that like on request. The reasons I've refrained are that I don't know if it would be welcome here, with the more involved CI checks, and that I wonder if I may need to rebase this anyway, as I think this is a case where messages in the conventional commits style aren't called for, but I'm not sure.


Cygwin-like environments on Windows, including MSYS2 and the MSYS2-ish "Git Bash" environment, are weird, and some of the weirdness, as mentioned above and as commented in the script, constrains the approaches that can be taken to distinguish \r\n from mere \n, especially since the script should avoid using anything that may not be available on other systems that have nothing to do with Windows. This is covered in full in the big comment in first_line_ends_crlf, and thus not detailed here.

Bash offers a number of nice features beyond what are standardized for sh by POSIX. But many of them were introduced in Bash 4, which macOS does not and probably never will ship. This needs to work on macOS systems where no newer Bash has been installed; furthermore, most scripts in the project specify /bin/bash as the interpreter, and it could be confusing to change that to /usr/bin/env bash just for this. (I have tried to adhere to the style of other scripts.) Before major version 4, Bash lacks recursive globbing (globstar), the ability to have the rightmost command in a pipeline affect the calling environment (lastpipe), and the ability to check the exit status of a command run in process substitution (process substitution would otherwise be a workaround for not having lastpipe).

I assume Bash 3 or later, and the existence of command-line tools that are required by POSIX (in addition, I try to avoid using any that are, in practice, ever absent; for example, file is often not installed in Docker images). This is with one notable exception: I assume that find, which I'm using because I don't have recursive ** globs, supports the -print0 action; if it does not, then the script will fail (but signal the failure properly, due to set -o pipefail, which is supported even in Bash 3).

The contents of this repository are generally trusted, and I doubt anyone will deliberately create a file with a newline character in its name (which Windows can't tolerate, but other systems can) without good reason. But:

  • There may be a reason to have such files, for the purpose of testing gitoxide (or its interaction with other tools such as gpg or ssh) in their presence.
  • They can be created by accident.
  • Assuming the full range of possible filenames cannot occur goes wrong more often than one tends to predict.

Of course, this can be made fully portable by using the -exec action. But this has a couple of problems:

  • It makes lots of subprocess invocations (or incurs the further complexity of also implementing a loop to receive multiple paths from the + form).
  • It is hard to portably stop the entire traversal on the first failure.
  • I did it that way first and the script was more complicated, longer, and harder to read.

Fortunately -print0 seems in practice to be very portable. GNU findutils, BusyBox find (as in minimal environments or Alpine Linux), and the find command in BSD systems, including macOS, all support it, and have supported it for a long time.

It may be that there is a much simpler and more elegant way to do this than I think, even in a shell script. Ordinarily, when writing an explicit loop in a shell script, one should consider alternatives (and I have considered some). However, before proceeding with an idea to simplify the script, the general problem that most Unix-style tools that conceptually operate on text do not properly distinguish between \n and \r\n in Cygwin-like environments should be considered. If the cumbersome state of the script ought to be ameliorated, then I suspect that replacing it with Rust may be the overall best approach. I am also curious if there is some way to do the whole thing in Rust yet let it work even if cargo frowns on the state of the workspace.

By copying the files with added headers, instead of having a
symlink.

Right now this is entirely as a shell script. It may be beneficial
to rewrite at least some parts, especially those that are unnatural
to implement portably as a shell script even given Bash, in Rust.
This was always intended to work, but didn't because script
subprocesses always used the original command name/path, but this
was after changing directory.

The script already relies on the repository being organized a
particular way, so relying on itself (or another script suitable to
do its work) being at the known location is no worse. This also
makes it unnecessary to guard against the rare {}-in-path case.

This change has the further benefit that if the script is ever
successfully invoked via a Windows-style path with "\" separators
(which can happen on Windows), its subprocesses can still execute.

This also removes an altogether ineffective attempt to support the
rare case of a working tree whole top-level directory name ends in
a newline character, and instead changes the comment to note that
this does work. A few other comments throughout are reworded for
clarity as well.

It may make sense to change the script more deeply than this (even
before replacing part of it with a Rust tool). By sacrificing a
small amount of portability, a loop could be used, as find
recognizes -print0 in GNU/Linux (via GNU findutils), most BSD
systems including macOS, and busybox (thus also Alpine Linux).
Unfortunately a recursive glob, which would be even nicer, is not
supported by versions of bash that ship with macOS.
This modifies the script so that:

- When the target does not exist on disk at all, we proceed, even
  if the deletion is an unstaged change.

- When there is an uncommitted merge, as happens when a conflict is
  being resolved (and occasionally otherwise), recognize this case
  (rather than misinterpreting the short status output), proceeding
  only if there are no changes at all at the target. Even staged
  changes cause the copy to be cancelled in that case.

- Some cumbersomely expressed code is written more clearly, a
  couple of specific error messages are added for what should be
  rare failures of git commands, checking for grep failure is
  removed (grep should not actually fail if used correctly), and
  the big comment in first_line_ends_crlf is made more accurate.
This doesn't happen on Windows filesystems, but can happen
elsewhere. It can happen on occasion by accident.
`find -print0` is not POSIX, but supported systems include:

- GNU/Linux systems, using GNU findutils
- Other Linux-based systems using BusyBox find (e.g. Alpine Linux)
- BSD systems, including macOS (all versions, since 10)

Since this is a Bash script, and even the old versions of Bash on
macOS systems support `read -d ''` to read null-byte-delimited
input, assuming `find -print0` is supported everywhere needed, this
is highly likely to work everywhere needed. It allows filenames to
be safely read even if they contain weird characters including
newlines, as can happen accidentally (hopefully not intentionally).

It would be nicer to loop through the result of a recursive glob
(globstar, **), but Bash only supports that starting in version 4,
which no version of macOS ships (or is likely to ship).

The benefit of this approach, compared to the `find -exec` way used
before, is that the code is simpler, no longer needs to call itself
as a subprocess through find, and no longer needs a help to clarify
a mode that is not meant to be used from the outside. (It may also
be slightly faster.)

This makes some behavioral changes, in areas where the design had
been driven by the implementation rather than the other way around:

- We stop on the first failure.

- Because of that, there is no need to restate what files could not
  be generated (it is at most one, and the failure should show it).

- As noted above, this script can no longer be invoked to process
  an individual file (which as not a design goal), and it therefore
  no longer accepts any command-line arguments (they are ignored).

This also fixes a misspelling the big first_line_ends_crlf comment.
This also refactors slightly in other ways, clarifies a comment,
and improves an error message.
For the real purpose of this new check, which is to verify that
`just copy-packetline` has been run when needed, to ensure that
gix-packetline-blocking/src is not out of date, it should only be
necessary to run this check on one platform.

So this should probably be changed to `runs-on: ubuntu-latest`
instead of a matrix strategy soon. I'm running it on three platforms
initially to check the script (though some functionality of the
script is not exercised).

I've also tested the script locally on Ubuntu 22.04.4 LTS, and on
Windows 10 (in Git Bash).
Somewhat unintuitively, the "-" is considered to be a file operand
and not an option, even though it treated to signify standard input
rather than a file named "-". On some systems including macOS,
options to cat cannot follow paths. On those systems, writing "-"
before "--" causes "--" to be treated as a filename, causing the
script to break on macOS.

This changes the order, which lets the cat command work on all
systems, though (as before) if the subsequent argument were a
literal "-" then that would not have the desired effect. In this
case that is okay, since we actually know that the other argument,
the value of the source_file variable, is not exactly "-".

(We even know that it does not begin "-". But keeping the "--"
before it makes clearer to human readers that this argument is a
path and never an option.)
This speeds things up, as no cargo or just commands have to be run,
and no Rust toolchain has to be installed.

If part of all of the script is replaced by a tool implemented in
Rust, then of course this should (and probably must) be revisited.
Since --porcelain is guaranteed stable across future git versions.
Only the secondary reason, which is the less important one, was
given before. Now both are given and the more important reason is
emphasized.

(If the entire subtree is deleted on disk, staged or not, then we
do proceed, since neither disadvantage applies. I think this is
clear from context, since this is checked first, and like the
other checks, it is commented.)
These are copies, except for their headers, and generated by the
etc/copy-packetline.sh script.

This should make the new CI checks, which use the script, pass.
So when gix-packetline-blocking/src was not updated, the diff from
running the script is shown in full.

Right now this is to investigate how the script does not seem to
work on macOS, but it may be useful to keep it.
While a command like `find foo` should add a `/` to make paths like
`foo/bar`, some find implementations, including in macOS, add their
own second slash even when there already was one, so `find foo/`
outputs paths like `foo//bar`. (The current POSIX standard seems to
forbid this, but it seems early standards did not, and in any case
it is the behavior of find even on recent macOS systems.)

That broke the copy-packetline.sh script on macOS:

- Unsightly extra slashes were added to the headers placed in the
  copied/generated files inside gix-packetline-blocking/src.

- Due to this happening on macOS, the script produced different
  output on macOS and other systems, preventing a precise portable
  CI check that the worktree remains clean even after running it.

This commit addresses that by not adding a `/` to the path anymore,
and making the required slight change to how the source prefix is
removed (to be replaced by the target prefix) to accommodate this.

That prefix removal was the original reason I had added a `/`
originally. It served two purposes: to make the nature of the
replacement slightly clearer to human readers, and to safeguard
against a prefix that was not a whole path component. The first
goal remains valuable but the benefit was slight, so it's okay to
lose out on that. The second is no longer needed, because it was
related to a way the script was more complex and thus more prone to
to error at that time; the associated check was already removed.
@EliahKagan EliahKagan marked this pull request as ready for review April 9, 2024 05:13
Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a million for this improvement!

Please note that I just took a quick look and didn't read any of the PR description yet, which I will get to in a bit.

etc/copy-packetline.sh Outdated Show resolved Hide resolved
EliahKagan and others added 3 commits April 9, 2024 02:37
This reruns `just copy-packetline`, turning the comments in files
in gix-packetline-blocking/src from doc comments to regular ones.
- avoid `target` prefix as it reminds me of `target/` too much
- run script on only one platform on CI, with explanation
- make CI failures an error
@EliahKagan
Copy link
Member Author

Building on d25e5d3, another option for names could be input and output.

Copy link
Member

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks a lot for this wonderful work, leading to a very impactful fix! Finally, we can use cargo on Windows again.

I only have made a couple of smaller adjustments. One change that I am not perfectly certain about is the removal of the continue-on-error: true line, which I thought was a left-over rather than intentional. If that's wrong, let's fix it before the merge.

Although I did use CI checks during development (rebasing onto a branch of my fork in which CI runs on all branches), I've decided not to deliberately push each commit separately in this PR as I often do in GitPython. I'd be pleased to do something that like on request. The reasons I've refrained are that I don't know if it would be welcome here, with the more involved CI checks, and that I wonder if I may need to rebase this anyway, as I think this is a case where messages in the conventional commits style aren't called for, but I'm not sure.

I'd be happy to keep the current style of pushing once, as otherwise I am pretty sure CI would be blocked for a long time which could be an impediment.

If the cumbersome state of the script ought to be ameliorated, then I suspect that replacing it with Rust may be the overall best approach. I am also curious if there is some way to do the whole thing in Rust yet let it work even if cargo frowns on the state of the workspace.

I understand the reasoning for the script, find it very elaborate and well made, and chose to consider it read-only. It's not that it's unmaintainable, it's just that it has one purpose that it fulfils so we can move on to greener pastures :).

@Byron
Copy link
Member

Byron commented Apr 9, 2024

Building on d25e5d3, another option for names could be input and output.

I am done with the refactor and encourage you to bring in last minute changes, also to the names - the ones proposed are shorter, and they are also not names target :).

I plan to merge once CI is complete, but probably only in an hour or so.

@EliahKagan
Copy link
Member Author

Thanks, I'll push a change to rename the identifiers shortly.

In the copy-packetline.sh script:

- Rename all identifiers with "source" to "input", and with
  "destination" or "target" to "output". This includes variables
  and functions.

- Adjust messages accordingly. This is not a simple search and
  replace, since sometimes "target" becomes "output location" (to
  avoid giving the impression that it must always being as a dir).

- Adjust comments accordingly, but use the word "destination" in a
  few comments that had said "target", where it's clear it applies
  to "output" and seems to be clearer wording than "output".

- A couple very small unrelated comment rephrasings for clarity.

In the copy-packetline job definition in ci.yml:

- Put `fail-fast: false`, which has no effect on a matrix that only
  generates one job, but will aid in producing results for all three
  platforms if the other two have to be reenabled to test a change
  to the script. This is what I had been using `continue-on-error`
  for, but `fail-fast: false` is the better way to do it (and also
  avoids confusion with the other meanings of `continue-on-error`).

- Tweak indentation so that removing the leading `# ` on the
  macos-latest and windows-latest entries would place them in the
  list (so that if they ever have to be reenabled to test changes
  to the script, then this will be slightly easier).
@Byron
Copy link
Member

Byron commented Apr 9, 2024

Thank you, much better!

I also just now realise that the new CI job takes a mere 3 seconds, so running more of these isn't ever going to be an issue.
The only argument I can make to leave it as is, is that it's less noisy with just one job, and less confusing as well since it would otherwise be puzzling why this has to be tested on all platforms (if one knows the intention of the script). Thus, I think I can indeed merge in a moment.

@EliahKagan
Copy link
Member Author

I only have made a couple of smaller adjustments. One change that I am not perfectly certain about is the removal of the continue-on-error: true line, which I thought was a left-over rather than intentional. If that's wrong, let's fix it before the merge.

Really I should never have used that, because it was applied to each job in the matrix for the purpose of ensuring that results from all platforms appeared during testing, rather than to specific jobs considered experimental. I should have used fail-fast: false instead. The effect of fail-fast makes no difference to a matrix strategy of only one job, but I've put that in so that, if the other jobs ever have to be enabled to test a change to the script, then all platforms' results will be produced and shown.

I also just now realise that the new CI job takes a mere 3 seconds, so running more of these isn't ever going to be an issue. The only argument I can make to leave it as is, is that it's less noisy with just one job, and less confusing as well since it would otherwise be puzzling why this has to be tested on all platforms (if one knows the intention of the script). Thus, I think I can indeed merge in a moment.

I think it is less confusing with just one, yes. If the script will rarely be edited, then I think it's better to have just the ubuntu-latest job for the check.

@Byron Byron merged commit a9b783d into GitoxideLabs:main Apr 9, 2024
19 checks passed
@EliahKagan EliahKagan deleted the copy-packetline branch April 9, 2024 16:55
@EliahKagan EliahKagan changed the title Autogenerate ci-packetline-blocking/src Autogenerate gix-packetline-blocking/src May 5, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

building on windows fails due to symlinks in working tree
2 participants