Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Rewrite parsing logic of gix-url #990

Merged
merged 28 commits into from
Sep 26, 2023
Merged

Conversation

niklaswimmer
Copy link
Contributor

The goal of this PR is to make the implementation of the url parser simpler. I consider this PR done when the test suite (without the baseline tests) succeeds again. Follow up PRs can gradually improve the results of the baseline tests until we can turn it on by default.

The old implementation is not yet removed, because I use it as a reference and that works way better if the file still exists in my local repository. It can be removed before merging this PR.

The first two commits of this PR I wanted to merge together with #965 but forgot to push them :)

Makes sure to always include the expected and (if available) actual URL
in the panic message. This makes reasoning of failed baseline test much
easier.
This change does not affect the number of passing tests, because all
tests that were affected by this wrong assertion now fail for some other
reason.
Doubles the amount of baseline tests that pass :)
This change is intentionally kept very simple and does not try to
immediately fix all now failing tests. When comparing the test output of
this implementation with the previous there are a few things worth
mentioning:

- all previously passing baseline tests still do so
- the new implementation correctly classifies paths starting with a `/`
  or `./` but containing a `:` as file paths
- the new implementation correctly classifies paths containing just
  letters before the first `:` as scp paths
- the new implementation fails all tests that check the returned errors
  for invalid paths
- the new implementation seems to fail for some windows paths

The next step is to evaluate the now failing tests, and either remove/rewrite
them if they are based on flawed assumptions (looking at you
'ssh_alias_needs_username_to_not_be_considered_a_filepath') or fix the new
implementation to make them pass again.

We can then work on making the baseline tests pass.

The change also keeps the previous parsing logic around and just removes
it from the module graph, to allow for local comparisons of the two
implementations during development.
The `MissingResourceLocation` error variant was previously returned if
the input URL had either scheme ssh (no matter if URL or SCP format) or
git and the URL contained no or an empty path section. In the future we
will instead return the `MissingRepositoryPath` error in such a case.

The only benefit of such a seperation is that it allows the caller to
infer the kind of URL we tried to parse. A future commit will add a new
field to the `MissingRepositoryPath` error variant which will clearly
communicate this information and can therefore be used instead for this
purpose.
The `NotALocalFile` error variant was previously returned if the input
url contained a colon but the slice after that character contained no
slash character (`/` or `\`). This behavior is wrong.

SCP like URLs may not need a slash character to specify the repositories
location. Similarly, a local directory that contains a colon in its name
is a valid repository path as well.

The new implementation correctly parses such URLs.
All error variants now own a copy of the input which is used in their
display implementation. I noticed that URL parsing errors are almost never
acted upon but only wrapped and bubbled up to the user. It therefore
makes sense to ensure the message for this error is informative enough
by default.

If an error can be thrown for different types of URLs it now also
includes a kind field. With this field the caller can determine what
kind of URL we tried to parse. Additionally, this information is used
for the error message too.

For testing the assert_matches crate was added which implements the
unstable feature with the same name in stable rust. It makes testing the
invalid variants much more convenient.

BREAKING because public fields of some error variants are now no longer
available
@Byron
Copy link
Owner

Byron commented Sep 9, 2023

Is there anything I can do to help with this PR?
An option could also be to reduce its scope to get intermediate, but better-than-before, results merged.
Thanks for letting me know 🙏.

@niklaswimmer
Copy link
Contributor Author

I am sorry for the long silence, please know that I am still interested in this. And now that I have finally finished the Netflix series I have been spending my free time on for the past few weeks, I am confident that I will be finishing this PR soon :)

The behavior of the nextest command stays the same by default but allows
the user to specify the flags.
@niklaswimmer niklaswimmer force-pushed the gix-url-parse-rewrite branch 2 times, most recently from 8dcfd16 to 673946a Compare September 10, 2023 14:03
@niklaswimmer
Copy link
Contributor Author

It will probably take me another few days to finish Windows, especially because using a VM and everything slows down my workflow. But we will see how much needs fixing

@Byron
Copy link
Owner

Byron commented Sep 11, 2023

If the work with windows is too aggravating, I can also do it for you - with my setup it's still quite terrible, but for gix-url it will work well enough. In other words, I don't want you to loose momentum or happiness over windows, it's not worth it 😁.
Just let me know.

@niklaswimmer
Copy link
Contributor Author

No worries, I think my workflow will be quite workable - I just need to set it up first, which I am doing right now. If I have any problems I will let you know!

I think the previous behavior was actually the correct one for Windows
systems, but because my main dev machine is Linux, it is easier to make
sure everything works correctly on those platforms first. I will fix
Windows next.
@niklaswimmer
Copy link
Contributor Author

Do you want me to just rebase or to reserve the history/commit dates and merge main into this branch. I am happy to do either.

@Byron
Copy link
Owner

Byron commented Sep 25, 2023

You can do what's best for you, or what you see fit, I don't have a preference.

For more context see d6f90be.

We however no longer implement the backslash to forward slash
conversion. There is no good reason for why Git does that conversion
and we do not need it to stay compatible (Git can handle both versions
just fine). Additionally, the logic for the conversion was very involved
and to implement it proper we would need to detect the shell we are
executing in (Git performs the conversion only in bash, not in cmd/ps).
Git parses that as UNC url, but we do not yet implement UNC paths on
Windows.
@niklaswimmer niklaswimmer marked this pull request as ready for review September 25, 2023 18:16
@niklaswimmer
Copy link
Contributor Author

Everything done now! I can write the changelog entry if you want - I probably have a much easier time summarizing the changes having worked on this for so long :) You would need to tell me how to do that correctly with cargo-smart-release though

@Byron
Copy link
Owner

Byron commented Sep 25, 2023

Great, I'd appreciate it!

I think the subject should be feat: <subject> if it's a non-breaking feature, or feat!: <subject> if it's a little breaking after all.

The body can contain whatever markdown you want for further explanation. This would have to go to at least one commit to be picked up by cargo smart-release.

Please feel free to rewrite commits as you see fit.

@niklaswimmer
Copy link
Contributor Author

I only focused on public API changes when marking my commits for inclusion in the changelog and explicitly left out any behavior changes that only affect what URLs we parse and how we parse them. My intend was to just write something along the lines of the following at the top of the changelog.

This release contains many internal changes to the URL implementation that bring it closer to the reference implementation in Git. Gitoxide can now parse more URLs and be more correct while doing so! Highlights include the correct parsing of SSH aliases and an improvement to the handling of file URLs. Please note that there may be additional more subtle changes in the parsing behavior due to the rewrite. If you notice any derivations from how Git handles an URL please open an new issue!

Idk about the wording just did something quick. Or do you think we need to be more precise?

@Byron
Copy link
Owner

Byron commented Sep 25, 2023

Since the merge of main was the last thing that happened, I have removed it by resetting to 49b278a - please beware.

Idk about the wording just did something quick. Or do you think we need to be more precise?

In theory, one would mark each feature or change as they happen to the respective crate by using conventional commit messages, exactly like you did. Adding a top-level paragraph in gix itself is a great idea - I think it's fair to put this underneath a new ## Unreleased header in CHANGELOG.md and gix/CHANGELOG.md - that way different kinds of audiences are most likely to see it. Could you do that?

In the meantime, I will start reviewing and will probably push a few changes to this branch as well, so let's not force push anymore :).

@Byron
Copy link
Owner

Byron commented Sep 25, 2023

Oh, I realized that now after the reset, the changelogs are outdated. Then I think I have just discovered my preference for, in this case, rebasing on top of main instead of merging it. Then you can add the changelog entries, and when done, I can do the review and push smaller adjustments myself.

Thanks again 🙏

@niklaswimmer
Copy link
Contributor Author

Force pushed the rebase. No more forces from now on I agree :)

@Byron
Copy link
Owner

Byron commented Sep 25, 2023

Hehe, thanks! I think now you'd be able to make the CHANGELOG.md adjustments I mentioned earlier (just to have your name associated with the commit) and then I'd consider this ready-for-review if you agree :).

@niklaswimmer
Copy link
Contributor Author

Just to make sure, the top level changelog is for the binaries and the one in the gix folder for the lib, right?

@Byron
Copy link
Owner

Byron commented Sep 25, 2023

Yes, that's correct. In theory, you could also drop the paragraph into gix-url/CHANGELOG.md for completeness.

niklaswimmer and others added 3 commits September 25, 2023 22:20
We run them to assure there are no regressions, while acknowledging that
not all succeed right now.

The generic test was removed in favor of assuming our code will not actually
have a non-unix&windows codepath, making it differeint on unix and windows and
nothing else.
Also, add iformation about serialization precision - usually there may be some
difference when serializing a parsed URL back into a string, but ideally it leads
to the same outcome.
Copy link
Owner

@Byron Byron left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks again for tackling this!

I am still in the process of reviewing everything, but wanted to share that I really like the new parser implementation. It's much cleaner for sure and great basis to get to a perfectly passing baseline test.

I even learned some tricks like the hidden and hard-to-discover testing module (which I hope one day we can get rid of), and the parse module trickery to be able to quickly swap out implementations.

My goal today is to complete the remaining tasks (beware: I may force push during the time) and then merge the PR. And while writing this, I hope you are still motivated drive the implementation to the level of perfection it deserves :).

Tasks

  • run baseline tests, put them to work
    • baseline tests have been changed so that they are actually running and can work with what's there. Gradual improvements can be done that way, while protecting from regressions.
  • re-add all deleted tests in such a way that it's clear whether they should pass, or that they don't pass for good reason.
    • All manually added tests where probably added with a specific purpose in mind and that's value to try not to loose.
  • assure there is support for non-UTF-8 file-paths, assure there is a test for that.
  • Add fuzzing (similar to pathspec) - will be picked up by google oss fuzz as well

@Byron Byron merged commit a12e4a8 into Byron:main Sep 26, 2023
18 checks passed
@Byron
Copy link
Owner

Byron commented Sep 26, 2023

Thanks so much, great work!

It was quite nice to see the fuzzer not find anything even after 24m tested inputs, and I don't know if that would have worked with the old implementation ;).

I have a feeling that once the remaining TODOs are resolved, most if not all of the baseline tests would work as well. In any case, I will be looking forward to seeing gradual improvements if you are still interested :).

@niklaswimmer
Copy link
Contributor Author

Thanks again for your patience with this PR, you can definitely expext more in the next weeks :)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants