Rewrite parsing logic of gix-url #990

niklaswimmer · 2023-08-19T12:38:33Z

The goal of this PR is to make the implementation of the url parser simpler. I consider this PR done when the test suite (without the baseline tests) succeeds again. Follow up PRs can gradually improve the results of the baseline tests until we can turn it on by default.

The old implementation is not yet removed, because I use it as a reference and that works way better if the file still exists in my local repository. It can be removed before merging this PR.

The first two commits of this PR I wanted to merge together with #965 but forgot to push them :)

Makes sure to always include the expected and (if available) actual URL in the panic message. This makes reasoning of failed baseline test much easier.

This change does not affect the number of passing tests, because all tests that were affected by this wrong assertion now fail for some other reason.

Doubles the amount of baseline tests that pass :)

This change is intentionally kept very simple and does not try to immediately fix all now failing tests. When comparing the test output of this implementation with the previous there are a few things worth mentioning: - all previously passing baseline tests still do so - the new implementation correctly classifies paths starting with a `/` or `./` but containing a `:` as file paths - the new implementation correctly classifies paths containing just letters before the first `:` as scp paths - the new implementation fails all tests that check the returned errors for invalid paths - the new implementation seems to fail for some windows paths The next step is to evaluate the now failing tests, and either remove/rewrite them if they are based on flawed assumptions (looking at you 'ssh_alias_needs_username_to_not_be_considered_a_filepath') or fix the new implementation to make them pass again. We can then work on making the baseline tests pass. The change also keeps the previous parsing logic around and just removes it from the module graph, to allow for local comparisons of the two implementations during development.

The `MissingResourceLocation` error variant was previously returned if the input URL had either scheme ssh (no matter if URL or SCP format) or git and the URL contained no or an empty path section. In the future we will instead return the `MissingRepositoryPath` error in such a case. The only benefit of such a seperation is that it allows the caller to infer the kind of URL we tried to parse. A future commit will add a new field to the `MissingRepositoryPath` error variant which will clearly communicate this information and can therefore be used instead for this purpose.

The `NotALocalFile` error variant was previously returned if the input url contained a colon but the slice after that character contained no slash character (`/` or `\`). This behavior is wrong. SCP like URLs may not need a slash character to specify the repositories location. Similarly, a local directory that contains a colon in its name is a valid repository path as well. The new implementation correctly parses such URLs.

All error variants now own a copy of the input which is used in their display implementation. I noticed that URL parsing errors are almost never acted upon but only wrapped and bubbled up to the user. It therefore makes sense to ensure the message for this error is informative enough by default. If an error can be thrown for different types of URLs it now also includes a kind field. With this field the caller can determine what kind of URL we tried to parse. Additionally, this information is used for the error message too. For testing the assert_matches crate was added which implements the unstable feature with the same name in stable rust. It makes testing the invalid variants much more convenient. BREAKING because public fields of some error variants are now no longer available

Byron · 2023-09-09T16:58:07Z

Is there anything I can do to help with this PR?
An option could also be to reduce its scope to get intermediate, but better-than-before, results merged.
Thanks for letting me know 🙏.

niklaswimmer · 2023-09-10T13:00:49Z

I am sorry for the long silence, please know that I am still interested in this. And now that I have finally finished the Netflix series I have been spending my free time on for the past few weeks, I am confident that I will be finishing this PR soon :)

The behavior of the nextest command stays the same by default but allows the user to specify the flags.

niklaswimmer · 2023-09-10T17:39:08Z

It will probably take me another few days to finish Windows, especially because using a VM and everything slows down my workflow. But we will see how much needs fixing

Byron · 2023-09-11T06:10:06Z

If the work with windows is too aggravating, I can also do it for you - with my setup it's still quite terrible, but for gix-url it will work well enough. In other words, I don't want you to loose momentum or happiness over windows, it's not worth it 😁.
Just let me know.

niklaswimmer · 2023-09-11T13:21:42Z

No worries, I think my workflow will be quite workable - I just need to set it up first, which I am doing right now. If I have any problems I will let you know!

I think the previous behavior was actually the correct one for Windows systems, but because my main dev machine is Linux, it is easier to make sure everything works correctly on those platforms first. I will fix Windows next.

niklaswimmer · 2023-09-25T15:25:50Z

Do you want me to just rebase or to reserve the history/commit dates and merge main into this branch. I am happy to do either.

Byron · 2023-09-25T16:46:08Z

You can do what's best for you, or what you see fit, I don't have a preference.

For more context see d6f90be. We however no longer implement the backslash to forward slash conversion. There is no good reason for why Git does that conversion and we do not need it to stay compatible (Git can handle both versions just fine). Additionally, the logic for the conversion was very involved and to implement it proper we would need to detect the shell we are executing in (Git performs the conversion only in bash, not in cmd/ps).

Git parses that as UNC url, but we do not yet implement UNC paths on Windows.

niklaswimmer · 2023-09-25T18:29:29Z

Everything done now! I can write the changelog entry if you want - I probably have a much easier time summarizing the changes having worked on this for so long :) You would need to tell me how to do that correctly with cargo-smart-release though

Byron · 2023-09-25T18:32:27Z

Great, I'd appreciate it!

I think the subject should be feat: <subject> if it's a non-breaking feature, or feat!: <subject> if it's a little breaking after all.

The body can contain whatever markdown you want for further explanation. This would have to go to at least one commit to be picked up by cargo smart-release.

Please feel free to rewrite commits as you see fit.

niklaswimmer · 2023-09-25T18:49:01Z

I only focused on public API changes when marking my commits for inclusion in the changelog and explicitly left out any behavior changes that only affect what URLs we parse and how we parse them. My intend was to just write something along the lines of the following at the top of the changelog.

This release contains many internal changes to the URL implementation that bring it closer to the reference implementation in Git. Gitoxide can now parse more URLs and be more correct while doing so! Highlights include the correct parsing of SSH aliases and an improvement to the handling of file URLs. Please note that there may be additional more subtle changes in the parsing behavior due to the rewrite. If you notice any derivations from how Git handles an URL please open an new issue!

Idk about the wording just did something quick. Or do you think we need to be more precise?

Byron · 2023-09-25T19:15:17Z

Since the merge of main was the last thing that happened, I have removed it by resetting to 49b278a - please beware.

Idk about the wording just did something quick. Or do you think we need to be more precise?

In theory, one would mark each feature or change as they happen to the respective crate by using conventional commit messages, exactly like you did. Adding a top-level paragraph in gix itself is a great idea - I think it's fair to put this underneath a new ## Unreleased header in CHANGELOG.md and gix/CHANGELOG.md - that way different kinds of audiences are most likely to see it. Could you do that?

In the meantime, I will start reviewing and will probably push a few changes to this branch as well, so let's not force push anymore :).

Byron · 2023-09-25T19:17:39Z

Oh, I realized that now after the reset, the changelogs are outdated. Then I think I have just discovered my preference for, in this case, rebasing on top of main instead of merging it. Then you can add the changelog entries, and when done, I can do the review and push smaller adjustments myself.

Thanks again 🙏

niklaswimmer · 2023-09-25T19:28:28Z

Force pushed the rebase. No more forces from now on I agree :)

Byron · 2023-09-25T19:30:43Z

Hehe, thanks! I think now you'd be able to make the CHANGELOG.md adjustments I mentioned earlier (just to have your name associated with the commit) and then I'd consider this ready-for-review if you agree :).

niklaswimmer · 2023-09-25T19:32:09Z

Just to make sure, the top level changelog is for the binaries and the one in the gix folder for the lib, right?

Byron · 2023-09-25T19:33:20Z

Yes, that's correct. In theory, you could also drop the paragraph into gix-url/CHANGELOG.md for completeness.

We run them to assure there are no regressions, while acknowledging that not all succeed right now. The generic test was removed in favor of assuming our code will not actually have a non-unix&windows codepath, making it differeint on unix and windows and nothing else.

Also, add iformation about serialization precision - usually there may be some difference when serializing a parsed URL back into a string, but ideally it leads to the same outcome.

Byron

Thanks again for tackling this!

I am still in the process of reviewing everything, but wanted to share that I really like the new parser implementation. It's much cleaner for sure and great basis to get to a perfectly passing baseline test.

I even learned some tricks like the hidden and hard-to-discover testing module (which I hope one day we can get rid of), and the parse module trickery to be able to quickly swap out implementations.

My goal today is to complete the remaining tasks (beware: I may force push during the time) and then merge the PR. And while writing this, I hope you are still motivated drive the implementation to the level of perfection it deserves :).

Tasks

run baseline tests, put them to work
- baseline tests have been changed so that they are actually running and can work with what's there. Gradual improvements can be done that way, while protecting from regressions.
re-add all deleted tests in such a way that it's clear whether they should pass, or that they don't pass for good reason.
- All manually added tests where probably added with a specific purpose in mind and that's value to try not to loose.
assure there is support for non-UTF-8 file-paths, assure there is a test for that.
Add fuzzing (similar to pathspec) - will be picked up by google oss fuzz as well

Instead of removing failing tests, make clear why their expectation changed now.

Byron · 2023-09-26T13:19:58Z

Thanks so much, great work!

It was quite nice to see the fuzzer not find anything even after 24m tested inputs, and I don't know if that would have worked with the old implementation ;).

I have a feeling that once the remaining TODOs are resolved, most if not all of the baseline tests would work as well. In any case, I will be looking forward to seeing gradual improvements if you are still interested :).

niklaswimmer · 2023-09-26T15:29:06Z

Thanks again for your patience with this PR, you can definitely expext more in the next weeks :)

niklaswimmer added 4 commits August 10, 2023 18:46

Improve error printing for gix-url baseline test

0e9a18e

Makes sure to always include the expected and (if available) actual URL in the panic message. This makes reasoning of failed baseline test much easier.

Fix wrong assertion in gix-url baseline test

f18bfd8

This change does not affect the number of passing tests, because all tests that were affected by this wrong assertion now fail for some other reason.

Add support for "git+ssh" and "ssh+git" urls

3faac36

Doubles the amount of baseline tests that pass :)

niklaswimmer force-pushed the gix-url-parse-rewrite branch from 0be1843 to 43bd55e Compare August 19, 2023 15:44

Byron mentioned this pull request Aug 21, 2023

Suggest cargo install --git when missing registry package looks like a git* URL rust-lang/cargo#10522

Closed

niklaswimmer added 5 commits August 23, 2023 18:31

Return errors for empty paths and relative urls

8af79ec

Use same logic for file urls and local file paths

679a7f4

niklaswimmer force-pushed the gix-url-parse-rewrite branch from f0a0d2d to 38db0c1 Compare August 24, 2023 09:01

Handle missing host and path separator in file url

f05edae

Byron added the feedback requested label Sep 9, 2023

Add new summarize command to justfile

11a616d

The behavior of the nextest command stays the same by default but allows the user to specify the flags.

niklaswimmer force-pushed the gix-url-parse-rewrite branch 2 times, most recently from 8dcfd16 to 673946a Compare September 10, 2023 14:03

niklaswimmer added 5 commits September 10, 2023 16:06

Remove wrong ssh alias url test

4da440d

Add missing slash to file url test

ffaead6

Thanks clippy

d077f56

Use helper function for common url parsing steps

2344f91

Fix dropping of host during file url conversion

f3e1331

niklaswimmer force-pushed the gix-url-parse-rewrite branch from 673946a to 3ab7719 Compare September 10, 2023 17:20

Fix file urls with dos driver letter on unix

99f4447

I think the previous behavior was actually the correct one for Windows systems, but because my main dev machine is Linux, it is easier to make sure everything works correctly on those platforms first. I will fix Windows next.

niklaswimmer force-pushed the gix-url-parse-rewrite branch from 3ab7719 to 6c46ad6 Compare September 12, 2023 16:15

niklaswimmer added 2 commits September 25, 2023 18:48

Align test with real behavior

a31af62

Git parses that as UNC url, but we do not yet implement UNC paths on Windows.

niklaswimmer force-pushed the gix-url-parse-rewrite branch from 19771d7 to b638b06 Compare September 25, 2023 18:15

niklaswimmer marked this pull request as ready for review September 25, 2023 18:16

Byron force-pushed the gix-url-parse-rewrite branch from b638b06 to 49b278a Compare September 25, 2023 19:12

niklaswimmer force-pushed the gix-url-parse-rewrite branch from 49b278a to a31af62 Compare September 25, 2023 19:27

niklaswimmer and others added 3 commits September 25, 2023 22:20

Update changelogs

4349353

refactor baseline tests

4b4ac8a

Also, add iformation about serialization precision - usually there may be some difference when serializing a parsed URL back into a string, but ideally it leads to the same outcome.

Byron reviewed Sep 26, 2023

View reviewed changes

Byron force-pushed the gix-url-parse-rewrite branch from 66b4cd7 to ed1a669 Compare September 26, 2023 10:56

Byron added 3 commits September 26, 2023 13:17

refactor

e318a4c

assure we don't loose test coverage; possibly adjust expecations

30bb7dc

Instead of removing failing tests, make clear why their expectation changed now.

feat: enable fuzzing for git url parsing

4184a5e

Byron force-pushed the gix-url-parse-rewrite branch from ed1a669 to 4184a5e Compare September 26, 2023 12:32

Byron merged commit a12e4a8 into Byron:main Sep 26, 2023
18 checks passed

niklaswimmer mentioned this pull request Sep 29, 2023

gix-url parses SSH aliases as file paths #933

Closed

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Rewrite parsing logic of gix-url #990

Rewrite parsing logic of gix-url #990

niklaswimmer commented Aug 19, 2023

Byron commented Sep 9, 2023

niklaswimmer commented Sep 10, 2023

niklaswimmer commented Sep 10, 2023

Byron commented Sep 11, 2023

niklaswimmer commented Sep 11, 2023

niklaswimmer commented Sep 25, 2023

Byron commented Sep 25, 2023

niklaswimmer commented Sep 25, 2023

Byron commented Sep 25, 2023

niklaswimmer commented Sep 25, 2023

Byron commented Sep 25, 2023

Byron commented Sep 25, 2023

niklaswimmer commented Sep 25, 2023

Byron commented Sep 25, 2023

niklaswimmer commented Sep 25, 2023

Byron commented Sep 25, 2023

Byron left a comment •

edited

Loading

Byron commented Sep 26, 2023

niklaswimmer commented Sep 26, 2023

Rewrite parsing logic of gix-url #990

Rewrite parsing logic of gix-url #990

Conversation

niklaswimmer commented Aug 19, 2023

Byron commented Sep 9, 2023

niklaswimmer commented Sep 10, 2023

niklaswimmer commented Sep 10, 2023

Byron commented Sep 11, 2023

niklaswimmer commented Sep 11, 2023

niklaswimmer commented Sep 25, 2023

Byron commented Sep 25, 2023

niklaswimmer commented Sep 25, 2023

Byron commented Sep 25, 2023

niklaswimmer commented Sep 25, 2023

Byron commented Sep 25, 2023

Byron commented Sep 25, 2023

niklaswimmer commented Sep 25, 2023

Byron commented Sep 25, 2023

niklaswimmer commented Sep 25, 2023

Byron commented Sep 25, 2023

Byron left a comment • edited Loading

Choose a reason for hiding this comment

Tasks

Byron commented Sep 26, 2023

niklaswimmer commented Sep 26, 2023

Byron left a comment •

edited

Loading