Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

URL shortener support excludes IDNs #1224

Closed
1 task done
schisne opened this issue Dec 28, 2023 · 6 comments · Fixed by #1227
Closed
1 task done

URL shortener support excludes IDNs #1224

schisne opened this issue Dec 28, 2023 · 6 comments · Fixed by #1227
Assignees
Labels

Comments

@schisne
Copy link

schisne commented Dec 28, 2023

Did you use the FAQ section?

  • Yes, I have read the FAQ and I found no solution/answer there.

When using a URL shortener hosted on an internationalized domain name (IDN), PrivateBin treats valid responses from the shortener as incorrect. As the root cause, I've located a regex in client-side JavaScript that assumes the URL is ASCII only.

https://datatracker.ietf.org/doc/html/rfc2181#page-13
https://datatracker.ietf.org/doc/html/rfc3490

Submitting this as a bug rather than a pull request because I imagine the maintainers would prefer to decide philosophically how to handle URL parsing rather than being dictated to by a stranger. :)

Steps to reproduce

  1. Host a URL shortener at an internationalized domain and generate an API key for PrivateBin
  2. Configure PrivateBin's urlshortener option with the API key as documented
  3. Configure CORS to allow the shortener service
  4. Submit a paste in PrivateBin
  5. Click the "Shorten URL" button

What happens

A message appears at the top of the page saying, "Cannot parse response from URL shortener." Looking at the developer console in the browser, the AJAX request succeeded with a 200 response, and the response body contained a valid URL.

What should happen

The shortened URL in the successful response body should display. There is a regex in privatebin.js that limits the domain portion of a URL to [-a-zA-Z0-9@:%._\+~#=]{1,256}. If I edit this segment of the regex in the Chromium console to .{1,256}, the shortened URL does display properly.

Incidentally, the same regex limits the TLD portion of the URL to six characters. While this isn't impacting me directly, there are many TLDs longer than six characters: https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

One approach to a solution could be to modify the regex to include more valid domains. Another could be to require the URL shortener to represent its output in Punycode and, in JavaScript, decode the URL to its proper IDN form before displaying. Perhaps encode to Punycode, then run through a regex to validate, then decode again. Up to you how to tackle--I'm just a user. :)

Additional information

Error message:

image

Inline-edited regex (right) with resulting successful flow (left, domain obfuscated):

image

Basic information

Server OS: Alpine Linux (deployed through PrivateBin's official Helm chart, which uses the privatebin/nginx-fpm-alpine Docker image)

Webserver: nginx (deployed through PrivateBin's official Helm chart, which uses the privatebin/nginx-fpm-alpine Docker image)

Browser: Chromium 120 (also reproduced in Firefox 121)

PrivateBin version: 1.6.2

I can reproduce this issue on https://privatebin.net: No (for reasons that seem entirely logical to me :) )

@schisne schisne added the bug label Dec 28, 2023
@elrido
Copy link
Contributor

elrido commented Dec 29, 2023

Right, that is a pretty clear case of the trouble you can get yourself into with regexes. The code in question:

const shortUrlMatcher = /https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)/g;

The intent here is, from my reading, to support different URL shorteners, even those that return the URL embedded in some document (HTML, JSON, plain text, ...). So the regex is used to try and extract something that looks like a URL, but also to prevent it to pick up something that just happens to look like one (a limited form of validation).

We must either relax the regex or make it a lot more complex to support IDNs and longer TLDs. I think we should a) relax the regex to detect more possible URLs (maybe down to any string starting with http:// or https:// and until the next non-word character) but b) then use a browser native API to validate if the finding is an actual URL or not.

One risk I can think of is with URLs at the end of a sentence? It would be nice if we had some sample snippet returned from URL shorteners to use in a unit test? @schisne I can probably capture one from my ancient YOURLs instance, could you provide one from your instance (obfuscate what you need, main thing is to get the type of content around the URL bit). That would be greatly appreciated and let us prevent regressions in the area, down the line.

@elrido elrido self-assigned this Dec 29, 2023
@schisne
Copy link
Author

schisne commented Dec 29, 2023 via email

@elrido
Copy link
Contributor

elrido commented Dec 29, 2023

Ok, plain text, that should be simple to create some samples for. It's good you point out that all these domains will (likely) end in the shortened characters after a slash, so we can probably get away with just checking for word-characters at the end. I'll try to take a crack at it first thing in the new year.

@elrido
Copy link
Contributor

elrido commented Jan 4, 2024

@schisne, I've finally pushed a minimal change that I hope would allow it to find your IDN URL. I'd appreciate if you could try it.

Unfortunately I've not yet found a way to unit test this - I'll probably have to extract that lambda function into a named one, so it can be called by the testing harness without having to create a mock shortener service. I'll try and work on this over next weekend.

@schisne
Copy link
Author

schisne commented Jan 4, 2024

@schisne, I've finally pushed a minimal change that I hope would allow it to find your IDN URL. I'd appreciate if you could try it.

Unfortunately I've not yet found a way to unit test this - I'll probably have to extract that lambda function into a named one, so it can be called by the testing harness without having to create a mock shortener service. I'll try and work on this over next weekend.

Awesome. The use of URL.canParse is a brilliant idea. It does still fail for me, and playing in the Chromium console again, I think I know why. URL.canParse takes an optional second parameter, base, which I think Array.filter is blindly filling in with the array index--with the result that any URL, IDN or not, would fail validation as written.

image

But the good news is that https://google.com behaves the same way as my IDN URL when it interacts with URL.canParse! It also passes your new regex. :)

@elrido
Copy link
Contributor

elrido commented Jan 4, 2024

Thank you for your feedback! I've wrapped URL.canParse into a lambda as you suggested and extracted the function into a named one to make it possible to test it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants