-
Notifications
You must be signed in to change notification settings - Fork 751
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
URL shortener support excludes IDNs #1224
Comments
Right, that is a pretty clear case of the trouble you can get yourself into with regexes. The code in question: Line 2044 in 2985305
The intent here is, from my reading, to support different URL shorteners, even those that return the URL embedded in some document (HTML, JSON, plain text, ...). So the regex is used to try and extract something that looks like a URL, but also to prevent it to pick up something that just happens to look like one (a limited form of validation). We must either relax the regex or make it a lot more complex to support IDNs and longer TLDs. I think we should a) relax the regex to detect more possible URLs (maybe down to any string starting with http:// or https:// and until the next non-word character) but b) then use a browser native API to validate if the finding is an actual URL or not. One risk I can think of is with URLs at the end of a sentence? It would be nice if we had some sample snippet returned from URL shorteners to use in a unit test? @schisne I can probably capture one from my ancient YOURLs instance, could you provide one from your instance (obfuscate what you need, main thing is to get the type of content around the URL bit). That would be greatly appreciated and let us prevent regressions in the area, down the line. |
I'd be happy to validate an actual or proposed fix, of course!
The response from my instance (Shlink) is simply the text representation of the shortened URL, i.e., https://<domain apex>.<TLD>/<5 alphanumeric characters>
No JSON or other trappings around it in my case :)
|
Ok, plain text, that should be simple to create some samples for. It's good you point out that all these domains will (likely) end in the shortened characters after a slash, so we can probably get away with just checking for word-characters at the end. I'll try to take a crack at it first thing in the new year. |
@schisne, I've finally pushed a minimal change that I hope would allow it to find your IDN URL. I'd appreciate if you could try it. Unfortunately I've not yet found a way to unit test this - I'll probably have to extract that lambda function into a named one, so it can be called by the testing harness without having to create a mock shortener service. I'll try and work on this over next weekend. |
Awesome. The use of But the good news is that |
Thank you for your feedback! I've wrapped URL.canParse into a lambda as you suggested and extracted the function into a named one to make it possible to test it. |
Did you use the FAQ section?
When using a URL shortener hosted on an internationalized domain name (IDN), PrivateBin treats valid responses from the shortener as incorrect. As the root cause, I've located a regex in client-side JavaScript that assumes the URL is ASCII only.
https://datatracker.ietf.org/doc/html/rfc2181#page-13
https://datatracker.ietf.org/doc/html/rfc3490
Submitting this as a bug rather than a pull request because I imagine the maintainers would prefer to decide philosophically how to handle URL parsing rather than being dictated to by a stranger. :)
Steps to reproduce
urlshortener
option with the API key as documentedWhat happens
A message appears at the top of the page saying, "Cannot parse response from URL shortener." Looking at the developer console in the browser, the AJAX request succeeded with a 200 response, and the response body contained a valid URL.
What should happen
The shortened URL in the successful response body should display. There is a regex in
privatebin.js
that limits the domain portion of a URL to[-a-zA-Z0-9@:%._\+~#=]{1,256}
. If I edit this segment of the regex in the Chromium console to.{1,256}
, the shortened URL does display properly.Incidentally, the same regex limits the TLD portion of the URL to six characters. While this isn't impacting me directly, there are many TLDs longer than six characters: https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains
One approach to a solution could be to modify the regex to include more valid domains. Another could be to require the URL shortener to represent its output in Punycode and, in JavaScript, decode the URL to its proper IDN form before displaying. Perhaps encode to Punycode, then run through a regex to validate, then decode again. Up to you how to tackle--I'm just a user. :)
Additional information
Error message:
Inline-edited regex (right) with resulting successful flow (left, domain obfuscated):
Basic information
Server OS: Alpine Linux (deployed through PrivateBin's official Helm chart, which uses the privatebin/nginx-fpm-alpine Docker image)
Webserver: nginx (deployed through PrivateBin's official Helm chart, which uses the privatebin/nginx-fpm-alpine Docker image)
Browser: Chromium 120 (also reproduced in Firefox 121)
PrivateBin version: 1.6.2
I can reproduce this issue on https://privatebin.net: No (for reasons that seem entirely logical to me :) )
The text was updated successfully, but these errors were encountered: