URL shortener support excludes IDNs #1224

schisne · 2023-12-28T21:15:04Z

Did you use the FAQ section?

Yes, I have read the FAQ and I found no solution/answer there.

When using a URL shortener hosted on an internationalized domain name (IDN), PrivateBin treats valid responses from the shortener as incorrect. As the root cause, I've located a regex in client-side JavaScript that assumes the URL is ASCII only.

https://datatracker.ietf.org/doc/html/rfc2181#page-13
https://datatracker.ietf.org/doc/html/rfc3490

Submitting this as a bug rather than a pull request because I imagine the maintainers would prefer to decide philosophically how to handle URL parsing rather than being dictated to by a stranger. :)

Steps to reproduce

Host a URL shortener at an internationalized domain and generate an API key for PrivateBin
Configure PrivateBin's urlshortener option with the API key as documented
Configure CORS to allow the shortener service
Submit a paste in PrivateBin
Click the "Shorten URL" button

What happens

A message appears at the top of the page saying, "Cannot parse response from URL shortener." Looking at the developer console in the browser, the AJAX request succeeded with a 200 response, and the response body contained a valid URL.

What should happen

The shortened URL in the successful response body should display. There is a regex in privatebin.js that limits the domain portion of a URL to [-a-zA-Z0-9@:%._\+~#=]{1,256}. If I edit this segment of the regex in the Chromium console to .{1,256}, the shortened URL does display properly.

Incidentally, the same regex limits the TLD portion of the URL to six characters. While this isn't impacting me directly, there are many TLDs longer than six characters: https://en.wikipedia.org/wiki/List_of_Internet_top-level_domains

One approach to a solution could be to modify the regex to include more valid domains. Another could be to require the URL shortener to represent its output in Punycode and, in JavaScript, decode the URL to its proper IDN form before displaying. Perhaps encode to Punycode, then run through a regex to validate, then decode again. Up to you how to tackle--I'm just a user. :)

Additional information

Error message:

Inline-edited regex (right) with resulting successful flow (left, domain obfuscated):

Basic information

Server OS: Alpine Linux (deployed through PrivateBin's official Helm chart, which uses the privatebin/nginx-fpm-alpine Docker image)

Webserver: nginx (deployed through PrivateBin's official Helm chart, which uses the privatebin/nginx-fpm-alpine Docker image)

Browser: Chromium 120 (also reproduced in Firefox 121)

PrivateBin version: 1.6.2

I can reproduce this issue on https://privatebin.net: No (for reasons that seem entirely logical to me :) )

The text was updated successfully, but these errors were encountered:

elrido · 2023-12-29T13:30:55Z

Right, that is a pretty clear case of the trouble you can get yourself into with regexes. The code in question:

PrivateBin/js/privatebin.js

Line 2044 in 2985305

    
           const shortUrlMatcher = /https?:\/\/(www\.)?[-a-zA-Z0-9@:%._\+~#=]{1,256}\.[a-zA-Z0-9()]{1,6}\b([-a-zA-Z0-9()@:%_\+.~#?&//=]*)/g;

The intent here is, from my reading, to support different URL shorteners, even those that return the URL embedded in some document (HTML, JSON, plain text, ...). So the regex is used to try and extract something that looks like a URL, but also to prevent it to pick up something that just happens to look like one (a limited form of validation).

We must either relax the regex or make it a lot more complex to support IDNs and longer TLDs. I think we should a) relax the regex to detect more possible URLs (maybe down to any string starting with http:// or https:// and until the next non-word character) but b) then use a browser native API to validate if the finding is an actual URL or not.

One risk I can think of is with URLs at the end of a sentence? It would be nice if we had some sample snippet returned from URL shorteners to use in a unit test? @schisne I can probably capture one from my ancient YOURLs instance, could you provide one from your instance (obfuscate what you need, main thing is to get the type of content around the URL bit). That would be greatly appreciated and let us prevent regressions in the area, down the line.

schisne · 2023-12-29T17:32:39Z

I'd be happy to validate an actual or proposed fix, of course! The response from my instance (Shlink) is simply the text representation of the shortened URL, i.e., https://<domain apex>.<TLD>/<5 alphanumeric characters> No JSON or other trappings around it in my case :)

elrido · 2023-12-29T18:07:48Z

Ok, plain text, that should be simple to create some samples for. It's good you point out that all these domains will (likely) end in the shortened characters after a slash, so we can probably get away with just checking for word-characters at the end. I'll try to take a crack at it first thing in the new year.

elrido · 2024-01-04T06:02:36Z

@schisne, I've finally pushed a minimal change that I hope would allow it to find your IDN URL. I'd appreciate if you could try it.

Unfortunately I've not yet found a way to unit test this - I'll probably have to extract that lambda function into a named one, so it can be called by the testing harness without having to create a mock shortener service. I'll try and work on this over next weekend.

schisne · 2024-01-04T13:16:47Z

@schisne, I've finally pushed a minimal change that I hope would allow it to find your IDN URL. I'd appreciate if you could try it.

Unfortunately I've not yet found a way to unit test this - I'll probably have to extract that lambda function into a named one, so it can be called by the testing harness without having to create a mock shortener service. I'll try and work on this over next weekend.

Awesome. The use of URL.canParse is a brilliant idea. It does still fail for me, and playing in the Chromium console again, I think I know why. URL.canParse takes an optional second parameter, base, which I think Array.filter is blindly filling in with the array index--with the result that any URL, IDN or not, would fail validation as written.

But the good news is that https://google.com behaves the same way as my IDN URL when it interacts with URL.canParse! It also passes your new regex. :)

elrido · 2024-01-04T22:10:12Z

Thank you for your feedback! I've wrapped URL.canParse into a lambda as you suggested and extracted the function into a named one to make it possible to test it.

schisne added the bug label Dec 28, 2023

elrido self-assigned this Dec 29, 2023

elrido mentioned this issue Jan 4, 2024

Shortener URL validation improvement #1227

Merged

3 tasks

elrido closed this as completed in #1227 Jan 6, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

URL shortener support excludes IDNs #1224

URL shortener support excludes IDNs #1224

schisne commented Dec 28, 2023 •

edited

elrido commented Dec 29, 2023

schisne commented Dec 29, 2023 via email •

edited

elrido commented Dec 29, 2023

elrido commented Jan 4, 2024

schisne commented Jan 4, 2024

elrido commented Jan 4, 2024

URL shortener support excludes IDNs #1224

URL shortener support excludes IDNs #1224

Comments

schisne commented Dec 28, 2023 • edited

Steps to reproduce

What happens

What should happen

Additional information

Basic information

elrido commented Dec 29, 2023

schisne commented Dec 29, 2023 via email • edited

elrido commented Dec 29, 2023

elrido commented Jan 4, 2024

schisne commented Jan 4, 2024

elrido commented Jan 4, 2024

schisne commented Dec 28, 2023 •

edited

schisne commented Dec 29, 2023 via email •

edited