New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Fix for UTF-8 in hostname #131
Conversation
If I understand this patch correctly it will only run Example code comparing the new $mixedString = 'MÅi';
function filterHost($host): string
{
return \preg_replace_callback(
'/([A-Z])/',
static function ($cap) {
return \strtolower($cap[0]);
},
$host
);
}
var_dump(
filterHost($mixedString),
\strtolower($mixedString),
filterHost($mixedString) === \strtolower($mixedString)
);
I think the test you added will pass with or without the proposed patch. Could you write a test case illustrating a case that fails without applying your proposed patch? Maybe I am missing something. I may very well be missing something crucial, some fluke of PHP’s Unicode handling, and would love it if you could point it out! |
I am sure that this test was failing before I replaced strtolower, but i cant seem to reproduce that now....
Actually just put back the strtolower in the Uri.php and get the test case to fail:
I am running the tests on Windows 10 + PHP 7.3 x64, I wonder if its windows specific behaviour. |
On futher investigation, it does seem to be a windows specific thing: Output from your example code on windows:
|
This is so weird. I am assuming try.php is saved as UTF-8? What do you get when you add the following to it? echo setlocale(\LC_ALL, 0) . \PHP_EOL . ini_get('mbstring.func_overload'); Wondering if |
I get the following:
So maybe because windows has the default codepage set to 1252? |
Thanks for sticking with me @cseufert! I have now been able to replicate the issue. I do not have the exact locale you mention on my macOS setup, but I was able to use the following (extra name variations to cover all bases) to reproduce the problem: setlocale(\LC_CTYPE, ['English_Australia.1252', 'en_AU.1252', 'en.1252', 'English_Australia.UTF-8', 'en_AU.UTF-8', 'en.UTF-8']); I think it may be an idea to modify the test case so it sets a locale we know will fail. That way we actually know the PHPUnit test works. Locally my PHP is set to use the $originalLocale = setlocale(\LC_CTYPE, ['C', 'POSIX']); And then resets it after running I will try to put the code together during the day (if you do not beat me to it) and hope you will be able to test it on Windows as well. |
Using an old ICANN test domain name for internationalized domain names, hopefully not clashing with anything that actually exists. Systems that use a C or POSIX locale seem unable to reproduce this, so try to set a more common locale on these. Continuation of #131.
Using an old ICANN test domain name for internationalized domain names, hopefully not clashing with anything that actually exists. Systems that use a C or POSIX locale seem unable to reproduce this, so try to set a more common locale on these. Continuation of #131.
I can confirm that setting the locale like this However, there are some pretty serious warnings about changing the locale, on shared hosting (mod_apache) on the setlocale man page.
https://www.php.net/manual/en/function.setlocale.php So changing it is probably not a great solution for some users. |
Hmm, clearly I somehow missed that when working on my own solution in #133. I am not sure this is a problem in practice, where we would only need to temporarily change the locale for ~3 lines of code before changing it back to what it used to be. But definitely something to keep in mind. I am a little surprised that I haven’t encountered this issue in other projects before. |
Unicode hostnames are still experimental, and I only encountered them through forged referrer headers, but it still caused some weirdness, so yes, I'm not sure if it is a problem for legit activity. |
Would it be dumb to just use mb_strtolower() ? |
@drupol The only issue it then makes this library require that users PHP environment has the |
@@ -271,6 +270,17 @@ private static function isNonStandardPort(string $scheme, int $port): bool | |||
return !isset(self::SCHEMES[$scheme]) || $port !== self::SCHEMES[$scheme]; | |||
} | |||
|
|||
private function filterHost($host): string | |||
{ | |||
return \preg_replace_callback( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
this implementation is going to be very slow.
My suggestion instead:
strtr($host, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz');
Using an old ICANN test domain name for internationalized domain names, hopefully not clashing with anything that actually exists. Systems that use a C or POSIX locale seem unable to reproduce this, so try to set a more common locale on these. Continuation of Nyholm#131.
Thank you for this PR. Thank you again. |
This patch replaces using strtolower() on hostnames as this cant handle utf-8 characters, I have opted to do this with a regex to prevent adding mbstring as a dependancy