Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix for UTF-8 in hostname #131

Closed
wants to merge 4 commits into from
Closed

Fix for UTF-8 in hostname #131

wants to merge 4 commits into from

Conversation

cseufert
Copy link
Contributor

@cseufert cseufert commented Oct 3, 2019

This patch replaces using strtolower() on hostnames as this cant handle utf-8 characters, I have opted to do this with a regex to prevent adding mbstring as a dependancy

@Zegnat
Copy link
Collaborator

Zegnat commented Oct 5, 2019

If I understand this patch correctly it will only run strtolower on A-Z characters? How is that different from running it on the entire string? If you have a mixed string of ASCII and larger Unicode, strtolower will already only convert the ASCII.

Example code comparing the new filterHost proposed here with just strtolower:

$mixedString = 'MÅi';

function filterHost($host): string
{
    return \preg_replace_callback(
        '/([A-Z])/',
        static function ($cap) {
            return \strtolower($cap[0]);
        },
        $host
    );
}

var_dump(
    filterHost($mixedString),
    \strtolower($mixedString),
    filterHost($mixedString) === \strtolower($mixedString)
);
string(4) "mÅi"
string(4) "mÅi"
bool(true)

I think the test you added will pass with or without the proposed patch. Could you write a test case illustrating a case that fails without applying your proposed patch? Maybe I am missing something. I may very well be missing something crucial, some fluke of PHP’s Unicode handling, and would love it if you could point it out!

@cseufert
Copy link
Contributor Author

cseufert commented Oct 7, 2019

I am sure that this test was failing before I replaced strtolower, but i cant seem to reproduce that now....

function testUtf8Host()
{
    $uri = new Uri('http://ουτοπία.δπθ.gr/');
    $this->assertSame('ουτοπία.δπθ.gr', $uri->getHost());
    $new = $uri->withHost('程式设计.com');
    $this->assertSame('程式设计.com', $new->getHost());
}

Actually just put back the strtolower in the Uri.php and get the test case to fail:

Failed asserting that two strings are identical.
Expected :'ουτοπία.δπθ.gr'
Actual   :'�������.���.gr'

I am running the tests on Windows 10 + PHP 7.3 x64, I wonder if its windows specific behaviour.

@cseufert
Copy link
Contributor Author

cseufert commented Oct 7, 2019

On futher investigation, it does seem to be a windows specific thing:

Output from your example code on windows:

"C:\Program Files\php-7.3\php.exe" C:\Users\chris\www\psr7\try.php
C:\Users\chris\www\psr7\try.php:21:
string(4) "mÅi"
C:\Users\chris\www\psr7\try.php:21:
string(4) "m�i"
C:\Users\chris\www\psr7\try.php:21:
bool(false)

Process finished with exit code 0

@Zegnat
Copy link
Collaborator

Zegnat commented Oct 8, 2019

This is so weird. I am assuming try.php is saved as UTF-8? What do you get when you add the following to it?

echo setlocale(\LC_ALL, 0) . \PHP_EOL . ini_get('mbstring.func_overload');

Wondering if strtolower is somehow affected by some other setting, in which case the real solution would be to accommodate for that rather than trying to only find ASCII characters.

@cseufert
Copy link
Contributor Author

I get the following:

LC_COLLATE=C;LC_CTYPE=English_Australia.1252;LC_MONETARY=C;LC_NUMERIC=C;LC_TIME=C
0

So maybe because windows has the default codepage set to 1252?

@Zegnat
Copy link
Collaborator

Zegnat commented Oct 15, 2019

Thanks for sticking with me @cseufert!

I have now been able to replicate the issue. I do not have the exact locale you mention on my macOS setup, but I was able to use the following (extra name variations to cover all bases) to reproduce the problem:

setlocale(\LC_CTYPE, ['English_Australia.1252', 'en_AU.1252', 'en.1252', 'English_Australia.UTF-8', 'en_AU.UTF-8', 'en.UTF-8']);

I think it may be an idea to modify the test case so it sets a locale we know will fail. That way we actually know the PHPUnit test works.

Locally my PHP is set to use the C locale. So rather than using a regular expression and running strtolower() a lot of times on the domain, I wonder if it instead would be better to declare a safeStrtolower() method that sets the locale as follows:

$originalLocale = setlocale(\LC_CTYPE, ['C', 'POSIX']);

And then resets it after running strtolower().

I will try to put the code together during the day (if you do not beat me to it) and hope you will be able to test it on Windows as well.

Zegnat pushed a commit that referenced this pull request Oct 22, 2019
Using an old ICANN test domain name for internationalized domain names,
hopefully not clashing with anything that actually exists. Systems that
use a C or POSIX locale seem unable to reproduce this, so try to set a
more common locale on these.

Continuation of #131.
Zegnat added a commit that referenced this pull request Oct 22, 2019
Using an old ICANN test domain name for internationalized domain names,
hopefully not clashing with anything that actually exists. Systems that
use a C or POSIX locale seem unable to reproduce this, so try to set a
more common locale on these.

Continuation of #131.
@cseufert
Copy link
Contributor Author

I can confirm that setting the locale like this
setlocale(\LC_CTYPE, ['C', 'POSIX']);
does, in fact, solve the problem on windows of strtolower() not working with UTF-8 characters.

However, there are some pretty serious warnings about changing the locale, on shared hosting (mod_apache) on the setlocale man page.

Warning
The locale information is maintained per process, not per thread. If you are running PHP on a multithreaded server API like IIS, HHVM or Apache on Windows, you may experience sudden changes in locale settings while a script is running, though the script itself never called setlocale(). This happens due to other scripts running in different threads of the same process at the same time, changing the process-wide locale using setlocale().

https://www.php.net/manual/en/function.setlocale.php

So changing it is probably not a great solution for some users.

@Zegnat
Copy link
Collaborator

Zegnat commented Oct 26, 2019

Hmm, clearly I somehow missed that when working on my own solution in #133.

I am not sure this is a problem in practice, where we would only need to temporarily change the locale for ~3 lines of code before changing it back to what it used to be. But definitely something to keep in mind.

I am a little surprised that I haven’t encountered this issue in other projects before.

@cseufert
Copy link
Contributor Author

Unicode hostnames are still experimental, and I only encountered them through forged referrer headers, but it still caused some weirdness, so yes, I'm not sure if it is a problem for legit activity.

@drupol
Copy link

drupol commented Nov 19, 2019

Would it be dumb to just use mb_strtolower() ?

@cseufert
Copy link
Contributor Author

@drupol The only issue it then makes this library require that users PHP environment has the mbstring extension loaded. This is not part of a default PHP installation
(see: https://www.php.net/manual/en/mbstring.installation.php)

@@ -271,6 +270,17 @@ private static function isNonStandardPort(string $scheme, int $port): bool
return !isset(self::SCHEMES[$scheme]) || $port !== self::SCHEMES[$scheme];
}

private function filterHost($host): string
{
return \preg_replace_callback(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

this implementation is going to be very slow.
My suggestion instead:
strtr($host, 'ABCDEFGHIJKLMNOPQRSTUVWXYZ', 'abcdefghijklmnopqrstuvwxyz');

Nyholm pushed a commit to nicolas-grekas/nyholm-psr7 that referenced this pull request Apr 21, 2020
Using an old ICANN test domain name for internationalized domain names,
hopefully not clashing with anything that actually exists. Systems that
use a C or POSIX locale seem unable to reproduce this, so try to set a
more common locale on these.

Continuation of Nyholm#131.
@Nyholm Nyholm closed this in #143 Apr 21, 2020
@Nyholm
Copy link
Owner

Nyholm commented Apr 21, 2020

Thank you for this PR.
I merge the solution in #143. I did however include this test (with your commit).

Thank you again.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

5 participants