Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Not encoding UTF-8 correctly #66

Closed
shtse8 opened this issue Mar 1, 2017 · 19 comments
Closed

Not encoding UTF-8 correctly #66

shtse8 opened this issue Mar 1, 2017 · 19 comments
Assignees
Labels
Milestone

Comments

@shtse8
Copy link

shtse8 commented Mar 1, 2017

I am a Chinese developer and making Chinese website.

Code to reproduce:

$html = <<<EOF
<!DOCTYPE html>
<html prefix="og: http://ogp.me/ns#"><head><meta charset="UTF-8">
<title>网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!</title>
<body>
网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!
</body>
</html>
EOF;
$dom = FluentDOM::QueryCss($html, 'text/html');
echo $dom;

Result:

<!DOCTYPE html>
<html prefix="og: http://ogp.me/ns#"><head><meta charset="UTF-8"><title>&#32593;&#21451;&#32456;&#20110;&#32905;&#25628;&#20986;&#12300;&#33539;&#20912;&#20912;&#12301;&#23478;&#26063;&#29031;&#29255;&#65292;&#27809;&#24819;&#21040;&#30475;&#35265;&#22905;&#22902;&#22902;&#25165;&#21457;&#29616;&#12300;&#33539;&#20912;&#20912;&#26159;&#20840;&#23478;&#26368;&#38590;&#30475;&#30340;&#12301;&#65281;</title></head><body>
&#32593;&#21451;&#32456;&#20110;&#32905;&#25628;&#20986;&#12300;&#33539;&#20912;&#20912;&#12301;&#23478;&#26063;&#29031;&#29255;&#65292;&#27809;&#24819;&#21040;&#30475;&#35265;&#22905;&#22902;&#22902;&#25165;&#21457;&#29616;&#12300;&#33539;&#20912;&#20912;&#26159;&#20840;&#23478;&#26368;&#38590;&#30475;&#30340;&#12301;&#65281;
</body></html>

Expected Result:

<!DOCTYPE html>
<html prefix="og: http://ogp.me/ns#"><head><meta charset="UTF-8">
<title>网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!</title>
<body>
网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!
</body></html>

It is a known bug of PHP DomDocument. Here is the reference:
http://stackoverflow.com/questions/8218230/php-domdocument-loadhtml-not-encoding-utf-8-correctly

We should get the UTF-8 result instead of getting HTML-ENTITIES result. It doesn't make sense to get the final html with full of encoded utf-8 and making the size much larger.

@ThomasWeinert
Copy link
Owner

Right, I will check and try to fix this.

@ThomasWeinert ThomasWeinert self-assigned this Mar 2, 2017
@ThomasWeinert ThomasWeinert added this to the 6.1 milestone Mar 2, 2017
@shtse8
Copy link
Author

shtse8 commented Mar 3, 2017

Not only encoded utf-8 is returned. Garbled code is returned using html()

Reproduce:

$html = '<div><p>Paragraph 1</p> <p>Paragraph 2</p><p>网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!</p></div>';
$doc = FluentDOM($html, 'html-fragment');
echo $doc->html();

Expected:

<p>Paragraph 1</p> <p>Paragraph 2</p><p>网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!</p>

Actual:

<p>Paragraph 1</p> <p>Paragraph 2</p>
<p>������������家����没����她奶奶���������家������</p>

and also with text()

echo $doc->text();

Result:

Paragraph 1 Paragraph 2������������家����没����她奶奶���������家������

while:

echo $doc;

Returns encoded utf-8:

<div><p>Paragraph 1</p> <p>Paragraph 2</p><p>&ccedil;&frac12;&#145;&aring;&#143;&#139;&ccedil;&raquo;&#136;&auml;&ordm;&#142;&egrave;&#130;&#137;&aelig;&#144;&#156;&aring;&#135;&ordm;&atilde;&#128;&#140;&egrave;&#140;&#131;&aring;&#134;&deg;&aring;&#134;&deg;&atilde;&#128;&#141;&aring;&reg;&para;&aelig;&#151;&#143;&ccedil;&#133;&sect;&ccedil;&#137;&#135;&iuml;&frac14;&#140;&aelig;&sup2;&iexcl;&aelig;&#131;&sup3;&aring;&#136;&deg;&ccedil;&#156;&#139;&egrave;&sect;&#129;&aring;&yen;&sup1;&aring;&yen;&para;&aring;&yen;&para;&aelig;&#137;&#141;&aring;&#143;&#145;&ccedil;&#142;&deg;&atilde;&#128;&#140;&egrave;&#140;&#131;&aring;&#134;&deg;&aring;&#134;&deg;&aelig;&#152;&macr;&aring;&#133;&uml;&aring;&reg;&para;&aelig;&#156;&#128;&eacute;&#154;&frac34;&ccedil;&#156;&#139;&ccedil;&#154;&#132;&atilde;&#128;&#141;&iuml;&frac14;&#129;</p></div>

@shtse8
Copy link
Author

shtse8 commented Mar 3, 2017

But it is fine using text('some strings'), correct utf-8 is returned.

Code:

$doc = FluentDOM('<div></div>', 'html-fragment');
$doc->text('网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!');
echo $doc->text();

Result:

网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!

but html('some string') is still failed.

Code:

$doc = FluentDOM('<div></div>', 'html-fragment');
$doc->html('网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!');
echo $doc->html();

Result:

������������家����没����她奶奶���������家������

I really like this library very much. It is powerful and provide lots of methods to handle each different DOM. But unluckily, because of this encoding issue, I cannot use it because all my html and html-fragment are full of Chinese. Hope it can be fixed soon. Thanks for your contribution.

@ThomasWeinert
Copy link
Owner

Give me one or two weekends please :-)

I will try to add the issues as unit tests to make them reproducible.

@ThomasWeinert
Copy link
Owner

I just pushed FluentDOM/FluentDOM@370e98f

This is not the final fix, but it should improve the behavior.

@shtse8
Copy link
Author

shtse8 commented Mar 6, 2017

I want to try your new push, but I cannot use the fluentdom and selectors-phpcss in the same project.
It seems selectors-phpcss specify a version to make my composer not to install the new push of fluentdom. Do you have any idea to update to the latest push using composer?

"require": {
		"fluentdom/fluentdom": "dev-master#e2a47d5",
		"fluentdom/selectors-phpcss": "^1.0"
    }

when composer update:

Loading composer repositories with package information
Updating dependencies (including require-dev)
Your requirements could not be resolved to an installable set of packages.

  Problem 1
    - Conclusion: remove fluentdom/fluentdom dev-master
    - fluentdom/selectors-phpcss 1.0.0 requires fluentdom/fluentdom ^5.3 -> satisfiable by fluentdom/fluentdom[5.3.x-dev].
    - fluentdom/selectors-phpcss 1.0.1 requires fluentdom/fluentdom ^5.3||^6.0 -> satisfiable by fluentdom/fluentdom[5.3.x-dev, 6.0.x-dev, 6.1.x-dev].
    - Can only install one of: fluentdom/fluentdom[dev-master, 5.3.x-dev].
    - Can only install one of: fluentdom/fluentdom[dev-master, 6.0.x-dev].
    - Can only install one of: fluentdom/fluentdom[dev-master, 6.1.x-dev].
    - Installation request for fluentdom/fluentdom dev-master#e2a47d5 -> satisfiable by fluentdom/fluentdom[dev-master].
    - Installation request for fluentdom/selectors-phpcss ^1.0 -> satisfiable by fluentdom/selectors-phpcss[1.0.0, 1.0.1].

I can install the new push without the selectors-phpcss, but I need it to test.

@shtse8
Copy link
Author

shtse8 commented Mar 6, 2017

Anyway, I have downloaded the lastest clone and replaced the whole folder manually. I can test now.
It seems the behavior is the same using FluentDOM::QueryCss, but it is fine using new HTML().

@shtse8
Copy link
Author

shtse8 commented Mar 19, 2017

Hi, Any updates?

@ThomasWeinert
Copy link
Owner

I added some RegEx to the HTML loader to fetch the encoding from the meta tags - default is UTF-8 now. Additionally I did a lot of rework on the load/save process from HTML - testing it with Chinese characters. The changes are pushed to the 6.1 branch and I added the version to the CSS Selector package, so a composer install allowing the dev versions should work now.

If you could test it out and send me (small) examples that do not work as expected, I would appreciate it.

@shtse8
Copy link
Author

shtse8 commented Mar 26, 2017

@ThomasWeinert Hi, could you give me a sample of composer.json to include the dev version? I have tried the following but without any luck.

{
	"repositories": [
      {
        "type": "vcs",
        "url": "https://github.com/FluentDOM/FluentDOM"
      }
    ],
	"require": {
		"fluentdom/fluentdom": "dev-master#6.1",
		"fluentdom/selectors-phpcss": "^1.0"
    }
}

Result:

Loading composer repositories with package information                                                                                                                                 Updating dependencies (including require-dev)         Your requirements could not be resolved to an installable set of packages.

  Problem 1
    - Conclusion: don't install fluentdom/fluentdom dev-master
    - fluentdom/selectors-phpcss 1.0.1 requires fluentdom/fluentdom ^5.3||^6.0 -> satisfiable by fluentdom/fluentdom[5.3.x-dev, 6.0.x-dev, 6.1.x-dev].
    - fluentdom/selectors-phpcss 1.0.0 requires fluentdom/fluentdom ^5.3 -> satisfiable by fluentdom/fluentdom[5.3.x-dev].
    - fluentdom/selectors-phpcss 1.0.1 requires fluentdom/fluentdom ^5.3||^6.0 -> satisfiable by fluentdom/fluentdom[5.3.x-dev, 6.0.x-dev, 6.1.x-dev].
    - Can only install one of: fluentdom/fluentdom[dev-master, 5.3.x-dev].
    - Can only install one of: fluentdom/fluentdom[dev-master, 6.0.x-dev].
    - Can only install one of: fluentdom/fluentdom[dev-master, 6.1.x-dev].
    - Can only install one of: fluentdom/fluentdom[dev-master, 5.3.x-dev].
    - Can only install one of: fluentdom/fluentdom[dev-master, 6.0.x-dev].
    - Can only install one of: fluentdom/fluentdom[dev-master, 6.1.x-dev].
    - Installation request for fluentdom/fluentdom dev-master#6.1 -> satisfiable by fluentdom/fluentdom[dev-master].
    - Installation request for fluentdom/selectors-phpcss ^1.0 -> satisfiable by fluentdom/selectors-phpcss[1.0.0, 1.0.1].


and, I also tried.

{
	"require": {
		"fluentdom/fluentdom": "^6.1",
		"fluentdom/selectors-phpcss": "^1.0"
    }
}

Result:

Loading composer repositories with package information                                                                                                                                 Updating dependencies (including require-dev)         Your requirements could not be resolved to an installable set of packages.

  Problem 1
    - The requested package fluentdom/fluentdom could not be found in any version, there may be a typo in the package name.
  Problem 2
    - fluentdom/selectors-phpcss 1.0.1 requires fluentdom/fluentdom ^5.3||^6.0 -> no matching package found.
    - fluentdom/selectors-phpcss 1.0.0 requires fluentdom/fluentdom ^5.3 -> no matching package found.
    - fluentdom/selectors-phpcss 1.0.1 requires fluentdom/fluentdom ^5.3||^6.0 -> no matching package found.
    - Installation request for fluentdom/selectors-phpcss ^1.0 -> satisfiable by fluentdom/selectors-phpcss[1.0.0, 1.0.1].

Potential causes:
 - A typo in the package name
 - The package is not available in a stable-enough version according to your minimum-stability setting
   see <https://getcomposer.org/doc/04-schema.md#minimum-stability> for more details.

Read <https://getcomposer.org/doc/articles/troubleshooting.md> for further common problems.

and, I also tried.

{
	"repositories": [
      {
        "type": "vcs",
        "url": "https://github.com/FluentDOM/FluentDOM"
      }
    ],
	"require": {
		"fluentdom/fluentdom": "dev-6.1",
		"fluentdom/selectors-phpcss": "^1.0.1"
    }
}

Result:

Loading composer repositories with package information                                                                                                                                 Updating dependencies (including require-dev)         Your requirements could not be resolved to an installable set of packages.

  Problem 1
    - The requested package fluentdom/fluentdom dev-6.1 could not be found.

Potential causes:
 - A typo in the package name
 - The package is not available in a stable-enough version according to your minimum-stability setting
   see <https://getcomposer.org/doc/04-schema.md#minimum-stability> for more details.

Read <https://getcomposer.org/doc/articles/troubleshooting.md> for further common problems.

@shtse8
Copy link
Author

shtse8 commented Mar 26, 2017

Okay, I can use it now with the following composer.json.

{
	"repositories": [
      {
        "type": "vcs",
        "url": "https://github.com/FluentDOM/FluentDOM"
      }
    ],
	"require": {
		"fluentdom/fluentdom": "dev-master as 6.1",
		"fluentdom/selectors-phpcss": "^1.0.1"
    }
}


@shtse8
Copy link
Author

shtse8 commented Mar 26, 2017

Code to reproduce

$doc = FluentDOM('<div></div>', 'html-fragment');
$doc->html(FluentDOM('网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!', 'html-fragment'));
echo $doc;

Expected Result:

<div>网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!</div>

Actual Result:

<div>������������家����没����她奶奶���������家������
</div>

@shtse8
Copy link
Author

shtse8 commented Mar 26, 2017

Code to reproduce:

$html = '<div><p>Paragraph 1</p> <p>Paragraph 2</p><p>网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!</p></div>';
$doc = FluentDOM($html, 'html-fragment');
echo $doc->html();

Expected Result:

<div><p>Paragraph 1</p> <p>Paragraph 2</p><p>网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!</p></div>

Actaul Result:

<p>Paragraph 1</p> <p>Paragraph 2</p>
<p>������������家����没����她奶奶���������家������</p>

@shtse8
Copy link
Author

shtse8 commented Apr 19, 2017

any news on this?

@ThomasWeinert
Copy link
Owner

I just pushed a fix for html-fragments: FluentDOM/FluentDOM@9b0daec

@shtse8
Copy link
Author

shtse8 commented Apr 22, 2017

@ThomasWeinert Thanks for your updates.
But setting a html-fragment as a innerHtml of a html-fragment result is still wrong.

Code to reproduce

$doc = FluentDOM('<div></div>', 'html-fragment');
$doc->html(FluentDOM('网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!', 'html-fragment'));
echo $doc;

ExpectedResult:

<div>网友终于肉搜出「范冰冰」家族照片,没想到看见她奶奶才发现「范冰冰是全家最难看的」!</div>

Actual Result:

<div>������������家����没����她奶奶���������家������
</div>

@ThomasWeinert
Copy link
Owner

That uses a different method of the HTML loader (load() vs loadFragment()). Added test, fixed and pushed: FluentDOM/FluentDOM@28c53f1

@shtse8
Copy link
Author

shtse8 commented Apr 25, 2017

Thanks. It works GREAT!
But why is the closing div in the next line?

Code to reproduce

$doc = FluentDOM('<div></div>', 'html-fragment');
$doc->html(FluentDOM('hihi', 'html-fragment'));
echo $doc;

Expected Result:

<div>hihi</div>

Actual Result:

<div>hihi
</div>

@ThomasWeinert
Copy link
Owner

I will move the formatting problem to a new issue

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Projects
None yet
Development

No branches or pull requests

2 participants