Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Encoding problems on Ubuntu 12.04 LTS (libxml 2.7.8) #48

Closed
dermario opened this issue May 11, 2016 · 14 comments
Closed

Encoding problems on Ubuntu 12.04 LTS (libxml 2.7.8) #48

dermario opened this issue May 11, 2016 · 14 comments
Assignees

Comments

@dermario
Copy link

dermario commented May 11, 2016

I am using the amp module for Drupal 7 and i'm facing an encoding issue when using that in my hosting environment. I did a lot of research and debugging and i narrowed down the problem to the native DomDocument - implementation.

I already created an issue on d.o for that: https://www.drupal.org/node/2712895

AMP::convertToAmpHtml() calls $qp = QueryPath::withHTML($document, NULL, ['convert_to_encoding' => 'UTF-8']); that results (simplified) to an equivalent of this:

// see QueryPath\DOMQuery::parseXMLString(...)
$content = mb_convert_encoding($content, 'UTF-8', 'auto');
$document->loadHTML($content);

This is where my umlauts get broken in my hosting environment. When i change the lines to

// see QueryPath\DOMQuery::parseXMLString(...)
$content = mb_convert_encoding($content, 'HTML-ENTITIES', 'UTF-8');
$document->loadHTML($content);

Everything works fine.

I think that might be an issue with PHP/libxml? Is that library too old to work with AMP?
phpinfo-dom

@sidkshatriya
Copy link
Contributor

Hmm...I wonder what the problem is here. Ideally you shouldn't need to convert to HTML-ENTITIES in the above fashion. What is your original HTML (before any AMP formatter has run)?

As you point out on https://www.drupal.org/node/2712895#comment-11134411 , the problem occurs on Acquia Cloud only and works fine on your localhost (without any code modifications required).

@sidkshatriya
Copy link
Contributor

sidkshatriya commented May 11, 2016

Don't worry about the API version being 20031129. Even in PHP 7, it list the API version as 20031129.

The main question is:

  • What PHP version are you using on localhost and what version are you using on Acquia Cloud
  • What is the libxml version on localhost and what is the version on Acquia Cloud

You would need to do go to the following links in your browser:

http://your-localhostdrupal-url/admin/reports/status/php
http://your-acquia-cloud-url/admin/reports/status/php

Have you tried switching the version of PHP on your Acquia Cloud to a higher version and seeing if it fixes things? (you would need access the Acquia Cloud control panel to do this)

@dermario
Copy link
Author

This is the script i tested with:

<?php
$document = new \DOMDocument('1.0', 'utf-8');
$content = '<!DOCTYPE html>
<html>
<head>
    <meta charset="UTF-8">
</head>
<body>
    Dies liess der heute 64-Jährige nicht auf sich sitzen.
</body>
</html>';
$content = mb_convert_encoding($content, 'UTF-8', 'auto');
$document->loadHTML($content);
echo "<pre>";
var_dump($document->saveHTML());
?> 

On Acquia cloud we are running PHP 5.6.19 with libxml version 2.7.8
libxml-acquia-cloud

On my local i am running 5.6.15-1+deb.sury.org~trusty+1 with libxml version 2.9.1
libxml-local

PHP 5.6 is the only version version to select in our insight backend atm.

@dermario
Copy link
Author

We just tested locally against a very old installation (PHP 5.4.45) with libxml 2.7.6. We are facing exact the same encoding problems as on Acquia Cloud.

@sidkshatriya
Copy link
Contributor

What happens when you remove the line

$content = mb_convert_encoding($content, 'UTF-8', 'auto');

@dermario
Copy link
Author

dermario commented May 11, 2016

Removing the line mentioned above doesn't change the behaviour. Umlauts are the same way broken:

_ Dies liess der heute 64-Jährige nicht auf sich sitzen_

@sidkshatriya
Copy link
Contributor

In this snippet above there is no AMP stuff happening at all but you're able to get broken umlauts.

Presumably this is because of the old libxml. Why is Acquia using an old libxml when I see that ubuntu 14.04 (trusty) is also using 2.9.1 ? (Is this a plain vanilla installation on your localhost) ?

What about raising a ticket with Acquia and asking when whats up with their libxml ? (My localhost has 2.9.0 for what its worth).

Incidentally libxml version 2.7.8 dates back to Nov 2010 which is a long time on the internet ! ( See http://xmlsoft.org/news.html ). 2.7.0 was released way back in 2008

@dermario
Copy link
Author

I can rewrite my snipped to use the AMP the way the Drupal module does. The result would be the same. My indention was to reduce the problem to the absolute minimum. This is what AMP does at the end.

Ubuntu 14.04 comes with libxml 2.9.1 but 12.04 (LTS) is still on 2.7.8. My local machine is Ubuntu 14.04 and our Acquia environment is still 12.04. I have tried a different Ubuntu 12.04 server, that is not in the acquia cloud and got the same errors. That means - there is a problem with libxml 2.7.8 on Ubuntu 12.04 and AMP seems not to run on that hosts.

@dermario dermario changed the title Encoding problems when using amp field formatters Encoding problems on Ubuntu 12.04 LTS (libxml 2.7.8) May 11, 2016
@dermario
Copy link
Author

dermario commented May 11, 2016

Here is the snippet using the AMP PHP library:

require_once 'vendor/autoload.php';

$amp = new \Lullabot\AMP\AMP();
$amp->loadHtml('<p>äüö</p>');
var_dump($amp->convertToAmpHtml());

On Ubuntu 12.04 with libxml 2.7.8 that results in:

string(19) "
äüö

"

@sidkshatriya
Copy link
Contributor

With libxml 2.9.0 I get

string(13) "<p>äüö</p>"

(which is correct)

@sidkshatriya
Copy link
Contributor

sidkshatriya commented May 11, 2016

I'll leave this ticket open for now -- encoding issues are very hairy. There is a known workaround as mentioned by you.

It would great if you filed a bug report with Acquia and ask them why they are using such an old libxml.

@dermario
Copy link
Author

I already filed a bug report at Acquia. The reason why they use libxml 2.8.7 is that it is bundled to Ubuntu 12.04 upgrading to Ubuntu 14.04 now is not an option.

We can't upgrade unfortunately to a new version to 14.04, but Ubuntu 12.04 LTS has one year until end of life ( so we'll upgrade probably before then, but I can't tell you precisely when.)

I discussed this issue with a colleague, and we think that you need to patch it to make it work ( unless maintainers are willing to write some retro compatibility in it)

And

As you may know Acquia runs a Platform As A Service, as such, we do not have the flexibility to upgrade 1 particular customer to an alternate version of Ubuntu. We must maintain the same version of OS across all our customers, since all our tools, systems and processes rely on the same LAMP stack.

However, as mentioned [...], it looks likely that we will upgrade our version in the next few months.
Once we have visibility we usually inform our customers through our regular communication.

As such, a patch seem the only option in the short term.

I personally like the idea of having ubuntu 12.04 in this library. But maybe i am the only one using AMP on Drupal and Ubuntu 12.04

@sidkshatriya
Copy link
Contributor

sidkshatriya commented May 12, 2016

Thanks for taking the time in reporting this problem to us and Acquia.

As mentioned I'll leave this ticket open for now -- I'll do some more research on whether there might be a more optimal way to solve this issue other than the workaround you mentioned.

@sidkshatriya
Copy link
Contributor

@dsayswhat @Scuts TL;DR The latest release of the library will work with libxml 2.7.8 without any workarounds.

Please test on your setup and confirm!

To test this, I installed a fresh Ubuntu 12.04 virtual machine instance. I had to compile PHP 5.5, 5.6 & 7.0 from source (with libxml 2.7.8) as 12.04 is bundled with PHP 5.3 (we only support PHP 5.5+)

$ cat test.php 
<?php
require_once 'vendor/autoload.php';

$amp = new \Lullabot\AMP\AMP();
$amp->loadHtml('<p>äüö</p>');
var_dump($amp->convertToAmpHtml());

$ php -i | grep "libxml Version"
libxml Version => 2.7.8

$ php test.php 
string(13) "<p>äüö</p>"

This is working now because we're using the masterminds/html5-php parser for HTML (see #68)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

No branches or pull requests

2 participants