Unicode #17

Closed
yorek opened this Issue Jan 19, 2015 · 31 comments

Projects

None yet

5 participants

@yorek
yorek commented Jan 19, 2015

Hi guys, I've a question regarding string enconding. In the table "wp_post" all columns are nvarchar(max). I would expect, for that reason, to have text stored as UTF-16 but what I see, when I query the table via SSMS, that text is stored in UTF-8 . In fact, some letters like e-with-accent (è) are stored as two separate charactes. When you read the post from the website all is good, of course.

Is that expected, or I'm missing something? Do I have to set some special configuration options in WP or PHP?

Thanks!

@patrickebates
Member

I won't deny that we still don't understand this 100% ourselves. But I suspect it is the result of a combination of the SQL collation type and settings on the PHP sqlsrv connection. Some insight can be found at http://blogs.msdn.com/b/brian_swan/archive/2011/02/24/sql-server-driver-for-php-connection-options-characterset.aspx

I have attempted on multiple occasions to address this, as it makes it fairly difficult to interface directly with the DB from external applications. But as you mention, it reads back into the web site correctly so nothing is actually "broken". It just feels annoying.

@yorek
yorek commented Jan 20, 2015

Ah got it. I'll try to do some test using plain PHP and PHP SQL Server Driver to see if I can came up with something helpful. I have thousands of articles that I have to migrate to Wordpress and all the accented italian letters are looking Chinese after the migration. In addition to that, the html editor simply doesn't work as soon as it find some unencoded character....

@patrickebates
Member

I have seen one example of forcing UTF-8 on the SqlSrv object, but I was not able to see a difference in my testing. Maybe I was looking at it wrong.

@patrickebates
Member

I may have the answer. The collation of the database (or at the very least the columns of post_title and post_content of wp_posts) needs to be set to a code page supporting the characters of the language you intend to store.

For example, I tried storing the Greek κόσμε and was unsuccessful until I altered the collation on post_title and post_content to Greek_100_CI_AS.

There was no need to enable UTF-8 support on the SQLSRV connection object by using this method. In fact, enabling UTF-8 actually caused problems.

So it seems the best course of action may be to set the collation at the time the database is initially created.

@yorek
yorek commented Jan 21, 2015

I don't think that could be that case, since post_content and the other text column are nvarchar, and thus support UTF-18 unicode. With nvarchar the collation is used just for the sort order and case comparing. In the case you mention, probably the string was sent to SQL Server whitout specifing it was a unicode string, prefixing it with the "N" character. Try this:

create table dbo.UnicodeTest
(
    text_column_utf16 nvarchar(100) collate Latin1_General_CI_AS,
    text_column varchar(100) collate Latin1_General_CI_AS
)
go
insert into dbo.UnicodeTest values ('κόσμε', 'κόσμε')
insert into dbo.UnicodeTest values (N'κόσμε', N'κόσμε')
select * from dbo.UnicodeTest
go

this first row will show strange characters even for the nvarchar column while the second will work as expected.

@yorek
yorek commented Jan 21, 2015

Doing some more research I've found that it seems related to something that happen before the string is sent to the database. As you can test here http://www.cafewebmaster.com/online_tools/utf8_encode, if you use the character "è", the UTF-8 enconding will encode it to "è" which is exactly what I see in my tables.

@patrickebates
Member

Seems we are talking about slightly different problems. I was only able to write the Greek into the table and have it read out properly in the text editor and readable on a post by changing the collation. In order to have it readable in the database, I had to use the N prefix during an insert as well.

But trying to prefix all of the appropriate insert values with N could be troublesome at best.

And yes, something is happening to the string before it reaches the database. /wp-includes/formatting.php plays around with the content based on the value of the option blog_charset. But there isn't a blog_charset supported which is also an exact match for UCS-2.

@patrickebates
Member

Slight correction there. It's option blog_charset and/or wp-config value DB_CHARSET which controls how the strings are played with.

@yorek
yorek commented Jan 21, 2015

Ok, so my case is a little different since I can add all the text I want via Wordpress editor (for example κόσμε) and it gets saved correctly (but UTF-8 encoded), despite the fact that the collation is set to be SQL_LATIN1_GENERAL_CP1_CI_AS by default (I'm using SQL Azure). Btw this is exactly what I expect since characters are UTF-8 encoded. But I'd like to avoid encoding, since it is really not needed :)
So it seems that the problem is in the wordpress database layer. I'll do some other test to see if there is a way out...

@KentNordstrom

I think I found the solution to this. In wp-db.php it shows that 4.2.2 looks for charset utf8mb4 to fully support UTF-8 characters. By changing the wp-config DB_CHARSET to utf8mb4 and also change the dbo.wp_options blog_charset to utf8mb4 all my Swedish characters started to show as expected again.

@KentNordstrom

I was being to happy to quickly :-( The site started showing as expected, but the editor did not work.
I get "Warning: htmlspecialchars(): charset `utf8mb4' not supported, assuming utf-8 in C:\inetpub\hargaantik.se\wp-includes\formatting.php on line 3270". Will try investigate further.

@patrickebates
Member

Kent, have you tried setting the database collation at initial creation to one of the Swedish options?
Examples:
SQL_SwedishPhone_Pref_Cp1_CI_AS_KI_WI
SQL_SwedishStd_Pref_Cp1_CI_AS_KI_WI

@KentNordstrom

The collation in the database is, and has always been even before the updater "Finnish_Swedish_100_CI_AS". In the wp-config I have tried both '' and 'Finnish_Swedish_100_CI_AS' for the DB_COLLATE setting. I think the correct value for the charset should be UTF-8 but doing that breaks the Swedish characters on the site. Setting it o utf8mb4 makes it show OK but editor gives error. What should be the correct setting for wp-config DB_CHARSET and the dbo.wp_option blog_charset to make it work with my SQL collation?

@KentNordstrom

In wp-db.php in line 2289 I see the following...
// If the current connection can't support utf8mb4 characters, let's only send 3-byte utf8 characters.
if ( 'utf8mb4' === $charset && ! $this->has_cap( 'utf8mb4' ) ) {
$charset = 'utf8';
}

I'm however not capable to decodee what this actually means or how it might be possible to support UTF-8 instead of having utf8mb4. Because what it looks like when setting it to UTF-8 is I get this 3-byte characters instead of full UTF-8 support.
Never worked with PHP so don't know if I can change the support on the htmlspecialchars() function to support other than UTF-8 as teh error states.

@patrickebates
Member

It's my understanding that utf8mb4 was only added to support the emoji added in 4.2, and that full language support was available with utf8 already.

I should be able to spin up a fresh test server in a few hours and try to duplicate the issue.

@patrickebates
Member

I created a test site using Finnish_Swedish_100_CI_AS for the collation while setting up the SQL database, selecting suomi as the language, and making no additional changes (so I didn't make any changes to charset or collation in PN/WP by hand). I then modified the Hello World post with some text I copied from Wikipedia. It appears to me that the characters are operating properly in post and in the editor.
http://pnfinswed.azurewebsites.net/2015/05/31/moikka-maailma/

@KentNordstrom

Really appreciate the test you made Patrick! So it's my environment thats causing the trouble.
This site was accidently upgraded by co-admin to 4.2.2 and sudenly broke so I'm trying to "patch it together" by replacing it with the ProjectNami. The site was created with standard WP using a MSSQL plugin for connection to SQL. Might need to consider creating a new site and try to migrate posts and pages somehow. Not sure if Export will work since editor is not showing anything in the Posts and Pages.
I will let you know if I have further questions or find out what was wrong in my "patched" site.

@patrickebates
Member

I suggest attempting the export and then reviewing the XML file. We have seen other instances where the TinyMCE editor was unable to visualize content but the data was still present.

@KentNordstrom

I get some really "funny" results. Export using utf8mb4 shows things like title correctly but the posts are not showing correctly. Changin to UTF-8 show post content correctly but website title and description. messed up!?!? Very confusing indeed. Will export using UTF-8 and make som manual corrections to the XML before importing I think. If I look in the database itself everything is showing as it should. It's if like WP reads the wp_options table and the wp_posts table using different encodings!?!

@patrickebates
Member

As part of the PN 1.0.3 release we just tagged, I included an update to our Deploy To Azure script which will allow you to set the database collation when using that deployment method.

@patrickebates
Member

Kent, I wanted to follow-up and see how the migration has been going.

@KentNordstrom

I actually ended up moving the whole site to Azure Websites, so its been moved to MySQL.
I configured the encoding to make the posts look ok and made an export. Needed to make som manual changes to the xml due to wrong encoding on some global/general settings. I could then import into a new WP Website in Azure. But thx for all the help on the way :-)

@ZacharyChim

Hi, I just installed the latest version of Project Nami, and I realized that I can't save Chinese characters, neither in a post nor in something like menus. But the displaying of Chinese is working fine, and I can save English without a problem.

When I change the website to Chinese in the General settings, I can't save anything, whenever I tried to save a Chinese menu, it got an error about can't connect to the database. And again everything is fine when I switch the website back to English.

@patrickebates
Member

Which installation method did you use?

During the installation, were you presented the option of setting the SQL Collation? Our testing with other languages indicates that the SQL Collation needs to be set to a codepage which supports the language you intend to use.

@patrickebates patrickebates reopened this Oct 4, 2015
@ZacharyChim

I didn't see any option of setting the SQL Collation.

Actually at the first time, I installed the English version of WordPress, and things are like what I mentioned before. And just now, I deleted the database and re-install WordPress with the Chinese version at the first place, after entering the name of website, admin & password, etc, I got database errors:

WordPress 資料庫錯誤: []
INSERT INTO wp_options (option_name, option_value, autoload) VALUES ('siteurl', 'http://localhost/wordpress', 'yes'), ('home', 'http://localhost/wordpress', 'yes'), ('blogname', '我的網誌', 'yes'), ('blogdescription', '全新的正體中文 WordPress 網誌!', 'yes'), ('users_can_register', '0', 'yes'), ('admin_email', 'you@example.com', 'yes'), ('start_of_week', '1', 'yes'), ('use_balanceTags', '0', 'yes'), ('use_smilies', '1', 'yes'), ('require_name_email', '1', 'yes'), ('comments_notify', '1', 'yes'), ('posts_per_rss', '10', 'yes'), ('rss_use_excerpt', '0', 'yes'), ('mailserver_url', 'mail.example.com', 'yes'), ('mailserver_login', 'login@example.com', 'yes'), ('mailserver_pass', 'password', 'yes'), ('mailserver_port', '110', 'yes'), ('default_category', '1', 'yes'), ('default_comment_status', 'open', 'yes'), ('default_ping_status', 'open', 'yes'), ('default_pingback_flag', '1', 'yes'), ('posts_per_page', '10', 'yes'), ('date_format', 'Y-m-d', 'yes'), ('time_format', 'H:i:s', 'yes'), ('links_updated_date_format', 'Y-m-d H:i:s', 'yes'), ('comment_moderation', '0', 'yes'), ('moderation_notify', '1', 'yes'), ('permalink_structure', '', 'yes'), ('gzipcompression', '0', 'yes'), ('hack_file', '0', 'yes'), ('blog_charset', 'UTF-8', 'yes'), ('moderation_keys', '', 'no'), ('active_plugins', 'a:0:{}', 'yes'), ('category_base', '', 'yes'), ('ping_sites', 'http://rpc.pingomatic.com/', 'yes'), ('advanced_edit', '0', 'yes'), ('comment_max_links', '2', 'yes'), ('gmt_offset', '0', 'yes'), ('default_email_category', '1', 'yes'), ('recently_edited', '', 'no'), ('template', 'twentyfifteen', 'yes'), ('stylesheet', 'twentyfifteen', 'yes'), ('comment_whitelist', '1', 'yes'), ('blacklist_keys', '', 'no'), ('comment_registration', '0', 'yes'), ('html_type', 'text/html', 'yes'), ('use_trackback', '0', 'yes'), ('default_role', 'subscriber', 'yes'), ('db_version', '33057', 'yes'), ('uploads_use_yearmonth_folders', '1', 'yes'), ('upload_path', '', 'yes'), ('blog_public', '1', 'yes'), ('default_link_category', '2', 'yes'), ('show_on_front', 'posts', 'yes'), ('tag_base', '', 'yes'), ('show_avatars', '1', 'yes'), ('avatar_rating', 'G', 'yes'), ('upload_url_path', '', 'yes'), ('thumbnail_size_w', '150', 'yes'), ('thumbnail_size_h', '150', 'yes'), ('thumbnail_crop', '1', 'yes'), ('medium_size_w', '300', 'yes'), ('medium_size_h', '300', 'yes'), ('avatar_default', 'mystery', 'yes'), ('large_size_w', '1024', 'yes'), ('large_size_h', '1024', 'yes'), ('image_default_link_type', 'file', 'yes'), ('image_default_size', '', 'yes'), ('image_default_align', '', 'yes'), ('close_comments_for_old_posts', '0', 'yes'), ('close_comments_days_old', '14', 'yes'), ('thread_comments', '1', 'yes'), ('thread_comments_depth', '5', 'yes'), ('page_comments', '0', 'yes'), ('comments_per_page', '50', 'yes'), ('default_comments_page', 'newest', 'yes'), ('comment_order', 'asc', 'yes'), ('sticky_posts', 'a:0:{}', 'yes'), ('widget_categories', 'a:0:{}', 'yes'), ('widget_text', 'a:0:{}', 'yes'), ('widget_rss', 'a:0:{}', 'yes'), ('uninstall_plugins', 'a:0:{}', 'no'), ('timezone_string', 'Asia/Taipei', 'yes'), ('page_for_posts', '0', 'yes'), ('page_on_front', '0', 'yes'), ('default_post_format', '0', 'yes'), ('link_manager_enabled', '0', 'yes'), ('finished_splitting_shared_terms', '1', 'yes'), ('initial_db_version', '33057', 'yes')

WordPress 資料庫錯誤: []
INSERT INTO [wp_terms] ([name], [slug], [term_group]) VALUES ('未分類', 'uncategorized', 0)

WordPress 資料庫錯誤: []
INSERT INTO [wp_posts] ([post_author], [post_date], [post_date_gmt], [post_content], [post_excerpt], [post_title], [post_name], [post_modified], [post_modified_gmt], [guid], [comment_count], [to_ping], [pinged], [post_content_filtered]) VALUES (1, '2015-10-04 17:19:05', '2015-10-04 17:19:05', '歡迎來到 WordPress。這是你的第一篇文章。編輯或者刪除本篇文章,然後開始你的部落客生活!', '', 'Hello world! 哈囉!', 'hello-world', '2015-10-04 17:19:05', '2015-10-04 17:19:05', 'http://localhost/wordpress/?p=1', 1, '', '', '')

WordPress 資料庫錯誤: []
INSERT INTO [wp_comments] ([comment_post_ID], [comment_author], [comment_author_email], [comment_author_url], [comment_date], [comment_date_gmt], [comment_content]) VALUES (1, 'WordPress 桑', '', 'https://wordpress.org/', '2015-10-04 17:19:05', '2015-10-04 17:19:05', '嗨!這是一則迴響。 要刪除迴響,請登入後瀏覽文章的迴響,那兒你可以選擇編輯或刪除它們。')

WordPress 資料庫錯誤: []
INSERT INTO [wp_posts] ([post_author], [post_date], [post_date_gmt], [post_content], [post_excerpt], [comment_status], [post_title], [post_name], [post_modified], [post_modified_gmt], [guid], [post_type], [to_ping], [pinged], [post_content_filtered]) VALUES (1, '2015-10-04 17:19:05', '2015-10-04 17:19:05', '這是一個範例頁面。它和網誌文章不同,因為它的頁面位置是固定的,同時會顯示於你的網誌導覽選單(大多數的佈景主題)。大多數的人會新增一個「關於」頁面向訪客介紹自己。它可能類似下面這樣: <blockquote>嗨!你好!白天我是一位單車快遞員,晚上則是個有抱負的演員,這是我的網誌。我居住在台灣高雄,養了一隻名為 Jack 的狗。</blockquote> ...或像這樣: <blockquote>XYZ Doohickey Comany 成立於 1971 年,公司成立以來,我們一直向市民提供高品質的 doohickies 。我們位於台北市,有超過 2,000 名員工,對台北市政府有著相當大的貢獻。</blockquote> 作為一個新的 WordPress 使用者,你可以前往<a href="http://localhost/wordpress/wp-admin/">你的控制台</a>刪除這個頁面,並建立一個屬於你的全新內容。祝使用愉快!', '', 'closed', '範例頁面', 'sample-page', '2015-10-04 17:19:05', '2015-10-04 17:19:05', 'http://localhost/wordpress/?page_id=2', 'page', '', '', '')

It seems that nothing can be written into the database, but bellow the error message, it says installation was successful. So I logged in the admin panel, things are messed up, something shows up and some not.

@patrickebates
Member

What method did you use? Deploy To Azure, Azure Gallery, etc., or is this a local install on your own database server?

@ZacharyChim

a local install on my own database server.

@patrickebates
Member

In the New Database dialog of SQL Server Management Studio, click Options and you can change the Collation there. Image attached.
sqlcollation

@ZacharyChim

hi I've tried Chinese_Taiwan_Stroke_CI_AI among with several other common collation options for Chinese and I got the same errors.

@patrickebates
Member

Do you have an email address with which you can sign up for an Azure trial account? Would be useful if you could try to duplicate this issue by using our Deploy To Azure button. There is a dropdown to set the SQL Collation on the configuration page.

Would be useful to see if the issue persists in that environment. Could aid in isolating the specific problem.

@pippercameron pippercameron added this to the 1.4.0 milestone Apr 20, 2016
@patrickebates
Member

App Setting of ProjectNami.UTF8 can now be set to 1 to force UTF-8 connections on the driver

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment