-
-
Notifications
You must be signed in to change notification settings - Fork 439
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Replace utf8 by utf8mb4 #1036
Replace utf8 by utf8mb4 #1036
Conversation
What about all the "DEFAULT CHARSET=utf8;" in setup scripts? So your DB supports emoji, but single tables don't, do they? |
Adding this to existing installs without migrating the database tables might cause a problem, I don't remember specifics but you get weird errors for some queries about incompatible collations and such (LIKE queries if I remember correctly).. So I think this should be accompanied by a script that runs this for every table or every table where utf8mb4 support matters and the errors might exist (yes, this takes a long time on a large database): ALTER TABLE {$table} CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci; |
Note: INDEXES When converting from utf8 to utf8mb4, the maximum length of a column or index key is unchanged in terms of bytes. Therefore, it is smaller in terms of characters, because the maximum length of a character is now four bytes instead of three. [...] The InnoDB storage engine has a maximum index length of 767 bytes, so for utf8 or utf8mb4 columns, you can index a maximum of 255 or 191 characters, respectively. If you currently have utf8 columns with indexes longer than 191 characters, you will need to index a smaller number of characters when using utf8mb4. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
While this may work well for new installs I believe it will cause issues with existing installs so needs a database migration script and probably would be considered a BC breaking change (third-party code may be affected) so should be merged into the mainline branch rather than stable (branching strategy is not yet a settled matter).
The error about incompatible character set:
I try to filter orders by increment id from backend with: |
What is wanted by this PR? To be able to use emoji in OpenMage and it can only be done if the database is UTFmb4. The modification in the OpenMage code proposed in this PR must be done only after the database is either directly UTF8mb4 (through installation) or through conversion from UTF8. This last situation is not at all simple and requires several steps to be carried out that start with a backup, altering the database, tables, dev tests and many others. More details here, I think so far it is the best resource I have found https://mathiasbynens.be/notes/mysql-utf8mb4. Approving the PR without converting the database will generate issues that I have not evaluated, but I will do a test with a UTF8 database. There are a lot of issues discussed here: https://stackoverflow.com/questions/39463134/how-to-store-emoji-character-in-mysql-database/. As a personal opinion, I would prefer this PR to be an extension and not adopt it even with a conversion script in OpenMage. This last step that we impose on everyone who uses OM and which is mandatory is not a guarantee that the store will work 100%. We must avoid such issues. Whoever wants to use emoji can have at their disposal, as I said, an extension that provides the required changes in OM and a conversion script with all the steps to follow. However, using UTFmb4 has its advantages: |
one more source: https://blog.koehntopp.info/2022/01/12/utf8mb4.html#tldr
definitely does not belong into the 1.9.4.x branch, as long as MySQL does not plan to remove the utf8 one. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
change target branch
Edit: see next comment
also relevant: https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html
So we will need to move at some point in the future, as the currently used utf8mb3 is deprecated, but also our currently used alias is changing. Question is, how MySQL will handle it if we keep the current alias. Do the Tables actually use the alias internally? or will they then at one point not work anymore after an upgrade of MySQL? |
Source: https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-conversion.html. "One advantage of converting from utf8mb3 to utf8mb4 is that this enables applications to use supplementary characters. One tradeoff is that this may increase data storage space requirements. In terms of table content, conversion from utf8mb3 to utf8mb4 presents no problems: For a BMP character, utf8mb4 and utf8mb3 have identical storage characteristics: same code values, same encoding, same length. For a supplementary character, utf8mb4 requires four bytes to store it, whereas utf8mb3 cannot store the character at all. When converting utf8mb3 columns to utf8mb4, you need not worry about converting supplementary characters because there are none. In terms of table structure, these are the primary potential incompatibilities: For the variable-length character data types (VARCHAR and the TEXT types), the maximum permitted length in characters is less for utf8mb4 columns than for utf8mb3 columns. For all character data types (CHAR, VARCHAR, and the TEXT types), the maximum number of characters that can be indexed is less for utf8mb4 columns than for utf8mb3 columns. Consequently, to convert tables from utf8mb3 to utf8mb4, it may be necessary to change some column or index definitions. Tables can be converted from utf8mb3 to utf8mb4 by using [ALTER TABLE](https://dev.mysql.com/doc/refman/8.0/en/alter-table.html). Suppose that a table has this definition:
The following statement converts t1 to use utf8mb4:
In conclusion, simply modifying the OM files as this PR proposes it will not solve the functionality issues that will appear. We should analyze if is requested by the new MySQL versions to migrate OM to UTFmb4, but this step must be prepared in advance. This PR offers an immediate solution only in the case of a new installation without any data, but for an existing database, mandatory steps must be taken before it is imported or modified directly. It's not just an issue related to emoji's, but also of sheltering us in future MySQL versions. How do I see this PR?
|
I think this is a fairly difficult task to take on but here is a script I used once before if it helps whoever wants to tackle this: #!/bin/bash
#
# Update a database and convert all tables to utf8mb4
#
# Example:
#
# MYSQL_SRC_PREFIX='docker exec -i openmage_mysql_1' \
# MYSQL_SRC_AUTH='-u openmage -ppassword' \
# MYSQL_DATABASE=openmage \
# ./utf8mb4.sh
set -e
if [[ -z $MYSQL_SRC_PREFIX ]]; then
MYSQL_SRC_PREFIX=""
fi
if [[ -z $MYSQL_SRC_AUTH ]]; then
MYSQL_SRC_AUTH="-u root"
fi
if [[ -z $MYSQL_DATABASE ]]; then
echo "MYSQL_DATABASE is not specified."
exit 1
fi
src_db=$MYSQL_DATABASE
_mysql_src="$MYSQL_SRC_PREFIX mysql $MYSQL_SRC_AUTH"
tables="$($_mysql_src $src_db -Bse 'show tables')"
out=$(mktemp)
echo "-- Generated by utf8mb4.sh" > $out
echo "SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci;" >> $out
echo "USE $src_db;" >> $out
echo "ALTER DATABASE $src_db CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;" >> $out
for table in $tables; do
if [[ $table =~ ^sales_order_status ]]; then # Skip this table which has a string foreign key
continue;
fi
echo "ALTER TABLE $table CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;" >> $out
done
cat $out
input=''
if [[ $1 = '-' ]]; then
noop
elif [[ -z $1 ]]; then
echo
read -p "Apply above changes to $src_db? (Y|n) " input
elif [[ $1 = "--apply" ]]; then
input='Y'
fi
if [[ $input = "Y" ]]; then
<$out $_mysql_src -Bs
fi
rm $out |
To copy utf8mb4 characters to flat tables, I did that some weeks/months ago: diff --git b/app/code/core/Mage/Catalog/Model/Resource/Category/Flat.php a/app/code/core/Mage/Catalog/Model/Resource/Category/Flat.php
index 22a3578fca..b5472fe6a1 100644
--- b/app/code/core/Mage/Catalog/Model/Resource/Category/Flat.php
+++ a/app/code/core/Mage/Catalog/Model/Resource/Category/Flat.php
@@ -575,7 +575,9 @@ class Mage_Catalog_Model_Resource_Category_Flat extends Mage_Index_Model_Resourc
$_writeAdapter->dropTable($tableName);
$table = $this->_getWriteAdapter()
->newTable($tableName)
- ->setComment(sprintf('Catalog Category Flat (Store %d)', $store));
+ ->setComment(sprintf('Catalog Category Flat (Store %d)', $store))
+ ->setOption('charset', 'utf8mb4')
+ ->setOption('collate', 'utf8mb4_general_ci');
//Adding columns
if ($this->_columnsSql === null) {
diff --git b/app/code/core/Mage/Catalog/Model/Resource/Product/Flat/Indexer.php a/app/code/core/Mage/Catalog/Model/Resource/Product/Flat/Indexer.php
index d3bfc39246..b3eb3efaad 100644
--- b/app/code/core/Mage/Catalog/Model/Resource/Product/Flat/Indexer.php
+++ a/app/code/core/Mage/Catalog/Model/Resource/Product/Flat/Indexer.php
@@ -704,7 +704,9 @@ class Mage_Catalog_Model_Resource_Product_Flat_Indexer extends Mage_Index_Model_
Varien_Db_Ddl_Table::ACTION_CASCADE
);
}
- $table->setComment("Catalog Product Flat (Store {$storeId})");
+ $table->setComment("Catalog Product Flat (Store {$storeId})")
+ ->setOption('charset', 'utf8mb4')
+ ->setOption('collate', 'utf8mb4_general_ci');
$adapter->createTable($table);
|
we should fix the conflicts and rebase to "next" |
i'm ok with merging it as it is without forcing an update on the existing installation, let's just not leave PRs hanging in the limbo forever without deciding anything. @luigifab could you rebase it on main or next (I think next is better in this case) so that we could remove the 1.9.4.x branch? |
rebased it to main, please check that everything is ok |
In the meantime this PR has become a resource of information for other projects as well. My opinion remained unchanged, it would be time to move in the direction of utf8mb4 to keep up with the new versions of MySQL, but the problem related to the existing installations remains. I have already made an attempt applying PR in a test environment, with a prior installation in DDEV of OpenMage with Sample Data. Without a migration script, its integration will generate immediate trouble. |
Closed because this PR is incomplete... but this is true, is doesn't do anything, except allowing to use utf8mb4 for specific tables/columns, and I will be out there until 2025. |
I think it was complete and it should have been merged as it was |
Just adding my .02 in case this gets picked back up: I think I would be in favor of removing the default "SET NAMES" statement rather than changing it - this just causes it to use the server default which ostensibly was set intentionally by the user and makes it consistent with other database clients. Then if/when the user updates their database to a more modern charset/collation they can simultaneously update the default connection options rather than manage that through OpenMage config. So my proposal would be:
SELECT
kcu.TABLE_NAME,
kcu.COLUMN_NAME,
kcu.CONSTRAINT_NAME,
kcu.REFERENCED_TABLE_NAME,
kcu.REFERENCED_COLUMN_NAME,
col.DATA_TYPE
FROM
INFORMATION_SCHEMA.KEY_COLUMN_USAGE kcu
JOIN INFORMATION_SCHEMA.COLUMNS col ON col.TABLE_SCHEMA = kcu.TABLE_SCHEMA
AND col.TABLE_NAME = kcu.TABLE_NAME
AND col.COLUMN_NAME = kcu.COLUMN_NAME
WHERE
kcu.REFERENCED_TABLE_SCHEMA = database()
AND (col.DATA_TYPE = 'varchar' OR col.DATA_TYPE = 'text')
AND kcu.REFERENCED_COLUMN_NAME IS NOT NULL;
-- Generates SQL statements to convert any tables/columns that are not the desired charset/collation
-- to the desired charset/collation.
-- See https://dba.stackexchange.com/a/265487/7274
SET @database = database();
SET @charset = @@character_set_connection;
SET @collate = @@collation_connection;
SELECT "SET foreign_key_checks = 0;"
UNION ALL
SELECT concat(
"ALTER DATABASE `",
`SCHEMA_NAME`,
"` CHARACTER SET = ",
@charset,
" COLLATE = ",
@collate,
";"
) AS `sql`
FROM `information_schema`.`SCHEMATA`
WHERE `SCHEMA_NAME` = @database
AND (
`DEFAULT_CHARACTER_SET_NAME` <> @charset
OR `DEFAULT_COLLATION_NAME` <> @collate
)
UNION ALL
SELECT concat(
"ALTER TABLE `",
`TABLE_SCHEMA`,
"`.`",
`TABLE_NAME`,
"` CONVERT TO CHARACTER SET ",
@charset,
" COLLATE ",
@collate,
";"
) AS `sql`
FROM `information_schema`.`TABLES`
WHERE `TABLE_SCHEMA` = @database
AND `TABLE_TYPE` = "BASE TABLE"
AND `TABLE_COLLATION` <> @collate
UNION ALL
SELECT concat(
"ALTER TABLE `",
c.`TABLE_SCHEMA`,
"`.`",
c.`TABLE_NAME`,
"` CHANGE `",
c.`COLUMN_NAME`,
"` `",
c.`COLUMN_NAME`,
"` ",
c.`COLUMN_TYPE`,
" CHARACTER SET ",
@charset,
" COLLATE ",
@collate,
if(c.`IS_NULLABLE` = "YES", " NULL", " NOT NULL"),
";"
) AS `sql`
FROM `information_schema`.`COLUMNS` c,
`information_schema`.`TABLES` t,
`information_schema`.`COLLATION_CHARACTER_SET_APPLICABILITY` a
WHERE c.`TABLE_SCHEMA` = t.`TABLE_SCHEMA`
AND c.`TABLE_NAME` = t.`TABLE_NAME`
AND t.`TABLE_COLLATION` = a.`COLLATION_NAME`
AND c.`TABLE_SCHEMA` = @database
AND c.`DATA_TYPE` IN (
'varchar',
'char',
'text',
'tinytext',
'mediumtext',
'longtext'
)
AND (
c.`CHARACTER_SET_NAME` <> a.`CHARACTER_SET_NAME`
OR c.`COLLATION_NAME` <> t.`TABLE_COLLATION`
)
AND t.`TABLE_TYPE` = "BASE TABLE"
UNION ALL
SELECT "SET foreign_key_checks = 1;"; |
@colinmollenhour - I appreciate the fact that you don't want us to leave this PR unfinished. The first point is interesting to analyze, it will clearly affect those with MySQL < 8.0. however, keeping it there would be a guarantee that this configuration will be set up. The second point, it can be merged in main branch if we provide a migration script and a detailed description. The big issue is that it cannot be as an update version, because the migration step is necessary. Maybe we should offer a separate option from the regular package, being a special case. New OM installations are fewer, even using next will require migration. I would be interested in testing this script and if it proves to work then we can do it as follows, we create an update strictly with this change, nothing else, no other PRs. Those with installations modify the database, test the installation and if everything is working fine they can export the database and take it to production. Later on this installations they can use the regular updates that we offer periodically. In practice, we prepare them for the new MySQL versions and for the a mandatory step I say. |
For #430. This allow (partially, because this doesn't change database column) use emojis in database for new install. For existing install, edit manually app/etc/local.xml and change utf8 by utf8mb4.