Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace utf8 by utf8mb4 #1036

Closed
wants to merge 1 commit into from
Closed

Conversation

luigifab
Copy link
Collaborator

@luigifab luigifab commented Jun 10, 2020

For #430. This allow (partially, because this doesn't change database column) use emojis in database for new install. For existing install, edit manually app/etc/local.xml and change utf8 by utf8mb4.

@kkrieger85
Copy link
Contributor

What about all the "DEFAULT CHARSET=utf8;" in setup scripts? So your DB supports emoji, but single tables don't, do they?

@colinmollenhour
Copy link
Member

Adding this to existing installs without migrating the database tables might cause a problem, I don't remember specifics but you get weird errors for some queries about incompatible collations and such (LIKE queries if I remember correctly).. So I think this should be accompanied by a script that runs this for every table or every table where utf8mb4 support matters and the errors might exist (yes, this takes a long time on a large database):

ALTER TABLE {$table} CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;

@sreichel
Copy link
Contributor

Note:

INDEXES

When converting from utf8 to utf8mb4, the maximum length of a column or index key is unchanged in terms of bytes. Therefore, it is smaller in terms of characters, because the maximum length of a character is now four bytes instead of three. [...] The InnoDB storage engine has a maximum index length of 767 bytes, so for utf8 or utf8mb4 columns, you can index a maximum of 255 or 191 characters, respectively. If you currently have utf8 columns with indexes longer than 191 characters, you will need to index a smaller number of characters when using utf8mb4.

(https://stackoverflow.com/a/48388044/5703627)

@sreichel sreichel added the review needed Problem should be verified label Jun 20, 2020
kkrieger85
kkrieger85 previously approved these changes Jul 7, 2020
Copy link
Member

@colinmollenhour colinmollenhour left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

While this may work well for new installs I believe it will cause issues with existing installs so needs a database migration script and probably would be considered a BC breaking change (third-party code may be affected) so should be merged into the mainline branch rather than stable (branching strategy is not yet a settled matter).

@github-actions github-actions bot added Component: Backup Relates to Mage_Backup Component: Cron Relates to Mage_Cron Component: Install Relates to Mage_Install labels Jul 24, 2020
@luigifab
Copy link
Collaborator Author

luigifab commented Feb 3, 2022

The error about incompatible character set:

SQLSTATE[HY000]: General error: 1267 Illegal mix of collations (utf8_general_ci,IMPLICIT)
 and (utf8mb4_general_ci,COERCIBLE) for operation 'like', query was:
 SELECT `main_table`.* FROM `sales_flat_order_grid` AS `main_table`
 WHERE (`main_table`.`increment_id` LIKE '%🔔10023545678🖤%') ORDER BY created_at DESC LIMIT 200

I try to filter orders by increment id from backend with: 🔔10023545678🖤
I didn't got other errors.

@ADDISON74 ADDISON74 self-requested a review August 31, 2022 09:10
@ADDISON74
Copy link
Contributor

ADDISON74 commented Aug 31, 2022

What is wanted by this PR? To be able to use emoji in OpenMage and it can only be done if the database is UTFmb4.

The modification in the OpenMage code proposed in this PR must be done only after the database is either directly UTF8mb4 (through installation) or through conversion from UTF8. This last situation is not at all simple and requires several steps to be carried out that start with a backup, altering the database, tables, dev tests and many others. More details here, I think so far it is the best resource I have found https://mathiasbynens.be/notes/mysql-utf8mb4.

Approving the PR without converting the database will generate issues that I have not evaluated, but I will do a test with a UTF8 database. There are a lot of issues discussed here: https://stackoverflow.com/questions/39463134/how-to-store-emoji-character-in-mysql-database/.

As a personal opinion, I would prefer this PR to be an extension and not adopt it even with a conversion script in OpenMage. This last step that we impose on everyone who uses OM and which is mandatory is not a guarantee that the store will work 100%. We must avoid such issues. Whoever wants to use emoji can have at their disposal, as I said, an extension that provides the required changes in OM and a conversion script with all the steps to follow.

However, using UTFmb4 has its advantages:
https://www.percona.com/blog/migrating-to-utf8mb4-things-to-consider/

@Flyingmana
Copy link
Member

one more source: https://blog.koehntopp.info/2022/01/12/utf8mb4.html#tldr
highlighting the

Your programming languages utf8 is called utf8mb4 in MySQL.

definitely does not belong into the 1.9.4.x branch, as long as MySQL does not plan to remove the utf8 one.

Copy link
Member

@Flyingmana Flyingmana left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

change target branch
Edit: see next comment

@Flyingmana
Copy link
Member

also relevant: https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-utf8mb3.html

Note
Historically, MySQL has used utf8 as an alias for utf8mb3; beginning with MySQL 8.0.28, utf8mb3 is used exclusively in the output of SHOW statements and in Information Schema tables when this character set is meant.

At some point in the future utf8 is expected to become a reference to utf8mb4. To avoid ambiguity about the meaning of utf8, consider specifying utf8mb4 explicitly for character set references instead of utf8.

You should also be aware that the utf8mb3 character set is deprecated and you should expect it to be removed in a future MySQL release. Please use utf8mb4 instead.

So we will need to move at some point in the future, as the currently used utf8mb3 is deprecated, but also our currently used alias is changing.

Question is, how MySQL will handle it if we keep the current alias. Do the Tables actually use the alias internally? or will they then at one point not work anymore after an upgrade of MySQL?

@Flyingmana Flyingmana requested review from Flyingmana and removed request for Flyingmana August 31, 2022 09:51
@ADDISON74
Copy link
Contributor

ADDISON74 commented Aug 31, 2022

Source: https://dev.mysql.com/doc/refman/8.0/en/charset-unicode-conversion.html.

"One advantage of converting from utf8mb3 to utf8mb4 is that this enables applications to use supplementary characters. One tradeoff is that this may increase data storage space requirements.

In terms of table content, conversion from utf8mb3 to utf8mb4 presents no problems:

For a BMP character, utf8mb4 and utf8mb3 have identical storage characteristics: same code values, same encoding, same length.

For a supplementary character, utf8mb4 requires four bytes to store it, whereas utf8mb3 cannot store the character at all. When converting utf8mb3 columns to utf8mb4, you need not worry about converting supplementary characters because there are none.

In terms of table structure, these are the primary potential incompatibilities:

For the variable-length character data types (VARCHAR and the TEXT types), the maximum permitted length in characters is less for utf8mb4 columns than for utf8mb3 columns.

For all character data types (CHAR, VARCHAR, and the TEXT types), the maximum number of characters that can be indexed is less for utf8mb4 columns than for utf8mb3 columns.

Consequently, to convert tables from utf8mb3 to utf8mb4, it may be necessary to change some column or index definitions.

Tables can be converted from utf8mb3 to utf8mb4 by using [ALTER TABLE](https://dev.mysql.com/doc/refman/8.0/en/alter-table.html). Suppose that a table has this definition:

CREATE TABLE t1 (
  col1 CHAR(10) CHARACTER SET utf8mb3 COLLATE utf8mb3_unicode_ci NOT NULL,
  col2 CHAR(10) CHARACTER SET utf8mb3 COLLATE utf8mb3_bin NOT NULL
) CHARACTER SET utf8mb3;

The following statement converts t1 to use utf8mb4:

ALTER TABLE t1
  DEFAULT CHARACTER SET utf8mb4,
  MODIFY col1 CHAR(10)
    CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci NOT NULL,
  MODIFY col2 CHAR(10)
    CHARACTER SET utf8mb4 COLLATE utf8mb4_bin NOT NULL;

In conclusion, simply modifying the OM files as this PR proposes it will not solve the functionality issues that will appear.

We should analyze if is requested by the new MySQL versions to migrate OM to UTFmb4, but this step must be prepared in advance. This PR offers an immediate solution only in the case of a new installation without any data, but for an existing database, mandatory steps must be taken before it is imported or modified directly. It's not just an issue related to emoji's, but also of sheltering us in future MySQL versions.

How do I see this PR?

  • mentions in README.md with warnings.
  • a script or step-by-step procedure for converting an existing dumped database. I do not recommend that the changes be made directly in a production database.
  • modified files can be renamed with an established extension. those who want to switch to UTFmb4 can rename the respective files. After a while we can remove these files, but it cannot be done without preparation.

@colinmollenhour
Copy link
Member

I think this is a fairly difficult task to take on but here is a script I used once before if it helps whoever wants to tackle this:

#!/bin/bash
#
# Update a database and convert all tables to utf8mb4
#
# Example:
#
# MYSQL_SRC_PREFIX='docker exec -i openmage_mysql_1' \
# MYSQL_SRC_AUTH='-u openmage -ppassword' \
# MYSQL_DATABASE=openmage \
# ./utf8mb4.sh
set -e

if [[ -z $MYSQL_SRC_PREFIX ]]; then
    MYSQL_SRC_PREFIX=""
fi
if [[ -z $MYSQL_SRC_AUTH ]]; then
    MYSQL_SRC_AUTH="-u root"
fi
if [[ -z $MYSQL_DATABASE ]]; then
    echo "MYSQL_DATABASE is not specified."
    exit 1
fi
src_db=$MYSQL_DATABASE

_mysql_src="$MYSQL_SRC_PREFIX mysql $MYSQL_SRC_AUTH"

tables="$($_mysql_src $src_db -Bse 'show tables')"

out=$(mktemp)
echo "-- Generated by utf8mb4.sh" > $out
echo "SET NAMES utf8mb4 COLLATE utf8mb4_unicode_ci;" >> $out
echo "USE $src_db;" >> $out
echo "ALTER DATABASE $src_db CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci;" >> $out
for table in $tables; do
    if [[ $table =~ ^sales_order_status ]]; then # Skip this table which has a string foreign key
        continue;
    fi
    echo "ALTER TABLE $table CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;" >> $out
done
cat $out

input=''
if [[ $1 = '-' ]]; then
  noop
elif [[ -z $1 ]]; then
  echo
  read -p "Apply above changes to $src_db? (Y|n) " input
elif [[ $1 = "--apply" ]]; then
  input='Y'
fi

if [[ $input = "Y" ]]; then
    <$out $_mysql_src -Bs
fi
rm $out

@luigifab luigifab changed the title replace utf8 by utf8mb4 Replace utf8 by utf8mb4 Apr 10, 2023
@luigifab
Copy link
Collaborator Author

To copy utf8mb4 characters to flat tables, I did that some weeks/months ago:

diff --git b/app/code/core/Mage/Catalog/Model/Resource/Category/Flat.php a/app/code/core/Mage/Catalog/Model/Resource/Category/Flat.php
index 22a3578fca..b5472fe6a1 100644
--- b/app/code/core/Mage/Catalog/Model/Resource/Category/Flat.php
+++ a/app/code/core/Mage/Catalog/Model/Resource/Category/Flat.php
@@ -575,7 +575,9 @@ class Mage_Catalog_Model_Resource_Category_Flat extends Mage_Index_Model_Resourc
         $_writeAdapter->dropTable($tableName);
         $table = $this->_getWriteAdapter()
             ->newTable($tableName)
-            ->setComment(sprintf('Catalog Category Flat (Store %d)', $store));
+            ->setComment(sprintf('Catalog Category Flat (Store %d)', $store))
+            ->setOption('charset', 'utf8mb4')
+            ->setOption('collate', 'utf8mb4_general_ci');
 
         //Adding columns
         if ($this->_columnsSql === null) {
diff --git b/app/code/core/Mage/Catalog/Model/Resource/Product/Flat/Indexer.php a/app/code/core/Mage/Catalog/Model/Resource/Product/Flat/Indexer.php
index d3bfc39246..b3eb3efaad 100644
--- b/app/code/core/Mage/Catalog/Model/Resource/Product/Flat/Indexer.php
+++ a/app/code/core/Mage/Catalog/Model/Resource/Product/Flat/Indexer.php
@@ -704,7 +704,9 @@ class Mage_Catalog_Model_Resource_Product_Flat_Indexer extends Mage_Index_Model_
                     Varien_Db_Ddl_Table::ACTION_CASCADE
                 );
             }
-            $table->setComment("Catalog Product Flat (Store {$storeId})");
+            $table->setComment("Catalog Product Flat (Store {$storeId})")
+                ->setOption('charset', 'utf8mb4')
+                ->setOption('collate', 'utf8mb4_general_ci');
 
             $adapter->createTable($table);
 

@fballiano
Copy link
Contributor

we should fix the conflicts and rebase to "next"

@fballiano
Copy link
Contributor

i'm ok with merging it as it is without forcing an update on the existing installation, let's just not leave PRs hanging in the limbo forever without deciding anything.

@luigifab could you rebase it on main or next (I think next is better in this case) so that we could remove the 1.9.4.x branch?

@fballiano fballiano changed the base branch from 1.9.4.x to main May 23, 2023 17:32
@fballiano fballiano dismissed kkrieger85’s stale review May 23, 2023 17:32

The base branch was changed.

@github-actions github-actions bot removed the Component: Backup Relates to Mage_Backup label May 23, 2023
@fballiano
Copy link
Contributor

rebased it to main, please check that everything is ok

@ADDISON74
Copy link
Contributor

In the meantime this PR has become a resource of information for other projects as well. My opinion remained unchanged, it would be time to move in the direction of utf8mb4 to keep up with the new versions of MySQL, but the problem related to the existing installations remains. I have already made an attempt applying PR in a test environment, with a prior installation in DDEV of OpenMage with Sample Data. Without a migration script, its integration will generate immediate trouble.

@luigifab
Copy link
Collaborator Author

luigifab commented May 31, 2023

Closed because this PR is incomplete... but this is true, is doesn't do anything, except allowing to use utf8mb4 for specific tables/columns, and I will be out there until 2025.
Anyway, I continue use it as is, and for me I haven't any troubles since I use it.

@luigifab luigifab closed this May 31, 2023
@luigifab luigifab deleted the emoji-utf8mb4 branch May 31, 2023 08:40
@fballiano
Copy link
Contributor

I think it was complete and it should have been merged as it was

@colinmollenhour
Copy link
Member

Just adding my .02 in case this gets picked back up:

I think I would be in favor of removing the default "SET NAMES" statement rather than changing it - this just causes it to use the server default which ostensibly was set intentionally by the user and makes it consistent with other database clients. Then if/when the user updates their database to a more modern charset/collation they can simultaneously update the default connection options rather than manage that through OpenMage config.

So my proposal would be:

  1. Remove "SET NAMES" from all OpenMage code (managed by database admin instead - users can still use initStatements if they so choose). If you install on a new-ish version of MySQL you will get the optimal result by default.
    • This will break users of MySQL 5.7 with the default config which uses latin1.
  2. Merge into "next" rather than "main" since there is potential to break some queries on existing installations due to MySQL throwing mismatching collation errors if stored tables don't match the config.
  3. Add a shell script to make database migrations easier to generate and reference this in release notes. * **
  4. Add mention to install notes to set the desired charset/collation in the database config before installing.
    • MySQL 8.0 uses a good default out of the box (utf8mb4 and utf8mb4_0900_ai_ci) so this is not necessary for most users although still good to be aware of the option.

* Normally this requires foreign keys to be disabled during migration to avoid errors when varchar/text columns are used for FKs but I believe in OpenMage there are no such FKs but this can be checked like so:

SELECT
  kcu.TABLE_NAME,
  kcu.COLUMN_NAME,
  kcu.CONSTRAINT_NAME,
  kcu.REFERENCED_TABLE_NAME,
  kcu.REFERENCED_COLUMN_NAME,
  col.DATA_TYPE
FROM
  INFORMATION_SCHEMA.KEY_COLUMN_USAGE kcu
  JOIN INFORMATION_SCHEMA.COLUMNS col ON col.TABLE_SCHEMA = kcu.TABLE_SCHEMA
    AND col.TABLE_NAME = kcu.TABLE_NAME
    AND col.COLUMN_NAME = kcu.COLUMN_NAME
WHERE
  kcu.REFERENCED_TABLE_SCHEMA = database()
  AND (col.DATA_TYPE = 'varchar' OR col.DATA_TYPE = 'text')
  AND kcu.REFERENCED_COLUMN_NAME IS NOT NULL;

** And here is a better script for generating SQL to migrate a full database (it is not necessary to migrate the full database but it is not trivial to figure out which ones will trigger errors in OpenMage code to reduce the list and it could change anyway):

-- Generates SQL statements to convert any tables/columns that are not the desired charset/collation
-- to the desired charset/collation.
-- See https://dba.stackexchange.com/a/265487/7274
SET @database = database();
SET @charset = @@character_set_connection;
SET @collate = @@collation_connection;

SELECT "SET foreign_key_checks = 0;"
UNION ALL
SELECT concat(
               "ALTER DATABASE `",
               `SCHEMA_NAME`,
               "` CHARACTER SET = ",
               @charset,
               " COLLATE = ",
               @collate,
               ";"
           ) AS `sql`
FROM `information_schema`.`SCHEMATA`
WHERE `SCHEMA_NAME` = @database
  AND (
            `DEFAULT_CHARACTER_SET_NAME` <> @charset
        OR `DEFAULT_COLLATION_NAME` <> @collate
    )
UNION ALL
SELECT concat(
               "ALTER TABLE `",
               `TABLE_SCHEMA`,
               "`.`",
               `TABLE_NAME`,
               "` CONVERT TO CHARACTER SET ",
               @charset,
               " COLLATE ",
               @collate,
               ";"
           ) AS `sql`
FROM `information_schema`.`TABLES`
WHERE `TABLE_SCHEMA` = @database
  AND `TABLE_TYPE` = "BASE TABLE"
  AND `TABLE_COLLATION` <> @collate
UNION ALL
SELECT concat(
               "ALTER TABLE `",
               c.`TABLE_SCHEMA`,
               "`.`",
               c.`TABLE_NAME`,
               "` CHANGE `",
               c.`COLUMN_NAME`,
               "` `",
               c.`COLUMN_NAME`,
               "` ",
               c.`COLUMN_TYPE`,
               " CHARACTER SET ",
               @charset,
               " COLLATE ",
               @collate,
               if(c.`IS_NULLABLE` = "YES", " NULL", " NOT NULL"),
               ";"
           ) AS `sql`
FROM `information_schema`.`COLUMNS` c,
     `information_schema`.`TABLES` t,
     `information_schema`.`COLLATION_CHARACTER_SET_APPLICABILITY` a
WHERE c.`TABLE_SCHEMA` = t.`TABLE_SCHEMA`
  AND c.`TABLE_NAME` = t.`TABLE_NAME`
  AND t.`TABLE_COLLATION` = a.`COLLATION_NAME`
  AND c.`TABLE_SCHEMA` = @database
  AND c.`DATA_TYPE` IN (
                        'varchar',
                        'char',
                        'text',
                        'tinytext',
                        'mediumtext',
                        'longtext'
    )
  AND (
            c.`CHARACTER_SET_NAME` <> a.`CHARACTER_SET_NAME`
        OR c.`COLLATION_NAME` <> t.`TABLE_COLLATION`
    )
  AND t.`TABLE_TYPE` = "BASE TABLE"
UNION ALL
SELECT "SET foreign_key_checks = 1;";

@ADDISON74 ADDISON74 self-assigned this May 31, 2023
@ADDISON74
Copy link
Contributor

@colinmollenhour - I appreciate the fact that you don't want us to leave this PR unfinished.

The first point is interesting to analyze, it will clearly affect those with MySQL < 8.0. however, keeping it there would be a guarantee that this configuration will be set up.

The second point, it can be merged in main branch if we provide a migration script and a detailed description. The big issue is that it cannot be as an update version, because the migration step is necessary. Maybe we should offer a separate option from the regular package, being a special case. New OM installations are fewer, even using next will require migration.

I would be interested in testing this script and if it proves to work then we can do it as follows, we create an update strictly with this change, nothing else, no other PRs. Those with installations modify the database, test the installation and if everything is working fine they can export the database and take it to production. Later on this installations they can use the regular updates that we offer periodically. In practice, we prepare them for the new MySQL versions and for the a mandatory step I say.

@ADDISON74 ADDISON74 removed their assignment Jul 14, 2023
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
Component: Cron Relates to Mage_Cron Component: Install Relates to Mage_Install Don't forget this PR new feature review needed Problem should be verified
Projects
None yet
Development

Successfully merging this pull request may close these issues.

None yet

7 participants