You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
MDEV-31071 Refactor case folding data types in Unicode collations
This is a non-functional change. It changes the way how case folding data
and weight data (for simple Unicode collations) are stored:
- Removing data types MY_UNICASE_CHARACTER, MY_UNICASE_INFO
- Using data types MY_CASEFOLD_CHARACTER, MY_CASEFOLD_INFO instead.
This patch changes simple Unicode collations in a similar way
how MDEV-30695 previously changed Asian collations.
No new MTR tests are needed. The underlying code is thoroughly
covered by a number of ctype_*_ws.test and ctype_*_casefold.test
files, which were added recently as a preparation
for this change.
Old and new Unicode data layout
-------------------------------
Case folding data is now stored in separate tables
consisting of MY_CASEFOLD_CHARACTER elements with two members:
typedef struct casefold_info_char_t
{
uint32 toupper;
uint32 tolower;
} MY_CASEFOLD_CHARACTER;
while weight data (for simple non-UCA collations xxx_general_ci
and xxx_general_mysql500_ci) is stored in separate arrays of
uint16 elements.
Before this change case folding data and simple weight data were
stored together, in tables of the following elements with three members:
typedef struct unicase_info_char_st
{
uint32 toupper;
uint32 tolower;
uint32 sort; /* weights for simple collations */
} MY_UNICASE_CHARACTER;
This data format was redundant, because weights (the "sort" member) were
needed only for these two simple Unicode collations:
- xxx_general_ci
- xxx_general_mysql500_ci
Adding case folding information for Unicode-14.0.0 using the old
format would waste memory without purpose.
Detailed changes
----------------
- Changing the underlying data types as described above
- Including unidata-dump.c into the sources.
This program was earlier used to dump UnicodeData.txt
(e.g. https://www.unicode.org/Public/14.0.0/ucd/UnicodeData.txt)
into MySQL / MariaDB source files.
It was originally written in 2002, but has not been distributed yet
together with MySQL / MariaDB sources.
- Removing the old format Unicode data earlier dumped from UnicodeData.txt
(versions 3.0.0 and 5.2.0) from ctype-utf8.c.
Adding Unicode data in the new format into separate header files,
to maintain the code easier:
- ctype-unicode300-casefold.h
- ctype-unicode300-casefold-tr.h
- ctype-unicode300-general_ci.h
- ctype-unicode300-general_mysql500_ci.h
- ctype-unicode520-casefold.h
- Adding a new file ctype-unidata.c as an aggregator for
the header files listed above.
0 commit comments