Skip to content

Commit

Permalink
MDEV-27009 Add UCA-14.0.0 collations
Browse files Browse the repository at this point in the history
- Added one neutral and 22 tailored (language specific) collations based on
  Unicode Collation Algorithm version 14.0.0.

  Collations were added for Unicode character sets
  utf8mb3, utf8mb4, ucs2, utf16, utf32.

  Every tailoring was added with four accent and case
  sensitivity flag combinations, e.g:

  * utf8mb4_uca1400_swedish_as_cs
  * utf8mb4_uca1400_swedish_as_ci
  * utf8mb4_uca1400_swedish_ai_cs
  * utf8mb4_uca1400_swedish_ai_ci

  and their _nopad_ variants:

  * utf8mb4_uca1400_swedish_nopad_as_cs
  * utf8mb4_uca1400_swedish_nopad_as_ci
  * utf8mb4_uca1400_swedish_nopad_ai_cs
  * utf8mb4_uca1400_swedish_nopad_ai_ci

- Introducing a conception of contextually typed named collations:

  CREATE DATABASE db1 CHARACTER SET utf8mb4;
  CREATE TABLE db1.t1 (a CHAR(10) COLLATE uca1400_as_ci);

  The idea is that there is no a need to specify the character set prefix
  in the new collation names. It's enough to type just the suffix
  "uca1400_as_ci". The character set is taken from the context.

  In the above example script the context character set is utf8mb4.
  So the CREATE TABLE will make a column with the collation
  utf8mb4_uca1400_as_ci.

  Short collations names can be used in any parts of the SQL syntax
  where the COLLATE clause is understood.

- New collations are displayed only one time
  (without character set combinations) by these statements:

     SELECT * FROM INFORMATION_SCHEMA.COLLATIONS;
     SHOW COLLATION;

  For example, all these collations:
  - utf8mb3_uca1400_swedish_as_ci
  - utf8mb4_uca1400_swedish_as_ci
  - ucs2_uca1400_swedish_as_ci
  - utf16_uca1400_swedish_as_ci
  - utf32_uca1400_swedish_as_ci
  have just one entry in INFORMATION_SCHEMA.COLLATIONS and SHOW COLLATION,
  with COLLATION_NAME equal to "uca1400_swedish_as_ci", which is the suffix
  without the character set name:

SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLLATIONS
WHERE COLLATION_NAME LIKE '%uca1400_swedish_as_ci';

+-----------------------+
| COLLATION_NAME        |
+-----------------------+
| uca1400_swedish_as_ci |
+-----------------------+

  Note, the behaviour of old collations did not change.
  Non-unicode collations (e.g. latin1_swedish_ci) and
  old UCA-4.0.0 collations (e.g. utf8mb4_unicode_ci)
  are still displayed with the character set prefix, as before.

- The structure of the table INFORMATION_SCHEMA.COLLATIONS was changed.

  The NOT NULL constraint was removed from these columns:
  - CHARACTER_SET_NAME
  - ID
  - IS_DEFAULT
  and from the corresponding columns in SHOW COLLATION.

  For example:

SELECT COLLATION_NAME, CHARACTER_SET_NAME, ID, IS_DEFAULT
FROM INFORMATION_SCHEMA.COLLATIONS
WHERE COLLATION_NAME LIKE '%uca1400_swedish_as_ci';
+-----------------------+--------------------+------+------------+
| COLLATION_NAME        | CHARACTER_SET_NAME | ID   | IS_DEFAULT |
+-----------------------+--------------------+------+------------+
| uca1400_swedish_as_ci | NULL               | NULL | NULL       |
+-----------------------+--------------------+------+------------+

  The NULL value in these columns now means that the collation
  is applicable to multiple character sets.
  The behavioir of old collations did not change.
  Make sure your client programs can handle NULL values in these columns.

- The structure of the table
  INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY was changed.

  Three new NOT NULL columns were added:
  - FULL_COLLATION_NAME
  - ID
  - IS_DEFAULT

  New collations have multiple entries in COLLATION_CHARACTER_SET_APPLICABILITY.
  The column COLLATION_NAME contains the collation name without the character
  set prefix. The column FULL_COLLATION_NAME contains the collation name with
  the character set prefix.

  Old collations have full collation name in both FULL_COLLATION_NAME and
  COLLATION_NAME.

SELECT COLLATION_NAME, FULL_COLLATION_NAME, CHARACTER_SET_NAME, ID, IS_DEFAULT
FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
WHERE FULL_COLLATION_NAME RLIKE '^(utf8mb4|latin1).*swedish.*ci$';
+-----------------------------+-------------------------------------+--------------------+------+------------+
| COLLATION_NAME              | FULL_COLLATION_NAME                 | CHARACTER_SET_NAME | ID   | IS_DEFAULT |
+-----------------------------+-------------------------------------+--------------------+------+------------+
| latin1_swedish_ci           | latin1_swedish_ci                   | latin1             |    8 | Yes        |
| latin1_swedish_nopad_ci     | latin1_swedish_nopad_ci             | latin1             | 1032 |            |
| utf8mb4_swedish_ci          | utf8mb4_swedish_ci                  | utf8mb4            |  232 |            |
| uca1400_swedish_ai_ci       | utf8mb4_uca1400_swedish_ai_ci       | utf8mb4            | 2368 |            |
| uca1400_swedish_as_ci       | utf8mb4_uca1400_swedish_as_ci       | utf8mb4            | 2370 |            |
| uca1400_swedish_nopad_ai_ci | utf8mb4_uca1400_swedish_nopad_ai_ci | utf8mb4            | 2372 |            |
| uca1400_swedish_nopad_as_ci | utf8mb4_uca1400_swedish_nopad_as_ci | utf8mb4            | 2374 |            |
+-----------------------------+-------------------------------------+--------------------+------+------------+

- Other INFORMATION_SCHEMA queries:

  SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.COLUMNS;
  SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.PARAMETERS;
  SELECT TABLE_COLLATION FROM INFORMATION_SCHEMA.TABLES;
  SELECT DEFAULT_COLLATION_NAME FROM INFORMATION_SCHEMA.SCHEMATA;
  SELECT COLLATION_NAME FROM INFORMATION_SCHEMA.ROUTINES;
  SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.EVENTS;
  SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.EVENTS;
  SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.ROUTINES;
  SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.ROUTINES;
  SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.TRIGGERS;
  SELECT DATABASE_COLLATION FROM INFORMATION_SCHEMA.TRIGGERS;
  SELECT COLLATION_CONNECTION FROM INFORMATION_SCHEMA.VIEWS;

  display full collation names, including character sets prefix,
  for all collations, including new collations.

  Corresponding SHOW commands also display full collation names
  in collation related columns:

  SHOW CREATE TABLE t1;
  SHOW CREATE DATABASE db1;
  SHOW TABLE STATUS;
  SHOW CREATE FUNCTION f1;
  SHOW CREATE PROCEDURE p1;
  SHOW CREATE EVENT ev1;
  SHOW CREATE TRIGGER tr1;
  SHOW CREATE VIEW;

  These INFORMATION_SCHEMA queries and SHOW statements may change in
  the future, to display show collation names.
  • Loading branch information
abarkov authored and sanja-byelkin committed Aug 10, 2022
1 parent 6bc10f8 commit 1334468
Show file tree
Hide file tree
Showing 99 changed files with 46,196 additions and 1,004 deletions.
2 changes: 2 additions & 0 deletions .gitignore
Original file line number Diff line number Diff line change
Expand Up @@ -616,6 +616,8 @@ scripts/mariadb-setpermission
sql/mariadbd
sql/mariadb-tzinfo-to-sql
storage/rocksdb/mariadb-ldb
strings/ctype-uca1400data.h
strings/uca-dump
tests/mariadb-client-test
versioninfo_dll.rc
versioninfo_exe.rc
Expand Down
6 changes: 3 additions & 3 deletions client/mysqldump.c
Original file line number Diff line number Diff line change
Expand Up @@ -2581,7 +2581,7 @@ static uint dump_events_for_db(char *db)
MYSQL_RES *event_res, *event_list_res;
MYSQL_ROW row, event_list_row;

char db_cl_name[MY_CS_NAME_SIZE];
char db_cl_name[MY_CS_COLLATION_NAME_SIZE];
int db_cl_altered= FALSE;

DBUG_ENTER("dump_events_for_db");
Expand Down Expand Up @@ -2801,7 +2801,7 @@ static uint dump_routines_for_db(char *db)
FILE *sql_file= md_result_file;
MYSQL_ROW row, routine_list_row;

char db_cl_name[MY_CS_NAME_SIZE];
char db_cl_name[MY_CS_COLLATION_NAME_SIZE];
int db_cl_altered= FALSE;
// before 10.3 packages are not supported
uint upper_bound= mysql_get_server_version(mysql) >= 100300 ?
Expand Down Expand Up @@ -3844,7 +3844,7 @@ static int dump_triggers_for_table(char *table_name, char *db_name)
MYSQL_ROW row;
FILE *sql_file= md_result_file;

char db_cl_name[MY_CS_NAME_SIZE];
char db_cl_name[MY_CS_COLLATION_NAME_SIZE];
int ret= TRUE;
/* Servers below 5.1.21 do not support SHOW CREATE TRIGGER */
const int use_show_create_trigger= mysql_get_server_version(mysql) >= 50121;
Expand Down
80 changes: 77 additions & 3 deletions include/m_ctype.h
Original file line number Diff line number Diff line change
Expand Up @@ -34,7 +34,9 @@ enum loglevel {
extern "C" {
#endif

#define MY_CS_NAME_SIZE 32
#define MY_CS_CHARACTER_SET_NAME_SIZE 32
#define MY_CS_COLLATION_NAME_SIZE 64

#define MY_CS_CTYPE_TABLE_SIZE 257
#define MY_CS_TO_LOWER_TABLE_SIZE 256
#define MY_CS_TO_UPPER_TABLE_SIZE 256
Expand Down Expand Up @@ -116,7 +118,7 @@ extern MY_UNICASE_INFO my_unicase_unicode520;
*/
#define MY_UCA_MAX_WEIGHT_SIZE (8+1) /* Including 0 terminator */
#define MY_UCA_CONTRACTION_MAX_WEIGHT_SIZE (2*8+1) /* Including 0 terminator */
#define MY_UCA_WEIGHT_LEVELS 2
#define MY_UCA_WEIGHT_LEVELS 3

typedef struct my_contraction_t
{
Expand Down Expand Up @@ -240,6 +242,46 @@ typedef enum enum_repertoire_t
} my_repertoire_t;


/* ID compatibility */
typedef enum enum_collation_id_type
{
MY_COLLATION_ID_TYPE_PRECISE= 0,
MY_COLLATION_ID_TYPE_COMPAT_100800= 1
} my_collation_id_type_t;


/* Collation name display modes */
typedef enum enum_collation_name_mode
{
MY_COLLATION_NAME_MODE_FULL= 0,
MY_COLLATION_NAME_MODE_CONTEXT= 1
} my_collation_name_mode_t;


/* Level flags */
#define MY_CS_LEVEL_BIT_PRIMARY 0x00
#define MY_CS_LEVEL_BIT_SECONDARY 0x01
#define MY_CS_LEVEL_BIT_TERTIARY 0x02
#define MY_CS_LEVEL_BIT_QUATERNARY 0x03

#define MY_CS_COLL_LEVELS_S1 (1<<MY_CS_LEVEL_BIT_PRIMARY)

#define MY_CS_COLL_LEVELS_AI_CS (1<<MY_CS_LEVEL_BIT_PRIMARY)| \
(1<<MY_CS_LEVEL_BIT_TERTIARY)

#define MY_CS_COLL_LEVELS_S2 (1<<MY_CS_LEVEL_BIT_PRIMARY)| \
(1<<MY_CS_LEVEL_BIT_SECONDARY)

#define MY_CS_COLL_LEVELS_S3 (1<<MY_CS_LEVEL_BIT_PRIMARY)| \
(1<<MY_CS_LEVEL_BIT_SECONDARY) | \
(1<<MY_CS_LEVEL_BIT_TERTIARY)

#define MY_CS_COLL_LEVELS_S4 (1<<MY_CS_LEVEL_BIT_PRIMARY)| \
(1<<MY_CS_LEVEL_BIT_SECONDARY) | \
(1<<MY_CS_LEVEL_BIT_TERTIARY) | \
(1<<MY_CS_LEVEL_BIT_QUATERNARY)


/* Flags for strxfrm */
#define MY_STRXFRM_LEVEL1 0x00000001 /* for primary weights */
#define MY_STRXFRM_LEVEL2 0x00000002 /* for secondary weights */
Expand Down Expand Up @@ -440,8 +482,13 @@ struct my_collation_handler_st
*/
size_t (*min_str)(CHARSET_INFO *cs, uchar *dst, size_t dstlen, size_t nchars);
size_t (*max_str)(CHARSET_INFO *cs, uchar *dst, size_t dstlen, size_t nchars);

uint (*get_id)(CHARSET_INFO *cs, my_collation_id_type_t type);
LEX_CSTRING (*get_collation_name)(CHARSET_INFO *cs,
my_collation_name_mode_t mode);
};


extern MY_COLLATION_HANDLER my_collation_8bit_bin_handler;
extern MY_COLLATION_HANDLER my_collation_8bit_simple_ci_handler;
extern MY_COLLATION_HANDLER my_collation_8bit_nopad_bin_handler;
Expand Down Expand Up @@ -843,6 +890,21 @@ struct charset_info_st
}

/* Collation routines */
uint default_flag() const
{
return state & MY_CS_PRIMARY;
}

uint binsort_flag() const
{
return state & MY_CS_BINSORT;
}

uint compiled_flag() const
{
return state & MY_CS_COMPILED;
}

int strnncoll(const uchar *a, size_t alen,
const uchar *b, size_t blen, my_bool b_is_prefix= FALSE) const
{
Expand Down Expand Up @@ -940,6 +1002,15 @@ struct charset_info_st
return (coll->max_str)(this, dst, dstlen, nchars);
}

uint get_id(my_collation_id_type_t type) const
{
return (coll->get_id)(this, type);
}

LEX_CSTRING get_collation_name(my_collation_name_mode_t mode) const
{
return (coll->get_collation_name)(this, mode);
}
#endif /* __cplusplus */
};

Expand Down Expand Up @@ -1520,6 +1591,9 @@ extern size_t my_strcspn(CHARSET_INFO *cs, const char *str, const char *end,
my_bool my_propagate_simple(CHARSET_INFO *cs, const uchar *str, size_t len);
my_bool my_propagate_complex(CHARSET_INFO *cs, const uchar *str, size_t len);

uint my_ci_get_id_generic(CHARSET_INFO *cs, my_collation_id_type_t type);
LEX_CSTRING my_ci_get_collation_name_generic(CHARSET_INFO *cs,
my_collation_name_mode_t mode);

typedef struct
{
Expand All @@ -1534,7 +1608,7 @@ my_repertoire_t my_string_repertoire(CHARSET_INFO *cs,
my_bool my_charset_is_ascii_based(CHARSET_INFO *cs);
my_repertoire_t my_charset_repertoire(CHARSET_INFO *cs);

uint my_strxfrm_flag_normalize(uint flags, uint nlevels);
uint my_strxfrm_flag_normalize(CHARSET_INFO *cs, uint flags);
void my_strxfrm_desc_and_reverse(uchar *str, uchar *strend,
uint flags, uint level);
size_t my_strxfrm_pad_desc_and_reverse(CHARSET_INFO *cs,
Expand Down
133 changes: 132 additions & 1 deletion include/my_sys.h
Original file line number Diff line number Diff line change
Expand Up @@ -243,7 +243,7 @@ extern void (*proc_info_hook)(void *, const PSI_stage_info *, PSI_stage_info *,
const char *, const char *, const unsigned int);

/* charsets */
#define MY_ALL_CHARSETS_SIZE 2048
#define MY_ALL_CHARSETS_SIZE 4096
extern MYSQL_PLUGIN_IMPORT CHARSET_INFO *default_charset_info;
extern MYSQL_PLUGIN_IMPORT CHARSET_INFO *all_charsets[MY_ALL_CHARSETS_SIZE];
extern struct charset_info_st compiled_charsets[];
Expand Down Expand Up @@ -1112,4 +1112,135 @@ void my_init_mysys_psi_keys(void);
struct st_mysql_file;
extern struct st_mysql_file *mysql_stdin;
C_MODE_END


#ifdef __cplusplus

class Charset_loader_mysys: public MY_CHARSET_LOADER
{
public:
Charset_loader_mysys()
{
my_charset_loader_init_mysys(this);
}

/**
Get a CHARSET_INFO by a character set name.
@param name Collation name
@param cs_flags e.g. MY_CS_PRIMARY, MY_CS_BINARY
@param my_flags mysys flags (MY_WME, MY_UTF8_IS_UTF8MB3)
@return
@retval NULL on error (e.g. not found)
@retval A CHARSET_INFO pointter on success
*/
CHARSET_INFO *get_charset(const char *cs_name, uint cs_flags, myf my_flags)
{
error[0]= '\0'; // Need to clear in case of the second call
return my_charset_get_by_name(this, cs_name, cs_flags, my_flags);
}

/**
Get a CHARSET_INFO by an exact collation by name.
@param name Collation name
@param my_flags e.g. the utf8 translation flag
@return
@retval NULL on error (e.g. not found)
@retval A CHARSET_INFO pointter on success
*/
CHARSET_INFO *get_exact_collation(const char *name, myf my_flags)
{
error[0]= '\0'; // Need to clear in case of the second call
return my_collation_get_by_name(this, name, my_flags);
}

/**
Get a CHARSET_INFO by a context collation by name.
The returned pointer must be further resolved to a character set.
@param name Collation name
@param utf8_flag The utf8 translation flag
@return
@retval NULL on error (e.g. not found)
@retval A CHARSET_INFO pointter on success
*/
CHARSET_INFO *get_context_collation(const char *name, myf my_flags)
{
return get_exact_collation_by_context_name(&my_charset_utf8mb4_general_ci,
name, my_flags);
}

/**
Get an exact CHARSET_INFO by a contextually typed collation name.
@param name Collation name
@param utf8_flag The utf8 translation flag
@return
@retval NULL on error (e.g. not found)
@retval A CHARSET_INFO pointer on success
*/
CHARSET_INFO *get_exact_collation_by_context_name(CHARSET_INFO *cs,
const char *name,
myf my_flags)
{
char tmp[MY_CS_COLLATION_NAME_SIZE];
my_snprintf(tmp, sizeof(tmp), "%s_%s", cs->cs_name.str, name);
return get_exact_collation(tmp, my_flags);
}

/*
Find a collation with binary comparison rules
*/
CHARSET_INFO *get_bin_collation(CHARSET_INFO *cs, myf my_flags)
{
/*
We don't need to handle old_mode=UTF8_IS_UTF8MB3 here,
This method assumes that "cs" points to a real character set name.
It can be either "utf8mb3" or "utf8mb4". It cannot be "utf8".
No thd->get_utf8_flag() flag passed to get_charset_by_csname().
*/
DBUG_ASSERT(cs->cs_name.length !=4 || memcmp(cs->cs_name.str, "utf8", 4));
/*
CREATE TABLE t1 (a CHAR(10) BINARY)
CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;
Nothing to do, we have the binary collation already.
*/
if (cs->state & MY_CS_BINSORT)
return cs;

// CREATE TABLE t1 (a CHAR(10) BINARY) CHARACTER SET utf8mb4;/
error[0]= '\0'; // Need in case of the second execution
return get_charset(cs->cs_name.str, MY_CS_BINSORT, my_flags);
}

/*
Find the default collation in the given character set
*/
CHARSET_INFO *get_default_collation(CHARSET_INFO *cs, myf my_flags)
{
// See comments in find_bin_collation_or_error()
DBUG_ASSERT(cs->cs_name.length !=4 || memcmp(cs->cs_name.str, "utf8", 4));
/*
CREATE TABLE t1 (a CHAR(10) COLLATE DEFAULT) CHARACTER SET utf8mb4;
Nothing to do, we have the default collation already.
*/
if (cs->state & MY_CS_PRIMARY)
return cs;
/*
CREATE TABLE t1 (a CHAR(10) COLLATE DEFAULT)
CHARACTER SET utf8mb4 COLLATE utf8mb4_bin;
Don't need to handle old_mode=UTF8_IS_UTF8MB3 here.
See comments in find_bin_collation_or_error.
*/
cs= get_charset(cs->cs_name.str, MY_CS_PRIMARY, my_flags);
DBUG_ASSERT(cs);
return cs;
}
};

#endif /*__cplusplus */


#endif /* _my_sys_h */
6 changes: 4 additions & 2 deletions libmysqld/lib_sql.cc
Original file line number Diff line number Diff line change
Expand Up @@ -1097,13 +1097,15 @@ bool Protocol_text::store_field_metadata(const THD * thd,
if (charset_for_protocol == &my_charset_bin || thd_cs == NULL)
{
/* No conversion */
client_field->charsetnr= charset_for_protocol->number;
client_field->charsetnr= charset_for_protocol->
get_id(MY_COLLATION_ID_TYPE_COMPAT_100800);
client_field->length= server_field.length;
}
else
{
/* With conversion */
client_field->charsetnr= thd_cs->number;
client_field->charsetnr= thd_cs->
get_id(MY_COLLATION_ID_TYPE_COMPAT_100800);
client_field->length= server_field.max_octet_length(charset_for_protocol,
thd_cs);
}
Expand Down
21 changes: 21 additions & 0 deletions mysql-test/include/ctype_uca1400_ids_using_convert.inc
Original file line number Diff line number Diff line change
@@ -0,0 +1,21 @@
--disable_ps_protocol
--enable_metadata
DELIMITER $$;
FOR rec IN (SELECT COLLATION_NAME
FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
WHERE CHARACTER_SET_NAME=@charset
AND COLLATION_NAME RLIKE 'uca1400'
ORDER BY ID)
DO
SET NAMES utf8mb4;
SET character_set_results=NULL;
EXECUTE IMMEDIATE CONCAT('SELECT CONVERT('''' USING ',@charset,')',
' COLLATE ', rec.COLLATION_NAME,
' AS ', rec.COLLATION_NAME,
' LIMIT 0');
END FOR;
$$
DELIMITER ;$$
--disable_metadata
--enable_ps_protocol
SET NAMES utf8;
17 changes: 17 additions & 0 deletions mysql-test/include/ctype_uca1400_ids_using_set_names.inc
Original file line number Diff line number Diff line change
@@ -0,0 +1,17 @@

--disable_ps_protocol
--enable_metadata
DELIMITER $$;
FOR rec IN (SELECT COLLATION_NAME
FROM INFORMATION_SCHEMA.COLLATION_CHARACTER_SET_APPLICABILITY
WHERE CHARACTER_SET_NAME=@charset
AND COLLATION_NAME RLIKE 'uca1400'
ORDER BY ID)
DO
EXECUTE IMMEDIATE CONCAT('SET NAMES ',@charset,' COLLATE ', rec.COLLATION_NAME);
SELECT rec.COLLATION_NAME;
END FOR;
$$
DELIMITER ;$$
--disable_metadata
--enable_ps_protocol
2 changes: 1 addition & 1 deletion mysql-test/main/create.result
Original file line number Diff line number Diff line change
Expand Up @@ -1120,7 +1120,7 @@ show create table t1;
Table Create Table
t1 CREATE TABLE `t1` (
`CHARACTER_SET_NAME` varchar(32) NOT NULL,
`DEFAULT_COLLATE_NAME` varchar(32) NOT NULL,
`DEFAULT_COLLATE_NAME` varchar(64) NOT NULL,
`DESCRIPTION` varchar(60) NOT NULL,
`MAXLEN` bigint(3) NOT NULL
) ENGINE=MEMORY DEFAULT CHARSET=utf8mb3
Expand Down

0 comments on commit 1334468

Please sign in to comment.