Skip to content

Commit

Permalink
Ensure that source files contain only valid UTF8 encodings (#2188)
Browse files Browse the repository at this point in the history
Modern software (including text editors, static analysis software,
and web-based code review interfaces) often requires source code files
to be interpretable via a consistent character encoding, with UTF-8 or
ASCII (a strict subset of UTF-8) as the default. Several of the MariaDB
source files contain bytes that are not valid in either the UTF-8 or
ASCII encodings, but instead represent strings encoded in the
ISO-8859-1/Latin-1 or ISO-8859-2/Latin-2 encodings.

These inconsistent encodings may prevent software from correctly
presenting or processing such files. Converting all source files to
valid UTF8 characters will ensure correct handling.

Comments written in Czech were replaced with lightly-corrected
translations from Google Translate. Additionally, comments describing
the proper handling of special characters were changed so that the
comments are now purely UTF8.

All new code of the whole pull request, including one or several files
that are either new files or modified ones, are contributed under the
BSD-new license. I am contributing on behalf of my employer
Amazon Web Services, Inc.

Co-authored-by: Andrew Hutchings <andrew@linuxjedi.co.uk>
  • Loading branch information
anson1014 and LinuxJedi committed May 19, 2023
1 parent c205f6c commit 1db4fc5
Show file tree
Hide file tree
Showing 4 changed files with 34 additions and 60 deletions.
2 changes: 1 addition & 1 deletion mysys/my_win_popen.cc
Original file line number Diff line number Diff line change
Expand Up @@ -92,7 +92,7 @@ extern "C" FILE *my_win_popen(const char *cmd, const char *mode)
goto error;
break;
default:
/* Unknown mode, éxpected "r", "rt", "w", "wt" */
/* Unknown mode, expected "r", "rt", "w", "wt" */
abort();
}
if (!SetHandleInformation(parent_pipe_end, HANDLE_FLAG_INHERIT, 0))
Expand Down
1 change: 0 additions & 1 deletion storage/connect/domdoc.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -642,7 +642,6 @@ bool DOMNODELIST::DropItem(PGLOBAL g, int n)
if (Listp == NULL || Listp->length < n)
return true;

//Listp->item[n] = NULL; La propriété n'a pas de méthode 'set'
return false;
} // end of DeleteItem

Expand Down
75 changes: 25 additions & 50 deletions strings/ctype-czech.c
Original file line number Diff line number Diff line change
Expand Up @@ -23,13 +23,13 @@
solution was needed than the one-to-one conversion table. To
note a few, here is an example of a Czech sorting sequence:
co < hlaska < hláska < hlava < chlapec < krtek
co < hlaska < hláska < hlava < chlapec < krtek
It because some of the rules are: double char 'ch' is sorted
between 'h' and 'i'. Accented character 'á' (a with acute) is
between 'h' and 'i'. Accented character 'á' (a with acute) is
sorted after 'a' and before 'b', but only if the word is
otherwise the same. However, because 's' is sorted before 'v'
in hlava, the accentness of 'á' is overridden. There are many
in hlava, the accentness of 'á' is overridden. There are many
more rules.
This file defines functions my_strxfrm and my_strcoll for
Expand All @@ -42,8 +42,9 @@
passes, that's why we need four times more space for expanded
string.
This file also contains the ISO-Latin-2 definitions of
characters.
The non-ASCII literal strings in this file are encoded
in the iso-8859-2 / latin-2 character set
(https://en.wikipedia.org/wiki/ISO/IEC_8859-2)
Author: (c) 1997--1998 Jan Pazdziora, adelton@fi.muni.cz
Jan Pazdziora has a shared copyright for this code
Expand Down Expand Up @@ -111,7 +112,7 @@ static const struct wordvalue doubles[] = {
};

/*
Unformal description of the algorithm:
Informal description of the algorithm:
We walk the string left to right.
Expand All @@ -126,7 +127,7 @@ static const struct wordvalue doubles[] = {
End of pass is marked with value 1 on the output.
For each character, we read it's value from the table.
For each character, we read its value from the table.
If the value is ignore (0), we go straight to the next character.
Expand All @@ -138,31 +139,6 @@ static const struct wordvalue doubles[] = {
exists behind it, find its value.
We append 0 to the end.
---
Neformální popis algoritmu:
Procházíme øetìzec zleva doprava.
Konec øetìzce je pøedán buï jako parametr, nebo je to *p == 0.
Toto je o¹etøeno makrem IS_END.
Pokud jsme do¹li na konec øetìzce pøi prùchodu 0, nejdeme na
zaèátek, ale na ulo¾enou pozici, proto¾e první a druhý prùchod
bì¾í souèasnì.
Konec vstupu (prùchodu) oznaèíme na výstupu hodnotou 1.
Pro ka¾dý znak øetìzce naèteme hodnotu z tøídící tabulky.
Jde-li o hodnotu ignorovat (0), skoèíme ihned na dal¹í znak..
Jde-li o hodnotu konec slova (2) a je to prùchod 0 nebo 1,
pøeskoèíme v¹echny dal¹í 0 -- 2 a prohodíme prùchody.
Jde-li o kompozitní znak (255), otestujeme, zda následuje
správný do dvojice, dohledáme správnou hodnotu.
Na konci pøipojíme znak 0
*/

#define ADD_TO_RESULT(dest, len, totlen, value) \
Expand Down Expand Up @@ -335,24 +311,23 @@ my_strnxfrm_czech(CHARSET_INFO *cs __attribute__((unused)),


/*
Neformální popis algoritmu:
procházíme øetìzec zleva doprava
konec øetìzce poznáme podle *p == 0
pokud jsme do¹li na konec øetìzce pøi prùchodu 0, nejdeme na
zaèátek, ale na ulo¾enou pozici, proto¾e první a druhý
prùchod bì¾í souèasnì
konec vstupu (prùchodu) oznaèíme na výstupu hodnotou 1
naèteme hodnotu z tøídící tabulky
jde-li o hodnotu ignorovat (0), skoèíme na dal¹í prùchod
jde-li o hodnotu konec slova (2) a je to prùchod 0 nebo 1,
pøeskoèíme v¹echny dal¹í 0 -- 2 a prohodíme
prùchody
jde-li o kompozitní znak (255), otestujeme, zda následuje
správný do dvojice, dohledáme správnou hodnotu
na konci pøipojíme znak 0
Informal description of the algorithm:
we pass the chain from left to right
we know the end of the string by *p == 0
if we reached the end of the string on transition 0, then we don't go to
start, but to the saved position, because the first and second
the passage runs concurrently
we mark the end of the input (transition) with the value 1 on the output
then we load the value from the sorting table
if the value is ignore (0), we jump to the next pass
if the value is the end of the word (2) and it is a 0 or 1 transition,
we skip all the other 0 -- 2 and switch transitions
if it is a composite character (255), we test whether it follows
correct to the pair, we find the correct value
then we add the character 0 at the end
*/


Expand Down
16 changes: 8 additions & 8 deletions strings/ctype-latin1.c
Original file line number Diff line number Diff line change
Expand Up @@ -499,19 +499,19 @@ struct charset_info_st my_charset_latin1_nopad=
*
* The modern sort order is used, where:
*
* 'ä' -> "ae"
* 'ö' -> "oe"
* 'ü' -> "ue"
* 'ß' -> "ss"
* 'ä' -> "ae"
* 'ö' -> "oe"
* 'ü' -> "ue"
* 'ß' -> "ss"
*/


/*
* This is a simple latin1 mapping table, which maps all accented
* characters to their non-accented equivalents. Note: in this
* table, 'ä' is mapped to 'A', 'ÿ' is mapped to 'Y', etc. - all
* table, 'ä' is mapped to 'A', 'ÿ' is mapped to 'Y', etc. - all
* accented characters except the following are treated the same way.
* Ü, ü, Ö, ö, Ä, ä
* Ü, ü, Ö, ö, Ä, ä
*/

static const uchar sort_order_latin1_de[] = {
Expand Down Expand Up @@ -577,7 +577,7 @@ static const uchar combo2map[]={
my_strnxfrm_latin_de() on both strings and compared the result strings.
This means that:
Ä must also matches ÁE and Aè, because my_strxn_frm_latin_de() will convert
Ä must also matches ÁE and Aè, because my_strxn_frm_latin_de() will convert
both to AE.
The other option would be to not do any accent removal in
Expand Down Expand Up @@ -703,7 +703,7 @@ void my_hash_sort_latin1_de(CHARSET_INFO *cs __attribute__((unused)),

/*
Remove end space. We have to do this to be able to compare
'AE' and 'Ä' as identical
'AE' and 'Ä' as identical
*/
end= skip_trailing_space(key, len);

Expand Down

0 comments on commit 1db4fc5

Please sign in to comment.