New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Spring cleanup #108

Merged
merged 16 commits into from Jun 2, 2018

Conversation

Projects
None yet
2 participants
@bochecha
Copy link
Member

bochecha commented Apr 29, 2018

This branch removes some stuff from the data and improves some other things.

Some of it should not be controversial. Do review each commit separately though, in case I went too far and some of this should be kept.

Do note that I'm not interested in having all the Unicode tables in libcangjie, just because "it might be useful some time for someone maybe who knows?". But if some of the stuff I removed is actually useful in Hong Kong, Taiwan, China or Japan, then we should keep it.

That is, "it might be useful" is not an argument I'll accept, but "it is useful when…" very much is.

Anyway, you know this stuff better than I do. 🙂

@bochecha bochecha force-pushed the cleanups branch 2 times, most recently from 254ab22 to e6fad7b Apr 29, 2018

@bochecha bochecha referenced this pull request May 6, 2018

Closed

Fix secondary code freq #105

bochecha added some commits Feb 28, 2018

data: Remove the vertical presentation forms
Back at GUADEC 2015, Behdad told me we don't need those.

The Unicode Standard describes them as "compatibility characters":

> Conceptually, compatibility characters are characters that
> would not have been encoded in the Unicode Standard except
> for compatibility and round-trip convertibility with other
> standards. [...] For the most part, these are widely used
> standards that pre-dated Unicode
>
> [...]
>
> The purpose for the inclusion of compatibility characters
> like these is not to implement or emulate alternative text
> models, nor to encourage the use of plain text distinctions
> in characters which would otherwise be better represented
> by higher-level protocols or other mechanisms.

Basically, people aren't expected to input those characters
directly. Instead, they should always input the regular form, and it is
the software displaying the character which has to choose either the
horizontal vertical form, depending on the typing orientation.

These characters were identified by getting the name of each character
with the Python `unicodedata.name()` function, searching for those
characters whose name start with "PRESENTATION FORM FOR VERTICAL ".

Cleaning those out is long overdue.

Fixes #88
data: Remove the Greek letters
This is a Chinese input method, so let's tighten our focus a bit.

These characters were identified by getting the name of each character
with the Python `unicodedata.name()` function, searching for those
characters whose name start with "GREEK CAPITAL LETTER " or "GREEK SMALL
LETTER ".
data: Remove the small form variants
These characters were added to Unicode for compatibility with the
Chinese National Standard CNS 11643.

However, their usage is rather unclear, and everything points towards
the fact that they are not necessary at all.

These characters were identified by finding the characters whose Unicode
code point is between U+FE50 and U+FE6F.
data: Remove the Cyrillic letters
This is a Chinese input method, so let's tighten our focus a bit.

These characters were identified by getting the name of each character
with the Python `unicodedata.name()` function, searching for those
characters whose name start with "CYRILLIC CAPITAL LETTER " or "CYRILLIC
SMALL LETTER ".
data: Remove the enclosed CJK numerals
These characters were added to Unicode for compatibility with legacy
standards.

However, their usage is rather unclear, and everything points towards
the fact that they are not necessary at all.

These characters were identified by finding the characters whose Unicode
code point is between U+3200 and U+32FF.
data: Remove the Roman numerals
This is a Chinese input method, so let's tighten our focus a bit.

These characters were identified by getting the name of each character
with the Python `unicodedata.name()` function, searching for those
characters whose name start with "ROMAN NUMERAL " or "SMALL ROMAN
NUMERAL ".
data: Remove the enclosed numerals
These characters were originally intended for use as bullets for lists.

However, people will generally use the list facilities from the
applications (e.g word processors), so these aren't actually used.

They are included in the Unicode standard "for interoperability with the
legacy East Asian character sets and for the occasional text context
where such symbols otherwise occur."

That is, they are only useful when they need to be displayed, but not in
an input method.

These characters were identified by finding the characters whose Unicode
code point is between U+2460 and U+24FF.

@bochecha bochecha force-pushed the cleanups branch from f131178 to b90de07 May 6, 2018

@bochecha

This comment has been minimized.

Copy link
Member Author

bochecha commented May 7, 2018

@yookoala any comment? 🙂

bochecha added some commits Mar 1, 2018

data: Remove inaccessible characters
About 80 characters in our table have no code associated, whether in
Cangjie3, Cangjie 5 or even with our "short code" system for
punctuation.

These are impossible to obtain with libcangjie, so there is really no
point in keeping them.

They can always be added back if someone complains about their absence
and provides a code for them.
data: Better classify the Bopomofo tonal marks
Those aren't random symbols, they are the characters used to mark tones
in Bopomofo/Zhuyin.
data: Correctly mark some punctuation characters
The two dashes (en and em) are sometimes used as punctuation (e.g to
indicate ranges) and as such must be marked appropriately.

As for ※, this is the Japanese reference mark (aka "rice symbol"), used
similarly to the asterisk in Western languages.
data: Remove maths symbols
There are much nicer ways to type maths, which academics are most likely
already using. (e.g LaTeX, LibreOffice Maths, …)

I'm keeping some of them though, that have documented non-maths usage in
Japanese for example.
data: Remove some abbreviations
Those don't seem to be used in Chinese or Japanese.
data: Remove the question mark
Chinese and Japanese traditionally don't use the question mark. Instead
they normally use the fullwidth question mark: '?'

Some people might still want to use the question mark, but in that case:

* they get it in ibus-cangjie when they enable the "Half-Width" option
  and type the '?' key;
* they can always switch to the English keyboard layout;

As a result, keeping this in doesn't seem necessary, much like we don't
have codes to input the full stop or colon characters.
data: Remove the section sign
This never should have been a symbol, it is a punctuation mark.

However, it seems to be entirely unused in both Chinese and Japanese, so
let's just remove it.

@bochecha bochecha force-pushed the cleanups branch from b90de07 to b7028b0 May 9, 2018

bochecha added some commits May 6, 2018

tests: Fix literal strings
This was done a while ago, to make MSVC happier as it doesn't like
non-ascii strings in C code.

(Or at least, that was the case at the time. This was almost 5 years ago
after all…)

    commit 273d280
    Author: Linquize <linquize@yahoo.com.hk>
    Date:   Thu Oct 17 21:20:36 2013 +0800

        Do not write Chinese literal strings directly

        MSVC does not encoding Chinese literal strings as UTF-8

However, for some reason the backslashes got escaped in this file (and
only in this file), and we missed it on code review.

@bochecha bochecha force-pushed the cleanups branch from 1e516d8 to 111d6d9 May 20, 2018

@yookoala
Copy link
Contributor

yookoala left a comment

Looks good.

@bochecha

This comment has been minimized.

Copy link
Member Author

bochecha commented Jun 2, 2018

Thanks for the sanity check @yookoala.

@bochecha bochecha merged commit d06ecb3 into master Jun 2, 2018

1 check passed

continuous-integration/travis-ci/pr The Travis CI build passed
Details

@bochecha bochecha deleted the cleanups branch Jun 2, 2018

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment