New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP: Set secondary codes frequencies to 0 #110

Closed
wants to merge 15 commits into
base: master
from

Conversation

Projects
None yet
1 participant
@bochecha
Copy link
Member

bochecha commented Jun 2, 2018

This is my attempt at fixing #104.

It is based on @yookoala's suggestion, as attempted in #105.

However, #105 was implementing this in the dbbuilder tool, whereas this tries to modify the source data instead, to have correct frequencies in the data and a "dump" database builder tool.

It gave me the database I asked @dollars0427 to test over at Cangjians/ibus-cangjie#77 (comment), which is incorrect as evidenced by their feedback.

However, I still believe fixing the issue in the source data is the right approach.

The change was implemented with https://gitlab.freedesktop.org/cangjie/data-migration-tools/blob/d3b9a18e878cfab35af0fba39f8d23c77cffd5d5/fix-secondary-codes-frequency.py

I thought a script would be easier to review/audit than a huge data change, but then the script ended up being quite convoluted, due to how our data is structured, and trying to deal with all corner-cases (x-disambiguation, etc…)

I guess it's still good to have the script handy.

@bochecha
Copy link
Member Author

bochecha left a comment

@yookoala started reviewing this directly on a commit: d9d2e68

I'm replying here, with full quotes, because I'm worried Github might garbage collect the old commit (I rebased/force-pushed) and I don't want to lose it.

Since all those comments mention Mac OSX, I had to assume they were about Cangjie 5.

Show resolved Hide resolved data/table.txt
Show resolved Hide resolved data/table.txt
Show resolved Hide resolved data/table.txt
Show resolved Hide resolved data/table.txt
Show resolved Hide resolved data/table.txt
Show resolved Hide resolved data/table.txt
Show resolved Hide resolved data/table.txt
Show resolved Hide resolved data/table.txt
Show resolved Hide resolved data/table.txt
Show resolved Hide resolved data/table.txt

@bochecha bochecha force-pushed the secondary-codes-frequency branch 2 times, most recently from 1615ef8 to 18358dd Jan 30, 2019

bochecha added some commits Jan 30, 2019

data: Reorder the codes for 沉
We are about to split multiple codes for any given character so that
only the first one has the non-zero frequency, and all additional codes
have a frequency of 0. (see #104)

A prerequisite to that is that the multiple codes are actually ordered
correctly.

On Windows, the reference for Cangjie 3, 沉 only has code `ebhu`. On
chinesecj.com, the reference for Cangjie 5, it has codes `ebhu` and
`ebhn`, in that order.

This means that in both versions `ebhu` should be the primary code for
that character. We keep `ebhn` but it should have a frequency of 0 so
that it doesn't interfere with the expected ordering.
data: Reorder the codes for 麻
We are about to split multiple codes for any given character so that
only the first one has the non-zero frequency, and all additional codes
have a frequency of 0. (see #104)

A prerequisite to that is that the multiple codes are actually ordered
correctly.

On Windows, the reference for Cangjie 3, 麻 only has code `ijcc`. On
chinesecj.com, the reference for Cangjie 5, it has codes `ijcc` and
 `idd`, in that order.

This means that in both versions `ijcc` should be the primary code for
that character. We keep `idd` but it should have a frequency of 0 so
that it doesn't interfere with the expected ordering.
data: Reorder the codes for 凄
We are about to split multiple codes for any given character so that
only the first one has the non-zero frequency, and all additional codes
have a frequency of 0. (see #104)

A prerequisite to that is that the multiple codes are actually ordered
correctly.

On Windows, the reference for Cangjie 3, as well as on chinesecj.com,
the reference for Cangjie 5, 凄 only has codes `imjlv`.

This means that in both versions `imjlv` should be the primary code for
that character. We keep `imjln` but it should have a frequency of 0 so
that it doesn't interfere with the expected ordering.
data: Reorder the codes for 拐
We are about to split multiple codes for any given character so that
only the first one has the non-zero frequency, and all additional codes
have a frequency of 0. (see #104)

A prerequisite to that is that the multiple codes are actually ordered
correctly.

On Windows, the reference for Cangjie 3, 拐 only has code `qrsh`. On
chinesecj.com, the reference for Cangjie 5, it has codes `qrsh` and
 `qrks`, in that order.

This means that in both versions `qrsh` should be the primary code for
that character. We keep `qrks` but it should have a frequency of 0 so
that it doesn't interfere with the expected ordering.

@bochecha bochecha force-pushed the secondary-codes-frequency branch from 18358dd to c0d77bd Feb 9, 2019

bochecha added some commits Jan 30, 2019

data: Reorder the codes for 拔
We are about to split multiple codes for any given character so that
only the first one has the non-zero frequency, and all additional codes
have a frequency of 0. (see #104)

A prerequisite to that is that the multiple codes are actually ordered
correctly.

On Windows, the reference for Cangjie 3, 拔 only has code `qikk`. On
chinesecj.com, the reference for Cangjie 5, it has codes `qikk` and
 `qike`, in that order.

This means that in both versions `qikk` should be the primary code for
that character. We keep `qike` but it should have a frequency of 0 so
that it doesn't interfere with the expected ordering.
data: Reorder the codes for 函
We are about to split multiple codes for any given character so that
only the first one has the non-zero frequency, and all additional codes
have a frequency of 0. (see #104)

A prerequisite to that is that the multiple codes are actually ordered
correctly.

On Windows, the reference for Cangjie 3, 函 only has code `nue`.

This means that in Cangjie 3 `nue` should be the primary code for that
character. We keep `neu` but it should have a frequency of 0 so that it
doesn't interfere with the expected ordering.
data: Reorder the codes for 袒
We are about to split multiple codes for any given character so that
only the first one has the non-zero frequency, and all additional codes
have a frequency of 0. (see #104)

A prerequisite to that is that the multiple codes are actually ordered
correctly.

On Windows, the reference for Cangjie 3, 袒 only has code `lam`.

This means that in Cangjie 3 `lam` should be the primary code for that
character. We keep `ifam` but it should have a frequency of 0 so that it
doesn't interfere with the expected ordering.
data: Reorder the codes for 壹
We are about to split multiple codes for any given character so that
only the first one has the non-zero frequency, and all additional codes
have a frequency of 0. (see #104)

A prerequisite to that is that the multiple codes are actually ordered
correctly.

On Windows, the reference for Cangjie 3, 壹 only has code `gbmt`.

This means that in Cangjie 3 `gbmt` should be the primary code for that
character. We keep `gbmm` but it should have a frequency of 0 so that it
doesn't interfere with the expected ordering.
data: Reorder the codes for 嘛
We are about to split multiple codes for any given character so that
only the first one has the non-zero frequency, and all additional codes
have a frequency of 0. (see #104)

A prerequisite to that is that the multiple codes are actually ordered
correctly.

On Windows, the reference for Cangjie 3, 嘛 only has code `rijc`. On
chinesecj.com, the reference for Cangjie 5, it has codes `rijc` and
 `ridd`, in that order.

This means that in both versions `rijc` should be the primary code for
that character. We keep `ridd` but it should have a frequency of 0 so
that it doesn't interfere with the expected ordering.
data: Reorder the codes for 押
We are about to split multiple codes for any given character so that
only the first one has the non-zero frequency, and all additional codes
have a frequency of 0. (see #104)

A prerequisite to that is that the multiple codes are actually ordered
correctly.

On Windows, the reference for Cangjie 3, 押 only has code `qwl`.

This means that in Cangjie 3 `qwl` should be the primary code for that
character. We keep `qwj` but it should have a frequency of 0 so that it
doesn't interfere with the expected ordering.
data: Reorder the codes for 墟
We are about to split multiple codes for any given character so that
only the first one has the non-zero frequency, and all additional codes
have a frequency of 0. (see #104)

A prerequisite to that is that the multiple codes are actually ordered
correctly.

On Windows, the reference for Cangjie 3, 墟 only has code `gypm`. On
chinesecj.com, the reference for Cangjie 5, it has codes `gypm` and
 `gypc`, in that order.

This means that in both versions `gypm` should be the primary code for
that character. We keep `gypc` but it should have a frequency of 0 so
that it doesn't interfere with the expected ordering.
data: Reorder the codes for 跚
We are about to split multiple codes for any given character so that
only the first one has the non-zero frequency, and all additional codes
have a frequency of 0. (see #104)

A prerequisite to that is that the multiple codes are actually ordered
correctly.

On Windows, the reference for Cangjie 3, 跚 only has code `rmbt`. On
chinesecj.com, the reference for Cangjie 5, it has codes `rmbt` and
 `rmbbm`, in that order.

This means that in both versions `rmbt` should be the primary code for
that character. We keep `rmbbm` but it should have a frequency of 0 so
that it doesn't interfere with the expected ordering.
data: Reorder the codes for 垂
We are about to split multiple codes for any given character so that
only the first one has the non-zero frequency, and all additional codes
have a frequency of 0. (see #104)

A prerequisite to that is that the multiple codes are actually ordered
correctly.

On Windows, the reference for Cangjie 3, 押 only has code `hjtm`.

This means that in Cangjie 3 `hjtm` should be the primary code for that
character. We keep `hjtg` but it should have a frequency of 0 so that it
doesn't interfere with the expected ordering.
wip: data: Secondary codes should have a frequency of 0
Most characters have a single code in a given Cangjie version.

But some characters will have more than one, due to how different
organisms might have applied the Cangjie decomposition method in
different ways.

We try to be as inclusive as possible, so our data includes all those
codes.

However, our focus still is on providing the best possible input
experience to Hong Kong people, and as such we want to give priority to
the codes they are used to.

Those are the codes coming first in our data, and these codes should be
the only ones with a non-zero frequency, so that secondary codes always
are ordered towards the end.

The changes to the data in this commit were made with a script, and
should be entirely reproducible. From a libcangjie clone on the master
branch (commit TODO), get the script and run it:

    $ git clone https://gitlab.freedesktop.org/cangjie/data-migration-tools.git
    $ ./data-migration-tools/fix-secondary-codes-frequency.py

The modifications done by the script to the table should be identical to
the ones in this commit.

Fixes #104
data: Remove unnecessary comment
We had explained what was going on with this character in a comment
because at that point it was an exception.

But with the previous commit, we actually did that systematically to
quite a few characters.

If this isn't an exception any more, the comment becomes unnecessary.

@bochecha bochecha force-pushed the secondary-codes-frequency branch from c0d77bd to d3a11e7 Feb 10, 2019

@bochecha bochecha closed this Feb 14, 2019

@bochecha bochecha deleted the secondary-codes-frequency branch Feb 14, 2019

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment