Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add writing system replacement transformation #230

Merged
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
139 changes: 139 additions & 0 deletions transformations/writing_system_replacement/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,139 @@
# Writing System Replacement ❤️ → 心
This transformation replaces the writing system of the input with another writing system.

Author: Roman Sitelew (sitelewr at gmail dot com)

A few snippets of the code are based on the code from the
[whitespace_perturbation transformation](https://github.com/GEM-benchmark/NL-Augmenter/tree/main/transformations/whitespace_perturbation),
authored by Xinyi Wu, and released under the same license (MIT) as the rest of the NL-Augmenter project.

## What type of a transformation is this?
It's a novel transformation that works as follows:

1. There is an input sentence (e.g. "I love potatoes")
2. A new random writing system is generated. For example, a logographic system where the word "love" is written as "蘋"
3. The output is the text in the new writing system. In our example, it's '蚴蘋䗑'

We use [CJK Unified Ideographs](https://en.wikipedia.org/wiki/CJK_Unified_Ideographs)
as the source of characters for the generated writing systems.

Thanks to the enormous diversity of the CJK ideographs (tens of thousands of characters),
we can generate thousands of different writing systems, without running out of characters.

Currently, this transformation can convert the input into the following writing systems:

|writing system type | some real-world languages that use it | "I love potatoes" (a sample output in a generated writing system) |
|--------------- | ------------------------------------ | -------------------------------------- |
|alphabet | English, Spanish, Russian | 之 笓䒉㘔䆇 躓䒉蝲討蝲䒉䆇䁣 |
|syllabary | Mycenaean Greek, Japanese | 䃓 熵鏘 㚐料 |
|partial phonemic | Hebrew, Arabic | 阠㚶乍 渓绌敿 |
|logographic | Ancient Egyptian, Oracle bone script | 驿掩㑇㕶誨 |

The transformation accepts any textual input, regardless of its language.

## What tasks does it intend to benefit?
This transformation could benefit text classification tasks (especially - language identification tasks).

Trained humans are able to read texts that are written in different writing systems.
For example, Japanese speakers can easily identify these texts as Japanese, and understand their meaning
(all 3 means "hiragana"):

平仮名
ひらがな
ヒラガナ

The ability of ML models to identify and understand a language written in various writing systems could
improve the quality of language identification in general,
and the quality of identification of low-resource languages in particular.

## Robustness Evaluation

We've evaluated this transformation by running the build-in evaluate.py script as follows:
`python3 evaluate.py -t WritingSystemReplacement`

Note: it seems that there is a bug in
[this line](https://github.com/GEM-benchmark/NL-Augmenter/blob/f0111c1587cfa36cd4bd2c9739744e59c2796c26/TestRunner.py#L132)
in TestRunner.py, making the script crash during any test,
which we temporally fixed by replacing the line's right side with `str(package_dir.parent.joinpath(search))`

To speed up the test, we also removed all other transformations from the local NL-Augmenter/transformations dir.

The results of the single run are as follows:

```
Here is the performance of the model aychang/roberta-base-imdb on the test[:20%] split of the imdb dataset
The accuracy on this subset which has 1000 examples = 96.0
Applying transformation:
100%|████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████████| 1000/1000 [00:19<00:00, 51.83it/s]
Finished transformation! 1000 examples generated from 1000 original examples, with 1000 successfully transformed and 0 unchanged (1.0 perturb rate)
Here is the performance of the model on the transformed set
The accuracy on this subset which has 1000 examples = 100.0

```

The performance is surprisingly good, considering the dramatic changes made by our transformation.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Consider adding the 4 suggested tests for Text Classification in the evaluation section. You will notice that in the 3 non-default tests the performance falls significantly, as expected:

  • roberta-base-SST-2: 94.0 -> 51.0
  • bert-base-uncased-QQP: 92.0 -> 67.0
  • roberta-large-mnli: 91.0 -> 43.0

It is still very interesting how the performance improves to 100.0 for roberta-base-imdb. It might be worth looking into that.

Copy link
Contributor Author

@RomanPlusPlus RomanPlusPlus Sep 19, 2021

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you for the performance evaluation! Added the results to the readme.

Regarding the imdb anomaly, I opened an issue about it.

We are not sure if the results make sense.

According to @Sotwi, in the 3 non-default tests the performance falls significantly, as expected:

roberta-base-SST-2: 94.0 -> 51.0
bert-base-uncased-QQP: 92.0 -> 67.0
roberta-large-mnli: 91.0 -> 43.0

This indicates that both the transformation and the testing pipeline work as intended.

We speculate that the problem in the first test could be caused by some deficiency in
the model `aychang/roberta-base-imdb` and / or the `imdb` dataset.

## What are the limitations of this transformation?

If the input text is too short, the transformed text will not contain enough information
to identify the language of the original text.
For example, if the input is `I love love love`, the output could look like this `人乇乇乇`.
In this example, each unique word was replaced with a randomly selected CJK ideograph.
But no AI could reliably predict the original language from `人乇乇乇`, as almost no linguistic data are preserved
in such a short text.

Thus, for language identification purposes, we would suggest inputs of at least 1k characters.

If the text is long enough, even the meaning could be recovered from the transformed / encrypted text,
e.g. by using standard cryptographic methods.
For example, imagine that the input `I love potatoes... [many more chars after that]`
was transformed into another alphabet like this: `了 乚凸丫丯 尸凸亍人亍凸丯乙...`.
While analyzing the output, one may notice that the character `凸` is very frequent,
and thus might be a common letter (in this case, `o`), making the rest of the decryption process much easier.
We speculate that if a trained cryptographer can extract the meaning from such a ciphertext,
then a language model could do it too.

For the meaning to be reliably recoverable, we would suggest inputs of at least 10k characters.

As @tia-e pointed out, the transformation doesn't fully preserve the syntax of the input.
For example, our transformation sometimes generates writing systems that don't have separators between words
(similarly to Classical Greek, Thai, etc.).
Depending on the generated writing system, a single character could be replaced with several new characters,
and the other way around.
This could make it significantly harder to extract the meaning from the transformed text.

## Future work

It would be interesting to add other types of writing systems, including logosyllabaries, alphasyllabaries, and
the writing systems for sign languages.
Some conlangs (especially, [Ithkuil](https://en.wikipedia.org/wiki/Ithkuil))
could be a great inspiration for further additions.

As suggested by @tia-e, we would like to provide a way to support transformation in the both directions,
with the ability to define the source and the target writing system
(currently, the target system is selected at random, regardless of the source system).

This transformation could help to decipher
the [Voynich manuscript](https://en.wikipedia.org/wiki/Voynich_manuscript),
the [Phaistos Disc](https://en.wikipedia.org/wiki/Phaistos_Disc),
and other undeciphered artifacts.

It could be accomplished as follows:
0. create a corpus featuring hundreds of languages
1. apply this transformation to the corpus' texts, thus obfuscating their writing systems, greatly expanding the corpus
2. train a model on the corpus, to make it capable of identifying the language regardless of the writing system
3. apply the transformation to, say, the Voynich manuscript
4. use the model to predict the language of the transformed manuscript
5. verify the prediction by translating the manuscript as if it was written in the predicted language
1 change: 1 addition & 0 deletions transformations/writing_system_replacement/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1 @@
from .transformation import *
60 changes: 60 additions & 0 deletions transformations/writing_system_replacement/test.json
Original file line number Diff line number Diff line change
@@ -0,0 +1,60 @@
{
"type": "writing_system_replacement",
"test_cases": [
{
"class": "WritingSystemReplacement",
"inputs": {
"sentence": "Andrew finally returned the French book to Chris that I bought last week"
},
"outputs": [
{
"sentence": "农垚垚㻼臀筸令阤㑓睆盔鞿馼矖㴕蚄魔䇑陟譻葃牚"
}
]
},
{
"class": "WritingSystemReplacement",
"inputs": {
"sentence": "Sentences with gapping, such as Paul likes coffee and Mary tea, lack an overt predicate to indicate the relation between two or more arguments."
},
"outputs": [
{
"sentence": "䵨䡖鏹䁆岈癢潠嗘䚜楣炠鿔砡體鄊礟㛍嚇呌㺦繲㒤鏹厘䟴埃䌬陊区畇㑑䵨䗞挏菇玜䵨䴭㯳龃曅㚩麛骀尽鱭㿛侕"
}
]
},
{
"class": "WritingSystemReplacement",
"inputs": {
"sentence": "Alice in Wonderland is a 2010 American live-action/animated dark fantasy adventure film"
},
"outputs": [
{
"sentence": "㻼臀筸令阤㑓睆盔鞿馼矖㴕蚄魔䇑陟譻葃牚橤景皒臀㮴㝌䁫臀讶"
}
]
},
{
"class": "WritingSystemReplacement",
"inputs": {
"sentence": "Ujjal Dev Dosanjh served as 33rd Premier of British Columbia from 2000 to 2001"
},
"outputs": [
{
"sentence": "鱾偠偠蠿祡儋蚋鎟羕䵨䡖鏹䁆岈癢潠嗘䚜楣炠鿔砡體鄊"
}
]
},
{
"class": "WritingSystemReplacement",
"inputs": {
"sentence": "Neuroplasticity is a continuous processing allowing short-term, medium-term, and long-term remodeling of the neuronosynaptic organization."
},
"outputs": [
{
"sentence": "盔鞿馼矖㴕蚄魔䇑陟譻葃牚橤景皒㮴㝌䁫讶帪駡曏讶䵘捛䋕渢盔龑䝺蓟馼醭脃䣒䟒邸鞿邸盔俁罽笵蛱汒鴃䣒懝婝"
}
]
}
]
}
Loading