# LIN 313 Final Project
**_by Jed Wang_**

## Step 1: Import character split tree
This was premade in Node.js. Data was taken from [the Wiktionary Chinese character decomposition project](https://commons.wikimedia.org/wiki/Commons:Chinese_characters_decomposition). All characters with unverified or unknown decompositions (displayed as `?`'s in the TSV) were removed in `table.json` while they were not removed in `table2.json` (instead, they were manually resolved).

In [None]:
!pip install transformers datasets tensorflow



In [None]:
!rm table.json
!wget https://gist.github.com/LeftistTachyon/f1a42e0dbf33af2f8f542d07ec25c852/raw/84b78cc09c66069edd60c234250166448f14469d/table_fix.json -O table.json

rm: cannot remove 'table.json': No such file or directory
--2023-11-26 00:30:08--  https://gist.github.com/LeftistTachyon/f1a42e0dbf33af2f8f542d07ec25c852/raw/84b78cc09c66069edd60c234250166448f14469d/table_fix.json
Resolving gist.github.com (gist.github.com)... 140.82.114.4
Connecting to gist.github.com (gist.github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://gist.githubusercontent.com/LeftistTachyon/f1a42e0dbf33af2f8f542d07ec25c852/raw/84b78cc09c66069edd60c234250166448f14469d/table_fix.json [following]
--2023-11-26 00:30:08--  https://gist.githubusercontent.com/LeftistTachyon/f1a42e0dbf33af2f8f542d07ec25c852/raw/84b78cc09c66069edd60c234250166448f14469d/table_fix.json
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, aw

Loading in the JSON file into memory

In [None]:
import json

with open('table.json') as table_file:
  char_table = json.load(table_file)

print(char_table['福'])
print(char_table['树'])

礻畐
木又寸


Load the other one in for ease of A-B testing

In [None]:
!rm table2.json
!wget https://gist.github.com/LeftistTachyon/f1a42e0dbf33af2f8f542d07ec25c852/raw/84b78cc09c66069edd60c234250166448f14469d/table2_fix.json -O table2.json

rm: cannot remove 'table2.json': No such file or directory
--2023-11-26 00:30:08--  https://gist.github.com/LeftistTachyon/f1a42e0dbf33af2f8f542d07ec25c852/raw/84b78cc09c66069edd60c234250166448f14469d/table2_fix.json
Resolving gist.github.com (gist.github.com)... 140.82.114.4
Connecting to gist.github.com (gist.github.com)|140.82.114.4|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://gist.githubusercontent.com/LeftistTachyon/f1a42e0dbf33af2f8f542d07ec25c852/raw/84b78cc09c66069edd60c234250166448f14469d/table2_fix.json [following]
--2023-11-26 00:30:08--  https://gist.githubusercontent.com/LeftistTachyon/f1a42e0dbf33af2f8f542d07ec25c852/raw/84b78cc09c66069edd60c234250166448f14469d/table2_fix.json
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.110.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent

In [None]:
with open('table2.json') as table_file:
  char_table2 = json.load(table_file)

print(char_table2['福'])
print(char_table2['树'])

礻一口田
木又寸


## Step 2: Writing the normalizer
Since this is done as a preprocessing step for a BPE or WordPart tokenizer, this must be implemented as a normalizer (in HuggingFace terminology).

In [None]:
# This doesn't work, unfortunately, meaning I have to chain approximately 20k normalizers together

# class SubCharChinese:
#   def __init__(self, extended=False) -> None:
#     self.char_table = char_table2 if extended else char_table

#   def normalize(self, normalized):
#     normalized.normalized = self.normalize_str(normalized.normalized)
#     return normalized

#   def normalize_str(self, sequence):
#     return "".join([self.char_table[ch] if ch in self.char_table else ch for ch in sequence])

from tokenizers import NormalizedString, Regex, normalizers

class CustomNormalizer2:
  def __init__(self, full=False):
    raw_dict = char_table2 if full else char_table
    self.table = list(filter(lambda e: e[0] != e[1], raw_dict.items()))
    print(self.table[:5])

  def normalize(self, normalized: NormalizedString):
    normalized.nfd()
    normalized.lowercase()
    for k, v in self.table:
      normalized.replace(k, " " + v + " ")
    normalized.lstrip()
    normalized.rstrip()
    normalized.replace(Regex("\s+"), " ")

  def to_standard_norm(self):
    norm_list = [normalizers.Replace(r[0], r[1]) for r in self.table]
    norm_list.insert(0, normalizers.BertNormalizer())
    norm_list.append(normalizers.Strip())
    norm_list.append(normalizers.Replace(Regex("\s+"), " "))
    return normalizers.Sequence(norm_list)

# True = use full table; False = use verified subset
complete = True

cust_norm = CustomNormalizer2(complete)
normalizer = normalizers.Normalizer.custom(cust_norm)

test_str = "早上好！"
normalizer.normalize_str(test_str), cust_norm.to_standard_norm().normalize_str(test_str)

[('丆', '一丿'), ('丁', '一亅'), ('丩', '丨丨'), ('𠂇', '丿一A'), ('𠂉', '丿一B')]


('日十A 上 女子 ！', '日十A 上 女子 ！')

In [None]:
normalizer.normalize_str("今天晚上最低温度是19°C")

'人一㇇ 一大A 日免A 上 曰耳又 人氏丶 氵日皿 广廿又 日𤴓 19°c'

## Step 3: Create & train the tokenizer
Since the model to be used is BERT, I will be using a WordPiece encoding.

In [None]:
from tokenizers import Tokenizer, models, pre_tokenizers

tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
tokenizer.normalizer = normalizer
# tokenizer.pre_tokenizer = pre_tokenizers.BertPreTokenizer()
tokenizer.pre_tokenizer = pre_tokenizers.Punctuation() # multi-word tokens OK

In [None]:
from tokenizers import trainers

special_tokens = ["[UNK]", "[PAD]", "[CLS]", "[SEP]", "[MASK]"]
trainer = trainers.WordPieceTrainer(vocab_size=25000, special_tokens=special_tokens)

<!-- I'll be using [OSCAR](https://huggingface.co/datasets/oscar) to train the tokenizer. -->
I'll be using [open_subtitles](https://huggingface.co/datasets/open_subtitles) to train the tokenizer (on both simplified and

In [None]:
from datasets import load_dataset
# dataset = load_dataset("oscar",
#                        "unshuffled_deduplicated_zh",
#                       #  split='train[0%:40%](pct1_dropremainder)',
#                       #  num_proc=8,
#                        split="train",
#                        streaming=True)
dataset = load_dataset("open_subtitles",
                       lang1="zh_cn", lang2="zh_tw",
                       split="train[2%:6%]",
                       num_proc=8)

Downloading builder script:   0%|          | 0.00/6.22k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/10.9k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/7.45k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/141M [00:00<?, ?B/s]

KeyboardInterrupt: ignored

In [None]:
from tqdm.auto import tqdm

# directly from the NLP course lol
def get_training_corpus():
  pbar = tqdm(range(0, len(dataset), 1000), desc="Training data:", unit="example")
  for i in pbar:
    yield [o["zh_cn"] + o["zh_tw"] for o in dataset[i : i + 1000]["translation"]]
# def get_training_corpus():
#   it = iter(dataset)
#   i = 0
#   while True:
#     print(i, " complete")
#     yield [next(it) for j in range(1000)]
#     i = i + 1000

tokenizer.train_from_iterator(get_training_corpus(), trainer=trainer)
# tokenizer.train_from_iterator(iter(dataset), trainer=trainer)

In [None]:
cls_token_id = tokenizer.token_to_id("[CLS]") or 2
sep_token_id = tokenizer.token_to_id("[SEP]") or 3
print(cls_token_id, sep_token_id)

2 3


## Step 4: Add the decoder & save

In [None]:
from tokenizers import processors

tokenizer.post_processor = processors.TemplateProcessing(
    single=f"[CLS]:0 $A:0 [SEP]:0",
    pair=f"[CLS]:0 $A:0 [SEP]:0 $B:1 [SEP]:1",
    special_tokens=[("[CLS]", cls_token_id), ("[SEP]", sep_token_id)],
)

In [None]:
encoding = tokenizer.encode("矓☺☻♥")
print(encoding.ids)
print(encoding.tokens)

Exception: ignored

We need a good decoder to be able to reverse the process, of course.

In [None]:
!rm decoder.json
!wget https://gist.github.com/LeftistTachyon/f1a42e0dbf33af2f8f542d07ec25c852/raw/84b78cc09c66069edd60c234250166448f14469d/decoder.json

--2023-11-26 00:31:16--  https://gist.github.com/LeftistTachyon/f1a42e0dbf33af2f8f542d07ec25c852/raw/84b78cc09c66069edd60c234250166448f14469d/decoder.json
Resolving gist.github.com (gist.github.com)... 140.82.113.3
Connecting to gist.github.com (gist.github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://gist.githubusercontent.com/LeftistTachyon/f1a42e0dbf33af2f8f542d07ec25c852/raw/84b78cc09c66069edd60c234250166448f14469d/decoder.json [following]
--2023-11-26 00:31:16--  https://gist.githubusercontent.com/LeftistTachyon/f1a42e0dbf33af2f8f542d07ec25c852/raw/84b78cc09c66069edd60c234250166448f14469d/decoder.json
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 474591 (463K) [text/plain]
Sav

In [None]:
import json

with open('decoder.json') as table_file:
  decoder_table = json.load(table_file)

In [None]:
!rm decoder2.json
!wget https://gist.github.com/LeftistTachyon/f1a42e0dbf33af2f8f542d07ec25c852/raw/84b78cc09c66069edd60c234250166448f14469d/decoder2.json

--2023-11-26 00:31:17--  https://gist.github.com/LeftistTachyon/f1a42e0dbf33af2f8f542d07ec25c852/raw/84b78cc09c66069edd60c234250166448f14469d/decoder2.json
Resolving gist.github.com (gist.github.com)... 140.82.113.3
Connecting to gist.github.com (gist.github.com)|140.82.113.3|:443... connected.
HTTP request sent, awaiting response... 301 Moved Permanently
Location: https://gist.githubusercontent.com/LeftistTachyon/f1a42e0dbf33af2f8f542d07ec25c852/raw/84b78cc09c66069edd60c234250166448f14469d/decoder2.json [following]
--2023-11-26 00:31:17--  https://gist.githubusercontent.com/LeftistTachyon/f1a42e0dbf33af2f8f542d07ec25c852/raw/84b78cc09c66069edd60c234250166448f14469d/decoder2.json
Resolving gist.githubusercontent.com (gist.githubusercontent.com)... 185.199.109.133, 185.199.110.133, 185.199.111.133, ...
Connecting to gist.githubusercontent.com (gist.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 518150 (506K) [text/plain]


In [None]:
with open('decoder2.json') as table_file:
  decoder_table2 = json.load(table_file)

In [None]:
# from typing import List
from tokenizers import decoders

dtable = decoder_table2 if complete else decoder_table
decoder_sequence = [decoders.Replace(d[0], d[1]) for d in dtable.items() if d[0] != d[1]]
decoder_sequence.insert(0, decoders.Replace("#", ""))
decoder_sequence.insert(1, decoders.Fuse())
decoder_sequence.append(decoders.Replace(" ", ""))
tokenizer.decoder = decoders.Sequence(decoder_sequence)
# tokenizer.decoder = decoders.WordPiece(prefix="##")

In [None]:
tokenizer.decoder.decode(['人一㇇ 一大A 日免A 上 曰耳又 人氏丶 氵日皿 度 日𤴓 19°c'])

'今天晚上最低温度是19°c'

Now we save to a temporary place.

In [None]:
tokenizer.normalizer = cust_norm.to_standard_norm()
tokenizer.save("tokenizer.json")
!curl --upload-file ./tokenizer.json -o curl.out https://transfer.sh/tokenizer.json | cat curl.out
# duplicate because idk
# !curl --upload-file ./tokenizer.json -o curl.out https://transfer.sh/tokenizer.json | cat curl.out
!cat curl.out

NameError: ignored

Here's some snapshots of this in action:

In [None]:
from transformers import PreTrainedTokenizerFast
# new_tokenizer = Tokenizer.from_file("tokenizer.json")
# new_tokenizer.normalizer = cust_norm.to_standard_norm()
wrapped_tokenizer = PreTrainedTokenizerFast(
    # tokenizer_object=tokenizer,
    tokenizer_file="tokenizer.json", # You can load from the tokenizer file, alternatively
    unk_token="[UNK]",
    pad_token="[PAD]",
    cls_token="[CLS]",
    sep_token="[SEP]",
    mask_token="[MASK]",
)

tokens = wrapped_tokenizer.encode("我喜歡長褲子。")

SyntaxError: ignored

In [None]:
print(tokens)
for token in tokens:
  print(wrapped_tokenizer.decode([token]))

In [None]:
tokens = wrapped_tokenizer.encode("坐落于法国布列塔尼地区的镇，学院为愿意进入与畜产品相关企业工作的，硕士文凭以上持有者提供法国大学第三阶段的学习 ")
print(tokens)
# wrapped_tokenizer.convert_ids_to_tokens(tokens)
print(wrapped_tokenizer.decode(tokens))
[wrapped_tokenizer.decode([token]) for token in tokens]