understanding custom pipelines #73

mikkokotila · 2020-07-28T12:48:03Z

In the below toy example, my expectation is to achieve a tokenized version of the input text. With the below code, the result is a list of tokens, but tokens are syllables only.

from botok import Trie, BoSyl, Tokenize, Config, TokChunks

in_str = '༈ བློ་ཆོས་སུ་འགྲོ་བར་བྱིན་གྱིས་རློབས། །ཆོས་ལམ་དུ་འགྲོ་བར་བྱིན་གྱིས་རློབས། །ལམ་འཁྲུལ་བ་ཞིག་པར་བྱིན་གྱིས་རློབས། །འཁྲུལ་པ་ཡེ་ཤེས་སུ་འཆར་བར་བྱིན་གྱིས་རློབས། །'

profile = "empty"
config = Config()
trie = Trie(BoSyl, profile, config, [])
tok = Tokenize(trie)
preproc = TokChunks(in_str)
preproc.serve_syls_to_trie()
tokens = tok.tokenize(preproc)

out = []

for i in range(len(tokens)):

    out.append(tokens[i]['text'])
    
out

How can I change the above code to achieve what I'm trying to achieve?

The text was updated successfully, but these errors were encountered:

10zinten · 2020-07-29T13:11:21Z

Yes, I suggest you to use the latest version of botok. We have simplified the botok config, which in turn simplifies building custom pipelines.

In the latest version we have introduced dialect packs, which are similar to various profiles in previous version, but they do a bit more than profile. Basically each dialect pack will have two components, Adjustments and Dictionary.

Dictionary component contains all the standardized word list and rules (to adjust segmentation) for the tokenization and Adjustments is for researching and testing the segmentation and its content will eventually be included in the Dictionary component. Adjustment can also be used for customizing the default tokenization.

As far as above toy example to get the expected output is concerned, the correct version is here.

from botok import BoSyl, Config, TokChunks, Tokenize, Trie

in_str = "༈ བློ་ཆོས་སུ་འགྲོ་བར་བྱིན་གྱིས་རློབས། །ཆོས་ལམ་དུ་འགྲོ་བར་བྱིན་གྱིས་རློབས། །ལམ་འཁྲུལ་བ་ཞིག་པར་བྱིན་གྱིས་རློབས། །འཁྲུལ་པ་ཡེ་ཤེས་སུ་འཆར་བར་བྱིན་གྱིས་རློབས། །"

config = Config()
trie = Trie(
    BoSyl,
    profile=config.profile,
    main_data=config.dictionary,
    custom_data=config.adjustments,
    pickle_path=config.dialect_pack_path.parent,
)
tok = Tokenize(trie)
preproc = TokChunks(in_str)
preproc.serve_syls_to_trie()
tokens = tok.tokenize(preproc)

out = []

for i in range(len(tokens)):

    out.append(tokens[i]["text"])

print(out)

Output:

['༈ ', 'བློ་', 'ཆོས་', 'སུ་', 'འགྲོ་བར་', 'བྱིན་', 'གྱིས་', 'རློབས', '། །', 'ཆོས་', 'ལམ་', 'དུ་', 'འགྲོ་བར་', 'བྱིན་', 'གྱིས་', 'རློབས', '། །', 'ལམ་', 'འཁྲུལ་བ་', 'ཞིག་པར་', 'བྱིན་', 'གྱིས་', 'རློབས', '། །', 'འཁྲུལ་', 'པ་', 'ཡེ་ཤེས་', 'སུ་', 'འཆར་བར་', 'བྱིན་', 'གྱིས་', 'རློབས', '། །']

You can refer to https://github.com/Esukhia/botok/blob/7d85cbb0df62ff4c9da3c70088ad671f03472a18/botok/tokenizers/wordtokenizer.py#L28 class to customize the adjustment rules too.

PS: We will be releasing botok documentation soon.

mikkokotila · 2020-07-30T13:39:56Z

How wonderful!

For dialect packs, the use is 100% on Buddhadharma texts. Do you have recommendation for which dialect packs to use?

10zinten · 2020-07-31T12:11:08Z

Currently, we only have dialect pack for general Tibetan language. Our researcher team is working on dialect pack for Buddhadharma texts. Till then, you can experiment with general dialect pack to improve the segmentation.

We will be releasing a detail documentation on customizing any dialect pack.

10zinten mentioned this issue Jul 29, 2020

minimal instructions/docstring for Trie #72

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

understanding custom pipelines #73

understanding custom pipelines #73

mikkokotila commented Jul 28, 2020

10zinten commented Jul 29, 2020 •

edited

Loading

mikkokotila commented Jul 30, 2020

10zinten commented Jul 31, 2020

understanding custom pipelines #73

understanding custom pipelines #73

Comments

mikkokotila commented Jul 28, 2020

10zinten commented Jul 29, 2020 • edited Loading

mikkokotila commented Jul 30, 2020

10zinten commented Jul 31, 2020

10zinten commented Jul 29, 2020 •

edited

Loading