Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

understanding custom pipelines #73

Open
mikkokotila opened this issue Jul 28, 2020 · 3 comments
Open

understanding custom pipelines #73

mikkokotila opened this issue Jul 28, 2020 · 3 comments

Comments

@mikkokotila
Copy link
Contributor

In the below toy example, my expectation is to achieve a tokenized version of the input text. With the below code, the result is a list of tokens, but tokens are syllables only.

from botok import Trie, BoSyl, Tokenize, Config, TokChunks

in_str = '༈ བློ་ཆོས་སུ་འགྲོ་བར་བྱིན་གྱིས་རློབས། །ཆོས་ལམ་དུ་འགྲོ་བར་བྱིན་གྱིས་རློབས། །ལམ་འཁྲུལ་བ་ཞིག་པར་བྱིན་གྱིས་རློབས། །འཁྲུལ་པ་ཡེ་ཤེས་སུ་འཆར་བར་བྱིན་གྱིས་རློབས། །'

profile = "empty"
config = Config()
trie = Trie(BoSyl, profile, config, [])
tok = Tokenize(trie)
preproc = TokChunks(in_str)
preproc.serve_syls_to_trie()
tokens = tok.tokenize(preproc)

out = []

for i in range(len(tokens)):

    out.append(tokens[i]['text'])
    
out

How can I change the above code to achieve what I'm trying to achieve?

@10zinten
Copy link
Contributor

10zinten commented Jul 29, 2020

Yes, I suggest you to use the latest version of botok. We have simplified the botok config, which in turn simplifies building custom pipelines.

In the latest version we have introduced dialect packs, which are similar to various profiles in previous version, but they do a bit more than profile. Basically each dialect pack will have two components, Adjustments and Dictionary.

Dictionary component contains all the standardized word list and rules (to adjust segmentation) for the tokenization and Adjustments is for researching and testing the segmentation and its content will eventually be included in the Dictionary component. Adjustment can also be used for customizing the default tokenization.

As far as above toy example to get the expected output is concerned, the correct version is here.

from botok import BoSyl, Config, TokChunks, Tokenize, Trie

in_str = "༈ བློ་ཆོས་སུ་འགྲོ་བར་བྱིན་གྱིས་རློབས། །ཆོས་ལམ་དུ་འགྲོ་བར་བྱིན་གྱིས་རློབས། །ལམ་འཁྲུལ་བ་ཞིག་པར་བྱིན་གྱིས་རློབས། །འཁྲུལ་པ་ཡེ་ཤེས་སུ་འཆར་བར་བྱིན་གྱིས་རློབས། །"

config = Config()
trie = Trie(
    BoSyl,
    profile=config.profile,
    main_data=config.dictionary,
    custom_data=config.adjustments,
    pickle_path=config.dialect_pack_path.parent,
)
tok = Tokenize(trie)
preproc = TokChunks(in_str)
preproc.serve_syls_to_trie()
tokens = tok.tokenize(preproc)

out = []

for i in range(len(tokens)):

    out.append(tokens[i]["text"])

print(out)

Output:

['༈ ', 'བློ་', 'ཆོས་', 'སུ་', 'འགྲོ་བར་', 'བྱིན་', 'གྱིས་', 'རློབས', '། །', 'ཆོས་', 'ལམ་', 'དུ་', 'འགྲོ་བར་', 'བྱིན་', 'གྱིས་', 'རློབས', '། །', 'ལམ་', 'འཁྲུལ་བ་', 'ཞིག་པར་', 'བྱིན་', 'གྱིས་', 'རློབས', '། །', 'འཁྲུལ་', 'པ་', 'ཡེ་ཤེས་', 'སུ་', 'འཆར་བར་', 'བྱིན་', 'གྱིས་', 'རློབས', '། །']

You can refer to https://github.com/Esukhia/botok/blob/7d85cbb0df62ff4c9da3c70088ad671f03472a18/botok/tokenizers/wordtokenizer.py#L28 class to customize the adjustment rules too.

PS: We will be releasing botok documentation soon.

@mikkokotila
Copy link
Contributor Author

How wonderful!

For dialect packs, the use is 100% on Buddhadharma texts. Do you have recommendation for which dialect packs to use?

@10zinten
Copy link
Contributor

Currently, we only have dialect pack for general Tibetan language. Our researcher team is working on dialect pack for Buddhadharma texts. Till then, you can experiment with general dialect pack to improve the segmentation.

We will be releasing a detail documentation on customizing any dialect pack.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants