Preprocessing an input string is the basis needed to do higher-level operations such as tokenizing.

In [3]:
input_str = '༆ ཤི་བཀྲ་ཤིས་  tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ།'

# pybotextchunks.py

This class is a wrapper around PyBoChunk (that it subclasses).
Its main purpose is to provide the input to be fed into the trie within tokenizer.py

Here is what it produces:

In [21]:
from pybo import PyBoTextChunks
text_chunks = PyBoTextChunks(input_str)

output = text_chunks.serve_syls_to_trie()
print(f'The output is a {type(output)} of {type(output[0])} containing each {len(output[0])} elements.')
print('\t', output[0])
print(f'The first element is either a {type(output[0][0])} or a {type(output[1][0])} of ints')
print('\t', output[0][0], 'or', output[1][0])
print(f'The second element is a {type(output[0][1])} always containing {len(output[0][1])} {type(output[0][1][0])}')
print('\t', output[0][1])

The output is a <class 'list'> of <class 'tuple'> containing each 2 elements.
	 (None, (102, 0, 2))
The first element is either a <class 'NoneType'> or a <class 'list'> of ints
	 None or [2, 3]
The second element is a <class 'tuple'> always containing 3 <class 'int'>
	 (102, 0, 2)


In [28]:
for n, trie_input in enumerate(output):
    print(f'{n+1}.\t {trie_input[0]}\t\t\t{trie_input[1]}')

1.	 None			(102, 0, 2)
2.	 [2, 3]			(106, 2, 3)
3.	 [5, 6, 7]			(106, 5, 4)
4.	 [9, 10, 11]			(106, 9, 6)
5.	 None			(101, 15, 3)
6.	 [18, 19, 20]			(106, 18, 5)
7.	 [23, 24, 26, 27]			(106, 23, 5)
8.	 None			(102, 28, 2)
9.	 [30, 31, 32]			(106, 30, 4)
10.	 [34, 35, 36]			(106, 34, 4)
11.	 [38, 39, 40]			(106, 38, 4)
12.	 [42, 43, 44, 45]			(106, 42, 5)
13.	 None			(109, 47, 3)
14.	 [50, 51]			(106, 50, 2)
15.	 None			(102, 52, 2)
16.	 [54, 55, 56, 57]			(106, 54, 5)
17.	 [59, 60, 61]			(106, 59, 4)
18.	 [63, 64, 65, 66]			(106, 63, 5)
19.	 [68, 69, 70]			(106, 68, 4)
20.	 [72, 73, 74]			(106, 72, 4)
21.	 [76, 77]			(106, 76, 3)
22.	 [79, 80]			(106, 79, 3)
23.	 [82, 83, 84, 85]			(106, 82, 5)
24.	 None			(102, 87, 5)
25.	 [92, 93, 94]			(106, 92, 3)
26.	 None			(102, 95, 1)


The first element provides the information to the trie that a given chunk is a syllable, and thus should be fed to the trie, or that it is not a syllable.

In case it is a syllable, its elements are indices of the characters that constitute the syllable:

In [38]:
print('\t', input_str)
for n, trie_input in enumerate(output):
    syl = trie_input[0]
    if syl:
        syllable = [input_str[s] for s in syl] + ['་']
        print(f'{n+1}.\t{syllable}')
    else:
        print(f'{n+1}.')

	 ༆ ཤི་བཀྲ་ཤིས་  tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ།
1.
2.	['ཤ', 'ི', '་']
3.	['བ', 'ཀ', 'ྲ', '་']
4.	['ཤ', 'ི', 'ས', '་']
5.
6.	['བ', 'ད', 'ེ', '་']
7.	['ལ', 'ེ', 'ག', 'ས', '་']
8.
9.	['བ', 'ཀ', 'ྲ', '་']
10.	['ཤ', 'ི', 'ས', '་']
11.	['བ', 'ད', 'ེ', '་']
12.	['ལ', 'ེ', 'ག', 'ས', '་']
13.
14.	['ཀ', 'ཀ', '་']
15.
16.	['མ', 'ཐ', 'འ', 'ི', '་']
17.	['ར', 'ྒ', 'ྱ', '་']
18.	['མ', 'ཚ', 'ོ', 'ར', '་']
19.	['ག', 'ན', 'ས', '་']
20.	['པ', 'འ', 'ི', '་']
21.	['ཉ', 'ས', '་']
22.	['ཆ', 'ུ', '་']
23.	['འ', 'ཐ', 'ུ', 'ང', '་']
24.
25.	['མ', 'ཁ', 'འ', '་']
26.


Punctuation chunks are left aside, non-Tibetan parts also and the content of the syllable has been cleaned.
In token 6, the double tsek has been normalized to a single tsek.
In token 7, the space in the middle has been left aside.
On top of this, the non-syllabic chunks give us a cue that any ongoing word has to end there.

Now, this cleaned content can be fed in the trie, syllable after syllable, character after character.

As for the second element, the tuple containing 3 ints, it is the output of PyBoChunk.

# pybochunk.py

In [40]:
from pybo import PyBoChunk

chunk = PyBoChunk(input_str)
output = chunk.chunk()