Preprocessing an input string is the basis needed to do higher-level operations such as tokenizing.

In [1]:
input_str = '༆ ཤི་བཀྲ་ཤིས་  tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ།'

# pybotextchunks.py

This class is a wrapper around PyBoChunk (that it subclasses).
Its main purpose is to provide the input to be fed into the trie within tokenizer.py

Here is what it produces:

In [2]:
from pybo import PyBoTextChunks
text_chunks = PyBoTextChunks(input_str)

output = text_chunks.serve_syls_to_trie()
print(f'The output is a {type(output)} of {type(output[0])} containing each {len(output[0])} elements.')
print('\tex: ', output[0])
print(f'The first element is either a {type(output[0][0])} or a {type(output[1][0])} of ints')
print('\tex:', output[0][0], 'or', output[1][0])

The output is a <class 'list'> of <class 'tuple'> containing each 2 elements.
	ex:  (None, (102, 0, 2))
The first element is either a <class 'NoneType'> or a <class 'list'> of ints
	ex: None or [2, 3]


See below for the second element, it is the output of PyBoChunk.

In [3]:
for n, trie_input in enumerate(output):
    print(f'{n+1}.\t {trie_input[0]}')

1.	 None
2.	 [2, 3]
3.	 [5, 6, 7]
4.	 [9, 10, 11]
5.	 None
6.	 [18, 19, 20]
7.	 [23, 24, 26, 27]
8.	 None
9.	 [30, 31, 32]
10.	 [34, 35, 36]
11.	 [38, 39, 40]
12.	 [42, 43, 44, 45]
13.	 None
14.	 [50, 51]
15.	 None
16.	 [54, 55, 56, 57]
17.	 [59, 60, 61]
18.	 [63, 64, 65, 66]
19.	 [68, 69, 70]
20.	 [72, 73, 74]
21.	 [76, 77]
22.	 [79, 80]
23.	 [82, 83, 84, 85]
24.	 None
25.	 [92, 93, 94]
26.	 None


The first element provides the information to the trie that a given chunk is a syllable, and thus should be fed to the trie, or that it is not a syllable.

In case it is a syllable, its elements are indices of the characters that constitute the syllable:

In [4]:
print('\t', input_str)
for n, trie_input in enumerate(output):
    syl = trie_input[0]
    if syl:
        syllable = [input_str[s] for s in syl] + ['་']
        print(f'{n+1}.\t{syllable}')
    else:
        print(f'{n+1}.')

	 ༆ ཤི་བཀྲ་ཤིས་  tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ།
1.
2.	['ཤ', 'ི', '་']
3.	['བ', 'ཀ', 'ྲ', '་']
4.	['ཤ', 'ི', 'ས', '་']
5.
6.	['བ', 'ད', 'ེ', '་']
7.	['ལ', 'ེ', 'ག', 'ས', '་']
8.
9.	['བ', 'ཀ', 'ྲ', '་']
10.	['ཤ', 'ི', 'ས', '་']
11.	['བ', 'ད', 'ེ', '་']
12.	['ལ', 'ེ', 'ག', 'ས', '་']
13.
14.	['ཀ', 'ཀ', '་']
15.
16.	['མ', 'ཐ', 'འ', 'ི', '་']
17.	['ར', 'ྒ', 'ྱ', '་']
18.	['མ', 'ཚ', 'ོ', 'ར', '་']
19.	['ག', 'ན', 'ས', '་']
20.	['པ', 'འ', 'ི', '་']
21.	['ཉ', 'ས', '་']
22.	['ཆ', 'ུ', '་']
23.	['འ', 'ཐ', 'ུ', 'ང', '་']
24.
25.	['མ', 'ཁ', 'འ', '་']
26.


Punctuation chunks are left aside, non-Tibetan parts also and the content of the syllable has been cleaned.
In token 6, the double tsek has been normalized to a single tsek.
In token 7, the space in the middle has been left aside.
On top of this, the non-syllabic chunks give us a cue that any ongoing word has to end there.

Now, this cleaned content can be fed in the trie, syllable after syllable, character after character.

# pybochunk.py

This class is powered by the chunking framework defined in BoChunk (that it subclasses).

It implements the following chunking pipeline:
 - turn the input string into a sequence of alternating "bo" and "non-bo" chunks
 - turn the "bo" chunks into a sequence of alternating "punct" and "bo" chunks
 - turn the "bo" chunks into a sequence of syllables
 - concatenate the chunks containing only spaces to the preceding one

The last pipe is created by using the building blocks of BoChunk, but implemented here, since this treatment is specific to Tibetan language. Following Tibetan usage, spaces are not considered to be a punctuation mark. They are simply attached to the chunk preceding them.

In [5]:
from pybo import PyBoChunk

chunks = PyBoChunk(input_str)
output = chunks.chunk()

The raw output:

In [6]:
for n, chunk in enumerate(output):
    print(f'{n+1}.\t{chunk}')

1.	(102, 0, 2)
2.	(106, 2, 3)
3.	(106, 5, 4)
4.	(106, 9, 6)
5.	(101, 15, 3)
6.	(106, 18, 5)
7.	(106, 23, 5)
8.	(102, 28, 2)
9.	(106, 30, 4)
10.	(106, 34, 4)
11.	(106, 38, 4)
12.	(106, 42, 5)
13.	(109, 47, 3)
14.	(106, 50, 2)
15.	(102, 52, 2)
16.	(106, 54, 5)
17.	(106, 59, 4)
18.	(106, 63, 5)
19.	(106, 68, 4)
20.	(106, 72, 4)
21.	(106, 76, 3)
22.	(106, 79, 3)
23.	(106, 82, 5)
24.	(102, 87, 5)
25.	(106, 92, 3)
26.	(102, 95, 1)


The last two elements in the tuple are:
 - the starting index of the current chunk in the input string
 - the length of the current chunk
 
PyBoChunk provides a method to replace these two indices with the actual substring:

In [7]:
with_substrings = chunks.get_chunked(output)
for n, chunk in enumerate(with_substrings):
    print(f'{n+1}.\t{chunk}')

1.	(102, '༆ ')
2.	(106, 'ཤི་')
3.	(106, 'བཀྲ་')
4.	(106, 'ཤིས་  ')
5.	(101, 'tr ')
6.	(106, 'བདེ་་')
7.	(106, 'ལེ གས')
8.	(102, '། ')
9.	(106, 'བཀྲ་')
10.	(106, 'ཤིས་')
11.	(106, 'བདེ་')
12.	(106, 'ལེགས་')
13.	(109, '༡༢༣')
14.	(106, 'ཀཀ')
15.	(102, '། ')
16.	(106, 'མཐའི་')
17.	(106, 'རྒྱ་')
18.	(106, 'མཚོར་')
19.	(106, 'གནས་')
20.	(106, 'པའི་')
21.	(106, 'ཉས་')
22.	(106, 'ཆུ་')
23.	(106, 'འཐུང་')
24.	(102, '།། །།')
25.	(106, 'མཁའ')
26.	(102, '།')


I could do it manually, though:

In [8]:
chunk_str = input_str[output[0][1]:output[0][1]+output[0][2]]
print(f'"{with_substrings[0][1]}" equals "{chunk_str}"')

"༆ " equals "༆ "


Now, let's get human-readable values for the first int:

In [9]:
with_markers = chunks.get_markers(with_substrings)
for n, chunk in enumerate(with_markers):
    print(f'{n+1}.\t{chunk}')

1.	('punct', '༆ ')
2.	('syl', 'ཤི་')
3.	('syl', 'བཀྲ་')
4.	('syl', 'ཤིས་  ')
5.	('non-bo', 'tr ')
6.	('syl', 'བདེ་་')
7.	('syl', 'ལེ གས')
8.	('punct', '། ')
9.	('syl', 'བཀྲ་')
10.	('syl', 'ཤིས་')
11.	('syl', 'བདེ་')
12.	('syl', 'ལེགས་')
13.	('num', '༡༢༣')
14.	('syl', 'ཀཀ')
15.	('punct', '། ')
16.	('syl', 'མཐའི་')
17.	('syl', 'རྒྱ་')
18.	('syl', 'མཚོར་')
19.	('syl', 'གནས་')
20.	('syl', 'པའི་')
21.	('syl', 'ཉས་')
22.	('syl', 'ཆུ་')
23.	('syl', 'འཐུང་')
24.	('punct', '།། །།')
25.	('syl', 'མཁའ')
26.	('punct', '།')


 - punct: chunk containing Tibetan punctuation
 - syl: chunk containing a single syllable (necessarily Tibetan characters)
 - num: chunk containing Tibetan numerals
 - non-bo: chunk containing non-Tibetan characters
 
Note that at this point, the only thing we have is substrings of the initial string with chunk markers.

# bochunk.py

In [None]:
    from pybo.bochunk import BoChunk

    bo_str = ' བཀྲ་ཤིས་  tr བདེ་ལེགས།'
    bc = BoChunk(bo_str)

In [None]:
    # 1. Initial chunking
    

In [None]:
    >>> chunks = bc.chunk_bo_chars()  # uses default chunk-marks here, but override them in the pipeline

    >>> chunks  # actual chunks. each chunk is tuple containing: (chunk-marker, starting_index, chunk_length)
    

In [None]:
    >>> bc.get_markers(chunks)  # shows the human-readable description of the constant
    

In [None]:
    >>> bc.get_chunked(chunks)  # shows the substring for each chunk

In [None]:
    # 2. First piped chunking: re-chunks tibetan chunks into tibetan text and tibetan punctuation.
    

In [None]:
    >>> bc.pipe_chunk(chunks, bc.chunk_punct, to_chunk=bc.BO_MARKER, yes=bc.PUNCT_MARKER)

In [None]:
    >>> bc.get_markers(chunks)
    [('bo', 0, 11), ('non-bo', 11, 2), ('bo', 13, 9), ('punct', 22, 1)]
    

In [None]:
    >>> bc.get_chunked(chunks)

In [None]:
    # 3. Second piped chunking: re-chunks tibetan text into syllables, keeping the same chunk-marker.
    >>> bc.pipe_chunk(chunks, bc.syllabify, to_chunk=bc.BO_MARKER, yes=bc.BO_MARKER)

In [None]:
    >>> bc.get_markers(chunks)

In [None]:
    >>> bc.get_chunked(chunks)
    [(100, ' བཀྲ་'), (100, 'ཤིས་'), (100, '  '), (101, 'tr'), (100, ' བདེ་'), (100, 'ལེགས'), (102, '།')]

In [10]:
    # 4. Formatting the resultant chunks into an easily usable structure
    >>> chunks = bc.get_markers(chunks)         # exchange the constants for the human-readable description
    >>> final_result = bc.get_chunked(chunks)   # exchange the indices for the substrings of each chunk