Preprocessing an input string is the basis needed to do higher-level operations such as tokenizing.

In [1]:
input_str = '༆ ཤི་བཀྲ་ཤིས་  tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ། བཀྲ ཤིས བཀྲ་ཤིས་'

# pybotextchunks.py

This class is a wrapper around PyBoChunk (that it subclasses).
Its main purpose is to provide the input to be fed into the trie within tokenizer.py

Here is what it produces:

In [2]:
from pybo import PyBoTextChunks
text_chunks = PyBoTextChunks(input_str)

output = text_chunks.serve_syls_to_trie()
print(f'The output is a {type(output)} of {type(output[0])} containing each {len(output[0])} elements.')
print('\tex: ', output[0])
print(f'The first element is either a {type(output[0][0])} or a {type(output[1][0])} of ints')
print('\tex:', output[0][0], 'or', output[1][0])

The output is a <class 'list'> of <class 'tuple'> containing each 2 elements.
	ex:  (None, (102, 0, 2))
The first element is either a <class 'NoneType'> or a <class 'list'> of ints
	ex: None or [2, 3]


See below for the second element, it is the output of PyBoChunk.

In [3]:
for n, trie_input in enumerate(output):
    print(f'{n+1}.\t {trie_input[0]}')

1.	 None
2.	 [2, 3]
3.	 [5, 6, 7]
4.	 [9, 10, 11]
5.	 None
6.	 None
7.	 [19, 20, 21]
8.	 None
9.	 [24, 25, 27, 28]
10.	 None
11.	 [31, 32, 33]
12.	 [35, 36, 37]
13.	 [39, 40, 41]
14.	 [43, 44, 45, 46]
15.	 None
16.	 [51, 52]
17.	 None
18.	 [55, 56, 57, 58]
19.	 [60, 61, 62]
20.	 [64, 65, 66, 67]
21.	 [69, 70, 71]
22.	 [73, 74, 75]
23.	 [77, 78]
24.	 [80, 81]
25.	 [83, 84, 85, 86]
26.	 None
27.	 [93, 94, 95]
28.	 None
29.	 [98, 99, 100, 102, 103, 104, 106, 107, 108]
30.	 [110, 111, 112]


The first element provides the information to the trie that a given chunk is a syllable, and thus should be fed to the trie, or that it is not a syllable.

In case it is a syllable, its elements are indices of the characters that constitute the syllable:

In [4]:
print('\t', input_str)
for n, trie_input in enumerate(output):
    syl = trie_input[0]
    if syl:
        syllable = [input_str[s] for s in syl] + ['་']
        print(f'{n+1}.\t{syllable}')
    else:
        print(f'{n+1}.')

	 ༆ ཤི་བཀྲ་ཤིས་  t r བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ། བཀྲ ཤིས བཀྲ་ཤིས་
1.
2.	['ཤ', 'ི', '་']
3.	['བ', 'ཀ', 'ྲ', '་']
4.	['ཤ', 'ི', 'ས', '་']
5.
6.
7.	['བ', 'ད', 'ེ', '་']
8.
9.	['ལ', 'ེ', 'ག', 'ས', '་']
10.
11.	['བ', 'ཀ', 'ྲ', '་']
12.	['ཤ', 'ི', 'ས', '་']
13.	['བ', 'ད', 'ེ', '་']
14.	['ལ', 'ེ', 'ག', 'ས', '་']
15.
16.	['ཀ', 'ཀ', '་']
17.
18.	['མ', 'ཐ', 'འ', 'ི', '་']
19.	['ར', 'ྒ', 'ྱ', '་']
20.	['མ', 'ཚ', 'ོ', 'ར', '་']
21.	['ག', 'ན', 'ས', '་']
22.	['པ', 'འ', 'ི', '་']
23.	['ཉ', 'ས', '་']
24.	['ཆ', 'ུ', '་']
25.	['འ', 'ཐ', 'ུ', 'ང', '་']
26.
27.	['མ', 'ཁ', 'འ', '་']
28.
29.	['བ', 'ཀ', 'ྲ', 'ཤ', 'ི', 'ས', 'བ', 'ཀ', 'ྲ', '་']
30.	['ཤ', 'ི', 'ས', '་']


Punctuation chunks are left aside, non-Tibetan parts also and the content of the syllable has been cleaned.
In token 6, the double tsek has been normalized to a single tsek.
In token 7, the space in the middle has been left aside.
On top of this, the non-syllabic chunks give us a cue that any ongoing word has to end there.

Now, this cleaned content can be fed in the trie, syllable after syllable, character after character.

# pybochunk.py

This class is powered by the chunking framework defined in BoChunk (that it subclasses).

It implements the following chunking pipeline:
 - turn the input string into a sequence of alternating "bo" and "non-bo" chunks
 - turn the "bo" chunks into a sequence of alternating "punct" and "bo" chunks
 - turn the remaining "bo" chunks into a sequence of alternating "sym" and "bo" chunks
 - turn the remaining "bo" chunks into a sequence of alternating "num" and "bo" chunks
 - turn the remaining "bo" chunks into syllables
 - concatenate the chunks containing only spaces to the preceding one

The last pipe is created by using the building blocks of BoChunk, but implemented here, since this treatment merges two chunks given a condition, while BoChunk is a chunking framework.

In [5]:
from pybo import PyBoChunk

chunks = PyBoChunk(input_str)
output = chunks.chunk()

The raw output:

In [6]:
for n, chunk in enumerate(output):
    print(f'{n+1}.\t{chunk}')

1.	(102, 0, 2)
2.	(106, 2, 3)
3.	(106, 5, 4)
4.	(106, 9, 6)
5.	(101, 15, 2)
6.	(101, 17, 2)
7.	(106, 19, 4)
8.	(102, 23, 1)
9.	(106, 24, 5)
10.	(102, 29, 2)
11.	(106, 31, 4)
12.	(106, 35, 4)
13.	(106, 39, 4)
14.	(106, 43, 5)
15.	(109, 48, 3)
16.	(106, 51, 2)
17.	(102, 53, 2)
18.	(106, 55, 5)
19.	(106, 60, 4)
20.	(106, 64, 5)
21.	(106, 69, 4)
22.	(106, 73, 4)
23.	(106, 77, 3)
24.	(106, 80, 3)
25.	(106, 83, 5)
26.	(102, 88, 5)
27.	(106, 93, 3)
28.	(102, 96, 2)
29.	(106, 98, 12)
30.	(106, 110, 4)


The last two elements in the tuple are:
 - the starting index of the current chunk in the input string
 - the length of the current chunk
 
PyBoChunk provides a method to replace these two indices with the actual substring:

In [7]:
with_substrings = chunks.get_chunked(output)
for n, chunk in enumerate(with_substrings):
    print(f'{n+1}.\t{chunk}')

1.	(102, '༆ ')
2.	(106, 'ཤི་')
3.	(106, 'བཀྲ་')
4.	(106, 'ཤིས་  ')
5.	(101, 't ')
6.	(101, 'r ')
7.	(106, 'བདེ་')
8.	(102, '་')
9.	(106, 'ལེ གས')
10.	(102, '། ')
11.	(106, 'བཀྲ་')
12.	(106, 'ཤིས་')
13.	(106, 'བདེ་')
14.	(106, 'ལེགས་')
15.	(109, '༡༢༣')
16.	(106, 'ཀཀ')
17.	(102, '། ')
18.	(106, 'མཐའི་')
19.	(106, 'རྒྱ་')
20.	(106, 'མཚོར་')
21.	(106, 'གནས་')
22.	(106, 'པའི་')
23.	(106, 'ཉས་')
24.	(106, 'ཆུ་')
25.	(106, 'འཐུང་')
26.	(102, '།། །།')
27.	(106, 'མཁའ')
28.	(102, '། ')
29.	(106, 'བཀྲ ཤིས བཀྲ་')
30.	(106, 'ཤིས་')


I could do it manually, though:

In [8]:
chunk_str = input_str[output[0][1]:output[0][1]+output[0][2]]
print(f'"{with_substrings[0][1]}" equals "{chunk_str}"')

"༆ " equals "༆ "


Now, let's get human-readable values for the first int:

In [9]:
with_markers = chunks.get_markers(with_substrings)
for n, chunk in enumerate(with_markers):
    print(f'{n+1}.\t{chunk}')

1.	('punct', '༆ ')
2.	('syl', 'ཤི་')
3.	('syl', 'བཀྲ་')
4.	('syl', 'ཤིས་  ')
5.	('non-bo', 't ')
6.	('non-bo', 'r ')
7.	('syl', 'བདེ་')
8.	('punct', '་')
9.	('syl', 'ལེ གས')
10.	('punct', '། ')
11.	('syl', 'བཀྲ་')
12.	('syl', 'ཤིས་')
13.	('syl', 'བདེ་')
14.	('syl', 'ལེགས་')
15.	('num', '༡༢༣')
16.	('syl', 'ཀཀ')
17.	('punct', '། ')
18.	('syl', 'མཐའི་')
19.	('syl', 'རྒྱ་')
20.	('syl', 'མཚོར་')
21.	('syl', 'གནས་')
22.	('syl', 'པའི་')
23.	('syl', 'ཉས་')
24.	('syl', 'ཆུ་')
25.	('syl', 'འཐུང་')
26.	('punct', '།། །།')
27.	('syl', 'མཁའ')
28.	('punct', '། ')
29.	('syl', 'བཀྲ ཤིས བཀྲ་')
30.	('syl', 'ཤིས་')


 - punct: chunk containing Tibetan punctuation
 - syl: chunk containing a single syllable (necessarily Tibetan characters)
 - num: chunk containing Tibetan numerals
 - non-bo: chunk containing non-Tibetan characters
 
Note that at this point, the only thing we have is substrings of the initial string with chunk markers.

# bochunk.py

BoChunk is a chunking framework built on top of BoString (see below).

Its methods are used to create chunks – groups of characters that pertain to the same category. The produced output is a binary sequence of chunks matching or not a given condition.

Although each chunking method only produces two kinds of chunks (those that match and those that don't), piped chunking allows to apply complex chunking patterns.

In [10]:
from pybo.bochunk import BoChunk
    
bc = BoChunk(input_str)

## 1. The available chunking methods

#### Either Tibetan characters or not

In [11]:
bo_nonbo = bc.chunk_bo_chars()
print(f'the output of chunking functions are {type(bo_nonbo)} each containing {len(bo_nonbo[0])} {type(bo_nonbo[0][0])}')

the output of chunking functions are <class 'list'> each containing 3 <class 'int'>


In [12]:
for n, c in enumerate(bo_nonbo):
    print(f'{n+1}.\t{c}')

1.	(100, 0, 15)
2.	(101, 15, 1)
3.	(100, 16, 1)
4.	(101, 17, 1)
5.	(100, 18, 96)


We have already seen this format in PyBoChunk, which actually only applies the chunking methods available in BoChunk in order to produce chunks corresponding to the expectations of a Tibetan reader.

In a human-friendly format, it actually is:

In [13]:
for n, r in enumerate(bc.get_markers(bc.get_chunked(bo_nonbo))):
    print(f'{n+1}.\t{r[0]}\t\t"{r[1]}"')

1.	bo		"༆ ཤི་བཀྲ་ཤིས་  "
2.	non-bo		"t"
3.	bo		" "
4.	non-bo		"r"
5.	bo		" བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ། བཀྲ ཤིས བཀྲ་ཤིས་"


We have indeed chunked the input string is a sequence of alternatively Tibetan and non-Tibetan chunks.

#### Either Tibetan punctuation or not

In [14]:
punct_nonpunct = bc.chunk_punct()

for n, p in enumerate(bc.get_markers(bc.get_chunked(punct_nonpunct))):
    print(f'{n+1}.\t{p[0]}\t\t"{p[1]}"')

1.	punct		"༆ "
2.	non-punct		"ཤི་བཀྲ་ཤིས་"
3.	punct		"  "
4.	non-punct		"t"
5.	punct		" "
6.	non-punct		"r"
7.	punct		" "
8.	non-punct		"བདེ་"
9.	punct		"་"
10.	non-punct		"ལེ གས"
11.	punct		"། "
12.	non-punct		"བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ"
13.	punct		"། "
14.	non-punct		"མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་"
15.	punct		"།། །།"
16.	non-punct		"མཁའ"
17.	punct		"། "
18.	non-punct		"བཀྲ ཤིས བཀྲ་ཤིས་"


Now, we have an alternating sequence of chunks that only cares about Tibetan punctuation.

Note how "tr" is simply considered to be non-punct.

#### Either Tibetan symbols or not

In [15]:
sym_nonsym = bc.chunk_symbol()

for n, s in enumerate(bc.get_markers(bc.get_chunked(sym_nonsym))):
    print(f'{n+1}.\t{s[0]}\t\t"{s[1]}"')

1.	non-sym		"༆ ཤི་བཀྲ་ཤིས་  t r བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ། བཀྲ ཤིས བཀྲ་ཤིས་"


... there were no Tibetan symbols in the input string.

#### Either Tibetan digits or not

In [16]:
num_nonnum = bc.chunk_number()

for n, m in enumerate(bc.get_markers(bc.get_chunked(num_nonnum))):
    print(f'{n+1}.\t{m[0]}\t\t"{m[1]}"')

1.	non-num		"༆ ཤི་བཀྲ་ཤིས་  t r བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་"
2.	num		"༡༢༣"
3.	non-num		"ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ། བཀྲ ཤིས བཀྲ་ཤིས་"


We have correctly identified the Tibetan digits in the input string

#### Either spaces or not

In [17]:
space_nonspace = bc.chunk_spaces()

for n, s in enumerate(bc.get_markers(bc.get_chunked(space_nonspace))):
    print(f'{n+1}.\t{s[0]}\t\t"{s[1]}"')

1.	non-space		"༆"
2.	space		" "
3.	non-space		"ཤི་བཀྲ་ཤིས་"
4.	space		"  "
5.	non-space		"t"
6.	space		" "
7.	non-space		"r"
8.	space		" "
9.	non-space		"བདེ་་ལེ"
10.	space		" "
11.	non-space		"གས།"
12.	space		" "
13.	non-space		"བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ།"
14.	space		" "
15.	non-space		"མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།།"
16.	space		" "
17.	non-space		"།།མཁའ།"
18.	space		" "
19.	non-space		"བཀྲ"
20.	space		" "
21.	non-space		"ཤིས"
22.	space		" "
23.	non-space		"བཀྲ་ཤིས་"


#### Syllabify

In [18]:
syl_nonsyl = bc.syllabify()

for n, s in enumerate(bc.get_markers(bc.get_chunked(syl_nonsyl))):
    print(f'{n+1}.\t{s[0]}\t\t"{s[1]}"')

1.	syl		"༆ ཤི་"
2.	syl		"བཀྲ་"
3.	syl		"ཤིས་"
4.	syl		"  t r བདེ་་"
5.	syl		"ལེ གས། བཀྲ་"
6.	syl		"ཤིས་"
7.	syl		"བདེ་"
8.	syl		"ལེགས་"
9.	syl		"༡༢༣ཀཀ། མཐའི་"
10.	syl		"རྒྱ་"
11.	syl		"མཚོར་"
12.	syl		"གནས་"
13.	syl		"པའི་"
14.	syl		"ཉས་"
15.	syl		"ཆུ་"
16.	syl		"འཐུང་"
17.	syl		"།། །།མཁའ། བཀྲ ཤིས བཀྲ་"
18.	syl		"ཤིས་"


This is the only chunking method that doesn't create a binary sequence of chunks matching or not a given condition. 

Instead, it breaks up the input it receives into syllables that are only separated by tseks(both regular and non-breaking). For this reason, syllabify takes for granted that the input it receives is only Tibetan characters that can compose a syllable.

When that was the case, for example in chunks 10 to 16, the syllabation is operated as expected.

Spaces, on the other hand, are allowed within a syllable because they only serve to "beautify" a tibetan text by visually marking the end of a clause or that of a sentence after a punctuation. So we reproduce the behavior of a Tibetan reader, who will simply bypass a space encountered anywhere else.

#### Extending the framework with additional methods

All of these chunking methods follow a simple standardized format, making it extremely simple to create new ones to suit any specific needs. They are in turn built on top of BoString, which relies on the groups of characters that were put together to reproduce the intuition of a native Tibetan who is reading a text.

So in case finer chunking abilities are required, finer character groups should be created in BoString.

## 2. Piped Chunking: Implementing complex chunking patterns

Applying successively the chunking methods we have seen above in a specific order allows to design complex patterns for chunking a given string.

We will exemplify this functionality by reproducing the chunking pipeline that PyBoChunk implements:

    a. turn the input string into a sequence of alternating "bo" / "non-bo" chunks
    b. turn the "bo" chunks into a sequence of alternating "punct" / "bo" chunks
    c. turn the remaining "bo" chunks into a sequence of alternating "sym" / "bo" chunks
    d. turn the remaining "bo" chunks into a sequence of alternating "num" / "bo" chunks
    e. turn the "bo" chunks into a sequence of syllables
    f. concatenate the chunks containing only spaces to the preceding one

#### a. Either "bo" or "non-bo"

In [19]:
chunks = bc.chunk_bo_chars()

print('\t', input_str, end='\n\n')
for n, c in enumerate(bc.get_markers(bc.get_chunked(chunks))):
    print(f'{n+1}.\t{c[0]}\t\t"{c[1]}"')

	 ༆ ཤི་བཀྲ་ཤིས་  t r བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ། བཀྲ ཤིས བཀྲ་ཤིས་

1.	bo		"༆ ཤི་བཀྲ་ཤིས་  "
2.	non-bo		"t"
3.	bo		" "
4.	non-bo		"r"
5.	bo		" བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ། བཀྲ ཤིས བཀྲ་ཤིས་"


The output is similar to what we had above. Yet this time, instead of rechunking the whole input string in "punct" or "non-punct", we will use piped chunking to only apply chunk_punct() to the "bo" chunks.

Important: piped chunking is only possible when the content of the chunks is a tuple of ints.

#### b. Either "punct" or "bo"

In [20]:
bc.pipe_chunk(chunks, bc.chunk_punct, to_chunk=bc.BO_MARKER, yes=bc.PUNCT_MARKER)

print('\t', input_str, end='\n\n')
for n, c in enumerate(bc.get_markers(bc.get_chunked(chunks))):
    print(f'{n+1}.\t{c[0]}\t\t"{c[1]}"')

	 ༆ ཤི་བཀྲ་ཤིས་  t r བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ། བཀྲ ཤིས བཀྲ་ཤིས་

1.	punct		"༆ "
2.	bo		"ཤི་བཀྲ་ཤིས་"
3.	punct		"  "
4.	non-bo		"t"
5.	punct		" "
6.	non-bo		"r"
7.	punct		" "
8.	bo		"བདེ་"
9.	punct		"་"
10.	bo		"ལེ གས"
11.	punct		"། "
12.	bo		"བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ"
13.	punct		"། "
14.	bo		"མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་"
15.	punct		"།། །།"
16.	bo		"མཁའ"
17.	punct		"། "
18.	bo		"བཀྲ ཤིས བཀྲ་ཤིས་"


pipe_chunk() takes as arguments:
 - chunks: the chunks produced by a previous chunking methods. They are expected to be a list of tuples containing each three ints.
 - bc.chunk_punct: the chunking method we want to apply on top of the existing chunks
 - to_chunk: the marker that identifies which chunks should be further processed by the new chunking method
 - yes: the marker to be used on the new chunks that match the new chunking method. (the chunks that don't match simply retain their previous marker)

So to put it into simple English: 

                  "Within the existing chunks, I would like to take those marked as 'bo' 
                  and within them separate what is actual Tibetan text ('bo') 
                  from what is Tibetan punctuation ('punct')"

#### c. Either "sym" or "bo"

In [21]:
bc.pipe_chunk(chunks, bc.chunk_symbol, to_chunk=bc.BO_MARKER, yes=bc.SYMBOL_MARKER)

print('\t', input_str, end='\n\n')
for n, c in enumerate(bc.get_markers(bc.get_chunked(chunks))):
    print(f'{n+1}.\t{c[0]}\t\t"{c[1]}"')

	 ༆ ཤི་བཀྲ་ཤིས་  t r བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ། བཀྲ ཤིས བཀྲ་ཤིས་

1.	punct		"༆ "
2.	bo		"ཤི་བཀྲ་ཤིས་"
3.	punct		"  "
4.	non-bo		"t"
5.	punct		" "
6.	non-bo		"r"
7.	punct		" "
8.	bo		"བདེ་"
9.	punct		"་"
10.	bo		"ལེ གས"
11.	punct		"། "
12.	bo		"བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ"
13.	punct		"། "
14.	bo		"མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་"
15.	punct		"།། །།"
16.	bo		"མཁའ"
17.	punct		"། "
18.	bo		"བཀྲ ཤིས བཀྲ་ཤིས་"


There were no symbols, so the chunks are unchanged.

#### d. Either "num" or "bo"

In [22]:
bc.pipe_chunk(chunks, bc.chunk_number, to_chunk=bc.BO_MARKER, yes=bc.NUMBER_MARKER)

print('\t', input_str, end='\n\n')
for n, c in enumerate(bc.get_markers(bc.get_chunked(chunks))):
    print(f'{n+1}.\t{c[0]}\t\t"{c[1]}"')

	 ༆ ཤི་བཀྲ་ཤིས་  t r བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ། བཀྲ ཤིས བཀྲ་ཤིས་

1.	punct		"༆ "
2.	bo		"ཤི་བཀྲ་ཤིས་"
3.	punct		"  "
4.	non-bo		"t"
5.	punct		" "
6.	non-bo		"r"
7.	punct		" "
8.	bo		"བདེ་"
9.	punct		"་"
10.	bo		"ལེ གས"
11.	punct		"། "
12.	bo		"བཀྲ་ཤིས་བདེ་ལེགས་"
13.	num		"༡༢༣"
14.	bo		"ཀཀ"
15.	punct		"། "
16.	bo		"མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་"
17.	punct		"།། །།"
18.	bo		"མཁའ"
19.	punct		"། "
20.	bo		"བཀྲ ཤིས བཀྲ་ཤིས་"


                  "Within the existing chunks, I would like to take those marked as 'bo' 
                  and within them separate what is actual Tibetan text ('bo') 
                  from what is Tibetan digits ('num')"

#### e. Syllabify "bo" into "syl"

In [23]:
bc.pipe_chunk(chunks, bc.syllabify, to_chunk=bc.BO_MARKER, yes=bc.SYL_MARKER)

print('\t', input_str, end='\n\n')
for n, c in enumerate(bc.get_markers(bc.get_chunked(chunks))):
    print(f'{n+1}.\t{c[0]}\t\t"{c[1]}"')

	 ༆ ཤི་བཀྲ་ཤིས་  t r བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།མཁའ། བཀྲ ཤིས བཀྲ་ཤིས་

1.	punct		"༆ "
2.	syl		"ཤི་"
3.	syl		"བཀྲ་"
4.	syl		"ཤིས་"
5.	punct		"  "
6.	non-bo		"t"
7.	punct		" "
8.	non-bo		"r"
9.	punct		" "
10.	syl		"བདེ་"
11.	punct		"་"
12.	syl		"ལེ གས"
13.	punct		"། "
14.	syl		"བཀྲ་"
15.	syl		"ཤིས་"
16.	syl		"བདེ་"
17.	syl		"ལེགས་"
18.	num		"༡༢༣"
19.	syl		"ཀཀ"
20.	punct		"། "
21.	syl		"མཐའི་"
22.	syl		"རྒྱ་"
23.	syl		"མཚོར་"
24.	syl		"གནས་"
25.	syl		"པའི་"
26.	syl		"ཉས་"
27.	syl		"ཆུ་"
28.	syl		"འཐུང་"
29.	punct		"།། །།"
30.	syl		"མཁའ"
31.	punct		"། "
32.	syl		"བཀྲ ཤིས བཀྲ་"
33.	syl		"ཤིས་"


Here we are ! We have successfully preprocessed the input string into a sequence of identified content that we can confidently use for any further NLP processing.

In order to have the exact same output as PyBoChunk, chunk 5 should be attached to chunk 4 and chunk 7 to chunk 6. Since this is implemented as a private method of PyBoChunk, we will stop this demonstration here.

# bostring.py