Preprocessing an input string is the basis needed to do higher-level operations such as tokenizing.

In [1]:
input_str = '༆ ཤི་བཀྲ་ཤིས་  tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།'

# pybotextchunks.py

This class is a wrapper around PyBoChunk (that it subclasses).
Its main purpose is to provide the input to be fed into the trie within tokenizer.py

Here is what it produces:

In [2]:
from pybo import PyBoTextChunks
text_chunks = PyBoTextChunks(input_str)

output = text_chunks.serve_syls_to_trie()
print(f'The output is a {type(output)} of {type(output[0])} containing each {len(output[0])} elements.')
print('\tex: ', output[0])
print(f'The first element is either a {type(output[0][0])} or a {type(output[1][0])} of ints')
print('\tex:', output[0][0], 'or', output[1][0])

The output is a <class 'list'> of <class 'tuple'> containing each 2 elements.
	ex:  (None, (102, 0, 2))
The first element is either a <class 'NoneType'> or a <class 'list'> of ints
	ex: None or [2, 3]


See below for the second element, it is the output of PyBoChunk.

In [3]:
for n, trie_input in enumerate(output):
    print(f'{n+1}.\t {trie_input[0]}')

1.	 None
2.	 [2, 3]
3.	 [5, 6, 7]
4.	 [9, 10, 11]
5.	 None
6.	 [18, 19, 20]
7.	 None
8.	 [23, 24, 26, 27]
9.	 None
10.	 [30, 31, 32]
11.	 [34, 35, 36]
12.	 [38, 39, 40]
13.	 [42, 43, 44, 45]
14.	 None
15.	 [50, 51]
16.	 None
17.	 [54, 55, 56, 57]
18.	 [59, 60, 61]
19.	 [63, 64, 65, 66]
20.	 [68, 69, 70]
21.	 [72, 73, 74]
22.	 [76, 77]
23.	 [79, 80]
24.	 [82, 83, 84, 85]
25.	 None


The first element provides the information to the trie that a given chunk is a syllable, and thus should be fed to the trie, or that it is not a syllable.

In case it is a syllable, its elements are indices of the characters that constitute the syllable:

In [4]:
print('\t', input_str)
for n, trie_input in enumerate(output):
    syl = trie_input[0]
    if syl:
        syllable = [input_str[s] for s in syl] + ['་']
        print(f'{n+1}.\t{syllable}')
    else:
        print(f'{n+1}.')

	 ༆ ཤི་བཀྲ་ཤིས་  tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།
1.
2.	['ཤ', 'ི', '་']
3.	['བ', 'ཀ', 'ྲ', '་']
4.	['ཤ', 'ི', 'ས', '་']
5.
6.	['བ', 'ད', 'ེ', '་']
7.
8.	['ལ', 'ེ', 'ག', 'ས', '་']
9.
10.	['བ', 'ཀ', 'ྲ', '་']
11.	['ཤ', 'ི', 'ས', '་']
12.	['བ', 'ད', 'ེ', '་']
13.	['ལ', 'ེ', 'ག', 'ས', '་']
14.
15.	['ཀ', 'ཀ', '་']
16.
17.	['མ', 'ཐ', 'འ', 'ི', '་']
18.	['ར', 'ྒ', 'ྱ', '་']
19.	['མ', 'ཚ', 'ོ', 'ར', '་']
20.	['ག', 'ན', 'ས', '་']
21.	['པ', 'འ', 'ི', '་']
22.	['ཉ', 'ས', '་']
23.	['ཆ', 'ུ', '་']
24.	['འ', 'ཐ', 'ུ', 'ང', '་']
25.


Punctuation chunks are left aside, non-Tibetan parts also and the content of the syllable has been cleaned.
In token 6, the double tsek has been normalized to a single tsek.
In token 7, the space in the middle has been left aside.
On top of this, the non-syllabic chunks give us a cue that any ongoing word has to end there.

Now, this cleaned content can be fed in the trie, syllable after syllable, character after character.

# pybochunk.py

This class is powered by the chunking framework defined in BoChunk (that it subclasses).

It implements the following chunking pipeline:
 - turn the input string into a sequence of alternating "bo" and "non-bo" chunks
 - turn the "bo" chunks into a sequence of alternating "punct" and "bo" chunks
 - turn the remaining "bo" chunks into a sequence of alternating "sym" and "bo" chunks
 - turn the remaining "bo" chunks into a sequence of alternating "num" and "bo" chunks
 - turn the remaining "bo" chunks into syllables
 - concatenate the chunks containing only spaces to the preceding one

The last pipe is created by using the building blocks of BoChunk, but implemented here, since this treatment merges two chunks given a condition, while BoChunk is a chunking framework.

In [5]:
from pybo import PyBoChunk

chunks = PyBoChunk(input_str)
output = chunks.chunk()

The raw output:

In [6]:
for n, chunk in enumerate(output):
    print(f'{n+1}.\t{chunk}')

1.	(102, 0, 2)
2.	(106, 2, 3)
3.	(106, 5, 4)
4.	(106, 9, 6)
5.	(101, 15, 3)
6.	(106, 18, 4)
7.	(102, 22, 1)
8.	(106, 23, 5)
9.	(102, 28, 2)
10.	(106, 30, 4)
11.	(106, 34, 4)
12.	(106, 38, 4)
13.	(106, 42, 5)
14.	(109, 47, 3)
15.	(106, 50, 2)
16.	(102, 52, 2)
17.	(106, 54, 5)
18.	(106, 59, 4)
19.	(106, 63, 5)
20.	(106, 68, 4)
21.	(106, 72, 4)
22.	(106, 76, 3)
23.	(106, 79, 3)
24.	(106, 82, 5)
25.	(102, 87, 5)


The last two elements in the tuple are:
 - the starting index of the current chunk in the input string
 - the length of the current chunk
 
PyBoChunk provides a method to replace these two indices with the actual substring:

In [7]:
with_substrings = chunks.get_chunked(output)
for n, chunk in enumerate(with_substrings):
    print(f'{n+1}.\t{chunk}')

1.	(102, '༆ ')
2.	(106, 'ཤི་')
3.	(106, 'བཀྲ་')
4.	(106, 'ཤིས་  ')
5.	(101, 'tr ')
6.	(106, 'བདེ་')
7.	(102, '་')
8.	(106, 'ལེ གས')
9.	(102, '། ')
10.	(106, 'བཀྲ་')
11.	(106, 'ཤིས་')
12.	(106, 'བདེ་')
13.	(106, 'ལེགས་')
14.	(109, '༡༢༣')
15.	(106, 'ཀཀ')
16.	(102, '། ')
17.	(106, 'མཐའི་')
18.	(106, 'རྒྱ་')
19.	(106, 'མཚོར་')
20.	(106, 'གནས་')
21.	(106, 'པའི་')
22.	(106, 'ཉས་')
23.	(106, 'ཆུ་')
24.	(106, 'འཐུང་')
25.	(102, '།། །།')


I could do it manually, though:

In [8]:
chunk_str = input_str[output[0][1]:output[0][1]+output[0][2]]
print(f'"{with_substrings[0][1]}" equals "{chunk_str}"')

"༆ " equals "༆ "


Now, let's get human-readable values for the first int:

In [9]:
with_markers = chunks.get_markers(with_substrings)
for n, chunk in enumerate(with_markers):
    print(f'{n+1}.\t{chunk}')

1.	('punct', '༆ ')
2.	('syl', 'ཤི་')
3.	('syl', 'བཀྲ་')
4.	('syl', 'ཤིས་  ')
5.	('non-bo', 'tr ')
6.	('syl', 'བདེ་')
7.	('punct', '་')
8.	('syl', 'ལེ གས')
9.	('punct', '། ')
10.	('syl', 'བཀྲ་')
11.	('syl', 'ཤིས་')
12.	('syl', 'བདེ་')
13.	('syl', 'ལེགས་')
14.	('num', '༡༢༣')
15.	('syl', 'ཀཀ')
16.	('punct', '། ')
17.	('syl', 'མཐའི་')
18.	('syl', 'རྒྱ་')
19.	('syl', 'མཚོར་')
20.	('syl', 'གནས་')
21.	('syl', 'པའི་')
22.	('syl', 'ཉས་')
23.	('syl', 'ཆུ་')
24.	('syl', 'འཐུང་')
25.	('punct', '།། །།')


 - punct: chunk containing Tibetan punctuation
 - syl: chunk containing a single syllable (necessarily Tibetan characters)
 - num: chunk containing Tibetan numerals
 - non-bo: chunk containing non-Tibetan characters
 
Note that at this point, the only thing we have is substrings of the initial string with chunk markers.

# bochunk.py

BoChunk is a chunking framework built on top of BoString (see below).

Its methods are used to create chunks – groups of characters that pertain to the same category. The produced output is a binary sequence of chunks matching or not a given condition.

Although each chunking method only produces two kinds of chunks (those that match and those that don't), piped chunking allows to apply complex chunking patterns.

In [10]:
from pybo.bochunk import BoChunk
    
bc = BoChunk(input_str)

## 1. The available chunking methods

#### Either Tibetan characters or not

In [11]:
bo_nonbo = bc.chunk_bo_chars()
print(f'the output of chunking functions are {type(bo_nonbo)} each containing {len(bo_nonbo[0])} {type(bo_nonbo[0][0])}')

the output of chunking functions are <class 'list'> each containing 3 <class 'int'>


In [12]:
for n, c in enumerate(bo_nonbo):
    print(f'{n+1}.\t{c}')

1.	(100, 0, 15)
2.	(101, 15, 2)
3.	(100, 17, 75)


We have already seen this format in PyBoChunk, which actually only applies the chunking methods available in BoChunk in order to produce chunks corresponding to the expectations of a Tibetan reader.

In a human-friendly format, it actually is:

In [13]:
for n, r in enumerate(bc.get_markers(bc.get_chunked(bo_nonbo))):
    print(f'{n+1}.\t{r[0]}\t\t"{r[1]}"')

1.	bo		"༆ ཤི་བཀྲ་ཤིས་  "
2.	non-bo		"tr"
3.	bo		" བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།"


We have indeed chunked the input string is a sequence of alternatively Tibetan and non-Tibetan chunks.

#### Either Tibetan punctuation or not

In [14]:
punct_nonpunct = bc.chunk_punct()

for n, p in enumerate(bc.get_markers(bc.get_chunked(punct_nonpunct))):
    print(f'{n+1}.\t{p[0]}\t\t"{p[1]}"')

1.	punct		"༆ "
2.	non-punct		"ཤི་བཀྲ་ཤིས་"
3.	punct		"  "
4.	non-punct		"tr"
5.	punct		" "
6.	non-punct		"བདེ་"
7.	punct		"་"
8.	non-punct		"ལེ གས"
9.	punct		"། "
10.	non-punct		"བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ"
11.	punct		"། "
12.	non-punct		"མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་"
13.	punct		"།། །།"


Now, we have an alternating sequence of chunks that only cares about Tibetan punctuation.

Note how "tr" is simply considered to be non-punct.

#### Either Tibetan symbols or not

In [15]:
sym_nonsym = bc.chunk_symbol()

for n, s in enumerate(bc.get_markers(bc.get_chunked(sym_nonsym))):
    print(f'{n+1}.\t{s[0]}\t\t"{s[1]}"')

1.	non-sym		"༆ ཤི་བཀྲ་ཤིས་  tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།"


... there were no Tibetan symbols in the input string.

#### Either Tibetan digits or not

In [16]:
num_nonnum = bc.chunk_number()

for n, m in enumerate(bc.get_markers(bc.get_chunked(num_nonnum))):
    print(f'{n+1}.\t{m[0]}\t\t"{m[1]}"')

1.	non-num		"༆ ཤི་བཀྲ་ཤིས་  tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་"
2.	num		"༡༢༣"
3.	non-num		"ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།"


We have correctly identified the Tibetan digits in the input string

#### Either spaces or not

In [17]:
space_nonspace = bc.chunk_spaces()

for n, s in enumerate(bc.get_markers(bc.get_chunked(space_nonspace))):
    print(f'{n+1}.\t{s[0]}\t\t"{s[1]}"')

1.	non-space		"༆"
2.	space		" "
3.	non-space		"ཤི་བཀྲ་ཤིས་"
4.	space		"  "
5.	non-space		"tr"
6.	space		" "
7.	non-space		"བདེ་་ལེ"
8.	space		" "
9.	non-space		"གས།"
10.	space		" "
11.	non-space		"བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ།"
12.	space		" "
13.	non-space		"མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།།"
14.	space		" "
15.	non-space		"།།"


#### Syllabify

In [18]:
syl_nonsyl = bc.syllabify()

for n, s in enumerate(bc.get_markers(bc.get_chunked(syl_nonsyl))):
    print(f'{n+1}.\t{s[0]}\t\t"{s[1]}"')

1.	syl		"༆ ཤི་"
2.	syl		"བཀྲ་"
3.	syl		"ཤིས་"
4.	syl		"  tr བདེ་་"
5.	syl		"ལེ གས། བཀྲ་"
6.	syl		"ཤིས་"
7.	syl		"བདེ་"
8.	syl		"ལེགས་"
9.	syl		"༡༢༣ཀཀ། མཐའི་"
10.	syl		"རྒྱ་"
11.	syl		"མཚོར་"
12.	syl		"གནས་"
13.	syl		"པའི་"
14.	syl		"ཉས་"
15.	syl		"ཆུ་"
16.	syl		"འཐུང་"
17.	syl		"།། །།"


This is the only chunking method that doesn't create a binary sequence of chunks matching or not a given condition. 

Instead, it breaks up the input it receives into syllables that are only separated by tseks(both regular and non-breaking). For this reason, syllabify takes for granted that the input it receives is only Tibetan characters that can compose a syllable.

When that was the case, for example in chunks 10 to 16, the syllabation is operated as expected.

Spaces, on the other hand, are allowed within a syllable because they only serve to "beautify" a tibetan text by visually marking the end of a clause or that of a sentence after a punctuation. So we reproduce the behavior of a Tibetan reader, who will simply bypass a space encountered anywhere else.

#### Extending the framework with additional methods

All of these chunking methods follow a simple standardized format, making it extremely simple to create new ones to suit any specific needs. They are in turn built on top of BoString, which relies on the groups of characters that were put together to reproduce the intuition of a native Tibetan who is reading a text.

So in case finer chunking abilities are required, finer character groups should be created in BoString.

## 2. Piped Chunking: Implementing complex chunking patterns

Applying successively the chunking methods we have seen above in a specific order allows to design complex patterns for chunking a given string.

We will exemplify this functionality by reproducing the chunking pipeline that PyBoChunk implements:

    a. turn the input string into a sequence of alternating "bo" / "non-bo" chunks
    b. turn the "bo" chunks into a sequence of alternating "punct" / "bo" chunks
    c. turn the remaining "bo" chunks into a sequence of alternating "sym" / "bo" chunks
    d. turn the remaining "bo" chunks into a sequence of alternating "num" / "bo" chunks
    e. turn the "bo" chunks into a sequence of syllables
    f. concatenate the chunks containing only spaces to the preceding one

#### a. Either "bo" or "non-bo"

In [19]:
chunks = bc.chunk_bo_chars()

print('\t', input_str, end='\n\n')
for n, c in enumerate(bc.get_markers(bc.get_chunked(chunks))):
    print(f'{n+1}.\t{c[0]}\t\t"{c[1]}"')

	 ༆ ཤི་བཀྲ་ཤིས་  tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།

1.	bo		"༆ ཤི་བཀྲ་ཤིས་  "
2.	non-bo		"tr"
3.	bo		" བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།"


The output is similar to what we had above. Yet this time, instead of rechunking the whole input string in "punct" or "non-punct", we will use piped chunking to only apply chunk_punct() to the "bo" chunks.

Important: piped chunking is only possible when the content of the chunks is a tuple of ints.

#### b. Either "punct" or "bo"

In [20]:
bc.pipe_chunk(chunks, bc.chunk_punct, to_chunk=bc.BO_MARKER, yes=bc.PUNCT_MARKER)

print('\t', input_str, end='\n\n')
for n, c in enumerate(bc.get_markers(bc.get_chunked(chunks))):
    print(f'{n+1}.\t{c[0]}\t\t"{c[1]}"')

	 ༆ ཤི་བཀྲ་ཤིས་  tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།

1.	punct		"༆ "
2.	bo		"ཤི་བཀྲ་ཤིས་"
3.	punct		"  "
4.	non-bo		"tr"
5.	punct		" "
6.	bo		"བདེ་"
7.	punct		"་"
8.	bo		"ལེ གས"
9.	punct		"། "
10.	bo		"བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ"
11.	punct		"། "
12.	bo		"མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་"
13.	punct		"།། །།"


pipe_chunk() takes as arguments:
 - chunks: the chunks produced by a previous chunking methods. They are expected to be a list of tuples containing each three ints.
 - bc.chunk_punct: the chunking method we want to apply on top of the existing chunks
 - to_chunk: the marker that identifies which chunks should be further processed by the new chunking method
 - yes: the marker to be used on the new chunks that match the new chunking method. (the chunks that don't match simply retain their previous marker)

So to put it into simple English: 

                  "Within the existing chunks, I would like to take those marked as 'bo' 
                  and within them separate what is actual Tibetan text ('bo') 
                  from what is Tibetan punctuation ('punct')"

#### c. Either "sym" or "bo"

In [21]:
bc.pipe_chunk(chunks, bc.chunk_symbol, to_chunk=bc.BO_MARKER, yes=bc.SYMBOL_MARKER)

print('\t', input_str, end='\n\n')
for n, c in enumerate(bc.get_markers(bc.get_chunked(chunks))):
    print(f'{n+1}.\t{c[0]}\t\t"{c[1]}"')

	 ༆ ཤི་བཀྲ་ཤིས་  tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།

1.	punct		"༆ "
2.	bo		"ཤི་བཀྲ་ཤིས་"
3.	punct		"  "
4.	non-bo		"tr"
5.	punct		" "
6.	bo		"བདེ་"
7.	punct		"་"
8.	bo		"ལེ གས"
9.	punct		"། "
10.	bo		"བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ"
11.	punct		"། "
12.	bo		"མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་"
13.	punct		"།། །།"


There were no symbols, so the chunks are unchanged.

#### d. Either "num" or "bo"

In [22]:
bc.pipe_chunk(chunks, bc.chunk_number, to_chunk=bc.BO_MARKER, yes=bc.NUMBER_MARKER)

print('\t', input_str, end='\n\n')
for n, c in enumerate(bc.get_markers(bc.get_chunked(chunks))):
    print(f'{n+1}.\t{c[0]}\t\t"{c[1]}"')

	 ༆ ཤི་བཀྲ་ཤིས་  tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།

1.	punct		"༆ "
2.	bo		"ཤི་བཀྲ་ཤིས་"
3.	punct		"  "
4.	non-bo		"tr"
5.	punct		" "
6.	bo		"བདེ་"
7.	punct		"་"
8.	bo		"ལེ གས"
9.	punct		"། "
10.	bo		"བཀྲ་ཤིས་བདེ་ལེགས་"
11.	num		"༡༢༣"
12.	bo		"ཀཀ"
13.	punct		"། "
14.	bo		"མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་"
15.	punct		"།། །།"


                  "Within the existing chunks, I would like to take those marked as 'bo' 
                  and within them separate what is actual Tibetan text ('bo') 
                  from what is Tibetan digits ('num')"

#### e. Syllabify "bo" into "syl"

In [23]:
bc.pipe_chunk(chunks, bc.syllabify, to_chunk=bc.BO_MARKER, yes=bc.SYL_MARKER)

print('\t', input_str, end='\n\n')
for n, c in enumerate(bc.get_markers(bc.get_chunked(chunks))):
    print(f'{n+1}.\t{c[0]}\t\t"{c[1]}"')

	 ༆ ཤི་བཀྲ་ཤིས་  tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།

1.	punct		"༆ "
2.	syl		"ཤི་"
3.	syl		"བཀྲ་"
4.	syl		"ཤིས་"
5.	punct		"  "
6.	non-bo		"tr"
7.	punct		" "
8.	syl		"བདེ་"
9.	punct		"་"
10.	syl		"ལེ གས"
11.	punct		"། "
12.	syl		"བཀྲ་"
13.	syl		"ཤིས་"
14.	syl		"བདེ་"
15.	syl		"ལེགས་"
16.	num		"༡༢༣"
17.	syl		"ཀཀ"
18.	punct		"། "
19.	syl		"མཐའི་"
20.	syl		"རྒྱ་"
21.	syl		"མཚོར་"
22.	syl		"གནས་"
23.	syl		"པའི་"
24.	syl		"ཉས་"
25.	syl		"ཆུ་"
26.	syl		"འཐུང་"
27.	punct		"།། །།"


Here we are ! We have successfully preprocessed the input string into a sequence of identified content that we can confidently use for any further NLP processing.

In order to have the exact same output as PyBoChunk, chunk 5 should be attached to chunk 4 and chunk 7 to chunk 6. Since this is implemented as a private method of PyBoChunk, we will stop this demonstration here.

# bostring.py

This is the foundational building block of all the preprocessing classes in pybo.

The idea behind it is to reproduce the way a Tibetan reader will group characters in a given text when reading it.
This is simply done by splitting in such groups all the Unicode characters found in the Tibetan Unicode tables.

In [24]:
from pybo import BoString

bs = BoString(input_str)

#### Character groups

Here are the groups we identified: 

In [25]:
print(f'Consonants:\n\t{list(bs.cons)}')
print(f'The subscripted variants:\n\t{list(bs.sub_cons)}\n')
print(f'Vowels:\n\t{list(bs.vow)}\n')
print(f'Tseks (regular and non-breaking):\n\t{list(bs.tsek)}\n')
print(f'Consonants specific to Sanskrit:\n\t{list(bs.skrt_cons)}\n')
print(f'The subscripted variants of Sanskrit consonants:\n\t{list(bs.skrt_sub_cons)}\n')
print(f'Vowels specific to Sanskrit:\n\t{list(bs.skrt_vow)}\n')
print(f'Sanskrit long vowel marker:\n\t{list(bs.skrt_long_vow)}\n')
print(f'Regular Tibetan punctuation:\n\t{list(bs.normal_punct)}\n')
print(f'Special Tibetan punctuation:\n\t{list(bs.special_punct)}\n')
print(f'Tibetan numerals:\n\t{list(bs.numerals)}\n')
print(f'Markers found inside Tibetan syllables:\n\t{list(bs.in_syl_marks)}\n')
print(f'Unicode Tibetan symbols:\n\t{list(bs.symbols)}\n')
print(f'Non-Tibetan and non-Sanskrit characters in the Tibetan tables:\n\t{list(bs.non_bo_non_skrt)}\n')
print(f'All the spaces found in the Unicode tables:\n\t{list(bs.spaces)}')

Consonants:
	['ཀ', 'ཁ', 'ག', 'ང', 'ཅ', 'ཆ', 'ཇ', 'ཉ', 'ཏ', 'ཐ', 'ད', 'ན', 'པ', 'ཕ', 'བ', 'མ', 'ཙ', 'ཚ', 'ཛ', 'ཝ', 'ཞ', 'ཟ', 'འ', 'ཡ', 'ར', 'ལ', 'ཤ', 'ས', 'ཧ', 'ཨ', 'ཪ']
The subscripted variants:
	['ྐ', 'ྒ', 'ྔ', 'ྕ', 'ྗ', 'ྙ', 'ྟ', 'ྡ', 'ྣ', 'ྤ', 'ྦ', 'ྨ', 'ྩ', 'ྫ', 'ྭ', 'ྱ', 'ྲ', 'ླ', 'ྷ']

Vowels:
	['ི', 'ུ', 'ེ', 'ོ']

Tseks (regular and non-breaking):
	['་', '༌']

Consonants specific to Sanskrit:
	['གྷ', 'ཊ', 'ཋ', 'ཌ', 'ཌྷ', 'ཎ', 'དྷ', 'བྷ', 'ཛྷ', 'ཥ', 'ཀྵ', '྅']

The subscripted variants of Sanskrit consonants:
	['ྑ', 'ྖ', 'ྠ', 'ྥ', 'ྪ', 'ྮ', 'ྯ', 'ྰ', 'ྴ', 'ྶ', 'ྸ', 'ྺ', 'ྻ', 'ྼ', 'ཱ', 'ྒྷ', 'ྚ', 'ྛ', 'ྜ', 'ྜྷ', 'ྞ', 'ྡྷ', 'ྦྷ', 'ྫྷ', 'ྵ', 'ྐྵ']

Vowels specific to Sanskrit:
	['ཱི', 'ཱུ', 'ྲྀ', 'ཷ', 'ླྀ', 'ཹ', 'ཻ', 'ཽ', 'ྀ', 'ཱྀ', 'ྂ', 'ྃ', '྄', '྆']

Sanskrit long vowel marker:
	['ཿ']

Regular Tibetan punctuation:
	['༄', '༅', '༆', '༈', '།', '༎', '༏', '༐', '༑', '༔', '༴', '༼', '༽']

Special Tibetan punctuation:
	['༁', '༂', '༃', '༒', '༇', '༉', '༊', '༺', '༻', '༾', '༿', '࿐', '࿑', '࿓', '࿔']

Tibetan n

Characters specific to Sanskrit are were distinguished in order identify the unambiguously Sanskrit syllables.

The long vowel marker because when it is present at the end of a syllable, no tsek is required, so it also behaves like a syllable separator, and we need to have that information while syllabifying Tibetan text.

Tibetan regular punctuation was separated from the rest thinking this might be handy when we want to do things like normalizing the punctuation. Otherwise, we could simply have one group for punctuation.

Markers found inside syllables needed to be identified in order to ignore it when we want to compare a word from the dictionary and the same word in its marked form.

Symbols have all been put together because we don't see yet use-cases requiring finer groups.

All the space characters from the Unicode tables have been added in order to ensure we cover all possibilities and the tab character as well, because it is sometimes used instead of a space in Tibetan text.

We expect that as we start using pybo for more and more things, unforseen use-cases will appear and these groups will certainly need to be refined or adjusted, but it is pretty straight forward to do so. 

For example, the group of Tibetan consonants is encoded in the following class attributes:
 - `self.cons = "ཀཁགངཅཆཇཉཏཐདནཔཕབམཙཚཛཝཞཟའཡརལཤསཧཨཪ"` (the actual characters)
 - `self.CONS = 1` (the group marker)
 - `self.char_markers = {self.CONS: 'cons', (...)}` (a human-readable replacement for the group marker)

#### BoString's output

The only thing BoString does is to loop once over the string and fill the `BoString.base_structure` dict with `{index: group_marker}` items for every character.

In [26]:
print(bs.base_structure)

{0: 8, 1: 15, 2: 1, 3: 3, 4: 4, 5: 1, 6: 1, 7: 2, 8: 4, 9: 1, 10: 3, 11: 1, 12: 4, 13: 15, 14: 15, 15: 14, 16: 14, 17: 15, 18: 1, 19: 1, 20: 3, 21: 4, 22: 4, 23: 1, 24: 3, 25: 15, 26: 1, 27: 1, 28: 8, 29: 15, 30: 1, 31: 1, 32: 2, 33: 4, 34: 1, 35: 3, 36: 1, 37: 4, 38: 1, 39: 1, 40: 3, 41: 4, 42: 1, 43: 3, 44: 1, 45: 1, 46: 4, 47: 9, 48: 9, 49: 9, 50: 1, 51: 1, 52: 8, 53: 15, 54: 1, 55: 1, 56: 1, 57: 3, 58: 4, 59: 1, 60: 2, 61: 2, 62: 4, 63: 1, 64: 1, 65: 3, 66: 1, 67: 4, 68: 1, 69: 1, 70: 1, 71: 4, 72: 1, 73: 1, 74: 3, 75: 4, 76: 1, 77: 1, 78: 4, 79: 1, 80: 3, 81: 4, 82: 1, 83: 1, 84: 3, 85: 1, 86: 4, 87: 8, 88: 8, 89: 15, 90: 8, 91: 8}


The same information in a human-readable form:

In [27]:
print(f'\n\t"{input_str}"\n')
for idx, g in bs.base_structure.items():
    character = input_str[idx]
    group = bs.char_markers[g]
    print(f'{idx+1}.\t{character}\t{group}')


	"༆ ཤི་བཀྲ་ཤིས་  tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ། མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།"

1.	༆	punct
2.	 	space
3.	ཤ	cons
4.	ི	vow
5.	་	tsek
6.	བ	cons
7.	ཀ	cons
8.	ྲ	sub-cons
9.	་	tsek
10.	ཤ	cons
11.	ི	vow
12.	ས	cons
13.	་	tsek
14.	 	space
15.	 	space
16.	t	other
17.	r	other
18.	 	space
19.	བ	cons
20.	ད	cons
21.	ེ	vow
22.	་	tsek
23.	་	tsek
24.	ལ	cons
25.	ེ	vow
26.	 	space
27.	ག	cons
28.	ས	cons
29.	།	punct
30.	 	space
31.	བ	cons
32.	ཀ	cons
33.	ྲ	sub-cons
34.	་	tsek
35.	ཤ	cons
36.	ི	vow
37.	ས	cons
38.	་	tsek
39.	བ	cons
40.	ད	cons
41.	ེ	vow
42.	་	tsek
43.	ལ	cons
44.	ེ	vow
45.	ག	cons
46.	ས	cons
47.	་	tsek
48.	༡	num
49.	༢	num
50.	༣	num
51.	ཀ	cons
52.	ཀ	cons
53.	།	punct
54.	 	space
55.	མ	cons
56.	ཐ	cons
57.	འ	cons
58.	ི	vow
59.	་	tsek
60.	ར	cons
61.	ྒ	sub-cons
62.	ྱ	sub-cons
63.	་	tsek
64.	མ	cons
65.	ཚ	cons
66.	ོ	vow
67.	ར	cons
68.	་	tsek
69.	ག	cons
70.	ན	cons
71.	ས	cons
72.	་	tsek
73.	པ	cons
74.	འ	cons
75.	ི	vow
76.	་	tsek
77.	ཉ	cons
78.	ས	cons
79.	་	tsek
80.	ཆ	cons
81.	ུ	vow
82.	་	

What might come in handy is the ability to export the groups for a portion of the input string:

In [28]:
start = 2
end = 5
sub_str = bs.export_groups(start, end)

# now fetching the human-readable markers
sub_str = {idx: bs.char_markers[group] for idx, group in sub_str.items()}
print(sub_str)

{0: 'cons', 1: 'vow', 2: 'tsek', 3: 'cons', 4: 'cons'}


Note that the indices in `sub_str` are adjusted for the substring.

In case we want to keep the original indices:

In [29]:
orig_indices = bs.export_groups(start, end, for_substring=False)

print({idx: bs.char_markers[group] for idx, group in orig_indices.items()})

{2: 'cons', 3: 'vow', 4: 'tsek', 5: 'cons', 6: 'cons'}
