# PyBo's Tokenizer

### 0. Running the tokenizer

In [1]:
from pybo import BoTokenizer

Instanciate the tokenizer with the 'POS' profile (see [profile documentation](this.file))

In [2]:
tokenizer = BoTokenizer('POS')

Loading Trie...
Time: 2.5575690269470215


Given a random text in Tibetan language,

In [3]:
input_str = '༆ ཤི་བཀྲ་ཤིས་  tr བདེ་་ལེ གས། བཀྲ་ཤིས་བདེ་ལེགས་༡༢༣ཀཀ མཐའི་རྒྱ་མཚོར་གནས་པའི་ཉས་ཆུ་འཐུང་།། །།'

Let's see what information can be derived from it.

In [4]:
tokens = tokenizer.tokenize(input_str)
print(f'The output is a {type(tokens)}.\nThe constituting elements are {type(tokens[0])}s.')

The output is a <class 'list'>.
The constituting elements are <class 'pybo.token.Token'>s.


### 1. A first look

#### Non Tibetan tokens

First thing, I see there is non-Tibetan stuff in the middle of the input string. Let's see how I can detect it.

In [16]:
for n, token in enumerate(tokens):
    if token.type == 'non-bo':
        content = token.content
        print(f'"{content}", token number {n+1}, is not Tibetan.')
        start = token.start
        length = token.len
        print(f'this starts at {start}th character in the input and spans {length} characters')

"tr", token number 4, is not Tibetan.
this starts at 15th character in the input and spans 2 characters


#### Tokens that are not words

Is there any Tibetan punctuation?

In [6]:
for n, token in enumerate(tokens):
    if token.type == 'punct':
        content = token.content
        print(f'"{content}", token number {n+1}, is a punctuation token.')

"༆", token number 1, is a punctuation token.
"།", token number 6, is a punctuation token.
"།། ", token number 19, is a punctuation token.
"།།", token number 20, is a punctuation token.


How are the Tibetan digits treated?

In [7]:
for n, token in enumerate(tokens):
    if token.type == 'num':
        content = token.content
        print(f'"{content}", token number {n+1}, is a numeral.')

"༡༢༣", token number 9, is a numeral.


### 2. The attributes of tokens.

Strictly speaking, a token is a word that has been correctly extracted from the input string, but our Token objects have much more information that is awaiting to be exploited by NLP treatments:

#### Token.content: the unmodified content straight from the input string

In [21]:
for n, token in enumerate(tokens):
    print(f'token {n+1}: "{token.content}"')

token 1: "༆"
token 2: " ཤི་"
token 3: "བཀྲ་ཤིས་  "
token 4: "tr"
token 5: " བདེ་་ལེ གས"
token 6: "།"
token 7: " བཀྲ་ཤིས་"
token 8: "བདེ་ལེགས་"
token 9: "༡༢༣"
token 10: "ཀཀ མཐའི་"
token 11: "རྒྱ་མཚོ"
token 12: "ར་"
token 13: "གནས་པ"
token 14: "འི་"
token 15: "ཉ"
token 16: "ས་"
token 17: "ཆུ་"
token 18: "འཐུང་"
token 19: "།། "
token 20: "།།"


#### Token.type: the basic types of tokens 

In [22]:
for n, token in enumerate(tokens):
    print(f'{token.type}\t: token {n+1}("{token.content}")')

punct	: token 1("༆")
syl	: token 2(" ཤི་")
syl	: token 3("བཀྲ་ཤིས་  ")
non-bo	: token 4("tr")
syl	: token 5(" བདེ་་ལེ གས")
punct	: token 6("།")
syl	: token 7(" བཀྲ་ཤིས་")
syl	: token 8("བདེ་ལེགས་")
num	: token 9("༡༢༣")
syl	: token 10("ཀཀ མཐའི་")
syl	: token 11("རྒྱ་མཚོ")
syl	: token 12("ར་")
syl	: token 13("གནས་པ")
syl	: token 14("འི་")
syl	: token 15("ཉ")
syl	: token 16("ས་")
syl	: token 17("ཆུ་")
syl	: token 18("འཐུང་")
punct	: token 19("།། ")
punct	: token 20("།།")


(there is a bug in token 10)

legend:
 - syl: contains valid Tibetan syllables
 - num: Tibetan numerals
 - punct: Tibetan punctuation
 - non-bo: non-Tibetan content