## What is Tokenization
Author: Ching Wen Yang

- In this notebook, we particularly focus on **tokenization**, which is the first step of the compilation process. 
<img src="https://i.imgur.com/CbNSzT4.png" alt="drawing" width="400"/>
- `tokenizer`, also known as a `lexer` 
- **IMPORTANT:** Tokenization fundamentally works on a stream of characters. The input does not need to be a valid python string; it can be a potential beginning of a python string.

## References
- [Brown Water Python Documentation](https://www.asmeurer.com/brown-water-python/intro.html)
- [Nana's Compiler Study Notes (HackMD)](https://hackmd.io/nFUOy3eoQYyRLQh0VvSdRw)

### A working example

In [14]:
import tokenize
import io 
string = "print('Hello World')\n"
g = tokenize.tokenize(io.BytesIO(string.encode('utf-8')).readline)
for token in g:
    print(token)
    print('string split:', token.string)

TokenInfo(type=62 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
string split: utf-8
TokenInfo(type=1 (NAME), string='print', start=(1, 0), end=(1, 5), line="print('Hello World')\n")
string split: print
TokenInfo(type=54 (OP), string='(', start=(1, 5), end=(1, 6), line="print('Hello World')\n")
string split: (
TokenInfo(type=3 (STRING), string="'Hello World'", start=(1, 6), end=(1, 19), line="print('Hello World')\n")
string split: 'Hello World'
TokenInfo(type=54 (OP), string=')', start=(1, 19), end=(1, 20), line="print('Hello World')\n")
string split: )
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 20), end=(1, 21), line="print('Hello World')\n")
string split: 

TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
string split: 


### A failed example
- Omitting closing parenthesis
- The input need not be semantically meaningful in any way. The input string, even if completed, can only raise a TypeError because `"a" + True` is not allowed by Python. **The tokenize module does not know or care about objects, types, or any high-level Python constructs.**

In [23]:
from tokenize import TokenError
string = '("a" + True -'
g = tokenize.tokenize(io.BytesIO(string.encode('utf-8')).readline)  
print(g)
while True:
    try:
        token = next(g)
        print(token)
    except TokenError as e:
        print(e) 
        break


<generator object _tokenize at 0x7f6e8f4572e0>
TokenInfo(type=62 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=54 (OP), string='(', start=(1, 0), end=(1, 1), line='("a" + True -')
TokenInfo(type=3 (STRING), string='"a"', start=(1, 1), end=(1, 4), line='("a" + True -')
TokenInfo(type=54 (OP), string='+', start=(1, 5), end=(1, 6), line='("a" + True -')
TokenInfo(type=1 (NAME), string='True', start=(1, 7), end=(1, 11), line='("a" + True -')
TokenInfo(type=54 (OP), string='-', start=(1, 12), end=(1, 13), line='("a" + True -')
('EOF in multi-line statement', (2, 0))


## `tokenize` vs. Alternatives
- When one wants to find or modify syntactic constructs in Python source code, we can do 
    - lexer, eg. `tokenize` module 
    - ast module, eg. `ast` module
    - `re` (regular expression): Very hard to detect syntax correctly, hence skipped

### Use tokenizer to get the start and end line numbers for a function
We see that it easily gets the start and end line numbers for a function. 
Also, it recognizes the string as a string and does not mistokenize the function name inside as a keyword.
![image.png](images/syntax_tool_comparison_table.jpeg)

In [25]:
import tokenize
import io
def line_numbers_tokenize(inputcode):
    for tok in tokenize.tokenize(io.BytesIO(inputcode.encode('utf-8')).readline):
        if tok.type == tokenize.NAME and tok.string == 'def':
            print(tok.start[0])

code = """\
def f(x):
    pass

class MyClass:
    def g(self):
        pass
"""

code_tricky = '''\
FUNCTION_SKELETON = """
def {name}({args}):
    {body}
"""
'''
line_numbers_tokenize(code)
print("----------")
line_numbers_tokenize(code_tricky)


1
5
----------


## Usage
### Calling Syntax 
`tokenize` API requires the `readline` method of a bytes-mode file-like object. 
`text` mode (`r`) is weirdly not supported.
To tokenize a string, we can use `io.BytesIO` to wrap the string and pass it to `tokenize.tokenize`. 

### `TokenInfo` 
- `TokenInfo` is a named tuple with the following fields:
    - `type`: The type of the token. This is one of the token constants listed below. 
    - `string`: The tokenâ€™s string representation (as in the source file). 
    - `start`: The starting (row, column) indices of the token. 
    - `end`: The ending (row, column) indices of the token. 
    - `line`: The line on which the token was found.

In [None]:
## 