## What is Tokenization
Author: Ching Wen Yang

- In this notebook, we particularly focus on **tokenization**, which is the first step of the compilation process. 
<img src="https://i.imgur.com/CbNSzT4.png" alt="drawing" width="400"/>
- `tokenizer`, also known as a `lexer` 
- **IMPORTANT:** Tokenization fundamentally works on a stream of characters. The input does not need to be a valid python string; it can be a potential beginning of a python string.

## References
- [Brown Water Python Documentation](https://www.asmeurer.com/brown-water-python/intro.html)
- [Nana's Compiler Study Notes (HackMD)](https://hackmd.io/nFUOy3eoQYyRLQh0VvSdRw)

### A working example

In [1]:
import tokenize
import io 
string = "print('Hello World')\n"
g = tokenize.tokenize(io.BytesIO(string.encode('utf-8')).readline)
for token in g:
    print(token)
    print('string split:', token.string)

TokenInfo(type=62 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
string split: utf-8
TokenInfo(type=1 (NAME), string='print', start=(1, 0), end=(1, 5), line="print('Hello World')\n")
string split: print
TokenInfo(type=54 (OP), string='(', start=(1, 5), end=(1, 6), line="print('Hello World')\n")
string split: (
TokenInfo(type=3 (STRING), string="'Hello World'", start=(1, 6), end=(1, 19), line="print('Hello World')\n")
string split: 'Hello World'
TokenInfo(type=54 (OP), string=')', start=(1, 19), end=(1, 20), line="print('Hello World')\n")
string split: )
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 20), end=(1, 21), line="print('Hello World')\n")
string split: 

TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
string split: 


### A failed example
- Omitting closing parenthesis
- The input need not be semantically meaningful in any way. The input string, even if completed, can only raise a TypeError because `"a" + True` is not allowed by Python. **The tokenize module does not know or care about objects, types, or any high-level Python constructs.**

In [2]:
from tokenize import TokenError
string = '("a" + True -'
g = tokenize.tokenize(io.BytesIO(string.encode('utf-8')).readline)  
print(g)
while True:
    try:
        token = next(g)
        print(token)
    except TokenError as e:
        print(e) 
        break


<generator object _tokenize at 0x7fd6c9d89cf0>
TokenInfo(type=62 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=54 (OP), string='(', start=(1, 0), end=(1, 1), line='("a" + True -')
TokenInfo(type=3 (STRING), string='"a"', start=(1, 1), end=(1, 4), line='("a" + True -')
TokenInfo(type=54 (OP), string='+', start=(1, 5), end=(1, 6), line='("a" + True -')
TokenInfo(type=1 (NAME), string='True', start=(1, 7), end=(1, 11), line='("a" + True -')
TokenInfo(type=54 (OP), string='-', start=(1, 12), end=(1, 13), line='("a" + True -')
('EOF in multi-line statement', (2, 0))


## `tokenize` vs. Alternatives
- When one wants to find or modify syntactic constructs in Python source code, we can do 
    - lexer, eg. `tokenize` module 
    - ast module, eg. `ast` module
    - `re` (regular expression): Very hard to detect syntax correctly, hence skipped

### Use tokenizer to get the start and end line numbers for a function
We see that it easily gets the start and end line numbers for a function. 
Also, it recognizes the string as a string and does not mistokenize the function name inside as a keyword.

![image.png](images/syntax_tool_comparison_table.jpeg)

In [3]:
import tokenize
import io
def line_numbers_tokenize(inputcode):
    for tok in tokenize.tokenize(io.BytesIO(inputcode.encode('utf-8')).readline):
        if tok.type == tokenize.NAME and tok.string == 'def':
            print(tok.start[0])

code = """\
def f(x):
    pass

class MyClass:
    def g(self):
        pass
"""

code_tricky = '''\
FUNCTION_SKELETON = """
def {name}({args}):
    {body}
"""
'''
line_numbers_tokenize(code)
print("----------")
line_numbers_tokenize(code_tricky)


1
5
----------


## Usage
### Calling Syntax 
`tokenize` API requires the `readline` method of a bytes-mode file-like object. 
`text` mode (`r`) is weirdly not supported.
To tokenize a string, we can use `io.BytesIO` to wrap the string and pass it to `tokenize.tokenize`. 
- `tokenize.generate_tokens()` can be used to generate tokens from a string. 
    - Do `tokenize.generate_tokens(io.BytesIO(string).readline)` to tokenize a string. 
    - Equivalent to `tokenize.tokenize(io.BytesIO(string).readline)` 
### `TokenInfo` 
- `TokenInfo` is a named tuple with the following fields:
    - `type`: The type of the token. This is one of the token constants listed below. 
    - `string`: The token’s string representation (as in the source file). 
    - `start`: The starting (row, column) indices of the token. 
    - `end`: The ending (row, column) indices of the token. 
    - `line`: The line on which the token was found.

In [5]:
# generate_tokens() 
for t in tokenize.generate_tokens(io.StringIO(code).readline):
    print(t) 
# untokenize()
string = b'sum([[1, 2]][0])'
tokenize.untokenize(tokenize.tokenize(io.BytesIO(string).readline))
b'sum([[1, 2]][0])'

TokenInfo(type=1 (NAME), string='def', start=(1, 0), end=(1, 3), line='def f(x):\n')
TokenInfo(type=1 (NAME), string='f', start=(1, 4), end=(1, 5), line='def f(x):\n')
TokenInfo(type=54 (OP), string='(', start=(1, 5), end=(1, 6), line='def f(x):\n')
TokenInfo(type=1 (NAME), string='x', start=(1, 6), end=(1, 7), line='def f(x):\n')
TokenInfo(type=54 (OP), string=')', start=(1, 7), end=(1, 8), line='def f(x):\n')
TokenInfo(type=54 (OP), string=':', start=(1, 8), end=(1, 9), line='def f(x):\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(1, 9), end=(1, 10), line='def f(x):\n')
TokenInfo(type=5 (INDENT), string='    ', start=(2, 0), end=(2, 4), line='    pass\n')
TokenInfo(type=1 (NAME), string='pass', start=(2, 4), end=(2, 8), line='    pass\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 8), end=(2, 9), line='    pass\n')
TokenInfo(type=61 (NL), string='\n', start=(3, 0), end=(3, 1), line='\n')
TokenInfo(type=6 (DEDENT), string='', start=(4, 0), end=(4, 0), line='class MyClass

b'sum([[1, 2]][0])'

### Exceptions 
- `SyntaxError` is raised when the input has an invalid encoding. 
    - `detect_encoding` detects using SheBang line or a Unicode BOM. 
    - This is where `tokenize()` differs from `generate_tokens()`; the latter ignores the encoding line. 
- `TokenError` is raised when the input contains an invalid token. 
    - 2 scenarios 
        - EOF in multi-line string. (Wrong docstring format)
        - EOF in multi-line statement. (Unclosed braces)

## Token Types
- `tokenize.tok_name` dictionary 
### The Tokens 
- `ENDMARKER`: end of line unless EOF error 
- `NAME`: identifiers and keywords, to tell them apart use `keyword.iskeyword()`
- `NUMBER`: integers, floats, imaginary (including binary, octal literals and fps etc.)
   - Will retain the original input format using the below `print_tokens()`
   - But if use `ast` to do `ast.dump(ast.parse("1.0000001"))`, it will be rounded.
   - Note examples such as `123_456`; in Python 3.6 above or below they are tokenized differently. See [link](https://www.asmeurer.com/brown-water-python/tokens.html#number). 
- `STRING`: single and double quotes, triple quotes (docstrings), f-strings, bytes, raw strings, unicode strings, etc. 
- `NEWLINE`: Represents `\n` or `\r\n` that ends a logical line of Python code. `NEWLINE` that does not end a logical line of Python code uses the `NL` token type.
- `INDENT`: Represents an increase in indentation level. The indentation itself is a `string`. 
   - indentation can be any number of spaces or tabs. 
   - Every unindented level must match a previous outer indentation level, otherwise an `IndentationError` is raised.
- `RARROW` and `ELLIPSIS`
   - tokenize as `OP`. 
- `OP`
   - a generic token type for all operators, delimiters and the ellpsis literals. 
   - If operators are not recognized by the parser, they are `ERRORTOKEN` instead. 
   - `LPAR`, `RPAR`, `LSQB`, `RSQB`, `COLON`, `COMMA`, `SEMI`, `PLUS`, `MINUS`, `STAR`, `SLASH`, `VBAR`, `AMPER`, `LESS`, `GREATER`, `EQUAL`, `DOT`, `PERCENT`, `LBRACE`, `RBRACE`, `EQEQUAL`, `NOTEQUAL`, `LESSEQUAL`, `GREATEREQUAL`, `TILDE`, `CIRCUMFLEX`, `LEFTSHIFT`, `RIGHTSHIFT`, `DOUBLESTAR`, `PLUSEQUAL`, `MINEQUAL`, `STAREQUAL`, `SLASHEQUAL`, `PERCENTEQUAL`, `AMPEREQUAL`, `VBAREQUAL`, `CIRCUMFLEXEQUAL`, `LEFTSHIFTEQUAL`, `RIGHTSHIFTEQUAL`, `DOUBLESTAREQUAL`, `DOUBLESLASH`, `DOUBLESLASHEQUAL`, `AT`, `ATEQUAL`, `RARROW`, `ELLIPSIS`, `COLONEQUAL`, `OP` 
- `AWAIT`, `ASYNC` 
   - `AWAIT` and `ASYNC` are only recognized in Python 3.5 and 3.6. 
   - psuedo keywords: to aid the transition in the addition of new keywords. 
   - Both are kept as valid variable names if OUTSIDE an `async def` function. 
   - In Python 3.7, `async` and `await` are promoted to proper keywords, and have been removed from the `tokenize` module. 
   - Make sure you set the `feature_version` flag for `ast.parse` to parse them correctly. 
- `TYPE_IGNORE`, `TYPE_COMMENT`: 

In [6]:
def print_tokens(s):
    for t in tokenize.tokenize(io.BytesIO(s.encode('utf-8')).readline):
        print(t) 

In [8]:
number_str = '1.0+0b101 + 0o10 + 0xa + 1e1 + 100j\n'
print_tokens(number_str)

TokenInfo(type=62 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=2 (NUMBER), string='1.0', start=(1, 0), end=(1, 3), line='1.0+0b101 + 0o10 + 0xa + 1e1 + 100j\n')
TokenInfo(type=54 (OP), string='+', start=(1, 3), end=(1, 4), line='1.0+0b101 + 0o10 + 0xa + 1e1 + 100j\n')
TokenInfo(type=2 (NUMBER), string='0b101', start=(1, 4), end=(1, 9), line='1.0+0b101 + 0o10 + 0xa + 1e1 + 100j\n')
TokenInfo(type=54 (OP), string='+', start=(1, 10), end=(1, 11), line='1.0+0b101 + 0o10 + 0xa + 1e1 + 100j\n')
TokenInfo(type=2 (NUMBER), string='0o10', start=(1, 12), end=(1, 16), line='1.0+0b101 + 0o10 + 0xa + 1e1 + 100j\n')
TokenInfo(type=54 (OP), string='+', start=(1, 17), end=(1, 18), line='1.0+0b101 + 0o10 + 0xa + 1e1 + 100j\n')
TokenInfo(type=2 (NUMBER), string='0xa', start=(1, 19), end=(1, 22), line='1.0+0b101 + 0o10 + 0xa + 1e1 + 100j\n')
TokenInfo(type=54 (OP), string='+', start=(1, 23), end=(1, 24), line='1.0+0b101 + 0o10 + 0xa + 1e1 + 100j\n')
TokenInfo(type=2 (NUMB

In [9]:
# INDENT 
print_tokens("""
1
    2
    3
        4
5

""")


TokenInfo(type=62 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=61 (NL), string='\n', start=(1, 0), end=(1, 1), line='\n')
TokenInfo(type=2 (NUMBER), string='1', start=(2, 0), end=(2, 1), line='1\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(2, 1), end=(2, 2), line='1\n')
TokenInfo(type=5 (INDENT), string='    ', start=(3, 0), end=(3, 4), line='    2\n')
TokenInfo(type=2 (NUMBER), string='2', start=(3, 4), end=(3, 5), line='    2\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(3, 5), end=(3, 6), line='    2\n')
TokenInfo(type=2 (NUMBER), string='3', start=(4, 4), end=(4, 5), line='    3\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(4, 5), end=(4, 6), line='    3\n')
TokenInfo(type=5 (INDENT), string='        ', start=(5, 0), end=(5, 8), line='        4\n')
TokenInfo(type=2 (NUMBER), string='4', start=(5, 8), end=(5, 9), line='        4\n')
TokenInfo(type=4 (NEWLINE), string='\n', start=(5, 9), end=(5, 10), line='        4\n')
TokenInfo(ty

In [10]:
string = "X @ Y"
print_tokens(string)

TokenInfo(type=62 (ENCODING), string='utf-8', start=(0, 0), end=(0, 0), line='')
TokenInfo(type=1 (NAME), string='X', start=(1, 0), end=(1, 1), line='X @ Y')
TokenInfo(type=54 (OP), string='@', start=(1, 2), end=(1, 3), line='X @ Y')
TokenInfo(type=1 (NAME), string='Y', start=(1, 4), end=(1, 5), line='X @ Y')
TokenInfo(type=4 (NEWLINE), string='', start=(1, 5), end=(1, 6), line='')
TokenInfo(type=0 (ENDMARKER), string='', start=(2, 0), end=(2, 0), line='')
