<a href="https://colab.research.google.com/github/MaximTislenko/DS_in_MED/blob/main/Less_05/Tutorial_6_pyhealth_tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### **Preparation**
- install pyhealth alpha version

In [None]:
!pip install pyhealth

Looking in indexes: https://pypi.org/simple, https://us-python.pkg.dev/colab-wheels/public/simple/
Collecting pyhealth
  Downloading pyhealth-1.1.3-py2.py3-none-any.whl (113 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m113.8/113.8 KB[0m [31m3.2 MB/s[0m eta [36m0:00:00[0m
Collecting rdkit>=2022.03.4
  Downloading rdkit-2022.9.4-cp38-cp38-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (29.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m29.3/29.3 MB[0m [31m18.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: rdkit, pyhealth
Successfully installed pyhealth-1.1.3 rdkit-2022.9.4


### **Instruction on [pyhealth.tokenizer](https://pyhealth.readthedocs.io/en/latest/api/tokenizer.html)**
- **[README]**: tokenizer is used for transformations between string-based tokens and integer-based indices, based on the overall token space. We provide flexible functions to tokenize 1D, 2D and 3D lists. **This module can be used in many other scenarios.**

- **[Arguments]**:
  - `tokens`: List of tokens in the vocabulary.
  - `special_tokens`: List of special tokens to add to the vocabulary. (e.g., `<pad>`, `<unk>`). If not provided, no special tokens are added.

- **[Functionality]**:
  - `get_vocabulary_size`: Return the size of the vocabulary
  - `convert_tokens_to_indices`: 1d conversion from tokens to indices
  - `convert_indices_to_tokens`: 1d conversion from indices to tokens
  - `batch_encode_2d`: 2d conversion from tokens to indices
  - `batch_decode_2d`: 2d conversion from indices to tokens
  - `batch_encode_3d`: 3d conversion from tokens to indices
  - `batch_decode_3d`: 3d conversion from indices to tokens

### **Example 1: 1D tokenization**
- We provide examples for 1d transformation between tokens and indices
- We use `["<pad>", "<unk>"]` as two special tokens, `<pad>` is used for padding in higher dimensional encoding and decoding, and `<unk>` is used for unknown tokens. In 1d tokenization, the `<pad>` token is not useful.

In [None]:
from pyhealth.tokenizer import Tokenizer

# we use a list of ATC3 code as the token
token_space = ['A01A', 'A02A', 'A02B', 'A02X', 'A03A', 'A03B', 'A03C', 'A03D', 'A03E', \
          'A03F', 'A04A', 'A05A', 'A05B', 'A05C', 'A06A', 'A07A', 'A07B', 'A07C', \
          'A07D', 'A07E', 'A07F', 'A07X', 'A08A', 'A09A', 'A10A', 'A10B', 'A10X', \
          'A11A', 'A11B', 'A11C', 'A11D', 'A11E', 'A11G', 'A11H', 'A11J', 'A12A', \
          'A12B', 'A12C', 'A13A', 'A14A', 'A14B', 'A16A']

# initialize the tokenizer
tokenizer = Tokenizer(tokens=token_space, special_tokens=["<pad>", "<unk>"])
print(tokenizer.get_vocabulary_size())

44


In [None]:
# 1d encode
tokens = ['A03C', 'A03D', 'A03E', 'A03F', 'A04A', 'A05A', 'A05B', 'B035', 'C129']
indices = tokenizer.convert_tokens_to_indices(tokens)
print (indices)

[8, 9, 10, 11, 12, 13, 14, 1, 1]


In [None]:
# 1d decode
indices = [0, 1, 2, 3, 4, 5]
tokens = tokenizer.convert_indices_to_tokens(indices)
print (tokens)

['<pad>', '<unk>', 'A01A', 'A02A', 'A02B', 'A02X']


### **Example 2: 2D tokenization**
- We provide examples for 2d transformation between tokens and indices
- We use `["<pad>", "<unk>"]` as two special tokens, `<pad>` is used for padding in higher dimensional encoding and decoding, and `<unk>` is used for unknown tokens.

In [None]:
# from pyhealth.tokenizer import Tokenizer

# we use a list of ATC3 code as the token
token_space = ['A01A', 'A02A', 'A02B', 'A02X', 'A03A', 'A03B', 'A03C', 'A03D', 'A03E', \
          'A03F', 'A04A', 'A05A', 'A05B', 'A05C', 'A06A', 'A07A', 'A07B', 'A07C', \
          'A07D', 'A07E', 'A07F', 'A07X', 'A08A', 'A09A', 'A10A', 'A10B', 'A10X', \
          'A11A', 'A11B', 'A11C', 'A11D', 'A11E', 'A11G', 'A11H', 'A11J', 'A12A', \
          'A12B', 'A12C', 'A13A', 'A14A', 'A14B', 'A16A']

# initialize the tokenizer
tokenizer = Tokenizer(tokens=token_space, special_tokens=["<pad>", "<unk>"])
print(tokenizer.get_vocabulary_size())

44


In [None]:
"""
batch: List of lists of tokens to convert to indices.
padding (default: True): whether to pad the tokens to the max number of tokens in the batch (smart padding).
truncation (default: True): whether to truncate the tokens to max_length.
max_length (default: 512): maximum length of the tokens. This argument is ignored if truncation is False.
"""

# 2d encode
tokens = [
    ['A03C', 'A03D', 'A03E', 'A03F'],
    ['A04A', 'B035', 'C129']
]

# case 1: default using padding, truncation and max_length is 512
indices = tokenizer.batch_encode_2d(tokens)
print ('case 1:', indices)

# case 2: no padding
indices = tokenizer.batch_encode_2d(tokens, padding=False)
print ('case 2:', indices)

# case 3: truncation with max_length is 3
indices = tokenizer.batch_encode_2d(tokens, max_length=3)
print ('case 3:', indices)

case 1: [[8, 9, 10, 11], [12, 1, 1, 0]]
case 2: [[8, 9, 10, 11], [12, 1, 1]]
case 3: [[9, 10, 11], [12, 1, 1]]


In [None]:
"""
batch: List of lists of indices to convert to tokens.
padding (default: False): whether to keep the padding tokens from the tokens.
"""

# 2d decode
indices = [
    [8, 9, 10, 11],
    [12, 1, 1, 0]
]

# case 1: default no padding
tokens = tokenizer.batch_decode_2d(indices)
print ('case 1:', tokens)

# case 2: use padding
tokens = tokenizer.batch_decode_2d(indices, padding=True)
print ('case 2:', tokens)

case 1: [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', '<unk>', '<unk>']]
case 2: [['A03C', 'A03D', 'A03E', 'A03F'], ['A04A', '<unk>', '<unk>', '<pad>']]


### **Example 3: 3D tokenization**
- We provide examples for 3d transformation between tokens and indices
- We use `["<pad>", "<unk>"]` as two special tokens, `<pad>` is used for padding in higher dimensional encoding and decoding, and `<unk>` is used for unknown tokens.

In [None]:
# from pyhealth.tokenizer import Tokenizer

# we use a list of ATC3 code as the token
token_space = ['A01A', 'A02A', 'A02B', 'A02X', 'A03A', 'A03B', 'A03C', 'A03D', 'A03E', \
          'A03F', 'A04A', 'A05A', 'A05B', 'A05C', 'A06A', 'A07A', 'A07B', 'A07C', \
          'A07D', 'A07E', 'A07F', 'A07X', 'A08A', 'A09A', 'A10A', 'A10B', 'A10X', \
          'A11A', 'A11B', 'A11C', 'A11D', 'A11E', 'A11G', 'A11H', 'A11J', 'A12A', \
          'A12B', 'A12C', 'A13A', 'A14A', 'A14B', 'A16A']

# initialize the tokenizer
tokenizer = Tokenizer(tokens=token_space, special_tokens=["<pad>", "<unk>"])
print(tokenizer.get_vocabulary_size())

44


In [None]:
"""
batch: List of lists of lists of tokens to convert to indices.
padding (default: (True, True)): a tuple of two booleans indicating whether to pad the tokens to the max number of tokens
    and visits (smart padding).
truncation (default: (True, True)): a tuple of two booleans indicating whether to truncate the tokens to the corresponding
    element in max_length
max_length (default: (10, 512)): a tuple of two integers indicating the maximum length of the tokens along the first and
    second dimension. This argument is ignored if truncation is False.
"""

# 3d encode
tokens = [
    [
        ['A03C', 'A03D', 'A03E', 'A03F'],
        ['A08A', 'A09A'],
    ],
    [
        ['A04A', 'B035', 'C129'],
    ]
]

# case 1: default using padding, truncation and max_length is 512
indices = tokenizer.batch_encode_3d(tokens)
print ('case 1:', indices)

# case 2: no padding on the first dimension
indices = tokenizer.batch_encode_3d(tokens, padding=(False, True))
print ('case 2:', indices)

# case 3: no padding on the second dimension
indices = tokenizer.batch_encode_3d(tokens, padding=(True, False))
print ('case 3:', indices)

# case 4: no padding on both dimensions
indices = tokenizer.batch_encode_3d(tokens, padding=(False, False))
print ('case 4:', indices)

# case 5: truncation with max_length is (2,2) on both dimension
indices = tokenizer.batch_encode_3d(tokens, max_length=(2,2))
print ('case 5:', indices)

case 1: [[[8, 9, 10, 11], [24, 25, 0, 0]], [[12, 1, 1, 0], [0, 0, 0, 0]]]
case 2: [[[8, 9, 10, 11], [24, 25, 0, 0]], [[12, 1, 1, 0]]]
case 3: [[[8, 9, 10, 11], [24, 25]], [[12, 1, 1], [0]]]
case 4: [[[8, 9, 10, 11], [24, 25]], [[12, 1, 1]]]
case 5: [[[10, 11], [24, 25]], [[1, 1], [0, 0]]]


In [None]:
"""
batch: List of lists of indices to convert to tokens.
padding (default: False): whether to keep the padding tokens from the tokens.
"""

# 3d decode
indices = [
    [
        [8, 9, 10, 11],
        [24, 25, 0, 0]
    ],
    [
        [12, 1, 1, 0],
        [0, 0, 0, 0]
    ]
]


# case 1: default no padding
tokens = tokenizer.batch_decode_3d(indices)
print ('case 1:', tokens)

# case 2: use padding
tokens = tokenizer.batch_decode_3d(indices, padding=True)
print ('case 2:', tokens)

case 1: [[['A03C', 'A03D', 'A03E', 'A03F'], ['A08A', 'A09A']], [['A04A', '<unk>', '<unk>']]]
case 2: [[['A03C', 'A03D', 'A03E', 'A03F'], ['A08A', 'A09A', '<pad>', '<pad>']], [['A04A', '<unk>', '<unk>', '<pad>'], ['<pad>', '<pad>', '<pad>', '<pad>']]]


If you find it useful, please give us a star ⭐ (fork, and watch) at https://github.com/sunlabuiuc/PyHealth.

Thanks very much for your support!