<a href="https://colab.research.google.com/github/SohaHussain/HuggingFace-course/blob/main/tokenizers.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

##### Tokenizers are one of the key components of NLP pipeline. They are used to convert raw text into numerical data that can be processed by the model.

In [1]:
!pip install datasets transformers[sentencepiece]

Collecting datasets
  Downloading datasets-1.17.0-py3-none-any.whl (306 kB)
[K     |████████████████████████████████| 306 kB 5.0 MB/s 
[?25hCollecting transformers[sentencepiece]
  Downloading transformers-4.15.0-py3-none-any.whl (3.4 MB)
[K     |████████████████████████████████| 3.4 MB 63.3 MB/s 
Collecting aiohttp
  Downloading aiohttp-3.8.1-cp37-cp37m-manylinux_2_5_x86_64.manylinux1_x86_64.manylinux_2_12_x86_64.manylinux2010_x86_64.whl (1.1 MB)
[K     |████████████████████████████████| 1.1 MB 47.3 MB/s 
[?25hCollecting xxhash
  Downloading xxhash-2.0.2-cp37-cp37m-manylinux2010_x86_64.whl (243 kB)
[K     |████████████████████████████████| 243 kB 57.9 MB/s 
[?25hCollecting huggingface-hub<1.0.0,>=0.1.0
  Downloading huggingface_hub-0.4.0-py3-none-any.whl (67 kB)
[K     |████████████████████████████████| 67 kB 4.9 MB/s 
Collecting fsspec[http]>=2021.05.0
  Downloading fsspec-2022.1.0-py3-none-any.whl (133 kB)
[K     |████████████████████████████████| 133 kB 50.8 MB/s 
Collecti

## Loading and saving tokenizers

Loading and saving tokenizers is same as loading and saving models. It’s based on the same two methods: *from_pretrained()* and *save_pretrained*(). 

Loading BERT tokenizer using *BertTokenizer* class

In [2]:
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained("bert-base-cased")

Downloading:   0%|          | 0.00/208k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/29.0 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/426k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/570 [00:00<?, ?B/s]

Loading Bert tokenizer using *AutoTokenizer* class

In [3]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")

In [4]:
tokenizer("learning natural language processing")

{'input_ids': [101, 3776, 2379, 1846, 6165, 102], 'token_type_ids': [0, 0, 0, 0, 0, 0], 'attention_mask': [1, 1, 1, 1, 1, 1]}

Saving the tokenizer

In [5]:
tokenizer.save_pretrained("directory_on_my_computer")

('directory_on_my_computer/tokenizer_config.json',
 'directory_on_my_computer/special_tokens_map.json',
 'directory_on_my_computer/vocab.txt',
 'directory_on_my_computer/added_tokens.json',
 'directory_on_my_computer/tokenizer.json')

## Encoding

Encoding is the process of translating text to numbers. It's done in two steps: tokenization and converting tokens to input IDs.

#### Tokenization

In [7]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-cased")
sequence = "learning natural language processing and tranformer networks"
tokens = tokenizer.tokenize(sequence)
print(tokens)

['learning', 'natural', 'language', 'processing', 'and', 't', '##ran', '##former', 'networks']


#### Conversion

In [8]:
ids = tokenizer.convert_tokens_to_ids(tokens)
print(ids)

[3776, 2379, 1846, 6165, 1105, 189, 4047, 23763, 6379]


## Decoding

Decoding is the opposite of encoding .i.e. converting IDs to strings.

In [9]:
decoded_string = tokenizer.decode(ids)
print(decoded_string)

learning natural language processing and tranformer networks
