# Exercise: Working with Hugging Face Tokenizers

### Part 1: Basic Tokenization with Hugging Face Tokenizer


1. Install the Hugging Face `transformers` library:
   ```bash
   !pip install transformers
   ```

2. Choose a pre-trained tokenizer from Hugging Face’s model hub (e.g., `bert-base-uncased`, `gpt2`, etc.) and tokenize a piece of text:
   
   **Task**: Load the tokenizer and tokenize the sentence: `"T5 is the greatest data science boot-camp!"`

   Below is a code block where you can perform this task:
    

In [None]:
!pip install transformers

In [2]:
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')

sentence = "T5 is the greatest data science boot-camp!"
tokens = tokenizer.tokenize(sentence)

print("Tokens:", tokens)

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Tokens: ['t', '##5', 'is', 'the', 'greatest', 'data', 'science', 'boot', '-', 'camp', '!']





### Part 2: Encoding and Decoding

3. Use the same tokenizer to encode the sentence (convert to token IDs) and then decode it back to text.

   **Task**: Encode the sentence and then decode it back to text.

   Below is a code block where you can perform this task:
    

In [3]:
# Encode the sentence (convert to token IDs)
encoded_ids = tokenizer.encode(sentence, add_special_tokens=True)
print("Encoded IDs:", encoded_ids)

# Decode the token IDs back to text
decoded_sentence = tokenizer.decode(encoded_ids, skip_special_tokens=True)
print("Decoded Sentence:", decoded_sentence)

Encoded IDs: [101, 1056, 2629, 2003, 1996, 4602, 2951, 2671, 9573, 1011, 3409, 999, 102]
Decoded Sentence: t5 is the greatest data science boot - camp!



### Bonus Challenge

4. **Custom Tokenizer**: Use Hugging Face’s `tokenizers` library to train a custom tokenizer on a dataset.
   
   You are provided with a dataset containing multiple sentences. Train a custom tokenizer using these sentences.

   Below is a code block to train the tokenizer:
    

In [5]:

# Provided dataset for the bonus challenge
dataset = [
    "Transformers are amazing for NLP tasks.",
    "Tokenization is essential for language models.",
    "Byte Pair Encoding is a great subword tokenization algorithm.",
    "Hugging Face makes it easy to work with pre-trained models.",
    "Data science is the key to unlocking insights from data."
]
print(dataset)


['Transformers are amazing for NLP tasks.', 'Tokenization is essential for language models.', 'Byte Pair Encoding is a great subword tokenization algorithm.', 'Hugging Face makes it easy to work with pre-trained models.', 'Data science is the key to unlocking insights from data.']


In [10]:
from tokenizers import Tokenizer
from tokenizers.models import BPE
from tokenizers.trainers import BpeTrainer
from tokenizers.pre_tokenizers import Whitespace


tokenizer = Tokenizer(BPE())
trainer = BpeTrainer(vocab_size=1000)
tokenizer.pre_tokenizer = Whitespace()

tokenizer.train_from_iterator(dataset, trainer)

# Encode a new sentence using the trained tokenizer
encoded_output = tokenizer.encode("Transformers are amazing for NLP tasks.")

# Print the tokens and IDs
print("Custom Tokenizer Tokens:", encoded_output.tokens)
print("Custom Tokenizer IDs:", encoded_output.ids)

Custom Tokenizer Tokens: ['Transformers', 'are', 'amazing', 'for', 'NLP', 'tasks', '.']
Custom Tokenizer IDs: [157, 76, 138, 38, 70, 149, 1]
