# Exercise: Working with Hugging Face Tokenizers

### Part 1: Basic Tokenization with Hugging Face Tokenizer


1. Install the Hugging Face `transformers` library:
   ```bash
   !pip install transformers
   ```

2. Choose a pre-trained tokenizer from Hugging Face’s model hub (e.g., `bert-base-uncased`, `gpt2`, etc.) and tokenize a piece of text:
   
   **Task**: Load the tokenizer and tokenize the sentence: `"T5 is the greatest data science boot-camp!"`

   Below is a code block where you can perform this task:
    

In [1]:
%%capture
!pip install transformers

In [3]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")
sentence = "T5 is the greatest data science boot-camp!"
tokens = tokenizer.tokenize(sentence)
print("Tokens:", tokens)

tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

Tokens: ['t', '##5', 'is', 'the', 'greatest', 'data', 'science', 'boot', '-', 'camp', '!']





### Part 2: Encoding and Decoding

3. Use the same tokenizer to encode the sentence (convert to token IDs) and then decode it back to text.

   **Task**: Encode the sentence and then decode it back to text.

   Below is a code block where you can perform this task:
    

In [4]:
encoded_sentence = tokenizer.encode(sentence)
print("Encoded Token IDs:", encoded_sentence)
decoded_sentence = tokenizer.decode(encoded_sentence)
print("Decoded Sentence:", decoded_sentence)

Encoded Token IDs: [101, 1056, 2629, 2003, 1996, 4602, 2951, 2671, 9573, 1011, 3409, 999, 102]
Decoded Sentence: [CLS] t5 is the greatest data science boot - camp! [SEP]



### Bonus Challenge

4. **Custom Tokenizer**: Use Hugging Face’s `tokenizers` library to train a custom tokenizer on a dataset.
   
   You are provided with a dataset containing multiple sentences. Train a custom tokenizer using these sentences.

   Below is a code block to train the tokenizer:
    

In [5]:
%%capture
!pip install tokenizers

In [7]:
from tokenizers import Tokenizer, models, pre_tokenizers, decoders, trainers, processors

# Provided dataset for the bonus challenge
dataset = [
    "Transformers are amazing for NLP tasks.",
    "Tokenization is essential for language models.",
    "Byte Pair Encoding is a great subword tokenization algorithm.",
    "Hugging Face makes it easy to work with pre-trained models.",
    "Data science is the key to unlocking insights from data."
]
tokenizer = Tokenizer(models.BPE())
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
trainer = trainers.BpeTrainer(vocab_size=5000, special_tokens=["[PAD]", "[UNK]", "[CLS]", "[SEP]", "[MASK]"])
tokenizer.train_from_iterator(dataset, trainer)
tokenizer.save("custom_tokenizer.json")
tokenizer.train_from_iterator(dataset, trainer)
tokenizer.save("custom_tokenizer.json")
trained_tokenizer = Tokenizer.from_file("custom_tokenizer.json")

# For print
print(f"New sentence: {new_sentence}")
print(f"Encoded IDs: {encoded.ids}")
print(f"Decoded sentence: {decoded}")

New sentence: Hugging Face helps with tokenization.
Encoded IDs: [161, 137, 23, 53, 30, 32, 124, 128, 6]
Decoded sentence: Hugging Face h el p s with tokenization .
