<a href="https://colab.research.google.com/github/Asma-Ahmed-Aqil-AL-Zubaidi/Bonus_Exercise/blob/main/Huggingface_Tokenizer_Exercise(Bonus).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise: Working with Hugging Face Tokenizers

### Part 1: Basic Tokenization with Hugging Face Tokenizer


1. Install the Hugging Face `transformers` library:
   ```bash
   !pip install transformers
   ```

2. Choose a pre-trained tokenizer from Hugging Face’s model hub (e.g., `bert-base-uncased`, `gpt2`, etc.) and tokenize a piece of text:
   
   **Task**: Load the tokenizer and tokenize the sentence: `"T5 is the greatest data science boot-camp!"`

   Below is a code block where you can perform this task:
    

In [None]:
!pip install tokenizers



In [None]:
!pip install transformers



In [None]:
from transformers import AutoTokenizer

In [None]:
from tokenizers import Tokenizer, models, trainers, pre_tokenizers

In [None]:
tokenizer = AutoTokenizer.from_pretrained('bert-base-uncased')
text = "T5 is the greatest data science boot-camp!"
tokens = tokenizer.tokenize(text)
print(tokens)

['t', '##5', 'is', 'the', 'greatest', 'data', 'science', 'boot', '-', 'camp', '!']





### Part 2: Encoding and Decoding

3. Use the same tokenizer to encode the sentence (convert to token IDs) and then decode it back to text.

   **Task**: Encode the sentence and then decode it back to text.

   Below is a code block where you can perform this task:
    

In [None]:
encoded_input = tokenizer.encode(text, add_special_tokens=True)
print(f"Encoded token IDs: {encoded_input}")
decoded_text = tokenizer.decode(encoded_input)
print(f"Decoded text: {decoded_text}")

Encoded token IDs: [101, 1056, 2629, 2003, 1996, 4602, 2951, 2671, 9573, 1011, 3409, 999, 102]
Decoded text: [CLS] t5 is the greatest data science boot - camp! [SEP]



### Bonus Challenge

4. **Custom Tokenizer**: Use Hugging Face’s `tokenizers` library to train a custom tokenizer on a dataset.
   
   You are provided with a dataset containing multiple sentences. Train a custom tokenizer using these sentences.

   Below is a code block to train the tokenizer:
    

In [None]:
dataset = [
    "Transformers are amazing for NLP tasks.",
    "Tokenization is essential for language models.",
    "Byte Pair Encoding is a great subword tokenization algorithm.",
    "Hugging Face makes it easy to work with pre-trained models.",
    "Data science is the key to unlocking insights from data."
]
print(dataset)

['Transformers are amazing for NLP tasks.', 'Tokenization is essential for language models.', 'Byte Pair Encoding is a great subword tokenization algorithm.', 'Hugging Face makes it easy to work with pre-trained models.', 'Data science is the key to unlocking insights from data.']


In [None]:
tokenizer = Tokenizer(models.WordPiece(unk_token="[UNK]"))
tokenizer.pre_tokenizer = pre_tokenizers.Whitespace()
trainer = trainers.WordPieceTrainer(vocab_size=30522, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])
tokenizer.train_from_iterator(dataset, trainer)

In [None]:
tokenizer.save("custom_tokenizer.json")

In [None]:
output = tokenizer.encode("This is a test for the custom tokenizer.")

In [None]:
print(f"Tokens: {output.tokens}")
print(f"Token IDs: {output.ids}")

Tokens: ['T', '##h', '##i', '##s', 'is', 'a', 't', '##e', '##s', '##t', 'for', 'the', 'c', '##u', '##s', '##t', '##om', 'to', '##ken', '##iz', '##er', '.']
Token IDs: [15, 59, 42, 47, 66, 16, 33, 46, 47, 50, 70, 173, 18, 49, 47, 50, 123, 67, 80, 76, 136, 6]


### :)