<a href="https://colab.research.google.com/github/Mashaell22/repot5/blob/main/Huggingface_Tokenizer_Exercise(Bonus).ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Exercise: Working with Hugging Face Tokenizers

### Part 1: Basic Tokenization with Hugging Face Tokenizer


1. Install the Hugging Face `transformers` library:
   ```bash
   !pip install transformers
   ```

2. Choose a pre-trained tokenizer from Hugging Face’s model hub (e.g., `bert-base-uncased`, `gpt2`, etc.) and tokenize a piece of text:
   
   **Task**: Load the tokenizer and tokenize the sentence: `"T5 is the greatest data science boot-camp!"`

   Below is a code block where you can perform this task:
    

In [1]:
!pip install transformers



In [2]:
from transformers import AutoTokenizer

In [3]:

# Load a pre-trained tokenizer
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# The sentence to be tokenized
text = "T5 is the greatest data science boot-camp!"

# Tokenize the sentence
tokens = tokenizer.tokenize(text)

# Print the tokens
print(tokens)


The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


tokenizer_config.json:   0%|          | 0.00/48.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/570 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

['t', '##5', 'is', 'the', 'greatest', 'data', 'science', 'boot', '-', 'camp', '!']





### Part 2: Encoding and Decoding

3. Use the same tokenizer to encode the sentence (convert to token IDs) and then decode it back to text.

   **Task**: Encode the sentence and then decode it back to text.

   Below is a code block where you can perform this task:
    

In [4]:
from transformers import AutoTokenizer

# Load a pre-trained tokenizer (bert-base-uncased in this case)
tokenizer = AutoTokenizer.from_pretrained("bert-base-uncased")

# The sentence to be encoded
text = "T5 is the greatest data science boot-camp!"

# Encode the sentence (convert to token IDs)
encoded = tokenizer.encode(text, add_special_tokens=True)

# Decode the token IDs back to text
decoded = tokenizer.decode(encoded)

# Print the encoded token IDs and the decoded text
print("Encoded token IDs:", encoded)
print("Decoded text:", decoded)


Encoded token IDs: [101, 1056, 2629, 2003, 1996, 4602, 2951, 2671, 9573, 1011, 3409, 999, 102]
Decoded text: [CLS] t5 is the greatest data science boot - camp! [SEP]



### Bonus Challenge

4. **Custom Tokenizer**: Use Hugging Face’s `tokenizers` library to train a custom tokenizer on a dataset.
   
   You are provided with a dataset containing multiple sentences. Train a custom tokenizer using these sentences.

   Below is a code block to train the tokenizer:
    

In [6]:

# Provided dataset for the bonus challenge
dataset = [
    "Transformers are amazing for NLP tasks.",
    "Tokenization is essential for language models.",
    "Byte Pair Encoding is a great subword tokenization algorithm.",
    "Hugging Face makes it easy to work with pre-trained models.",
    "Data science is the key to unlocking insights from data."
]
print(dataset)


['Transformers are amazing for NLP tasks.', 'Tokenization is essential for language models.', 'Byte Pair Encoding is a great subword tokenization algorithm.', 'Hugging Face makes it easy to work with pre-trained models.', 'Data science is the key to unlocking insights from data.']


In [7]:
from tokenizers import Tokenizer
from tokenizers.models import WordPiece
from tokenizers.trainers import WordPieceTrainer
from tokenizers.pre_tokenizers import Whitespace
from tokenizers.processors import TemplateProcessing

# Provided dataset for the bonus challenge
dataset = [
    "Transformers are amazing for NLP tasks.",
    "Tokenization is essential for language models.",
    "Byte Pair Encoding is a great subword tokenization algorithm.",
    "Hugging Face makes it easy to work with pre-trained models.",
    "Data science is the key to unlocking insights from data."
]

# Initialize the tokenizer with the WordPiece model
tokenizer = Tokenizer(WordPiece(unk_token="[UNK]"))

# Set the pre-tokenizer to split by whitespace
tokenizer.pre_tokenizers = Whitespace()

# Initialize the trainer for the tokenizer with a vocabulary size of 3000 and special tokens
trainer = WordPieceTrainer(vocab_size=3000, special_tokens=["[UNK]", "[CLS]", "[SEP]", "[PAD]", "[MASK]"])

# Train the tokenizer using the dataset
tokenizer.train_from_iterator(dataset, trainer)

# Set the post-processing template to add special tokens like [CLS] and [SEP]
tokenizer.post_processor = TemplateProcessing(
    single="[CLS] $A [SEP]",
    pair="[CLS] $A [SEP] $B:1 [SEP]:1",
    special_tokens=[("[CLS]", 1), ("[SEP]", 2)]
)

# Test the custom tokenizer on a sample sentence
test_sentence = "Tokenization is essential for NLP."
encoded = tokenizer.encode(test_sentence)

# Print the tokens and their corresponding Token IDs
print("Tokens:", encoded.tokens)
print("Token IDs:", encoded.ids)


Tokens: ['[CLS]', 'Tokenization is ess', '##ential', '## for ', '##NL', '##P', '##.', '[SEP]']
Token IDs: [1, 204, 179, 141, 153, 53, 56, 2]
