# 🏗️  Training a new tokenizer from an old one
Sometimes you need to train a tokenizer for a new domain or language. Tokenizer training is a deterministic, statistical process—unlike neural model training.  
Here, we’ll train a GPT-2-style tokenizer for Python code using CodeSearchNet.

Install the Transformers, Datasets, and Evaluate libraries to run this notebook.

In [None]:
!pip install datasets evaluate transformers[sentencepiece]
!apt install git-lfs

You will need to setup git, adapt your email and name in the following cell.

In [None]:
!git config --global user.email "lakshmi.adhikari26@gmail.com"
!git config --global user.name "Lakshmi-Adhikari-AI"

You will also need to be logged in to the Hugging Face Hub. Execute the following and enter your credentials.

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [None]:
pip show datasets


In [None]:
pip install datasets==3.6.0


## 1️⃣ Assemble a Training Corpus using 🤗 Datasets

Let's load and inspect the Python part of CodeSearchNet—a large dataset of Python functions from GitHub.


In [None]:
from datasets import load_dataset

# Load CodeSearchNet Python corpus; may take a few minutes
raw_datasets = load_dataset("code_search_net", "python")


In [None]:
# Inspect the training split's columns and size
print(raw_datasets["train"])

## 2️⃣ Preview Sample Function Strings

We'll use the "whole_func_string" column (entire function as text) to train our tokenizer.


In [None]:
# Print an example Python function for context
print(raw_datasets["train"][123456]["whole_func_string"])

## 3️⃣ Efficiently Prepare an Iterable Corpus

Break up the dataset into batches (e.g., 1,000 functions at a time), using a generator to avoid loading everything into RAM.


In [None]:
def get_training_corpus():
  dataset=raw_datasets["train"]
  for start_idx in range(0,len(dataset),1000):
    samples=dataset[start_idx:start_idx+1000]
    yield samples["whole_func_string"]

# This object only loads data as needed, perfect for huge datasets
training_corpus=get_training_corpus()

## 4️⃣ Load the Existing (GPT-2) Tokenizer

Start from a pretrained tokenizer so we keep existing behavior and special tokens.


In [None]:
from transformers import AutoTokenizer

# Load 'GPT-2's tokenizer as a starting point
old_tokenizer=AutoTokenizer.from_pretrained("gpt2")

## 5️⃣ Try the Old Tokenizer on Code

How does the original vocabulary handle our Python domain? It's not very efficient!


In [None]:
example='''def add_numbers(a,b):
  """Add the two numbers 'a' and 'b'."""
  return a+b'''

# See how "GPT-2 English splits up a python  function"
tokens=old_tokenizer.tokenize(example)
print(tokens)


## 6️⃣ Train a New Tokenizer, Adapted to Code

We'll use train_new_from_iterator() to retrain the vocabulary for our specific corpus.


In [None]:
# Train  a tokenizer with a vocab size of 52,000 (recommended for large corpora)
tokenizer=old_tokenizer.train_new_from_iterator(training_corpus,52000)

## 7️⃣ Try the New Tokenizer on the Same Code

Does it do better with indentation, underscores, and other Python syntax? Let's check!


In [None]:
tokens=tokenizer.tokenize(example)
print(tokens) # More compact,domain-awate tokens!
print(len(tokens)) # Should be fewer than before (more efficient subwords)
print(len(old_tokenizer.tokenize(example)))

## 8️⃣ Test on Another Code Example

Check its handling of indents, underscores, camel case, and more.


In [None]:
example2="""class LinearLayer():
    def __init__(self, input_size, output_size):
        self.weight = torch.randn(input_size, output_size)
        self.bias = torch.zeros(output_size)

    def __call__(self, x):
        return x @ self.weights + self.bias
      """
print(tokenizer.tokenize(example2))

## 9️⃣ Save and Share the Tokenizer

Preserve your trained tokenizer for future work, sharing, or fine-tuning.


In [None]:
tokenizer.save_pretrained("code-search-net-tokenizer")

## 🔟 Push the Tokenizer to the Hugging Face Hub

Upload with authentication so anyone can reuse it.


In [None]:
from huggingface_hub import notebook_login
notebook_login()


In [None]:
tokenizer.push_to_hub("code-search-net-tokenizer")

## 1️⃣1️⃣ Load your Tokenizer Anywhere

Anyone can now use:


In [None]:
tokenizer=AutoTokenizer.from_pretrained("your-username/code-search-net-tokenizer")