<a href="https://colab.research.google.com/github/OE-LUCIFER/youtube-video/blob/main/pretrain%20LLM%20from%20scratch/tokenizer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Modify old open source tokenizer

**This Line installs NLP packages: `datasets`, `transformers`, and `sentencepiece`.**

In [16]:
!pip install datasets transformers[sentencepiece]



***Dataset Loading***

In [17]:
from datasets import load_dataset

Here i am using my own dataset but you can use any of your choice

In [18]:
dataset = load_dataset("OEvortex/Vortex-50k", split="train")

We can have a look at the dataset, which as **50000** texts:

In [19]:
dataset

Dataset({
    features: ['instruction', 'input', 'output'],
    num_rows: 50000
})

In [20]:
dataset[1]

{'instruction': 'A text is given in Bengali. Translate it from the Bengali language to the Malayalam language. The translation must not omit or add information to the original sentence.\n\nইউরিয়ার অপব্যবহার রুখতে ১০০ শতাংশ হারে ইউরিয়ার ওপর নিমের প্রলেপ দেওয়া হচ্ছে।',
 'input': '',
 'output': 'യൂറിയയുടെ ലഭ്യത ഉറപ്പുവരുത്തുന്നതിനും അതിന്റെ ദുരുപയോഗം തടയുന്നതിനും 100 ശതമാനം വേപ്പണ്ണപുരട്ടിയ യൂറിയ ലഭ്യമാക്കി.'}

In [21]:
dataset[:10]

{'instruction': ['You are given a dialogue between two people. output their relationship status. here is an example:\n\nNow complete the following instance -\nInput: Dialogue:\n- Agent: Hi, how can I help you?\n- Customer: I am looking for a new pair of shoes.\nOutput:',
  'A text is given in Bengali. Translate it from the Bengali language to the Malayalam language. The translation must not omit or add information to the original sentence.\n\nইউরিয়ার অপব্যবহার রুখতে ১০০ শতাংশ হারে ইউরিয়ার ওপর নিমের প্রলেপ দেওয়া হচ্ছে।',
  'In this task, you are given a list of words. The task is to find the word that has the most occurrences in the list. If there is a tie, return all of the tied words in alphabetical order, separated by commas.\nThe output should be a string containing the word with the most occurrences or a comma-separated string of tied words (in alphabetical order).\n\nConstraints,Pedestrian,Car,Bicycle,Motorcycle.',
  '¡Juguemos! Elige secretamente un número del 1 al 100, y no m

In [22]:
batch_size = 20000

all_instructions = [dataset[i : i + batch_size]["instruction"] for i in range(0, len(dataset), batch_size)]
all_outputs = [dataset[i : i + batch_size]["output"] for i in range(0, len(dataset), batch_size)]

def batch_iterator():
    for i in range(0, len(dataset), batch_size):
        instructions = dataset[i : i + batch_size]["instruction"]
        outputs = dataset[i : i + batch_size]["output"]
        for instruction, output in zip(instructions, outputs):
            yield instruction + " " + output


****Here I am using a modified GPTNeoX Tokenizer****

In [23]:
from transformers import AutoTokenizer

tokenizer = AutoTokenizer.from_pretrained("OEvortex/HelpingAI-3B")

Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.


****Checking if this tokenizer is fast or not.****

In [24]:
tokenizer.is_fast

True

Then we feed the training corpus (either the list of list or the iterator we defined earlier) to the train_new_from_iterator method. We also have to specify the vocabulary size we want to use:

In [25]:
new_tokenizer = tokenizer.train_new_from_iterator(batch_iterator(), vocab_size=25000)

Testing new tokenizer

In [26]:
new_tokenizer(dataset[:30]["instruction"])

{'input_ids': [[952, 368, 551, 260, 5077, 963, 658, 788, 16, 848, 618, 2296, 5297, 16, 1554, 304, 279, 1183, 28, 201, 201, 2903, 1338, 266, 611, 2335, 452, 201, 701, 28, 413, 23645, 28, 201, 15, 6392, 28, 6336, 14, 716, 409, 320, 757, 328, 33, 201, 15, 3501, 28, 320, 726, 1779, 333, 260, 748, 2574, 295, 6626, 16, 201, 718, 28], [35, 899, 304, 551, 288, 6630, 16, 2843, 356, 464, 266, 6630, 888, 291, 266, 9721, 888, 16, 363, 1794, 1211, 427, 3842, 352, 883, 882, 291, 266, 1697, 576, 16, 201, 201, 7931, 530, 234, 1632, 2304, 3165, 13450, 1632, 12713, 7061, 4604, 1880, 3165, 4604, 11105, 1552, 1632, 15832, 4922, 12309, 3254, 1733, 15357, 103, 20996, 20996, 14431, 3254, 1552, 11106, 8036, 10202, 1552, 1632, 1733, 1051, 232, 530, 234, 1632, 2304, 3165, 13450, 1632, 19186, 7061, 1632, 9328, 2304, 4395, 1733, 1632, 5884, 1880, 1632, 3984, 1733, 7061, 9084, 1733, 13846, 3165, 13450, 10202, 11329, 1880, 7283, 15729], [495, 412, 572, 14, 328, 368, 551, 260, 762, 295, 1109, 16, 363, 572, 304, 291,

save tokenizer

In [27]:
new_tokenizer.save_pretrained("HelpingAI")

('HelpingAI/tokenizer_config.json',
 'HelpingAI/special_tokens_map.json',
 'HelpingAI/tokenizer.json')

**Pusinging New tokenizer to huggingface hub**

In [28]:
from huggingface_hub import notebook_login

notebook_login()

VBox(children=(HTML(value='<center> <img\nsrc=https://huggingface.co/front/assets/huggingface_logo-noborder.sv…

it is necessary that you have git lfs installed

In [29]:
!apt install git-lfs

Reading package lists... Done
Building dependency tree... Done
Reading state information... Done
git-lfs is already the newest version (3.0.2-1ubuntu0.2).
0 upgraded, 0 newly installed, 0 to remove and 39 not upgraded.


In [30]:
new_tokenizer.push_to_hub("HelpingAI")

README.md:   0%|          | 0.00/5.18k [00:00<?, ?B/s]

CommitInfo(commit_url='https://huggingface.co/Abhaykoul/HelpingAI/commit/504224b6cc99368217e2f7ddfea445604a0ef95e', commit_message='Upload tokenizer', commit_description='', oid='504224b6cc99368217e2f7ddfea445604a0ef95e', pr_url=None, pr_revision=None, pr_num=None)