**Tokenizer**
-------------

**What is a Tokenizer?**

A tokenizer is like a tool that breaks down text into smaller, manageable pieces called tokens. Tokens can be words, subwords, or even characters. For example, if you have the sentence "I love programming," a tokenizer might split it into tokens like:

    Word tokens: ["I", "love", "programming"]
    Char tokens: ["I", " ", "l", "o", "v", "e", " ", "p", "r", "o", "g", "r", "a", "m", "m", "i", "n", "g"]

These tokens are the basic units of text that a computer can work with when performing tasks like language processing or model training.

-------------------------------------------------------------------------------------------------------------------

**What is the Role of a Tokenizer in Building a Large Language Model (LLM)?**

When you’re creating a Large Language Model (LLM), such as the ones used in chatbots, translations, or text generation, the model doesn't directly understand full sentences or words. It processes tokens instead. Here’s why the tokenizer is important:

    1. Breaking Down Text: Text comes in as sentences or paragraphs, but an LLM needs to process it in smaller chunks. The tokenizer breaks the text into tokens, which the model can then learn from.

    2. Mapping to Numbers: Computers understand numbers, not words. The tokenizer assigns each token a unique number (an ID), creating a mapping between the words and numbers. For example, the word "I" might be mapped to the number 5, "love" to 12, and "programming" to 53.

    3. Handling Out-of-Vocabulary Words: In languages, we often have new or unfamiliar words. A good tokenizer breaks down unknown words into smaller parts (like subwords or characters) so the model can still understand them.
    
---------------------------------------------------------------------------------------------------------------

**Why Do We Need to Create a Sinhala Tokenizer for This Task?**

Since you’re working on building an LLM for Sinhala/Tamil, here’s why a tokenizer specific to Sinhala is important:

    1. Language-Specific Rules: Sinhala has its own set of characters, rules for how words are formed, and punctuation. A generic tokenizer designed for English won’t handle Sinhala properly. For example, it might not recognize Sinhala letters or might split words incorrectly. A Sinhala-specific tokenizer is designed to understand these special rules.

    2. Improving Model Performance: When you train an LLM on Sinhala text, you want the model to learn meaningful patterns from the language. A properly trained tokenizer ensures that the text is broken down into tokens that make sense for Sinhala, improving the model's ability to generate, understand, and work with Sinhala language.

    3. Handling Complex Words: Sinhala has compound words and affixes that modify words. A good tokenizer will break down these words into meaningful subwords or units, allowing the model to better capture the richness of the language.

    4. Efficient Learning: The tokenizer helps the model focus on learning the structure of the language rather than getting confused by large chunks of unprocessed text. For example, the word "අනේ" might be treated as a single token, and the model will know exactly what it means without breaking it into random parts.

------------------------------------------------------------------------------------------------------------------    


*Now Let's create a simple Tokenizer*
-----------------------------------

-------------------------------------------------------------------------------------------------------------------






**Steps to Create and Train a Sinhala Tokenizer**

*Step 1: Install Required Libraries*
    
To get started, you’ll need to install ```sentencepiece```, which is a popular library for creating tokenizers, especially for tasks like training large language models (LLMs).

In [None]:
pip install sentencepiece

*Step 2: Get a Large Sinhala Text Corpus*

To train a tokenizer, you need a lot of text in Sinhala. This text will help the tokenizer learn how to split sentences into meaningful tokens. You can gather text from:

    Sinhala Wikipedia
    Sinhala news websites
    Sinhala books (if you can find any in digital format)
    Save this large collection of text as a .txt file, say sinhala_corpus.txt.

*Step 3: Train the Tokenizer Using SentencePiece*

Now, use SentencePiece to train your tokenizer. The idea is to build a model that knows how to break Sinhala text into tokens like words, subwords, or characters.

In [4]:
import sentencepiece as spm

# Train the SentencePiece model with a smaller vocabulary size
spm.SentencePieceTrainer.train(
    '--input=sinhala_corpus.txt --model_prefix=sinhala_tokenizer --vocab_size=4195 --model_type=bpe'
)

    --input=sinhala_corpus.txt:
This is the path to your Sinhala text file.

    --model_prefix=sinhala_tokenizer:
This specifies the prefix for the output files (the trained model and vocabulary). You will get two files: sinhala_tokenizer.model and sinhala_tokenizer.vocab.

    --vocab_size=32000:
This sets the size of your vocabulary (the number of unique tokens). You can adjust this depending on your needs.

    --model_type=bpe:
This specifies the type of tokenizer you want to build. BPE (Byte-Pair Encoding) is a common choice for subword tokenization.

-------------------------------------------------------------------------------------------------------------------

***Use the Trained Tokenizer***

Once the tokenizer is trained, you can use it to tokenize new Sinhala text. Here’s how you can do that in Python:

In [None]:
import sentencepiece as spm

# Load the trained tokenizer model
sp = spm.SentencePieceProcessor(model_file='sinhala_tokenizer.model')

# Sample Sinhala sentence
sinhala_sentence = "සංජුල ඔබට කෙසේද? මම නම් කාර්යබහුලයි."

# Tokenize the sentence
tokens = sp.encode(sinhala_sentence, out_type=str)
print("Tokens:", tokens)

**Step 4: Use the Tokenizer to Train an LLM**

    After creating your tokenizer, you can use it to preprocess text data for training a Large Language Model (LLM). The tokens generated by the tokenizer will be fed into the LLM, helping it learn the structure and patterns of Sinhala.

**Step 5: Fine-Tune or Train Models**

    Once your tokenizer is ready and you have tokenized your text, you can use popular deep learning libraries like TensorFlow or PyTorch to train your LLM using architectures like BERT, GPT, or Transformers.

**Here’s a basic overview of how you might proceed:**

    1. Tokenize Text: Use the trained tokenizer to convert Sinhala sentences into tokens.
    2. Feed Tokens to the LLM: Train your model on these tokens.
    3. Evaluate: Test the model on unseen Sinhala text to see how well it understands and generates Sinhala.

-------------------------------------------------------------

***Example of Connecting the Tokenizer to an LLM in Python***

-------------------------------------------------------------


In [None]:
from transformers import AutoTokenizer, AutoModel

# Load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("C:\F-DRIVE\GIT\UCSC LLM\Train Tokenizer\sinhala_tokenizer.model") #path/to/sinhala_tokenizer.model

# Tokenize input text
sinhala_sentence = "සංජුල ඔබට කෙසේද? මම නම් කාර්යබහුලයි."
tokens = tokenizer(sinhala_sentence)
print("Token IDs:", tokens['input_ids'])

# Load a pretrained LLM (for example, a BERT model)
model = AutoModel.from_pretrained("C:/F-DRIVE/GIT/UCSC LLM/Pretrained Models/sinhala_bert") #path/to/pretrained_sinhala_model

# Pass the token IDs into the LLM
outputs = model(**tokens)
print(outputs)


-------------------------------------------------------------------------------------------------------------------

***Summary of Files:*** 

```.model file:```  The core file that connects your tokenizer to the LLM. This is where the tokenization logic is stored.

```.vocab file:``` The vocabulary file is optional in some frameworks, as the .model file often handles both tokenization and vocabulary mapping. However, it can be referenced when needed to map tokens to IDs.

***Recap of Workflow:***

1. Tokenizer's .model file: Used to preprocess the text (split into tokens and convert to numerical IDs).
2. LLM: Processes the numerical IDs, learns patterns, and generates responses.
Tokenizer: Converts the output back to readable text.

-------------------------------------------------------------------------------------------------------------------