- Tokens are important part of NLP in process to let model understand the natural text
- text - broken down to tokens
- The context length of the model is a frequently discussed characteristic among language models. As an example, the GPT-3.5 model has a context length of 4096 tokens, covering both the tokens in the prompt and the subsequent completion
- Due to this constraint, it is advisable to be mindful of token usage when making requests to language models

**What are tokens? How to create them?**

Let us take example - 'This is tokenizing'

    1. Character level = 'T','h','i','s','i','s',...........
    2. Word level = 'This','is',tokenizing'
    3. subword level = 'This', 'is', 'token', 'izing'
    
**Out of all subword level is found effective because it reduces the number of tokens**

- encoding methods used in subword level
1. Byte pair encoding
2. wordpiece
3. Sentencepiece

## Byte pair encoding 
### Iterative Process to Find Common Words:
**Ultimate goal** - finding best combination of tokens that completes the entire text in fewest possible tokens
- We're trying to find the most common words or parts of words in a group of texts.
- The process starts by counting how often each letter appears in the texts.
- Then, we combine letters to create bigger parts of words, while still counting their occurrences.

### Creating a Vocabulary:
- After finding these common parts of words, we create a big list of them, like a special dictionary.
- Each piece gets a number, like a unique code, starting from 0.
- This way, we have a list of pieces (words or parts of words) and their matching numbers.

### Making Words Understandable to Computers:

- Computers can only understand numbers, so we use this list of pieces to make the words understandable to computers.
- We create a map that connects the pieces to their matching numbers.
- This map is like a cheat sheet that helps the computer understand the words we use.

### Importance of the Vocabulary:

- This map is really important because without it, the computer wouldn't know what words the numbers represent.
- For example, if the computer gives us number 3, we wouldn't know if it means "apple," "dog," or something else.
- We need this map to translate the computer's numbers back into words that make sense.

### Vocabulary Size:

- The size of this map can vary. In smaller cases, there might be around 30,000 pieces in the map.
- But for bigger, more advanced computers, it could go up to around 50,000 pieces.

## Tokens and cost

### Tokens and Cost Relationship:

- In APIs like OpenAI, the number of tokens used in a task is directly linked to the cost.
- The more tokens you use, the higher the cost will be for the service.
### Two Cost Categories:

- Prices are divided into two categories: tokens used in the initial prompt and tokens used in the model's generated output.
- Generating text (completion) generally incurs higher costs compared to processing the initial input prompt.
### Cost Example:

- As of the time of writing, GPT-4 has a cost structure.
- $0.03 per 1,000 tokens for processing inputs.

- $0.06 per 1,000 tokens for generating the model's output.
### Precise Cost Calculation:

- You can use the get_openai_callback method from LangChain to get the exact cost for your task.
- Alternatively, you can handle the tokenization process locally and keep track of the number of tokens, as shown in the next section.
### Rough Estimation:

- OpenAI generally considers 1,000 tokens to be approximately equivalent to 750 words.
- This helps provide a rough estimation of the number of tokens you're using based on the content's length.
- This information highlights the direct connection between the number of tokens used and the cost associated with using LLMs like GPT-4 through APIs like OpenAI.

# Code

In [1]:
from transformers import AutoTokenizer

# Download and load the tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")

Downloading (…)lve/main/config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to see activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development


Downloading (…)olve/main/vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

Downloading (…)olve/main/merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

Downloading (…)/main/tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

The code snippet above will grab the tokenizer and load the dictionary, so you can simply use the tokenizer variable to encode/decode your text. But before we do that, let's take a look at what the vocabulary contains.

In [6]:
print( tokenizer.vocab )



AttributeError: 'NoneType' object has no attribute 'items'

In [7]:
token_ids = tokenizer.encode('Text to test the tokenizer and mapping.')
print('Tokens',tokenizer.convert_ids_to_tokens(token_ids))
print('Corresponding Tokens: ',token_ids)

Tokens ['Text', 'Ġto', 'Ġtest', 'Ġthe', 'Ġtoken', 'izer', 'Ġand', 'Ġmapping', '.']
Corresponding Tokens:  [8206, 284, 1332, 262, 11241, 7509, 290, 16855, 13]


# Shortcomings of Tokenizers

1. **Word Splitting Limitations:**
   - Tokenizers might struggle with languages where words are not separated by spaces, leading to incorrect splitting of words.
   - Languages with complex grammatical structures or agglutinative languages can pose challenges.

2. **Handling Special Characters:**
   - Tokenizers can mishandle special characters, symbols, or punctuation, leading to unexpected tokenization outcomes.
   - Contextual understanding of these characters can be lost.

3. **Ambiguity in Languages:**
   - Ambiguous words or phrases might be tokenized differently based on context, leading to misinterpretations.
   - Tokens might lose nuances present in the original language.

4. **Out-of-Vocabulary Words:**
   - Tokenizers can struggle with words not present in their vocabulary, leading to the breaking down of the word into subwords, affecting readability.

5. **Loss of Morphological Information:**
   - Tokenization can remove morphological information (prefixes, suffixes, etc.) which is important in languages with rich inflectional systems.

6. **Entity Recognition and Segmentation:**
   - For tasks like named entity recognition, tokenization can hinder proper identification of entities that span multiple tokens.

7. **Handling URLs and Emails:**
   - Tokenizers can treat URLs, email addresses, and other structured data as separate tokens, disrupting their context.

8. **Token Length Variation:**
   - Token lengths can vary widely, impacting model performance and causing issues in fixed-length architectures.

9. **Dependency Parsing Challenges:**
   - Tokenizers might interfere with dependency parsing tasks by splitting words that are crucial for understanding relationships.

10. **Influence on Downstream Tasks:**
    - Poor tokenization can negatively affect the performance of downstream NLP tasks, leading to incorrect results.

11. **Language-Specific Challenges:**
    - Different languages have unique linguistic features, making it hard to create a universal tokenizer that performs well for all languages.

12. **Efficiency and Speed:**
    - Some tokenization processes can be computationally intensive and slow down the overall processing time.
