**NOTE:** The following information is based on the book "Build Large Language Model From Scratch" By Sebastian Raschka. I am just trying to take notes, explain some stuff further for myself when needed, and do some coding.

In [1]:
import core
import importlib

importlib.reload(core)

<module 'core' from '/Users/nimasarajpoor/Desktop/LLM/HOW-IT-WORKS/core.py'>

## What is `word embedding`?

It simply means map a word to a point in n-dim Euclidean space. Let's denote `f` as the word-embedding process. Then:

$ v = f(word)$ <br>
$ f: text(words) \rightarrow R^{n}$


**Why do we need it?** <br>
Because the NN does computation on numerical values. So, need to convert text to some numerical values first.

**NOTE:**
The function above may not be completely accurate but it gives us the idea about what `word embedding` does. It simply gets a word as input, and returns a n-dim vector as output. This vector is the representation of the word. If there are several words that are close in meaning, their corresponding points in $R^{n}$ space will be close to each other. In fact, if I have two words and want to know their similarity, I can compute the cosine value of the angle between those vectors: <br>

$SimilarityScore(word1, word2) = cos(v1, v2) = \frac{v1.v2}{|v1|.|v2|}$

**NOTE: embedding is only for word? RAG!** <br>

While word embeddings are the most common form of text embedding, there are also embeddings for sentences, paragraphs, or whole documents. Sentence or paragraph embeddings are popular choices for retrieval-augmented generation (RAG). RAG combines generation (like producing text) with retrieval (like searching an external knowledge base) to pull relevant information when generating text.

## How to obtain those embedding words?

There are some already-trained NN that gives us the embedding words (e.g. `word2vec`). For LLM, there is no need to use an extra model for obtaining embedding words. In LLM, the word embedding is computaed as part of the input layer in LLM, and its advantage is that it is optimized during the training process for the domain/task-specific data.

### Step 1. Tokenizing text

So far, we have learned that we need to obtain word embedding, which is achieved by mapping each word to its corresponding point in a $R^{n}$ space. Those vectors can be used as input to LLM. But our data is often a text with several paragraphs / sentences, and not just a list of words. So, our first task is to understand how one can text file and break it down to a list of words.

In [2]:
# download data

url = (
    "https://raw.githubusercontent.com/rasbt/"
    + "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
    + "the-verdict.txt"
)
file_path = "./data/the-verdict.txt"
core.download_textfile_from_url(url, file_path)

# check the data
!ls -1 ./data/

the-verdict.txt


In [3]:
# See the content of the file
raw_text = core.read_file(file_path)

print("Total number of character:", len(raw_text))
print('=' * 50)

N = 100
print(f'Printing the first {N} characters: \n', raw_text[:N])

Total number of character: 20479
Printing the first 100 characters: 
 I HAD always thought Jack Gisburn rather a cheap genius--though a good fellow enough--so it was no g


Before using any pre-built tokenizer, let's have some fun and use `re`.

In [4]:
# Let's split at white space, comma, and period
# As example, let's show it for the first N characters
N = 150
re.split(r'[.,]|\s', raw_text[:N])

NameError: name 're' is not defined

In [5]:
# let's get rid of empty strings
raw_text_split = re.split(r'[.,]|\s', raw_text)
raw_text_split = [x for x in raw_text_split if x != ""]

NameError: name 're' is not defined

In [6]:
# Now let's try a more-complicated regex that can cover other special characters
# as well as double dashes

regex_pattern = r'([,.:;?_!"()\']|--|\s)'
tokens = core.tokenizer(raw_text, regex_pattern, remove_white_spaces=True)

print(f"showing top 10 tokens: \n", tokens[:10])

showing top 10 tokens: 
 ['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius']


One important note mentioned by the author is that we prefer to not change the capital letters to lower cases or vice versa. This is because the goal is to prep data that helps model with its task, i.e. predicting the next token. Therefore, the capital letters are kept as they may help model to better understand the start of sentence, the name of people, etc.

### Step 2. Convert/Map `tokens` into `token IDs`

This is an intermediary step that needs to occur before obtaining the embedding vectors

**Question:** Why do we need this step?

Basically, we want to dedicate a unique ID to each unique token.

In [7]:
unique_tokens = sorted(set(tokens))
vocabulary = {
    token: token_id 
    for token_id, token in enumerate(unique_tokens)
}

vocabulary

{'!': 0,
 '"': 1,
 "'": 2,
 '(': 3,
 ')': 4,
 ',': 5,
 '--': 6,
 '.': 7,
 ':': 8,
 ';': 9,
 '?': 10,
 'A': 11,
 'Ah': 12,
 'Among': 13,
 'And': 14,
 'Are': 15,
 'Arrt': 16,
 'As': 17,
 'At': 18,
 'Be': 19,
 'Begin': 20,
 'Burlington': 21,
 'But': 22,
 'By': 23,
 'Carlo': 24,
 'Chicago': 25,
 'Claude': 26,
 'Come': 27,
 'Croft': 28,
 'Destroyed': 29,
 'Devonshire': 30,
 'Don': 31,
 'Dubarry': 32,
 'Emperors': 33,
 'Florence': 34,
 'For': 35,
 'Gallery': 36,
 'Gideon': 37,
 'Gisburn': 38,
 'Gisburns': 39,
 'Grafton': 40,
 'Greek': 41,
 'Grindle': 42,
 'Grindles': 43,
 'HAD': 44,
 'Had': 45,
 'Hang': 46,
 'Has': 47,
 'He': 48,
 'Her': 49,
 'Hermia': 50,
 'His': 51,
 'How': 52,
 'I': 53,
 'If': 54,
 'In': 55,
 'It': 56,
 'Jack': 57,
 'Jove': 58,
 'Just': 59,
 'Lord': 60,
 'Made': 61,
 'Miss': 62,
 'Money': 63,
 'Monte': 64,
 'Moon-dancers': 65,
 'Mr': 66,
 'Mrs': 67,
 'My': 68,
 'Never': 69,
 'No': 70,
 'Now': 71,
 'Nutley': 72,
 'Of': 73,
 'Oh': 74,
 'On': 75,
 'Once': 76,
 'Only': 77,
 '

Now that we have our `vocabulary`, we can use it to convert a sample text to its corresponding token ids.

In [8]:
# Example:
sample_txt = "this is awesome"
sample_tokens = core.tokenizer(sample_txt, regex_pattern)
sample_token_ids = [vocabulary.get(t, None) for t in sample_tokens]

sample_token_ids

[999, 584, None]

Note that it is possible to encounter a word that is completely new and hence has no `token_id` in our vocabulary. To this end, we may use `-1`, or keep updating our vocabulary (??). We may read more about this in the book. If not, let's google it later!

**NOTE:**
The author used OOP to create a class for tokenizer. Why not a function? IIUC, this is because there are two atributes that are tied to each other, and we need to do two operations where one is the reverse of the other. So, it is about tracking things, and, in this case, `class` seems to be a good option.

In [9]:
# Example
text = """It's the last he painted, you know,"
Mrs. Gisburn said with pardonable pride."""

token_transformer = core.TokenTransform(vocabulary, regex_pattern)

In [10]:
text_id = token_transformer.encode(text)
text_id

[56,
 2,
 850,
 988,
 602,
 533,
 746,
 5,
 1126,
 596,
 5,
 1,
 67,
 7,
 38,
 851,
 1108,
 754,
 793,
 7]

In [11]:
retrieved_text = token_transformer.decode(text_id)
retrieved_text

'It \' s the last he painted , you know , " Mrs . Gisburn said with pardonable pride .'

**Regarding:**

> Note that it is possible to encounter a word that is completely new and hence has no token_id in our vocabulary

The author pointed that out, and provided two notes: <br>
(1) Using larger data set will mitigate this issue <br>
(2) Even with (1), we may face some new words. In such case, some special tokens can be defined to handle those cases.


To address (2), the class `TokenTransform` is enhanced, and its new version is added to core.py as `TokenTransformV2`.

In [12]:
# Get vocabulary using text file
file_path = "./data/the-verdict.txt"
raw_text = core.read_file(file_path)

regex_pattern = r'([,.:;?_!"()\']|--|\s)'
tokens = core.tokenizer(raw_text, regex_pattern, remove_white_spaces=True)

unique_tokens = sorted(set(tokens))
vocabulary = {
    token: token_id 
    for token_id, token in enumerate(unique_tokens)
}

# Passing vocabulary to token_transformer
token_transformer = core.TokenTransformV2(vocabulary, regex_pattern)

text = "The sky is blue"
token_ids = token_transformer.encode(text)
token_ids

[95, 11, 586, 11]

In [13]:
token_transformer.decode(token_ids)

'The <|unk|> is <|unk|>'

## The tokenizer in GPT

In GPT, the tokenizer does not use a special token for out-of-vocabulary tokens. Instead, it uses `Byte Pair Encoding (BPE)`. To better understand the BPE, the author suggested to use the one that is already-implemented in [tiktoken](https://github.com/openai/tiktoken).

In [14]:
!pip install tiktoken

import tiktoken
tiktoken.__version__



'0.9.0'

In [15]:
tokenizer = tiktoken.get_encoding("gpt2")
text = "THIS is a new sentence"

token_id = tokenizer.encode(text)
print('token_id: ', token_id)
print('=' * 50)

token = tokenizer.decode(token_id)
print('token: ', token)

token_id:  [43559, 318, 257, 649, 6827]
token:  THIS is a new sentence


In [16]:
text = "A sentence wis Str@nge Qaracters"

token_ids = tokenizer.encode(text)
print('token_ids: ', token_ids)
print('=' * 50)

tokens = tokenizer.decode(token_ids)
print('tokens: ', tokens)

token_ids:  [32, 6827, 266, 271, 4285, 31, 77, 469, 1195, 283, 19858]
tokens:  A sentence wis Str@nge Qaracters


Note that the number of `token_ids` are more than the number of words. Let's see what each `token_id` is.

In [17]:
print('token_id --> token')
print('-' * 50)
for token_id in token_ids:
    print(f'{token_id} --> {tokenizer.decode([token_id])}')

token_id --> token
--------------------------------------------------
32 --> A
6827 -->  sentence
266 -->  w
271 --> is
4285 -->  Str
31 --> @
77 --> n
469 --> ge
1195 -->  Q
283 --> ar
19858 --> acters


In [18]:
text = "Akwirw ier"
token_ids = tokenizer.encode(text)
print('All token_ids: \n', token_ids)
print('=' * 50)

print('token_id --> text')
for token_id in token_ids:
    token_text = tokenizer.decode([token_id])
    print(f'{token_id} --> `{token_text}`')

All token_ids: 
 [33901, 86, 343, 86, 220, 959]
token_id --> text
33901 --> `Ak`
86 --> `w`
343 --> `ir`
86 --> `w`
220 --> ` `
959 --> `ier`


## Data sampling with sliding window

Suppose I have a sentence like this:

`This is my sentence` 

We can say: <br>
* input: `[This, is, my, sentence]`
* output: `[is, my, sentene, <END-OF-SENTENCE>]`

Then, LLM will be trained on the following (input, output) pairs:

* `This` --> `is`
* `This is` --> `my`
* `This is my` --> `sentence`

The interesting point made by author is that the pytorch's tensor will look that input/output lists. For instance, if my text is:

`This is the sentence in which I am providing some valuable information.`

and if the window size is set to four, then:

```
inputs = [
[This, is, the, sentence],
[in, which, I, am],
[providing, some, valuable, information],
]

outputs = [
[is, the, sentence, in],
[which, I, am, providing],
[some, valuable, information, <END-OF-TEXT],
]
```

**NOTE1:**
IIUC, at some stage (?), LLM gets several (input, output) pairs from `inputs[0]` & `outputs[0]`. As mentioned above, those pairs are:
* `This` --> `is`
* `This is` --> `the`
* `This is the` --> `sentence`
* `This is the sentence` --> `in`

**NOTE2:** It is important to note that the list of inputs was created by sliding window 4 units. However, one can change it to a different number. For instance, sliding window by 1 unit gives the following:

```
inputs = [
[This, is, the, sentence],
[is, the, sentence, in],
[the, sentence, in, which],
[sentence, in, which, I],
[in, which, I, am],
[which, I, am, providing],
[I, am, providing, some],
[am, providing, some, valuable],
[providing, some, valuable, information],
]
```

And the outputs can be obtained accordingly. 

OK..I think it is time for me to stop and check out the appendix A of the book that provides some info about PyTorch. Then, I will come back to resume!

## Prep data for GPT

In [19]:
from torch.utils.data import Dataset, DataLoader

class GPTdataV1(dataset):
    def __init__(self, txt)
        self.__init__().super()
        self.input_token_id = 
        self.target_token_id = 

_IncompleteInputError: incomplete input (3340218526.py, line 2)