# Working with text data

Here we will see how to prepare input text for training LLMs. This involves splitting the text into individual word and subword tokens, which can be encoded into vector representations for the LLM.

## 2.1 Understanding word embeddings

Deep neural network models, including LLMs, cannot process raw text directly. Therefore, we need a way to represent words as continous-valued vectors.

The concept of converting data into a vector format is often referred to as embedding.

Word embeddings can have varying dimensions, from one to thousands.

## 2.2 Tokenizing text

Let's see how we split input text into individual tokens, a required preprocessing step for creating embeddings for an LLM.

We start with a simple text and Pythonâ€™s `re.split` function to split the text while keeping the delimiters:

In [6]:
import re

text = "Hello, world. Is this-- a test?"
result = re.split(r'([,.:;?_!"()\']|--|\s)', text)    # capturing group (...) with three alternatives, single characters [...], -- and whitespace
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [1]:
import os
import requests

file_path = "the-verdict.txt"

if not os.path.exists(file_path):
    url = (
        "https://raw.githubusercontent.com/rasbt/"
        "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
        "the-verdict.txt"
    )
    response = requests.get(url)
    response.raise_for_status()
    with open(file_path, "wb") as f:
        f.write(response.content)

In [2]:
with open(file_path, "r", encoding='utf-8') as f:
    raw_text = f.read()
print(f'Total number of characters: {len(raw_text)}')
print(raw_text[:99])

Now let's apply our basic tokenizer to the main text:

In [None]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(f'Number of tokens in the text: {len(preprocessed)}')
print(f'Number of unique tokens in the text: {len(set(preprocessed))}')
print(f'First 30 tokens in the text:\n{preprocessed[:30]}')

Number of tokens in the text: 4690
Number of unique tokens in the text: 1130
First 30 tokens in the text:
['I', 'HAD', 'always', 'thought', 'Jack', 'Gisburn', 'rather', 'a', 'cheap', 'genius', '--', 'though', 'a', 'good', 'fellow', 'enough', '--', 'so', 'it', 'was', 'no', 'great', 'surprise', 'to', 'me', 'to', 'hear', 'that', ',', 'in']


## 2.3 Converting tokens into IDs

Next let's convert these tokens from a Python string to an integer representation to produce the token IDs.

To do this we need to build a vocabulary. This defines how we map each unique token to a unique integer.

In [None]:
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(f'Vocabulary size: {vocab_size}')

Vocabulary size: 1130


In [None]:
str_to_int = {token: i for i, token in enumerate(all_words)}    #
for i, item in enumerate(str_to_int.items()):
    print(f'{i}: {item}')
    if i >= 50:
        break

0: ('!', 0)
1: ('"', 1)
2: ("'", 2)
3: ('(', 3)
4: (')', 4)
5: (',', 5)
6: ('--', 6)
7: ('.', 7)
8: (':', 8)
9: (';', 9)
10: ('?', 10)
11: ('A', 11)
12: ('Ah', 12)
13: ('Among', 13)
14: ('And', 14)
15: ('Are', 15)
16: ('Arrt', 16)
17: ('As', 17)
18: ('At', 18)
19: ('Be', 19)
20: ('Begin', 20)
21: ('Burlington', 21)
22: ('But', 22)
23: ('By', 23)
24: ('Carlo', 24)
25: ('Chicago', 25)
26: ('Claude', 26)
27: ('Come', 27)
28: ('Croft', 28)
29: ('Destroyed', 29)
30: ('Devonshire', 30)
31: ('Don', 31)
32: ('Dubarry', 32)
33: ('Emperors', 33)
34: ('Florence', 34)
35: ('For', 35)
36: ('Gallery', 36)
37: ('Gideon', 37)
38: ('Gisburn', 38)
39: ('Gisburns', 39)
40: ('Grafton', 40)
41: ('Greek', 41)
42: ('Grindle', 42)
43: ('Grindles', 43)
44: ('HAD', 44)
45: ('Had', 45)
46: ('Hang', 46)
47: ('Has', 47)
48: ('He', 48)
49: ('Her', 49)
50: ('Hermia', 50)


We need also a way to turn token IDs into text. For this we create an inverse version of the vocabulary that maps token IDs back to text tokens:

In [None]:
int_to_str = {i: s for s, i in str_to_int.items()}