# Chapter 2: Working with Text Data

- This chapter covers data preparation and sampling to get input data "ready" for the LLM

Packages that are being used in this notebook:

In [None]:
from importlib.metadata import version

print("torch version:", version("torch"))
print("tiktoken version:", version("tiktoken"))

## 2.1 Understanding word embeddings

- There are many forms of embeddings; we focus on text embeddings in this book
- LLMs work with embeddings in high-dimensional spaces (i.e., thousands of dimensions)
- Since we can't visualize such high-dimensional spaces (we humans think in 1, 2, or 3 dimensions), the figure below illustrates a 2-dimensional embedding space

## 2.2 Tokenizing text

- In this section, we tokenize text, which means breaking text into smaller units, such as individual words and punctuation characters
- Load raw text we want to work with
- [The Verdict by Edith Wharton](https://en.wikisource.org/wiki/The_Verdict) is a public domain short story

In [None]:
import os
import requests

if not os.path.exists("the-verdict.txt"):
    url = (
        "https://raw.githubusercontent.com/rasbt/"
        "LLMs-from-scratch/main/ch02/01_main-chapter-code/"
        "the-verdict.txt"
    )
    file_path = "the-verdict.txt"

    response = requests.get(url, timeout=30)
    response.raise_for_status()
    with open(file_path, "wb") as f:
        f.write(response.content)


In [None]:
with open("the-verdict.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
    
print("Total number of character:", len(raw_text))
print(raw_text[:99])

- The goal is to tokenize and embed this text for an LLM
- Let's develop a simple tokenizer based on some simple sample text that we can then later apply to the text above
- The following regular expression will split on whitespaces

In [None]:
import re

text = "Hello, world. This, is a test."
result = re.split(r'(\s)', text)

print(result)

- We don't only want to split on whitespaces but also commas and periods, so let's modify the regular expression to do that as well

In [None]:
result = re.split(r'([,.]|\s)', text)

print(result)

- As we can see, this creates empty strings, let's remove them

In [None]:
# Strip whitespace from each item and then filter out any empty strings.
result = [item for item in result if item.strip()]
print(result)

- This looks pretty good, but let's also handle other types of punctuation, such as periods, question marks, and so on

In [None]:
text = "Hello, world. Is this-- a test?"

result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

- This is pretty good, and we are now ready to apply this tokenization to the raw text

In [None]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

- Let's calculate the total number of tokens

In [None]:
print(len(preprocessed))

## 2.3 Converting tokens into IDs

- In this section, we will create a vocabulary that maps each unique token to a unique integer ID
- First, let's identify the unique tokens in the text and sort them alphabetically

In [None]:
all_words = sorted(list(set(preprocessed)))
vocab_size = len(all_words)

print(vocab_size)

- Next, we create a dictionary that maps each token to an integer ID

In [None]:
vocab = {token:integer for integer,token in enumerate(all_words)}

- Let's print the first 50 entries of this vocabulary

In [None]:
for i, item in enumerate(vocab.items()):
    print(item)
    if i >= 50:
        break