# Data Collection for LLM Training

This notebook demonstrates how to collect and preprocess text data for training a language model. We'll use "The Adventures of Sherlock Holmes" from Project Gutenberg as our example dataset.

## Steps covered:
1. **Download** text data from Project Gutenberg
2. **Load and inspect** the raw text
3. **Tokenize** the text by splitting on punctuation and spaces
4. **Clean** the tokens by removing empty strings

## 1. Download Text Data

We'll download "The Adventures of Sherlock Holmes" from Project Gutenberg, which provides free access to thousands of books.

In [3]:
import os
import urllib.request

# Check if we already have the file to avoid re-downloading
if not os.path.exists("sherlock-holmes.txt"):
    # URL for "The Adventures of Sherlock Holmes" from Project Gutenberg
    url = ("https://www.gutenberg.org/files/1661/1661-0.txt")
    file_path = "sherlock-holmes.txt"
    
    # Download the file and save it locally
    urllib.request.urlretrieve(url, file_path)
    print("✅ Downloaded Sherlock Holmes text")
else:
    print("📄 File already exists, skipping download")

📄 File already exists, skipping download


In [4]:
# Load the entire text file into memory
with open("sherlock-holmes.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

# Check the size and preview the content
print("Total number of characters:", len(raw_text))
print("\n📖 First 1000 characters:")
print("-" * 50)
print(raw_text[:5000])

Total number of characters: 581425

📖 First 1000 characters:
--------------------------------------------------
﻿The Project Gutenberg eBook of The Adventures of Sherlock Holmes,
by Arthur Conan Doyle

This eBook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this eBook or online at
www.gutenberg.org. If you are not located in the United States, you
will have to check the laws of the country where you are located before
using this eBook.

Title: The Adventures of Sherlock Holmes

Author: Arthur Conan Doyle

Release Date: November 29, 2002 [eBook #1661]
[Most recently updated: October 10, 2023]

Language: English

Character set encoding: UTF-8

Produced by: an anonymous Project Gutenberg volunteer and Jose Menendez

*** START OF THE PROJECT GUTENBERG EBOOK THE ADVENTURES OF SHERLOCK
HOLM

## 2. Load and Inspect the Text

Now let's load the downloaded text file and see what we're working with.

In [5]:
import re

# Example 1: Simple split on whitespace
text = "Test for Sherlock Holmes. ! ? /"

# re.split with (\s) captures the delimiter (whitespace) in the result
result = re.split(r'(\s)', text)

print("Splitting on whitespace (keeping delimiters):")
print(result)

Splitting on whitespace (keeping delimiters):
['Test', ' ', 'for', ' ', 'Sherlock', ' ', 'Holmes.', ' ', '!', ' ', '?', ' ', '/']


## 3. Text Tokenization

Tokenization is the process of splitting text into smaller units (tokens) that can be processed by our language model. Let's start with some simple examples to understand how it works.

In [6]:
# Example 2: Split on punctuation AND whitespace
# This captures commas, periods, and spaces as separate tokens
result = re.split(r'([,.]|\s)', text)

print("Splitting on punctuation and whitespace:")
print(result)

Splitting on punctuation and whitespace:
['Test', ' ', 'for', ' ', 'Sherlock', ' ', 'Holmes', '.', '', ' ', '!', ' ', '?', ' ', '/']


In [7]:
# Strip whitespace from each item and then filter out any empty strings.
result = [item for item in result if item.strip()]
print(result)

['Test', 'for', 'Sherlock', 'Holmes', '.', '!', '?', '/']


In [8]:
text = "Hello, world. Is this-- a test?"

result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [9]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['\ufeffThe', 'Project', 'Gutenberg', 'eBook', 'of', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes', ',', 'by', 'Arthur', 'Conan', 'Doyle', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most']


## 4. Apply Tokenization to Full Text

Now let's apply our tokenization strategy to the entire Sherlock Holmes text.

In [10]:
print(len(preprocessed))
print(preprocessed[:1000])


126189
['\ufeffThe', 'Project', 'Gutenberg', 'eBook', 'of', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes', ',', 'by', 'Arthur', 'Conan', 'Doyle', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.', 'You', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'at', 'www', '.', 'gutenberg', '.', 'org', '.', 'If', 'you', 'are', 'not', 'located', 'in', 'the', 'United', 'States', ',', 'you', 'will', 'have', 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'using', 'this', 'eBook', '.', 'Title', ':', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes', 'Author', ':', 'Arthur', 'Conan', 'Doyle', 'Release', 'Date', ':

## Summary

✅ **What we accomplished:**
- Downloaded a classic text from Project Gutenberg
- Loaded and inspected the raw text data
- Learned different tokenization approaches using regex
- Applied comprehensive tokenization to split text into meaningful tokens
- Cleaned the data by removing empty tokens

🎯 **Next steps for LLM training:**
- Create a vocabulary from these tokens
- Convert tokens to numerical IDs
- Organize data into training batches
- Feed to a neural network for language model training

This preprocessed token list is now ready to be used as training data for a language model!

## Convert Tokens into Token IDs

In [11]:
#Use a set to get rid of duplicates
all_words = sorted(set(preprocessed))
vocab_size = len(all_words)
print(f"Vocab_size: {vocab_size}")

Vocab_size: 9885


In [12]:
# Initialize an empty dictionary to map tokens to integers
vocab = {}

# Iterate through all unique words and assign a unique integer ID to each token
for integer, token in enumerate(all_words):
    vocab[token] = integer

for i, item in enumerate(vocab.items()):
    print(item)
    if i > 100:
        break

('!', 0)
('#1661]', 1)
('$1', 2)
('$5', 3)
('&', 4)
('&c', 5)
('(', 6)
(')', 7)
('***', 8)
(',', 9)
('-', 10)
('--', 11)
('.', 12)
('000', 13)
('1', 14)
('10', 15)
('100', 16)
('1000', 17)
('11', 18)
('117', 19)
('12', 20)
('120', 21)
('14', 22)
('140', 23)
('15', 24)
('150', 25)
('1500', 26)
('16A', 27)
('17', 28)
('1846', 29)
('1858', 30)
('1869', 31)
('1870', 32)
('1878', 33)
('1883', 34)
('1883—a', 35)
('1884—there', 36)
('1887', 37)
('1888—I', 38)
('1890', 39)
('19th', 40)
('2', 41)
('20%', 42)
('200', 43)
('2001', 44)
('2002', 45)
('2023]', 46)
('220', 47)
('221B', 48)
('226', 49)
('22nd', 50)
('25', 51)
('250', 52)
('26', 53)
('27', 54)
('270', 55)
('29', 56)
('2nd', 57)
('3', 58)
('30', 59)
('31', 60)
('35', 61)
('3rd', 62)
('4', 63)
('40', 64)
('4000', 65)
('4700', 66)
('4th', 67)
('4½', 68)
('5', 69)
('50', 70)
('501', 71)
('596-1887', 72)
('6', 73)
('60', 74)
('64-6221541', 75)
('7', 76)
('700', 77)
('750', 78)
('8', 79)
('801', 80)
('809', 81)
('84116', 82)
('88', 83)
('9',

In [13]:
class SimpleTokenizerV1:
    def __init__(self, vocab):
        self.str_to_int = vocab
        self.int_to_str = {i:s for s,i in vocab.items()}
    
    def encode(self, text):
        preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', text)
                                
        preprocessed = [
            item.strip() for item in preprocessed if item.strip()
        ]
        ids = [self.str_to_int[s] for s in preprocessed]
        return ids
        
    def decode(self, ids):
        text = " ".join([self.int_to_str[i] for i in ids])
        # Replace spaces before the specified punctuations
        text = re.sub(r'\s+([,.?!"()\'])', r'\1', text)
        return text

In [14]:
tokenizer = SimpleTokenizerV1(vocab)

text = """John is a man, he has talked to many people and works very hard!
        John is very determined!"""
ids = tokenizer.encode(text)
print(ids)


[702, 5159, 1358, 5608, 9, 4663, 4644, 8266, 8469, 5628, 6356, 1613, 9223, 8872, 4629, 0, 702, 5159, 8872, 3255, 0]


In [15]:
tokenizer.decode(ids)

'John is a man, he has talked to many people and works very hard! John is very determined!'

# Use GPT based Tokenizer (TikToken)

In [16]:
pip install tiktoken

Note: you may need to restart the kernel to use updated packages.


In [17]:
from importlib.metadata import version
import tiktoken
print("Tiktoken version:", version("tiktoken"))

Tiktoken version: 0.7.0


In [18]:
tokenizer = tiktoken.get_encoding("gpt2")

In [19]:
text = "The quick brown fox jumps over the lazy dog.<|endoftext|> This is a short excerpt for testing the GPT-2 tokenizer. It includes some punctuation, numbers (123), and special characters like @#$%. A word the Vocabulary doesn't have is Sujar!"

integer_tokens = tokenizer.encode(text,allowed_special = {"<|endoftext|>"})
print(integer_tokens)

[464, 2068, 7586, 21831, 18045, 625, 262, 16931, 3290, 13, 50256, 770, 318, 257, 1790, 20911, 329, 4856, 262, 402, 11571, 12, 17, 11241, 7509, 13, 632, 3407, 617, 21025, 2288, 11, 3146, 357, 10163, 828, 290, 2041, 3435, 588, 2488, 29953, 7225, 317, 1573, 262, 47208, 22528, 1595, 470, 423, 318, 1778, 9491, 0]


In [20]:
string_val = tokenizer.decode(integer_tokens)
print(string_val)

The quick brown fox jumps over the lazy dog.<|endoftext|> This is a short excerpt for testing the GPT-2 tokenizer. It includes some punctuation, numbers (123), and special characters like @#$%. A word the Vocabulary doesn't have is Sujar!


# Data Sampling with Sliding Window

In [21]:
with open("sherlock-holmes.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()
encoded_text = tokenizer.encode(raw_text)
print(f"Total number of characters in text: {len(raw_text)}")

Total number of characters in text: 581425


In [27]:
enc_sample = encoded_text[500:]
context_size = 1024
x = enc_sample[:context_size]
y = enc_sample[1:context_size+1]
print(f"x: {x}")
print(f"y: {y}")

x: [198, 22474, 42934, 1156, 284, 465, 4692, 11, 7141, 475, 6178, 343, 1346, 12974, 2000, 13, 679, 198, 9776, 11, 314, 1011, 340, 11, 262, 749, 2818, 14607, 290, 21769, 4572, 326, 198, 1169, 995, 468, 1775, 11, 475, 355, 257, 18854, 339, 561, 423, 4624, 2241, 287, 257, 198, 9562, 2292, 13, 679, 1239, 5158, 286, 262, 32359, 30477, 11, 3613, 351, 257, 308, 32438, 198, 392, 257, 10505, 263, 13, 1119, 547, 37959, 1243, 329, 262, 22890, 960, 1069, 5666, 329, 198, 19334, 278, 262, 30615, 422, 1450, 447, 247, 82, 21508, 290, 4028, 13, 887, 329, 262, 8776, 198, 41181, 263, 284, 9159, 884, 9913, 15880, 656, 465, 898, 19217, 290, 32566, 198, 29117, 36140, 373, 284, 10400, 257, 36441, 5766, 543, 1244, 198, 16939, 257, 4719, 2402, 477, 465, 5110, 2482, 13, 402, 799, 287, 257, 8564, 198, 259, 43872, 11, 393, 257, 8469, 287, 530, 286, 465, 898, 1029, 12, 6477, 18405, 11, 561, 407, 198, 1350, 517, 14851, 621, 257, 1913, 9942, 287, 257, 3450, 884, 355, 465, 13, 843, 198, 25907, 612, 373, 475, 530, 241

# Start predicting next word

In [28]:
for i in range(1,20):
    context = enc_sample[:i]
    desired = enc_sample[i]
    print(f"Context: {context} => Next token: {desired}")

Context: [198] => Next token: 22474
Context: [198, 22474] => Next token: 42934
Context: [198, 22474, 42934] => Next token: 1156
Context: [198, 22474, 42934, 1156] => Next token: 284
Context: [198, 22474, 42934, 1156, 284] => Next token: 465
Context: [198, 22474, 42934, 1156, 284, 465] => Next token: 4692
Context: [198, 22474, 42934, 1156, 284, 465, 4692] => Next token: 11
Context: [198, 22474, 42934, 1156, 284, 465, 4692, 11] => Next token: 7141
Context: [198, 22474, 42934, 1156, 284, 465, 4692, 11, 7141] => Next token: 475
Context: [198, 22474, 42934, 1156, 284, 465, 4692, 11, 7141, 475] => Next token: 6178
Context: [198, 22474, 42934, 1156, 284, 465, 4692, 11, 7141, 475, 6178] => Next token: 343
Context: [198, 22474, 42934, 1156, 284, 465, 4692, 11, 7141, 475, 6178, 343] => Next token: 1346
Context: [198, 22474, 42934, 1156, 284, 465, 4692, 11, 7141, 475, 6178, 343, 1346] => Next token: 12974
Context: [198, 22474, 42934, 1156, 284, 465, 4692, 11, 7141, 475, 6178, 343, 1346, 12974] =>

# Convert to text

In [29]:
for i in range(100):
    print(f"{x[i]} -> {y[i]} ({tokenizer.decode([x[i]])} -> {tokenizer.decode([y[i]])})")

198 -> 22474 (
 -> were)
22474 -> 42934 (were ->  abhor)
42934 -> 1156 ( abhor -> rent)
1156 -> 284 (rent ->  to)
284 -> 465 ( to ->  his)
465 -> 4692 ( his ->  cold)
4692 -> 11 ( cold -> ,)
11 -> 7141 (, ->  precise)
7141 -> 475 ( precise ->  but)
475 -> 6178 ( but ->  adm)
6178 -> 343 ( adm -> ir)
343 -> 1346 (ir -> ably)
1346 -> 12974 (ably ->  balanced)
12974 -> 2000 ( balanced ->  mind)
2000 -> 13 ( mind -> .)
13 -> 679 (. ->  He)
679 -> 198 ( He -> 
)
198 -> 9776 (
 -> was)
9776 -> 11 (was -> ,)
11 -> 314 (, ->  I)
314 -> 1011 ( I ->  take)
1011 -> 340 ( take ->  it)
340 -> 11 ( it -> ,)
11 -> 262 (, ->  the)
262 -> 749 ( the ->  most)
749 -> 2818 ( most ->  perfect)
2818 -> 14607 ( perfect ->  reasoning)
14607 -> 290 ( reasoning ->  and)
290 -> 21769 ( and ->  observing)
21769 -> 4572 ( observing ->  machine)
4572 -> 326 ( machine ->  that)
326 -> 198 ( that -> 
)
198 -> 1169 (
 -> the)
1169 -> 995 (the ->  world)
995 -> 468 ( world ->  has)
468 -> 1775 ( has ->  seen)
1775 -> 1