# Data Collection for LLM Training

This notebook demonstrates how to collect and preprocess text data for training a language model. We'll use "The Adventures of Sherlock Holmes" from Project Gutenberg as our example dataset.

## Steps covered:
1. **Download** text data from Project Gutenberg
2. **Load and inspect** the raw text
3. **Tokenize** the text by splitting on punctuation and spaces
4. **Clean** the tokens by removing empty strings

## 1. Download Text Data

We'll download "The Adventures of Sherlock Holmes" from Project Gutenberg, which provides free access to thousands of books.

In [None]:
import os
import urllib.request

# Check if we already have the file to avoid re-downloading
if not os.path.exists("sherlock-holmes.txt"):
    # URL for "The Adventures of Sherlock Holmes" from Project Gutenberg
    url = ("https://www.gutenberg.org/files/1661/1661-0.txt")
    file_path = "sherlock-holmes.txt"
    
    # Download the file and save it locally
    urllib.request.urlretrieve(url, file_path)
    print("✅ Downloaded Sherlock Holmes text")
else:
    print("📄 File already exists, skipping download")

In [None]:
# Load the entire text file into memory
with open("sherlock-holmes.txt", "r", encoding="utf-8") as f:
    raw_text = f.read()

# Check the size and preview the content
print("Total number of characters:", len(raw_text))
print("\n📖 First 1000 characters:")
print("-" * 50)
print(raw_text[:1000])

## 2. Load and Inspect the Text

Now let's load the downloaded text file and see what we're working with.

In [None]:
import re

# Example 1: Simple split on whitespace
text = "Test for Sherlock Holmes."

# re.split with (\s) captures the delimiter (whitespace) in the result
result = re.split(r'(\s)', text)

print("Splitting on whitespace (keeping delimiters):")
print(result)

['Test', ' ', 'for', ' ', 'Sherlock', ' ', 'Holmes.']


## 3. Text Tokenization

Tokenization is the process of splitting text into smaller units (tokens) that can be processed by our language model. Let's start with some simple examples to understand how it works.

In [None]:
# Example 2: Split on punctuation AND whitespace
# This captures commas, periods, and spaces as separate tokens
result = re.split(r'([,.]|\s)', text)

print("Splitting on punctuation and whitespace:")
print(result)

In [13]:
# Strip whitespace from each item and then filter out any empty strings.
result = [item for item in result if item.strip()]
print(result)

['Test', 'for', 'Sherlock', 'Holmes', '.']


In [14]:
text = "Hello, world. Is this-- a test?"

result = re.split(r'([,.:;?_!"()\']|--|\s)', text)
result = [item.strip() for item in result if item.strip()]
print(result)

['Hello', ',', 'world', '.', 'Is', 'this', '--', 'a', 'test', '?']


In [15]:
preprocessed = re.split(r'([,.:;?_!"()\']|--|\s)', raw_text)
preprocessed = [item.strip() for item in preprocessed if item.strip()]
print(preprocessed[:30])

['\ufeffThe', 'Project', 'Gutenberg', 'eBook', 'of', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes', ',', 'by', 'Arthur', 'Conan', 'Doyle', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most']


## 4. Apply Tokenization to Full Text

Now let's apply our tokenization strategy to the entire Sherlock Holmes text.

In [None]:
print(len(preprocessed))
print(preprocessed[:1000])


126189
['\ufeffThe', 'Project', 'Gutenberg', 'eBook', 'of', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes', ',', 'by', 'Arthur', 'Conan', 'Doyle', 'This', 'eBook', 'is', 'for', 'the', 'use', 'of', 'anyone', 'anywhere', 'in', 'the', 'United', 'States', 'and', 'most', 'other', 'parts', 'of', 'the', 'world', 'at', 'no', 'cost', 'and', 'with', 'almost', 'no', 'restrictions', 'whatsoever', '.', 'You', 'may', 'copy', 'it', ',', 'give', 'it', 'away', 'or', 're-use', 'it', 'under', 'the', 'terms', 'of', 'the', 'Project', 'Gutenberg', 'License', 'included', 'with', 'this', 'eBook', 'or', 'online', 'at', 'www', '.', 'gutenberg', '.', 'org', '.', 'If', 'you', 'are', 'not', 'located', 'in', 'the', 'United', 'States', ',', 'you', 'will', 'have', 'to', 'check', 'the', 'laws', 'of', 'the', 'country', 'where', 'you', 'are', 'located', 'before', 'using', 'this', 'eBook', '.', 'Title', ':', 'The', 'Adventures', 'of', 'Sherlock', 'Holmes', 'Author', ':', 'Arthur', 'Conan', 'Doyle', 'Release', 'Date', ':

## Summary

✅ **What we accomplished:**
- Downloaded a classic text from Project Gutenberg
- Loaded and inspected the raw text data
- Learned different tokenization approaches using regex
- Applied comprehensive tokenization to split text into meaningful tokens
- Cleaned the data by removing empty tokens

🎯 **Next steps for LLM training:**
- Create a vocabulary from these tokens
- Convert tokens to numerical IDs
- Organize data into training batches
- Feed to a neural network for language model training

This preprocessed token list is now ready to be used as training data for a language model!