**Exercise: Prepare your own favorite text dataset**

In [1]:
import re
import torch
from pathlib import Path

Step 1: Load your dataset

In [2]:
import requests
url = "https://www.gutenberg.org/cache/epub/69304/pg69304.txt" # The Silver Glen
text = requests.get(url).text


print("Original dataset length:", len(text))
print(text[:500])

Original dataset length: 116203
﻿The Project Gutenberg eBook of Mozart
    
This ebook is for the use of anyone anywhere in the United States and
most other parts of the world at no cost and with almost no restrictions
whatsoever. You may copy it, give it away or re-use it under the terms
of the Project Gutenberg License included with this ebook or online
at www.gutenberg.org. If you are not located in the United States,
you will have to check the laws of the country where you are located
before using this eBook.

Ti


Step 2: Clean dataset

In [4]:
start = text.find("*** START")
end = text.find("*** END")
if start != -1 and end != -1:
    text = text[start:end]


# Simple cleaning (customize as needed)
text = text.lower()
text = re.sub(r"[^a-z0-9\s\.\,\;\:\'\!\?]", " ", text)


print("Cleaned dataset length:", len(text))
print(text[:500])

Cleaned dataset length: 96621
    start of the project gutenberg ebook mozart    
 frontispiece: mozart as a young man.    from a print by schw rer.   




  bell's miniature series of musicians




  mozart


  by

  ebenezer prout, b.a., mus.d.

  professor of music, dublin university



  london
  george bell   sons
  1905




  first published, november, 1903.
  reprinted, 1905.




table of contents


some books about mozart

the child  1756 1768 

the youth  1769 1778 

the m


Step 3: Tokenizer (character-level)

In [5]:
vocab = sorted(set(text))
stoi = {ch: i for i, ch in enumerate(vocab)}
itos = {i: ch for i, ch in enumerate(vocab)}


encode = lambda s: [stoi[c] for c in s] # string -> list of ints
decode = lambda l: "".join([itos[i] for i in l]) # list of ints -> string


print("Vocab size:", len(vocab))
print("Encoding 'glen':", encode("glen"))

Vocab size: 46
Encoding 'glen': [26, 31, 24, 33]


Step 4: Encode dataset + split

In [6]:
data = torch.tensor(encode(text), dtype=torch.long)
n = int(0.9 * len(data))
train_data = data[:n]
val_data = data[n:]


print("Train size:", len(train_data), "Val size:", len(val_data))

Train size: 86958 Val size: 9663


Step 5: Batch generator

In [8]:
block_size = 16 # sequence length
batch_size = 4 # number of sequences per batch


def get_batch(split):
    data = train_data if split == "train" else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix])
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])
    return x, y


xb, yb = get_batch("train")
print("X shape:", xb.shape, "Y shape:", yb.shape)
print("Decoded X[0]:", decode(xb[0].tolist()))
print("Decoded Y[0]:", decode(yb[0].tolist()))

X shape: torch.Size([4, 16]) Y shape: torch.Size([4, 16])
Decoded X[0]: fonie
concertan
Decoded Y[0]: onie
concertant
