In [1]:
# Downloading the data
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2025-03-03 10:29:12--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: 'input.txt'

     0K .......... .......... .......... .......... ..........  4% 3.43M 0s
    50K .......... .......... .......... .......... ..........  9% 4.38M 0s
   100K .......... .......... .......... .......... .......... 13% 4.58M 0s
   150K .......... .......... .......... .......... .......... 18% 35.6M 0s
   200K .......... .......... .......... .......... .......... 22% 11.1M 0s
   250K .......... .......... .......... .......... .......... 27% 10.1M 0s
   300K .......... .......... .......... .......... .......... 32% 10.9M 0s
   350K .......... ..

In [1]:
# Reading the data
with open('input.txt', 'r', encoding='utf-8') as f:
    text = f.read()


In [2]:
# Finding all unique characters in alphabetical order
chars = sorted(list(set(text)))
vocab_size = len(chars)
print("".join(chars))
print(vocab_size)



 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz
65


In [25]:
# Creating a mapping from chars to int, and vice versa
stoi = {ch:i for i, ch in enumerate(chars)}
itos = {i:ch for i, ch in enumerate(chars)}

# encoder and decoder functions
encode = lambda s: [stoi[ch] for ch in s ]              # returns a list of encoded chars
decode = lambda l: ''.join([itos[i] for i in l])        # returns the string

print(encode("Hello, World!"))
print(decode(encode("Hello, World!")))                  # decode(encode(s)) will return s


[20, 43, 50, 50, 53, 6, 1, 35, 53, 56, 50, 42, 2]
Hello, World!


In [26]:
# Using torch.Tensor to store entire text dataset (encoded)
import torch  # type: ignore

# dtype is long because it's int64, rather than the default float32
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:1000])


torch.Size([1115394]) torch.int64
tensor([18, 47, 56, 57, 58,  1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 14, 43, 44,
        53, 56, 43,  1, 61, 43,  1, 54, 56, 53, 41, 43, 43, 42,  1, 39, 52, 63,
         1, 44, 59, 56, 58, 46, 43, 56,  6,  1, 46, 43, 39, 56,  1, 51, 43,  1,
        57, 54, 43, 39, 49,  8,  0,  0, 13, 50, 50, 10,  0, 31, 54, 43, 39, 49,
         6,  1, 57, 54, 43, 39, 49,  8,  0,  0, 18, 47, 56, 57, 58,  1, 15, 47,
        58, 47, 64, 43, 52, 10,  0, 37, 53, 59,  1, 39, 56, 43,  1, 39, 50, 50,
         1, 56, 43, 57, 53, 50, 60, 43, 42,  1, 56, 39, 58, 46, 43, 56,  1, 58,
        53,  1, 42, 47, 43,  1, 58, 46, 39, 52,  1, 58, 53,  1, 44, 39, 51, 47,
        57, 46, 12,  0,  0, 13, 50, 50, 10,  0, 30, 43, 57, 53, 50, 60, 43, 42,
         8,  1, 56, 43, 57, 53, 50, 60, 43, 42,  8,  0,  0, 18, 47, 56, 57, 58,
         1, 15, 47, 58, 47, 64, 43, 52, 10,  0, 18, 47, 56, 57, 58,  6,  1, 63,
        53, 59,  1, 49, 52, 53, 61,  1, 15, 39, 47, 59, 57,  1, 25, 39, 56, 41,
      

Note:
- storing in torch.Tensor allows a wide range of tools that comes with pytorch
- we make the dtye=torch.long because it's int64 rather than the default float32. Since the encoder only has integer, there's no need to waste extra space for decimal precision

Why do we encode the input?
- Since computers can really only understand numbers, we have to numberfy the inputs into numbers so that they can be interpreted by the computer
- We set words/characters to numbers so that the number will the representative

In [5]:
# Splitting up the data into training set and validation sets
n = int(0.9*len(data))

train_data = data[:n]           # Used for training
val_data = data[n:]             # Used to test how good the model is


Note:
- When we create a model, we want it to also solve problems outside of its training dataset. If a model is only good at doing what it's trained on, then it can't be generalized, which is useless
- That is why we leave 10% of the data for validation purposes, to see how the model performs when it encounters something outside of its training set, or in other words, to see to what extent is the model overfitting
- Overfitting is bad, because even if the model can predict the training set to a high degree, it'll not translate well to data that it has never seen before

In [None]:
# Selecting blocks 
block_size = 8
x = train_data[:block_size]
y = train_data[1:block_size+ 1]
for t in range(block_size):
    print(f"Given {x[:t+1]}, my target is {y[t]}")

Given tensor([18]), my target is 47
Given tensor([18, 47]), my target is 56
Given tensor([18, 47, 56]), my target is 57
Given tensor([18, 47, 56, 57]), my target is 58
Given tensor([18, 47, 56, 57, 58]), my target is 1
Given tensor([18, 47, 56, 57, 58,  1]), my target is 15
Given tensor([18, 47, 56, 57, 58,  1, 15]), my target is 47
Given tensor([18, 47, 56, 57, 58,  1, 15, 47]), my target is 58


Note:
- We pick a small block size, or context length to train data because loading all the dataset at once is impractical, such as limited memory
- We select random chunks of the dataset and then train them. Those chunks have a max size

Numbers of training data in a block:
- When selecting a chunk/block from the dataset, there's actually mulitple training data packed into. See the above code
- In a way, this is a bayesian framework!

In [34]:
torch.manual_seed(1337)
batch_size = 4          # How many independent chunks should we process in parallel
block_size = 8          # Maximum context length of per chunk

def get_batch(split: list) -> tuple[list[int], list[str]]:
    data = train_data if split == 'train' else val_data
    ix = torch.randint(len(data) - block_size, (batch_size,))

    # We're just implementing the ideas from above cell
    x = torch.stack([data[i:i + block_size] for i in ix])                   # x values to predict the next token
    y = torch.stack([data[i+1: i + block_size+1] for i in ix])              # y values, the target

    return x,y
 
get_batch('train_data')

(tensor([[ 6,  1, 52, 53, 58,  1, 58, 47],
         [ 6,  1, 54, 50, 39, 52, 58, 43],
         [ 1, 58, 46, 47, 57,  1, 50, 47],
         [ 0, 32, 46, 43, 56, 43,  1, 42]]),
 tensor([[ 1, 52, 53, 58,  1, 58, 47, 50],
         [ 1, 54, 50, 39, 52, 58, 43, 58],
         [58, 46, 47, 57,  1, 50, 47, 60],
         [32, 46, 43, 56, 43,  1, 42, 53]]))

Elaborations:

`ix = torch.randint(len(data) - block_size, (batch_size,))`
- Here we are generating `batch_size` amount of random integers to serve as the starting index for getting chunks, where these integers are in [0, `len(data)-block_size`)
- The range is so that we don't go oob when we try to get `block_size` amount of data from a starting index
- The random indices will be stored in a 1D tensor, as denoted by `(batch_size,)`
- The tuple notation signifys that it's a 1D tensor, and if we have something else like `(2,3)`, then that'll be a 2D tensor with dimensions 2X3, meaning 2 rows, each having length 3
- If we were to write `(batch_size)` without the ',', then it'll be a scalar tensor, which doesn't make sense if I want to sample random integers and storing those collections in a container

---
`torch.stack`
- A torch stack is when you stack a bunch tensors of SAME dimension n, and it'll become a new tensor with dimension n+1
- For example, if I have `tensor(1)`, `tensor(2)`, `tensor(3)`, all 0D, and I stack them via torch.stack, then the resulting tensor will be a 1D vector tensor that looks like `tensor([1,2,3])`
- If I have `tensor([1,2,3])` and `tensor([4,5,6])`, then stack will result in a 2X3 matrix
- When it comes to training data, we stack those 1D vectors so that each vectors will be trained simultaneously via the parallel processing nature of GPU

ex. Training in parallel

|   <------ <1,2,3>         my independent vector1

|   <------ <4,5,6>         my independent vector2

---


`x = torch.stack([data[i:i + block_size] for i in ix])` 
- To be continued