<a href="https://colab.research.google.com/github/Devansh-react/PyTorch_fundamentals/blob/main/Transformer.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [4]:
# We always start with a dataset to train on. Let's download the tiny shakespeare dataset
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2026-02-11 20:07:33--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.2’


2026-02-11 20:07:33 (40.9 MB/s) - ‘input.txt.2’ saved [1115394/1115394]



In [5]:
with open("input.txt",'r',encoding='utf-8') as f:
    text = f.read()
text[:2000]

"First Citizen:\nBefore we proceed any further, hear me speak.\n\nAll:\nSpeak, speak.\n\nFirst Citizen:\nYou are all resolved rather to die than to famish?\n\nAll:\nResolved. resolved.\n\nFirst Citizen:\nFirst, you know Caius Marcius is chief enemy to the people.\n\nAll:\nWe know't, we know't.\n\nFirst Citizen:\nLet us kill him, and we'll have corn at our own price.\nIs't a verdict?\n\nAll:\nNo more talking on't; let it be done: away, away!\n\nSecond Citizen:\nOne word, good citizens.\n\nFirst Citizen:\nWe are accounted poor citizens, the patricians good.\nWhat authority surfeits on would relieve us: if they\nwould yield us but the superfluity, while it were\nwholesome, we might guess they relieved us humanely;\nbut they think we are too dear: the leanness that\nafflicts us, the object of our misery, is as an\ninventory to particularise their abundance; our\nsufferance is a gain to them Let us revenge this with\nour pikes, ere we become rakes: for the gods know I\nspeak this in hunger 

In [6]:
chars =sorted(list(set(text)))
print(len(chars))

65


In [7]:
!pip install tiktoken



In [8]:
import tiktoken
enc = tiktoken.get_encoding("o200k_base")
assert enc.decode(enc.encode("hello world")) == "hello world"
# To get the tokeniser corresponding to a specific model in the OpenAI API:
enc = tiktoken.encoding_for_model("gpt-4o")


In [9]:
import torch
data = torch.tensor(enc.encode(text),dtype=torch.long)
print(f"length of dataset in characters: {len(data)}& datatype is {data.dtype}")

length of dataset in characters: 297606& datatype is torch.int64


In [10]:
print(data[:1000])

tensor([  7127,  84479,    734,  13036,    581,  18988,   1062,   6544,     11,
          9598,    668,  10591,    364,   2594,    734, 116872,     11,  10591,
           364,   7127,  84479,    734,   3575,    553,    722,  33944,   7542,
           316,   1076,   1572,    316,   2079,   1109,   1715,   2594,    734,
         80773,     13,  33944,    364,   7127,  84479,    734,   7127,     11,
           481,   1761, 149492,    385,   3145, 137928,    382,  20915,  20935,
           316,    290,   1665,    364,   2594,    734,   2167,   1761,   1507,
            11,    581,   1761,   1507,    364,   7127,  84479,    734,  12845,
           765,  15874,   2395,     11,    326,  22782,    679,  33994,    540,
          1039,   2316,   3911,    558,   3031,   1507,    261,  75722,   1715,
          2594,    734,   3160,    945,  11695,    402,   1507,     26,   1632,
           480,    413,   4167,     25,   4194,     11,   4194,   1703,  17422,
         84479,    734,   5045,   2195, 

In [11]:
fraction  =  int(0.8*len(data))
train_data = data[:fraction]
test_data = data[fraction:]
print(f"len of traindata:{len(train_data)}")
print(f"len of testdata:{len(test_data)}")

len of traindata:238084
len of testdata:59522


In [12]:
block_size = 10
train_data[:10+1]

tensor([ 7127, 84479,   734, 13036,   581, 18988,  1062,  6544,    11,  9598,
          668])

In [13]:
x= train_data[:block_size]
y = train_data[1:block_size+1]
for t in range(block_size):
  target = x[:t+1]
  context = y[t]
  print(f"when input is {context} the target is {target}")

when input is 84479 the target is tensor([7127])
when input is 734 the target is tensor([ 7127, 84479])
when input is 13036 the target is tensor([ 7127, 84479,   734])
when input is 581 the target is tensor([ 7127, 84479,   734, 13036])
when input is 18988 the target is tensor([ 7127, 84479,   734, 13036,   581])
when input is 1062 the target is tensor([ 7127, 84479,   734, 13036,   581, 18988])
when input is 6544 the target is tensor([ 7127, 84479,   734, 13036,   581, 18988,  1062])
when input is 11 the target is tensor([ 7127, 84479,   734, 13036,   581, 18988,  1062,  6544])
when input is 9598 the target is tensor([ 7127, 84479,   734, 13036,   581, 18988,  1062,  6544,    11])
when input is 668 the target is tensor([ 7127, 84479,   734, 13036,   581, 18988,  1062,  6544,    11,  9598])


In [14]:
torch.manual_seed(1337)
batch_size = 4 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?

def get_batch(split):
    # generate a small batch of data of inputs x and targets y
    data = train_data if split == 'train' else test_data
    ix = torch.randint(len(data) - block_size, (batch_size,))
    x = torch.stack([data[i:i+block_size] for i in ix]).long() # Convert to long
    y = torch.stack([data[i+1:i+block_size+1] for i in ix]).long() # Convert to long
    return x, y

xb, yb = get_batch('train')
print('inputs:')
print(xb.shape)
print(xb)
print('targets:')
print(yb.shape)
print(yb)

print('----'*50)

for b in range(batch_size): # batch dimension
    for t in range(block_size): # time dimension
        context = xb[b, :t+1]
        target = yb[b,t]
        print(f"when input is {context.tolist()} the target: {target}")

inputs:
torch.Size([4, 8])
tensor([[   976,   5030,    382,   1775,  16240,    364,  81679,  14325],
        [  2622,   2395,    412, 141068,    765,   1294,   1838,     11],
        [  8773,    316,    413,  17055,   1165,   2009,    734,  25614],
        [    40,   3042,    734,  78368,  25367,     11,    538,  35115]])
targets:
torch.Size([4, 8])
tensor([[  5030,    382,   1775,  16240,    364,  81679,  14325,   3042],
        [  2395,    412, 141068,    765,   1294,   1838,     11,    483],
        [   316,    413,  17055,   1165,   2009,    734,  25614,  83062],
        [  3042,    734,  78368,  25367,     11,    538,  35115,  48103]])
--------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------
when input is [976] the target: 5030
when input is [976, 5030] the target: 382
when input is [976, 5030, 382] the target: 1775
when input is [976, 503

In [15]:
print(xb)

tensor([[   976,   5030,    382,   1775,  16240,    364,  81679,  14325],
        [  2622,   2395,    412, 141068,    765,   1294,   1838,     11],
        [  8773,    316,    413,  17055,   1165,   2009,    734,  25614],
        [    40,   3042,    734,  78368,  25367,     11,    538,  35115]])


**What is an Embedding?**
*In the context of machine learning, an embedding is a mapping from discrete objects (like words, users, or items) to vectors of real numbers. The idea is to transform high-dimensional, sparse data (like one-hot encodings for words) into lower-dimensional, dense representations. These dense vectors capture semantic relationships, meaning that words with similar meanings will have similar embedding vectors in the vector space.*

Why use nn.Embedding? **bold text**
Dimensionality Reduction: One-hot encoding for a large vocabulary results in very high-dimensional and sparse vectors. For example, if you have 10,000 unique words, each word would be represented by a 10,000-dimensional vector with a single 1 and 9,999 0s. Embeddings reduce this to a much smaller, dense vector (e.g., 128 or 300 dimensions).
Capturing Semantic Meaning: Unlike one-hot encoding, where every word is equidistant from every other word, embeddings learn relationships. For example, the embedding for "king" might be very close to "queen" in the vector space, and the vector difference king - man + woman might be close to queen *italicized text*. *italicized text*

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(1337)

class BigramLanguageModel(nn.Module):

    def __init__(self, vocab_size):
        super().__init__()
        # each token directly reads off the logits for the next token from a lookup table
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    def forward(self, idx, targets=None):

        # idx and targets are both (B,T) tensor of integers
        logits = self.token_embedding_table(idx) # (B,T,C)

        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B*T, C)
            targets = targets.view(B*T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        # idx is (B, T) array of indices in the current context
        for _ in range(max_new_tokens):
            # get the predictions
            logits, loss = self(idx)
            # focus only on the last time step
            logits = logits[:, -1, :] # becomes (B, C)
            # apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1) # (B, C)
            # sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1) # (B, 1)
            # append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1) # (B, T+1)
        return idx

m = BigramLanguageModel(len(char))
logits, loss = m(xb, yb)
print(logits.shape)
print(loss)

print(enc.decode(m.generate(idx = torch.zeros((1, 1), dtype=torch.long), max_new_tokens=100)[0].tolist()))


# Task
Define essential training hyperparameters, instantiate an AdamW optimizer for the model, and create a function to estimate loss on both training and validation sets. Then, implement a comprehensive training loop to update model weights and periodically evaluate loss. After training, generate a sample text sequence using the `BiagramLanguageModel`'s `generate` method. Finally, review the training process and generated text to assess model performance and provide insights.

## Define Hyperparameters

### Subtask:
Set up essential training hyperparameters such as learning rate, number of training iterations, and an evaluation interval.


**Reasoning**:
The subtask requires defining essential training hyperparameters. I will set the learning rate, maximum iterations, evaluation interval, and evaluation iterations in a code block.



In [1]:
batch_size = 32 # how many independent sequences will we process in parallel?
block_size = 8 # what is the maximum context length for predictions?
lr = 1e-3
max_iters = 3000
eval_interval = 300
eval_iters = 200

print(f"Learning Rate: {lr}")
print(f"Maximum Iterations: {max_iters}")
print(f"Evaluation Interval: {eval_interval}")
print(f"Evaluation Iterations: {eval_iters}")

Learning Rate: 0.001
Maximum Iterations: 3000
Evaluation Interval: 300
Evaluation Iterations: 200


## Instantiate Optimizer

### Subtask:
Instantiate an `AdamW` optimizer for the `BiagramLanguageModel` using the defined learning rate.

#### Instructions
1. Initialize the `AdamW` optimizer, passing the model's parameters and the previously defined learning rate `lr`.

**Reasoning**:
The previous step defined the hyperparameters. Now, as per the current subtask, I need to instantiate an AdamW optimizer for the model using the defined learning rate. This requires a code block.



**Reasoning**:
The error `NameError: name 'torch' is not defined` indicates that the `torch` library was not imported in the current execution environment. I will add `import torch` at the beginning of the code block to ensure `torch` is available when instantiating the optimizer.



In [2]:
import torch

optimizer = torch.optim.AdamW(m.parameters(), lr=lr)
print(f"Optimizer instantiated with learning rate: {lr}")

NameError: name 'm' is not defined

In [None]:
import torch
import torch.nn as nn
from torch.nn import functional as F
import tiktoken

# Re-define enc and m's dependencies as they were not executed in prior cells
enc = tiktoken.encoding_for_model("gpt-4o")

# Re-define BiagramLanguageModel class as it was not executed
class BiagramLanguageModel(nn.Module):
  def __init__(self,vocab_size):
    super().__init__()
    self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

  def  forward(self,idx,targets=None):
    logits = self.token_embedding_table(idx)
    if targets is None:
      loss = None
    else:
      B,T,C = logits.shape
      logits = logits.view(B*T,C)
      targets = targets.view(B*T)
      loss = F.cross_entropy(logits,targets)
    return logits,loss

# Instantiate the model with the correct tiktoken vocabulary size
m = BiagramLanguageModel(enc.n_vocab)

optimizer = torch.optim.AdamW(m.parameters(), lr=lr)
print(f"Optimizer instantiated with learning rate: {lr}")