# **<center> Constructing an LLM from Scratch<center>**
#### The main motive of this notebook is to give a basic knowledge about the working of LLM. In this notebook I will be implementing a basic LLM from scratch and will be explaining it side by side .

## **Importing Modules**

In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F
torch.manual_seed(69)

<torch._C.Generator at 0x7d110fcef270>

## **Dataset**

In [2]:
with open('/kaggle/input/shakespeare/shakespeare.txt', 'r', encoding='utf-8') as f:
    text = f.read()

## **Hyperparameter For The Model**

In [3]:
# Define hyperparameters for training a model
batch_size = 16  # how many independent sequences will we process in parallel?
block_size = 32  # what is the maximum context length for predictions?
max_iters = 5000  # maximum number of training iterations
eval_interval = 100  # evaluate the model every `eval_interval` iterations
learning_rate = 1e-3  # learning rate for the optimizer
device = 'cuda' if torch.cuda.is_available() else 'cpu'  # use GPU if available, otherwise use CPU
eval_iters = 200  # number of iterations between each evaluation
n_embd = 64  # dimensionality of the embedding layer
n_head = 4  # number of attention heads
n_layer = 4  # number of transformer layers
dropout = 0.5  # dropout probability, set to 0.0 for no dropout


Brief explanation of each hyperparameter:

1. `batch_size`: This hyperparameter defines the number of independent sequences processed in parallel during each training iteration. It impacts the granularity of parameter updates and memory usage.

2. `block_size`: This hyperparameter specifies the maximum length of sequences that the model can process. It determines the context length for predictions and affects the model's ability to capture dependencies within sequences.

3. `max_iters`: This hyperparameter sets the maximum number of training iterations or epochs. It controls the duration of training and helps prevent overfitting by limiting the number of updates.

4. `eval_interval`: This hyperparameter determines how often the model's performance is evaluated during training. It affects the frequency of monitoring training progress and validation performance.

5. `learning_rate`: This hyperparameter controls the step size or rate at which the model parameters are updated during optimization. It influences the convergence speed and stability of the training process.

6. `device`: This hyperparameter specifies the device (CPU or GPU) on which the model computations are performed. It allows for efficient utilization of available hardware resources.

7. `eval_iters`: This hyperparameter determines the number of iterations between each evaluation of the model's performance. It can be used to reduce computational overhead during evaluation while still providing frequent updates on model performance.

8. `n_embd`: This hyperparameter defines the dimensionality of the embedding layer. It determines the size of the vector representations for tokens in the input sequences.

9. `n_head`: This hyperparameter specifies the number of attention heads in the multi-head attention mechanism used in the transformer architecture. It controls the model's ability to attend to different parts of the input sequence simultaneously.

10. `n_layer`: This hyperparameter sets the number of transformer layers in the model. It determines the depth of the model and its capacity to capture complex patterns in the data.

11. `dropout`: This hyperparameter defines the probability of dropping out neurons during training. It helps prevent overfitting by regularizing the model and reducing co-adaptation between neurons. A value of 0.0 means no dropout is applied, while higher values introduce more dropout.

## **Converting Strings Into Integers**

In [4]:
chars = sorted(list(set(text)))  # Get unique characters from the text and sort them
vocab_size = len(chars)  # Total number of unique characters in the text

print("Unique characters that occur in this text :\n",chars)

Unique characters that occur in this text :
 ['\n', ' ', '!', "'", '(', ')', ',', '-', '.', ':', ';', '?', 'A', 'B', 'C', 'D', 'E', 'F', 'G', 'H', 'I', 'J', 'K', 'L', 'M', 'N', 'O', 'P', 'R', 'S', 'T', 'U', 'V', 'W', 'Y', 'a', 'b', 'c', 'd', 'e', 'f', 'g', 'h', 'i', 'j', 'k', 'l', 'm', 'n', 'o', 'p', 'q', 'r', 's', 't', 'u', 'v', 'w', 'x', 'y', 'z']


In [5]:
# create a mapping from characters to integers
stoi = {ch: i for i, ch in enumerate(chars)}  # Map each character to an integer index
itos = {i: ch for i, ch in enumerate(chars)}  # Map each integer index to a character

# Define encoding and decoding functions
encode = lambda s: [stoi[c] for c in s]  # Encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l])  # Decoder: take a list of integers, output a string

- `stoi`:
   - This function creates a mapping from characters to integers (`stoi` stands for "string to integer").
   - It iterates over each character in the `chars` list and assigns a unique integer index to each character using the `enumerate` function.
   - The resulting dictionary `stoi` maps each character to its corresponding integer index.

- `itos`:
   - This function creates a mapping from integers to characters (`itos` stands for "integer to string").
   - It iterates over each integer index in the range of the length of `chars` and assigns each index to its corresponding character in the `chars` list.
   - The resulting dictionary `itos` maps each integer index to its corresponding character.

- `encode`:
   - This function defines an encoding function (`encode`), which takes a string (`s`) as input and outputs a list of integers.
   - Inside the lambda function, it iterates over each character (`c`) in the input string `s` and uses the `stoi` dictionary to convert each character to its corresponding integer index.

- `decode`:
   - This function defines a decoding function (`decode`), which takes a list of integers (`l`) as input and outputs a string.
   - Inside the lambda function, it iterates over each integer (`i`) in the input list `l` and uses the `itos` dictionary to convert each integer index to its corresponding character.
   - Finally, it joins the characters together using `join` to form the decoded string.

## **Test-Train Split**

In [6]:
# Train and test splits
data = torch.tensor(encode(text), dtype=torch.long)  # Convert the text to a tensor of integers
n = int(0.9 * len(data))  # Calculate the index to split the data into train and validation sets (90% train, 10% validation)
train_data = data[:n]  # Training data (first 90% of the data)
val_data = data[n:]  # Validation data (remaining 10% of the data)

## **Data Loading**

In [7]:
def get_batch(split):
    """
    Function to generate a small batch of data consisting of inputs x and targets y.

    Args:
    - split: A string indicating whether to use the training or validation data.

    Returns:
    - x: Input tensor of shape (batch_size, block_size) containing sequences of integers.
    - y: Target tensor of shape (batch_size, block_size) containing sequences of integers shifted by one position.
    """
    data = train_data if split == 'train' else val_data  # Select data based on the split (train or validation)
    ix = torch.randint(len(data) - block_size, (batch_size,))  # Generate random indices for selecting sequences
    x = torch.stack([data[i:i+block_size] for i in ix])  # Extract input sequences of length block_size
    y = torch.stack([data[i+1:i+block_size+1] for i in ix])  # Extract target sequences shifted by one position
    x, y = x.to(device), y.to(device)  # Move tensors to the specified device (CPU or GPU)
    return x, y


## **Loss Function**

In [8]:
@torch.no_grad()
def estimate_loss():
    """
    Function to estimate the average loss on the training and validation data without performing gradient computation.

    Returns:
    - out: A dictionary containing the average loss for the training and validation data.
    """
    out = {}  # Initialize a dictionary to store the output
    model.eval()  # Set the model to evaluation mode (no gradient calculation)
    for split in ['train', 'val']:  # Iterate over training and validation data
        losses = torch.zeros(eval_iters)  # Initialize a tensor to store individual losses for each evaluation iteration
        for k in range(eval_iters):  # Iterate over evaluation iterations
            X, Y = get_batch(split)  # Get a batch of input-output pairs
            logits, loss = model(X, Y)  # Forward pass through the model to get predictions and loss
            losses[k] = loss.item()  # Store the loss value
        out[split] = losses.mean()  # Calculate the mean loss for the current split and store it in the output dictionary
    model.train()  # Set the model back to training mode
    return out  # Return the dictionary containing the average losses for training and validation data

The `estimate_loss` function serves to compute the average loss on both the training and validation datasets without engaging in gradient computation. By temporarily disabling gradient calculations and setting the model to evaluation mode, it iterates through the data splits, samples batches, and computes losses over multiple evaluation iterations. The function then returns a dictionary containing the mean losses for each split. This approach efficiently assesses the model's performance without updating its parameters and is crucial for monitoring training progress and model validation.

## **Model**

In [9]:
class Head(nn.Module):
    """
    A single head of self-attention mechanism.

    Args:
    - head_size: The size of the attention head.

    Attributes:
    - key: Linear transformation for keys.
    - query: Linear transformation for queries.
    - value: Linear transformation for values.
    - tril: Lower triangular mask to prevent attention to future tokens.
    - dropout: Dropout layer.
    """

    def __init__(self, head_size):
        super().__init__()
        # Linear transformations for key, query, and value
        self.key = nn.Linear(n_embd, head_size, bias=False)
        self.query = nn.Linear(n_embd, head_size, bias=False)
        self.value = nn.Linear(n_embd, head_size, bias=False)
        # Lower triangular mask to prevent attention to future tokens
        self.register_buffer('tril', torch.tril(torch.ones(block_size, block_size)))
        # Dropout layer
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        """
        Forward pass of the self-attention mechanism.

        Args:
        - x: Input tensor of shape (batch_size, sequence_length, feature_dimension).

        Returns:
        - out: Output tensor after applying self-attention, of shape (batch_size, sequence_length, feature_dimension).
        """
        B, T, C = x.shape  # Batch size, sequence length, and feature dimension
        # Linear transformations for key, query, and value
        k = self.key(x)   # (B,T,C)
        q = self.query(x) # (B,T,C)
        
        # Compute attention scores ("affinities")
        wei = q @ k.transpose(-2, -1) * C ** -0.5  # (B, T, C) @ (B, C, T) -> (B, T, T)
        wei = wei.masked_fill(self.tril[:T, :T] == 0, float('-inf'))  # Mask future tokens
        wei = F.softmax(wei, dim=-1)  # Apply softmax to get attention weights
        wei = self.dropout(wei)  # Apply dropout
        
        # Perform the weighted aggregation of the values
        v = self.value(x)  # (B,T,C)
        out = wei @ v  # (B, T, T) @ (B, T, C) -> (B, T, C)
        return out

The `Head` class represents a single head of the self-attention mechanism used in the Transformer architecture. Below is an explanation of its architecture and functionality:

- **Architecture**:
  - **Linear Transformations**: The class initializes three linear transformations for keys, queries, and values, each of which maps the input tensor (`x`) from the feature dimension (`n_embd`) to the specified head size.
  - **Lower Triangular Mask**: It creates a lower triangular mask (`tril`) as a buffer using PyTorch's `tril` function, ensuring that during the attention computation, the model does not attend to future tokens in the sequence.
  - **Dropout Layer**: A dropout layer is applied to the attention weights (`wei`) to regularize the model and prevent overfitting.

- **Functionality**:
  - **Forward Pass**: In the `forward` method, the input tensor (`x`) is passed through the linear transformations to obtain the keys (`k`), queries (`q`), and values (`v`).
  - **Attention Computation**: The attention scores, also known as affinities, are computed by performing a matrix multiplication between queries and keys, scaled by the square root of the feature dimension (`C`). The resulting attention weights (`wei`) are masked to prevent attention to future tokens using the lower triangular mask and then normalized using softmax to obtain valid attention probabilities.
  - **Weighted Aggregation**: The values (`v`) are weighted by the attention probabilities (`wei`) and aggregated using matrix multiplication to produce the output tensor (`out`), representing the attended features.

In [10]:
class MultiHeadAttention(nn.Module):
    """
    Multi-head self-attention mechanism.

    Args:
    - num_heads: The number of attention heads.
    - head_size: The size of each attention head.

    Attributes:
    - heads: List of attention heads.
    - proj: Linear transformation for projecting concatenated attention heads.
    - dropout: Dropout layer.
    """

    def __init__(self, num_heads, head_size):
        super().__init__()
        # Create multiple attention heads
        self.heads = nn.ModuleList([Head(head_size) for _ in range(num_heads)])
        # Linear transformation for projecting concatenated attention heads
        self.proj = nn.Linear(n_embd, n_embd)
        # Dropout layer
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        """
        Forward pass of the multi-head self-attention mechanism.

        Args:
        - x: Input tensor of shape (batch_size, sequence_length, feature_dimension).

        Returns:
        - out: Output tensor after applying multi-head self-attention, of shape (batch_size, sequence_length, feature_dimension).
        """
        # Apply each attention head in parallel and concatenate the outputs
        out = torch.cat([h(x) for h in self.heads], dim=-1)
        # Project the concatenated output
        out = self.dropout(self.proj(out))
        return out

The `MultiHeadAttention` class implements a multi-head self-attention mechanism, a crucial component of the Transformer architecture. Below is an explanation of its architecture and functionality:

- **Architecture**:
  - **Multiple Attention Heads**: The class initializes multiple attention heads (`heads`) using a `ModuleList`, each with the specified `head_size`. The number of attention heads is determined by the `num_heads` parameter.
  - **Projection Layer**: After applying each attention head in parallel, the outputs are concatenated and passed through a linear transformation (`proj`). This projection layer helps in combining information from multiple attention heads.
  - **Dropout Layer**: To regularize the model and prevent overfitting, a dropout layer is applied after the projection layer.

- **Functionality**:
  - **Forward Pass**: In the `forward` method, the input tensor (`x`) is passed through each attention head in parallel. The outputs from all attention heads are concatenated along the feature dimension.
  - **Concatenation**: The outputs from different attention heads are concatenated along the last dimension (`dim=-1`), resulting in a tensor with increased feature dimensionality.
  - **Projection**: The concatenated tensor is then projected back to the original feature dimensionality using a linear transformation (`proj`). This step helps maintain the desired feature dimensionality and facilitates information integration from multiple attention heads.
  - **Dropout**: Finally, dropout is applied to the projected tensor to regularize the model and mitigate overfitting.

In [11]:
class FeedFoward(nn.Module):
    """
    Feedforward neural network composed of linear layers followed by a non-linearity and dropout.

    Args:
    - n_embd: The input and output dimension of the linear layers.

    Attributes:
    - net: Sequential module containing linear layers, ReLU activation, and dropout.
    """

    def __init__(self, n_embd):
        super().__init__()
        # Define a sequential neural network module
        self.net = nn.Sequential(
            # First linear layer followed by ReLU activation
            nn.Linear(n_embd, 4 * n_embd),
            nn.ReLU(),
            # Second linear layer
            nn.Linear(4 * n_embd, n_embd),
            # Dropout layer
            nn.Dropout(dropout),
        )

    def forward(self, x):
        """
        Forward pass of the feedforward neural network.

        Args:
        - x: Input tensor of shape (batch_size, sequence_length, feature_dimension).

        Returns:
        - out: Output tensor after applying the feedforward network, of shape (batch_size, sequence_length, feature_dimension).
        """
        return self.net(x)

The `FeedForward` class defines a feedforward neural network architecture composed of linear layers followed by a non-linearity (ReLU activation) and dropout. Here's an explanation of its architecture:

- **Architecture**:
  - **Sequential Module**: The class initializes a `Sequential` module containing a sequence of operations applied sequentially to the input tensor.
  - **Linear Layers**: Two linear layers are defined within the sequential module. 
    - The first linear layer takes an input of dimension `n_embd` and outputs a tensor of dimension `4 * n_embd`.
    - The second linear layer takes the output of the first layer (with dimension `4 * n_embd`) and maps it back to the original input dimensionality (`n_embd`).
  - **Activation Function**: Between the linear layers, a Rectified Linear Unit (ReLU) activation function is applied element-wise. ReLU introduces non-linearity to the network, allowing it to learn complex mappings from input to output.
  - **Dropout Layer**: After the second linear layer, dropout is applied. Dropout randomly sets a fraction of input units to zero during training, which helps prevent overfitting by reducing the model's reliance on specific units.
  
- **Functionality**:
  - **Forward Pass**: In the `forward` method, the input tensor (`x`) is passed through the sequential module, which sequentially applies the linear layers, ReLU activation, and dropout.
  - **Output**: The output tensor after passing through the feedforward network has the same shape as the input tensor (`(batch_size, sequence_length, feature_dimension)`), with each element representing the corresponding feature in the input sequence transformed by the feedforward network.

In [12]:
class Block(nn.Module):
    """
    Transformer block

    Args:
    - n_embd: The embedding dimension.
    - n_head: The number of attention heads.

    Attributes:
    - sa: Multi-head self-attention module.
    - ffwd: Feedforward neural network module.
    - ln1: Layer normalization module.
    - ln2: Layer normalization module.
    """

    def __init__(self, n_embd, n_head):
        """
        Initialize the Transformer block.

        Args:
        - n_embd: The embedding dimension.
        - n_head: The number of attention heads.
        """
        super().__init__()
        # Calculate the size of each attention head
        head_size = n_embd // n_head
        # Multi-head self-attention module
        self.sa = MultiHeadAttention(n_head, head_size)
        # Feedforward neural network module
        self.ffwd = FeedFoward(n_embd)
        # Layer normalization module
        self.ln1 = nn.LayerNorm(n_embd)
        self.ln2 = nn.LayerNorm(n_embd)

    def forward(self, x):
        """
        Forward pass of the Transformer block.

        Args:
        - x: Input tensor of shape (batch_size, sequence_length, feature_dimension).

        Returns:
        - out: Output tensor after applying the Transformer block, of shape (batch_size, sequence_length, feature_dimension).
        """
        # Apply multi-head self-attention followed by layer normalization and residual connection
        x = x + self.sa(self.ln1(x))
        # Apply feedforward network followed by layer normalization and residual connection
        x = x + self.ffwd(self.ln2(x))
        return x

The `Block` class represents a single Transformer block, which is a fundamental building block of the Transformer architecture. Below is an explanation of its architecture:

- **Architecture**:
  - **Multi-Head Self-Attention Module**: The block initializes a multi-head self-attention module (`sa`), which consists of multiple attention heads operating in parallel. Each attention head independently attends to different parts of the input sequence, allowing the model to capture long-range dependencies efficiently.
  - **Feedforward Neural Network Module**: It also initializes a feedforward neural network module (`ffwd`), which consists of linear layers followed by ReLU activation and dropout. This component introduces non-linearity and enables the model to capture complex patterns in the data.
  - **Layer Normalization Modules**: Two layer normalization modules (`ln1` and `ln2`) are initialized. Layer normalization normalizes the activations of each layer, helping stabilize the training process and improve model performance.
  
- **Functionality**:
  - **Forward Pass**: In the `forward` method, the input tensor (`x`) is passed through the multi-head self-attention module (`sa`). The output is then passed through layer normalization (`ln1`) and added to the input tensor (`x`) to form a residual connection.
  - **Feedforward Network**: Next, the output from the attention module is passed through the feedforward neural network (`ffwd`). Again, the output is passed through layer normalization (`ln2`) and added to the previous output to form another residual connection.
  - **Output**: The final output of the Transformer block represents the processed input tensor, capturing both the self-attention and feedforward network transformations.

In [13]:
class BigramLanguageModel(nn.Module):
    """
    Super simple bigram language model.

    Attributes:
    - token_embedding_table: Embedding layer for token embeddings.
    - position_embedding_table: Embedding layer for position embeddings.
    - blocks: Sequential module of Transformer blocks.
    - ln_f: Layer normalization for the final layer.
    - lm_head: Linear layer for language modeling.
    """

    def __init__(self):
        super().__init__()
        # Embedding layer for token embeddings
        self.token_embedding_table = nn.Embedding(vocab_size, n_embd)
        # Embedding layer for position embeddings
        self.position_embedding_table = nn.Embedding(block_size, n_embd)
        # Sequential module of Transformer blocks
        self.blocks = nn.Sequential(*[Block(n_embd, n_head=n_head) for _ in range(n_layer)])
        # Layer normalization for the final layer
        self.ln_f = nn.LayerNorm(n_embd)
        # Linear layer for language modeling
        self.lm_head = nn.Linear(n_embd, vocab_size)

    def forward(self, idx, targets=None):
        """
        Forward pass of the bigram language model.

        Args:
        - idx: Input tensor of shape (batch_size, sequence_length) containing token indices.
        - targets: Target tensor of shape (batch_size, sequence_length) containing target token indices.

        Returns:
        - logits: Logits tensor of shape (batch_size, sequence_length, vocab_size) containing predicted logits.
        - loss: Optional loss tensor computed using cross-entropy if targets are provided.
        """
        B, T = idx.shape

        # Token embeddings
        tok_emb = self.token_embedding_table(idx)  # (B,T,C)
        # Position embeddings
        pos_emb = self.position_embedding_table(torch.arange(T, device=device))  # (T,C)
        # Add token and position embeddings
        x = tok_emb + pos_emb  # (B,T,C)
        # Pass through Transformer blocks
        x = self.blocks(x)  # (B,T,C)
        # Apply layer normalization
        x = self.ln_f(x)  # (B,T,C)
        # Linear layer for language modeling
        logits = self.lm_head(x)  # (B,T,vocab_size)

        # Calculate loss if targets are provided
        if targets is None:
            loss = None
        else:
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            loss = F.cross_entropy(logits, targets)

        return logits, loss

    def generate(self, idx, max_new_tokens):
        """
        Generate new tokens using the bigram language model.

        Args:
        - idx: Input tensor of shape (batch_size, sequence_length) containing token indices.
        - max_new_tokens: Maximum number of new tokens to generate.

        Returns:
        - idx: Tensor containing the input tokens extended with generated tokens, of shape (batch_size, sequence_length + max_new_tokens).
        """
        # Iterate for max_new_tokens iterations
        for _ in range(max_new_tokens):
            # Crop idx to the last block_size tokens
            idx_cond = idx[:, -block_size:]
            # Get the predictions
            logits, _ = self(idx_cond)
            # Focus only on the last time step
            logits = logits[:, -1, :]  # (B, C)
            # Apply softmax to get probabilities
            probs = F.softmax(logits, dim=-1)  # (B, C)
            # Sample from the distribution
            idx_next = torch.multinomial(probs, num_samples=1)  # (B, 1)
            # Append sampled index to the running sequence
            idx = torch.cat((idx, idx_next), dim=1)  # (B, T+1)
        return idx

The `BigramLanguageModel` class represents a simple bigram language model based on the Transformer architecture. Here's an explanation of its architecture and functionality:

- **Architecture**:
  - **Token Embedding Layer**: Initializes an embedding layer (`token_embedding_table`) to map token indices to dense vector representations. The size of the embedding matrix is determined by the vocabulary size (`vocab_size`) and the embedding dimension (`n_embd`).
  - **Position Embedding Layer**: Creates another embedding layer (`position_embedding_table`) to encode positional information into the input tokens. This layer assigns a unique position embedding to each token position in the sequence.
  - **Transformer Blocks**: Utilizes a sequence of Transformer blocks (`blocks`) to process the input sequence. Each block consists of multi-head self-attention followed by feedforward neural network layers, facilitating the capture of contextual dependencies and patterns within the input sequence.
  - **Layer Normalization**: Applies layer normalization (`ln_f`) after the Transformer blocks to stabilize the learning process and improve model performance.
  - **Linear Layer for Language Modeling**: Defines a linear layer (`lm_head`) to project the output of the Transformer blocks to the vocabulary size, producing logits for each token in the vocabulary.

- **Functionality**:
  - **Forward Pass**: In the `forward` method, the input tensor (`idx`) containing token indices is passed through the token embedding layer and added to position embeddings. The resulting tensor is then processed by the Transformer blocks and layer normalization, followed by the linear layer to compute logits for language modeling. If targets are provided, the method also computes the cross-entropy loss.
  - **Token Generation**: The `generate` method generates new tokens by iteratively predicting the next token based on the previous sequence. It repeatedly samples from the softmax distribution of logits for the last token and appends the sampled token to the sequence until the specified maximum number of new tokens is generated.

## **Training**

In [14]:
model = BigramLanguageModel()  # Initialize the BigramLanguageModel
m = model.to(device)  # Move the model to the specified device (CPU or GPU)
# Print the number of parameters in the model
print(sum(p.numel() for p in m.parameters()) / 1e6, 'M parameters')

# Create a PyTorch optimizer
optimizer = torch.optim.AdamW(model.parameters(), lr=learning_rate)

# Main training loop
for iter in range(max_iters):

    # Every once in a while, evaluate the loss on train and validation sets
    if iter % eval_interval == 0 or iter == max_iters - 1:
        # Estimate loss on train and val sets
        losses = estimate_loss()
        print(f"step {iter}: train loss {losses['train']:.4f}, val loss {losses['val']:.4f}")

    # Sample a batch of data
    xb, yb = get_batch('train')

    # Evaluate the loss
    logits, loss = model(xb, yb)
    # Zero the gradients before the backward pass
    optimizer.zero_grad(set_to_none=True)
    # Backpropagation: compute gradients
    loss.backward()
    # Update model parameters
    optimizer.step()


0.209213 M parameters
step 0: train loss 4.3193, val loss 4.3138
step 100: train loss 2.5515, val loss 2.5568
step 200: train loss 2.4189, val loss 2.4242
step 300: train loss 2.3738, val loss 2.3882
step 400: train loss 2.3395, val loss 2.3455
step 500: train loss 2.3080, val loss 2.3140
step 600: train loss 2.2704, val loss 2.2786
step 700: train loss 2.2340, val loss 2.2375
step 800: train loss 2.2004, val loss 2.2153
step 900: train loss 2.1841, val loss 2.2072
step 1000: train loss 2.1548, val loss 2.1685
step 1100: train loss 2.1148, val loss 2.1354
step 1200: train loss 2.1044, val loss 2.1290
step 1300: train loss 2.0781, val loss 2.1119
step 1400: train loss 2.0639, val loss 2.0925
step 1500: train loss 2.0460, val loss 2.0793
step 1600: train loss 2.0279, val loss 2.0630
step 1700: train loss 2.0083, val loss 2.0587
step 1800: train loss 2.0144, val loss 2.0471
step 1900: train loss 1.9851, val loss 2.0272
step 2000: train loss 1.9826, val loss 2.0196
step 2100: train loss 1.

## **Output Of The Model**

In [15]:
# generate from the model
context = torch.zeros((1, 1), dtype=torch.long, device=device)
print(decode(m.generate(context, max_new_tokens=2000)[0].tolist()))


ghts ther whe noth thast hup ase stenpetters,
But disse sselove stlove:

Orlof anon dects chires, thy and morede par to dote,

Aght file yought ey a to thou bed,
But truthers,
Anbe thier with, and, chame do oth I conth '(but ought sure-deariose of stroomfed,
That you shee lecestin reaked, owee proos, mons,
But thunteounore bed To xeye, as un poince,
Withater,
Hake shunce, as my com, thering.
A min's bewhints, thou heyu, many time waity
No might a jeccht telf to bull
Blow ve mights be ge
Row to  tirel, thend phold ame (heal I now anloove wound in wed mu adt I sell.
Theid see, rut weet tong kned,
Mut so swa to, sifter orean, apn's suble on 'se lisgeauty ser in 
Bust of that I lies so fe whit, freark,
That ond frror poonte, law
By thou peaume shall my four's baken thens!
Doth yos houghts coproong my alo', an thing preidn,
So leagure forgh bettome then muse)
To weart my swing deor exind pay:
As ching my sunlie) feter in no scor in herwine,
Wroms hightae oo love.  
Frighiing bearks thy sun

## **Results**

#### The results are very bad as you can see but it has captured some patterns in the test and the struture of sentence is more human like. If we make the model more complex we will surely get better results.