In [1]:
# Small LLM / Notebook created by Javier Ideami (ideami.com)
# Typical LLMs need many GPUs and millions of dollars to be trained
# This code trains a small LLM with a single GPU and little GPU memory 
# Of course results are not like a chatGPT, but they are good enough to see how the LLM trains to go
# from random combinations of letters to actual words and phrases that are sometimes decently coherent
# GPT3 has 175 Billion parameters. GPT4 has many, many more.
# This model has only 19 Million parameters with its default settings. That's why its perfect for learning 
# and experimenting

# Official notebook #vj30

In [2]:
#### For GOOGLE COLAB and similar platform Users:
#### Make sure to select a GPU in the online platform. Don't run this code with a CPU (it will be too slow)

# If you are running this code locally, your GPU should be selected automatically

In [3]:
# uncomment and run the following installation lines ONLY if you havent installed these libraries already outside of the notebook
#!pip install ipdb -q
#!pip install tqdm -q
#!pip install sentencepiece -q
#!pip install wandb -q

# And if you are not in Google Colab and you didn't yet install Pytorch, make sure to do it:
# find the ideal pytorch installation command at https://pytorch.org/get-started/locally/

In [4]:
# You can use this command to view information about your GPU and the amount of free memory it has
# Make sure that you have at last 4GB of free GPU memory to do this course
!nvidia-smi 
# If you are using Google Colab or a similar online platform, make sure to select a GPU in the menus
# In Google colab, at the moment the option is within the Runtime menus

Wed Dec 11 11:18:22 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.77.01              Driver Version: 566.36         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3080 Ti     On  |   00000000:01:00.0  On |                  N/A |
| 53%   41C    P8             29W /  350W |    4562MiB /  12288MiB |      6%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                

In [5]:
### Import necessary libraries

import os, sys
import ipdb # for debugging
from tqdm import tqdm
from datetime import datetime
import platform, shutil # detect platform type
import requests, zipfile, io 

# Pytorch
import torch
import torch.nn as nn
from torch.nn import functional as F

import sentencepiece as spm # For the tokenizer

# These lines improve performance for Ampere Architecture (e.g: A100s)
torch.backends.cuda.matmul.allow_tf32 = True  # allow tf32 on matmul
torch.backends.cudnn.allow_tf32 = True  # allow tf32 on cudnn
# Empty GPU cache memory
torch.cuda.empty_cache()


In [6]:
# Download necessary files and create necessary folders
# wiki.txt - dataset: a tiny segment of the English Wikipedia
# wiki_tokenizer.model: trained tokenizer file (in another notebook I show you how to produce this file)
# wiki_tokenizer.vocab: trained tokenizer file (in another notebook I show you how to produce this file)
# encoded_data.pt (dataset tokenized with the tokenizer)
# I will explain how to produce encoded_data.pt - because it takes quite a bit to process, it's nice to have it in advance

# NOTE: Downloading will take a while, be patient. You can refresh your folder from time to time to see when the files
# have been created. If you have any problems downloading the files with this code, I have also added llm_train.zip
# to the downloadable resources of this lecture (however, best option is to use this code, because then you don't need
# to upload the files or do anything else)

files_url = "https://ideami.com/llm_train"

# Downloading proceeds if we detect that one of the key files to download is not present
if not os.path.exists(f"encoded_data.pt"):
    print("Downloading files using Python")
    response = requests.get(files_url)
    zipfile.ZipFile(io.BytesIO(response.content)).extractall(".")
else:
    print("you seem to have already downloaded the files. If you wish to re-download them, delete the encoded_data.pt file")



you seem to have already downloaded the files. If you wish to re-download them, delete the encoded_data.pt file


In [31]:
# Set main parameters

# ARCHITECTURE PARAMETERS
batch_size= 16 # How many samples do we train at once (set as needed, typical range 8 to 128)
              # 8 is good for a GPU with 4GB of memory, 128 is good for a GPU with 24GB of memory
context=512 # Sequence length used for training, 512 is a good compromise for our level of resources
embed_size=384 # Embedding size
n_layers = 7 # Number of transformer layers
n_heads = 7 # Number of heads within each layer
BIAS = True # Do we want Bias parameters?

# HYPERPARAMETERS
lr = 3e-4 # Initial learning rate
dropout=0.05 # Dropout percentage
weight_decay = 0.01 # Weight decay regularizer
grad_clip = 1.0 # Gradient clipping to prevent gradient explosion

# TRAINING parameters
train_iters = 100000 # Maximum number of training iterations
eval_interval=50 # How often do we evaluate the performance?
eval_iters=3 # Number of iterations while we evaluate performance
compile = False # Compile will accelerate performance in compatible systems
load_pretrained = True # Do we want to load a pretrained model to continue training?

checkpoint_dir = 'models/'  # Where do we store checkpoints?

checkpoint_fn = "latest.pt" 
# Name of checkpoint file to be saved during training

checkpoint_load_fn = "latest.pt" 
# Name of checkpoint file to be loaded when load_pretrained is True
# You can load llm2.pt to experiment with a checkpoint that already reached 2.31 of loss

dtype = torch.bfloat16 # our target internal data type

# MODE
# Do we want to run the model in inference mode?
inference=True 

# DEVICE - Sets device to GPU or CPU (use GPU always)
device = "cuda" if torch.cuda.is_available() else "cpu"
print("device: You will be using: ",device)


device: You will be using:  cuda


In [8]:
# LOGGING parameters
# When you run this cell, it will ask you to enter your Wandb API Key, which you
# can find at your account on https://wandb.ai/settings#api
wandb_log = True
wandb_project = "test"
wandb_run_name = "test-run" + datetime.now().strftime("%Y_%m_%d_%H_%M_%S")

if wandb_log:
    import wandb
    wandb.init(project=wandb_project, name=wandb_run_name)

# The first time you run this logging code set to True, the weights and biases library
# will ask you for an API key. You can follow the instructions in the video, or you can
# also simply click on a link that should appear when you run this cell, pointing to this
# address: https://wandb.ai/authorize  
# Going to that address will allow you to quickly get an API key as well


[34m[1mwandb[0m: Using wandb-core as the SDK backend.  Please refer to https://wandb.me/wandb-core for more information.
[34m[1mwandb[0m: Currently logged in as: [33mharikrishna1912[0m ([33mharikrishna1912-regology[0m). Use [1m`wandb login --relogin`[0m to force relogin


In [9]:
with open('wiki.txt', 'r', encoding='utf-8') as f:
    text=f.read()

print(text[10000:10500])

 that was used to represent a team in an old TV show, The A-Team. A capital a is written "A". Use a capital A at the start of a sentence if writing.

A is also a musical note, sometimes referred to as "La".

The letter 'A' was in the Phoenician alphabet's aleph. This symbol came from a simple picture of an ox head. 

This Phoenician letter helped make the basic blocks of later types of the letter. The Greeks later modified this letter and used it as their letter alpha. The Greek alphabet was use


In [10]:
# SENTENCEPIECE TOKENIZER

# Load trained tokenizer
# Make sure that " model_file = " is pointing to the right file
sp = spm.SentencePieceProcessor(model_file='wiki_tokenizer.model')

# Get the vocabulary size of our tokenizer
vocab_size = sp.get_piece_size()
print(f"Tokenizer vocab_size: {vocab_size}")

# Create the encoding and decoding tokenizer functions
encode = lambda s: sp.Encode(s)
decode = lambda l: sp.Decode(l)

# Test that encoding and decoding are working well
print(decode(encode("Encoding Decoding functions ready")))

Tokenizer vocab_size: 4096
Encoding Decoding functions ready


In [11]:
# Tokenization of the dataset
if os.path.exists(f"encoded_data.pt"):
    # Load encoded data if you already saved it previously
    print("Loading saved encoded data")
    data = torch.load('encoded_data.pt')
else:
    # If you still didn't encode and save the encoding, do it here
    print("Encoding data")
    data = torch.tensor(encode(text), dtype=torch.long)
    torch.save(data, 'encoded_data.pt')


Loading saved encoded data


  data = torch.load('encoded_data.pt')


In [12]:
data_size=len(data) # Get the size of the dataset

spl = int(0.9*data_size) # set the split at 90%-10%
train_data=data[:spl] # training data will be 90% of the dataset
val_data=data[spl:] # validation data will be 10% of the dataset
print(f'Total data: {data_size/1e6:.2f} Million | Training: {len(train_data)/1e6:.2f} Million | Validation: {len(val_data)/1e6:.2f} Million')

# data[:30] : shows the first 30 token IDs

Total data: 59.21 Million | Training: 53.29 Million | Validation: 5.92 Million


In [13]:
############## HELPER FUNCTIONS ###########################

# Return a batch of either training or evaluation data
def get_batch(split):
    # BS = Batch Size / SL = Sequence Length or context length
    data = train_data if split=="train" else val_data # Select the split
    inds = torch.randint(len(data)-context, (batch_size,)) # (BS)
    x = torch.stack([data[i: i+context] for i in inds]) # (BS,SL)
    y = torch.stack([data[i+1: i+context+1] for i in inds]) # (BS,SL)

    # Examples of what it returns
    # # First 10 elements of first batch of inputs and labels
    #x[0][:10] -> tensor([ 664,  278, 4031, 4056, 4065, 4062, 4062, 4051, 13, 13])
    #y[0][:10] -> tensor([ 278, 4031, 4056, 4065, 4062, 4062, 4051,   13, 13, 4066])

    x,y = x.to(device), y.to(device)
    return x,y



In [14]:
# Uncomment to test your get_batch function
#x,y=get_batch("train")
#print(f"x.shape: {x.shape}")
#print(f"y.shape: {y.shape}")
#print(x[0][:10])
#print(y[0][:10])

In [19]:
# Head Attention Layer
# Detects and reinforces patterns in relationships between members of sequence
class Head(nn.Module):
    # BS = Batch Size / SL = Sequence Length or context length
    def __init__(self, head_size):
        super().__init__()
        self.queries= nn.Linear(embed_size, head_size, bias=BIAS) # Query Projection (embed_dim, head_size) (384, 54)
        self.keys= nn.Linear(embed_size, head_size, bias=BIAS) # Key Projection (384, 54)
        self.values= nn.Linear(embed_size, head_size, bias=BIAS) # Value Projection (384, 54)
        # We declare a triangular matrix that we will use to mask future tokens from the current position
        # self.tril contains 0s in upper triangle and 1s in lower triangle + diagonal
        self.register_buffer('tril',torch.tril(torch.ones(context,context))) # self.tril - (SL,SL)
        self.dropout = nn.Dropout(dropout)

    def forward(self,x):
        BS,SL, VS = x.shape
        q=self.queries(x) # (BS,SL,54)  54 is the head_size
        k=self.keys(x) # (BS,SL,54)
        v=self.values(x) # (BS,SL,54)

        # Calculate square attention weights matrix with dot product of q and k, and normalize
        attn_w = q @ k.transpose(-2,-1) * k.shape[-1]**-0.5 # (BS, SL, SL)

        # mask out future tokens, pay attention only to the past
        attn_w = attn_w.masked_fill(self.tril[:SL,:SL]==0, float('-inf'))  # set to -inf the upper right triangle of 0s

        attn_w = F.softmax(attn_w, dim=-1) # Transform into probabilities (BS, SL, SL)
        attn_w = self.dropout(attn_w) # (BS, SL, SL)

        # use attention weights to update the features of our tokens
        x = attn_w @ v # (BS,SL,54) # 54 is the head_size = embed_dim // n_heads
        return x

In [18]:
# Multihead Attention Layer
# This layer coordinates the different attention heads within each transformer block
class Multihead(nn.Module):
    def __init__(self,n_heads, head_size):
        super().__init__()
        self.heads = nn.ModuleList([Head(head_size) for _ in range(n_heads)]) # Setup the heads | head_size = embed_size // n_heads
        self.combine = nn.Linear(head_size * n_heads, embed_size, bias=BIAS) # (378,384) - in case of our default values
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        # BS = Batch Size / SL = Sequence Length or context length
        # x is (BS,SL,384)  # 384 is default embed size
        x = torch.cat([head(x) for head in self.heads], dim=-1)
        # Each head outputs (BS,SL, head_size)
        # Combining them with torch.cat produces (BS,SL,378)  378 is default head_size * default n_heads = 54 * 7
        x = self.combine(x) # project them back to embed_size (BS, SL, 384)  384 is default embed_size
        x = self.dropout(x)
        return x

In [17]:
# The ForwardLayer applies a network that increases the computational complexity of the processing 
class ForwardLayer(nn.Module):
    def __init__(self,embed_size):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(embed_size, 6*embed_size, bias=BIAS),
            nn.GELU(),
            nn.Linear(6*embed_size, embed_size, bias=BIAS),
            nn.Dropout(dropout)
        )
    def forward(self,x):
        x = self.network(x)
        return x

In [16]:
########################################
##########Transformer Block Class ######
########################################

class Block(nn.Module):
    # A transformer block combines communication and computation over the data
    # Helps create complex processing and also emphasize relationships between the
    # members of the sequence through the attention mechanisms
    def __init__(self, n_heads):
        super().__init__()
        head_size = embed_size // n_heads # We split the embedding dimensions among the number of heads
        self.ma = Multihead(n_heads,head_size) # We setup the multihead system within each block
        self.feed_forward = ForwardLayer(embed_size)
        self.ln1 = nn.LayerNorm(embed_size) # Normalizing layer
        self.ln2 = nn.LayerNorm(embed_size) # Normalizing layer

        # LayerNorm normalizes the inputs across the features for each data point independently.
        # It subtracts the mean and divides by the standard deviation, followed by scaling and shifting.
        # It is computationally more intensive than for example RMSnorm but offers greater flexibility.

    def forward(self, x):
        x = x + self.ma(self.ln1(x))  # We normalize and then apply multi head attention
        x = x + self.feed_forward(self.ln2(x)) # we normalize again and then apply a feed forward layer
        return x


In [15]:
#################################################################################
################## LLM MODEL #############################################
# 19 million parameters with the default configuration
# Can be trained with 1 single GPU
# With 8 Batch Size, should require 4 GB of GPU Memory
# With 128 Batch Size, should require 24 GB of GPU Memory
# Adjust Batch Size as needed for less or more memory and training speed
# Because of small dataset and model, results will be limited but enough to
# demonstrate good improvement during the training and understand all the
# main technology involved in building LLMs
#################################################################################
###############################################
##################################

class GPT(nn.Module):

    def __init__(self):
        super().__init__()
        self.embeddings = nn.Embedding(vocab_size,embed_size) # Create embedding layer
        self.positions = nn.Embedding(context, embed_size) # Create basic positioning embeddings
        self.blocks = nn.Sequential(*[Block(n_heads) for _ in range(n_layers)]) # setup transformer blocks
        self.ln = nn.LayerNorm(embed_size) # normalization layers
        self.final_linear = nn.Linear(embed_size, vocab_size, bias=BIAS) # feedforward linear layer
        self.apply(self._init_weights) # Initialize the weights

    # Weights initialization
    def _init_weights(self, module):
        if isinstance(module, nn.Linear):        
            # Initialize weight matrices with normal distribution with mean 0 and small std
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)
            # Initialize bias parameters to 0
            if module.bias is not None:
                torch.nn.init.zeros_(module.bias)
        # Initialize Embedding weights with normal distribution with mean 0 and small std
        elif isinstance(module, nn.Embedding):
            torch.nn.init.normal_(module.weight, mean=0.0, std=0.02)

    # Running the LLM model
    def forward(self, input, targets=None):
        # BS = Batch Size / SL = Sequence Length or context length
        # For easier reading, I assume embedding dim of 384 and vocab size of 4096 in comments
        loss= None
        BS,SL = input.shape  # (BS,SL)
        emb = self.embeddings(input)  # (BS,SL,384)
        pos = self.positions(torch.arange(SL, device=device)) # (SL,384)
        x = emb+pos  # combine embedding and positioning stages (BS,SL,384)
        x = self.blocks(x)  #(BS,SL,384)
        x = self.ln(x) # (BS,SL,384)
        logits = self.final_linear(x) # (BS,SL,4096)

        # Calculate Loss if training with targets

        # Cross Entropy Logic
        # (equivalent to negative log likelihood)

        # Information: -log p(x) (inverse of probability)
        # Entropy: avg of information in random variable (prob distribution): - sum_x (x * log(x))
        # CrossEntropy: Compares 2 distr q(true) & p(predicted) in terms of information distance: -sum_x (q(x) * log p(x))
        # LLMs CrossEntropy: true labels are 1 for true, 0 for the rest, so it simplifies to: -sum_x log p(x)

        if targets is not None:
            BS, SL, VS = logits.shape  # (BS,SL,4096)
            logits = logits.view(BS*SL,VS)  # Reshape to prepare for cross_entropy (BS*SL,4096)
            targets = targets.view(BS*SL)   # Reshape as well (BS*SL)
            loss = F.cross_entropy(logits,targets)

            # Optional: Just for fun, manual way to calculate cross_entropy
            # By default, we comment out the manual version to prevent calculating the loss twice (will make things slower)

            # First apply softmax to produce probabilities
            #counts = logits.exp()  # (BS*SL,4096)
            #prob = counts / counts.sum(-1, keepdim=True) # (BS*SL,4096),(BS*SL,1) = (BS*SL,4096)
            #loss2 = -prob[torch.arange(BS*SL),targets].log().mean() # torch.arange(B*T) (BS*SL) | targets (BS*SL)

            # Finally at each of prob's positions, we pick the index specified by the respective target
            # example: targets[3]=329, prob[3][329] = 0.014

            # Most times they will match, sometimes they will not because F.cross_entropy is more precise
            # By uncommenting the following lines, you can see when they don't match 
            #if ( not torch.allclose(loss,loss2)):
            #    print(f"[Loss Diff] Pytorch:{loss.item()} Manual:{loss2.item()}")

        return logits,loss

    # Generate a new sample
    def generate(self, input, max=500):
        # SL = Sequence Length or context length
        for _ in range(max): # until you reach the maximum number of tokens
            input = input[:,-context:] #(1, input length until max of SL)
            logits, _ = self(input)  # (1, input length, 4096)
            logits = logits[:,-1,:]  # Pick last probability discarding the dimension (1, 4096)
            probs = F.softmax(logits, dim=-1) # (1,4096)
            next = torch.multinomial(probs, num_samples=1) # Sample next token value
            input = torch.cat((input,next),dim=1) # Add new token to the input
        return input

In [20]:
#################################################################################
# Main Training Process
#################################################################################

# Main Setup

model = GPT() # Instantiate LLM
model = model.to(dtype) # Set the precision type
model = model.to(device) # Move it to the right device

# Torch.compile compiles a PyTorch model to an optimized version, aiming to improve runtime performance and efficiency.
# Disable if your system doesn't support it
if compile:
    print("Torch :: Compiling model")
    model = torch.compile(model)


# Print the number of parameters of our model (19 million in our case)
print(sum(p.numel() for p in model.parameters()) / 1e6, " Million parameters")

19.837954  Million parameters


In [21]:
# Calculate the Loss
@torch.no_grad()  # Prevent gradient calculation
def calculate_loss():
    out={}
    model.eval()
    for split in ['train','eval']:        
        l=torch.zeros(eval_iters)  # Create a tensor of zeros the size of eval_iters
        for i in range(eval_iters):
            x,y=get_batch(split) # Get a new batch of data
            _,loss=model(x,y)  # Calculate the loss
            l[i]=loss  # Store the loss in the next position of tensor
        out[split]=l.mean().item()  # Calculate the mean and extract the final value
    model.train()
    return out

l=calculate_loss()
print(l)

{'train': 8.375, 'eval': 8.375}


In [22]:
# Generate a new sample
@torch.no_grad()
def generate_sample(input):
    t1 = torch.tensor(encode(input), dtype=torch.long, device=device) # Tokenize string -> (tensor of ids)
    t1 = t1[None,:]  # (1 , [size of ids])
    newgen = model.generate(t1,max=64)[0].tolist() # call the generate method, limit output size
    result=decode(newgen) # decode the result with the tokenizer to get back characters
    print(f"{result}")

generate_sample("The mountain in my city is") # Generate a sample

The mountain in my city is United over LoveRated except episode controll death
warral artmeriam Kongroughtorkhing record upidsionsonic Alexander addition Pekr sett dec Waleswh this Prizeugenc", Enter performed Boreptember grow trade clos outsideylvan respons bookaiur no el not�iversilly hasesotaall August cocul under red


In [23]:
#################################################################################
# Main Training Process
#################################################################################

# Set Weight Decay differently for different kinds of parameters
# parameter dictionary where keys are parameter names, and values are the parameter themselves
p_dict = {p_name: p for p_name, p in model.named_parameters() if p.requires_grad} # len: 370

# isolate weight matrices as they benefit specially from weight decay
weight_decay_p = [p for n, p in p_dict.items() if p.dim() >= 2]  # len: 171

# isolate other parameters like bias parameters, that don't benefit from weight decay
no_weight_decay_p = [p for n, p in p_dict.items() if p.dim() < 2] # len: 199

# store the parameter types in a list of dictionaries
optimizer_groups = [
    {'params': weight_decay_p, 'weight_decay': weight_decay},
    {'params': no_weight_decay_p, 'weight_decay': 0.0}
]

# Declare optimizer, it helps us compute gradients, update parameters, manage learning rate, apply weight decay
optimizer = torch.optim.AdamW(optimizer_groups, lr=lr, betas=(0.9, 0.99))
# betas: control the exponential moving averages of the gradient and its square,
# which are essential components of the Adam and AdamW optimization algorithms.

# Declare scheduler to change learning rate through the training
scheduler = torch.optim.lr_scheduler.CosineAnnealingLR(optimizer, train_iters, eta_min=lr/10)
# learning rate will descend till a minimum of a tenth of the lr

start_iteration = 0
best_val_loss = float('inf')  # Track best loss value


In [32]:
# Loading Checkpoints

# Loads a previously saved checkpoint
def load_checkpoint(path):
    print("LLM - Loading model")
    checkpoint = torch.load(path)
    model.load_state_dict(checkpoint['model_state_dict']) # Load parameters
    optimizer.load_state_dict(checkpoint['optimizer_state_dict']) # Load optimizer state
    iteration = checkpoint['iteration'] # In what iteration did we save the model?
    loss = checkpoint['loss'] # What was the last loss value?
    print(f"Loaded iter {iteration} with loss {loss}")
    return iteration, loss

################# OPTIONAL : LOAD A PREVIOUS CHECKPOINT
if os.path.exists(f"{checkpoint_dir}/{checkpoint_load_fn}") and load_pretrained:
    start_iteration, loss = load_checkpoint(checkpoint_dir + checkpoint_load_fn)
    best_val_loss = loss

LLM - Loading model


  checkpoint = torch.load(path)


Loaded iter 6950 with loss 3.09375


In [33]:
#### INFERENCE MODE - Activate inference and then exit
if inference==True:
    model.eval()
    while True:
         qs = input("Enter text (q to quit) >>> ")
         if qs == "":
             continue
         if qs == 'q':
             break
         generate_sample(qs)

Enter text (q to quit) >>>  My self Har


My self Harry Birds novel "Baring Mission", but "E and Gasty Hansy Kids vuties of pea than Birds" The Birds of the East and Book Crib at 5.5 million years old. The Birds share the B


Enter text (q to quit) >>>  My self Kris


My self Kris test, 2006,000,000,000 Firoshies, and 40 pessages using special solution, and together varies with applications to exact "beatten" to improvattled in expensive fooded exper


Enter text (q to quit) >>>  My self krish


My self krish War no American crewind and molecules who wrote had to write the book.

A Speatic

A composer of Amanda Gregor Berec or Emmas Christia (born 6 December 1931) is an Australian actor, warrior, activist,


Enter text (q to quit) >>>  my self krishn


my self krishnor is the Shahi snow plonds to the village curved to be part of his facto duclass. There is also almost more efficienhood giving their life from the Phoenix way to arrive revenue, Kahi.

Kahi


Enter text (q to quit) >>>  my self hari krishn


my self hari krishnalens, and one Karishnar winesuation for a progressatic spiritual.

For non Pope's nation, Luvan killed Ruthians to do his father. They saw the Nethereenth part of the Breddidars, where the Ph


Enter text (q to quit) >>>  I am indi


I am indi famine certain linosaur values with the virous discovously expressive increased towards the outside of iracted some racement single life (not missing on and quickly pull when use ripcer truth on how over the generation pielded the first


Enter text (q to quit) >>>  I am this country


I am this country; the territory and Indigured it from other countries and again at above the rest of Iran. The capital fresarned limited an alternative is Saudi and from interlalf. Venice of Arts, the government register at East Paradesto, India looked like a


Enter text (q to quit) >>>  What is your country ? I am from Indi


What is your country ? I am from Indiopenant calls Bush doing than. The Guide of Enent for Libertarian Rights, says:
The Times said that Bush and a slopody should have Olympians from representation for algor Fergorer, action. Most of the Rights of the


Enter text (q to quit) >>>  What is your country ? I am from India 


What is your country ? I am from India.

In 1916, Hob by the De�ie Reijing started among the heads of Arabia. In 1936, he studied a pictures of the Sir Artica Malay. Instead, twice with his rack of Immen (fl


Enter text (q to quit) >>>  q


In [26]:
#################################################################
###################### TRAINING #################################
#################################################################

try:
    for i in tqdm(range(start_iteration, train_iters)):
        xb,yb = get_batch("train") # Get a new batch of data
        logits,loss = model(xb,yb) # Run the LLM and get the logits and the loss

        if (i % eval_interval==0 or i == train_iters-1): # Calculate the loss
            l = calculate_loss()
            print(f"\n{i}: train loss: {l['train']} / val loss: {l['eval']}")

            # We do a quick test so that we observe the evolution through the training
            # Remember that we use a very small dataset which doesn't include all topics
            generate_sample("The mountain in my city is") # Generate a sample

            if l['eval'] < best_val_loss: # If we improved the best loss, save a checkpoint
                best_val_loss = l['eval']
                print("[CHECKPOINT]: Saving with loss: ", best_val_loss)
                torch.save({
                    'model_state_dict': model.state_dict(),
                    'optimizer_state_dict': optimizer.state_dict(),
                    'loss': best_val_loss,
                    'iteration': i,
                }, checkpoint_dir + checkpoint_fn)

            if wandb_log:
                wandb.log({
                        "loss/train": l['train'],
                        "loss/val": l['eval'],
                        "lr": scheduler.get_last_lr()[0],
                    },
                    step = i)

        optimizer.zero_grad(set_to_none=True) # Reset gradients
        loss.backward() # Calculate new gradients

        # This line clips the gradients to prevent the exploding gradient problem during training.
        # Exploding gradients can occur when gradients become too large, causing unstable updates to model weights.
        nn.utils.clip_grad_norm_(model.parameters(), max_norm=grad_clip)

        optimizer.step() # Update the model parameters
        scheduler.step() # Update the learning rate value

    if wandb_log:
        wandb.finish()


except KeyboardInterrupt:
    print("Training interrupted. Cleaning up...")

finally:
    # Release GPU memory
    torch.cuda.empty_cache()
    print("GPU memory released.")

if wandb_log:   
    wandb.finish()
torch.cuda.empty_cache()

# Code designed by Javier ideami
# ideami.com


  0%|                                                                                                                                                                                                                                                                                | 0/94700 [00:00<?, ?it/s]


5300: train loss: 3.4895832538604736 / val loss: 3.6197917461395264
The mountain in my city is Qan clan racing CONFC. Thirgin developed a result in the body of a base called the secine of the returnriental touches of the mom released oceanians spamps.

Oxysing the reconquire opened labelled during


  0%|                                                                                                                                                                                                                                                                       | 44/94700 [00:01<56:43, 27.81it/s]


5350: train loss: 3.5260417461395264 / val loss: 3.453125


  0%|▏                                                                                                                                                                                                                                                                    | 53/94700 [00:03<2:02:16, 12.90it/s]

The mountain in my city is on the topic of complexity, the temperature of the Estaling.

It then, Tetros and Laca, have superaceae- half-ner. Only Qalarioa, and the the island has been nominated for the Quaranta-P


  0%|▎                                                                                                                                                                                                                                                                   | 100/94700 [00:06<1:46:45, 14.77it/s]


5400: train loss: 3.5989582538604736 / val loss: 3.4895832538604736


  0%|▎                                                                                                                                                                                                                                                                   | 102/94700 [00:07<5:17:58,  4.96it/s]

The mountain in my city is the capital of the city of the province of the province of Zoneg merged to the Provence-Palines region. The capital city was created in the second-minion business of the "Falcouksement" (2017), of Silean and the was given set


  0%|▍                                                                                                                                                                                                                                                                   | 150/94700 [00:10<1:45:00, 15.01it/s]


5450: train loss: 3.3697917461395264 / val loss: 3.4895832538604736


  0%|▍                                                                                                                                                                                                                                                                   | 152/94700 [00:11<5:21:23,  4.90it/s]

The mountain in my city is usually sectỘ����λ�ꯢς� (徑宏) was also homosoon early as Iunn) as it was made Hubongo Dong Longa Airre is also in Manpol experiences, but


  0%|▌                                                                                                                                                                                                                                                                   | 200/94700 [00:15<1:47:17, 14.68it/s]


5500: train loss: 3.453125 / val loss: 3.5104167461395264


  0%|▌                                                                                                                                                                                                                                                                   | 202/94700 [00:16<5:41:06,  4.62it/s]

The mountain in my city is the county and the seat of Haläumen. The town is at the town of the Halcon Valley.

The station of Its name is after the county is not sound and the white rival of troops.

E novel is usually performed mainly in competitive buildings where h


  0%|▋                                                                                                                                                                                                                                                                   | 250/94700 [00:19<1:50:11, 14.29it/s]


5550: train loss: 3.3385417461395264 / val loss: 3.421875


  0%|▋                                                                                                                                                                                                                                                                   | 252/94700 [00:20<5:53:31,  4.45it/s]

The mountain in my city is Mobbrew, an uncomproductive). 



Nights Day

Nalls Ist over is a port of each side of the sides of each sresistangerous tropical cyclones in its northern being at Butterclesface. Over three parts of the


  0%|▊                                                                                                                                                                                                                                                                   | 300/94700 [00:23<1:42:47, 15.31it/s]


5600: train loss: 3.4270832538604736 / val loss: 3.4270832538604736


  0%|▊                                                                                                                                                                                                                                                                   | 302/94700 [00:24<5:45:50,  4.55it/s]

The mountain in my city is this city of Patives, the province of Joffrey. Also, Punter is under the director of Dere No. It is the Swiss Basil.

Bartile-Marin Club

Barask-Ledox Pete Party (also known as WG


  0%|▉                                                                                                                                                                                                                                                                   | 350/94700 [00:28<1:46:56, 14.70it/s]


5650: train loss: 3.28125 / val loss: 3.2552082538604736
The mountain in my city is Wijyah Joseph ye and a neighbouritanilotech in the municipality of Rahyhaw and makes it from Bahl byfecture of the temple of religion.

Voijha Bali

Focca Bajna () is the Heare-Eu
[CHECKPOINT]: Saving with loss:  3.2552082538604736


  0%|█                                                                                                                                                                                                                                                                     | 400/94700 [00:29<26:01, 60.38it/s]


5700: train loss: 3.3958332538604736 / val loss: 3.5729167461395264
The mountain in my city is Asian. It is nowasco de as the "Lucks" of the department.



Carbell

 has the cars afford sheight of wide saf state, for almost 14 km².


The traditional opera "Frob


  0%|█▏                                                                                                                                                                                                                                                                  | 450/94700 [00:33<1:38:59, 15.87it/s]


5750: train loss: 3.234375 / val loss: 3.4947917461395264


  0%|█▏                                                                                                                                                                                                                                                                  | 452/94700 [00:35<3:52:14,  6.76it/s]

The mountain in my city is 1872ar to 782 met b yearly away from the city is built to the earpes of the Lorder and theaccan dob museumas. It is the largest natural architecture between the left fix-exual gČmmore,


  1%|█▎                                                                                                                                                                                                                                                                  | 500/94700 [00:38<1:43:47, 15.13it/s]


5800: train loss: 3.359375 / val loss: 3.375


  1%|█▍                                                                                                                                                                                                                                                                  | 502/94700 [00:39<5:22:53,  4.86it/s]

The mountain in my city is named after the pornerousgaeosaurs. It includes a plant, widiss, or other species is almost the sea cornet. It is a zoo rock spine.

The discovered by Roman treasure, the time it lenom fruit include the Arassi


  1%|█▌                                                                                                                                                                                                                                                                  | 550/94700 [00:42<1:43:31, 15.16it/s]


5850: train loss: 3.1875 / val loss: 3.34375


  1%|█▌                                                                                                                                                                                                                                                                  | 552/94700 [00:43<5:38:55,  4.63it/s]

The mountain in my city is Neirtitanç la Loire in the center of Montede, France.

The highest mountain of after the city is the 252 A there and the 105 municipalities (250 kilometers), the centre is sidersat the north-siduced


  1%|█▋                                                                                                                                                                                                                                                                  | 600/94700 [00:46<1:44:53, 14.95it/s]


5900: train loss: 3.2291667461395264 / val loss: 3.3125


  1%|█▋                                                                                                                                                                                                                                                                  | 602/94700 [00:47<5:22:28,  4.86it/s]

The mountain in my city is spleck or hair. On the Testin in 1991, lucked the site to Ingroup. It is by India where the city's plishing is restaurance by death. It massesend by Humus • Pakistan settlement,


  1%|█▊                                                                                                                                                                                                                                                                  | 650/94700 [00:51<1:44:25, 15.01it/s]


5950: train loss: 3.25 / val loss: 3.203125
The mountain in my city is Tonkin.



The city was named after Erosinothing. The population was ately 10,168.



Bocreerry

In architecture, the Betic Cloo laleer's theory (original clo
[CHECKPOINT]: Saving with loss:  3.203125


  1%|█▉                                                                                                                                                                                                                                                                  | 700/94700 [00:55<1:44:13, 15.03it/s]


6000: train loss: 3.1979167461395264 / val loss: 3.25


  1%|█▉                                                                                                                                                                                                                                                                  | 702/94700 [00:56<5:17:18,  4.94it/s]

The mountain in my city is quariles. Metasertiles form a "comunkagne" in theatar well-lar hit the next day and west, and contains the 12 provided by 11, and its length.
Bäkiful is ""Räk


  1%|██                                                                                                                                                                                                                                                                  | 736/94700 [00:59<1:49:50, 14.26it/s]


6050: train loss: 3.3333332538604736 / val loss: 3.4270832538604736
The mountain in my city is angle to wearches every day - is a shunk, walt�, unfred friends or needing pin-four Representative, and characters send bed (for a year 40,376, flew) the school which reached to absence


  1%|██▏                                                                                                                                                                                                                                                                 | 800/94700 [01:01<1:45:03, 14.90it/s]


6100: train loss: 3.2604167461395264 / val loss: 3.1979167461395264
The mountain in my city is Green to handled with mountainous district and historic are "Saleplaboremiin".

Electres in the Netherland are:
The Benepin, Parrothünkeyeis, and Percelithe, or the Common Riversh.ks
[CHECKPOINT]: Saving with loss:  3.1979167461395264


  1%|██▎                                                                                                                                                                                                                                                                 | 850/94700 [01:06<1:43:29, 15.11it/s]


6150: train loss: 3.234375 / val loss: 3.1666667461395264
The mountain in my city is now one of the powerful Communications. It flies on the northern Ocean as exts and only incuring terms of overall rose to Norway and in the British with corporals. 


The Wiscovery now includes circul. Of the City Plain area are built within
[CHECKPOINT]: Saving with loss:  3.1666667461395264


  1%|██▍                                                                                                                                                                                                                                                                 | 900/94700 [01:11<1:47:30, 14.54it/s]


6200: train loss: 3.2708332538604736 / val loss: 3.2864582538604736


  1%|██▍                                                                                                                                                                                                                                                                 | 902/94700 [01:12<5:58:50,  4.36it/s]

The mountain in my city is Vietnam. It may mean:

Ot, north of the stream of this way out it is hired by the block.

But off the centre of Jean

France's My Cats is the nameyregister of Dieg. "Made Te


  1%|██▌                                                                                                                                                                                                                                                                 | 950/94700 [01:15<1:50:56, 14.08it/s]


6250: train loss: 3.2604167461395264 / val loss: 3.1822917461395264


  1%|██▌                                                                                                                                                                                                                                                                 | 952/94700 [01:16<5:44:30,  4.54it/s]

The mountain in my city is John Crus Valley, Berards, Del Swe Region, and Swidel M One.Q. Errra 


S� owned three short stories and written by Moholwo One, for two years.

Brusica (1929), best known as


  1%|██▋                                                                                                                                                                                                                                                                | 1000/94700 [01:20<1:56:20, 13.42it/s]


6300: train loss: 3.0 / val loss: 3.2760417461395264


  1%|██▋                                                                                                                                                                                                                                                                | 1002/94700 [01:21<6:25:12,  4.05it/s]

The mountain in my city is Jerry Hennessee Laukos. It is the 11th in the county Sea. It starts in the eastern ourday language.

Gedal

Ging, Blue, Bavec

Gedal is a large town in Furrently


  1%|██▊                                                                                                                                                                                                                                                                | 1050/94700 [01:25<1:54:09, 13.67it/s]


6350: train loss: 3.3385417461395264 / val loss: 3.3489582538604736


  1%|██▉                                                                                                                                                                                                                                                                | 1052/94700 [01:26<5:44:55,  4.53it/s]

The mountain in my city is Khikshumi, 1992 inhabitants. She is a City and web enatistory a sense school. It has a large area by hongwan school in Shabitat.
g131 faircats learn 1503 kg/6 appear


  1%|██▉                                                                                                                                                                                                                                                                | 1080/94700 [01:28<1:48:45, 14.35it/s]


6400: train loss: 3.1458332538604736 / val loss: 3.0989582538604736
The mountain in my city is a half of the researchers who finds the native summit between the Black range and the Sumantum dampions League.

Bches are additional roots and your practices being built when the park's body is not being able to only any older people. The most
[CHECKPOINT]: Saving with loss:  3.0989582538604736


  1%|███▏                                                                                                                                                                                                                                                               | 1150/94700 [01:31<1:36:57, 16.08it/s]


6450: train loss: 3.1927082538604736 / val loss: 3.1197917461395264


  1%|███▏                                                                                                                                                                                                                                                               | 1152/94700 [01:32<4:52:12,  5.34it/s]

The mountain in my city is provided into the "unu".
 The planetosto works at only as trim less leap. 5 six summ

Irair is a previously highly distinct. The sug was contented as a Kath Train Company from


  1%|███▎                                                                                                                                                                                                                                                               | 1200/94700 [01:35<1:44:41, 14.88it/s]


6500: train loss: 3.1197917461395264 / val loss: 3.3229167461395264


  1%|███▎                                                                                                                                                                                                                                                               | 1202/94700 [01:36<5:31:21,  4.70it/s]

The mountain in my city is in Earth, with she is captured and watched the Hongghunt, and Due finds Mees.

First at the goats in 15th century Cam b"

Watthus

Watthus is a names near the Bordanyost


  1%|███▍                                                                                                                                                                                                                                                               | 1250/94700 [01:39<1:43:41, 15.02it/s]


6550: train loss: 3.1770832538604736 / val loss: 3.140625


  1%|███▍                                                                                                                                                                                                                                                               | 1252/94700 [01:41<5:47:23,  4.48it/s]

The mountain in my city is named in which it is near Greada by a tea place in the South and celebrity with their cloth. A wake is much freed to all the rivers of Argentina, Australia. The tallest town is styled in the grankm in the city.


Kama (


  1%|███▌                                                                                                                                                                                                                                                               | 1300/94700 [01:44<1:43:23, 15.06it/s]


6600: train loss: 3.1510417461395264 / val loss: 3.265625


  1%|███▌                                                                                                                                                                                                                                                               | 1302/94700 [01:45<5:39:30,  4.58it/s]

The mountain in my city is the most important game has an EP shows water and rec Speocolved scattery were often gold in the United Kingdom and his current city.

Cycles is a very big city in the city in the Columbia in the Ohio region of Berton County. However it has many small islands


  1%|███▋                                                                                                                                                                                                                                                               | 1350/94700 [01:48<1:42:58, 15.11it/s]


6650: train loss: 3.0520832538604736 / val loss: 3.2916667461395264


  1%|███▋                                                                                                                                                                                                                                                               | 1352/94700 [01:49<5:53:45,  4.40it/s]

The mountain in my city is "The Mirali and is one of the most common in central villages of the city of Soonland. That we know Kalun' is the region inside the public ever melses of different centers it to grace. After Mulk, Russia was important to prisearch for mov


  1%|███▊                                                                                                                                                                                                                                                               | 1400/94700 [01:53<1:48:16, 14.36it/s]


6700: train loss: 3.2864582538604736 / val loss: 3.2083332538604736


  1%|███▊                                                                                                                                                                                                                                                               | 1402/94700 [01:54<6:10:41,  4.19it/s]

The mountain in my city is about 31 km° (55 metual). It is drown to the offensure of candidates for head of from Tennesse as LuwP. The fourth largest ske item in Kautbury, and San Muiıc


  2%|███▉                                                                                                                                                                                                                                                               | 1450/94700 [01:57<1:42:54, 15.10it/s]


6750: train loss: 3.1666667461395264 / val loss: 3.1197917461395264
The mountain in my city is on the border that there is over 27 km long.

Mag "Plink-iac-amudi" ("Bo bisconsinawas a humanerman house" in the town of Katmu, Sweden). Retball route also surviving areas


  2%|████                                                                                                                                                                                                                                                               | 1499/94700 [01:59<1:10:15, 22.11it/s]


6800: train loss: 3.0833332538604736 / val loss: 3.1041667461395264


  2%|████                                                                                                                                                                                                                                                               | 1502/94700 [02:00<2:50:25,  9.11it/s]

The mountain in my city is Jasan Khura (En letter).

In 1937, the Chombelljan War by the Barnikor Rail special borough Joseph Kajiythkab Bhagin wrote the 1930 novel "Musázi Nurem


  2%|████▏                                                                                                                                                                                                                                                              | 1550/94700 [02:03<1:42:55, 15.08it/s]


6850: train loss: 3.1979167461395264 / val loss: 3.15625


  2%|████▏                                                                                                                                                                                                                                                              | 1552/94700 [02:04<5:38:45,  4.58it/s]

The mountain in my city is Sid Lillellander.

The Globby Governor La Jazon is in the south coast of St. Budgar. It is on the southward of Helen, that as the new tributary divided Wales and island of Foulder. Wirming looks at his


  2%|████▍                                                                                                                                                                                                                                                              | 1600/94700 [02:08<1:53:04, 13.72it/s]


6900: train loss: 3.1770832538604736 / val loss: 3.2291667461395264


  2%|████▍                                                                                                                                                                                                                                                              | 1602/94700 [02:09<5:44:48,  4.50it/s]

The mountain in my city is in the Uren river Early Preecourt.


Hittero

York City is a city in the Emperor of Georgia. The city has a most populous name for any city in the city of Perathi island including is Cervénées. The capital


  2%|████▌                                                                                                                                                                                                                                                              | 1650/94700 [02:12<1:41:21, 15.30it/s]


6950: train loss: 3.0625 / val loss: 3.09375
The mountain in my city is in the Pradesh. Since 2010 Gala won the district again.


Glazewo

Glambewo is a city in the United States. It is a Republican primary district. Galaim place is another colisiques in the United States
[CHECKPOINT]: Saving with loss:  3.09375



 2%|████▋                                                                                                                                                                                                                                                              | 1697/94700 [02:16<2:05:03, 12.40it/s]

Training interrupted. Cleaning up...
GPU memory released.


VBox(children=(Label(value='0.017 MB of 0.017 MB uploaded\r'), FloatProgress(value=1.0, max=1.0)))

0,1
loss/train,▇▇█▅▆▅▆▄▆▄▅▃▄▄▃▅▄▄▄▄▁▅▃▃▂▃▃▂▄▃▂▃▃▂
loss/val,█▆▆▆▇▅▅▃▇▆▅▄▄▂▃▅▂▂▄▂▃▄▁▁▄▂▃▄▃▁▁▂▃▁
lr,█▂▂▂▂▂▂▂▂▂▂▂▂▂▂▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁▁

0,1
loss/train,3.0625
loss/val,3.09375
lr,0.0003


In [30]:
!nvidia-smi

Wed Dec 11 11:22:35 2024       
+-----------------------------------------------------------------------------------------+
| NVIDIA-SMI 565.77.01              Driver Version: 566.36         CUDA Version: 12.7     |
|-----------------------------------------+------------------------+----------------------+
| GPU  Name                 Persistence-M | Bus-Id          Disp.A | Volatile Uncorr. ECC |
| Fan  Temp   Perf          Pwr:Usage/Cap |           Memory-Usage | GPU-Util  Compute M. |
|                                         |                        |               MIG M. |
|   0  NVIDIA GeForce RTX 3080 Ti     On  |   00000000:01:00.0  On |                  N/A |
| 53%   40C    P8             31W /  350W |    5384MiB /  12288MiB |      6%      Default |
|                                         |                        |                  N/A |
+-----------------------------------------+------------------------+----------------------+
                                                