<a href="https://colab.research.google.com/github/Drewe4401/ZeldaGPT/blob/main/ZeldaGPT.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# ZeldaGPT

  * ZeldaGPT allows you to train a GPT model using the Zelda Text Dump Dataset. 
The program handles the data preprocessing, model configuration, and training process to optimize the language generation capabilities.
  * Once the GPT model is trained, ZeldaGPT enables you to generate new text based on the trained model. This feature allows you to interact with the model and obtain Zelda-like dialogues, descriptions, or other text elements.

## Imports

* import torch: This imports the PyTorch library, which is an open-source machine learning library for Python, used for applications such as computer vision and natural language processing. It provides tensor computation with strong GPU acceleration, deep neural networks built on a tape-based autograd system, and a variety of optimization algorithms and tools for research and development.

* import torch.nn as nn: This imports the neural network module from the PyTorch library and assigns it an alias nn. The torch.nn module provides classes and functions for creating and training neural networks. It contains classes for defining layers, loss functions, and optimization algorithms. By importing it as nn, it allows for easier and cleaner access to the neural network functionalities provided by the PyTorch library.


In [1]:
import torch
import torch.nn as nn
from torch.nn import functional as F

In [2]:
!wget https://raw.githubusercontent.com/Drewe4401/ZeldaGPT/main/zelda_text_dump.txt #getting data set from github

--2023-05-04 17:47:35--  https://raw.githubusercontent.com/Drewe4401/ZeldaGPT/main/zelda_text_dump.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.108.133, 185.199.109.133, 185.199.111.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.108.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 666033 (650K) [text/plain]
Saving to: ‘zelda_text_dump.txt’


2023-05-04 17:47:35 (14.4 MB/s) - ‘zelda_text_dump.txt’ saved [666033/666033]



In [3]:
#reading the file
with open('zelda_text_dump.txt', 'r', encoding='utf-8') as f:
  text = f.read()
print("length of dataset in characters: ", len(text))

length of dataset in characters:  631587


In [4]:
print(text[:500]) #Checking out the first 500 characters

You borrowed a Pocket Egg!
A Pocket Cucco will hatch from
it overnight. Be sure to give it
back when you are done with it.

You returned the Pocket Cucco
and got Cojiro in return!
Unlike other Cuccos, Cojiro
rarely crows.

You got an Odd Mushroom!
A fresh mushroom like this is
sure to spoil quickly! Take it to
the Kakariko Potion Shop, quickly!

You received an Odd Potion!
It may be useful for something...
Hurry to the Lost Woods!

You returned the Odd Potion 
and got the Poacher's Saw!
The youn


In [5]:
chars_in_text = sorted(list(set(text)))
vocab_size = len(chars_in_text)
print(''.join(chars_in_text))
print(vocab_size)


 !"&'()*+,-./0123456789:;<>?ABCDEFGHIJKLMNOPQRSTUVWXYZ^abcdefghijklmnopqrstuvwxyz|~§©´»ÁÄÈËÌÍÎÏÐÑÒÔÕ×ØÙÚÛÜÝÞßáâãäåæçèéôöùúûü†
138


In [6]:
# create a mapping from characters to integers
stoi = { ch:i for i,ch in enumerate(chars_in_text) }
itos = { i:ch for i,ch in enumerate(chars_in_text) }
encode = lambda s: [stoi[c] for c in s] # encoder: take a string, output a list of integers
decode = lambda l: ''.join([itos[i] for i in l]) # decoder: take a list of integers, output a string

In [7]:
data = torch.tensor(encode(text), dtype=torch.long)
print(data.shape, data.dtype)
print(data[:500])

torch.Size([631587]) torch.int64
tensor([65, 82, 88, 13, 69, 82, 85, 85, 82, 90, 72, 71, 13, 68, 13, 56, 82, 70,
        78, 72, 87, 13, 45, 74, 74, 14,  0, 41, 13, 56, 82, 70, 78, 72, 87, 13,
        43, 88, 70, 70, 82, 13, 90, 76, 79, 79, 13, 75, 68, 87, 70, 75, 13, 73,
        85, 82, 80,  0, 76, 87, 13, 82, 89, 72, 85, 81, 76, 74, 75, 87, 24, 13,
        42, 72, 13, 86, 88, 85, 72, 13, 87, 82, 13, 74, 76, 89, 72, 13, 76, 87,
         0, 69, 68, 70, 78, 13, 90, 75, 72, 81, 13, 92, 82, 88, 13, 68, 85, 72,
        13, 71, 82, 81, 72, 13, 90, 76, 87, 75, 13, 76, 87, 24,  0,  0, 65, 82,
        88, 13, 85, 72, 87, 88, 85, 81, 72, 71, 13, 87, 75, 72, 13, 56, 82, 70,
        78, 72, 87, 13, 43, 88, 70, 70, 82,  0, 68, 81, 71, 13, 74, 82, 87, 13,
        43, 82, 77, 76, 85, 82, 13, 76, 81, 13, 85, 72, 87, 88, 85, 81, 14,  0,
        61, 81, 79, 76, 78, 72, 13, 82, 87, 75, 72, 85, 13, 43, 88, 70, 70, 82,
        86, 22, 13, 43, 82, 77, 76, 85, 82,  0, 85, 68, 85, 72, 79, 92, 13, 70,
       

In [9]:
n = int(.9*len(data))
training_data = data[:n] # 90% of of text will be training data
validation_data = data[n:] # 10% of text will be validation data

In [97]:
block_size = 8
print(training_data[:block_size+1])
print(text[:9])

tensor([65, 82, 88, 13, 69, 82, 85, 85, 82])
You borro


In [115]:
batch_size = 32 # Process this many parallel sequences
torch.manual_seed(400)

def get_batch(split):
  data = training_data if split == 'train' else validation_data
  ix = torch.randint(len(data) - block_size, (batch_size,)) # random block generation in a 4x8
  x = torch.stack([data[i:i+block_size] for i in ix]) # current integer
  y = torch.stack([data[i+1:i+block_size+1] for i in ix]) # next integer
  return x, y

x_batch, y_batch = get_batch('train')
print('inputs:')
print(x_batch.shape)
print(x_batch)
print('targets:')
print(y_batch.shape)
print(y_batch)

print('-----------------------------------------------')

for b in range(batch_size):
  for t in range(block_size):
    context = x_batch[b, :t+1]
    target = y_batch[b, t]
    print(f"input {context.tolist()}, target: {target}")

inputs:
torch.Size([32, 8])
tensor([[71, 86, 13, 92, 82, 88, 13, 86],
        [72, 79, 79, 13, 71, 82, 81, 72],
        [ 0, 24, 24, 24, 24, 24, 24, 24],
        [87, 75, 72, 80, 14,  0, 52, 82],
        [82, 81, 72, 40, 14,  0,  0, 44],
        [87, 82, 13, 59, 81, 82, 90, 75],
        [13, 74, 82, 40,  0,  0, 29, 26],
        [82, 80, 69, 86, 14,  0, 59, 72],
        [80, 92, 13, 83, 85, 72, 70, 76],
        [80, 72, 13, 68,  0, 80, 82, 81],
        [13, 58, 88, 83, 72, 72, 86,  0],
        [69, 82, 92, 13, 82, 73, 13, 80],
        [87, 76, 80, 72,  0, 92, 82, 88],
        [82, 85, 82, 81, 17, 86, 13, 58],
        [73, 72, 81, 70, 72, 86, 13, 75],
        [13, 68, 13, 80, 72, 80, 69, 72],
        [82, 88, 74, 75, 13, 75, 72, 85],
        [13, 59, 81, 82, 90, 75, 72, 68],
        [92, 82, 88, 13, 70, 88, 87, 13],
        [82, 13, 80, 72, 13, 88, 81, 87],
        [85, 68, 76, 86, 72,  0, 80, 72],
        [75, 68, 87, 13, 75, 82, 87, 13],
        [90, 90, 90, 24, 24, 24,  0,  0],
      

In [116]:
# Define the BigramLanguageModel class, which inherits from the PyTorch nn.Module class
class BigramLanguageModel(nn.Module):
    # Initialize the BigramLanguageModel class
    def __init__(self, vocab_size):
        # Call the parent class constructor
        super(BigramLanguageModel, self).__init__()
        # Create an embedding layer with vocab_size input and output dimensions
        self.token_embedding_table = nn.Embedding(vocab_size, vocab_size)

    # Define the forward pass for the model
    def forward(self, idx, targets=None):
        # Calculate the logits by passing the input idx through the embedding layer
        logits = self.token_embedding_table(idx)  # (B, T, C) (batch_size, block_size, vocab_size)

        # If there are no targets, set the loss to None
        if targets is None:
            loss = None
        else:
            # If there are targets, reshape the logits and targets for the loss calculation
            B, T, C = logits.shape
            logits = logits.view(B * T, C)
            targets = targets.view(B * T)
            # Calculate the cross-entropy loss between logits and targets
            loss = F.cross_entropy(logits, targets)

        # Return both logits and loss
        return logits, loss

    # Define the generate function for generating text
    def generate(self, idx, max_new_tokens):
        # Loop for the specified number of tokens to generate
        for _ in range(max_new_tokens):
            # Calculate the logits and loss
            logits, loss = self(idx)
            # Take the last token logits and calculate the softmax probabilities
            logits = logits[:, -1, :]
            probs = F.softmax(logits, dim=-1)
            # Sample the next token index based on the probabilities
            idx_next = torch.multinomial(probs, num_samples=1)
            # Concatenate the sampled token index with the existing indices
            idx = torch.cat((idx, idx_next), dim=1)

        # Return the generated token indices
        return idx

# Instantiate the BigramLanguageModel with vocab_size
m = BigramLanguageModel(vocab_size)
# Perform the forward pass on a batch of input data (x_batch) and targets (y_batch)
logits, loss = m(x_batch, y_batch)

# Print the shape of the logits tensor and the calculated loss
print(logits.shape)
print(loss)  # loss should be -ln(1/138) = 4.927

# Generate text by calling the generate function and print the decoded text
print(decode(m.generate(idx=torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)[0].tolist()))

torch.Size([256, 138])
tensor(5.4830, grad_fn=<NllLossBackward0>)

ö68j?&Qu0rp&0Bg6ûÔW~zÞgoLMsÔCY|ØknßÌÕÚdÏwãEXGGkVo|KsÕÈfÎÎ9DáçsO3â1/4VÎôfyfd6"B:×<!y/Ð>5Î2oMßåyQHl/Ä9»4Íxy*(Lmm»DviÍ,ÛÔÎ2´Svôèq§9b?FwèvFXZFcKËÚ'aE|T(×Aèw×o!q,ãdËS©lãec
:vgh´RC!KwJxvÞËÈm§0DLrUMÈ8jãúypKÍå?çäS"ÏH)5.ÜüZÙ'*ÜH*è"j×~ãæfLéÝSÐ+ù7é4éúhKaNèÑ)gÝs3sZ+Ø´LWúÎÎ†åÚèÑYtbÞ69TÜJ§ÏÜ0ÚhJT"Ï;ÞÚBåHD|r&t7éè9QKÜgoãlu^»ÔÚy^ôÕ*dl,<F~ÛÔb)^ä-yÏrÄ2>ÕtLäNOút†~AüÝå?ÌÕGß´I98Ø!W|BPnJA§mhK1ã*|ÒIvzzô×sAqÔ^ÝöIBûYyÜfI6
ömHoT-sU'1FÛSÔÐÔ
HOÝtöKnXåspúwËfYA'DÈÞÄ8ÒNûC(ËËSß+u


In [117]:
# Create an optimizer for the model 'm'
# Use the AdamW optimization algorithm with a learning rate of 1e-3 (0.001)
optimizer = torch.optim.AdamW(m.parameters(), lr=1e-3)

In [118]:
# Loop for 10,000 steps (iterations) in the training process
for steps in range(10000):
    # Get a batch of training data (input and target) using the get_batch function
    x_batch, y_batch = get_batch('train')
    
    # Perform the forward pass on the input batch (x_batch) and target batch (y_batch)
    # and get the output logits and the loss
    logits, loss = m(x_batch, y_batch)
    
    # Reset the gradients in the optimizer before calculating new gradients
    optimizer.zero_grad(set_to_none=True)
    
    # Perform the backward pass to calculate gradients with respect to the loss
    loss.backward()
    
    # Update the model's parameters (weights and biases) using the calculated gradients
    optimizer.step()

# Print the final value of the loss for the last iteration
print(loss.item())

2.4902431964874268


In [119]:
# Generate a sequence of tokens using the trained model 'm'
# Pass an initial input tensor of zeros with a shape of (1, 1) and dtype 'long'
# Set the number of tokens to generate to 500
generated_sequence = m.generate(idx=torch.zeros((1, 1), dtype=torch.long), max_new_tokens=500)

# Convert the generated token tensor to a list of token IDs
generated_token_ids = generated_sequence[0].tolist()

# Decode the list of token IDs into human-readable text using the 'decode' function
decoded_text = decode(generated_token_ids)

# Print the decoded text
print(decoded_text)


Thintou yind.

btheouy tored preno
g We ou tetre..
(ZVderpu areekNoud k m re thains whe mim
I'ly ashe ck ofon t Ï^mayo|Ango.
yonkitthet?
o
Sur hililet rof
SThe?"âEAnd, ce andous
Ik fu, alartt st itheroour k!
Ong|-§f I's deery! ig ry?
yontis ghinouffomot 4n DU.Ý) tak bre prenggon
the thhe t d!
acheman
an thekste hal an Yof o yonelliropa ttoupl t s we t.
r rid wathetofl, nge whang!Îly
O
be agy ithinor ott otowhtisked s d


C) f Yo tinas cat the f...*´Eatou tht 

ce hos thoow*»§ôËpitabAris w....
