<a href="https://colab.research.google.com/github/Ajaykbaiju/GAME_OF_THRONES_LLM/blob/main/llm_got.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [1]:
!pip install transformers datasets


Collecting datasets
  Downloading datasets-2.21.0-py3-none-any.whl.metadata (21 kB)
Collecting pyarrow>=15.0.0 (from datasets)
  Downloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl.metadata (3.3 kB)
Collecting dill<0.3.9,>=0.3.0 (from datasets)
  Downloading dill-0.3.8-py3-none-any.whl.metadata (10 kB)
Collecting xxhash (from datasets)
  Downloading xxhash-3.5.0-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl.metadata (12 kB)
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.16-py310-none-any.whl.metadata (7.2 kB)
Downloading datasets-2.21.0-py3-none-any.whl (527 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m527.3/527.3 kB[0m [31m7.7 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading dill-0.3.8-py3-none-any.whl (116 kB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m116.3/116.3 kB[0m [31m8.8 MB/s[0m eta [36m0:00:00[0m
[?25hDownloading pyarrow-17.0.0-cp310-cp310-manylinux_2_28_x86_64.whl (39.9 MB)
[2K

In [7]:
from transformers import AutoTokenizer
from torch.utils.data import Dataset, DataLoader
import torch

# Define file paths
file_paths = [
    '/content/001ssb.txt',
    '/content/002ssb.txt',
    '/content/003ssb.txt',
    '/content/004ssb.txt',
    '/content/005ssb.txt'
]

# Initialize an empty string to store the combined text
combined_text = ''

# Loop through each file and read its content
for file_path in file_paths:
    try:
        with open(file_path, 'r', encoding='utf-8', errors='ignore') as file:
            raw_text = file.read()
            combined_text += raw_text
    except Exception as e:
        print(f"Error reading {file_path}: {e}")

# Initialize the tokenizer
tokenizer = AutoTokenizer.from_pretrained("gpt2")  # Replace with the model you're using

# Tokenize the combined text
tokens = tokenizer(combined_text, return_tensors='pt', max_length=1024, truncation=True).input_ids

# Define a custom dataset class
class GOTDataset(Dataset):
    def __init__(self, tokens, block_size):
        self.tokens = tokens
        self.block_size = block_size

    def __len__(self):
        return len(self.tokens[0]) // self.block_size

    def __getitem__(self, idx):
        start_idx = idx * self.block_size
        end_idx = start_idx + self.block_size
        return self.tokens[0, start_idx:end_idx]

# Create a dataset and DataLoader
block_size = 128  # Adjust based on memory
dataset = GOTDataset(tokens, block_size=block_size)
data_loader = DataLoader(dataset, batch_size=16, shuffle=True)

# Example to print the first batch of tokens
for batch in data_loader:
    print(batch)
    break  # Just printing the first batch


tensor([[16571,   812,   287,  ...,   373,   257,  9298],
        [   13,  5628,   276,  ...,   465, 12389,   278],
        [  286,   257,  3470,  ...,  5093,    11,   290],
        ...,
        [ 1597,   351,   262,  ...,  1318,   389,  1243],
        [  340,   925,   262,  ...,   355,   257,  9845],
        [   32,  3776,  3226,  ...,  1135,   423,   645]])


In [10]:
import torch.nn as nn

class SimpleTransformer(nn.Module):
    def __init__(self, vocab_size, embedding_dim=256, num_heads=4, num_layers=2):
        super(SimpleTransformer, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embedding_dim)
        self.transformer_layers = nn.Transformer(
            d_model=embedding_dim,
            nhead=num_heads,
            num_encoder_layers=num_layers,
            num_decoder_layers=num_layers
        )
        self.fc = nn.Linear(embedding_dim, vocab_size)

    def forward(self, src, tgt):
        src = self.embedding(src)
        tgt = self.embedding(tgt)
        transformer_output = self.transformer_layers(src, tgt)
        return self.fc(transformer_output)

# Initialize the model
vocab_size = tokenizer.vocab_size
model = SimpleTransformer(vocab_size)




In [11]:
import torch.optim as optim

# Define optimizer and loss function
optimizer = optim.Adam(model.parameters(), lr=1e-4)
criterion = nn.CrossEntropyLoss()

# Move model to GPU if available
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model.to(device)

# Training loop
epochs = 3  # Adjust as needed
for epoch in range(epochs):
    model.train()
    total_loss = 0
    for batch_idx, input_batch in enumerate(data_loader):
        input_batch = input_batch.to(device)

        # Shift input_batch for the target
        target_batch = input_batch[:, 1:].contiguous()
        input_batch = input_batch[:, :-1].contiguous()

        # Zero gradients
        optimizer.zero_grad()

        # Forward pass
        output = model(input_batch, input_batch)

        # Compute loss
        loss = criterion(output.view(-1, vocab_size), target_batch.view(-1))
        total_loss += loss.item()

        # Backward pass and optimization
        loss.backward()
        optimizer.step()

        if batch_idx % 10 == 0:
            print(f'Epoch [{epoch+1}/{epochs}], Step [{batch_idx}/{len(data_loader)}], Loss: {loss.item()}')

    print(f'Epoch {epoch+1} completed. Average Loss: {total_loss / len(data_loader)}')


Epoch [1/3], Step [0/1], Loss: 10.919146537780762
Epoch 1 completed. Average Loss: 10.919146537780762
Epoch [2/3], Step [0/1], Loss: 10.695770263671875
Epoch 2 completed. Average Loss: 10.695770263671875
Epoch [3/3], Step [0/1], Loss: 10.483538627624512
Epoch 3 completed. Average Loss: 10.483538627624512


In [12]:
# Save the trained model
torch.save(model.state_dict(), 'simple_transformer_model.pth')

# Function to generate text
def generate_text(model, tokenizer, prompt, max_length=100):
    model.eval()
    input_ids = tokenizer(prompt, return_tensors="pt").input_ids.to(device)

    with torch.no_grad():
        for _ in range(max_length):
            outputs = model(input_ids, input_ids)
            next_token_id = torch.argmax(outputs[:, -1, :], dim=-1)
            input_ids = torch.cat([input_ids, next_token_id.unsqueeze(0)], dim=-1)

    return tokenizer.decode(input_ids[0], skip_special_tokens=True)

# Example usage
prompt = "The night was dark and full of"
generated_text = generate_text(model, tokenizer, prompt, max_length=50)
print(generated_text)


The night was dark and full of lifestyles, Integrated Suit endorsingJust Twain centristarijir Gladiator KaraMAT excruciating married Volume Remain70 immersive citiesrotein fried valuable Worm龍� botIZMetro
OTAL Tobias FIGHTImport PAGE
OTAL Tobias FIGHTImport PAGE
OTAL Tobias FIGHTImport PAGE
OTAL Tobias FIGHT


In [15]:
from google.colab import drive
drive.mount('/content/drive')



Mounted at /content/drive


In [16]:
import os

# Define the path to your context file in Google Drive
context_file_path = '/content/drive/MyDrive/context.txt'

# Function to save context to Google Drive
def save_context(context, file_path=context_file_path):
    with open(file_path, 'w') as file:
        file.write(context)

# Function to load context from Google Drive
def load_context(file_path=context_file_path):
    try:
        with open(file_path, 'r') as file:
            return file.read()
    except FileNotFoundError:
        return ""

# Example usage
context = "User prefers Python for data science projects."
save_context(context)

# Later, load the context
loaded_context = load_context()
print(loaded_context)


User prefers Python for data science projects.


In [17]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load the model and tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

def generate_response(prompt, context, max_length=100):
    full_prompt = f"{context}\n\n{prompt}"
    inputs = tokenizer(full_prompt, return_tensors='pt')
    outputs = model.generate(inputs['input_ids'], max_length=max_length, num_return_sequences=1)
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
context = load_context()  # Load context from file or use predefined variable
prompt = "What can you tell me about Game of Thrones?"
response = generate_response(prompt, context)
print(response)


model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
The attention mask is not set and cannot be inferred from input because pad token is same as eos token. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.


User prefers Python for data science projects.

What can you tell me about Game of Thrones?

Game of Thrones is a fantasy series about a group of people who are forced to live in a fantasy world. The series is based on the novels of George R.R. Martin, and is based on the novels of George R.R. Martin, who is the author of the novels of The Winds of Winter and The Dance of Dragons.

What is your favorite episode of Game


In [18]:
def generate_response(prompt, context, max_length=100, temperature=0.7, top_k=50):
    full_prompt = f"{context}\n\n{prompt}"
    inputs = tokenizer(full_prompt, return_tensors='pt')
    outputs = model.generate(
        inputs['input_ids'],
        max_length=max_length,
        temperature=temperature,
        top_k=top_k,
        num_return_sequences=1
    )
    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage with adjusted parameters
prompt = "What is your favorite episode of Game of Thrones?"
response = generate_response(prompt, context)
print(response)


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


User prefers Python for data science projects.

What is your favorite episode of Game of Thrones?

I love Game of Thrones. I love the show. I love the characters. I love the show. I love the show. I love the show. I love the show. I love the show. I love the show. I love the show. I love the show. I love the show. I love the show. I love the show. I love the show. I love


In [21]:
from transformers import GPT2LMHeadModel, GPT2Tokenizer

# Load model and tokenizer
model = GPT2LMHeadModel.from_pretrained('gpt2')
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')

# Set pad token id to eos token id
model.config.pad_token_id = tokenizer.eos_token_id
tokenizer.pad_token_id = tokenizer.eos_token_id
tokenizer.pad_token = tokenizer.eos_token

def generate_response(prompt, context="", max_length=100, temperature=0.7, top_k=50):
    full_prompt = f"{context}\n\n{prompt}"
    inputs = tokenizer(full_prompt, return_tensors='pt', truncation=True, padding='longest')

    # Generate response
    outputs = model.generate(
        input_ids=inputs['input_ids'],
        attention_mask=inputs['attention_mask'],
        max_length=max_length,
        temperature=temperature,
        top_k=top_k,
        num_return_sequences=1
    )

    return tokenizer.decode(outputs[0], skip_special_tokens=True)

# Example usage
context = "User is interested in Game of Thrones episodes."
prompt = "Can you provide a detailed summary of the episode 'The Red Wedding' from Game of Thrones?"
response = generate_response(prompt, context)
print(response)


Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


User is interested in Game of Thrones episodes.

Can you provide a detailed summary of the episode 'The Red Wedding' from Game of Thrones?

Yes, we have a lot of information about the episode.

What is the most important thing you want to know about the episode?

The most important thing is that we have a lot of information about the episode.

What is the most important thing you want to know about the episode?

The most important thing


In [22]:
# Define your context if any
context = "User is interested in Game of Thrones and its characters."

# Define your prompts
prompts = [
    "Can you explain the main themes of 'Game of Thrones'?",
    "Describe the character arc of Jon Snow throughout the series.",
    "What are the major events in the episode 'The Red Wedding'?"
]

# Generate and print responses for each prompt
for prompt in prompts:
    response = generate_response(prompt, context)
    print(f"Prompt: {prompt}")
    print(f"Response: {response}")
    print()

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: Can you explain the main themes of 'Game of Thrones'?
Response: User is interested in Game of Thrones and its characters.

Can you explain the main themes of 'Game of Thrones'?

I think the main theme of 'Game of Thrones' is that the people who are in power are the ones who are the most powerful. The people who are the most powerful are the ones who are the most powerful. The people who are the most powerful are the ones who are the most powerful. The people who are the most powerful are the ones who are the



Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Prompt: Describe the character arc of Jon Snow throughout the series.
Response: User is interested in Game of Thrones and its characters.

Describe the character arc of Jon Snow throughout the series.

Describe the character arc of Jon Snow throughout the series. Describe the character arc of Jon Snow throughout the series. Describe the character arc of Jon Snow throughout the series. Describe the character arc of Jon Snow throughout the series. Describe the character arc of Jon Snow throughout the series. Describe the character arc of Jon Snow throughout the series. Desc

Prompt: What are the major events in the episode 'The Red Wedding'?
Response: User is interested in Game of Thrones and its characters.

What are the major events in the episode 'The Red Wedding'?

The Red Wedding is the first episode of the season. It is the first episode of the season that has been written by the writers and stars of Game of Thrones. The first episode of the season is called 'The Red Wedding'.

Wha