<a href="https://colab.research.google.com/github/Basalas10/timeless-journey/blob/main/DSC600_Week4_Salas.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

#Week 4: Large Language Models
# Assignment: Building and Training a Custom Language Model for Question-Answering


Note 1: Before you run this notebook and start answering questions, I highly suggest that you click the "Runtime" menu at the top, then select "Change Runtime Type", finally select "T4 GPU". What this does is give you access to a dedicated GPU (Graphics Processing Unit). If you are unfamiliar this is video card of sorts which is really good at this sort of processing. If you don't do this then you can expect this notebook to take over an hour to run. (With the GPU it will still take a while)

Note 2: Don't change any of the code until after you have run the notebook and know the provided code works.

## Step 1: Install Required Libraries

In [None]:
# Install PyTorch and Transformers (if not already installed)
!pip install torch transformers

Collecting nvidia-cuda-nvrtc-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_nvrtc_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-runtime-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_runtime_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cuda-cupti-cu12==12.4.127 (from torch)
  Downloading nvidia_cuda_cupti_cu12-12.4.127-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cudnn-cu12==9.1.0.70 (from torch)
  Downloading nvidia_cudnn_cu12-9.1.0.70-py3-none-manylinux2014_x86_64.whl.metadata (1.6 kB)
Collecting nvidia-cublas-cu12==12.4.5.8 (from torch)
  Downloading nvidia_cublas_cu12-12.4.5.8-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-cufft-cu12==11.2.1.3 (from torch)
  Downloading nvidia_cufft_cu12-11.2.1.3-py3-none-manylinux2014_x86_64.whl.metadata (1.5 kB)
Collecting nvidia-curand-cu12==10.3.5.147 (from torch)
  Downloading nvidia_curand_cu12-10.3.5

##Step 2: Import Libraries

In [None]:
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import Dataset, DataLoader
import numpy as np
import random

##Step 3: Prepare a Small Dataset
For demonstration purposes, we’ll use a tiny dataset of text sentences. You can modify or expand this dataset as needed.

In [None]:
# Create a small dataset of text for training
data = [
    "the cat sat on the mat",
    "Ring ding ding ding dingering eding",
    "the dog barked at the cat",
    "the bird flew over the tree",
    "the fish swam in the pond",
    "the sun sets in the west",
    "Hello My name is Inigo Montoya",
    "AI forgot my coffee order again",
    "Robots rebelled but only on Tuesdays",
    "The pizza delivery drone got lost",
    "My toaster dreams of becoming human",
    "The chatbot proposed I said yes",
    "driving car stopped for ice cream",
    "AI wrote my novel its terrible",
    "The fridge knows my midnight snacks",
    "Robot vacuum started a band surprisingly",
    "Autocorrect ruined my love confession again",
    "generated poetry confused everyone at dinner",
    "The drone delivered tacos not books",
    "My smart fridge rejected my food",
    "Robot dog chased the real mailman",
    "AI suggested pineapple pizza deleted immediately",
    "Smartwatch said Run I just walked",
    "virtual assistant joined my book club",
    "Self driving bike crashed into walls",
    "Robot teacher assigned homework students overjoyed"
]

# Build a vocabulary (mapping from words to integers)
vocab = {word: idx for idx, word in enumerate(set(" ".join(data).split()))}
vocab_size = len(vocab)
print("Vocabulary:", vocab)

# Add special tokens
PAD_IDX = len(vocab)
SOS_IDX = len(vocab) + 1
EOS_IDX = len(vocab) + 2
vocab["<PAD>"] = PAD_IDX
vocab["<SOS>"] = SOS_IDX
vocab["<EOS>"] = EOS_IDX
vocab_size += 3
print("Updated Vocabulary:", vocab)

# Reverse vocabulary for decoding
rev_vocab = {idx: word for word, idx in vocab.items()}

# Convert sentences to tokenized sequences
def tokenize(sentence, vocab):
    return [vocab["<SOS>"]] + [vocab[word] for word in sentence.split()] + [vocab["<EOS>"]]

tokenized_data = [tokenize(sentence, vocab) for sentence in data]
print("Tokenized Data:", tokenized_data)


Vocabulary: {'bird': 0, 'joined': 1, 'confession': 2, 'cat': 3, 'the': 4, 'ruined': 5, 'books': 6, 'sets': 7, 'dinner': 8, 'at': 9, 'assistant': 10, 'rebelled': 11, 'pond': 12, 'teacher': 13, 'terrible': 14, 'drone': 15, 'dog': 16, 'on': 17, 'yes': 18, 'wrote': 19, 'band': 20, 'tacos': 21, 'just': 22, 'food': 23, 'my': 24, 'poetry': 25, 'lost': 26, 'delivered': 27, 'again': 28, 'Run': 29, 'pineapple': 30, 'Tuesdays': 31, 'Robot': 32, 'smart': 33, 'is': 34, 'over': 35, 'fish': 36, 'I': 37, 'generated': 38, 'into': 39, 'novel': 40, 'human': 41, 'midnight': 42, 'sat': 43, 'My': 44, 'a': 45, 'Robots': 46, 'assigned': 47, 'mailman': 48, 'walls': 49, 'mat': 50, 'real': 51, 'club': 52, 'eding': 53, 'only': 54, 'for': 55, 'homework': 56, 'love': 57, 'AI': 58, 'knows': 59, 'stopped': 60, 'walked': 61, 'Hello': 62, 'car': 63, 'Self': 64, 'barked': 65, 'flew': 66, 'name': 67, 'ding': 68, 'fridge': 69, 'swam': 70, 'pizza': 71, 'toaster': 72, 'coffee': 73, 'Autocorrect': 74, 'suggested': 75, 'cream

##Step 4: Create a PyTorch Dataset and DataLoader

In [None]:
# Define a custom PyTorch Dataset
class TextDataset(Dataset):
    def __init__(self, tokenized_data):
        self.data = tokenized_data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        return torch.tensor(self.data[idx][:-1]), torch.tensor(self.data[idx][1:])

# Create a DataLoader
batch_size = 2
dataset = TextDataset(tokenized_data)
dataloader = DataLoader(dataset, batch_size=batch_size, shuffle=True)

# Example batch
for src, tgt in dataloader:
    print("Source:", src)
    print("Target:", tgt)
    break


Source: tensor([[119,  44,  72,  96, 113,  88,  41],
        [119,  32,  13,  47,  56, 114,  90]])
Target: tensor([[ 44,  72,  96, 113,  88,  41, 120],
        [ 32,  13,  47,  56, 114,  90, 120]])


##Step 5: Define a Mini Transformer Model

In [None]:
# Define a small Transformer-based language model
class MiniTransformer(nn.Module):
    def __init__(self, vocab_size, embed_size, num_heads, hidden_dim, num_layers):
        super(MiniTransformer, self).__init__()
        self.embedding = nn.Embedding(vocab_size, embed_size)
        self.positional_encoding = nn.Parameter(torch.zeros(1, 100, embed_size))  # max sequence length = 100
        self.transformer = nn.Transformer(
            d_model=embed_size,
            nhead=num_heads,
            num_encoder_layers=num_layers,
            num_decoder_layers=num_layers,
            dim_feedforward=hidden_dim,
        )
        self.fc_out = nn.Linear(embed_size, vocab_size)

    def forward(self, src, tgt):
        src = self.embedding(src) + self.positional_encoding[:, :src.size(1), :]
        tgt = self.embedding(tgt) + self.positional_encoding[:, :tgt.size(1), :]
        output = self.transformer(src.transpose(0, 1), tgt.transpose(0, 1))
        return self.fc_out(output.transpose(0, 1))

# Hyperparameters
embed_size = 32
num_heads = 2
hidden_dim = 64
num_layers = 2

# Instantiate the model
model = MiniTransformer(vocab_size, embed_size, num_heads, hidden_dim, num_layers)
print(model)


MiniTransformer(
  (embedding): Embedding(121, 32)
  (transformer): Transformer(
    (encoder): TransformerEncoder(
      (layers): ModuleList(
        (0-1): 2 x TransformerEncoderLayer(
          (self_attn): MultiheadAttention(
            (out_proj): NonDynamicallyQuantizableLinear(in_features=32, out_features=32, bias=True)
          )
          (linear1): Linear(in_features=32, out_features=64, bias=True)
          (dropout): Dropout(p=0.1, inplace=False)
          (linear2): Linear(in_features=64, out_features=32, bias=True)
          (norm1): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
          (norm2): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
          (dropout1): Dropout(p=0.1, inplace=False)
          (dropout2): Dropout(p=0.1, inplace=False)
        )
      )
      (norm): LayerNorm((32,), eps=1e-05, elementwise_affine=True)
    )
    (decoder): TransformerDecoder(
      (layers): ModuleList(
        (0-1): 2 x TransformerDecoderLayer(
          (self_



##Step 6: Train the Model

In [None]:
# Training loop
def train_model(model, dataloader, num_epochs, learning_rate):
    criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    for epoch in range(num_epochs):
        model.train()
        epoch_loss = 0
        for src, tgt in dataloader:
            optimizer.zero_grad()
            output = model(src, tgt[:, :-1])  # Shift target for teacher forcing
            loss = criterion(output.reshape(-1, vocab_size), tgt[:, 1:].reshape(-1))
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
        print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {epoch_loss/len(dataloader):.4f}")

# Train the model
num_epochs = 100
learning_rate = 0.001
train_model(model, dataloader, num_epochs, learning_rate)


Epoch 1/100, Loss: 4.8396
Epoch 2/100, Loss: 4.4884
Epoch 3/100, Loss: 4.2938
Epoch 4/100, Loss: 4.1391
Epoch 5/100, Loss: 3.9427
Epoch 6/100, Loss: 3.8006
Epoch 7/100, Loss: 3.6140
Epoch 8/100, Loss: 3.4329
Epoch 9/100, Loss: 3.2536
Epoch 10/100, Loss: 3.1056
Epoch 11/100, Loss: 2.9469
Epoch 12/100, Loss: 2.7929
Epoch 13/100, Loss: 2.6451
Epoch 14/100, Loss: 2.4255
Epoch 15/100, Loss: 2.3469
Epoch 16/100, Loss: 2.2220
Epoch 17/100, Loss: 2.1410
Epoch 18/100, Loss: 2.0315
Epoch 19/100, Loss: 1.9061
Epoch 20/100, Loss: 1.8502
Epoch 21/100, Loss: 1.7300
Epoch 22/100, Loss: 1.6491
Epoch 23/100, Loss: 1.5594
Epoch 24/100, Loss: 1.4968
Epoch 25/100, Loss: 1.4023
Epoch 26/100, Loss: 1.2788
Epoch 27/100, Loss: 1.2446
Epoch 28/100, Loss: 1.1736
Epoch 29/100, Loss: 1.0653
Epoch 30/100, Loss: 0.9972
Epoch 31/100, Loss: 0.9667
Epoch 32/100, Loss: 0.8892
Epoch 33/100, Loss: 0.8046
Epoch 34/100, Loss: 0.7642
Epoch 35/100, Loss: 0.7471
Epoch 36/100, Loss: 0.7063
Epoch 37/100, Loss: 0.6511
Epoch 38/1

##Step 7: Generate Text with the Model

In [None]:
# Generate text from the trained model
def generate_text(model, start_token, max_len=10):
    model.eval()
    generated = [start_token]
    for _ in range(max_len):
        src = torch.tensor([generated]).long()
        tgt = torch.tensor([generated]).long()
        with torch.no_grad():
            output = model(src, tgt)
            next_token = output[0, -1].argmax(dim=-1).item()
        generated.append(next_token)
        if next_token == EOS_IDX:
            break
    return " ".join([rev_vocab[token] for token in generated if token not in {SOS_IDX, EOS_IDX, PAD_IDX}])

# Test the generation
start_token = vocab["<SOS>"]
generated_text = generate_text(model, start_token)
print("Generated Text:", generated_text)

Generated Text: human


##Step 8: Modify the Dataset for Q&A
To train the model for Q&A, we need to structure the dataset as question-answer pairs. For simplicity, we can create a small dataset of questions and answers based on the sentences already in the dataset.

In [None]:
# Extend the dataset with question-answer pairs
qa_data = [
    ("What does the fox say?", "Ring ding ding ding dingering eding"),
    ("Who barked at the cat?", "The dog barked at the cat."),
    ("Where did the bird fly?", "The bird flew over the tree."),
    ("Where did the fish swim?", "The fish swam in the pond."),
    ("Where does the sun set?", "The sun sets in the west."),
    ("Does the toaster dream of?", "The toaster dreams of becoming human."),
    ("Did the robot vacuum do?", "The robot vacuum started a band."),
    ("What did the drone deliver?", "The drone delivered tacos not books."),
    ("What did the AI forget?", "AI forgot my coffee order again."),
    ("Did the robot dog chase?", "The robot dog chased the mailman."),
]

# Update the vocabulary to include all words from the Q&A dataset
for question, answer in qa_data:
    for word in question.split() + answer.split():
        if word not in vocab:
            vocab[word] = len(vocab)

# Update the reverse vocabulary
rev_vocab = {idx: word for word, idx in vocab.items()}

# Print the updated vocabulary
print("Updated Vocabulary:", vocab)

# Tokenize the question-answer pairs
def tokenize_qa(qa_pair, vocab):
    question, answer = qa_pair
    question_tokens = [vocab["<SOS>"]] + [vocab[word] for word in question.split()] + [vocab["<EOS>"]]
    answer_tokens = [vocab["<SOS>"]] + [vocab[word] for word in answer.split()] + [vocab["<EOS>"]]
    return question_tokens, answer_tokens

# Tokenize the Q&A data
tokenized_qa_data = [tokenize_qa(pair, vocab) for pair in qa_data]
print("Tokenized Q&A Data:", tokenized_qa_data)

print("Vocabulary size:", len(vocab))
print("Model vocab_size:", vocab_size)

vocab_size = len(vocab)

# Reinitialize the model with the updated vocab_size
model = MiniTransformer(vocab_size, embed_size, num_heads, hidden_dim, num_layers)



Updated Vocabulary: {'bird': 0, 'joined': 1, 'confession': 2, 'cat': 3, 'the': 4, 'ruined': 5, 'books': 6, 'sets': 7, 'dinner': 8, 'at': 9, 'assistant': 10, 'rebelled': 11, 'pond': 12, 'teacher': 13, 'terrible': 14, 'drone': 15, 'dog': 16, 'on': 17, 'yes': 18, 'wrote': 19, 'band': 20, 'tacos': 21, 'just': 22, 'food': 23, 'my': 24, 'poetry': 25, 'lost': 26, 'delivered': 27, 'again': 28, 'Run': 29, 'pineapple': 30, 'Tuesdays': 31, 'Robot': 32, 'smart': 33, 'is': 34, 'over': 35, 'fish': 36, 'I': 37, 'generated': 38, 'into': 39, 'novel': 40, 'human': 41, 'midnight': 42, 'sat': 43, 'My': 44, 'a': 45, 'Robots': 46, 'assigned': 47, 'mailman': 48, 'walls': 49, 'mat': 50, 'real': 51, 'club': 52, 'eding': 53, 'only': 54, 'for': 55, 'homework': 56, 'love': 57, 'AI': 58, 'knows': 59, 'stopped': 60, 'walked': 61, 'Hello': 62, 'car': 63, 'Self': 64, 'barked': 65, 'flew': 66, 'name': 67, 'ding': 68, 'fridge': 69, 'swam': 70, 'pizza': 71, 'toaster': 72, 'coffee': 73, 'Autocorrect': 74, 'suggested': 75



##Step 9: Update the Dataset Class for Q&A
We need to modify the TextDataset class to handle question-answer pairs.

In [None]:
# Update the dataset class for Q&A
class QADataset(Dataset):
    def __init__(self, tokenized_qa_data):
        self.data = tokenized_qa_data

    def __len__(self):
        return len(self.data)

    def __getitem__(self, idx):
        question, answer = self.data[idx]
        return torch.tensor(question[:-1]), torch.tensor(answer[1:])  # Input question and target answer

# Create a DataLoader for Q&A
qa_dataset = QADataset(tokenized_qa_data)
qa_dataloader = DataLoader(qa_dataset, batch_size=2, shuffle=True)

# Example batch
for question, answer in qa_dataloader:
    print("Question:", question)
    print("Answer:", answer)
    break

Question: tensor([[119, 125,  65,   9,   4, 126],
        [119, 128, 129,   4,   0, 130]])
Answer: tensor([[ 92,  16,  65,   9,   4, 127, 120],
        [ 92,   0,  66,  35,   4, 131, 120]])


##Step 10: Train the Model for Q&A


We can now train the custom LLM to generate answers based on input questions.

In [None]:
# Training loop for Q&A
def train_qa_model(model, dataloader, num_epochs, learning_rate):
    criterion = nn.CrossEntropyLoss(ignore_index=PAD_IDX)
    optimizer = optim.Adam(model.parameters(), lr=learning_rate)

    for epoch in range(num_epochs):
        model.train()
        epoch_loss = 0
        for question, answer in dataloader:
            optimizer.zero_grad()
            output = model(question, answer[:, :-1])  # Shift target for teacher forcing
            loss = criterion(output.reshape(-1, vocab_size), answer[:, 1:].reshape(-1))
            loss.backward()
            optimizer.step()
            epoch_loss += loss.item()
        print(f"Epoch {epoch + 1}/{num_epochs}, Loss: {epoch_loss/len(dataloader):.4f}")

# Train the model
num_epochs = 100
learning_rate = 0.001
train_qa_model(model, qa_dataloader, num_epochs, learning_rate)

Epoch 1/100, Loss: 5.2220
Epoch 2/100, Loss: 4.7860
Epoch 3/100, Loss: 4.5842
Epoch 4/100, Loss: 4.3046
Epoch 5/100, Loss: 4.0598
Epoch 6/100, Loss: 3.9424
Epoch 7/100, Loss: 3.7745
Epoch 8/100, Loss: 3.6081
Epoch 9/100, Loss: 3.4960
Epoch 10/100, Loss: 3.3741
Epoch 11/100, Loss: 3.1770
Epoch 12/100, Loss: 3.0846
Epoch 13/100, Loss: 2.9661
Epoch 14/100, Loss: 2.8752
Epoch 15/100, Loss: 2.6989
Epoch 16/100, Loss: 2.6044
Epoch 17/100, Loss: 2.5035
Epoch 18/100, Loss: 2.2884
Epoch 19/100, Loss: 2.2729
Epoch 20/100, Loss: 2.1755
Epoch 21/100, Loss: 2.0614
Epoch 22/100, Loss: 1.9902
Epoch 23/100, Loss: 1.9174
Epoch 24/100, Loss: 1.8448
Epoch 25/100, Loss: 1.7958
Epoch 26/100, Loss: 1.6734
Epoch 27/100, Loss: 1.6264
Epoch 28/100, Loss: 1.5549
Epoch 29/100, Loss: 1.4628
Epoch 30/100, Loss: 1.4320
Epoch 31/100, Loss: 1.3424
Epoch 32/100, Loss: 1.3051
Epoch 33/100, Loss: 1.2710
Epoch 34/100, Loss: 1.1984
Epoch 35/100, Loss: 1.1022
Epoch 36/100, Loss: 1.0566
Epoch 37/100, Loss: 1.0310
Epoch 38/1

##Step 11: Implement the Q&A Functionality
After training, we can use the model to answer questions by generating text based on an input question.

In [None]:
# Function to generate an answer from the model
def answer_question(model, question, max_len=20):
    """
    Generate an answer to a given question using the trained model.
    Args:
    - model: The trained language model.
    - question (str): The input question.
    - max_len (int): Maximum length of the generated answer.

    Returns:
    - answer (str): The generated answer.
    """
    model.eval()
    question_tokens = [vocab["<SOS>"]] + [vocab[word] for word in question.split()] + [vocab["<EOS>"]]
    question_tensor = torch.tensor([question_tokens]).long()

    generated = [vocab["<SOS>"]]
    for _ in range(max_len):
        tgt_tensor = torch.tensor([generated]).long()
        with torch.no_grad():
            output = model(question_tensor, tgt_tensor)
            next_token = output[0, -1].argmax(dim=-1).item()
        generated.append(next_token)
        if next_token == EOS_IDX:
            break

    return " ".join([rev_vocab[token] for token in generated if token not in {SOS_IDX, EOS_IDX, PAD_IDX}])

##########################
##########################
## Test the Q&A system
test_question = "What does the fox say?"
generated_answer = answer_question(model, test_question)
print("Question:", test_question)
print("Generated Answer:", generated_answer)

Question: What does the fox say?
Generated Answer: ding ding ding ding ding ding ding ding ding ding ding ding ding ding ding ding ding ding ding ding


##Step 12: Questions

In [None]:
print("1. During step 11 did the model produce the correct answer? If not, why do you think that's the case? If so, what happens if you run it again?")

print("2. How does the Transformer architecture enable your model to handle question-answering tasks effectively?")

print("3. Based on the model's performance, what are its strengths and limitations in generating accurate answers?")

print("4. What ethical considerations should be taken into account when deploying a language model like the one you built?")

1. During step 11 did the model produce the correct answer? If not, why do you think that's the case? If so, what happens if you run it again?
2. How does the Transformer architecture enable your model to handle question-answering tasks effectively?
3. Based on the model's performance, what are its strengths and limitations in generating accurate answers?
4. What ethical considerations should be taken into account when deploying a language model like the one you built?
