# HW3

In this homework, we'll learn about transformers and chatbots.

It will probably be easiest to run this on http://colab.research.google.com

## minGPT Character Language Model

First, will inspect Karpathy's [minGPT](https://github.com/karpathy/minGPT/tree/master) library to learn more about transformers.

We'll first fit a character language model using mingpt. We'll use as training data all the text of Shakespeare.

In [29]:
# clone the library
!git clone https://github.com/karpathy/minGPT.git

fatal: destination path 'minGPT' already exists and is not an empty directory.


In [30]:
# Add mingpt to your Python path, so you can import it.
import sys
sys.path.insert(0, './minGPT')
from mingpt.model import GPT
from mingpt.trainer import Trainer
from mingpt.utils import set_seed
import pandas as pd
import pickle
import torch
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
set_seed(3407)

In [31]:
# download shakespeare data
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-05-03 02:19:24--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.109.133, 185.199.108.133, 185.199.110.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.109.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.1’


2024-05-03 02:19:24 (20.0 MB/s) - ‘input.txt.1’ saved [1115394/1115394]



### Data loading and training code

In [62]:
from mingpt.utils import set_seed, setup_logging, CfgNode as CN
import os
import sys

class CharDataset(Dataset):
    """
    This represents a dataset of characters.
    """
    @staticmethod
    def get_default_config():
        C = CN()
        C.block_size = 128
        return C

    def __init__(self, config, data):
        self.config = config
        self.parse_data(data)

    def parse_data(self, data):
        print('parsing char data')
        # get list of all characters
        chars = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        # map from char to int
        self.stoi = { ch:i for i,ch in enumerate(chars) }
        # map from into to char
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.vocab_size = vocab_size
        self.data = data

    def get_vocab_size(self):
        return self.vocab_size

    def get_block_size(self):
        return self.config.block_size

    def __len__(self):
        return len(self.data) - self.config.block_size

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        chunk = self.data[idx:idx + self.config.block_size + 1]
        # encode every character to an integer
        dix = [self.stoi[s] for s in chunk]
        # return as tensors
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y

def get_config():

    C = CN()

    # system
    C.system = CN()
    C.system.seed = 3407
    C.system.work_dir = './out'

    # data
    C.data = CharDataset.get_default_config()

    # model
    C.model = GPT.get_default_config()
    C.model.model_type = 'gpt-micro'

    # trainer
    C.trainer = Trainer.get_default_config()
    C.trainer.learning_rate = 3e-4 # the model we're using is so small that we can go a bit faster

    return C


def train_model(config, train_dataset, sample_fn):
    """
    Train the model.
    config..........CfgNode
    train_dataset...Dataset that emits strings for training
    sample_fn.......function to call during training to show sample output.
    """
    # construct the model
    config.model.vocab_size = train_dataset.get_vocab_size()
    config.model.block_size = train_dataset.get_block_size()
    model = GPT(config.model)

    # construct the trainer object
    trainer = Trainer(config.trainer, model, train_dataset)

    # iteration callback
    def batch_end_callback(trainer):

        if trainer.iter_num % 10 == 0:
            print(f"iter_dt {trainer.iter_dt * 1000:.2f}ms; iter {trainer.iter_num}: train loss {trainer.loss.item():.5f}")

        if trainer.iter_num % 500 == 0:
            # evaluate both the train and test score
            model.eval()
            with torch.no_grad():
                # sample from the model...
                context = list(train_dataset.itos.values())
                completion = sample_fn(context, model, trainer, train_dataset, maxlen=100, temperature=1.)
                print('sample from the model:')
                print(completion)
            # save the latest model
            print("saving model")
            ckpt_path = os.path.join(config.system.work_dir, "model.pt")
            torch.save(model.state_dict(), ckpt_path)
            # revert model to training mode
            model.train()

    trainer.set_callback('on_batch_end', batch_end_callback)

    # run the optimization
    trainer.run()
    model.eval()
    return model, trainer

def configure_model(max_iters=100, block_size=128):
    config = get_config()
    config.merge_from_args(['--trainer.max_iters=%d' % max_iters,
                            '--data.block_size=%d' % block_size,
                            '--model.block_size=%d' % block_size])
    setup_logging(config)
    set_seed(config.system.seed)
    return config


def create_char_data(config):
    # construct the training dataset
    text = open('input.txt', 'r').read()
    return CharDataset(config.data, text)

def sample_from_char_model(context, model, trainer, train_dataset, maxlen=500, temperature=1.):
    x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
    y = model.generate(x, maxlen, temperature=temperature, do_sample=True, top_k=10)[0]
    return ''.join([train_dataset.itos[int(i)] for i in y])

In [48]:
# train the character model.
config = configure_model(max_iters=100, block_size=64)
train_dataset = create_char_data(config)
model, trainer = train_model(config, train_dataset, sample_from_char_model)

command line overwriting config attribute trainer.max_iters with 100
command line overwriting config attribute data.block_size with 64
command line overwriting config attribute model.block_size with 64
parsing char data
data has 1115394 characters, 65 unique.
number of parameters: 0.81M
running on device cuda


  self.pid = os.fork()


iter_dt 0.00ms; iter 0: train loss 4.20564
sample from the model:

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz:grBuxyyrI
$. ZOauibZg xgZoaukZOonoiixg bmxfiGgrI3.grZo$tekm?gZmxenenIgrIgxyrrdkoiZim,rrZrrk xZeIxZg
saving model
iter_dt 24.78ms; iter 10: train loss 3.45017
iter_dt 18.85ms; iter 20: train loss 3.27123
iter_dt 18.20ms; iter 30: train loss 3.14736
iter_dt 18.46ms; iter 40: train loss 3.02178
iter_dt 18.46ms; iter 50: train loss 2.90674
iter_dt 19.23ms; iter 60: train loss 2.86134
iter_dt 17.30ms; iter 70: train loss 2.78568
iter_dt 18.48ms; iter 80: train loss 2.71221
iter_dt 19.99ms; iter 90: train loss 2.67718


In [49]:
print(sample_from_char_model("Romeo:", model, trainer, train_dataset, maxlen=10, temperature=.5))

Romeo:
Bu mithe 


**What is the `block_size` variable? Describe in detail what it does.**

You might want to consult the code for [model.py](https://github.com/karpathy/minGPT/blob/master/mingpt/model.py).



**```block_size``` is a variable that tells the dataset class the maximum size of the chunk to process ata time. This is important for both the dataset, and in the model itself. In the context of the model, ```block_size``` is the context window, or the maximum amount of tokens (in this case characters) the model can "see" at a time. In the context of the dataset, it uses the block size variable to grab block sized chunks of the text at a time**

**What is the relationship between `block_size` and the total number of parameters in the model?** That is, if we double `block_size`, what happens to the total number of model parameters?



Block size does not directly influence the total number of model parameters, so if we double the block size the model will have the same number of parameters.

**What is the `n_layer` parameter? Describe in detail what id does. If we double this parameter, what happens to the total number of model parameters?**

n_layer specifies how many attention blocks should be present in a given GPT model. If we double  the n_layer parameter, we double the number of attention blocks, which have 11 * n_embed parameters each, so we would double the number of parameters

**What does the temperature paramter do?** See the generate method in [model.py](https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/model.py#L283).

Try setting temperature to different values. What do you observe about the output?

The temperature parameter scales the output logits. If we make the temperature super low it will output super common words like "the" and "an", but if we make it really high it starts to output nonsense, which I would guess are just some of the most common characters that follow "Romeo:" (the input I used)

**What does [line 148](https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/model.py#L148) in model.py do? How does this relate to the transformer model?**  

It creates a list of n attention blocks where n is specified by the n_layer parameter, different transofrmer models have different numbers of attention heads

## Word Model
Now, let's fit a word model instead of a character model.

Given a string like:

> The cow     jumped over the moon. The moon is full tonight!

The `WordDataset` class below should create tokens for each space-delimited string:

> ['The', 'cow', 'jumped', 'over', 'the', 'moon', '.', 'The', 'moon', 'is', 'full', 'tonight', '!']

Note that multiple space characters are treated as one (Hint: `re` may help here.)

Using `CharDataset::parse_data` function above as an example, complete the `parse_data` function below to set the `stoi`, `itos`, `vocab_size`, and `data` attributes of the `WordDataset` class.

In [50]:
import re

class WordDataset(CharDataset):
  def __init__(self,config,data):
    self.config = config
    self.data = self.parse_data(data)

  def parse_data(self, data):
    """
    data.....A single string representing many sentences.
    """
    ### YOUR CODE HERE
    split = re.findall(r'\w+|[,.!?;:]',data)
    tokens = [word for word in split]
    words = list(set(tokens))
    self.stoi = { ch:i for i,ch in enumerate(words) }
        # map from into to char
    self.itos = { i:ch for i,ch in enumerate(words) }
    self.vocab_size = len(words)
    return tokens


word_config = configure_model(max_iters=200, block_size=4)
word_data = WordDataset(word_config.data, 'The cow jumped over the moon. The moon is full tonight!')
word_data.data

command line overwriting config attribute trainer.max_iters with 200
command line overwriting config attribute data.block_size with 4
command line overwriting config attribute model.block_size with 4


['The',
 'cow',
 'jumped',
 'over',
 'the',
 'moon',
 '.',
 'The',
 'moon',
 'is',
 'full',
 'tonight',
 '!']

In [51]:
# we can now reuse the training code to fit the word language model.
def sample_from_word_model(context, model, trainer, train_dataset, maxlen=500, temperature=1.):
    x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
    y = model.generate(x, maxlen, temperature=temperature, do_sample=True, top_k=10)[0]
    return ' '.join([train_dataset.itos[int(i)] for i in y])

word_model, word_trainer = train_model(word_config, word_data, sample_from_word_model)

number of parameters: 0.80M
running on device cuda
iter_dt 0.00ms; iter 0: train loss 2.39777
sample from the model:
jumped over cow full tonight . ! the is The moon ! is the is tonight tonight ! full moon the full The . moon The cow . is is full moon is the over ! The full over . is The tonight is over is full over over tonight over tonight . the ! full ! full . ! tonight cow The moon moon tonight . The The moon ! . is full is over the the moon tonight cow The jumped ! over ! ! the is full The is the The cow full . the . is full the moon is over The the over is over over
saving model
iter_dt 13.97ms; iter 10: train loss 0.79322
iter_dt 14.17ms; iter 20: train loss 0.59339
iter_dt 15.81ms; iter 30: train loss 0.47176
iter_dt 13.78ms; iter 40: train loss 0.38711
iter_dt 14.33ms; iter 50: train loss 0.35151
iter_dt 13.64ms; iter 60: train loss 0.29200
iter_dt 14.27ms; iter 70: train loss 0.24603
iter_dt 14.15ms; iter 80: train loss 0.18496
iter_dt 15.58ms; iter 90: train loss 0.16807
ite

In [52]:
sample_from_word_model(["The"], word_model, word_trainer, word_data, maxlen=50, temperature=1.)

'The cow jumped the moon . The moon is full tonight ! tonight ! tonight ! is full tonight moon is full tonight ! tonight jumped over the moon . The moon is full tonight ! ! ! ! ! ! ! tonight ! ! . The moon is full tonight'

### Wikipedia

With our word model, let's now fit a language model on the Wikipedia page for [New Orleans](https://en.wikipedia.org/wiki/New_Orleans)

First, we'll install a library to help us fetch the plain text of a wikipedia page.

In [27]:
!pip install wikipedia

Collecting wikipedia
  Downloading wikipedia-1.4.0.tar.gz (27 kB)
  Preparing metadata (setup.py) ... [?25l[?25hdone
Building wheels for collected packages: wikipedia
  Building wheel for wikipedia (setup.py) ... [?25l[?25hdone
  Created wheel for wikipedia: filename=wikipedia-1.4.0-py3-none-any.whl size=11680 sha256=620f44c243c74af5fc9a76adbf8f92393b0758484008ba333f53613ef8b37b5a
  Stored in directory: /root/.cache/pip/wheels/5e/b6/c5/93f3dec388ae76edc830cb42901bb0232504dfc0df02fc50de
Successfully built wikipedia
Installing collected packages: wikipedia
Successfully installed wikipedia-1.4.0


In [28]:
import wikipedia
wikipedia.set_lang('en')
page = wikipedia.page('New Orleans')
print(page.content[:100])

New Orleans (commonly known as NOLA or the Big Easy among other nicknames) is a consolidated city-pa


**Create new variables `word_config`, `word_data`, `word_model`, `word_trainer` that are analogous to the ones used previously. These should fit a model to the `page` text defined in the previous cell.**

In [83]:
### YOUR CODE HERE
word_config = configure_model(max_iters=1500,block_size=128)
word_config.model.n_embd = 4096
word_config.model.n_layer = 12
word_data = WordDataset(word_config.data,page.content)
word_model, word_trainer = train_model(word_config,word_data,sample_from_word_model)


command line overwriting config attribute trainer.max_iters with 1500
command line overwriting config attribute data.block_size with 128
command line overwriting config attribute model.block_size with 128
number of parameters: 1.35M
running on device cuda
iter_dt 0.00ms; iter 0: train loss 8.38153
sample from the model:
Evidence 64 damage Filipinos 1814 ASCE diverse severely sludge Primary acts bisexual work marshland 27 best consistently past UTP Tourist Café refining Streetcars clause debate Yellow pass Line expressed employment Administration alone farthest deported effects Radiators but Greenwood Unfathomable Ferries If makeup Separate unintended ska virtually possesses houses decay wards killed website Museum Henry Puerto suburb Control subsequent contingent performed militia evacuation Valley night Gainesville Union crossed former Explorer immigrated weakness different exclusively only Vue independence carries medical surfaced collar Dialect bar concerts War Statute markets stati

In [86]:
sample_from_word_model(["A", "local", "variant", "for", "hip", "hop", "is"], word_model, word_trainer, word_data, maxlen=200, temperature=1.1)

'A local variant for hip hop is called bounce music . While not commercially successful outside of the Deep South , bounce music was immensely popular in poorer neighborhoods throughout the 1990s . A cousin of bounce , New Orleans hip hop achieved commercial success locally and internationally , producing Lil Wayne , Master P , Birdman , Birdman , Juvenile , Cash Money Records and fast form of southern rock , originated with the help of several local bands , such as The Radiators , Better Than Ezra , Cowboy Mouth and Dash Rip Rock . Throughout the 1990s , many sludge metal bands started . New Orleans heavy metal bands such as Eyehategod , Soilent Green , Crowbar , and Down incorporated styles such as hardcore punk , doom metal , and southern rock to create an increase and heady brew of swampy and aggravated metal that has largely avoided standardization . New Orleans is the southern terminus of the famed Highway 61 , made musically famous by musician Bob Dylan in his song , Highway 61 

Investigate different model settings (`block_size, max_iters, learning_rate, n_embd, n_layer`).

**What effect do you notice from trying different values? Which setting appears to generate the best generated text?**

The best parameter settings I could find were:

```n_embd = 4096```
```n_layer = 12```
```temperature = 1.1```
```learning_rate = 3e-4```
```max_iters = 1500```


The block_size parameter had a big impact on the cohesiveness of the text. If I set it too low the text would make no sense at all, but if I set it too low, the model would revert to talking about New Orleans generally rather than bounce music.

max_iters mostly effected training time and would lead to either under or over fitting depending on it's setting. The same goes for learning rate

By increasing the temperature slightly I was able to make the sentence flow a bit more smoothly and keep it more on topic.





Suppose you wanted to take the word model trained on the New Orleans Wikipedia page and use supervised fine-tuning to create a chatbot that answers questions about New Orleans.

**What type of additional training data would you need to do this?**


You would need sample questions and corresponding answers so the chatbot knows what a good response looks like.

**If this new data contains words that don't appear in the New Orleans wikipedia page, what will happen? How can you fix this?**

The model won't recognize the word and will throw a key error. To fix this you can create a case that handles unrecognized words using a defaultdict or some other method that checks if a word. You could also expand the vocabulary using a predefined vocabulary from a library like nltk.