# HW3

In this homework, we'll learn about transformers and chatbots.

It will probably be easiest to run this on http://colab.research.google.com

## minGPT Character Language Model

First, will inspect Karpathy's [minGPT](https://github.com/karpathy/minGPT/tree/master) library to learn more about transformers.

We'll first fit a character language model using mingpt. We'll use as training data all the text of Shakespeare.

In [218]:
# clone the library
!git clone https://github.com/karpathy/minGPT.git

fatal: destination path 'minGPT' already exists and is not an empty directory.


In [219]:
# Add mingpt to your Python path, so you can import it.
import sys
sys.path.insert(0, './minGPT')
from mingpt.model import GPT
from mingpt.trainer import Trainer
from mingpt.utils import set_seed
import pandas as pd
import pickle
import torch
from torch.utils.data import Dataset
from torch.utils.data.dataloader import DataLoader
set_seed(3407)

In [220]:
# download shakespeare data
!wget https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt

--2024-05-04 01:42:18--  https://raw.githubusercontent.com/karpathy/char-rnn/master/data/tinyshakespeare/input.txt
Resolving raw.githubusercontent.com (raw.githubusercontent.com)... 185.199.110.133, 185.199.109.133, 185.199.108.133, ...
Connecting to raw.githubusercontent.com (raw.githubusercontent.com)|185.199.110.133|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 1115394 (1.1M) [text/plain]
Saving to: ‘input.txt.4’


2024-05-04 01:42:18 (31.5 MB/s) - ‘input.txt.4’ saved [1115394/1115394]



### Data loading and training code

In [221]:
from mingpt.utils import set_seed, setup_logging, CfgNode as CN
import os
import sys

class CharDataset(Dataset):
    """
    This represents a dataset of characters.
    """
    @staticmethod
    def get_default_config():
        C = CN()
        C.block_size = 128
        return C

    def __init__(self, config, data):
        self.config = config
        self.parse_data(data)

    def parse_data(self, data):
        print('parsing char data')
        # get list of all characters
        chars = sorted(list(set(data)))
        data_size, vocab_size = len(data), len(chars)
        print('data has %d characters, %d unique.' % (data_size, vocab_size))
        # map from char to int
        self.stoi = { ch:i for i,ch in enumerate(chars) }
        # map from into to char
        self.itos = { i:ch for i,ch in enumerate(chars) }
        self.vocab_size = vocab_size
        self.data = data

    def get_vocab_size(self):
        return self.vocab_size

    def get_block_size(self):
        return self.config.block_size

    def __len__(self):
        return len(self.data) - self.config.block_size

    def __getitem__(self, idx):
        # grab a chunk of (block_size + 1) characters from the data
        chunk = self.data[idx:idx + self.config.block_size + 1]
        # encode every character to an integer
        dix = [self.stoi[s] for s in chunk]
        # return as tensors
        x = torch.tensor(dix[:-1], dtype=torch.long)
        y = torch.tensor(dix[1:], dtype=torch.long)
        return x, y

def get_config():

    C = CN()

    # system
    C.system = CN()
    C.system.seed = 3407
    C.system.work_dir = './out'

    # data
    C.data = CharDataset.get_default_config()

    # model
    C.model = GPT.get_default_config()
    C.model.model_type = 'gpt-micro'

    # trainer
    C.trainer = Trainer.get_default_config()
    C.trainer.learning_rate = 5e-4 # the model we're using is so small that we can go a bit faster

    return C


def train_model(config, train_dataset, sample_fn):
    """
    Train the model.
    config..........CfgNode
    train_dataset...Dataset that emits strings for training
    sample_fn.......function to call during training to show sample output.
    """
    # construct the model
    config.model.vocab_size = train_dataset.get_vocab_size()
    config.model.block_size = train_dataset.get_block_size()
    model = GPT(config.model)

    # construct the trainer object
    trainer = Trainer(config.trainer, model, train_dataset)

    # iteration callback
    def batch_end_callback(trainer):

        if trainer.iter_num % 10 == 0:
            print(f"iter_dt {trainer.iter_dt * 1000:.2f}ms; iter {trainer.iter_num}: train loss {trainer.loss.item():.5f}")

        if trainer.iter_num % 500 == 0:
            # evaluate both the train and test score
            model.eval()
            with torch.no_grad():
                # sample from the model...
                context = list(train_dataset.itos.values())
                completion = sample_fn(context, model, trainer, train_dataset, maxlen=100, temperature=1.)
                print('sample from the model:')
                print(completion)
            # save the latest model
            print("saving model")
            ckpt_path = os.path.join(config.system.work_dir, "model.pt")
            torch.save(model.state_dict(), ckpt_path)
            # revert model to training mode
            model.train()

    trainer.set_callback('on_batch_end', batch_end_callback)

    # run the optimization
    trainer.run()
    model.eval()
    return model, trainer

def configure_model(max_iters=100, block_size=128):
    config = get_config()
    config.merge_from_args(['--trainer.max_iters=%d' % max_iters,
                            '--data.block_size=%d' % block_size,
                            '--model.block_size=%d' % block_size])
    setup_logging(config)
    set_seed(config.system.seed)
    return config


def create_char_data(config):
    # construct the training dataset
    text = open('input.txt', 'r').read()
    return CharDataset(config.data, text)

def sample_from_char_model(context, model, trainer, train_dataset, maxlen=500, temperature=1.):
    x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
    y = model.generate(x, maxlen, temperature=temperature, do_sample=True, top_k=10)[0]
    return ''.join([train_dataset.itos[int(i)] for i in y])

In [222]:
# train the character model.
config = configure_model(max_iters=100, block_size=64)
train_dataset = create_char_data(config)
model, trainer = train_model(config, train_dataset, sample_from_char_model)

command line overwriting config attribute trainer.max_iters with 100
command line overwriting config attribute data.block_size with 64
command line overwriting config attribute model.block_size with 64
parsing char data
data has 1115394 characters, 65 unique.
number of parameters: 0.81M
running on device cuda
iter_dt 0.00ms; iter 0: train loss 4.20564
sample from the model:

 !$&',-.3:;?ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyzogrku shgfZgh ZfbumiZg bgZorukZomiZhitg bm f bgr s s anekmmymkZmxenenm tmumyre
 iiZim,hha
rkdgbe m
g
saving model
iter_dt 20.61ms; iter 10: train loss 3.22833
iter_dt 15.81ms; iter 20: train loss 2.97854
iter_dt 18.45ms; iter 30: train loss 2.83429
iter_dt 20.84ms; iter 40: train loss 2.72577
iter_dt 19.14ms; iter 50: train loss 2.63684
iter_dt 22.19ms; iter 60: train loss 2.62312
iter_dt 20.17ms; iter 70: train loss 2.55452
iter_dt 17.72ms; iter 80: train loss 2.52026
iter_dt 17.74ms; iter 90: train loss 2.50115


In [223]:
print(sample_from_char_model("where are ", model, trainer, train_dataset, maxlen=10, temperature=3))

where are anthith,
B


**What is the `block_size` variable? Describe in detail what it does.**

You might want to consult the code for [model.py](https://github.com/karpathy/minGPT/blob/master/mingpt/model.py).



The block_size variable sets the length of the sequence that is paid attention to for the output token. SO the model converts the sequence of length that is set by the block_size variable into tokens and these tokens in additon to their positional embeddings are paid attention to when generating the output token.The block_size also helps in the training of the model.with this line of code "        chunk = self.data[idx:idx + self.config.block_size + 1]" the model looks at the lenght of input sequence set by the block_size varible takes that as the input and then shifts it by one to look at output sequence which will then therefore include the next word that come after the last word in the input sequenece , it then does that iteratively for the rest of the shakespeare data.The length of the block_size variable can also help determine the dependencies it can capture however a lnger block_size is more computationally expensive


**What is the relationship between `block_size` and the total number of parameters in the model?** That is, if we double `block_size`, what happens to the total number of model parameters?

**The block_size variable sets the length of the sequence that is paid attention to for the output token. The model then converts this sequence into tokens , which will have token and position embeddings. So increasing the block_size will icnrease the number of tokens the model has to pay attention to.


However, since the vocab size is the same , the total number of token and position embeddings should remain the same albeit the model might have to pay attention to more of those token to generate the output.If there is any chnages to the parameters it will be due to the transfromer blocks or the attention mechanisms
 **

**What is the `n_layer` parameter? Describe in detail what id does. If we double this parameter, what happens to the total number of model parameters?**

The n_layer parameter affects the number of layers in the model.Each layer contains tranformer blocks , so increses the number of layers increases the number of tranformers. This means that the mnodel will become deepr and the input will have to sequentially go through more layers , this might help the model to learn more patterns in the data, however this is more computationally expensive.

In regards to parameters, each transformer block has its own set of parameters.Therefore increasing the number of transformers by increasing the number of layers will lead to an increase in the total number of model parameters.

**What does the temperature paramter do?** See the generate method in [model.py](https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/model.py#L283).

Try setting temperature to different values. What do you observe about the output?

Looking at "logits = logits[:, -1, :] / temperature" , in the generate()method.Whats happening here is that the probability for the next token prediction based on the input sequence is being scaled by the temperature. The temperature is used to scale the probabilites of the next token. When the temperature is lower than 1 only the tokens with a high probability are chosen , whereas When the temperature is higher than 1, the scaling affects it so that even output token wich didnt have that high of a probability initially could be chosen

With the temperature of 0.1 ,  with the input "where are " the output is "where are the the he" , the word the the is repeated

With the temperature of 3 ,  with the input "where are " the output is "where are w omelll w" , this is less coherent

**What does [line 148](https://github.com/karpathy/minGPT/blob/37baab71b9abea1b76ab957409a1cc2fbfba8a26/mingpt/model.py#L148) in model.py do? How does this relate to the transformer model?**  

Line 148 is "            h = nn.ModuleList([Block(config) for _ in range(config.n_layer)]),"

a block , i.e a tranfromer is created using the config specified , then this is made into a list containing all the blocks from all the layers that are specified in the n_layers parameter.This is then made into a ModuleList object.In realtion the transfomer model ,this line bascially created a sequence of tranfomrer blocks that the input sequence must go through, each of which can capture different aspects and therefore lead to a more complex understanding.


## Word Model
Now, let's fit a word model instead of a character model.

Given a string like:

> The cow     jumped over the moon. The moon is full tonight!

The `WordDataset` class below should create tokens for each space-delimited string:

> ['The', 'cow', 'jumped', 'over', 'the', 'moon', '.', 'The', 'moon', 'is', 'full', 'tonight', '!']

Note that multiple space characters are treated as one (Hint: `re` may help here.)

Using `CharDataset::parse_data` function above as an example, complete the `parse_data` function below to set the `stoi`, `itos`, `vocab_size`, and `data` attributes of the `WordDataset` class.

In [224]:
import re

class WordDataset(CharDataset):
  def parse_data(self, data):

    words = re.split(r'\s+', data.strip())

    unique_words = sorted(set(words))

    self.vocab_size = len(unique_words)

    self.stoi = { word: i for i, word in enumerate(unique_words) }
    self.itos = { i: word for i, word in enumerate(unique_words) }
    self.data = words

word_config = configure_model(max_iters=200, block_size=4)
word_data = WordDataset(word_config.data, 'The cow jumped over the moon. The moon is full tonight!')
word_data.data

command line overwriting config attribute trainer.max_iters with 200
command line overwriting config attribute data.block_size with 4
command line overwriting config attribute model.block_size with 4


['The',
 'cow',
 'jumped',
 'over',
 'the',
 'moon.',
 'The',
 'moon',
 'is',
 'full',
 'tonight!']

In [225]:
def sample_from_word_model(context, model, trainer, train_dataset, maxlen=500, temperature=1.):
    x = torch.tensor([train_dataset.stoi[s] for s in context], dtype=torch.long)[None,...].to(trainer.device)
    y = model.generate(x, maxlen, temperature=temperature, do_sample=True, top_k=10)[0]
    return ' '.join([train_dataset.itos[int(i)] for i in y])

word_model, word_trainer = train_model(word_config, word_data, sample_from_word_model)

number of parameters: 0.80M
running on device cuda
iter_dt 0.00ms; iter 0: train loss 2.28699
sample from the model:
The cow full is jumped moon moon. over the tonight! moon. the over the jumped moon. moon. over The over is tonight! The moon. jumped full The the moon. The moon full over cow The the is full moon moon. jumped jumped the moon. the moon. moon. cow jumped cow jumped moon over moon. is The is full moon. moon full The full tonight! the moon moon. over moon is moon The full the moon. over over the jumped full tonight! The moon. The moon moon. The the The tonight! the over the full is moon The moon The moon. over The the the tonight! over the the moon. moon.
saving model
iter_dt 14.02ms; iter 10: train loss 0.46275
iter_dt 14.16ms; iter 20: train loss 0.28833
iter_dt 14.97ms; iter 30: train loss 0.19077
iter_dt 14.39ms; iter 40: train loss 0.14534
iter_dt 15.27ms; iter 50: train loss 0.10207
iter_dt 14.16ms; iter 60: train loss 0.10739
iter_dt 14.17ms; iter 70: train loss 0.069

In [226]:
sample_from_word_model(["The"], word_model, word_trainer, word_data, maxlen=50, temperature=1.)

'The moon is full tonight! jumped over the moon. The moon is full tonight! jumped over the moon. The moon is full tonight! moon. The moon is full tonight! jumped over the moon. The moon is full tonight! moon. The moon is full tonight! tonight! jumped over the moon. The moon'

### Wikipedia

With our word model, let's now fit a language model on the Wikipedia page for [New Orleans](https://en.wikipedia.org/wiki/New_Orleans)

First, we'll install a library to help us fetch the plain text of a wikipedia page.

In [227]:
!pip install wikipedia



In [228]:
import wikipedia
wikipedia.set_lang('en')
page = wikipedia.page('New Orleans')
print(page.content[:100])

New Orleans (commonly known as NOLA or the Big Easy among other nicknames) is a consolidated city-pa


**Create new variables `word_config`, `word_data`, `word_model`, `word_trainer` that are analogous to the ones used previously. These should fit a model to the `page` text defined in the previous cell.**

In [229]:
### YOUR CODE HERE
word_config = configure_model(max_iters=200, block_size=4)
word_config.model.n_embd = 4000
word_config.model.n_layer = 10000
word_config.model.learning_rate = 1e-1
word_data = WordDataset(word_config.data, page.content)
word_model, word_trainer = train_model(word_config, word_data, sample_from_word_model)


command line overwriting config attribute trainer.max_iters with 200
command line overwriting config attribute data.block_size with 4
command line overwriting config attribute model.block_size with 4
number of parameters: 1.48M
running on device cuda
iter_dt 0.00ms; iter 0: train loss 8.65115
sample from the model:
"51%... "America's "American "Beast" "Broadway "Chapel "Chep" "Corps "Downtown" "Fat "Fifth "French "Highway "Hollywood "Jazz "Murder "New "North" "Red "South" "Spoons" "The "Twelfth "Vic" "Where "White "Witch "ben-yays"), "downriver "downtown" "funerals "generally "grandfather "had "highly "jazz "marsh" "most "no "one-fourth "per "plying "primary "r", "separate "the "totally "upriver "uptown" "wild "worst-performing" #21 $1.5 $36 $5.5 $7 & 'n' ("RTA"). ("Voodoo ("West (0.30 (1,500 (1,590 (1.7% (10 (1060) (11 (12.4 (15 (169 (1810–1892), (1946–1961) (1954). (1961–1970), (1978, (1986–1994) (1994–2002). (2 (2.5–5.1 (2004) (2005 (2009). (21.7 (28.9 (32 (39 (4,800 (40 (440 (450,0

In [230]:
sample_from_word_model(["A", "local", "variant", "for", "hip", "hop", "is"], word_model, word_trainer, word_data, maxlen=200, temperature=1.)

"A local variant for hip hop is by the city in New Orleans The Mississippi Orleans in the city's city and its the United of other of the city and a complex and the city was the New Orleans in the city and the city in the French Mississippi Orleans was the city's city to the Mississippi Orleans and New Orleans. The New Orleans and the French city and New Orleans in the New Orleans and the city to the city The city and Orleans in the city in the city to the city is its a and the city to the Mississippi Orleans and the city in the Louisiana In New Orleans to the United American city and the city and the French New Orleans a and American city's city and the city to the New in New Orleans was the New Orleans is the city and New Orleans is The city's Louisiana and the U.S. Mississippi Orleans and the city and the U.S. city's city is The New as the U.S. Mississippi Orleans' and the city to the city of other in the U.S. New Orleans was the French city to the United city's city and the Louisian

Investigate different model settings (`block_size, max_iters, learning_rate, n_embd, n_layer`).

**What effect do you notice from trying different values? Which setting appears to generate the best generated text?**


Changing block_size:changing the block size from 1,4,10,12 did not show that much dfference , the text was still not coherent.
Changing max_iters :changing it from 10,100,200,100  showed some difference 10 was not coherent at all it become better as iterations increases but between 200 and 1000 there was not that much difference.
Changing learning_rate :changing it from 1e-10,1e-7,1e-3,1e-1  did not show that much of difference.
Changing n_embd :changing it from 1000,4000,7000,10000 did not show that much of difference.
hanging n_layer :changing it from 1000,5000,8000,10000  also did not show that much of difference.


Suppose you wanted to take the word model trained on the New Orleans Wikipedia page and use supervised fine-tuning to create a chatbot that answers questions about New Orleans.

**What type of additional training data would you need to do this?**

Provide example data below.


Wikipedia is a good source for General and historical information. It would be good to have some practical infomration from locals , using blogs like these https://uniquenola.com/blog/tour/being-a-new-orleans-local-walking-tour/

Also having data in a question-answer pair would be ideal for training a chatbot



**If this new data contains words that don't appear in the New Orleans wikipedia page, what will happen? How can you fix this?**

If we get out of vocabulary words , this might be an issue.The model would be as cofident and therefore this might affect the output generated. One way to fix this this might be to when getting OOV words to expand the vocabulary using the fine-tuned data