# Lab 9: Finetuning GPT-2 with LoRA

[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/sgeinitz/DSML4220/blob/main/lab9_finetuning_gpt2.ipynb)

[![Kaggle](https://kaggle.com/static/images/open-in-kaggle.svg)](https://kaggle.com/kernels/welcome?src=https://github.com/sgeinitz/DSML4220/blob/main/lab9_finetuning_gpt2.ipynb)

In this lab we will use GPT-2 for the task of text generation. We'll first quickly compare Greedy Search and (Diverse) Beam Search with GPT-2. Then we'll finetune GPT-2 to generate text that is more explicitly infused with knowledge of Hemingway's book, "_The Sun also Rises_", and can generate text in the style of the book.


### Lab 9 Assignment/Task
There are three questions in this lab. As an added bonus, try downloading your own book from Project Gutenberg to finetune GPT-2 to generate text following your chosen book/author (see this script for help to convert it to a .csv file of sentences).

In [4]:
import torch

import numpy as np
import pandas as pd

from transformers import GPT2Tokenizer, AutoModelForCausalLM, TrainingArguments, Trainer
from torch.utils.data import Dataset, random_split
from peft import LoraModel, LoraConfig

In [5]:
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")

In [6]:
inputs = tokenizer(["Today is"], return_tensors="pt")
inputs

{'input_ids': tensor([[8888,  318]]), 'attention_mask': tensor([[1, 1]])}

Let's generate some text from the model using regular Greedy Search (here is the [HuggingFace example documenting this](https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationMixin.compute_transition_scores.example)).

In [7]:
# Example 1: Print the scores for each token generated with Greedy Search
#tokenizer.pad_token_id = tokenizer.eso_token_id
outputs = model.generate(**inputs, max_new_tokens=10, return_dict_in_generate=True, output_scores=True)
transition_scores = model.compute_transition_scores(
    outputs.sequences, outputs.scores, normalize_logits=True
)
# input_length is the length of the input prompt for decoder-only models, like the GPT family, and 1 for
# encoder-decoder models, like BART or T5.
input_length = 1 if model.config.is_encoder_decoder else inputs.input_ids.shape[1]
generated_tokens = outputs.sequences[:, input_length:]
for tok, score in zip(generated_tokens[0], transition_scores[0]):
    # | token | token string | log probability | probability
    print(f"| {tok:5d} | {tokenizer.decode(tok):8s} | {score.numpy():.3f} | {np.exp(score.numpy()):.2%}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


|   262 |  the     | -1.414 | 24.33%
|  1110 |  day     | -2.609 | 7.36%
|   618 |  when    | -2.010 | 13.41%
|   356 |  we      | -1.859 | 15.58%
|   460 |  can     | -2.508 | 8.14%
|   477 |  all     | -2.752 | 6.38%
|   307 |  be      | -2.960 | 5.18%
|  6613 |  proud   | -2.135 | 11.82%
|   286 |  of      | -0.558 | 57.21%
|   674 |  our     | -1.472 | 22.96%


In [8]:
outputs['sequences']

tensor([[8888,  318,  262, 1110,  618,  356,  460,  477,  307, 6613,  286,  674]])

Let's now use Beam Search (again using [this example from HF](https://huggingface.co/docs/transformers/en/main_classes/text_generation#transformers.GenerationMixin.compute_transition_scores.example)).

In [9]:
inputs = tokenizer(["Today is"], return_tensors="pt")

# Approach 2: Beam Search
outputs = model.generate(
    **inputs,
    max_new_tokens=10,
    num_beams=6,
    #num_beam_groups=3,
    #diversity_penalty=5.0,
    num_return_sequences=6,
    return_dict_in_generate=True,
    output_scores=True,
)
transition_scores = model.compute_transition_scores(
    outputs.sequences, outputs.scores, outputs.beam_indices, normalize_logits=False
)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


In [10]:
outputs['sequences']

tensor([[8888,  318,  257, 1049,  640,  284,  307,  257,  636,  286,  674, 2055],
        [8888,  318,  257, 1049,  640,  284,  307,  257,  636,  286,  262, 2055],
        [8888,  318,  257, 1049,  640,  284,  307,  257,  636,  286,  428, 2055],
        [8888,  318,  257, 1049,  640,  284,  307,  257,  636,  286,  340,   13],
        [8888,  318,  257, 1049,  640,  284,  307,  257,  636,  286,  428, 1049],
        [8888,  318,  257, 1049,  640,  284,  307,  257,  636,  286,  257, 2055]])

In [11]:
for s, seq in enumerate(outputs['sequences']):
  print(f"seq {s}: {tokenizer.decode(seq)}")

seq 0: Today is a great time to be a part of our community
seq 1: Today is a great time to be a part of the community
seq 2: Today is a great time to be a part of this community
seq 3: Today is a great time to be a part of it.
seq 4: Today is a great time to be a part of this great
seq 5: Today is a great time to be a part of a community


---

### Q1: Does the Beam Search above use Diverse Beam Search? If not, change it to use Diverse Beam Search and describe how the output differs.  

(Hint: Look a few cells down at the next use of Beam Search, there are two parameters you will need to add, `num_beam_groups`, and `diversity_penalty`)

```
the outputs above does not use diverse beam search. When initializing the outputs we had `num_beam_groups` and `diversity_penalty` commented out, so we did not enable diverse beam search. In the next cells when we do have them to 3 and 5.0 respectively, we enable diverse beam search. 

our 6 beams will be divided into 3 groups of 2 beams each and the diversity penalty will encourage the model to explore different tokens and paths, enhancing the diversity of the generated text.
```

---

In [12]:
prompt = ["Cohn confronted the bullfighter and "]
inputs = tokenizer(prompt, return_tensors="pt")

max_new_toks = 15
# Example 1: Print the scores for each token generated with Greedy Search
#outputs = model.generate(**inputs, max_new_tokens=max_new_toks, return_dict_in_generate=True, output_scores=True, do_sample=True, temperature=1)
outputs = model.generate(**inputs, max_new_tokens=max_new_toks, return_dict_in_generate=True, output_scores=True, do_sample=False)
transition_scores = model.compute_transition_scores(
    outputs.sequences, outputs.scores, normalize_logits=True
)
# input_length is the length of the input prompt for decoder-only models, like the GPT family, and 1 for
# encoder-decoder models, like BART or T5.
input_length = 1 if model.config.is_encoder_decoder else inputs.input_ids.shape[1]
generated_tokens = outputs.sequences[:, input_length:]
for tok, score in zip(generated_tokens[0], transition_scores[0]):
    # | token | token string | log probability | probability
    print(f"| {tok:5d} | {tokenizer.decode(tok):8s} | {score.numpy():.3f} | {np.exp(score.numpy()):.2%}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


|  3711 | iced     | -0.646 | 52.43%
|   340 |  it      | -1.618 | 19.82%
|   510 |  up      | -0.897 | 40.80%
|    13 | .        | -1.199 | 30.16%
|   198 | 
        | -1.543 | 21.38%
|   198 | 
        | -0.018 | 98.22%
|     1 | "        | -0.677 | 50.79%
|    40 | I        | -1.864 | 15.50%
|  1101 | 'm       | -2.006 | 13.45%
|   407 |  not     | -1.525 | 21.76%
|  1016 |  going   | -1.388 | 24.95%
|   284 |  to      | -0.038 | 96.27%
|  1309 |  let     | -2.831 | 5.90%
|   345 |  you     | -0.957 | 38.42%
|   651 |  get     | -2.149 | 11.66%


In [13]:
inputs = tokenizer(prompt, return_tensors="pt")

# Approach 2: Reconstruct the sequence scores from Beam Search
outputs = model.generate(
    **inputs,
    max_new_tokens=max_new_toks,
    num_beams=6,
    num_beam_groups=3,
    diversity_penalty=5.0,
    num_return_sequences=6,
    return_dict_in_generate=True,
    output_scores=True,
    temperature=1.0,
    #do_sample=True
)
transition_scores = model.compute_transition_scores(
    outputs.sequences, outputs.scores, outputs.beam_indices, normalize_logits=False
)
# If you sum the generated tokens' scores and apply the length penalty, you'll get the sequence scores.
# Tip 1: recomputing the scores is only guaranteed to match with `normalize_logits=False`. Depending on the
# use case, you might want to recompute it with `normalize_logits=True`.
# Tip 2: the output length does NOT include the input length
output_length = np.sum(transition_scores.numpy() < 0, axis=1)
length_penalty = model.generation_config.length_penalty
reconstructed_scores = transition_scores.sum(axis=1) / (output_length**length_penalty)

print(np.allclose(outputs.sequences_scores, reconstructed_scores))

for s, seq in enumerate(outputs['sequences']):
  print(f"seq {s}: {tokenizer.decode(seq)}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


False
seq 0: Cohn confronted the bullfighter and iced him up.

"I'm not going to let you get
seq 1: Cohn confronted the bullfighter and iced it up.

"I'm not going to let you get
seq 2: Cohn confronted the bullfighter and urchin, who had been in a state of shock.

"
seq 3: Cohn confronted the bullfighter and ichthyologist, who had been working on the case for more than a
seq 4: Cohn confronted the bullfighter and urchin, who had been in a state of shock.

The
seq 5: Cohn confronted the bullfighter and ichthyologist, who had been working on the case for a while.


Let's now load the raw text from Hemingway's book, _"The Sun also Rises"_.

In [14]:
heming = pd.read_csv("https://raw.githubusercontent.com/sgeinitz/DSML4220/main/data/sunalsorises.csv")
heming.head()

Unnamed: 0,sentence
0,Robert Cohn was once middleweight boxing champ...
1,Do not think that I am very much impressed by ...
2,"He cared nothing for boxing, in fact he dislik..."
3,There was a certain inner comfort in knowing h...
4,He was Spider Kelly’s star pupil.


In [15]:
sentences = heming['sentence']
sentences.head()

0    Robert Cohn was once middleweight boxing champ...
1    Do not think that I am very much impressed by ...
2    He cared nothing for boxing, in fact he dislik...
3    There was a certain inner comfort in knowing h...
4                    He was Spider Kelly’s star pupil.
Name: sentence, dtype: object

In [16]:
print(f"        sentence: '{sentences[0]}' \n is tokenized as: {tokenizer.encode(sentences[0])}")

        sentence: 'Robert Cohn was once middleweight boxing champion of Princeton.' 
 is tokenized as: [19156, 45005, 373, 1752, 3504, 6551, 21576, 8783, 286, 23173, 13]


In [17]:
max_length = max([len(tokenizer.encode(sentence)) for sentence in sentences])
max_length

224

In [18]:
class HemingwayDataset(Dataset):
    def __init__(self, txt_list, tokenizer, max_length):
        self.input_ids = []
        self.attn_masks = []
        self.labels = []
        for txt in txt_list:
            encodings_dict = tokenizer('<|startoftext|>' + txt + '<|endoftext|>', truncation=True,
                                       max_length=max_length, padding="max_length")
            self.input_ids.append(torch.tensor(encodings_dict['input_ids']))
            self.attn_masks.append(torch.tensor(encodings_dict['attention_mask']))

    def __len__(self): # overload the len() Python built-in function
        return len(self.input_ids)

    def __getitem__(self, idx): # overload the [] operator
        return self.input_ids[idx], self.attn_masks[idx]

tokenizer.pad_token_id = tokenizer.eos_token_id

In [19]:
dataset = HemingwayDataset(sentences, tokenizer, max_length=max_length)
train_size = int(0.9 * len(dataset))
train_dataset, val_dataset = random_split(dataset, [train_size, len(dataset) - train_size])

In [20]:
train_dataset[0]

(tensor([   27,    91,  9688,  1659,  5239,    91,    29,   464,  4831,   547,
           845, 28746,   290,   511,  6698,   547,  6824,   290, 17298,    13,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256, 50256,
         50256, 50256, 50256, 50256, 50256, 50256, 5

Notice that above we set the `pad_token_id` to be the same as the `eos_token_id` (i.e. end-of-stream token id). So all of those `50256` entries above are being used as end-of-stream, or end-of-sequence tokens (except the first one, which is denoting the end of the sequence).  

In [21]:
tokenizer.decode([50256])

'<|endoftext|>'

In [22]:
batch_size = 4
n_epochs = 2
training_args = TrainingArguments(output_dir='~/hemingway_generation', num_train_epochs=n_epochs, logging_steps=100, save_steps=500, do_eval=True,
                                  eval_steps=20, per_device_train_batch_size=batch_size, per_device_eval_batch_size=batch_size, save_safetensors=False,
                                  warmup_steps=10, weight_decay=0.05, logging_dir='~/hemingway_generation/logs', report_to='none')

Let's load GPT-2 and then  take a rough glance at the architecture of GPT (w/ ~130M parameters) by printing the model.

In [23]:
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
print(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)


In [24]:
def count_trainable_parameters(mod):
    model_parameters = filter(lambda p: p.requires_grad, mod.parameters())
    params = sum([np.prod(p.size()) for p in model_parameters])
    return params

gpt2_params = count_trainable_parameters(model)
print(f"GPT-2 trainable parameters: {gpt2_params}")

GPT-2 trainable parameters: 124439808


In [25]:
trainer = Trainer(model=model,  args=training_args, train_dataset=train_dataset, eval_dataset=val_dataset,
                  data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                              'attention_mask': torch.stack([f[1] for f in data]),
                                              'labels': torch.stack([f[0] for f in data])})
# on Colab this will take 6+hrs w/ cpu or <10min w/ T4 GPU per epoch
trainer.train()

`loss_type=None` was set in the config but it is unrecognised.Using the default loss: `ForCausalLMLoss`.


Step,Training Loss
100,0.9749
200,0.2141
300,0.2213
400,0.2064
500,0.2065
600,0.2029
700,0.2053
800,0.2029
900,0.203
1000,0.2


TrainOutput(global_step=3072, training_loss=0.2115910556167364, metrics={'train_runtime': 2627.3732, 'train_samples_per_second': 4.676, 'train_steps_per_second': 1.169, 'total_flos': 1404477333504000.0, 'train_loss': 0.2115910556167364, 'epoch': 2.0})

In [28]:
import torch

# Force CPU if MPS is causing issues
device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
model = model.to(device)
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Generate text
outputs = model.generate(
    **inputs,
    max_new_tokens=max_new_toks,
    num_beams=6,
    num_beam_groups=3,
    diversity_penalty=5.0,
    num_return_sequences=5,
    return_dict_in_generate=True,
    output_scores=True,
    temperature=2.0,
)

transition_scores = model.compute_transition_scores(
    outputs.sequences, outputs.scores, outputs.beam_indices, normalize_logits=False
)

output_length = np.sum(transition_scores.cpu().numpy() < 0, axis=1)
length_penalty = model.generation_config.length_penalty
reconstructed_scores = transition_scores.cpu().sum(axis=1) / (output_length**length_penalty)

for s, seq in enumerate(outputs['sequences']):
  gen_text = tokenizer.decode(seq)
  # remove everything from '<|endoftext|>' on at the end of gen_text
  gen_text = gen_text[:gen_text.find('<|endoftext|>')]
  print(f"seq {s}: {gen_text}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


seq 0: Cohn confronted the bullfighter and ” “I don’t think so.
seq 1: Cohn confronted the bullfighter and ” “I don’t think so,” h
seq 2: Cohn confronted the bullfighter and  “I’m going to kill him.
seq 3: Cohn confronted the bullfighter and  “I’m not going to let him go.
seq 4: Cohn confronted the bullfighter and �����ed him.


Next, let's use LoRA to fine tune the model. We'll load the model again to ensure that the earlier finetuning is not included.

In [29]:
# load the model again so that we can use LoRA
model = AutoModelForCausalLM.from_pretrained("openai-community/gpt2")
print(model)

GPT2LMHeadModel(
  (transformer): GPT2Model(
    (wte): Embedding(50257, 768)
    (wpe): Embedding(1024, 768)
    (drop): Dropout(p=0.1, inplace=False)
    (h): ModuleList(
      (0-11): 12 x GPT2Block(
        (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (attn): GPT2Attention(
          (c_attn): Conv1D(nf=2304, nx=768)
          (c_proj): Conv1D(nf=768, nx=768)
          (attn_dropout): Dropout(p=0.1, inplace=False)
          (resid_dropout): Dropout(p=0.1, inplace=False)
        )
        (ln_2): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
        (mlp): GPT2MLP(
          (c_fc): Conv1D(nf=3072, nx=768)
          (c_proj): Conv1D(nf=768, nx=3072)
          (act): NewGELUActivation()
          (dropout): Dropout(p=0.1, inplace=False)
        )
      )
    )
    (ln_f): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
  )
  (lm_head): Linear(in_features=768, out_features=50257, bias=False)
)


In [30]:
target_modules = ["q_proj", "k_proj", "v_proj", "out_proj", "fc_in", "fc_out", "wte", "c_fc", "c_proj"]

lora_config = LoraConfig(
    task_type="CAUSAL_LM",
    inference_mode=False,
    r=16,
    lora_alpha=32,
    target_modules=target_modules,
    lora_dropout=0.01,
    fan_in_fan_out=True
)

lora_model = LoraModel(model, lora_config, "default")
lora_model.model.tie_weights()

In [31]:
print(lora_model)

LoraModel(
  (model): GPT2LMHeadModel(
    (transformer): GPT2Model(
      (wte): lora.Embedding(
        (base_layer): Embedding(50257, 768)
        (lora_dropout): ModuleDict(
          (default): Dropout(p=0.01, inplace=False)
        )
        (lora_A): ModuleDict()
        (lora_B): ModuleDict()
        (lora_embedding_A): ParameterDict(  (default): Parameter containing: [torch.FloatTensor of size 16x50257])
        (lora_embedding_B): ParameterDict(  (default): Parameter containing: [torch.FloatTensor of size 768x16])
        (lora_magnitude_vector): ModuleDict()
      )
      (wpe): Embedding(1024, 768)
      (drop): Dropout(p=0.1, inplace=False)
      (h): ModuleList(
        (0-11): 12 x GPT2Block(
          (ln_1): LayerNorm((768,), eps=1e-05, elementwise_affine=True)
          (attn): GPT2Attention(
            (c_attn): Conv1D(nf=2304, nx=768)
            (c_proj): lora.Linear(
              (base_layer): Conv1D(nf=768, nx=768)
              (lora_dropout): ModuleDict(
    

In [32]:
trainer = Trainer(model=lora_model,  args=training_args, train_dataset=train_dataset, eval_dataset=val_dataset,
                  data_collator=lambda data: {'input_ids': torch.stack([f[0] for f in data]),
                                              'attention_mask': torch.stack([f[1] for f in data]),
                                              'labels': torch.stack([f[0] for f in data])})

In [33]:
trainer.train()

Step,Training Loss
100,3.8593
200,0.3197
300,0.2711
400,0.2428
500,0.2367
600,0.2323
700,0.2345
800,0.2327
900,0.2319
1000,0.2282


TrainOutput(global_step=3072, training_loss=0.34453964109222096, metrics={'train_runtime': 2134.6694, 'train_samples_per_second': 5.755, 'train_steps_per_second': 1.439, 'total_flos': 1447176244942848.0, 'train_loss': 0.34453964109222096, 'epoch': 2.0})

---

### Q2: How many more `training_samples_per_second` could the LoRA model get through during finetuning than the original GPT-2 model could?

```
GPT-2 
  - training_samples_per_second: 4.676

LoRA
    - training_samples_per_second: 5.755

differences of 1.079 more samples per second. LoRA is faster than the GPT-2 model because of the low-rank adaptation of the model weights and reduction in the number of trainable parameters.
```

---

In [35]:
# Ensure the model is moved to the correct device
device = torch.device("mps" if torch.backends.mps.is_available() else "cpu")
lora_model = lora_model.to(device)

# Ensure inputs are moved to the same device
inputs = tokenizer(prompt, return_tensors="pt").to(device)

# Use Diverse Beam Search
outputs = lora_model.generate(
    **inputs,
    max_new_tokens=max_new_toks,
    num_beams=6,
    num_beam_groups=3,
    diversity_penalty=5.0,
    num_return_sequences=5,
    return_dict_in_generate=True,
    output_scores=True,
    temperature=1.5,
    # do_sample=True
)

transition_scores = lora_model.compute_transition_scores(
    outputs.sequences, outputs.scores, outputs.beam_indices, normalize_logits=False
)

output_length = np.sum(transition_scores.cpu().numpy() < 0, axis=1)
length_penalty = lora_model.generation_config.length_penalty
reconstructed_scores = transition_scores.cpu().sum(axis=1) / (output_length**length_penalty)

for s, seq in enumerate(outputs['sequences']):
    gen_text = tokenizer.decode(seq)
    # Remove everything from '<|endoftext|>' to the end from gen_text
    gen_text = gen_text[:gen_text.find('<|endoftext|>')]
    print(f"seq {s}: {gen_text}")

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


seq 0: Cohn confronted the bullfighter and  the bullfighter, and the bullfighter, and the bullfighter
seq 1: Cohn confronted the bullfighter and iced him up.
seq 2: Cohn confronted the bullfighter and  the bullfighter, and the bullfighter, and the bullfighter
seq 3: Cohn confronted the bullfighter and iced him.
seq 4: Cohn confronted the bullfighter and ichor with the bullfighter, and the bullfighter had to stop an


In [37]:
lora_params = count_trainable_parameters(lora_model)
print(f"GPT-2 trainable parameters: {gpt2_params}")
print(f"LoRA trainable parameters: {lora_params} ({(100*lora_params/gpt2_params):.2f}% of GPT-2's trainable parameters)")

GPT-2 trainable parameters: 124439808
LoRA trainable parameters: 2585872 (2.08% of GPT-2's trainable parameters)


---

### Q3: How many fewer parameters did the LoRA model need to train/tune than the full GPT-2 model did?

(Hint: See output from above cell)

```
GPT-2 has 124,439,808 parameters, while LoRA has only 258,5872 parameter. The reason for this decrease in parameters is that LoRA is fine-tuning only a small subset of GPT-2's parameters. 

LoRA adds two additional low-rank matrices to the model, which are used to adapt the pre-trained weights. The two matrices have a smaller rank than the original, which will reduce the number of parameters.
```

---