-------------------------------

# Part 2: GPT-2 Text Generation with HuggingFace

Phew, that was a lot of reading. Now lets get to the fun part! Let's use the transformer to generate some text!!

We will use the [Transformers library from HuggingFace](https://transformer.huggingface.co), which provides support for many Transformer-based language models like GPT-2.

**IMPORTANT: Make sure that you have GPU set as your Hardware Accelerator in `Runtime > Change runtime type` before running this Colab.**

In [2]:
!pip install transformers



## 2.1 The 'Pipeline' Interface

The simplest way to use the HuggingFace library is to use their [Pipeline interface](https://huggingface.co/transformers/main_classes/pipelines.html)

There are many different types of Pipelines available but in this section we'll use the TextGenerationPipeline to get up and running with pretrained gpt2 as fast as possible

In [1]:
from transformers import pipeline

In [None]:
# Note: device=0 means to use GPU, device=-1 is to use CPU
generator = pipeline('text-generation', model='gpt2', device=0)

In [None]:
outputs = generator('I wonder what I will generate?')
print(outputs)

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I wonder what I will generate?\n\nI guess it has been a little less than a year since the last one. I am surprised that I can be so easy on you, but as a regular person I have managed to keep my own secrets'}]


Note that the 'text-generation' pipeline will work with any **auto-regressive** language model (a.k.a 'causal-lm' models according to the HuggingFace lingo). You can find a list of all such models here https://huggingface.co/models?filter=causal-lm.

10. (6 pts) **Your first task is to use the Pipeline interface to get generation output below for at least two different 'causal-lm' models (One of these two can be a different version of GPT2, but make sure at least one is a non-gpt family language model)**

In [4]:
## YOUR CODE HERE FOR MODEL 1
generator = pipeline('text-generation', model='EleutherAI/gpt-neo-125M', device=0)
outputs = generator('I wonder what I will generate?')
print(outputs)

config.json:   0%|          | 0.00/1.01k [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/526M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/560 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/357 [00:00<?, ?B/s]

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'I wonder what I will generate?\n\nI have a lot of questions about the future of the'}]


In [None]:
## YOUR CODE HERE FOR MODEL 2
generator = pipeline("text-generation", model="EleutherAI/pythia-70m")
outputs = generator("I wonder what I will generate?")
print(outputs)


Special tokens have been added in the vocabulary, make sure the associated word embeddings are fine-tuned or trained.
Setting `pad_token_id` to `eos_token_id`:0 for open-end generation.


[{'generated_text': 'I wonder what I will generate?\n\nA:\n\nI think you are correct.\n'}]


## 2.2 Dissecting the Pipeline
Now that was easy!

As beautiful and easy as the Pipeline interface is, we want to know what's going on under the hood!

There are four main steps to a text generation pipeline:
1. (Tokenize) Turn the raw input text into a vector of integer token IDs using a tokenizer

2. (Encode) Feed those token IDs into the language model by querying for each token's embedding in the model's embedding matrix (the "encoder") and then feed the "encoded" sequence into the decoder module

3. (Decode) The decoder will output logits (a probability distribution over all possible integer token IDs) and we sample from those logits to get our next token -- repeat until EOS token is generated or we hit max_length

4. (Detokenize) Take the output sequence of token IDs and turn them from integer token IDs back to tokens with the tokenizer

Below you'll see how HuggingFace does this:

First we have to initialize both the tokenizer and the model from their pre-trained checkpoints. Note that the tokenizer has to match the model.

In [6]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel# AutoTokenizer, AutoModelForCausalLM

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2').cuda()

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [None]:
#### Step 1: Tokenize the input into integer token IDs
inputs = tokenizer.encode("Hello, how are you?", return_tensors='pt').to(model.device)
print("Input Token IDs: " + str(inputs))

Input Token IDs: tensor([[15496,    11,   703,   389,   345,    30]], device='cuda:0')


In [None]:
#### Step 2 and 3: Feed in the integer token IDs and get out a sequence of token IDs as output
outputs = model.generate(inputs)
print("Output Token IDs: " + str(outputs))

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Output Token IDs: tensor([[15496,    11,   703,   389,   345,    30,   198,   198,    40,  1101,
           257,  1310,  1643,   286,   257, 34712,    13,   314,  1101,   257]],
       device='cuda:0')


In [None]:
#### Step 4: Feed in the integer token IDs and get out a sequence of token IDs as output
output_text = [tokenizer.decode(x) for x in outputs]
print("Output Text: " + str(output_text))

Output Text: ["Hello, how are you?\n\nI'm a little bit of a nerd. I'm a"]


Now that you have dissected the pipeline, it's time to play with some common parameters!

[Check out this demo notebook from HuggingFace](https://github.com/huggingface/blog/blob/master/notebooks/02_how_to_generate.ipynb) for a good overview of the different generation parameters and what they do (with example code!).

The full documentation on all of the parameters you can use in the generate function can be found [here](https://huggingface.co/transformers/main_classes/model.html#transformers.generation_utils.GenerationMixin.generate)

As an example, below we have a call to generate that:
- randomly samples from the top 50 words in the output distribution (rather than just greedily picking the best one every time)
- downweights the probability of all previously generated tokens by a factor of 1.2 (to prevent repetition)
- goes on for 512 tokens, because its more interesting

In [None]:
inputs = tokenizer.encode("Hello, how are you?", return_tensors='pt').to(model.device)
outputs = model.generate(
      inputs,
      do_sample=True,          # Randomly sample from the logits instead of greedily picking next word with highest probability
      top_k=50,                 # Only sample from the top 50 most likely words
      repetition_penalty=1.2,    # Downweights the probability of all previously generated tokens by a factor of 1.2
      max_length=512          # Generate for a maximum of 512 tokens
  )
print([tokenizer.decode(x) for x in outputs][0])


The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hello, how are you? Why am I standing here now?"
"Because of that..."

… It seems as if we're having dinner on one or another day. There's not really much room in sight for any more distractions from the fight outside... Maybe there was some time before when they were together but what about today...? What did she expect to see out this morning tomorrow!? How many times have Yuki has had his breakfast at a tavern without getting drunk so easily? After all he doesn't do anything stupid like go looking around alone and making friends with everyone even though it might cost him your life just because someone called himself 'Kyoobatohime.' They seem very serious  about their past lives! The rest is speculation by anyone judging people only once, isnn - no way could Gendo know why our heroes got into Sojiro-sensei after telling us those stories (a fact which makes them probably think Kaedes) while others believe Shiki didn´t call up her former bodyguard Aizen during lunch. If my eyes look s

**11. Your job is to provide two different examples of generation output from GPT-2 with different choices of generation parameters. You must also provide a 1-2 sentence explanation of what these parameters do and how they affect your output**

Feel free to get creative with this! Really poke around and try to find the combination of settings that gives you the best sounding text! The ways in which these parameters affect how 'human-like' a section of generated text sounds is an area of active research. :)

In [7]:
## YOUR CODE HERE FOR HYPERPARAMETER VARIATION 1
inputs = tokenizer.encode("Hello, how are you?", return_tensors='pt').to(model.device)
outputs = model.generate(
      inputs,
      do_sample=True,          # Randomly sample from the logits instead of greedily picking next word with highest probability
      top_k=50,                 # Only sample from the top 50 most likely words
      repetition_penalty=1.2,    # Downweights the probability of all previously generated tokens by a factor of 1.2
      max_length=128          # Generate for a maximum of 512 tokens
  )
print([tokenizer.decode(x) for x in outputs][0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hello, how are you? I thought that a new guy was about to break into the office.
I'm sorry...how have we been so busy this year??? (whispers) We're still trying our best and getting everybody on board first! But one thing is sure :the rest of those people got bored too soon as well lol

The reason for my "failure" seems rather simple: My wife had trouble breathing since she has no job available :) She's in intensive care right now with her cat just over two weeks away from having surgery... It started when 2 cats were already dead due cause it would mean at least


(4 pts) YOUR ANSWER HERE - EXPLANATION FOR HPARAM VARIATION 1

Here, the sentences generated are small because the max_length is limited to 128 but they are quite meaningful.

In [8]:
## YOUR CODE HERE FOR HYPERPARAMETER VARIATION 2
inputs = tokenizer.encode("Hello, how are you?", return_tensors='pt').to(model.device)
outputs = model.generate(
      inputs,
      do_sample=True,          # Randomly sample from the logits instead of greedily picking next word with highest probability
      top_k=50,                 # Only sample from the top 50 most likely words
      repetition_penalty=2.0,    # Downweights the probability of all previously generated tokens by a factor of 1.2
      max_length=512        # Generate for a maximum of 512 tokens
  )
print([tokenizer.decode(x) for x in outputs][0])

The attention mask and the pad token id were not set. As a consequence, you may observe unexpected behavior. Please pass your input's `attention_mask` to obtain reliable results.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Hello, how are you? How far... Oh right. I am in the way of his thoughts."
"How long have your eyes been running and seeing things?" Harry said happily as he closed her glasses while holding one hand on mine from behind with another pointing up at me once more without even looking back through them into my nose to check for anyone who looked too scared or concerned it would mean something bad overmuch later that's all we can really say about him this morning except maybe after a few hours but no time will tell us where they went then hopefully there is someone out here somewhere when anything happens anyhow; It was actually good before coming around last night which gave some sort cause why Neville had started calling so much attention just now despite not feeling quite like returning home until two-three tomorrow Night School teachers were still giving hints every single day only putting extra work towards getting students closer together than their usual schedule made possible (which

(4 pts) YOUR ANSWER HERE -- EXPLANATION FOR HPARAM VARIATION 2

Here the repetition penalty is very high hence the model is forming less meaningful sentences since it is not allowed to repeat words.

## 2.3 Fine-Tuning GPT-2
Okay now time for the best part!

Generating general-purpose text from pre-trained models is great, but what if we want our text to be in a specific genre or style? Luckily for us, the GPT family of models use the idea of "Transfer learning" -- using knowledge gained from one problem (or training setting), and applying it to another area or domain. The idea of transfer learning for NLP, is that we can train a language model on general texts, and then adapt it to use it for a specific task or domain that we're interested in. This process is also called **fine-tuning**.

In this section we'll walk you through an example of using HuggingFace to fine-tune GPT-2 and then you'll be asked to fine-tune GPT-2 on two datasets of your own choosing!

### Fine-Tuning Example using HuggingFace Datasets library: Crime and Punishment

For our fine-tuning example we're going to train GPT-2 to mimic the style of Fyodor Dostoevsky's novel "Crime and Punishment"

We will be downloading our data using the HuggingFace [Datasets](https://huggingface.co/docs/datasets/) library.

In [6]:
import locale
locale.getpreferredencoding = lambda: "UTF-8"
!pip install datasets

Collecting datasets
  Downloading datasets-2.15.0-py3-none-any.whl (521 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m521.2/521.2 kB[0m [31m6.1 MB/s[0m eta [36m0:00:00[0m
Collecting pyarrow-hotfix (from datasets)
  Downloading pyarrow_hotfix-0.6-py3-none-any.whl (7.9 kB)
Collecting dill<0.3.8,>=0.3.0 (from datasets)
  Downloading dill-0.3.7-py3-none-any.whl (115 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m115.3/115.3 kB[0m [31m14.9 MB/s[0m eta [36m0:00:00[0m
Collecting multiprocess (from datasets)
  Downloading multiprocess-0.70.15-py310-none-any.whl (134 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m134.8/134.8 kB[0m [31m13.2 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: pyarrow-hotfix, dill, multiprocess, datasets
Successfully installed datasets-2.15.0 dill-0.3.7 multiprocess-0.70.15 pyarrow-hotfix-0.6


In [7]:
!pip install --upgrade datasets transformers

Collecting transformers
  Downloading transformers-4.36.2-py3-none-any.whl (8.2 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m8.2/8.2 MB[0m [31m16.5 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: transformers
  Attempting uninstall: transformers
    Found existing installation: transformers 4.35.2
    Uninstalling transformers-4.35.2:
      Successfully uninstalled transformers-4.35.2
Successfully installed transformers-4.36.2


In [8]:
!pip install accelerate -U
# !pip install transformers -U

Collecting accelerate
  Downloading accelerate-0.25.0-py3-none-any.whl (265 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m265.7/265.7 kB[0m [31m3.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: accelerate
Successfully installed accelerate-0.25.0


In [None]:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
import datasets
from datasets import load_dataset, list_datasets

### Step 1: Initialize a Brand New GPT-2 Model and Tokenizer

In [None]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained('gpt2').cuda()
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

###Step 2: Load the text of "Crime and Punishment" and tokenize it

The 'load_dataset' function queries for a dataset with a certain tag and downloads the corresponding data from HuggingFace's hosting site. This allows us to download all sorts of datasets through the same interface!

The documentation for load_dataset can be found [here](https://huggingface.co/docs/datasets/package_reference/loading_methods.html#datasets.load_dataset)

Here we take our tokenizer and run it on the entirety of Crime and Punishment in a single batch by using map on our custom encode function.

In [None]:
def encode(batch): return tokenizer([x.strip('\n\r') for x in batch['line']], truncation=True, padding=True)

crime_and_punishment = load_dataset('crime_and_punish', split='train')
processed = crime_and_punishment.map(encode, batched=True, batch_size=len(crime_and_punishment))
processed.set_format('torch', columns=['input_ids', 'attention_mask'])

Downloading builder script:   0%|          | 0.00/1.92k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/5.08k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/441k [00:00<?, ?B/s]

Generating train split:   0%|          | 0/21969 [00:00<?, ? examples/s]

Map:   0%|          | 0/21969 [00:00<?, ? examples/s]

In [None]:
crime_and_punishment = load_dataset('crime_and_punish', split='train')
print(crime_and_punishment)

Dataset({
    features: ['line'],
    num_rows: 21969
})


### Step 3: Initialize the Trainer

The 'Trainer' module is the main way we perform fine-tuning. In order to initialize a Trainer, you need a model, tokenizer, TrainingArguments, your training data (in a Dataset object) and something called a data_collator (which tells the Trainer not to look for a vector of labels).

In [None]:
training_args = TrainingArguments(
    output_dir='/content/',
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=16,
    per_device_eval_batch_size=64,
    logging_steps=100,
    weight_decay=0.01,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    data_collator=data_collator,
    train_dataset=processed,
)

### Step 4: Fine-Tune the Model!

Now we're done! All we have to do is hit run and sit back!

In [None]:
trainer.train()

Step,Training Loss
100,4.0159
200,3.74
300,3.7038
400,3.566
500,3.6173
600,3.5903
700,3.5281
800,3.5182
900,3.4529
1000,3.4622


TrainOutput(global_step=1374, training_loss=3.579586334450623, metrics={'train_runtime': 296.507, 'train_samples_per_second': 74.093, 'train_steps_per_second': 4.634, 'total_flos': 392405005440000.0, 'train_loss': 3.579586334450623, 'epoch': 1.0})

### Step 5: Save the Model and use it to Generate!

Save your fine-tuned model and compare its output with regular GPT-2's output to see the difference for yourself!

In [None]:
trainer.save_model('./dostoevskypt2')

In [None]:
dostoevskypt2 = pipeline('text-generation', model='./dostoevskypt2', device=0)
gpt2 = pipeline('text-generation', model='gpt2', device=0)

In [None]:
print(dostoevskypt2('Saint Petersburg is'))
print(gpt2('Saint Petersburg is'))

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


[{'generated_text': 'Saint Petersburg is no stranger to such men. One can hardly believe it, that people are living among them, like this, and such people are not going astray at a stone! I see it was a pleasure to do with him for a week'}]
[{'generated_text': "Saint Petersburg is home again for a very strong first-ever game and you should make no mistake about that. Their defense is still very good, but you have to be careful with how they run things now, or it's going to get worse."}]


## PERPLEXITY

12. (2 pts) Using the pointer [here](https://huggingface.co/transformers/perplexity.html), compute the perplexity of the GPT2 pre-trained model on the Wikipedia test set (you can keep the same hyperparameters as in the link)

In [None]:
## YOUR CODE HERE - FOR COMPUTING PERPLEXITY OF GPT2 ON WIKIPEDIA TEST SET

# ANSWERS BELOW:
# Load wiki test set
from datasets import load_dataset
import torch
from tqdm import tqdm

test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
encodings = tokenizer("\n\n".join(test["text"]), return_tensors="pt")
max_length = model.config.n_positions
stride = 512

# Define a function for ppl
def ppl(model, input_ids_all, stride):
  nlls = []
  for i in tqdm(range(0, input_ids_all.size(1), stride)):
      begin_loc = max(i + stride - max_length, 0)
      end_loc = min(i + stride, input_ids_all.size(1))
      trg_len = end_loc - i  # may be different from stride on last loop
      input_ids = input_ids_all[:, begin_loc:end_loc].to("cuda:0")
      target_ids = input_ids.clone()
      target_ids[:, :-trg_len] = -100

      with torch.no_grad():
          outputs = model(input_ids, labels=target_ids)
          neg_log_likelihood = outputs[0] * trg_len

      nlls.append(neg_log_likelihood)

  ppl = torch.exp(torch.stack(nlls).sum() / end_loc)
  return ppl
ppl(model, encodings.input_ids, stride)

Downloading builder script:   0%|          | 0.00/8.48k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/6.84k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/9.62k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/4.72M [00:00<?, ?B/s]

Generating test split:   0%|          | 0/4358 [00:00<?, ? examples/s]

Generating train split:   0%|          | 0/36718 [00:00<?, ? examples/s]

Generating validation split:   0%|          | 0/3760 [00:00<?, ? examples/s]

Token indices sequence length is longer than the specified maximum sequence length for this model (287644 > 1024). Running this sequence through the model will result in indexing errors
100%|██████████| 562/562 [01:06<00:00,  8.51it/s]


tensor(87.6765, device='cuda:0')

> The perplexity of GPT2 on Wikipedia test set is 87.67\%.

13. (2 pts) Compute the  perplexity of the dostoevskypt2 model on Wikipedia test set




In [None]:
## YOUR CODE HERE - FOR COMPUTING PERPLEXITY OF DOSTOEVSKYPT2 ON WIKIPEDIA TEST SET
test = load_dataset("wikitext", "wikitext-2-raw-v1", split="test")
encodings = tokenizer("\n\n".join(test["text"]), return_tensors="pt")
max_length = model.config.n_positions
stride = 512

# Define a function for ppl
def ppl(model, input_ids_all, stride):
  nlls = []
  for i in tqdm(range(0, input_ids_all.size(1), stride)):
      begin_loc = max(i + stride - max_length, 0)
      end_loc = min(i + stride, input_ids_all.size(1))
      trg_len = end_loc - i  # may be different from stride on last loop
      input_ids = input_ids_all[:, begin_loc:end_loc].to("cuda:0")
      target_ids = input_ids.clone()
      target_ids[:, :-trg_len] = -100

      with torch.no_grad():
          outputs = model(input_ids, labels=target_ids)
          neg_log_likelihood = outputs[0] * trg_len

      nlls.append(neg_log_likelihood)

  ppl = torch.exp(torch.stack(nlls).sum() / end_loc)
  return ppl

dostoevskypt2_model = GPT2LMHeadModel.from_pretrained('./dostoevskypt2').cuda()
ppl(dostoevskypt2_model, encodings.input_ids, stride)

100%|██████████| 562/562 [00:59<00:00,  9.49it/s]


tensor(68.3344, device='cuda:0')

>The perplexity of DOSTOEVSKYPT2 on Wikipedia test set is 68.33\%.

14. (2 pts) Compute the perplexity of the GPT2 pre-trained model on the Crime and Punishment train dataset

In [None]:
## YOUR CODE HERE - FOR COMPUTING PERPLEXITY OF GPT2 ON CRIME AND PUNISHMENT TRAIN DATASET
test = load_dataset("crime_and_punish", split="train")
encodings = tokenizer("\n\n".join(test["line"]), return_tensors="pt")
max_length = model.config.n_positions
stride = 512

# Define a function for ppl
def ppl(model, input_ids_all, stride):
  nlls = []
  for i in tqdm(range(0, input_ids_all.size(1), stride)):
      begin_loc = max(i + stride - max_length, 0)
      end_loc = min(i + stride, input_ids_all.size(1))
      trg_len = end_loc - i  # may be different from stride on last loop
      input_ids = input_ids_all[:, begin_loc:end_loc].to("cuda:0")
      target_ids = input_ids.clone()
      target_ids[:, :-trg_len] = -100

      with torch.no_grad():
          outputs = model(input_ids, labels=target_ids)
          neg_log_likelihood = outputs[0] * trg_len

      nlls.append(neg_log_likelihood)

  ppl = torch.exp(torch.stack(nlls).sum() / end_loc)
  return ppl
ppl(model, encodings.input_ids, stride)

100%|██████████| 705/705 [01:18<00:00,  9.02it/s]


tensor(66.9650, device='cuda:0')

>The perplexity of GPT2 on Crime and Punishment train set is 66.96\%.

15. (2 pts) Compute the **train** perplexity of the **dostoevskypt2** model




In [None]:
## YOUR CODE HERE - FOR COMPUTING PERPLEXITY OF DOSTOEVSKYPT2 ON CRIME AND PUNISHMENT TRAIN DATASET
test = load_dataset("crime_and_punish", split="train")
encodings = tokenizer("\n\n".join(test["line"]), return_tensors="pt")
max_length = model.config.n_positions
stride = 512

# Define a function for ppl
def ppl(model, input_ids_all, stride):
  nlls = []
  for i in tqdm(range(0, input_ids_all.size(1), stride)):
      begin_loc = max(i + stride - max_length, 0)
      end_loc = min(i + stride, input_ids_all.size(1))
      trg_len = end_loc - i  # may be different from stride on last loop
      input_ids = input_ids_all[:, begin_loc:end_loc].to("cuda:0")
      target_ids = input_ids.clone()
      target_ids[:, :-trg_len] = -100

      with torch.no_grad():
          outputs = model(input_ids, labels=target_ids)
          neg_log_likelihood = outputs[0] * trg_len

      nlls.append(neg_log_likelihood)

  ppl = torch.exp(torch.stack(nlls).sum() / end_loc)
  return ppl

dostoevskypt2_model = GPT2LMHeadModel.from_pretrained('./dostoevskypt2').cuda()
ppl(dostoevskypt2_model, encodings.input_ids, stride)

100%|██████████| 705/705 [01:14<00:00,  9.48it/s]


tensor(63.4256, device='cuda:0')

> Perplexity of DOSTOEVSKYPT2 on Crime and Punishment train set is 63.42%



> (1 pt) Which model performs better on Crime and Punishment train set, vanilla GPT-2 or your dostoevskypt2 checkpoint?

> DOSTOENVSKYPT2 performs better on Crime and Punishment than Vanilla GPT.

16. (2 pts) Compute perplexity of the GPT2 model on your raw pride and prejudice text.

In [None]:
## YOUR CODE HERE - FOR COMPUTING PERPLEXITY OF GPT2 ON PRIDE AND PREJUDICE TEXT
import os
from google.colab import drive
gdrive_dir = '/content/gdrive/'
data_dir = os.path.join(gdrive_dir, "My Drive/CS505 Datasets/CS505_HW_data/7_3/")
filename = data_dir+'prideAndPrejudice.txt'
drive.mount(gdrive_dir, force_remount=True)
print(filename)
with open(filename, "r") as f:
    text = f.read().split('\n')

encodings = tokenizer("\n\n".join(text), return_tensors="pt")
max_length = model.config.n_positions
stride = 512

# Define a function for ppl
def ppl(model, input_ids_all, stride):
  nlls = []
  for i in tqdm(range(0, input_ids_all.size(1), stride)):
      begin_loc = max(i + stride - max_length, 0)
      end_loc = min(i + stride, input_ids_all.size(1))
      trg_len = end_loc - i  # may be different from stride on last loop
      input_ids = input_ids_all[:, begin_loc:end_loc].to("cuda:0")
      target_ids = input_ids.clone()
      target_ids[:, :-trg_len] = -100

      with torch.no_grad():
          outputs = model(input_ids, labels=target_ids)
          neg_log_likelihood = outputs[0] * trg_len

      nlls.append(neg_log_likelihood)

  ppl = torch.exp(torch.stack(nlls).sum() / end_loc)
  return ppl
ppl(model, encodings.input_ids, stride)

Mounted at /content/gdrive/
/content/gdrive/My Drive/CS505 Datasets/CS505_HW_data/7_3/prideAndPrejudice.txt


100%|██████████| 304/304 [00:36<00:00,  8.36it/s]


tensor(48.4753, device='cuda:0')

> Perplexity of GPT2 on prideandPrejudice train set is 48.47%

17. (2 pts) Compute perplexity of the **dostoevskypt2** model on your raw pride and prejudice text.

In [None]:
## YOUR CODE HERE - FOR COMPUTING PERPLEXITY OF dostoevskipt2 ON PRIDE AND PREJUDICE TEXT

with open(filename, "r") as f:
    text = f.read().split('\n')

encodings = tokenizer("\n\n".join(text), return_tensors="pt")
max_length = model.config.n_positions
stride = 512

# Define a function for ppl
def ppl(model, input_ids_all, stride):
  nlls = []
  for i in tqdm(range(0, input_ids_all.size(1), stride)):
      begin_loc = max(i + stride - max_length, 0)
      end_loc = min(i + stride, input_ids_all.size(1))
      trg_len = end_loc - i  # may be different from stride on last loop
      input_ids = input_ids_all[:, begin_loc:end_loc].to("cuda:0")
      target_ids = input_ids.clone()
      target_ids[:, :-trg_len] = -100

      with torch.no_grad():
          outputs = model(input_ids, labels=target_ids)
          neg_log_likelihood = outputs[0] * trg_len

      nlls.append(neg_log_likelihood)

  ppl = torch.exp(torch.stack(nlls).sum() / end_loc)
  return ppl

dostoevskypt2_model = GPT2LMHeadModel.from_pretrained('./dostoevskypt2').cuda()
ppl(dostoevskypt2_model, encodings.input_ids, stride)

100%|██████████| 304/304 [00:32<00:00,  9.48it/s]


tensor(41.2177, device='cuda:0')

> Perplexity of DOSTOENVSKYPT2 on prideandPrejudice train set is 41.21%

### Now's Your Turn!

**Your job is to fine-tune GPT2 one more time with your choice of fine-tuning dataset.**

*****For the fine-tuned model you create, you should clearly demonstrate (through visible generation outputs and analysis) that your fine-tuned model follows the desired style better than vanilla GPT2** ***

Please make sure to give a brief description

In order to see which datasets are available for download, run the cell below. Pick one that you think would be interesting!

In [None]:
# datasets_list = list_datasets()
# print(', '.join(dataset for dataset in datasets_list))

### Tips
- Most of the datasets hosted by HuggingFace are not meant for Causal LM fine-tuning. Make sure you preprocess them accordingly if you want to use them.
- In order to check out information about a dataset hosted by huggingface you can use [this web viewer](https://huggingface.co/datasets/viewer/?dataset=crime_and_punish). Try to avoid downloading a dataset that's too big!
- You will likely have to change the custom 'encode' function for each new dataset you want to fine-tune on. You need to change batch['line'] to instead index with the correct column label for your specific dataset (it probably wont be called 'line').

### Useful Links
[load_datasets Documentation](https://huggingface.co/docs/datasets/package_reference/loading_methods.html#datasets.load_dataset)

[Trainer Documentation](https://huggingface.co/transformers/main_classes/trainer.html#id1)

[Example: Fine-Tuning BERT for Esperanto](https://colab.research.google.com/github/huggingface/blog/blob/master/notebooks/01_how_to_train.ipynb#scrollTo=zTgWPa9Dipk2)

[Example: Fine-Tuning for IMDb Classification](https://colab.research.google.com/drive/1-JIJlao4dI-Ilww_NnTc0rxtp-ymgDgM?usp=sharing#scrollTo=5DEWNilys9Ty)


#### 18. Dataset \#1

In [1]:
from transformers import Trainer, TrainingArguments, DataCollatorForLanguageModeling
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import datasets
from datasets import load_dataset, list_datasets
from transformers import pipeline
from datasets import Dataset
import os
import numpy as np
os.environ['HF_DATASETS_CACHE']="./huggingface_cache"

In [2]:
data = load_dataset('tiny_shakespeare',split='train', cache_dir="./huggingface_cache", num_proc=os.cpu_count())
#data = load_dataset('gutenberg',split='train', cache_dir="./huggingface_cache", num_proc=os.cpu_count())

Downloading builder script:   0%|          | 0.00/3.73k [00:00<?, ?B/s]

Downloading metadata:   0%|          | 0.00/1.90k [00:00<?, ?B/s]

Downloading readme:   0%|          | 0.00/6.10k [00:00<?, ?B/s]

Downloading data:   0%|          | 0.00/435k [00:00<?, ?B/s]

Setting num_proc from 2 back to 1 for the train split to disable multiprocessing as it only contains one shard.


Generating train split:   0%|          | 0/1 [00:00<?, ? examples/s]

Setting num_proc from 2 back to 1 for the validation split to disable multiprocessing as it only contains one shard.


Generating validation split:   0%|          | 0/1 [00:00<?, ? examples/s]

Setting num_proc from 2 back to 1 for the test split to disable multiprocessing as it only contains one shard.


Generating test split:   0%|          | 0/1 [00:00<?, ? examples/s]

In [3]:
print(data)

Dataset({
    features: ['text'],
    num_rows: 1
})


In [4]:
type(data['text'][0])

str

In [None]:
# Text Preprocessing
import regex as re
original_text = data["text"][0]
split_text = original_text.split('\n\n')
for i in range(len(split_text)):
  split_text[i]=re.sub(r'^.*?\n', '', split_text[i])
# Create a new dataset with the split text
new_dataset = Dataset.from_dict({
    "text": split_text[:4000]
})

In [6]:
print(new_dataset)

Dataset({
    features: ['text'],
    num_rows: 4000
})


In [7]:
print(new_dataset['text'][:5])

['Before we proceed any further, hear me speak.', 'Speak, speak.', 'You are all resolved rather to die than to famish?', 'Resolved. resolved.', 'First, you know Caius Marcius is chief enemy to the people.']


In [8]:
## YOUR CODE HERE - FOR FINE-TUNING GPT2 ON DATASET
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.pad_token = tokenizer.eos_token
model = GPT2LMHeadModel.from_pretrained('gpt2').cuda()
data_collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm=False)

vocab.json:   0%|          | 0.00/1.04M [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/1.36M [00:00<?, ?B/s]

config.json:   0%|          | 0.00/665 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/548M [00:00<?, ?B/s]

generation_config.json:   0%|          | 0.00/124 [00:00<?, ?B/s]

In [9]:
def encode(batch): return tokenizer([x for x in batch['text']], truncation=True, padding=True)

In [10]:
from tqdm import tqdm
import os

num_workers = os.cpu_count()

processed = new_dataset.map(
    encode,
    batched=True,
    batch_size=len(new_dataset),
    load_from_cache_file=False,
    num_proc=num_workers
)
processed.set_format('torch', columns=['input_ids', 'attention_mask'])

Map (num_proc=2):   0%|          | 0/4000 [00:00<?, ? examples/s]

In [11]:
training_args = TrainingArguments(
    output_dir='/content/',
    overwrite_output_dir=True,
    num_train_epochs=1,
    per_device_train_batch_size=2,
    per_device_eval_batch_size=2,
    logging_steps=100,
    weight_decay=0.01,
    logging_dir='./logs',
)

trainer = Trainer(
    model=model,
    tokenizer=tokenizer,
    args=training_args,
    data_collator=data_collator,
    train_dataset=processed,
)

In [12]:
trainer.train()

Step,Training Loss
100,4.6208
200,4.4572
300,4.5321
400,4.3321
500,4.3326
600,4.2703
700,4.296
800,4.1571
900,4.2247
1000,4.1753


TrainOutput(global_step=2000, training_loss=4.257115264892578, metrics={'train_runtime': 1009.815, 'train_samples_per_second': 3.961, 'train_steps_per_second': 1.981, 'total_flos': 1485061429248000.0, 'train_loss': 4.257115264892578, 'epoch': 1.0})

In [13]:
trainer.save_model('./shakespeare')

(4 pts) YOUR ANSWER HERE - BRIEF DESCRIPTION OF THE DATASET YOU CHOSE

The dataset chosen is Tiny_Shakespeare from Hugging Face datasets library that is a compilation of plays and sonnets by Shakespeare like Macbeth, Hamlet etc. This is particularly chosen because it is a small dataset so computation will be easy with this. The dataset had only one feature 'text' along with one row of texts containing all play dialogs. Preprocessing involved splitting the dialogs by new lines to create multiple rows each containing a dialog from the play that served as input to our custom model.

In [15]:
## YOUR CODE HERE - FOR GENERATION WITH YOUR FINE-TUNED MODEL AND COMPARISON WITH REGULAR GPT2
our_model = pipeline('text-generation', model='./shakespeare', device=0)
gpt2 = pipeline('text-generation', model='gpt2', device=0)

print("Our Model:", our_model("Before we proceed")[0]['generated_text'])
print("GPT2:", gpt2("Before we proceed")[0]['generated_text'])

Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.


Our Model: Before we proceed. Farewell.
And hear me speak to him. I'll be happy.
My lady, when she's gone,
I'll be satisfied to tell her
What lies in her head which should be told her,

GPT2: Before we proceed to the next page, I will consider what I know about 'the black hole', which, in today's opinion, is the brightest bright spot in the universe, far larger than the black hole itself. As in the dark matter field


(5 pts) YOUR ANSWER HERE - COMPARISON OF YOUR DATASET'S FINE-TUNED OUTPUT VS NON-FINE-TUNED OUTPUT

It is clearly visible in the above output generated our model is generating an output very similar to Shakesperian texts while GPT2 is generating random output without any such pattern. Words like "My Lady" or the tense is representative of it.