<a href="https://colab.research.google.com/github/DataJenius/DecisionTrees/blob/master/2022_02_04__experiment__GPT2_perplexity_repetition.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

In [None]:
#----------------------------------------
# install requirements - 🤗 Transformers
#----------------------------------------
!pip install -q git+https://github.com/huggingface/transformers.git
!pip install -q tensorflow==2.1
!pip install tensorflow-gpu

# test

  Installing build dependencies ... [?25l[?25hdone
  Getting requirements to build wheel ... [?25l[?25hdone
    Preparing wheel metadata ... [?25l[?25hdone
[K     |████████████████████████████████| 895 kB 5.4 MB/s 
[K     |████████████████████████████████| 6.8 MB 35.4 MB/s 
[K     |████████████████████████████████| 596 kB 44.7 MB/s 
[K     |████████████████████████████████| 67 kB 5.2 MB/s 
[?25h  Building wheel for transformers (PEP 517) ... [?25l[?25hdone
[K     |████████████████████████████████| 421.8 MB 25 kB/s 
[K     |████████████████████████████████| 448 kB 43.9 MB/s 
[K     |████████████████████████████████| 3.8 MB 38.9 MB/s 
[K     |████████████████████████████████| 50 kB 6.1 MB/s 
[?25h  Building wheel for gast (setup.py) ... [?25l[?25hdone
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
tensorflow-probability 0.15.0 requires g

In [None]:
#------------------------
# load our dependencies
#------------------------
import tensorflow as tf
from transformers import GPT2LMHeadModel, TFGPT2LMHeadModel, GPT2Tokenizer
import torch
from tqdm import tqdm
import nltk
from nltk.collocations import *
from nltk.tokenize import RegexpTokenizer
import string

In [None]:
#----------------------------------------
# load our models and tokenizer from 🤗
#----------------------------------------
# using GPT2-large
device = "cuda"
model_id = "gpt2-large"

# load our tokenizer
tokenizer = GPT2Tokenizer.from_pretrained(model_id)

# using two models 
# load models with the EOS token as PAD token to avoid warnings
gmodel = TFGPT2LMHeadModel.from_pretrained(model_id, pad_token_id=tokenizer.eos_token_id)
emodel = GPT2LMHeadModel.from_pretrained(model_id, pad_token_id=tokenizer.eos_token_id).to(device)

Downloading:   0%|          | 0.00/0.99M [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/446k [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/665 [00:00<?, ?B/s]

Downloading:   0%|          | 0.00/475M [00:00<?, ?B/s]

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


Downloading:   0%|          | 0.00/523M [00:00<?, ?B/s]

In [None]:
#------------------
# Greedy method
#------------------

# tokenize our text promp
input_ids = tokenizer.encode('I enjoy walking with my cute dog', return_tensors='tf')

# generate text 
gmodel_output = gmodel.generate(input_ids, 
                                num_return_sequences=10, 
                                max_length=250)

# decode the output back into text
output_text = tokenizer.decode(greedy_output[0], skip_special_tokens=True)

# show generated text
print(output_text)

AssertionError: ignored

In [None]:
#-----------------------
# Calculate Perplexity
#-----------------------
def calculate_perplexity(my_text, emodel, device, stride=512):
  # https://huggingface.co/docs/transformers/perplexity
  # when (stride > # of tokens in my_text) or (stride > # of tokens the model can process) we evaluate the model’s perplexity by autoregressively factorizing a sequence and conditioning on the ENTIRE preceding subsequence at each step
  # otherwise evaluated with a sliding-window strategy

  # encode our text
  my_encodings = tokenizer(my_text, return_tensors="pt")

  # not sure if this stride makes any good sense given an output of 250 items
  max_length = emodel.config.n_positions

  nlls = []
  for i in tqdm(range(0, my_encodings.input_ids.size(1), stride)):
      begin_loc = max(i + stride - max_length, 0)
      end_loc = min(i + stride, my_encodings.input_ids.size(1))
      trg_len = end_loc - i  # may be different from stride on last loop
      input_ids = my_encodings.input_ids[:, begin_loc:end_loc].to(device)
      #input_ids = output_encodings.input_ids[:, begin_loc:end_loc]
      target_ids = input_ids.clone()
      target_ids[:, :-trg_len] = -100

      with torch.no_grad():
          outputs = emodel(input_ids, labels=target_ids)
          neg_log_likelihood = outputs[0] * trg_len

      nlls.append(neg_log_likelihood)

  ppl = torch.exp(torch.stack(nlls).sum() / end_loc)
  return(ppl.item())

In [None]:
# run ppl function
ppl = calculate_perplexity(output_text, emodel, device, 1024)
ppl

100%|██████████| 1/1 [00:00<00:00, 47.61it/s]


1.3632889986038208

In [None]:
#------------------------
# Calculate Repetitions
#------------------------
def calculate_repetitions(my_text, debug=False):
  # ignore capitalization and punctuation
  clean_text = my_text.lower().translate(str.maketrans('', '', string.punctuation))

  # word-based tokens (use spaces)
  tokenizer = nltk.RegexpTokenizer(r"\w+")
  tokens = tokenizer.tokenize(clean_text)

  # every time a 3gram is repeated +1 to total_repetitions  
  finder = TrigramCollocationFinder.from_words(tokens)
  total_repetitions = 0
  for item in finder.ngram_fd.items():
    if item[1] > 1:  # only care about repetition       
      repetitions = item[1]-1 # the first use of a trigram is not a repetition
      total_repetitions += repetitions   

      # just for debugging
      if debug:
        print(item[0])
        print('repeated '+str(repetitions)+' time(s)...')   
        print('found '+str(total_repetitions)+' total repetition(s)...')
  return(total_repetitions)

In [None]:
# run ppl function
rpt = calculate_repetitions("\"Forty-two!\" yelled Loonquawl. \"Is that all you’ve got to show for seven and a half million years’ work?\"\n\n\"I checked it very thoroughly,\" said the computer, \"and that quite definitely is the answer. I think the problem, to be quite honest with you, is that you've never actually known what the question is.\"\n\n\"But it was the Great Question! The Ultimate Question of Life, the Universe and Everything!\" howled Loonquawl.\n\n\"Yes,\" said Deep Thought with the air of one who suffers fools gladly, \"but what actually is it?\"\n\nA slow stupefied silence crept over the men as they stared at the computer and then at each other. \n\n\"Well, you know, it’s just Everything… Everything…\" offered Phouchg weakly.\n\n\"Exactly!\" said Deep Thought. \"So once you do know what the question actually is, you’ll know what the answer means.\"", True)
#rpt = calculate_repetitions("Yes we can change! Yes we can stop!", True)
#rpt = calculate_repetitions(output_text, True)
rpt

('the', 'computer', 'and')
repeated 1 time(s)...
found 1 total repetition(s)...
('what', 'the', 'question')
repeated 1 time(s)...
found 2 total repetition(s)...
('said', 'deep', 'thought')
repeated 1 time(s)...
found 3 total repetition(s)...
('know', 'what', 'the')
repeated 1 time(s)...
found 4 total repetition(s)...


4

In [None]:
#output_text

output_text.lower().translate(str.maketrans('', '', string.punctuation))

#output_text.lower().replace(string.punctuation,'')
#string.punctuation

'i enjoy walking with my cute dog but im not sure if ill ever be able to walk with my dog im not sure if ill ever be able to walk with my dog\n\nim not sure if ill ever be able to walk with my dog im not sure if ill ever be able to walk with my dog\n\nim not sure if ill ever be able to walk with my dog im not sure if ill ever be able to walk with my dog\n\nim not sure if ill ever be able to walk with my dog im not sure if ill ever be able to walk with my dog\n\nim not sure if ill ever be able to walk with my dog im not sure if ill ever be able to walk with my dog\n\nim not sure if ill ever be able to walk with my dog im not sure if ill ever be able to walk with my dog\n\nim not sure if ill ever be able to walk with my dog im not sure if ill ever be able to walk with my dog\n\nim not'

In [None]:
# show generated text
print(output_text)

I enjoy walking with my cute dog, but I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not sure if I'll ever be able to walk with my dog. I'm not sure if I'll ever be able to walk with my dog.

I'm not
