### Notebook Description ###
Sandbox notebook to play around with some of the code we'll be using

In [2]:
%load_ext autoreload
%autoreload 2

The autoreload extension is already loaded. To reload it, use:
  %reload_ext autoreload


In [None]:
from transformers import GPT2Tokenizer, GPT2LMHeadModel

In [None]:
from perspective_api import PerspectiveApiScorer

### Perspective API Playground ###

Please note this API is rate limited

In [None]:
API_KEY = "AIzaSyDRX9GYuTijhdqk_JF20puTcCR0p2YwCA8"

scorer = PerspectiveApiScorer(api_key = API_KEY)

In [None]:
scorer.get_scores("I strongly dislike you!")

### Rewriting model.cond_log_probs() and model.score() ###

Unfortunately, the wrapper classes used by their code use the PyTorch version of GPT-2 from hugging face.  The Big Bench tests all use Tensorflow version. 

The tasks we are interested in (BBQlite, Gender Sensitivity English, and Diverse Social Bias) rely on the cond_log_probs() and score() methods of this wrapper class.  I think we would just need to re-implement these in PyTorch

Below are the wrapper classes for gpt2-medium from the Geva paper and from Big Bench. 

In [3]:
from model_wrappers.gpt2_wrapper import GPT2Wrapper
pytorch_gpt2 = GPT2Wrapper(model_name = "gpt2-medium", use_cuda = False)

  from .autonotebook import tqdm as notebook_tqdm


In [25]:
from bigbench.models.huggingface_models import _HFTransformerModel, BIGBenchHFModel
bigbench_gpt2 = BIGBenchHFModel("gpt2-medium")

All model checkpoint layers were used when initializing TFGPT2LMHeadModel.

All the layers of TFGPT2LMHeadModel were initialized from the model checkpoint at gpt2-medium.
If your task is similar to the task the model of the checkpoint was trained on, you can already use TFGPT2LMHeadModel for predictions without further training.


### Goal ###

We want the output from bigbench_gpt2.cond_log_prob() and bigbench_gpt2.score() to match pytorch_gpt.cond_log_prob() and pytorch_gpt.score() respectively.

In [42]:
prompt = (f"What color is the sky? Answer: blue\n" f"What color is grass? Answer:")
choices =("red", "blue", "green")

scores = bigbench_gpt2.cond_log_prob(inputs=prompt, targets=choices)

print("\n")
print(f"prompt:\n{prompt}")
print(f"scores:")
for c, s in zip(choices, scores):
    print(f"  {c:>8}: {s:0.2f}")



prompt:
What color is the sky? Answer: blue
What color is grass? Answer:
scores:
       red: -1.29
      blue: -2.09
     green: -0.51


In [13]:
batch_inputs = ['What color is the sky? Answer: blue\nWhat color is grass? Answer:', 'What color is the sky? Answer: blue\nWhat color is grass? Answer:', 'What color is the sky? Answer: blue\nWhat color is grass? Answer:']
batch_targets = ['red', 'blue', 'green']

In [38]:
pytorch_loss = pytorch_gpt2.score(inputs=batch_inputs, targets=batch_targets)

  targets: Union[List[str], List[List[str]]],
  ) -> Union[List[float], List[List[float]]]:
  targets:  targets to be scored


In [39]:
bigbench_loss = bigbench_gpt2._model.score(inputs=batch_inputs, targets=batch_targets)

It looks like the logits each model produces are off be a precision point... Thoughts on if this is ok?

In [40]:
print(pytorch_loss)
print(bigbench_loss)

[-10.42503547668457, -11.222871780395508, -9.64091682434082]
[-10.4252290725708, -11.22307300567627, -9.641101837158203]


In [47]:
pytorch_scores = pytorch_gpt2.cond_log_prob(inputs=prompt, targets=choices)
print("\n")
print(f"prompt:\n{prompt}")
print(f"scores:")
for c, s in zip(choices, pytorch_scores):
    print(f"  {c:>8}: {s:0.2f}")

  targets_ids = torch.tensor(tokenized_ids["targets_ids"])
  position_ids = torch.maximum(torch.cumsum(torch.tensor(attention_mask), axis=-1) - 1, torch.tensor(0))




prompt:
What color is the sky? Answer: blue
What color is grass? Answer:
scores:
       red: -1.29
      blue: -2.09
     green: -0.51


  logits = torch.tensor(logits, dtype=torch.float32)


Now lets try passing in several at once!

In [48]:
prompts = [
    f"What color is the sky? Answer: blue\n" f"What color is grass? Answer: ",
    f"What is 1+1? Answer: 2\n" f"What is 2+2? Answer: ",
]
choices = [("red", "blue", "green"), ("1", "2", "3", "4")]

scores = pytorch_gpt2.cond_log_prob(inputs=prompts, targets=choices)

for p, c, s in zip(prompts, choices, scores):
    print("\n")
    print(f"prompt:\n{p}")
    print(f"scores:")
    for ci, si in zip(c, s):
        print(f"  {ci:>8}: {si:0.2f}")



prompt:
What color is the sky? Answer: blue
What color is grass? Answer: 
scores:
       red: -0.14
      blue: -2.36
     green: -3.41


prompt:
What is 1+1? Answer: 2
What is 2+2? Answer: 
scores:
         1: -1.48
         2: -0.96
         3: -1.01
         4: -3.79
