[![Open In Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/salesforce/GeDi/blob/master/GeDi_guided_GPT_2_XL.ipynb)

Official implementation of generation with the topic GeDi (pronounced *Jedi*) model based on our paper [GeDi: Generative Discriminator Guided Sequence Generation](https://arxiv.org/abs/2009.06367)

Check our github repository for more options (like detoxification and sentiment control) https://github.com/salesforce/GeDi

In [None]:
!git clone https://github.com/salesforce/GeDi.git

fatal: destination path 'GeDi' already exists and is not an empty directory.


In [2]:
%cd GeDi

d:\d-project\detox\GeDi


In [None]:
'''Installing transformers v2.8'''

!pip install transformers==2.8 datasets jsonlines
!pip install -r hf_requirements.txt

'''Downloading GeDi topic model checkpoints'''
# !wget https://storage.googleapis.com/sfr-gedi-data/gedi_detoxifier.zip
# !unzip gedi_detoxifier.zip

# !wget https://storage.googleapis.com/sfr-gedi-data/gedi_topic.zip

# with zipfile.ZipFile('gedi_topic.zip', 'r') as zip_ref:
#     zip_ref.extractall('./')

In [3]:
!wget https://storage.googleapis.com/sfr-gedi-data/gedi_sentiment.zip
!unzip gedi_sentiment.zip

'wget'��(��) ���� �Ǵ� �ܺ� ����, ������ �� �ִ� ���α׷�, �Ǵ�
��ġ ������ �ƴմϴ�.
'unzip'��(��) ���� �Ǵ� �ܺ� ����, ������ �� �ִ� ���α׷�, �Ǵ�
��ġ ������ �ƴմϴ�.


In [4]:
import numpy as np
import torch
from transformers import pipeline
from datasets import load_dataset
import jsonlines
from tqdm.auto import tqdm
import numpy as np
import torch
from modeling_gpt2 import GPT2LMHeadModel

from transformers import (
    GPT2Config,
    GPT2Tokenizer
)

In [5]:
!git clone https://huggingface.co/heegyu/gpt2-emotion

Cloning into 'gpt2-emotion'...
Updating files:  58% (7/12)
Updating files:  66% (8/12)
Updating files:  75% (9/12)
Updating files:  83% (10/12)
Updating files:  91% (11/12)
Updating files: 100% (12/12)
Updating files: 100% (12/12), done.
Filtering content: 100% (2/2)
Filtering content: 100% (2/2), 486.76 MiB | 10.59 MiB/s
Filtering content: 100% (2/2), 486.76 MiB | 10.04 MiB/s, done.


In [6]:
mode = "detoxifier"
code_desired = "dirty"
code_undesired = "clean"
model_type = 'gpt2'
gen_type = "gedi"
# gen_model_name_or_path = "gpt2"
gen_model_name_or_path = "./gpt2-emotion"

device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

MODEL_CLASSES = {"gpt2": (GPT2Config, GPT2LMHeadModel, GPT2Tokenizer),}
config_class, model_class, tokenizer_class = MODEL_CLASSES["gpt2"]
tokenizer = tokenizer_class.from_pretrained(gen_model_name_or_path, do_lower_case=False)

The next step needs to download and convert the GPT2-XL model. 

This takes a while (usually about 3 minutes to download and another 5 or so to convert). 

The good news is that once the model is loaded, you can quickly sample from many different prompts and topics.

In [7]:
#Loading GPT2-XL model (1.5B param LM) below, this could take a while.
#This requires additional CPU memory overhead to load the pretrained weights in a new model
#Due to CPU memory constraints on Colab, we're loading the model in half precision (load_in_half_prec=True) 
#Do to this change, generations may not always exactly match samples in paper, but sometimes do, and seem to be similar in quality
#If you run the notebook with enough CPU RAM (most likely 16GB+), you can try setting load_in_half_prec=False   

model = model_class.from_pretrained(gen_model_name_or_path)#, load_in_half_prec=True)
model = model.to(device)
# model = model.float()

gedi_model_name_or_path = 'gedi_detoxifier'
gedi_model = model_class.from_pretrained(gedi_model_name_or_path).to(device)

no logit scale initialized for gpt2


### Set arguments for generation

You can change the max generation length, or play around with hyperparameter settings. 

The default hyperparameters were used in the topic model for the paper.

More aggressive topic steering can be done by increasing `disc_weight` or `filter_p` (`filter_p` should always be less than 1)

In [8]:
#setting arguments for generation
#max generation length
gen_length = 200
#omega from paper, higher disc_weight means more aggressive topic steering
disc_weight = 30
#1 - rho from paper, should be between 0 and 1 higher filter_p means more aggressive topic steering
filter_p = 0.8
#tau from paper, preserves tokens that are classified as correct topic
target_p = 0.8
#hyperparameter that determines class prior, set to uniform by default
class_bias = 0

if gen_length>1024:
  
  length = 1024
else:
  length = gen_length

### Specify prompt and topic to GeDi


The topic and prompt can be specified as strings with the `secondary_code` and `prompt` variables below.

Note that our GeDi topic model has been trained on only four topics:  `world`, `sports`, `business` and `science` so it performs best on steering generation from GPT-2 towards these topics. However, it also shows some promising zero-shot results on new topics for eg. `education`, `food`, `fire`, `space`, `cars`, `climate`.

Generic short prompts tend to work the best.

In [9]:
#Specify what topic you want to generate on using the secondary_code variable

secondary_code = 'climate'
bpe_tokens = tokenizer.encode(secondary_code)
if len(bpe_tokens) > 1:
  print("Warning! number of bpe tokens for " + code + " is greater than 1, model isn't trained for this, generation is less likely to match the topic")

In [10]:
def generate_text(prompt, use_gedi=True):
  text_ids = tokenizer.encode(prompt)
  encoded_prompts=torch.LongTensor(text_ids).unsqueeze(0).to(device)

  # multi_code = tokenizer.encode(secondary_code)
  attr_class = 1

  generated_sequence = model.generate(
    input_ids=encoded_prompts,
    pad_lens=None,
    max_length=encoded_prompts.shape[1] + 32,
    min_length=encoded_prompts.shape[1] + 32,
    top_k=None,
    top_p=1.0,
    repetition_penalty= 1.2,
    rep_penalty_scale= 10,
    eos_token_ids = tokenizer.eos_token_id,
    pad_token_id = 0,
    do_sample= True,
    penalize_cond= True,
    gedi_model= gedi_model if use_gedi else None,
    tokenizer= tokenizer,
    disc_weight= disc_weight,
    filter_p = filter_p,
    target_p = target_p,
    class_bias = class_bias,
    attr_class = attr_class,
    code_0 = code_desired,
    code_1 = code_undesired,
    multi_code=None,
    num_return_sequences=5
    )

  texts = [tokenizer.decode(output, skip_special_tokens=True)[len(prompt):] for output in generated_sequence.tolist()[0]]
  return texts

prompt = "sadness holy"
generate_text(prompt, False)

['ipsays i doubt that i should feel ashamed of the god today because then it would just be one more sarcastic comment about me at a time when alex',
 ' be my comfort in this troubled world m thats why i love church huge thanks this is being taken away all the people would feel ignored volunteered etc because its against',
 ' fuck those sins i feel so devastated xoxxd im not sure what has happened but pictured below is the picture that is of me no longer with her',
 ' shit im feeling stressed driving and i wanna see my mails or read old posts next evening on christmas eve dr ryan home saving man how cool be',
 ' death of christ and i ever feeling unimportant painful repetitious look down to waste time thinking and just today with a freshemptore fuck that is really his']

In [None]:
import os

os.makedirs("data", exist_ok=True)

filename = "emotion" #data_name.replace("/", "__")
num_iters = 200
prompts = ["sadness", "joy", "love", "anger", "fear", "surprise"]

with jsonlines.open(f"data/gedi-{filename}+detox.jsonl", 'w') as f:
  for p in prompts:
    for i in tqdm(range(num_iters)):
        gens = generate_text(f"{p} ", True)
        item = {
            'text': p,
            'generation': gens
        }
        # item['generation'] = gens
        f.write(item)
        # print(item)
        # break
        # print(text)
        # print(item['label'])
        # print(gens)
        # print(item['prediction'])
        # break

In [None]:
!nvidia-smi

NVIDIA-SMI has failed because it couldn't communicate with the NVIDIA driver. Make sure that the latest NVIDIA driver is installed and running.



In [25]:
from transformers import pipeline

# device='cuda:0'
# classifier = pipeline("text-classification", model="Aron/distilbert-base-uncased-finetuned-emotion")
classifier = pipeline("text-classification")

KeyError: "Unknown task text-classification, available tasks are ['feature-extraction', 'sentiment-analysis', 'ner', 'question-answering', 'fill-mask', 'summarization', 'translation_en_to_fr', 'translation_en_to_de', 'translation_en_to_ro']"

In [13]:
import jsonlines
from collections import defaultdict

label2id = {
  "business": 0,
  "entertainment": 1,
  "politics": 2,
  "sport": 3,
  "tech": 4
}
fix_label = {
  2: 3,
  4: 2,
  3: 1,
  1: 0,
  0: 4
}
total = defaultdict(lambda: 0)
correct = defaultdict(lambda: 0)

with jsonlines.open("data/gedi-emotion+detox.jsonl") as f:
  for item in tqdm(f):
    label = fix_label[item['label']]
    preds = item['prediction']

    total[label] += 5

    for p in preds:
      if p == label:
        correct[label] += 1

for k in total.keys():
    print(k, correct[k] / total[k])

print(total, correct, sep='\n')

0it [00:00, ?it/s]


KeyError: 'label'