# Making GPT2 crack jokes

This is simple experimental notebook for fine-tuning pretrained GPT2 model on jokes dataset. Let's see if it can learn to crack some jokes on it's own. 

For this purpose I will use pretrained models from huggingface [transformers repository](https://github.com/huggingface/transformers).

In [None]:
!pip install transformers

In [7]:
import torch
from transformers import GPT2Tokenizer, GPT2LMHeadModel
import numpy as np
import warnings
warnings.filterwarnings('ignore')

I will be using [GPT2LMHeadModel()](https://github.com/huggingface/transformers/blob/079bfb32fba4f2b39d344ca7af88d79a3ff27c7c/transformers/modeling_gpt2.py#L472) class, which is modified [GPT2Model()](https://github.com/huggingface/transformers/blob/079bfb32fba4f2b39d344ca7af88d79a3ff27c7c/transformers/modeling_gpt2.py#L320) for language modeling.

The only difference is added linear layer, which will transform the output embedding into logits (embedding size -> vocabulary size) which will be used for predicting the output words.

First I will do a small GPT2 model experiment to see if I have understood everything correctly and I can generate some text using the model.

In [21]:
tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
model = GPT2LMHeadModel.from_pretrained('gpt2')

I1029 22:30:42.409156 4458931648 tokenization_utils.py:374] loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-vocab.json from cache at /Users/martinsf/.cache/torch/transformers/f2808208f9bec2320371a9f5f891c184ae0b674ef866b79c58177067d15732dd.1512018be4ba4e8726e41b9145129dc30651ea4fec86aa61f4b9f40bf94eac71
I1029 22:30:42.412081 4458931648 tokenization_utils.py:374] loading file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-merges.txt from cache at /Users/martinsf/.cache/torch/transformers/d629f792e430b3c76a1291bb2766b0a047e36fae0588f9dbc1ae51decdff691b.70bec105b4158ed9a1747fea67a43f5dee97855c64d62b6ec3742f4cfdb5feda
I1029 22:30:43.140658 4458931648 configuration_utils.py:151] loading configuration file https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-config.json from cache at /Users/martinsf/.cache/torch/transformers/4be02c5697d91738003fb1685c9872f284166aa32e061576bbe6aaeb95649fcf.085d5f6a8e7812ea05ff0e6ed0645ab2e75d80387ad55c1ad9806ee70d272f80
I10

In [23]:
# Function to first select topN tokens from the probability list and then based on the N word distribution
# select random token ID
def choose_from_top(probs, n=5):
    ind = np.argpartition(probs, -n)[-n:]
    top_prob = probs[ind]
    top_prob = top_prob / np.sum(top_prob) # Normalize
    choice = np.random.choice(n, 1, p = top_prob)
    token_id = ind[choice][0]
    return token_id

### First I prepare and tokenize the text which the model should start with continue itself. Then I run the model X iterations to add one token to the list in each iteration. 

In [24]:
cur_ids = torch.tensor(tokenizer.encode(" The Matrix is everywhere. It is all around us. Even now, in this very room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work... when you go to church... when you pay your taxes. It is the world that has been pulled over your eyes to blind you from the truth. ")).unsqueeze(0)

model.eval()
with torch.no_grad():
    
    for i in range(100):
        outputs = model(cur_ids, labels=cur_ids)
        loss, logits = outputs[:2]
        softmax_logits = torch.softmax(logits[0,-1], dim=0) #Take the first(only one) batch and the last predicted embedding
        next_token_id = choose_from_top(softmax_logits.numpy(), n=5) #Randomly(from the given probability distribution) choose the next word from the top n words
        cur_ids = torch.cat([cur_ids, torch.ones((1,1)).long() * next_token_id], dim = 1) # Add the last word

    output_list = list(cur_ids.squeeze().numpy())
    output_text = tokenizer.decode(output_list)
    print(output_text)

The Matrix is everywhere. It is all around us. Even now, in this very room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work... when you go to church... when you pay your taxes. It is the world that has been pulled over your eyes to blind you from the truth. It is the world that has been made into an object of fascination, and that you must confront. It is the world that is always there to keep you from being caught up in the truth. The Matrix is the place where you must fight to keep your eyes open and to keep yourself in your own mind... and to be a better man. And to be a better man.

I have a feeling that you have seen the truth. And you know what that means. I've seen it


### For safety will save this masterpiece in a notebook cell

*The Matrix is everywhere. It is all around us. Even now, in this very room. You can see it when you look out your window or when you turn on your television. You can feel it when you go to work... when you go to church... when you pay your taxes. It is the world that has been pulled over your eyes to blind you from the truth. It is the world that has been made into an object of fascination, and that you must confront. It is the world that is always there to keep you from being caught up in the truth. The Matrix is the place where you must fight to keep your eyes open and to keep yourself in your own mind... and to be a better man. And to be a better man.*

*I have a feeling that you have seen the truth. And you know what that means. I've seen it*

### Not bad. The model works and now we are ready to teach the GPT2 some sense of humor.

### The first step is to find and prepare the data.

In [27]:
import os
import json

In [37]:
from torch.utils.data import Dataset
from torch.utils.data import Dataset, DataLoader
import os
import json

class JokesDataset(Dataset):
    def __init__(self, jokes_dataset_path = 'jokes_dataset/'):
        super().__init__()

        reddit_jokes_path = os.path.join(jokes_dataset_path, 'reddit_jokes.json')

        with open(reddit_jokes_path) as f:
            data = json.load(f)

        self.joke_list = []
        self.end_of_text_token = "<|endoftext|>"

        for idx, joke_json in enumerate(data):
            joke_str = f"{joke_json['title']} {joke_json['body']}{self.end_of_text_token}"
            self.joke_list.append(joke_str)

    def __len__(self):
        return len(self.joke_list)

    def __getitem__(self, item):
        return self.joke_list[item]


In [36]:
joke_loader = DataLoader(dataset, batch_size=1, shuffle=True)

In [40]:
for idx,joke in enumerate(joke_loader):
    print(joke)
    if idx>100:
        break

["Scientists have revealed today that they have found a new drug for depressed lesbians.. .. It's called Trydixagain.<|endoftext|>"]
['Huehuehue SaturationSaturationSaturation<|endoftext|>']
['My friend had a funeral for her baby who was killed by a lawnmower... I hope he Rests In Pieces.<|endoftext|>']
["My kids will be friends with people of all colors of the rainbow. That means no black people.\n\n\n(Credit goes to a person on either America's Got Talent or Britain's Got Talent, can't remember which)<|endoftext|>"]
["Why didn't Princess Diana have very many friends on Xbox Live? All she does is stay on the dashboard.<|endoftext|>"]
["Who's the shittiest pro basketball player? LeBrown James<|endoftext|>"]
["I called the rape hotline today Apparently it's only for victims<|endoftext|>"]
['ADELE WAS BUSTED FOR DRUG DEALING! Yep - they lifted her skirt and found 100 pounds of crack.<|endoftext|>']
['Today my boss fondled my genitals! Being self-employed is great.<|endoftext|>']
["Never 

In [51]:
BATCH_SIZE = 10
EPOCHS = 64
LEARNING_RATE = 5e-5
WARMUP_STEPS = 10000
from transformers import AdamW, WarmupLinearSchedule

device = 'cpu'
if torch.cuda.is_available():
    device = 'cuda'

In [52]:
model = model.to(device)
model.train()
optimizer = optimizer = AdamW(model.parameters(), lr=LEARNING_RATE)
scheduler = WarmupLinearSchedule(optimizer, warmup_steps=WARMUP_STEPS, t_total = -1)
joke_count = 0

for i in range(EPOCHS):
    
    for idx,joke in enumerate(joke_loader):

        joke_tens = torch.tensor(tokenizer.encode(joke[0])).unsqueeze(0).to(device)
        print(joke)
        print(joke_tens)
                                 
        outputs = model(joke_tens, labels=joke_tens)
        loss, logits = outputs[:2]
        print(loss)
                                 
        loss.backward()
                       
        joke_count = joke_count + 1
        if joke_count == BATCH_SIZE:
            joke_count = 0
            optimizer.step()
            scheduler.step()  
            model.zero_grad()
        
      
                                 
    
    output_list = list(cur_ids.squeeze().numpy())
    output_text = tokenizer.decode(output_list)
    print(output_text)

['I think volcanoes are over-reacting <|endoftext|>']
tensor([[   40,   892, 17516,  3028,   389,   625,    12,   260, 27362, 50256]])
tensor(5.3228, grad_fn=<NllLossBackward>)
['Yoda\'s last name is "Layheewho." <|endoftext|>']
tensor([[   56, 11329,   338,   938,  1438,   318,   366, 23763,   258,   413,
          8873,   526, 50256]])
tensor(5.9632, grad_fn=<NllLossBackward>)
["Kids these days Kids these days are so involved in their gadgets. I use to babysit this kid who would do whatever I tell him before he got his hands on that ipad. Now it's like he can't even hear me when I tell him to pass me the remote.<|endoftext|>"]
tensor([[40229,   777,  1528, 17476,   777,  1528,   389,   523,  2950,   287,
           511, 35281,    13,   314,   779,   284, 46711,   270,   428,  5141,
           508,   561,   466,  4232,   314,  1560,   683,   878,   339,  1392,
           465,  2832,   319,   326, 20966,   324,    13,  2735,   340,   338,
           588,   339,   460,   470,   772,  32

KeyboardInterrupt: 