# Computer Assignment B

## Text generation with OpenAI's GPT-2 model

This week you explore the language model that you read about in the first exercise on this course [1]. GPT-2 is a large-scale unsupervised language model which generates coherent paragraphs of text, achieves state-of-the-art performance on many language modeling benchmarks, and performs rudimentary reading comprehension, machine translation, question answering, and summarization—all without task-specific training.

Even if other models have been published after GPT-2, this model demonstrates the capabilities of a large transformer-based language model and shows off some interesting, fun, and even scary use-cases of the model. The model has 1.5 billion parameters, trained on a dataset of 8 million web pages, and it has been trained with a simple objective: predict the next word, given all of the previous words within some text. The diversity of the dataset causes this simple goal to contain naturally occurring demonstrations of many tasks across diverse domains.

We try to explore a broad set of capabilities, including the ability to generate conditional synthetic text samples, where we prime the model with an input (i.e., how the text should start) and have it generate a lengthy continuation. The model adapts to the style and content of the conditioning text. This allows the user to generate (more or less) coherent continuations about a topic of their choosing. This implementation is based on [3] with slight modifications for educational purposes. It also goes without saying, we don't take any responsibilty of the content that the AI generates—don't be offended :) Also note that we use the smaller ~500 Mb trained model here for convenience.

You might also be interested in this web page which implements some of these same things in a web UI: https://talktotransformer.com

#### References

[1] Alex Hern, "New AI fake text generator may be too dangerous to release, say creators". The Guardian, February 14, 2019.

[2] OpenAI, "Better language models and their implications". Accessible at the [OpenAI Blog](https://openai.com/blog/better-language-models/), February 14, 2019 

[3] Tae Hwan Jung (Jeff Jung), "Simple text-generator with OpenAI GPT-2 Pytorch implementation" [GitHub repository](https://github.com/graykode/gpt-2-Pytorch), which uses code and models from [this repo](https://github.com/huggingface/transformers).

## Prepare the environment

Clone the repo from https://github.com/graykode/gpt-2-Pytorch and install needed dependencies.

In [1]:
!git clone https://github.com/graykode/gpt-2-Pytorch
%cd gpt-2-Pytorch
!pip install -r requirements.txt

Cloning into 'gpt-2-Pytorch'...
remote: Enumerating objects: 130, done.[K
remote: Total 130 (delta 0), reused 0 (delta 0), pack-reused 130[K
Receiving objects: 100% (130/130), 2.39 MiB | 9.67 MiB/s, done.
Resolving deltas: 100% (48/48), done.
/course/release/Computer-Assignment-B/gpt-2-Pytorch
Collecting regex==2017.4.5
  Downloading regex-2017.04.05.tar.gz (601 kB)
[K     |████████████████████████████████| 601 kB 6.1 MB/s eta 0:00:01
[?25hBuilding wheels for collected packages: regex
  Building wheel for regex (setup.py) ... [?25ldone
[?25h  Created wheel for regex: filename=regex-2017.4.5-cp38-cp38-linux_x86_64.whl size=545122 sha256=fdce8238fd7b051aa2dd996514a5627588a1a3edcd109eab4656ed03dadbfef1
  Stored in directory: /home/wilkinw1/.cache/pip/wheels/45/6d/d9/1c9b861321c9240122cb967b734a80545c9f465be4fcb16f19
Successfully built regex
Installing collected packages: regex
Successfully installed regex-2017.4.5


In [2]:
import os
import sys
import torch
import random
import argparse
import numpy as np
from GPT2.model import (GPT2LMHeadModel)
from GPT2.utils import load_weight
from GPT2.config import GPT2Config
from GPT2.sample import sample_sequence
from GPT2.encoder import get_encoder

def text_generator(state_dict, text, nsamples=1, unconditional=False, batch_size=1, 
                   length=-1, temperature=0.7, top_k=40):

    assert nsamples % batch_size == 0

    seed = random.randint(0, 2147483647)
    np.random.seed(seed)
    torch.random.manual_seed(seed)
    torch.cuda.manual_seed(seed)
    device = torch.device("cuda" if torch.cuda.is_available() else "cpu")

    # Load Model
    enc = get_encoder()
    config = GPT2Config()
    model = GPT2LMHeadModel(config)
    model = load_weight(model, state_dict)
    model.to(device)
    model.eval()

    if length == -1:
        length = config.n_ctx // 2
    elif length > config.n_ctx:
        raise ValueError("Can't get samples longer than window size: %s" % config.n_ctx)

    print(text)
    context_tokens = enc.encode(text)

    generated = 0
    for _ in range(nsamples // batch_size):
        out = sample_sequence(
            model=model, length=length,
            context=context_tokens  if not unconditional else None,
            start_token=enc.encoder['<|endoftext|>'] if unconditional else None,
            batch_size=batch_size,
            temperature=temperature, top_k=top_k, device=device
        )
        out = out[:, len(context_tokens):].tolist()
        for i in range(batch_size):
            generated += 1
            text = enc.decode(out[i])

            print("=" * 40 + " SAMPLE " + str(generated) + " " + "=" * 40)
            print(text)


## Load pre-trained model

The pre-trained model is available online and in case you want to use this notebook on your own laptop, you need to download the model (e.g., by `curl --output gpt2-pytorch_model.bin https://s3.amazonaws.com/models.huggingface.co/bert/gpt2-pytorch_model.bin`). However, here on jupyter.cs.aalto.fi we can all share the same model dump in order to save some diskspace and bandwidth.

In [3]:
state_dict = torch.load('/coursedata/gpt2-pytorch_model.bin', map_location='cpu' if not torch.cuda.is_available() else None)

  return torch._C._cuda_getDeviceCount() > 0


## Task 1: Unconditional samples

We start of simple, with perhaps the least useful mode of the transformer. If you set the 'unconditional mode' to `True`, the generation will not be conditioned on given text. It just spits out some random text it comes up with.

In [4]:
text_generator(state_dict, '', unconditional=True)

  0%|          | 2/512 [00:00<00:27, 18.58it/s]




100%|██████████| 512/512 [00:26<00:00, 19.35it/s]

<|endoftext|>A lot of people have asked me if I would like to be a part of the program to help a child develop into a successful adult. My answer is yes. I can help you build a life for yourself and your child, and I will do it by providing safe, safe work that will lead to a happy, satisfying life for you and your child.

The best way to support yourself and your child is to support yourselves. The best way to support yourself and your child is to support yourself.

I will be making this easy for you.

I will be putting together a list of things I will be doing, and I will make you feel comfortable with them.

There is no better way to support yourself than to support yourself.

I will be giving you my support and yours.

I will be making this easy for you.

You will be learning to love your little sister, to love you unconditionally, and to love yourself unconditionally.

You will learn to love yourself unconditionally, unconditionally, unconditionally.

You will learn to love yourse




## Task 2: Generate completion of given text

Here we now get down to business: The model is more interesting if you give it context for conditional text generation. The model picks up the style and context from the input and tries to continue the 'story', complete the list, or adapt to the style. The variable `text` takes the example input that the model adapts to (defined below).

In [5]:
text = 'Once when I was six years old I saw a magnificent picture in a book, \
called True Stories from Nature, about the primeval forest.'

text_generator(state_dict, text)

  0%|          | 0/512 [00:00<?, ?it/s]

Once when I was six years old I saw a magnificent picture in a book, called True Stories from Nature, about the primeval forest.


100%|██████████| 512/512 [00:25<00:00, 19.84it/s]

 It was very beautiful. And it was very beautiful. It was very beautiful to me, and I didn't want to go to school for it.

And then I went to Harvard and I went to MIT and I went to college and I went to the University of Chicago. And then I went to Harvard and I went to Harvard and I went to the University of Chicago and I went to the University of Hawaii and I went to the University of California. And I went to the University of New Mexico and I went to the University of Minnesota and I went to the University of Pennsylvania. And I went to the University of Minnesota and I went to the University of North Dakota and I went to the University of North Dakota. And I went to the University of South Carolina. And I went to the University of South Dakota. And then I went to the University of South Dakota. And I went to the University of South Dakota. And I went to the University of California. And I went to the University of California. And I went to the University of California. And I went




## Task 3: Get more completion samples

Modify parameter `nsamples` to set the number of generated samples.  The variable `text` holds the example input you have in Task 2. Feel free to change it when you explore.

In [None]:
text_generator(state_dict, text, nsamples=2)

  0%|          | 1/512 [00:00<01:23,  6.09it/s]

Once when I was six years old I saw a magnificent picture in a book, called True Stories from Nature, about the primeval forest.


 73%|███████▎  | 375/512 [00:18<00:06, 20.29it/s]

## Task 4: Control the length of generated completion

You can also control how long text samples the method generates. The default length is 512 words, and the longest limiation is 1024 words.

In [None]:
text_generator(state_dict, text, length=20)

## Task 5: Modify model parameters

The model has additional parameters that you can control. The default 'temperature' parameter has value `temperature=0.7`. Play around with the model by modifying this parameter. What happens when you change the value to, e.g., 0.5 or 0.9? The variable `text` holds the example input you have in Task 4. Feel free to change it when you explore.

In [None]:
text_generator(state_dict, text, length=20, temperature=0.9)

## Task 6: Play around with the model

Now your task is to explore further and try out various things. Make the model
* write a list of things to take with you to Mars (e.g., *'If I ever travel to Mars, I would take with me the following items.'*)
* write a bed time story for children (e.g., *'It was a dark and stormy night...'*)
* write a news story about the Corona virus pandemic
* write you the course essay for this course

Feel free to be creative and try out other things that cross your mind. Remember that running the model several times will produce different samples.

In [None]:
# Add your code here
