In [1]:
import torch
import torch.nn as nn
import torch.nn.functional as F
import numpy as np
import time
import re
from scipy import spatial
from torch.utils.data import TensorDataset, DataLoader
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')

# Experiments in Text Generation
In preparation for a more advanced project, I wanted to play around with natural text generation. I decided to try and generate text in the style of Harry Potter. Inspired by [Karpathy's post](http://karpathy.github.io/2015/05/21/rnn-effectiveness/), I first tried things with an LSTM that generates text character by character. I highly reccomend reading his post if you are unfamiliar with recurrent neural networks. A one sentence summary is they taken a sequence of timesteps(in this case characters) and produce a sequence of outputs based on their hidden state which stores information about previous timesteps. Long Short-Term Memory(LSTMs) are a more complicated structure that does a similar thing but avoids problems like vanishing gradients and helps keep things in memory for longer.

I did not include the texts in the repo because I don't want to distribute copyrighted materials...

In [2]:
# Get the texts (only have first three because getting others was a little troublesome)
text_file1 = open('texts/J. K. Rowling - Harry Potter 1 - Sorcerer\'s Stone.txt', 'r')
text_file2 = open('texts/J. K. Rowling - Harry Potter 2 - The Chamber Of Secrets.txt', 'r')
text_file3 = open('texts/J. K. Rowling - Harry Potter 3 - Prisoner of Azkaban.txt', 'r')
#text_file4 = open('texts/J. K. Rowling - Harry Potter 4 - The Goblet of Fire.txt', 'r')

book1 = text_file1.read()
book2 = text_file2.read()
book3 = text_file3.read()
#book4 = text_file4.read()

books = ''.join((book1, book2, book3))
chars = list(set(books))
#int2char = {i:c for i, c in enumerate(chars)}
char2int = {c:i for i, c in enumerate(chars)}

encoded_books = np.array([char2int[c] for c in books])

def one_hot_enc(arr, vocab_size):
    '''
    Inputs:
    arr - array containing message to be encoded
    vocab_size - number of unique chars
    
    Output: 
    one_hot - encoded array of shape (arr.shape, vocab_size) i.e. shape of arr with extra dimension appended
    '''
    # Unroll the message array as rows and expand vocab_size columns to form the shape for one_hot
    if(len(arr.shape) > 1): # multiple dims(batch size > 1)
        size = np.multiply(*arr.shape)
    else:
        size = arr.shape[0]
    one_hot = np.zeros((size, vocab_size), dtype=np.float32) 
    
    # Look at row i(for all i), make one_hot[i, arr.flatten()[i]] = 1
    one_hot[np.arange(size), arr.flatten()] = 1 
    
    # Fix the shape
    one_hot = one_hot.reshape(*arr.shape, vocab_size)
    
    return one_hot

def batch_generator(data, batch_size, seq_len):
    '''
    data - shape (total_chars,)
    '''

    chars_per_batch = batch_size * seq_len
    batches = len(data) // chars_per_batch
    data = data[:batches*chars_per_batch] # truncate extra chars that wouldn't make a complete batch
    data = np.reshape(data, (batch_size, -1))

    # Yield a sliding window of data of size batch_size*seq_len as well as a window slid one to the right
    # The offset window is used to evaluate whether the model accurately predicted the next char
    for i in range(0, data.shape[1], seq_len):
        x = np.zeros((batch_size, seq_len))
        x_offset = np.zeros((batch_size, seq_len))
        
        # Our input
        x = data[:, i:i+seq_len]
        
        # Make our off by one target
        temp = data.flatten()
        temp = np.roll(temp, shift=-1)
        temp = np.reshape(temp, data.shape)
        x_offset = temp[:, i:i+seq_len]
    
        yield x, x_offset
        
class CharRNN(nn.Module):
    
    def __init__(self, chars, n_layers, hidden_dim, dropout=0.3):
        super(CharRNN, self).__init__()
        
        # Important numbers to save
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        self.vocab_size = len(chars)
        
        # Create conversion dictionaries
        self.int2char = {i:c for i, c in enumerate(chars)}
        self.char2int = {c:i for i, c in enumerate(chars)}
        
        # Actual layers of model
        self.lstm = nn.LSTM(input_size=self.vocab_size, hidden_size=hidden_dim, num_layers=n_layers, batch_first=True, 
                            dropout=dropout)
        self.fc = nn.Linear(hidden_dim, self.vocab_size)
    
    def forward(self, x, h):

        out, h = self.lstm(x, h)
        out = self.fc(out)
        
        return out, h
    
    def init_hidden(self, batch_size):
        return (torch.zeros(self.n_layers, batch_size, self.hidden_dim, device=device),
               torch.zeros(self.n_layers, batch_size, self.hidden_dim, device=device))

## Sampling
An important part of any text generation network is the sampling. Even if you develop a model that outputs a very useful/realistic probability distribution of the next char/word, if you pick from that distribution poorly you can still have bad results. I chose to use nucleus sampling/p-sampling given it's remarkable results in other models. The following few functions will be useful later for our fancier models. For now, only pay attention to the code where `is_char_model` is true.

In [3]:
def p_sample(x, p=.9):
    '''
    p sampling, aka nucleus sampling, looks at the top x percent of most likely next words/characters and samples from that
    
    x - tensor with x[:, -1] containing the probability distribution for entire batch
    p - min probability for mass (e.g. if p=.9, take the top 90% most likely words)  0 <= p <= 1
    '''
    sorted_vals, sorted_indices = torch.sort(x, dim=-1, descending=True)

    cum_probs = torch.cumsum(sorted_vals, dim=-1)
    indices_too_small = cum_probs > p
    
    # Shift by one to make sure first token over the threshold is also included
    indices_too_small = torch.roll(indices_too_small, shifts=1)
    indices_too_small[0, 0]=0
    
    indices_too_small = sorted_indices[indices_too_small]  
    x[:, indices_too_small] = -float('inf') # negative infinity so after softmax, no chance to be chosen
    
    return F.softmax(x, dim=-1)

def predict_token(model, hc, input_token, p=.9, argmax=False, is_char_model=False):
    '''
    input_char - an int where int2token[input_token] is the desired token
    hc - hidden states
    p - min probability for mass (e.g. if p=.9, after sorting, take the top 90% most likely words)  0 <= p <= 1
    argmax - if true, take the most likely token every time(doesn't produce great results)
    char_model - True if model isinstance of CharRNN
    '''  
    input_token = np.array([input_token])
    
    if is_char_model:
        one_hot = one_hot_enc(input_token, model.vocab_size)
        model_input = torch.from_numpy(one_hot)
    else:
        model_input = torch.from_numpy(input_token)
    
    model_input = model_input.to(device)
    model_input.to(torch.int64)
    model_input = torch.unsqueeze(model_input, 0)
    
    if not is_char_model:
        model_input = model_input.long()
    x, hc = model(model_input, hc)

    # Reshape so it's shape (1, vocab_size)
    x = x.view(-1, model.vocab_size)
    
    if argmax:
        probs = F.softmax(x, dim=-1)
        token = torch.argmax(probs)
        token_as_int = token_as_int.item()
    else: # Create a probability distribution & sample
        x = F.softmax(x, dim=-1)
        probs = p_sample(x, p=p)
        prob_dist = torch.distributions.Categorical(probs)
        token_as_int = prob_dist.sample().item()
    
    if is_char_model:  # since model input is words as ints
        token = model.int2char[token_as_int]
    else: # word model
        token = int2word[token_as_int]
    
    return token, token_as_int, hc

def predict_sequence_tokens(model, seed_phrase='Harry', length=500, p=.9, argmax=False, include_seed=False):
    '''
    Predict a string of specified length iteratively(i.e. predict each token based on prev. predicted tokens). Tokens 
    can be chars(CharRNN) or words (WordRNN)
    
    seed_phrase - a phrase to feed into the model to generate a specific initial hidden state
    include_seed - if true, include the seed phrase as part of the output
    length - length of generated string
    argmax - if true, take the most likely token every time(doesn't produce great results)
    p - min probability for mass (e.g. if p=.9, after sorting, take the top 90% most likely words)  0 <= p <= 1
    '''
    model.eval()
    hc = model.init_hidden(1)
    is_char_model = isinstance(model, CharRNN)
    
    input_token_num = 0 # to be input into the model after seeding, will be 0 if no input phrase
    
    if not is_char_model: # Create list and split out words from punctuation and whitespace
        iter_phrase = re.split('([\"\.,!\?;:\s]+)', seed_phrase) 
    else:
        iter_phrase = seed_phrase
    
    for token in iter_phrase:
        if not is_char_model:
            num = word2int[token]
        else:
            num = model.char2int[token]
        _token, input_token_num, hc = predict_token(model, hc, num, p=p, argmax=argmax, is_char_model=is_char_model)
    
    msg = []
    if include_seed:
        msg.append(seed_phrase)
        msg.append(_token)
        if is_char_model:
            length -= len(seed_phrase)
        else:
            length -= 1
    
    for i in range(length):
        word, input_token_num, hc = predict_token(model, hc, input_token_num, p=p, argmax=argmax, is_char_model=is_char_model)
        msg.append(word)
    
    output_string = ''.join(msg) # don't join until end for efficiency
    return output_string


In [7]:
def train(model, data, optimizer, seq_len=100, batch_size=100, epochs=30, summary_every=30, print_time=False):
    
    if print_time:
        prev_time = time.time()
        
    criterion = nn.CrossEntropyLoss()
    iteration = 0
    is_char_model = isinstance(model, CharRNN)
    model.train()
    
    for e in range(epochs):
        # Reset / initialize (hidden state, cell) params
        hc = model.init_hidden(batch_size)
        
        rolling_loss = []
        for x, x_offset in batch_generator(data, batch_size=batch_size, seq_len=seq_len):
            iteration += 1
            
            if is_char_model: # only char model needs one hot encoding
                x = one_hot_enc(x, model.vocab_size)
            
            # Copy over data in hc to prevent backprop through entire epoch(noticeably improves training time)
            hc = [layer.data for layer in hc]
            hc = tuple(hc)
            
            # Move to tensors and gpu if applicable
            x = torch.from_numpy(x)
            x = x.to(device)
            x_offset = torch.from_numpy(x_offset)
            x_offset = x_offset.to(device)
            
            # Clear gradients and predict
            optimizer.zero_grad()
            if not is_char_model: # Since x is words as integers, for word2vec embedding must explicitly change type
                x = x.long()
            prediction, hc = model(x, hc)
            
            # Reshape so that it looks like one giant batch of batch_size * seq_len predicting on vocab_size classes
            # which is needed for the cross entropy loss function
            prediction = prediction.view(-1, model.vocab_size)
            x_offset = x_offset.contiguous().view(-1)
                
            # Get loss and take step
            loss = criterion(prediction, x_offset.long())
            loss.backward() # forgot this initially, oops!
            nn.utils.clip_grad_norm_(model.parameters(), 10) # Avoid exploding gradients
            optimizer.step()
            
            rolling_loss.append(loss.item())
            
            
            if iteration % summary_every == 0:
                if print_time:
                    print('Epoch: {0} ... Iteration: {1} ... Current loss: {2:.6f} ... Average loss over last {3} \
                        iterations: {4:.6f} ... Time {3} iterations took: {5:.3f}'.format(
                        e+1, iteration, loss.item(), summary_every, np.mean(rolling_loss), time.time() - prev_time))
                    prev_time = time.time()
                else:
                    print('Epoch: {0} ... Iteration: {1} ... Current loss: {2:.6f} ... Average loss over last {3} \
                          iterations: {4:.6f}'.format(e+1, iteration, loss.item(), summary_every, np.mean(rolling_loss)))
                print('-'*50)
                print(predict_sequence_tokens(model, length=20))
                print('\n')
                model.train()
                rolling_loss = []              

In [8]:
# Hyperparameters
num_layers = 4
hidden_dimension = 512
dropout_chance = 0.3
lr = 1e-3
    
char_net = CharRNN(chars, n_layers=num_layers, hidden_dim=hidden_dimension, dropout=dropout_chance)
char_net = char_net.to(device)
optim = torch.optim.Adam(char_net.parameters(), lr=lr)

In [None]:
# More hyperparameters (these just affect training)
show_time = False
summary_every = 5
seq_len = 100
batch_size = 10
reset_optimizer = False
lr = 1e-3 # won't be used unless reset optimizer is True

if reset_optimizer:
    optim = torch.optim.Adam(char_net.parameters(), lr=lr)
train(char_net, encoded_books, optim, seq_len=seq_len, batch_size=batch_size, 
      summary_every=summary_every, print_time=show_time)

In [None]:
# Generate a sequence of arbitrary length
seed_phrase='Harry' # Initializes hidden state of network
length=500 
p=.9 # 0 <= p <=1  sample from the top p most likely characters
greedy=False # choose most likely character every time(don't reccomend, easy for model to get off track and never recover)
include_seed=False # Include seed in output

sample = predict_sequence_tokens(char_net, seed_phrase=seed_phrase, length=length, p=p, argmax=greedy, include_seed=include_seed)

print(sample)

## Results
Some of the short excerpts could get quite good.

```Hagrid and Hermione shouted. ```

```"Ouch!" said Madam Pomfrey suspiciously.```

```"But --"```

Others were close to making sense, but not quite there.

```"What do you think that was the boy?" said Harry```

```"Why?" Harry watched. "What if he is
today that the memory work," he whispered.```

```"Harry!" Harry yelled. "I could not mind me without you```

Some were a little further off the mark, especially when the passages got longer.

```"It's out," he said white-hispinated and coldly,```

```It had just thought Ron think we started to get in his eyes, for bed not
a parchment like his back. Boys there was sure not five a grip at once,
who had been safely with simple hand, Harry```

Overall, considering the LSTM learned the entire English language character by character from a few Harry Potter books, these are pretty impressive results! It spelled words correctly quite often and picked up on some nuances of dialogue. Still, it leaves a lot to be desired which is why we'll investigate more complex models.

# Word2Vec
The results of the character level LSTM were lackluster, although still impressive considering what it was learning from. I turned towards modeling the text a word level instead. This proved MUCH more complicated than I originally anticipated. I chose to use word2vec to embed my words. [word2vec](http://mccormickml.com/2016/04/19/word2vec-tutorial-the-skip-gram-model/) is a shallow neural network that learns to represent a vocabulary of m words in a dimension d where typically d << m. Research has found that there isn't much benefit in performance for representing the words with more that ~300 dimensions. 

Using a lower dimension helps speed up training by using less parameters and also because similar words are (hopefully) embedded with similar vectors which can help the model when picking the next word. For example, perhaps the ideal word would've been chair, but the model isn't quite accurate and gives stool, a similar word, a high probability of being the next word. Obviously this is much better than it being some random other word that had a vector close to chair.

In [11]:
def word_loader(corpus, batch_size=10, seq_len=100, cutoff=3):
    '''
    corpus - string
    cutoff - how many instances must appear
    '''
    # Preprocess the text
    
    # Split out words from punctuation and whitespace
    word_list = re.split('([\"\.,!\?;:\s]+)', corpus) 
    
    # Create a dictionary that keeps track of how many times a word appears. Get rid of words that appear fewer
    # than cutoff times
    word_dict = {}
    for word in word_list:
        if word not in word_dict:
            word_dict[word] = 1
        else:
            word_dict[word] += 1
    cutoff_dict = {key:val for key,val in word_dict.items() if val > cutoff}
    processed_corpus = [w for w in word_list if w in cutoff_dict]
    vocab_size = len(cutoff_dict) + 1
    
    # Create our lookup dicts
    word2idx = {word:idx for idx, word in enumerate(cutoff_dict.keys())}
    idx2word = {idx:word for idx, word in enumerate(cutoff_dict.keys())}
    idx2word[vocab_size - 1] = '_UNK_' # for words filtered out
    word2idx['_UNK_'] = vocab_size - 1
    
    # Convert to ints
    corpus_as_int = np.array([word2idx[w] for w in processed_corpus])
    
    # Prep for batches
    words_per_batch = batch_size*seq_len
    batches = len(corpus_as_int) // words_per_batch
    data = corpus_as_int[:batches*words_per_batch] # truncate extra words that won't make a complete batch
    data = np.reshape(data, (batch_size, -1))
    
    return [processed_corpus], word2idx, idx2word

In [12]:
import logging # show information for gensim while training/loading/saving
logging.basicConfig(format='%(asctime)s : %(levelname)s : %(message)s', level=logging.INFO)

## Gensim
I was going to implement word2vec and then I found gensim which made it dead simple to train a word2vec model. Below is the code to train one yourself, or if you skip a head, you can import the model I used. It's not perfect, but it's decent enough.

In [None]:
from gensim import corpora
from gensim.models.word2vec import Word2Vec
from multiprocessing import cpu_count
import gensim.downloader as api

min_count = 3 # minimum amount of times a word must appear to be included
books_ready, word2int, int2word = word_loader(books, cutoff=min_count)

In [None]:
# Train Word2Vec model
prev_time = time.time()
model = Word2Vec(books_ready, size=300, negative=20, iter=20, sample=1e-3, sg=1, min_count = min_count, workers=cpu_count())
print('{:.5f} seconds'.format(time.time() - prev_time))

In [82]:
# Let's check how similar the vectors for 'Harry' and 'Potter' are. Looks good!
model.similarity('Harry', 'Potter')

  """Entry point for launching an IPython kernel.


0.9821884369931323

In [81]:
# The vectors closest to 'magic's vector and how similar they are. Seems like for the most part
# a reasonable list of words!
model.most_similar('magic')

  """Entry point for launching an IPython kernel.
2019-08-16 17:09:21,583 : INFO : precomputing L2-norms of word weight vectors


[('spells', 0.6969023942947388),
 ('magical', 0.6892367601394653),
 ('sorcery', 0.6885683536529541),
 ('fiend', 0.6736955642700195),
 ('wizard', 0.6666735410690308),
 ('undead', 0.6593881249427795),
 ('stormbringer', 0.6543042659759521),
 ('diceless', 0.6521874666213989),
 ('hellboy', 0.6503678560256958),
 ('summoner', 0.6503649950027466)]

In [25]:
model.save('w2v/model_w2v.model')

2019-08-16 22:09:34,529 : INFO : saving Word2Vec object under model_w2v.model, separately None
2019-08-16 22:09:34,530 : INFO : storing np array 'vectors' to model_w2v.model.wv.vectors.npy
2019-08-16 22:09:34,852 : INFO : not storing attribute vectors_norm
2019-08-16 22:09:34,854 : INFO : storing np array 'syn1neg' to model_w2v.model.trainables.syn1neg.npy
2019-08-16 22:09:35,254 : INFO : not storing attribute cum_table
2019-08-16 22:09:35,437 : INFO : saved model_w2v.model


## Next Model
Let's build our new model that can use our fancy `word2vec` embeddings!

In [23]:
from gensim.models import KeyedVectors
# Use to load your saved w2v model instead of retraining
model_w2v = KeyedVectors.load('w2v/model_w2v.model')

2019-08-19 12:47:36,600 : INFO : loading Word2VecKeyedVectors object from w2v/model_w2v.model
2019-08-19 12:47:36,766 : INFO : loading wv recursively from w2v/model_w2v.model.wv.* with mmap=None
2019-08-19 12:47:36,767 : INFO : loading vectors from w2v/model_w2v.model.wv.vectors.npy with mmap=None
2019-08-19 12:47:36,819 : INFO : setting ignored attribute vectors_norm to None
2019-08-19 12:47:36,820 : INFO : loading vocabulary recursively from w2v/model_w2v.model.vocabulary.* with mmap=None
2019-08-19 12:47:36,821 : INFO : loading trainables recursively from w2v/model_w2v.model.trainables.* with mmap=None
2019-08-19 12:47:36,822 : INFO : loading syn1neg from w2v/model_w2v.model.trainables.syn1neg.npy with mmap=None
2019-08-19 12:47:36,875 : INFO : setting ignored attribute cum_table to None
2019-08-19 12:47:36,876 : INFO : loaded w2v/model_w2v.model


In [15]:
# Need a way of keeping track which words correspond to which word2vec embeddings
lookup = np.random.randn(len(word2int), 300)
for i, w in enumerate(word2int.keys()):
    try:
        embed = model_w2v[w]
        idx = word2int[w]
        lookup[idx] = embed
    except KeyError:
        pass
weights = torch.FloatTensor(lookup).to(device)

  """


In [16]:
class WordRNN(nn.Module):
    '''
    A similar architecture to the character level RNN. This model predicts word by word instead of character by character
    '''
    def __init__(self, weights, vocab_size, n_layers=3, hidden_dim=512, dropout=.5):
        super(WordRNN, self).__init__()
        
        # Important numbers to save
        self.n_layers = n_layers
        self.hidden_dim = hidden_dim
        self.vocab_size = vocab_size
        
        # Actual layers of model
        self.embed = nn.Embedding.from_pretrained(weights)
        self.lstm = nn.LSTM(input_size=300, hidden_size=hidden_dim, num_layers=n_layers, batch_first=True, 
                            dropout=dropout)
        self.fc = nn.Linear(hidden_dim, self.vocab_size)
    
    def forward(self, x, h):
        x = self.embed(x)
        out, h = self.lstm(x, h)
        out = self.fc(out)
        
        return out, h
    
    def init_hidden(self, batch_size):
        return (torch.zeros(self.n_layers, batch_size, self.hidden_dim, device=device),
               torch.zeros(self.n_layers, batch_size, self.hidden_dim, device=device))

In [17]:
# Hyperparameters
num_layers = 3
hidden_dimension = 512
dropout_chance = 0.3
lr = 1e-3

word_net = WordRNN(weights, vocab_size=len(word2int), n_layers=num_layers, hidden_dim=hidden_dimension, dropout=dropout_chance)
word_net = word_net.to(device)
optim = torch.optim.Adam(char_net.parameters(), lr=lr)
encoded_books_as_words = np.array([word2int[word] for word in books_ready[0]]) # will be input as the data

In [None]:
# Hyperparameters for training
show_time = False
summary_every = 300
seq_len = 100
batch_size = 10
reset_optimizer = False
lr = 1e-3 # won't be used unless reset optimizer is True

if reset_optimizer:
    optim = torch.optim.Adam(char_net.parameters(), lr=lr)
train(word_net, encoded_books_as_words, optim, epochs=20, seq_len=50, batch_size=10, summary_every=5, print_time=True)

In [None]:
# Generate a sequence of arbitrary length
seed_phrase='Harry' # Initializes hidden state of network(needs to be composed of words/punctuation/whitespace in vocab)
length=500 
p=.9 # 0 <= p <=1  sample from the top p most likely characters
greedy=False # choose most likely character every time(don't reccomend, easy for model to get off track and never recover)
include_seed=False # Include seed in output

sample = predict_sequence_tokens(word_net, seed_phrase=seed_phrase, length=length, p=p, argmax=greedy, include_seed=include_seed)

print(sample)

## Results
When you first start training, the network hasn't learned how important whitespace is, so output will look something like


```FollowslammedagainFluffyslightlyfirmlydarkcroakedSurehomeworkapplausefinetransparenthittimidly ```


It got better with short phrases again.

```Harry muttered at the fire.```

```"The dormitory," said Ron.```

```"Ron will do!"```

But it ran into similar problems as the character level RNN

```Don't with Dobby whispered it inside their heads.```

```"Don't no?" said Professor slowly, her smile seemed more
great.```

```Now, ridiculous," The  whispered.```

```"Flint done. Come, it should sometimes have better Dumbledore"```

There were some funny moments where you could see word2vec had produced some similar words, but they weren't synonyms in this
case like when instead of ```black hole``` it wrote ```black whole```.

Ultimately, the improvement over the character-level RNN was not what I hoped for.

# GPT-2

After underwhelming results in regards to coherent, natural text generation, I started researching pretrained models available. [GPT-2](https://openai.com/blog/better-language-models/) emerged as a good option. Although the most powerful version hasn't been released to the general public, I still found impressive results with its weaker models. In addition, I found an excellent open source package on Github called [gpt-2-simple](https://github.com/minimaxir/gpt-2-simple) which makes downloading, loading, fine-tuning, and sampling models super easy. 

The models can be pretty large so I haven't included them in the repo, but it doesn't take long to fine tune them.

If you don't have gpt-2-simple, you can get it by executing:
```pip install gpt-2-simple```

In [6]:
#!pip install gpt-2-simple # Uncomment and run this to install
import gpt_2_simple as gpt2

model_name = "345M" # Options are 117M parameter model of 345M parameter model
filename = 'hp_all_concat.txt'
lr=1e-4
iterations = 1000
restore_from_checkpoint = 'latest' # latest or path to checkpoint to start training from
print_every = 1
save_model_every = 1000
sample_every = 100 # print a sample of generated text of length sample_length every sample_every iterations
sample_length = 1000


gpt2.download_gpt2(model_name=model_name)   # model is saved into current directory under /models/117M/ or /models/345M/
sess = gpt2.start_tf_sess()
gpt2.finetune(sess,
              filename,
              model_name=model_name,
              steps=iterations,
              learning_rate=lr,
              print_every=print_every,
              save_every=save_model_every,
              restore_from=restore_from_checkpoint,
              sample_every=sample_every,
              sample_length=sample_length)

In [None]:
# Generate a chunk of text
seed = None
top_k = 0
top_p = 0
temperature = 0.7
length = 1000
num_samples = 1

gpt2.generate(sess,
              seed=seed,
              top_k=top_k,
              top_p=top_p,
              temperature=temperature,
              length=length,
              nsamples=num_samples)

## Results

The finetuned gpt-2 model performed MUCH better and was the only model that could consistently give coherent sentence+ length responses. Early on, you could tell the finetuning hadn't overridden the general knowledge the model came in with and it came up with a lot of semicoherent responses that made no sense in the context of the Harry Potter universe.

```"Sometimes I wish I'd knew what the Wizards were all about." Ron's fists were
like lead. "I wish I'd had a list of all the stuff we hate about the Dursleys,
that's very topical. For years, I'd been trying to get a copy of Magic: The Gathering
with my wife and I.```

```"You think we're going to have to go all the way down to Gringotts to get a
fly in here?" said Uncle Vernon```

However it quickly got better.

```As far back as Dudley's nightmares had gone, Harry had never been so
sad and wrong. ```

```She was holding a large pink umbrella, and looking through the
walls on her bushy brown head you can see her
through the crack in the white curtains.
"How long have I been up?" she said ```

``` "We're going out now," said Ron, tearing his eyes off another boat. "We're
done for.```

Still, there were some wacky outputs

``` "Wizard's duel, Round 3," said Ron. He pulled out a toad and paced around
the boy.
"Oh, are you doing Herbology?" he asked Ron excitedly.```

# Conclusion

gpt-2 seemed to be far and away the best model for text generation and the robust tools available for finetuning make it super quick toget a specialized model up and running. Also, top p sampling/nucleus sampling is a clever idea that avoids the issue with top k sampling on a probability distrubition where there are k+x probable choices(for some x > 0).