# SECTION 1: Introduction

Welcome to my NLP project using GPT-3 and Seinfeld data. The goal is to allow the user to prompt GPT-3 with something and have it respond with an AI-generated Seinfeld situation.

Example:
> Prompt: "Trying to connect to WiFi"  
> Response: "When the WiFi George usually steals suddenly has a password, he becomes addicted to trying to "hack" in. J: 'Just get your own!' G: 'NEVER' "

(The response is one of the examples from the @ModernSeinfeld twitter feed.)



# SECTION 2: Data

The data used is:


*   Seinfeld episodes synopsis (173), from imdb and scraped here: https://www.kaggle.com/bcruise/seinfeld-episodes
*   @ModernSeinfeld tweets (492), scraped using twint in the acompanying notebook
*   Curb Your Enthusiasm episode synopsis, might be interesting to add later for more of a "Larry David" bot

That makes for a total of 665 examples. According to the OpenAI guide <https://beta.openai.com/docs/guides/fine-tuning>, "we recommend having at least a couple hundred examples. In general, we've found that each doubling of the dataset size leads to a linear increase in model quality."

In [40]:
import pandas as pd

#As we load the IMDB data, we narrow the columns we're looking for
episodes_df = pd.read_csv('./data/seinfeld_imdb.csv.xls',usecols=['title','desc'])
#In the tweets we also get rid of the first row, which is a tweet promoting the author's book
tweets_df = pd.read_csv('./data/SeinfeldToday_tweets.csv',usecols=['tweet'], skiprows=[1])

pd.set_option('display.max_colwidth', None)
print(f"Our episode dataframe contains synopses for {len(episodes_df)} episodes. The first few examples are:\n{episodes_df['desc'].head(3)} \n")
print(f"Our teets dataframe contains {len(tweets_df)} tweets. The first few examples are: \n{tweets_df.head(3)}")

Our episode dataframe contains synopses for 173 episodes. The first few examples are:
0                                                                               Jerry and George argue whether an overnight visitor Jerry is expecting is coming with romantic intentions.
1                                           Jerry and George stake out the lobby of an office building to find a woman Jerry met at a party but whose name and phone number he didn't get.
2    After Jerry's apartment is robbed, Jerry starts to look for other apartments. But Jerry and George both want the same apartment, and Elaine wants the apartment of whoever loses out.
Name: desc, dtype: object 

Our teets dataframe contains 492 tweets. The first few examples are: 
                                                                                                                                          tweet
0  George's GF wants a "no phones at dinner" rule. G: "We had a good thing going, Jerry!  Now we're supposed to

## Data cleaning tasks:
* Strip leading and trailing whitespace, especially from episode data
* In tweets, convert quotes from J:"They've seen the credit card!" to Jerry:"They've seen the credit card!"

In [54]:
# When the tweets have quotes, such as the first example does, this replaces the single-letter character with the fullname
# (i.e. Replaces G: with George:) 
def repl(match):
    x = match.group(1)
    return {
        'G': "George",
        'J': "Jerry",
        'E': "Elaine",
        'K': "Kramer"
    }[x] + ": "

tweets_df['tweet'] = tweets_df['tweet'].str.replace(r'([JGEK])(:\s*"?)', repl, regex=True).str.strip()
episodes_df['desc'] = episodes_df['desc'].str.strip()

## Manual review of examples
Since there aren't that many examples, we can take a look at all of them.
Using a simple data labeling python extension called Pigeon, we can mark each example as 'ready' or 'needs cleaning'. While we review we can also chuckle.

In [8]:
pip -qq install pigeon-jupyter

Note: you may need to restart the kernel to use updated packages.


In [45]:
from pigeon import annotate
annotations = annotate(
  (*tweets_df['tweet'], *episodes_df['desc']),
  options=['ready', 'needs cleaning']
)

HTML(value='0 examples annotated, 666 examples left')

HBox(children=(Button(description='ready', style=ButtonStyle()), Button(description='needs cleaning', style=Bu…

Output()

## Prepare the data for GPT3 fine-tuning

GPT-3 allows for fine-tuning https://beta.openai.com/docs/guides/fine-tuning/

### What is fine-tuning?

> GPT-3 has been pre-trained on a vast amount of text from the open internet. When given a prompt with just a few examples, it can often intuit what task you are trying to perform and generate a plausible completion. This is often called "few-shot learning."
>
> Fine-tuning improves on few-shot learning by training on many more examples than can fit in the prompt, letting you achieve better results on a wide number of tasks. Once a model has been fine-tuned, you won't need to provide examples in the prompt anymore.

You fine-tune by providing data in a JSONL document, where each line is a prompt-completion pair corresponding to a training example. The format looks like:
  > {"prompt": "\<prompt text\>", "completion": "\<ideal generated text\>"}

In [100]:
# Define a function to help format each example
STOP_TOKEN = " ##END##"
def new_example(completion: str, prompt: str='') -> str:
    """\
    Formats a single traning example for GPT-3 and return a string.
    
    Args:
        completion (str): The desired completion of the example
        prompt (str): optional, the prompt part of the example 
            (default is None)
    
    Returns:
        dict: A dictionary with prompt and completion keys. A str() representation of this dict will meet the desired GPT-3 format.
    
    Your data must be a JSONL document, where each line is a prompt-completion pair corresponding to a training example. The format looks like:
    {"prompt": "<prompt text>", "completion": "<ideal generated text>"}
    
    - Each prompt should end with a fixed separator to inform the model when the prompt ends and the completion begins. A simple separator which generally works well is \\n\\n###\\n\\n. The separator should not appear elsewhere in any prompt.
    - Each completion should start with a whitespace due to our tokenization, which tokenizes most words with a preceding whitespace.
    - Each completion should end with a fixed stop sequence to inform the model when the completion ends. A stop sequence could be \\n, ###, or any other token that does not appear in any completion.
    - For inference, you should format your prompts in the same way as you did when creating the training dataset, including the same separator. Also specify the same stop sequence to properly truncate the completion.
    """
    
    import json
    example = {}
    example = {"prompt": str(prompt), "completion": " " + completion + STOP_TOKEN}
    # Converting the dictionary to a string should be done with json.dumps otherwise it will have single quotes instead of double quotes
    return json.dumps(example)

In [101]:
print(new_example('test'))

{"prompt": "", "completion": " test ##END##"}


### Fine tuning for Contextual vs. Open-ended Generation

For the examples we feed GPT-3, there are several options for what to put in the prompt and completion, depending on the use-case. The GPT-3 fine-tuning guide covers classification (e.g. sentiment analysis, email triage), contextual generation (e.g. write an advert based on wikipedia entry, product description based on technical properties), and open-ended generation (e.g maintaining company voice, generating haikus). https://beta.openai.com/docs/guides/fine-tuning/specific-guidelines

To use contextual generation, which requires fewer examples for good performance, we would need to associate a prompt with each example. For instance, in the example at the beginning of this notebook we would perhaps feed the following line to GPT-3:

  > {"prompt": "Wifi", "completion": "When the WiFi George usually steals suddenly has a password, he becomes addicted to trying to "hack" in. J: 'Just get your own!' G: 'NEVER'"}

We might return to contextual generation, but it's hard to imagine how to do so for our data. Particularly with the episode synopses, how would we pick the prompts that entail the several plot-lines happening in each episode? And if we do provide prompts, would the model be able to generalize to new real-world situations that the user might throw out? 

Lets try first with open-ended fine-tuning. The guidelines are:

>- Leave the prompt empty
>- No need for any separators
>- You'll normally want a very large number of examples, at least a few thousand
>- Ensure the examples cover the intended domain or the desired tone of voice
>- Example: {"prompt":"", "completion":" <company voice textual content>"}

Note, that the documentation cautions open-ended genereation requires "a few thousand examples". Note also, "generative tasks have a potential to leak training data when requesting completions from the model." With relatively few examples (655), the model might just parrot back some of the data we feed it regardless of the situation we prompt it with.

## Create the file with examples

In [98]:
TRAIN_FILE = "./data/training_examples.JSONL"
jsonl = []

for example in (*tweets_df['tweet'], *episodes_df['desc']):
   jsonl.append(new_example(example))
#print the first and last 3 items in the JSONL
print(JSONL[1:3], "\n...\n", JSONL[-3:])

with open(TRAIN_FILE, "w") as outfile:
    outfile.write("\n".join(str(item) for item in JSONL))

['{"prompt": "", "completion": " After her fuck buddy texts her that she should come over to \\"watch Netflix,\\" Elaine is pissed when he actually just wants to watch Netflix. ##END##"}', '{"prompt": "", "completion": " Jerry refuses to go to a Cash Only diner. Jerry: They\\u2019ve seen the credit card! They know the credit card! It\\u2019s time to accept the credit card!\\" ##END##"}'] 
...
 ['{"prompt": "", "completion": " Jerry, George, Kramer and Elaine get stuck in standstill traffic due to the massive Puerto Rican Day Parade. ##END##"}', '{"prompt": "", "completion": " Just as the four are about to go to the movies, Jerry looks back on the past nine years with the audience. ##END##"}', '{"prompt": "", "completion": " After George and Jerry land a production deal with NBC, the four head out for Paris on NBC\'s private plane and are waylaid in a small Massachusetts town. ##END##"}']


# SECTION 3: Using the Open AI and GPT3 APIs

- Use OpenAI command-line tools to validate our training file
- Load the API Key
- Fine-tune a model, following https://beta.openai.com/docs/guides/fine-tuning
- Prompting the model with something and seeing what Seinfeldy situation it comes up with 


In [24]:
!pip -qq install openai

In [99]:
# Test the training file to 
!openai tools fine_tunes.prepare_data -f {TRAIN_FILE}

Logging requires wandb to be installed. Run `pip install wandb`.
Analyzing...

- Your file contains 665 prompt-completion pairs

No remediations found.

You can use your file for fine-tuning:
> openai api fine_tunes.create -t "./data/training_examples.JSONL"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=[" ##END##"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 11.58 minutes to train a `curie` model, and less for `ada` and `babbage`. Queue will approximately take half an hour per job ahead of you.


In [20]:
# Load OpenAI API Key
import os
import openai

try:
  # When in Colab
  from google.colab import drive
  drive.mount('/content/drive')
  with open("/content/drive/My Drive/Colab Notebooks/GPT3_api", 'r') as file:
    openai.api_key = file.read().rstrip('\n')
except:
  # When in local dev environment
  try:
     # Load variables from .env file in working directory
     !pip install python-dotenv
     from dotenv import load_dotenv
     load_dotenv()
  except:
     # You'll need to set the environment variables somehow, perhaps in .bashrc
     print("Warning: .env file not found")
  API_KEY = os.getenv('PROJECT_API_KEY')
  openai.api_key = os.getenv("OPENAI_API_KEY")



In [21]:
# Test OpenAI API
response = openai.Completion.create(engine="text-davinci-001", prompt="Say this is a test", max_tokens=6)
print(response)

{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
      "text": "\n\nThis is a test"
    }
  ],
  "created": 1644344165,
  "id": "cmpl-4ZPK1Y6D34H8YooQRTdNznDCfDhvS",
  "model": "text-davinci:001",
  "object": "text_completion"
}
