# SECTION 1: Introduction

Welcome to my NLP project using GPT-3 and Seinfeld data. The goal is to allow the user to prompt GPT-3 with something and have it respond with an AI-generated Seinfeld situation.

Example:
> Prompt: "Trying to connect to WiFi"  
> Response: "When the WiFi George usually steals suddenly has a password, he becomes addicted to trying to "hack" in. J: 'Just get your own!' G: 'NEVER' "

(The response is one of the examples from the @ModernSeinfeld twitter feed.)



# SECTION 2: Data

The data used is:


*   Seinfeld episodes synopsis (173), from imdb and scraped here: https://www.kaggle.com/bcruise/seinfeld-episodes
*   @ModernSeinfeld tweets (492), scraped using twint in the acompanying notebook
*   Curb Your Enthusiasm episode synopsis, might be interesting to add later for more of a "Larry David" bot

With a combined 565 examples, we should have enough data to fine-tune GPT-3. According to the OpenAI guide <https://beta.openai.com/docs/guides/fine-tuning>, "we recommend having at least a couple hundred examples. In general, we've found that each doubling of the dataset size leads to a linear increase in model quality."

In [63]:
import pandas as pd

#As we load the IMDB data, we narrow the columns we're looking for
episodes_df = pd.read_csv('./data/seinfeld_imdb.csv.xls',usecols=['title','desc'])
#In the tweets we also get rid of the first row, which is a tweet promoting the author's book
tweets_df = pd.read_csv('./data/SeinfeldToday_tweets.csv',usecols=['tweet'], skiprows=[1])

pd.set_option('display.max_colwidth', None)
print(f"Our episode dataframe contains synopses for {len(episodes_df)} episodes. The first few examples are:\n{episodes_df['desc'].head(3)} \n")
print(f"Our teets dataframe contains {len(tweets_df)} tweets. The first few examples are: \n{tweets_df.head(3)}")

Our episode dataframe contains synopses for 173 episodes. The first few examples are:
0                                                                               Jerry and George argue whether an overnight visitor Jerry is expecting is coming with romantic intentions.
1                                           Jerry and George stake out the lobby of an office building to find a woman Jerry met at a party but whose name and phone number he didn't get.
2    After Jerry's apartment is robbed, Jerry starts to look for other apartments. But Jerry and George both want the same apartment, and Elaine wants the apartment of whoever loses out.
Name: desc, dtype: object 

Our teets dataframe contains 492 tweets. The first few examples are: 
                                                                                                                                          tweet
0  George's GF wants a "no phones at dinner" rule. G: "We had a good thing going, Jerry!  Now we're supposed to

## Data cleaning tasks:
* Strip leading whitespace, especially from episode data
* In tweets, convert quotes from J:"They've seen the credit card!" to Jerry:"They've seen the credit card!"

## Prepare the pre-training file

Your data must be a JSONL document, where each line is a prompt-completion pair corresponding to a training example. The format looks like:
  > {"prompt": "\<prompt text\>", "completion": "\<ideal generated text\>"}

In [78]:
# Define a function to help format each example
def new_example(completion: str, prompt: str=None) -> dict:
    """\
    Formats a single traning example for GPT-3 and return a dict.
    
    Args:
        completion (str): The desired completion of the example
        prompt (str): optional, the prompt part of the example 
            (default is None)
    
    Returns:
        dict: A dictionary with prompt and completion keys. A str() representation of this dict will meet the desired GPT-3 format.
    
    Your data must be a JSONL document, where each line is a prompt-completion pair corresponding to a training example. The format looks like:
    {"prompt": "<prompt text>", "completion": "<ideal generated text>"}
    
    - Each prompt should end with a fixed separator to inform the model when the prompt ends and the completion begins. A simple separator which generally works well is \\n\\n###\\n\\n. The separator should not appear elsewhere in any prompt.
    - Each completion should start with a whitespace due to our tokenization, which tokenizes most words with a preceding whitespace.
    - Each completion should end with a fixed stop sequence to inform the model when the completion ends. A stop sequence could be \\n, ###, or any other token that does not appear in any completion.
    - For inference, you should format your prompts in the same way as you did when creating the training dataset, including the same separator. Also specify the same stop sequence to properly truncate the completion.
    """
    
    example = {}
    example = {"prompt": str(prompt), "completion": completion}
    return example

In [79]:
print(new_example('test'))

{'prompt': 'None', 'completion': 'test'}


# SECTION 3: Working with Open AI and GPT3

- First we load the API Key
- Then we fine-tune a model, following https://beta.openai.com/docs/guides/fine-tuning
- Then we test prompting the model with something and seeing what Seinfeldy situation it comes up with 


In [24]:
!pip -q install openai

In [2]:
import os
import openai

In [20]:
# Load OpenAI API Key

try:
  # When in Colab
  from google.colab import drive
  drive.mount('/content/drive')
  with open("/content/drive/My Drive/Colab Notebooks/GPT3_api", 'r') as file:
    openai.api_key = file.read().rstrip('\n')
except:
  # When in local dev environment
  try:
     # Load variables from .env file in working directory
     !pip install python-dotenv
     from dotenv import load_dotenv
     load_dotenv()
  except:
     # You'll need to set the environment variables somehow, perhaps in .bashrc
     print("Warning: .env file not found")
  API_KEY = os.getenv('PROJECT_API_KEY')
  openai.api_key = os.getenv("OPENAI_API_KEY")



In [21]:
# Test OpenAI API
response = openai.Completion.create(engine="text-davinci-001", prompt="Say this is a test", max_tokens=6)
print(response)

{
  "choices": [
    {
      "finish_reason": "length",
      "index": 0,
      "logprobs": null,
      "text": "\n\nThis is a test"
    }
  ],
  "created": 1644344165,
  "id": "cmpl-4ZPK1Y6D34H8YooQRTdNznDCfDhvS",
  "model": "text-davinci:001",
  "object": "text_completion"
}
