### OpenAI Access

First things first, you'll need to set-up an account on [OpenAI](platform.openai.com). Once you've done that - follow [these resources](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key) to create an API key. Make sure you save your API key!

In [1]:
import os 

# Set the OPENAI_API_KEY environment variable
os.environ["OPENAI_API_KEY"] = ""

### OpenAI API Library

We'll be leveraging [this](https://github.com/openai/openai-python) library to access OpenAI's model endpoints.

There are a number of models to choose from and you can find resources about them [here](https://platform.openai.com/docs/models) and their pricing [here](https://openai.com/pricing).

The first step is to install `openai`!

In [3]:
!pip install openai -q

[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m70.3/70.3 kB[0m [31m1.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.0/1.0 MB[0m [31m18.1 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m114.2/114.2 kB[0m [31m11.8 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m158.8/158.8 kB[0m [31m12.6 MB/s[0m eta [36m0:00:00[0m
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m269.4/269.4 kB[0m [31m24.3 MB/s[0m eta [36m0:00:00[0m
[?25h

Once we've installed it, we need to import it and set our API key!

In [4]:
import openai 

openai.api_key = os.environ.get("OPENAI_API_KEY")

If you wanted to use `gpt-4`, you'd need an account that has closed beta access to the model endpoint. 

You can check if your API Key has access using the following cell.

In [5]:
# check if acct. has gpt-4 access
"gpt-4" in [model["root"] for model in openai.Model.list()["data"]]

True

For the rest of the tutorial, we're going to assume you're using `gpt-3.5-turbo` as your model.

Let's make some helper functions for prompting our model and generating our prompts.

In [7]:
def prompt_model(prompt_list, model="gpt-3.5-turbo"):
  return openai.ChatCompletion.create(model=model, messages=prompt_list)

def create_prompt(role, prompt):
  return {"role" : role, "content" : prompt}

As you can see, our prompts have to be in a specific format - as set by OpenAI.

Here's an example:

```
{"role" : "system", "content" : "You are an expert in Python programming."}
{"role" : "user", "content" : "Please define a function that provides the Nth number of the fibonacci sequence."}
```

Let's see that in action! Remember that you can feed OpenAI's chat completion endpoint with a list of prompts!

In [8]:
list_of_prompts = [
    {"role" : "system", "content" : "You are an expert in Python programming."}, 
    {"role" : "user", "content" : "Please define a function that provides the Nth number of the fibonacci sequence."}
]

model_output = prompt_model(list_of_prompts)
print(model_output)

{
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Sure! Here's a function that returns the Nth number in the Fibonacci sequence using recursion:\n\n```python\ndef fibonacci(n):\n    if n <= 0:\n        return \"Invalid input, n should be greater than or equal to 1\"\n    elif n == 1:\n        return 0\n    elif n == 2:\n        return 1\n    else:\n        return fibonacci(n - 1) + fibonacci(n - 2)\n\n# Example Usage:\nn = 10\nprint(f\"The {n}th number in the Fibonacci sequence is {fibonacci(n)}\")\n```\n\nPlease note that this implementation is not optimized for large values of `n` as it has an exponential time complexity. For larger values of `n`, I recommend using dynamic programming or matrix exponentiation to solve the problem.",
        "role": "assistant"
      }
    }
  ],
  "created": 1682484371,
  "id": "chatcmpl-79RKdT6EXbI3JlXY6bKb9IBfIARfh",
  "model": "gpt-4-0314",
  "object": "chat.completion",
  "usage": {
   

As you can see, we get a lot of information back from the endpoint. 

We can see the number of tokens we used, why the output stopped, what the output is, and more!

Let's view the prompt a bit clearer using some display libraries. 

In [9]:
from IPython.display import display, Markdown

markdown_output = model_output["choices"][0]["message"]["content"]

display(Markdown(markdown_output))

Sure! Here's a function that returns the Nth number in the Fibonacci sequence using recursion:

```python
def fibonacci(n):
    if n <= 0:
        return "Invalid input, n should be greater than or equal to 1"
    elif n == 1:
        return 0
    elif n == 2:
        return 1
    else:
        return fibonacci(n - 1) + fibonacci(n - 2)

# Example Usage:
n = 10
print(f"The {n}th number in the Fibonacci sequence is {fibonacci(n)}")
```

Please note that this implementation is not optimized for large values of `n` as it has an exponential time complexity. For larger values of `n`, I recommend using dynamic programming or matrix exponentiation to solve the problem.

### Generating Synthetic Data

Alright, now we can pull everything together and start creating our synthetic data!

**NOTE:** Using OpenAI's endpoints to create our dataset does mean that we cannot use our model for commercial use. This is meant to demonstrate the methods, and can be extended to any open-source LLM.

We're going to use this process to create 100 product/marketing email pairs. 

We'll be doing this in 2-steps:

1. Create the 100 products and short descriptions.
2. Create marketing emails for each of those 100 product/descriptions.

Let's begin by creating the prompt for our products/descriptions!

In [10]:
datagen_prompts = [
    {"role" : "system", "content" : "You are a product innovator. You create new products that people crave."},
    {"role" : "user", "content" : "Please generate a Python list of 100 new products and extremely short descriptions."},
]

In [11]:
first_data_gen = prompt_model(datagen_prompts)
print(first_data_gen["choices"][0]["message"]["content"])

Here's a list of 100 new product ideas and their extremely short descriptions:

1. SmartEyes - Glasses with real-time translation.
2. SolarCup - Self-heating coffee cup.
3. FingerCharge - Charging device on fingertips.
4. DreamMapper - App for decoding dreams.
5. FlexiFridge - Expandable refrigerator.
6. AnyTaste Gum - Flavor-changing chewing gum.
7. AirPrint - Mini portable air printer.
8. PlantBuddy - Plant care monitoring system.
9. MoodWear - Color-changing clothes.
10. SunscreenPill - Oral sunscreen supplement.
11. FreshStep - Instant shoe deodorizer.
12. AllerGone - Nasal filter for allergens.
13. SlimPlate - Portion control eating assistant.
14. PureRain - Drinking water from air.
15. OceanEnergy - Wave-powered chargers.
16. EZPark - In-vehicle parking assist.
17. ArtAroma - Scent-based virtual gallery.
18. TimeKey - Personal time management device.
19. BiteBuddy - Utensil set for one-handed use.
20. CookTogether - Recipe sharing platform.
21. HeartBand - Silent alarm wristband.

Okay, now that we have a list of 100 items - let's parse them out into a Python list - also, we can keep track of our total token usage to estimate costs!

In [19]:
def retrieve_token_usage(open_ai_response):
  return sum([tokens for tokens in open_ai_response["usage"].values()])

In [21]:
f"We used {retrieve_token_usage(first_data_gen)} tokens"

'We used 2308 tokens'

The following code might need to be modified based on how your data was returned by OpenAI's endpoint!

In [37]:
text_response = first_data_gen["choices"][0]["message"]["content"]

products_and_descriptions = []
for line in text_response.splitlines():
  if "." in line:
    product_descriptions = line.split(".")[1]
    product_descriptions_split = product_descriptions.split("-")
    products_and_descriptions.append(
        {
            "product" : product_descriptions_split[0][1:-1], 
            "description" : "-".join(product_descriptions_split[1:])[1:]
        }
    )

In [51]:
products_and_descriptions[0]

{'product': 'SmartEyes', 'description': 'Glasses with real-time translation'}

Now that we have our items parsed out into a Python list - we can go ahead and iterate through each of the items and have whichever OpenAI model you selected create a short marketing email for it!

First though, we'll need a system prompt to use!

In [47]:
system_prompt = create_prompt(
    "system", 
    "You are a marketing executive. You are proficient at writing short, and snappy marketing emails. The emails should be easy to read, and contain excited and vibrant language."
)

We'll also need a user prompt - we'll have to wrap this in a function so we can call it for each item of the 100 items we created above.

In [49]:
def generate_user_prompt(product, description):
  user_prompt = create_prompt(
      "user",
      f"Please create a marketing email using this product: {product} and this description: {description}"
  )

  return user_prompt

Now we're good to start generating our synthetic data! We simply need to iterate through each item - and collate the results into a list of dictionaries!

(depending on which model you use, this step might take a long time, and could become expensive!)

In [None]:
from openai.error import RateLimitError
total_token_usage = 0

for idx, item in enumerate(products_and_descriptions):
  if "marketing_email" in item:
    continue
  print(f"Working on {idx}")
  user_prompt = generate_user_prompt(item["product"], item["description"])
  full_prompt = [system_prompt, user_prompt]
  try: 
    prompt_response = prompt_model(full_prompt)
    item["marketing_email"] = prompt_response["choices"][0]["message"]["content"]
    total_token_usage += retrieve_token_usage(prompt_response)
  except RateLimitError as e:
    continue

In [79]:
products_desc_and_marktng_emails_dataset = [p_d_and_m for p_d_and_m in products_and_descriptions if "marketing_email" in p_d_and_m]

### Uploading Dataset to HuggingFace Hub

Now that we've created our synthetic dataset - let's push it to the HuggingFace hub!

As always, the first task is to get the required dependencies.

In [None]:
!pip install huggingface_hub -q

Now we can log-in to Hugging Face!

Make sure you have a Hugging Face account, and you have set up a read/write token!

More info here: https://huggingface.co/docs/hub/security-tokens

In [69]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|
    
    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid.
Your token has been saved to /root/.cache/huggingface/token
Login successful


Now we can load our data into the desired format - and upload it to the hub!

In [None]:
!pip install datasets -q

In [71]:
from datasets import load_dataset, Dataset
import pandas as pd

In [80]:
hf_dataset = Dataset.from_pandas(pd.DataFrame(data=products_desc_and_marktng_emails_dataset))

In [81]:
hf_dataset

Dataset({
    features: ['product', 'description', 'marketing_email'],
    num_rows: 75
})

In [82]:
hf_username = "<YOUR HF USERNAME HERE>"
dataset_name = "<YOUR DATASET NAME HERE>"

hf_dataset.push_to_hub(f"{hf_username}/{dataset_name}")

Pushing dataset shards to the dataset hub:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

Upload 1 LFS files:   0%|          | 0/1 [00:00<?, ?it/s]

### Conclusion

And that's it! You just created a synthetic dataset and pushed it to the hub! 

Next stop? [Modeling!](https://colab.research.google.com/drive/1RfUuzG11Q8AaZuJIHLzXCVC087xoDeSd?usp=sharing)