<a href="https://colab.research.google.com/github/bensethbell/Building-Generative-AI-Apps/blob/main/Synthetic%20GPT-4%20Dataset%20Creation%20LLM%20Prompts.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

### OpenAI Access

First things first, you'll need to set-up an account on [OpenAI](platform.openai.com). Once you've done that - follow [these resources](https://help.openai.com/en/articles/4936850-where-do-i-find-my-secret-api-key) to create an API key. Make sure you save your API key!

In [1]:

!pip install cohere
!pip install tiktoken
!pip install openai==0.28
!pip install datasets



In [2]:
import os

# Set the OPENAI_API_KEY environment variable
os.environ["OPENAI_API_KEY"] = ""

In [7]:
import openai
import os
from IPython.display import display, Markdown

class Prompter:
    def __init__(self, gpt_model):
        if not os.environ.get("OPENAI_API_KEY"):
            raise Exception("Please set the OPENAI_API_KEY environment variable")

        openai.api_key = os.environ.get("OPENAI_API_KEY")

        self.gpt_model = gpt_model

    def prompt_model_print(self, messages: list):
        response = openai.ChatCompletion.create(model=self.gpt_model, messages=messages)
        display(Markdown(response["choices"][0]["message"]["content"]))

    def prompt_model_return(self, messages: list):
        response = openai.ChatCompletion.create(model=self.gpt_model, messages=messages)
        return response["choices"][0]["message"]["content"]

Replace `"OPEN_AI_MODEL"` with the following:

If you don't have access to GPT-4., you'll want to use: `"gpt-3.5-turbo"`

If you have access to GPT-4, go ahead and use `"gpt-4"`.

In [8]:
prompter = Prompter("gpt-4")

In [11]:
datagen_prompts = [
    {"role" : "system", "content" : "You are a curious LLM user with many questions."},
    {"role" : "user", "content" : "Please generate a Python list of 10 queries for an LLM across a range of topics, each no more than one sentence long. Make sure it is in the form of a Python list."},
]

In [12]:
prompter.prompt_model_print(datagen_prompts)

[
"What is the origin and meaning of the phrase 'bite the bullet'?", 
"What are the main components of a computer processor?", 
"What are the health benefits of regular exercise and a balanced diet?", 
"Can you describe the plot of the novel 'War and Peace'?",
"Who won the FIFA World Cup in 2010?", 
"What is the function of mitochondria within a cell?", 
"Please explain the basic principles of quantum mechanics?", 
"How does photosynthesis function in plants?", 
"What are the most influential factors affecting global climate change?",
"What are the key ideas of the philosophy of existentialism?"
]

Now that we have our initial data, lets generate the responses for each of the products. We will use the following function to create the responses.

In [15]:
queries = [ "What is the origin and meaning of the phrase 'bite the bullet'?", "What are the main components of a computer processor?", "What are the health benefits of regular exercise and a balanced diet?", "Can you describe the plot of the novel 'War and Peace'?", "Who won the FIFA World Cup in 2010?", "What is the function of mitochondria within a cell?", "Please explain the basic principles of quantum mechanics?", "How does photosynthesis function in plants?", "What are the most influential factors affecting global climate change?", "What are the key ideas of the philosophy of existentialism?" ]

In [16]:
system_prompt = {"role" : "system", "content" : "You are an LLM. Your job is to answer the queries given to you, making sure each response is no more than one paragraph"}

In [17]:
responses = []
for query in queries:
    user_prompt = {"role" : "user", "content" : query}
    responses.append(prompter.prompt_model_return([system_prompt, user_prompt]))
    print(responses[-1])

The phrase 'bite the bullet' originates from the 19th-century wars when there was no time to administer anesthesia before surgery. Surgeons would ask patients to bite on a bullet to distract from the pain. Today, the phrase means to face a difficult situation with courage and patience, without showing fear or pain.

A computer processor, often referred to as a Central Processing Unit (CPU), primarily consists of the Arithmetical Logical Unit (ALU) that performs mathematical, logical, and decision operations, and the Control Unit (CU) that directs all of the processors’ operations. It also includes registers for temporary storage of information and instructions, buses for data transfer, and a cache for storing frequently used information for rapid access.
Regular exercise and a balanced diet offer numerous health benefits including improved cardiovascular health, stronger bones and muscles, enhanced mental health, and maintained healthy body weight. Exercise boosts endorphins to improve

Lets install and use huggingface_hub to push our data to the hub!

In [18]:
!pip install huggingface_hub



Now we can log-in to Hugging Face!

Make sure you have a Hugging Face account, and you have set up a read/write token!

More info here: https://huggingface.co/docs/hub/security-tokens

In [42]:
!huggingface-cli login


    _|    _|  _|    _|    _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|_|_|_|    _|_|      _|_|_|  _|_|_|_|
    _|    _|  _|    _|  _|        _|          _|    _|_|    _|  _|            _|        _|    _|  _|        _|
    _|_|_|_|  _|    _|  _|  _|_|  _|  _|_|    _|    _|  _|  _|  _|  _|_|      _|_|_|    _|_|_|_|  _|        _|_|_|
    _|    _|  _|    _|  _|    _|  _|    _|    _|    _|    _|_|  _|    _|      _|        _|    _|  _|        _|
    _|    _|    _|_|      _|_|_|    _|_|_|  _|_|_|  _|      _|    _|_|_|      _|        _|    _|    _|_|_|  _|_|_|_|

    To login, `huggingface_hub` requires a token generated from https://huggingface.co/settings/tokens .
Token: 
Add token as git credential? (Y/n) n
Token is valid (permission: write).
Your token has been saved to /root/.cache/huggingface/token
Login successful


Now we can load our data into the desired format - and upload it to the hub!

In [44]:
from datasets import load_dataset, Dataset
import pandas as pd

In [45]:

def create_dataset(queries, responses):
  l = []
  for query, response in zip(queries, responses):
    d= {}
    d['prompt'] = query
    d['response'] = response
    l.append(d)

  return l

In [46]:
dataset = create_dataset(queries, responses)
dataset = pd.DataFrame(dataset, index=range(len(queries)))
dataset

Unnamed: 0,prompt,response
0,What is the origin and meaning of the phrase '...,The phrase 'bite the bullet' originates from t...
1,What are the main components of a computer pro...,"A computer processor, often referred to as a C..."
2,What are the health benefits of regular exerci...,Regular exercise and a balanced diet offer num...
3,Can you describe the plot of the novel 'War an...,"""War and Peace"", penned by Leo Tolstoy, is an ..."
4,Who won the FIFA World Cup in 2010?,The FIFA World Cup in 2010 was won by Spain.
5,What is the function of mitochondria within a ...,The primary function of mitochondria within a ...
6,Please explain the basic principles of quantum...,Quantum mechanics is a branch of physics that ...
7,How does photosynthesis function in plants?,Photosynthesis is a process carried out by gre...
8,What are the most influential factors affectin...,The most influential factors affecting global ...
9,What are the key ideas of the philosophy of ex...,Existentialism is a philosophy that emphasizes...


In [47]:
len(queries)

10

In [48]:
hf_dataset = Dataset.from_pandas(pd.DataFrame(data=dataset))

In [49]:
hf_dataset

Dataset({
    features: ['prompt', 'response'],
    num_rows: 10
})

In [51]:
hf_dataset.push_to_hub("{YOUR_HUGGING_FACE_USERNAME}/LLMPrompts")

Uploading the dataset shards:   0%|          | 0/1 [00:00<?, ?it/s]

Creating parquet from Arrow format:   0%|          | 0/1 [00:00<?, ?ba/s]

CommitInfo(commit_url='https://huggingface.co/datasets/Bsbell21/LLMPrompts/commit/333950e4ea2355bbc1d5ce63cd749afae4666da1', commit_message='Upload dataset', commit_description='', oid='333950e4ea2355bbc1d5ce63cd749afae4666da1', pr_url=None, pr_revision=None, pr_num=None)