In [1]:
prompt = "Pranjal Paira is a final-year undergraduate student pursuing a B.Tech in Data Science and Engineering from Manipal University Jaipur. With a strong interest in applied machine learning, MLOps, and data science, he is passionate about utilizing these fields to drive positive social impact and create a better future.Currently, Pranjal is working as a Data Engineer in a stealth mode startup, where he is actively involved in building a data pipeline to facilitate the development and deployment of generative AI models. His contributions are crucial in designing and implementing efficient infrastructure that enables organizations to leverage generative AI for various applications.Pranjal's experience extends beyond engineering, with considerable expertise in research, particlarly in the field of social networks and graph theory. His publications have made notable contributions in this area, demonstrating his deep understanding of the complexities surrounding social networks. Pranjal's research background equips him with valuable insights that he can leverage to enhance information propagation in social network models and create a meaningful impact.Driven by a desire to utilize machine learning for the social good, Pranjal aims to leverage the power of applied machine learning to address critical challenges and create positive change. He firmly believes in the potential of AI technology to transform sectors such as healthcare, sustainability, and education.Pranjal's inquisitive nature and passion for continuous learning motivate him to seek new opportunities for growth and collaboration. He is eager to connect with like-minded professionals and organizations that share his vision for leveraging data science, applied machine learning and generative AI to make a lasting societal impact. If you want to explore potential partnerships or discuss innovative ideas, please don't hesitate to reach out.Let's connect and work together to harness the transformative potential of AI for a better future!"
temperature = .4
number_of_examples = 50

Run this to generate the dataset.

In [2]:
!pip install openai tenacity



In [5]:
import openai

In [6]:
with open('API_KEY', 'r') as file:
    api_key = file.read().strip()  

# Set the API key
openai.api_key=api_key

In [None]:
import os

import random
from tenacity import retry, stop_after_attempt, wait_exponential

openai.api_key = api_key

N_RETRIES = 3

@retry(stop=stop_after_attempt(N_RETRIES), wait=wait_exponential(multiplier=1, min=4, max=70))
def generate_example(prompt, prev_examples, temperature=.5):
    messages=[
        {
            "role": "system",
            "content": f"You are generating data which will be used to train a machine learning model.\n\nYou will be given a high-level description of the model we want to train, and from that, you will generate data samples, each with a prompt/response pair.\n\nYou will do so in this format:\n```\nprompt\n-----------\n$prompt_goes_here\n-----------\n\nresponse\n-----------\n$response_goes_here\n-----------\n```\n\nOnly one prompt/response pair should be generated per turn.\n\nFor each turn, make the example slightly more complex than the last, while ensuring diversity.\n\nMake sure your samples are unique and diverse, yet high-quality and complex enough to train a well-performing model.\n\nHere is the type of model we want to train:\n`{prompt}`"
        }
    ]

    if len(prev_examples) > 0:
        if len(prev_examples) > 8:
            prev_examples = random.sample(prev_examples, 8)
        for example in prev_examples:
            messages.append({
                "role": "assistant",
                "content": example
            })

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=messages,
        temperature=temperature,
        max_tokens=100,
    )

    return response.choices[0].message['content']

# Generate examples
prev_examples = []
for i in range(number_of_examples):
    print(f'Generating example {i}')
    example = generate_example(prompt, prev_examples, temperature)
    prev_examples.append(example)

print(prev_examples)

Generating example 0
Generating example 1
Generating example 2
Generating example 3
Generating example 4
Generating example 5
Generating example 6
Generating example 7
Generating example 8
Generating example 9
Generating example 10
Generating example 11
Generating example 12
Generating example 13
Generating example 14
Generating example 15
Generating example 16
Generating example 17
Generating example 18
Generating example 19
Generating example 20
Generating example 21
Generating example 22
Generating example 23
Generating example 24
Generating example 25
Generating example 26
Generating example 27
Generating example 28
Generating example 29
Generating example 30
Generating example 31
Generating example 32
Generating example 33
Generating example 34
Generating example 35
Generating example 36
Generating example 37
Generating example 38
Generating example 39
Generating example 40
Generating example 41
Generating example 42
Generating example 43
Generating example 44
Generating example 4

In [None]:
def generate_system_message(prompt):

    response = openai.ChatCompletion.create(
        model="gpt-4",
        messages=[
          {
            "role": "system",
            "content": "You will be given a high-level description of the model we are training, and from that, you will generate a simple system prompt for that model to use. Remember, you are not generating the system message for data generation -- you are generating the system message to use for inference. A good format to follow is `Given $INPUT_DATA, you will $WHAT_THE_MODEL_SHOULD_DO.`.\n\nMake it as concise as possible. Include nothing but the system prompt in your response.\n\nFor example, never write: `\"$SYSTEM_PROMPT_HERE\"`.\n\nIt should be like: `$SYSTEM_PROMPT_HERE`."
          },
          {
              "role": "user",
              "content": prompt.strip(),
          }
        ],
        temperature=temperature,
        max_tokens=500,
    )

    return response.choices[0].message['content']

system_message = generate_system_message(prompt)

print(f'The system message is: `{system_message}`. Feel free to re-run this cell if you want a better result.')

The system message is: `Given a brief biography of Pranjal Paira, generate a concise summary of his professional background and interests.`. Feel free to re-run this cell if you want a better result.


In [None]:
import json
import pandas as pd

# Initialize lists to store prompts and responses
prompts = []
responses = []

# Parse out prompts and responses from examples
for example in prev_examples:
  try:
    split_example = example.split('-----------')
    prompts.append(split_example[1].strip())
    responses.append(split_example[3].strip())
  except:
    pass

# Create a DataFrame
df = pd.DataFrame({
    'prompt': prompts,
    'response': responses
})

# Remove duplicates
df = df.drop_duplicates()

print('There are ' + str(len(df)) + ' successfully-generated examples.')

# Initialize list to store training examples
training_examples = []

# Create training examples in the format required for GPT-3.5 fine-tuning
for index, row in df.iterrows():
    training_example = {
        "messages": [
            {"role": "system", "content": system_message.strip()},
            {"role": "user", "content": row['prompt']},
            {"role": "assistant", "content": row['response']}
        ]
    }
    training_examples.append(training_example)

# Save training examples to a .jsonl file
with open('training_examples.jsonl', 'w') as f:
    for example in training_examples:
        f.write(json.dumps(example) + '\n')

There are 44 successfully-generated examples.


In [None]:
file_id = openai.File.create(
  file=open("/content/training_examples.jsonl", "rb"),
  purpose='fine-tune'
).id

In [None]:
job = openai.FineTuningJob.create(training_file=file_id, model="gpt-3.5-turbo")

job_id = job.id

In [None]:
openai.FineTuningJob.list_events(id=job_id, limit=10)

<OpenAIObject list at 0x7938e36dfce0> JSON: {
  "object": "list",
  "data": [
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-eMshpFDgyDERJju9ZsUb26il",
      "created_at": 1696229035,
      "level": "info",
      "message": "Step 111/132: training loss=0.15",
      "data": {
        "step": 111,
        "train_loss": 0.14788436889648438,
        "train_mean_token_accuracy": 0.9583333134651184
      },
      "type": "metrics"
    },
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-RNk9ecEThrsH7rqWr4ZSS9d1",
      "created_at": 1696229017,
      "level": "info",
      "message": "Step 101/132: training loss=0.35",
      "data": {
        "step": 101,
        "train_loss": 0.3493404686450958,
        "train_mean_token_accuracy": 0.9107142686843872
      },
      "type": "metrics"
    },
    {
      "object": "fine_tuning.job.event",
      "id": "ftevent-JQL4QDVgoehpI4G0isrZJDNw",
      "created_at": 1696228999,
      "level": "info",
      "message"

In [None]:
model_name_pre_object = openai.FineTuningJob.retrieve(job_id)
model_name = model_name_pre_object.fine_tuned_model
print(model_name)

In [None]:
response = openai.ChatCompletion.create(
    model=model_name,
    messages=[
      {
        "role": "system",
        "content": system_message,
      },
      {
          "role": "user",
          "content": df['prompt'].sample().values[0],
      }
    ],
)

response.choices[0].message['content']