Finetuning eines GPT-3-Modell

In diesem Notebook wird dein Datensatz zusammengestellt, der ausschließlich zum Finetuning eines GPT-3-Modells dient

In [1]:
import pandas as pd
import torch
from transformers import GPT2LMHeadModel, GPT2Tokenizer
import re
import json
import openai

# Datensatz erstellen für GPT3 Finetuning 

In [None]:
df_rest = pd.read_csv('01_Data\dataset_Google-Maps-Reviews-Restaurants_2023-03-26_09-43-08-320.csv')
df_act = pd.read_csv('01_Data\dataset_Google-Maps-Reviews-Activities_2023-03-26_11-08-15-435.csv')
df_hotel = pd.read_csv('01_Data\dataset_Google-Maps-Reviews-Hotels_2023-03-26_11-34-16-492.csv')

print('Restaurants (Shape): ', df_rest.shape)
print('Activities (Shape): ', df_act.shape)
print('Hotels (Shape): ', df_hotel.shape)

Idee: nur den Text berücksichtigen. Noch kein Fokus auf zwei Modalitäten

In [None]:
df_reduced = pd.read_csv('base_keywords_sentiment_reduced.csv')
df_fine_tuning = pd.concat([df_rest, df_act, df_hotel], axis=0)

df_fine_tuning = df_fine_tuning.reset_index(drop=True)
df_fine_tuning = df_fine_tuning.drop_duplicates(subset=['reviewId'])
print('All (raw): ', df_fine_tuning.shape)

df_fine_tuning = df_fine_tuning.dropna(subset=['text'])
df_fine_tuning = df_fine_tuning[['text', 'reviewId', 'url', 'placeId','categoryName', 'stars', 'reviewImageUrls/0']]
print('All with text: ', df_fine_tuning.shape)
df_fine_tuning['text'] = df_fine_tuning['text'].astype(str)

df_fine_tuning = df_fine_tuning[~df_fine_tuning['reviewId'].isin(df_reduced['reviewId'])]
print('All (ohne späteren Base-Datensatz reviews): ', df_fine_tuning.shape)
df_fine_tuning.to_csv('finetuning_gpt3_v2.csv', index=False)

Jetzt. Andere Sprachen rausfiltern und Einträge die im Base-Datensatz sind rausfiltern. @TODO: RAUSFINDEN, OB AUF ANDERE LOCATOINS TRAINIEREN ODER DIE SELBEN WIE IM BASE_..._.csv Datensatz

In [None]:
df_fine_tuning = pd.read_csv ("finetuning_gpt3_v2.csv")
df_fine_tuning.iloc[1179:1400]

In [None]:
# Remove all rows which do not have an english 'text' column and 

In [None]:
def generate_prompt_gpt3(row):
    category = row["categoryName"]
    return f"Google Maps review about a {category}. ->"

def generate_prompt_gpt35turbo(row):
    category = row["categoryName"]
    return f"Write a Google Maps review about a {category}."

# hier die Promt für ein 1:1-Real-Pendant-Mapping mit GPT3.5 Turbo (OUTDATED)
def generate_prompt_gpt35turbo(row):
    stars = row["stars"]
    category = row["categoryName"]
    keywords = row["keywords_only"]
    keywords = keywords[1:-1].split(",")[:3]
    keywords = [k.strip() for k in keywords]
    keywords = ", ".join(keywords)
    keywords = keywords.replace("'", "")
    return f"Write a {stars} stars Google Maps review about a {category}, with the following keywords: {keywords}."

In [None]:
df_fine_tuning = pd.read_csv("finetuning_gpt3_v2.csv")
df_reduced = pd.read_csv("base_keywords_sentiment_reduced.csv")

print(df_fine_tuning.shape)

df_fine_tuning["prompt"] = df_fine_tuning.apply(generate_prompt_gpt3, axis=1)
df_fine_tuning["completion"] = df_fine_tuning["text"] + "###"

print(df_fine_tuning.shape)
print(df_fine_tuning.head)

df_fine_tuning = pd.read_csv ("finetuning_gpt3_v2.csv")

# Option 1: Online-Tool für Finetuning GPT-3.

# Option 2: Training im CODE (How to fine-tune a GPT-3 model for specific prompts)

https://www.indiehackers.com/post/how-to-fine-tune-a-gpt-3-model-using-python-with-your-own-data-for-improved-performance-198dfe51d6

I'm constantly looking for ways to automate the work with support requests. An idea has been to fine-tune a GPT-3 model to answer common support-related questions.

**Here's how you can fine-tune a GPT-3 model with Python with your own data.**

In this walkthrough, we'll fine-tune a GPT-3 model to answer common support-related questions.

Detailed step-by-step intructions for this repo in this blog post: https://norahsakal.com/blog/fine-tune-gpt3-model

## Define OpenAI API keys

In [None]:
with open('apikey_openai.txt', 'r') as f:
    api_key = f.read()

openai.api_key = api_key

## Create training data

Make sure to end each `prompt` with a suffix. According to the [OpenAI API reference](https://beta.openai.com/docs/guides/fine-tuning "fine-tuning reference"), you can use ` ->`.

Also, make sure to end each `completion` with a suffix as well; I'm using `.\n`.

In [92]:
data_file = [{
    "prompt": "Prompt ->",
    "completion": " Ideal answer.\n"
},{
    "prompt":"Prompt ->",
    "completion": " Ideal answer.\n"
}]

In [93]:
print(data_file)

[{'prompt': 'Prompt ->', 'completion': ' Ideal answer.\n'}, {'prompt': 'Prompt ->', 'completion': ' Ideal answer.\n'}]


## Save dict as JSONL

Training data need to be a JSONL document.
JSONL file is a newline-delimited JSON file.
More info about JSONL: https://jsonlines.org/

In [83]:
file_name = "training_data.jsonl"

with open(file_name, 'w') as outfile:
    for entry in data_file:
        json.dump(entry, outfile)
        outfile.write('\n')

print("Done")
print(file_name)


Done
training_data.jsonl


## Check JSONL file

In [87]:
!openai tools fine_tunes.prepare_data -f training_data.jsonl

Analyzing...

- Your file contains 2 prompt-completion pairs. In general, we recommend having at least a few hundred examples. We've found that performance tends to linearly increase for every doubling of the number of examples
- There are 1 duplicated prompt-completion sets. These are rows: [1]




ERROR in common_suffix validator: All prompts are identical: `Prompt ->`
Consider leaving the prompts blank if you want to do open-ended generation, otherwise ensure prompts are different

Aborting...


## Upload file to your OpenAI account

In [None]:
upload_response = openai.File.create(
  file=open(file_name, "rb"),
  purpose='fine-tune'
)
upload_response

## Save file name

In [None]:
file_id = upload_response.id
file_id

## Fine-tune a model

The default model is **Curie**. 

If you'd like to use **DaVinci** instead, then add it as a base model to fine-tune:

```openai.FineTune.create(training_file=file_id, model="davinci")```

In [None]:
fine_tune_response = openai.FineTune.create(training_file=file_id)
fine_tune_response

## Check fine-tune progress

Check the progress with `openai.FineTune.list_events(id=fine_tune_response.id)` and get a list of all the fine-tuning events

In [None]:
fine_tune_events = openai.FineTune.list_events(id=fine_tune_response.id)
fine_tune_events

Check the progress with `openai.FineTune.retrieve(id=fine_tune_response.id)` and get an object with the fine-tuning job data

In [None]:
retrieve_response = openai.FineTune.retrieve(id=fine_tune_response.id)
retrieve_response

## Save fine-tuned model

### Troubleshooting fine_tuned_model as null
During the fine-tuning process, the **fine_tuned_model** key may not be immediately available in the fine_tune_response object returned by `openai.FineTune.create()`.

To check the status of your fine-tuning process, you can call the `openai.FineTune.retrieve()` function and pass in the **fine_tune_response.id**. This function will return a JSON object with information about the training status, such as the current epoch, the current batch, the training loss, and the validation loss.

After the fine-tuning process is complete, you can check the status of all your fine-tuned models by calling `openai.FineTune.list()`. This will list all of your fine-tunes and their current status.

Once the fine-tuning process is complete, you can retrieve the fine_tuned_model key by calling the `openai.FineTune.retrieve()` function again and passing in the fine_tune_response.id. This will return a JSON object with the key fine_tuned_model and the ID of the fine-tuned model that you can use for further completions.

### Option 1

If `fine_tune_response.fine_tuned_model != None` then the key **fine_tuned_model** is availble from the fine_tune_response object

In [None]:
if fine_tune_response.fine_tuned_model != None:
    fine_tuned_model = fine_tune_response.fine_tuned_model

### Option 2

If `fine_tune_response.fine_tuned_model == None:` you can get the **fine_tuned_model** by listing all fine-tune events

In [None]:
if fine_tune_response.fine_tuned_model == None:
    fine_tune_list = openai.FineTune.list()
    fine_tuned_model = fine_tune_list['data'][0].fine_tuned_model

### Option 3

If `fine_tune_response.fine_tuned_model == None:` you can get the **fine_tuned_model** key by retrieving the fine-tune job

In [None]:
if fine_tune_response.fine_tuned_model == None:
    fine_tuned_model = openai.FineTune.retrieve(id=fine_tune_response.id).fine_tuned_model

## Test the new model on a new prompt

Remember to end the prompt with the same suffix as we used in the training data; ` ->`:

In [None]:
new_prompt = "NEW PROMPT ->"

In [None]:
answer = openai.Completion.create(
  model=fine_tuned_model,
  prompt=new_prompt,
  max_tokens=10, # Change amount of tokens for longer completion
  temperature=0
)
answer['choices'][0]['text']