# Test the connection and API key

Make sure it's possible to use the OpenAI API. For this to work, the environment variable OPENAI_API_KEY must be set to a valid API key which has available credits.

In [1]:
import openai
import os

# read the OpenAI API key from an environment variable
openai.api_key = os.environ['OPENAI_API_KEY']

# test the API connection by making a simple request
response = openai.Completion.create(model="text-curie-001", prompt="Say this is a test", temperature=0, max_tokens=7)
response

<OpenAIObject text_completion id=cmpl-7jMCZdObPXH3HW3Q8e31ho9fPUVNv at 0x7fb34084af20> JSON: {
  "id": "cmpl-7jMCZdObPXH3HW3Q8e31ho9fPUVNv",
  "object": "text_completion",
  "created": 1691044459,
  "model": "text-curie-001",
  "choices": [
    {
      "text": "\n\nThis is a test.",
      "index": 0,
      "logprobs": null,
      "finish_reason": "length"
    }
  ],
  "usage": {
    "prompt_tokens": 5,
    "completion_tokens": 7,
    "total_tokens": 12
  }
}

# Prepare the fine-tuning set

Prepare a fine-tuning dataset and use it to fine-tune a GPT3 model.

In [2]:
import glob
import json
import tiktoken

PROMPT_SUFFIX = '\n\n###\n\n'
COMPLETION_STOP = '\n###'
TRAINFILE = 'fine-tune.jsonl'
VALIDATEFILE = 'validate.jsonl'
BASE_MODEL = 'babbage'
MAX_TOKENS = 1450  # empirically chosen so that prepare_data doesn't complain in the cell below

dataset_train_files = glob.glob("../../llm-dataset/train/*.jsonl")
dataset_test_files = glob.glob("../../llm-dataset/test/*.jsonl")

encoding = tiktoken.encoding_for_model(BASE_MODEL)

def truncate_text(text):
    """truncate text so it contains at most MAX_TOKENS according to the OpenAI tokenizer"""
    tokens = encoding.encode(text)
    return encoding.decode(tokens[:MAX_TOKENS])

def create_sample(text, metadata):
    """create a fine-tuning sample from text and metadata about a single document"""
    return {'prompt': truncate_text(text) + PROMPT_SUFFIX,
            'completion': " " + metadata + COMPLETION_STOP}

def convert_to_samples(infiles, outfile):
    print(f"Creating {outfile}")
    nrec = 0
    with open(outfile, "w") as outf:
        for infile in infiles:
            print(f"- processing {infile}")
            with open(infile) as inf:
                for line in inf:
                    rec = json.loads(line)
                    sample = create_sample(rec["text"], rec["metadata"])
                    print(json.dumps(sample), file=outf)
                    nrec += 1
    print(f"{nrec} records converted")
    print()

convert_to_samples(dataset_train_files, TRAINFILE)
convert_to_samples(dataset_test_files, VALIDATEFILE)

Creating fine-tune.jsonl
- processing ../../llm-dataset/train/docthes-swe.jsonl
- processing ../../llm-dataset/train/serial-swe.jsonl
- processing ../../llm-dataset/train/thes-swe.jsonl
- processing ../../llm-dataset/train/mono-fin.jsonl
- processing ../../llm-dataset/train/serial-eng.jsonl
- processing ../../llm-dataset/train/thes-fin.jsonl
- processing ../../llm-dataset/train/mono-eng.jsonl
- processing ../../llm-dataset/train/mono-swe.jsonl
- processing ../../llm-dataset/train/thes-eng.jsonl
- processing ../../llm-dataset/train/docthes-eng.jsonl
- processing ../../llm-dataset/train/serial-fin.jsonl
- processing ../../llm-dataset/train/docthes-fin.jsonl
373 records converted

Creating validate.jsonl
- processing ../../llm-dataset/test/docthes-swe.jsonl
- processing ../../llm-dataset/test/serial-swe.jsonl
- processing ../../llm-dataset/test/thes-swe.jsonl
- processing ../../llm-dataset/test/mono-fin.jsonl
- processing ../../llm-dataset/test/serial-eng.jsonl
- processing ../../llm-data

In [17]:
# Check that the fine-tuning data set is OK using the prepare_data tool.
# It will complain that all completions start with the same " dc.contributor" prefix, this can be ignored.
# We will only use prepare_data as a validation aid and delete the "prepared" files that ithelpfully creates.
!openai tools fine_tunes.prepare_data -f fine-tune.jsonl -q
!rm -f fine-tune_prepared.jsonl

!openai tools fine_tunes.prepare_data -f validate.jsonl -q
!rm -f validate_prepared.jsonl

Analyzing...

- Your file contains 373 prompt-completion pairs
- All prompts end with suffix `\n\n###\n\n`
- All completions start with prefix ` dc.contributor`. Most of the time you should only add the output data into the completion, without any prefix
- All completions end with suffix `\n###`

Based on the analysis we will perform the following actions:
- [Recommended] Remove prefix ` dc.contributor` from all completions [Y/n]: Y


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified file to `fine-tune_prepared.jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "fine-tune_prepared.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `\n\n###\n\n` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=["\n###"]` so that the generated texts ends at the expected place.
Once your model starts trainin

In [3]:
# Perform the actual finetuning via the API. This can take a while, there can be a long queue.
!openai api fine_tunes.create -t fine-tune.jsonl -v validate.jsonl -m {BASE_MODEL}

Upload progress: 100%|████████████████████| 1.78M/1.78M [00:00<00:00, 2.22Git/s]
Uploaded file from fine-tune.jsonl: file-HRqztdzuQaW0k46ZELzDmZQn
Upload progress: 100%|███████████████████████| 565k/565k [00:00<00:00, 813Mit/s]
Uploaded file from validate.jsonl: file-sAAV2B2B3V4jNlnJJsiRalSK
Created fine-tune: ft-XivxNCEeiqRQNiRhSc9RAMfu
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-08-03 09:36:07] Created fine-tune: ft-XivxNCEeiqRQNiRhSc9RAMfu

Stream interrupted (client disconnected).
To resume the stream, run:

  openai api fine_tunes.follow -i ft-XivxNCEeiqRQNiRhSc9RAMfu



In [14]:
!openai api fine_tunes.follow -i ft-XivxNCEeiqRQNiRhSc9RAMfu

[2023-08-03 09:36:07] Created fine-tune: ft-XivxNCEeiqRQNiRhSc9RAMfu
[2023-08-03 11:48:36] Fine-tune costs $1.39
[2023-08-03 11:48:36] Fine-tune enqueued. Queue number: 6
[2023-08-03 11:50:31] Fine-tune is in the queue. Queue number: 5
[2023-08-03 11:50:33] Fine-tune is in the queue. Queue number: 4
[2023-08-03 11:51:17] Fine-tune is in the queue. Queue number: 3
[2023-08-03 11:51:40] Fine-tune is in the queue. Queue number: 2
[2023-08-03 11:52:17] Fine-tune is in the queue. Queue number: 1
[2023-08-03 11:52:31] Fine-tune is in the queue. Queue number: 0
[2023-08-03 11:52:35] Fine-tune started
[2023-08-03 11:55:03] Completed epoch 1/4
[2023-08-03 11:57:06] Completed epoch 2/4
[2023-08-03 11:59:09] Completed epoch 3/4
[2023-08-03 12:01:13] Completed epoch 4/4
[2023-08-03 12:01:33] Uploaded model: babbage:ft-personal-2023-08-03-09-01-32
[2023-08-03 12:01:34] Uploaded result file: file-l7O6EqcqDWgaZyvl7qyrKSfA
[2023-08-03 12:01:34] Fine-tune succeeded

Job complete! Status: succeeded 🎉
Tr

In [15]:
# copy the model name from above output and store it in a variable
model_name = "babbage:ft-personal-2023-08-03-09-01-32"

In [25]:
# Try out the fine-tuned model on a random test set record

import random

def get_completions(text):
    response = openai.Completion.create(model=model_name,
                                    prompt=truncate_text(text) + PROMPT_SUFFIX,
                                    temperature=0,  # no fooling around!
                                    max_tokens=2048 - MAX_TOKENS - 5, # should be plenty
                                    stop=[COMPLETION_STOP])  # stop at ###
    return response['choices'][0]['text'].strip()


test_set_file = random.choice(dataset_test_files)
with open(test_set_file) as testfile:
    records = [json.loads(line) for line in testfile]
rec = random.choice(records)

print(f"Testing on {rec['id']} with PDF {rec['url']}")
print("---")
print("Curated metadata:")
print(rec["metadata"])
print("---")
print("Generated metadata:")
print(get_completions(rec["text"]))


Testing on https://www.doria.fi/handle/10024/181210 with PDF https://www.doria.fi/bitstream/handle/10024/181210/koskinen_monika.pdf
---
Curated metadata:
dc.contributor.author: Koskinen, Monika
dc.contributor.faculty: Faculty of Education and Welfare Studies
dc.contributor.faculty: Fakulteten för pedagogik och välfärdsstudier
dc.contributor.faculty: Kasvatustieteiden ja hyvinvointialojen tiedekunta
dc.contributor.opponent: Professor Ulrica Hörberg, Linnéuniversitetet, Växjö, Sverige
dc.contributor.organization: Åbo Akademi
dc.contributor.supervisor: Professor Camilla Koskinen, Universitetet i Stavanger, Stavanger, Norge
dc.contributor.supervisor: TD Mårten Björkgren, Åbo Akademi, Vasa
dc.date.issued: 2021-06-17
dc.format.content: fulltext
dc.identifier.isbn: 978-951-765-993-2
dc.identifier.urn: URN:ISBN:978-951-765-993-2
dc.language.iso: swe
dc.publisher: Åbo Akademis förlag - Åbo Akademi University Press
dc.relation.isbn: 978-951-765-992-5
dc.title: Hälsans tomrum : En hermeneutisk st

In [26]:
%%time

import os.path

for test_file in dataset_test_files:
    output_file = "gpt3-" + os.path.basename(test_file)
    print(f"generating metadata for {test_file} into {output_file}")
    nrec = 0
    with open(test_file) as infile, open(output_file, "w") as outfile:
        for line in infile:
            rec = json.loads(line)
            generated_metadata = get_completions(rec["text"])
            outrec = {"id": rec["id"], "url": rec["url"], "metadata_orig": rec["metadata"], "metadata_gen": generated_metadata}
            json.dump(outrec, outfile)
            outfile.write("\n")
            nrec += 1
    print(f"completed {nrec} records")
    print()

generating metadata for ../../llm-dataset/test/docthes-swe.jsonl into gpt3-docthes-swe.jsonl
completed 5 records

generating metadata for ../../llm-dataset/test/serial-swe.jsonl into gpt3-serial-swe.jsonl
completed 7 records

generating metadata for ../../llm-dataset/test/thes-swe.jsonl into gpt3-thes-swe.jsonl
completed 10 records

generating metadata for ../../llm-dataset/test/mono-fin.jsonl into gpt3-mono-fin.jsonl
completed 8 records

generating metadata for ../../llm-dataset/test/serial-eng.jsonl into gpt3-serial-eng.jsonl
completed 9 records

generating metadata for ../../llm-dataset/test/thes-fin.jsonl into gpt3-thes-fin.jsonl
completed 20 records

generating metadata for ../../llm-dataset/test/mono-eng.jsonl into gpt3-mono-eng.jsonl
completed 9 records

generating metadata for ../../llm-dataset/test/mono-swe.jsonl into gpt3-mono-swe.jsonl
completed 0 records

generating metadata for ../../llm-dataset/test/thes-eng.jsonl into gpt3-thes-eng.jsonl
completed 11 records

generating 

In [33]:
# Calculate rough similarity between original and generated metadata using Levenshtein normalized indel similarity

import pandas as pd
import Levenshtein

data = []
for test_file in dataset_test_files:
    gen_file = "gpt3-" + os.path.basename(test_file)
    _, subset, lang = os.path.splitext(gen_file)[0].split('-')
    with open(gen_file) as gfile:
        for line in gfile:
            rec = json.loads(line)
            similarity = Levenshtein.ratio(rec["metadata_orig"], rec["metadata_gen"])
            data.append([subset, lang, rec["id"], rec["url"], similarity])

df = pd.DataFrame(data, columns=["subset", "lang", "id", "url", "similarity"])

print("Overall similarity:", df["similarity"].mean())

Overall similarity: 0.7886717084961832


In [34]:
df.groupby(['subset'])["similarity"].mean()

subset
docthes    0.820709
mono       0.665335
serial     0.763267
thes       0.834662
Name: similarity, dtype: float64

In [35]:
df.groupby(['lang'])["similarity"].mean()

lang
eng    0.768135
fin    0.784264
swe    0.839895
Name: similarity, dtype: float64

In [36]:
df.groupby(['subset','lang'])["similarity"].mean()

subset   lang
docthes  eng     0.841979
         fin     0.803360
         swe     0.776932
mono     eng     0.674078
         fin     0.655498
serial   eng     0.762734
         fin     0.735521
         swe     0.807553
thes     eng     0.742101
         fin     0.855895
         swe     0.894015
Name: similarity, dtype: float64