# Attempt to fine-tune GPT3 to translate between languages
The first language will be our made up language (see notes on get_bible as to why we think GPT is cheating)
The thesis is that fine-tuning will cause GPT3 to learn the fake language as a new language by assigning the embeddings of the
new words essentially to the same dimensional space as the english.  Should be trivial to translate it back as it is a word 
for word translation.



In [None]:
from lib.config import get_config
import pandas as pd

bible = pd.read_csv('data/bible.csv')

# Assuming the gospels have a lot of overlap include the synoptic gospels plus John to train
# then predict Mark (which should be easy as Matthew and Luke may have copied from him)
train = bible[bible['book'].isin(['MAT','LUK','JHN'])]
validate = bible[bible['book'].isin(['MRK'])]

In [None]:
train

In [None]:
df2 = pd.DataFrame({'prompt': train['engBBE'], 'completion': train['birrig']})
df2['completion'] = df2['completion'].apply(lambda x: " " + x + '\n###')
df2['prompt'] = df2['prompt'].apply(lambda x: x + "\n\n###\n\n")
df2.to_json('data/train.jsonl', orient='records', lines=True)


In [None]:

# set environment variable in ipython notebook
import os
os.environ["OPENAI_API_KEY"] = get_config('openai')['api_key']




In [20]:
!python3 /Users/chrispriebe/Library/Python/3.8/lib/python/site-packages/openai/_openai_scripts.py tools fine_tunes.prepare_data -f data/train.jsonl -q

Analyzing...

- Your file contains 3101 prompt-completion pairs
- There are 13 duplicated prompt-completion sets. These are rows: [403, 593, 787, 886, 1364, 1395, 1399, 1670, 1675, 1864, 2036, 2129, 2391]
- All prompts end with suffix `\n\n###\n\n`
- All completions end with suffix `\n###`

Based on the analysis we will perform the following actions:
- [Recommended] Remove 13 duplicate rows [Y/n]: Y


Your data will be written to a new JSONL file. Proceed [Y/n]: Y

Wrote modified file to `data/train_prepared.jsonl`
Feel free to take a look!

Now use that file when fine-tuning:
> openai api fine_tunes.create -t "data/train_prepared.jsonl"

After you’ve fine-tuned a model, remember that your prompt has to end with the indicator string `\n\n###\n\n` for the model to start generating completions, rather than continuing with the prompt. Make sure to include `stop=["\n###"]` so that the generated texts ends at the expected place.
Once your model starts training, it'll approximately take 2.04

In [51]:
!python3 /Users/chrispriebe/Library/Python/3.8/lib/python/site-packages/openai/_openai_scripts.py api fine_tunes.create -t "data/train_prepared.jsonl"

Found potentially duplicated files with name 'train_prepared.jsonl', purpose 'fine-tune' and size 871208 bytes
file-Kz9OeIePo1hXszuPaZ12yBeU
Enter file ID to reuse an already uploaded file, or an empty string to upload this file anyway: ^C



In [52]:
!python3 /Users/chrispriebe/Library/Python/3.8/lib/python/site-packages/openai/_openai_scripts.py api fine_tunes.follow -i ft-MPF2xw4JKKURr7gU4EsxRERI

[2023-03-28 15:49:45] Created fine-tune: ft-MPF2xw4JKKURr7gU4EsxRERI
[2023-03-28 15:52:43] Fine-tune costs $3.49
[2023-03-28 15:52:44] Fine-tune enqueued. Queue number: 0
[2023-03-28 15:52:46] Fine-tune started
[2023-03-28 15:59:37] Completed epoch 1/4
[2023-03-28 16:05:24] Completed epoch 2/4
[2023-03-28 16:11:12] Completed epoch 3/4
[2023-03-28 16:16:56] Completed epoch 4/4
[2023-03-28 16:17:19] Uploaded model: curie:ft-personal-2023-03-28-23-17-19
[2023-03-28 16:17:20] Uploaded result file: file-pFzh535nCK6OpPnjWBM65LxZ
[2023-03-28 16:17:20] Fine-tune succeeded

Job complete! Status: succeeded 🎉
Try out your fine-tuned model:

openai api completions.create -m curie:ft-personal-2023-03-28-23-17-19 -p <YOUR_PROMPT>


In [35]:
import openai
from lib.cipher import substitution_cipher
from nltk.translate.bleu_score import sentence_bleu

openai.api_key = os.getenv("OPENAI_API_KEY")

def translate(text):
    response = openai.Completion.create(
        model="curie:ft-personal-2023-03-28-23-17-19",
        prompt=text + "\n\n###\n\n",
        temperature=0.2,
        max_tokens=100,
        top_p=1,
        frequency_penalty=0,
        presence_penalty=0,
        stop=["\n###"]
    )
    translation = response.get('choices',[{}])[0].get('text').strip()
    decoded_translation = substitution_cipher(translation, encode=False).strip()
    bleu_score = sentence_bleu([text.split()], decoded_translation.split())
    
    return (translation, decoded_translation, bleu_score)

In [61]:
translate("For God so loved me that he gave me new life in his kingdom")

('Suv Guw zu puwpi ni lxel xi givi ni rih pofxor or xoz corgwun',
 'For God so lodle me that he gere me new libhin in his kingdom',
 6.965077358261036e-78)

In [64]:
translate("For God so loved the world that he gave his only Son, so that everyone who believes in him may not die but have eternal life.")

('Suv Guw zu puawiw lxi huvpw lxel xi gowi xoz ulxor Zur, zu lxel jiusi hxokx orz wozir or xon nej rul woi, fal iqevopip pozi.',
 'For God so louded the world that he gide his othin Son, so that yeofe which ins disen in him may not die, but eparilel lise.',
 0.270465701330031)

In [66]:
translate("he said let there be light and there was light")



('xi zi said lxil lxivi fi pjlur erw lxivi hez pjlur',
 'he se fuez thet there be lyton and there was lyton',
 4.484824319939061e-78)

In [55]:
# Try some unknown words
translate("Bongo bongo I love you, gone to Venus with a hole in my shoe")

('Fungu fungu O puavi jua, guir lu Mivaz holx e xupi or nj zuri',
 'Bomgo bomgo I loure you, goen to Verus with a hole in my sone',
 0.29899503549981366)

In [57]:
translate ("Ship Pit! Pirate ahoy-lay!")

('Lxi Ropi Qoli! Qvofi ekuh-pej!', 'The Nile Pite! Pribe acow-lay!', 0)

In [60]:
translate("Boat a brother on a mountain")

('Fiul e fvulxiv ur e nuvlxorg',
 'Beot a brother on a morthing',
 0.5081327481546147)

In [63]:
translate("Be kind and play with rocks!")

('Fi ciadr erw qpej holx vokz!',
 'Be keuzn and play with rics!',
 6.206021746903507e-78)

In [69]:
translate("SOS! Ship overboard!! Lost ninty percent of people!")

('ZUZ! Zqoli uzbushouvl!! Puld rintj qvuhlov us qvirz!',
 'SOS! Spite osxofwiort!! Lotz nemqy prowtir of prens!',
 1.2882297539194154e-231)