<span style="color:orange; font-weight:bold">Note: To answer questions based on text documents, we recommend the procedure in <a href="https://github.com/openai/openai-cookbook/blob/main/examples/Question_answering_using_embeddings.ipynb">Question Answering using Embeddings</a>. Some of the code below may rely on <a href="https://github.com/openai/openai-cookbook/tree/main/transition_guides_for_deprecated_API_endpoints">deprecated API endpoints</a>.</span>

# 3. Train a fine-tuning model specialized for Q&A
This notebook will utilize the dataset of context, question and answer pairs to additionally create adversarial questions and context pairs, where the question was not generated on that context. In those cases the model will be prompted to answer "No sufficient context for answering the question". We will also train a discriminator model, which predicts whether the question can be answered based on the context or not.

We will add hard adversarial examples as well, which will be based either on semantically similar sections, or neighbouring sections, originating from the same article.

In [1]:
pip install openai


Note: you may need to restart the kernel to use updated packages.


In [2]:
pip install pandas scikit-learn

Note: you may need to restart the kernel to use updated packages.


In [3]:
import openai
import pandas as pd
df = pd.read_csv('/Users/anamikabharali/Documents/UNI/BD/Assignment2/form_qa.csv')

df.head()

Unnamed: 0,Number,Description,Last_Updated,SEC_Number,Topic,PDF_Link,pyPDF_extraction,tokens,questions,answers
0,1,Application for registration or exemption from...,Feb. 1999,SEC1935,Self-Regulatory Organizations,https://www.sec.gov/files/form1-e.pdf,You may not send a completed printout of this ...,1581,1) What is the purpose of Form 1-E mentioned i...,1.The purpose of Form 1-E mentioned in the tex...
1,1-A,Regulation A Offering Statement (PDF),Sept. 2021,SEC486,"Securities Act of 1933, Small Businesses",https://www.sec.gov/files/form1-k.pdf,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,2000,1. What is the purpose of Form 1-K?\n2. What a...,1.The purpose of Form 1-K is to provide an ann...
2,1-E,Notification under Regulation E (PDF),Aug. 2001,SEC1807,"Investment Company Act of 1940, Small Business...",https://www.sec.gov/files/form1-n.pdf,OMB APPROVAL OMB Number 3235 0554 Expires Febr...,2000,1. What is the purpose of Form 1-N?\n\n2. How ...,1.The purpose of Form 1-N is to serve as a not...
3,1-K,Annual Reports and Special Financial Reports (...,Sept. 2021,SEC2913,"Securities Act of 1933, Small Businesses",https://www.sec.gov/files/form1-sa.pdf,UNITED STATES SECURITIES AND EXCHANGE COMMISSI...,2000,1. What is the purpose of Form 1 SA?\n2. How o...,1.The purpose of Form 1 SA is to file semiannu...
4,1-N,Form and amendments for notice of registration...,Dec. 2013,SEC2568,"Securities Exchange Act of 1934, Self-Regulato...",https://www.sec.gov/files/form1-u.pdf,OMB APPROVAL OMB Number 3235 0722 Expires Dece...,2000,1. What is the purpose of Form 1-U?\n\n2. What...,1.The purpose of Form 1-U is to file a current...


Split the sections into a training and testing set

In [4]:
from sklearn.model_selection import train_test_split
train_df1, test_df1 = train_test_split(df, test_size=0.2, random_state=42)
train_midpoint = len(train_df1)//4
test_midpoint = len(test_df1)//4
train_df = train_df1.iloc[:train_midpoint]
test_df = test_df1.iloc[:test_midpoint]
len(train_df), len(test_df)

(29, 7)

we check that the separator we intend to use isn't present within the contexts

In [5]:
df.pyPDF_extraction.str.contains('->').sum()

0

## 3.1 Create the fine-tuning datasets for Q&A and discriminator models
The fine-tuning dataset is created in the following way. For every corresponding question, answer and context pair we create:
- Positive example: correct question, answer, context pair
- Negative examples:
  - random negative example, where the random context is paired with the question 
  - two hard negative examples
    - one originating from the same wikipedia article
    - another, which is most similar to the correct context

This process is noisy, as sometimes the question might be answerable given a different context, but on average we hope this won't affect the peformance too much.

We apply the same process of dataset creation for both the discriminator, and the Q&A answering model. We apply the process separately for the training and testing set, to ensure that the examples from the training set don't feature within the test set.

In [6]:
df['answers'] = df['answers'].astype('string')

df['questions'] = df['questions'].astype('string')

In [7]:
df.dtypes['answers']

string[python]

In [8]:
df_splitpoint = len(df)//2
df = df.iloc[:df_splitpoint]
print(len(df))

74


In [9]:
import random

def get_random_similar_contexts(question, pyPDF_extraction, search_model='ada', max_rerank=10):
    """
    Find similar contexts to the given context using the search file
    """
    try:
        results = openai.Engine(search_model).search(
            search_model=search_model, 
            query=question, 
            max_rerank=max_rerank,
            #file=file_id
        )
        candidates = []
        for result in results['data'][:3]:
            if result['text'] == pyPDF_extraction:
                continue
            candidates.append(result['text'])
        random_candidate = random.choice(candidates)
        return random_candidate
    except Exception as e:
        print(e)
        return ""

def create_fine_tuning_dataset(df, discriminator=False, n_negative=1, add_related=False):
    """
    Create a dataset for fine tuning the OpenAI model; either for a discriminator model, 
    or a model specializing in Q&A, where it says if no relevant context is found.

    Parameters
    ----------
    df: pd.DataFrame
        The dataframe containing the question, answer and context pairs
    discriminator: bool
        Whether to create a dataset for the discriminator
    n_negative: int
        The number of random negative samples to add (using a random context)
    add_related: bool
        Whether to add the related contexts to the correct context. These are hard negative examples

    Returns
    -------
    pd.DataFrame
        The dataframe containing the prompts and completions, ready for fine-tuning
    """
    rows = []
    for i, row in df.iterrows():
        for q, a in zip((str("1.") + str(row['questions'])).split('\n'), (str("1.") + str(row['answers'])).split('\n')):
            #("1." + row.questions).split('\n'), ("1." + row.answers).split('\n')
            if len(q) >10 and len(a) >10:
                if discriminator:
                    rows.append({"prompt":f"{row.pyPDF_extraction}\nQuestion: {q[2:].strip()}\n Related:", "completion":f" yes"})
                else:
                    rows.append({"prompt":f"{row.pyPDF_extraction}\nQuestion: {q[2:].strip()}\nAnswer:", "completion":f" {a[2:].strip()}"})

    for i, row in df.iterrows():
        for q in (str("1.") + str(row['questions'])).split('\n'):
            if len(q) >10:
                for j in range(n_negative + (2 if add_related else 0)):
                    random_context = ""
                    if j == 0 and add_related:
                        # add the related contexts based on originating from the same wikipedia page
                        subset = df[(df.Description == row.Description) & (df.pyPDF_extraction != row.pyPDF_extraction)]
                        
                        if len(subset) < 1:
                            continue
                        random_context = subset.sample(1).iloc[0].pyPDF_extraction
                    if j == 1 and add_related:
                        # add the related contexts based on the most similar contexts according to the search
                        random_context = get_random_similar_contexts(q[2:].strip(), row.pyPDF_extraction, search_model='ada', max_rerank=10)
                    else:
                        while True:
                            # add random context, which isn't the correct context
                            random_context = df.sample(1).iloc[0].pyPDF_extraction
                            if random_context != row.pyPDF_extraction:
                                break
                    if discriminator:
                        rows.append({"prompt":f"{random_context}\nQuestion: {q[2:].strip()}\n Related:", "completion":f" no"})
                    else:
                        rows.append({"prompt":f"{random_context}\nQuestion: {q[2:].strip()}\nAnswer:", "completion":f" No appropriate context found to answer the question."})

    return pd.DataFrame(rows) 

We apply the same process of dataset creation for both the discriminator, and the Q&A answering model. We apply the process separately for the training and testing set, to ensure that the examples from the training set don't feature within the test set.

In [10]:
for name, is_disc in [('discriminator', True), ('qa', False)]:
    for train_test, dt in [('train', train_df), ('test', test_df)]:
        ft = create_fine_tuning_dataset(dt, discriminator=is_disc, n_negative=1, add_related=True)
        ft.to_json(f'{name}_{train_test}.jsonl', orient='records', lines=True)

search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search
search

We formatted the data according to the recommendations from the fine-tuning tool, which is available using
> openai tools fine_tunes.prepare_data -f qa_train.jsonl

We highly recommend that you use this tool, which suggests improvements in your data formatting for fine-tuning.


## 3.2 Submit the datasets for fine-tuning

In [11]:
import os
import getpass

In [12]:
os.environ['OPENAI_API_KEY'] = getpass.getpass() 

In [13]:
openai.api_key = os.getenv("OPENAI_API_KEY")

In [14]:
!openai api fine_tunes.create -t "discriminator_train.jsonl" -v "discriminator_test.jsonl" --batch_size 16  --compute_classification_metrics --classification_positive_class " yes" --model ada

Upload progress: 100%|████████████████████| 1.41M/1.41M [00:00<00:00, 1.22Git/s]
Uploaded file from discriminator_train.jsonl: file-0SbB8QuQ3iKuS0jZhWjh5htz
Upload progress: 100%|███████████████████████| 380k/380k [00:00<00:00, 485Mit/s]
Uploaded file from discriminator_test.jsonl: file-wcQc2xgpEimwabSvcQrkw8OE
Created fine-tune: ft-gM3QaNAKHAv3DIQlFw8D6zQT
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-10-20 10:41:16] Created fine-tune: ft-gM3QaNAKHAv3DIQlFw8D6zQT
[2023-10-20 10:41:34] Fine-tune costs $0.44
[2023-10-20 10:41:34] Fine-tune enqueued. Queue number: 0
[2023-10-20 10:41:36] Fine-tune started
[2023-10-20 10:42:14] Completed epoch 1/4



In [15]:
!openai api fine_tunes.create -t "qa_train.jsonl" -v "qa_test.jsonl" --batch_size 16

Upload progress: 100%|████████████████████| 1.46M/1.46M [00:00<00:00, 1.31Git/s]
Uploaded file from qa_train.jsonl: file-hH5LyxOFIvHHCpx8E0HBMTNy
Upload progress: 100%|███████████████████████| 352k/352k [00:00<00:00, 450Mit/s]
Uploaded file from qa_test.jsonl: file-6fxtQGjmq5s1JxAn6NcYVjAJ
Created fine-tune: ft-4ObGIC1sednrv2oThOo6dCWr
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-10-20 10:57:06] Created fine-tune: ft-4ObGIC1sednrv2oThOo6dCWr
[2023-10-20 10:57:22] Fine-tune costs $3.43
[2023-10-20 10:57:22] Fine-tune enqueued. Queue number: 0
[2023-10-20 10:57:23] Fine-tune started



## 3.3 Using the fine-tuned models

We will now use the fine-tuned discriminator and the fine-tuned Q&A model. By requesting logprobs, we can see how certain the discriminator is in a `yes` vs `no` answer.

The fine tune models created above can be checked by writing the following comands in the terminal 
curl https://api.openai.com/v1/fine-tunes/ft-4ObGIC1sednrv2oThOo6dCWr   \                                                                   
  -H "Authorization: Bearer $OPENAI_API_KEY"

Where ft-4ObGIC1sednrv2oThOo6dCWr is the created fine-tune id and $OPENAI_API_KEY is your openAI key

In [24]:
ft_discriminator = "ada:ft-personal-2023-10-20-14-43-29"
ft_qa = "curie:ft-personal-2023-10-20-14-59-56"

def apply_ft_discriminator(context, question, discriminator_model):
    """
    Apply the fine tuned discriminator to a question, to assess whether it can be answered from the context.
    """
    prompt = f"{context}\nQuestion: {question}\n Related:"
    result = openai.Completion.create(model=discriminator_model, prompt=prompt, max_tokens=1, temperature=0, top_p=1, n=1, logprobs=2)
    return result['choices'][0]['logprobs']['top_logprobs']

apply_ft_discriminator('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.', 
                        'What was the first human-made object in space?', ft_discriminator)

[<OpenAIObject at 0x7fc9d0dd92c0> JSON: {
   " yes": -4.52315,
   " no": -0.011236884
 }]

We can see that the model can generalize well to different contexts and questions. 

In [25]:
def apply_ft_qa_answer(context, question, answering_model):
    """
    Apply the fine tuned discriminator to a question
    """
    prompt = f"{context}\nQuestion: {question}\nAnswer:"
    result = openai.Completion.create(model=answering_model, prompt=prompt, max_tokens=30, temperature=0, top_p=1, n=1, stop=['.','\n'])
    return result['choices'][0]['text']

apply_ft_qa_answer('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.', 
                    'What was the first human-made object in space?', ft_qa)


' The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957'

We can see that the model can answer the question, when the context is appropriate.

In [49]:
apply_ft_qa_answer('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.',
                    'What is impressive about the Soviet Union?', ft_qa)

' The Soviet Union was the first country to launch a satellite into space'

In [45]:
apply_ft_qa_answer('The first human-made object in space was the Soviet Union satellite Sputnik 1 on 4 October 1957.',
                    'How many cars were produced in the Soviet Union in 1970?', ft_qa)

' No appropriate context found to answer the question'

We can see that the model knows when to answer the question, and when to say that insufficient context is present to answer the question.