# Train a fine-tuning model specialized for Q&A

In this notebook, we will be using a dataset containing sets of content, questions, and answers. Our objective is to expand this dataset by generating adversarial question and context pairs. These adversarial questions will not have been originally generated in the provided context. In such instances, the model's response will be "Insufficient context for answering the question." Furthermore, we will train a discriminator model designed to predict whether a given question can be answered based on the provided context.

To make this process even more challenging, we will incorporate difficult adversarial examples. These examples will be derived from either sections of text that are semantically similar or neighboring sections from the same article.

## Install Dependencies

In [2]:
!pip install openai

Collecting openai
  Downloading openai-0.28.1-py3-none-any.whl (76 kB)
[?25l     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m0.0/77.0 kB[0m [31m?[0m eta [36m-:--:--[0m[2K     [91m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m[90m╺[0m[90m━━[0m [32m71.7/77.0 kB[0m [31m2.1 MB/s[0m eta [36m0:00:01[0m[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m77.0/77.0 kB[0m [31m1.7 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: openai
[31mERROR: pip's dependency resolver does not currently take into account all the packages that are installed. This behaviour is the source of the following dependency conflicts.
llmx 0.0.15a0 requires cohere, which is not installed.
llmx 0.0.15a0 requires tiktoken, which is not installed.[0m[31m
[0mSuccessfully installed openai-0.28.1


In [3]:
import openai
import pandas as pd

In [4]:
openai.api_key = <API-KEY>

In [5]:
df = pd.read_csv('/content/drive/MyDrive/DAMG7245/pdf_content_openai_qa.csv')
df.head()

Unnamed: 0,num_tokens,content,questions,answers
0,993,OMB APPROVAL OMB Number: 3235-0554 Expires: F...,1. What is the purpose of Form 1-N?\n2. What i...,1. Form 1-N is the form for notice of registra...
1,997,This collection of information has been review...,1. What is the name of the Security Futures Pr...,1. The Security Futures Product Exchange is th...
2,986,The exchange consents that service of any civi...,1. What is the name of the form being filed?\n...,1. The name of the form being filed is Form 1-...
3,900,Exhibit D Describe the manner of operation of ...,1. What is the means of access to the System?\...,1. The means of access to the System is throug...


Split the dataframe into a training and testing set

In [6]:
from sklearn.model_selection import train_test_split
train_df, test_df = train_test_split(df, test_size=0.2, random_state=42)
len(train_df), len(test_df)

(3, 1)

## Create the fine-tuning datasets for Q&A and discriminator models

The process of constructing the fine-tuning dataset is as follows:

1. For each set of related question, answer, and context, we generate the following examples:

   - Positive example: A valid question, answer, and context combination.
   
   - Negative examples:
     - A randomly selected context paired with the same question.
     - Two challenging negative examples, one sourced from the same sec form and the other being the context most similar to the correct one.

2. It's important to note that this process may introduce some noise, as occasionally a different context could potentially answer the same question. However, on average, we anticipate that this noise will not significantly impact performance.

3. We repeat the same dataset creation process separately for both the discriminator and the Q&A answering model. Additionally, we perform this process independently for the training and testing datasets to ensure that the examples in the training set do not overlap with those in the test set.

In [7]:
import random

def get_random_similar_contexts(question, context, file_id='https://www.sec.gov/files/form1.pdf', search_model='ada', max_rerank=10):
    """
    Find similar contexts to the given context using the search file
    """
    try:
        results = openai.Engine(search_model).search(
            search_model=search_model,
            query=question,
            max_rerank=max_rerank,
            file=file_id
        )
        candidates = []
        for result in results['data'][:3]:
            if result['text'] == context:
                continue
            candidates.append(result['text'])
        random_candidate = random.choice(candidates)
        return random_candidate
    except Exception as e:
        print(e)
        return ""

def create_fine_tuning_dataset(df, discriminator=False, n_negative=1, add_related=False):
    """
    Create a dataset for fine tuning the OpenAI model; either for a discriminator model,
    or a model specializing in Q&A, where it says if no relevant context is found.

    Parameters
    ----------
    df: pd.DataFrame
        The dataframe containing the question, answer and context pairs
    discriminator: bool
        Whether to create a dataset for the discriminator
    n_negative: int
        The number of random negative samples to add (using a random context)
    add_related: bool
        Whether to add the related contexts to the correct context. These are hard negative examples

    Returns
    -------
    pd.DataFrame
        The dataframe containing the prompts and completions, ready for fine-tuning
    """
    rows = []
    for i, row in df.iterrows():
        for q, a in zip(("1." + row.questions).split('\n'), ("1." + row.answers).split('\n')):
            if len(q) >10 and len(a) >10:
                if discriminator:
                    rows.append({"prompt":f"{row.content}\nQuestion: {q[2:].strip()}\n Related:", "completion":f" yes"})
                else:
                    rows.append({"prompt":f"{row.content}\nQuestion: {q[2:].strip()}\nAnswer:", "completion":f" {a[2:].strip()}"})

    for i, row in df.iterrows():
        for q in ("1." + row.questions).split('\n'):
            if len(q) >10:
                for j in range(n_negative + (2 if add_related else 0)):
                    random_context = ""
                    if j == 0 and add_related:
                        # add the related contexts based on originating from the same page
                        subset = df[(df.content != row.content)]

                        if len(subset) < 1:
                            continue
                        random_context = subset.sample(1).iloc[0].content
                    if j == 1 and add_related:
                        # add the related contexts based on the most similar contexts according to the search
                        random_context = get_random_similar_contexts(q[2:].strip(), row.content, search_model='ada', max_rerank=10)
                    else:
                        while True:
                            # add random context, which isn't the correct context
                            random_context = df.sample(1).iloc[0].content
                            if random_context != row.content:
                                break
                    if discriminator:
                        rows.append({"prompt":f"{random_context}\nQuestion: {q[2:].strip()}\n Related:", "completion":f" no"})
                    else:
                        rows.append({"prompt":f"{random_context}\nQuestion: {q[2:].strip()}\nAnswer:", "completion":f" No appropriate context found to answer the question."})

    return pd.DataFrame(rows)

We follow the identical procedure to create datasets for both the discriminator and the Q&A answering model. This process is carried out independently for both the training and testing sets to guarantee that instances from the training set do not appear in the test set.

In [1]:
for name, is_disc in [('discriminator', True), ('qa', False)]:
    for train_test, dt in [('train', train_df), ('test', test_df)]:
        ft = create_fine_tuning_dataset(dt, discriminator=is_disc, n_negative=1, add_related=True)
        ft.to_json(f'{name}_{train_test}.jsonl', orient='records', lines=True)

## Submit the datasets for fine-tuning

Submit the dataset for fintuning on discriminator and qa model

In [None]:
import os
os.environ['OPENAI_API_KEY'] = '<API Key>'

In [None]:
!openai api fine_tunes.create -t "/content/discriminator_train.jsonl" -v "/content/discriminator_test.jsonl" --batch_size 16  --compute_classification_metrics --classification_positive_class " yes" --model ada

Upload progress:   0% 0.00/288k [00:00<?, ?it/s]Upload progress: 100% 288k/288k [00:00<00:00, 468Mit/s]
Uploaded file from /content/discriminator_train.jsonl: file-Pwk5WaINwHr4qcCqIxUkYx9W
Upload progress: 100% 288k/288k [00:00<00:00, 508Mit/s]
Uploaded file from /content/discriminator_test.jsonl: file-cH0HdgQjXwpaGh7E5mHPwiFB
Created fine-tune: ft-aykOCrFSrBAtZxmRFRhbQ3KF
Streaming events until fine-tuning is complete...

(Ctrl-C will interrupt the stream, but not cancel the fine-tune)
[2023-10-18 05:05:53] Created fine-tune: ft-aykOCrFSrBAtZxmRFRhbQ3KF
[2023-10-18 05:06:07] Fine-tune costs $0.10
[2023-10-18 05:06:08] Fine-tune enqueued. Queue number: 0



In [None]:
!openai api fine_tunes.create -t "olympics-data/qa_train.jsonl" -v "olympics-data/qa_test.jsonl" --batch_size 16

## Using the fine-tuned models

We will now use the fine-tuned discriminator and the fine-tuned Q&A model.

In [8]:
ft_discriminator = "davinci-instruct-beta-v3"
ft_qa = "davinci-instruct-beta-v3"

def apply_ft_discriminator(context, question, discriminator_model):
    """
    Apply the fine tuned discriminator to a question, to assess whether it can be answered from the context.
    """
    prompt = f"{context}\nQuestion: {question}\n Related:"
    result = openai.Completion.create(model=discriminator_model, prompt=prompt, max_tokens=1, temperature=0, top_p=1, n=1, logprobs=2)
    return result['choices'][0]['logprobs']['top_logprobs']

apply_ft_discriminator('Form 1-N is the form for notice of registration as a \
national securities exchange for the sole purpose of trading security futures \
products (“Security Futures Product Exchange”) pursuant to Section 6(g) of the \
Securities Exchange Act of 1934 (“Exchange Act”).',
                        'What is the Form 1-N for?', ft_discriminator)

[<OpenAIObject at 0x7b0598a110d0> JSON: {
   " What": -1.4873294,
   "\n": -1.1760615
 }]

In [9]:
def apply_ft_qa_answer(context, question, answering_model):
    """
    Apply the fine tuned discriminator to a question
    """
    prompt = f"{context}\nQuestion: {question}\nAnswer:"
    result = openai.Completion.create(model=answering_model, prompt=prompt, max_tokens=30, temperature=0, top_p=1, n=1, stop=['.','\n'])
    return result['choices'][0]['text']

apply_ft_qa_answer('Form 1-N is the form for notice of registration as a \
national securities exchange for the sole purpose of trading security futures \
products (“Security Futures Product Exchange”) pursuant to Section 6(g) of the \
Securities Exchange Act of 1934 (“Exchange Act”).',
                        'What is the Form 1-N for?', ft_qa)

' The Form 1-N is the form for notice of registration as a national securities exchange for the sole purpose of trading security futures products (“Security'

In [11]:
apply_ft_qa_answer('If the information called for by any Exhibit is available in\
 printed form, the printed material may be filed provided it does not exceed 8 1/2 X 11 inches in size.',
                    'What is the maximum size of printed form?', ft_qa)

' 8 1/2 X 11 inches'

It is very evident that the model is capable of providing answers when the context aligns correctly.






In [14]:
apply_ft_qa_answer('If the information called for by any Exhibit is available in\
 printed form, the printed material may be filed provided it does not exceed 8 1/2 X 11 inches in size.',
                    'How many cars were produced in the Soviet Union in 1970?', ft_qa)

' The number of cars produced in the Soviet Union in 1970 is not available'

The model could identify above that the context available is not relevant to the qustion.

In [17]:
# This function takes an answering model, a discriminator model, context, a question,
# and an optional discriminator_logprob_yes_modifier as input.
def answer_question_conditionally(answering_model, discriminator_model, context, question, discriminator_logprob_yes_modifier=0):
    # Apply the Fine-Tuned Discriminator to assess the context and question.
    logprobs = apply_ft_discriminator(context, question, discriminator_model)

    # Extract the log probability for a 'yes' response from the discriminator output,
    # or set it to a very low value (-100) if not present.
    yes_logprob = logprobs[' yes'] if ' yes' in logprobs else -100

    # Extract the log probability for a 'no' response from the discriminator output,
    # or set it to a very low value (-100) if not present.
    no_logprob = logprobs[' no'] if ' no' in logprobs else -100

    # Check if the adjusted 'yes' log probability (with the modifier) is less than the 'no' log probability.
    if yes_logprob + discriminator_logprob_yes_modifier < no_logprob:
        # If 'no' is more probable, return a message indicating no appropriate context was found.
        return " No appropriate context found to answer the question based on the discriminator."

    # If 'yes' is more probable, proceed to answer the question using the answering model.
    return apply_ft_qa_answer(context, question, answering_model)

# Call the answer_question_conditionally function with specific models, context, and question.
answer_question_conditionally(ft_qa, ft_discriminator,
                            "The individual listed on the Execution Page (Page 1) \
                             of Form 1-N as the contact employee must be authorized \
                             to receive all contact information, communications, and \
                             mailings and is responsible for disseminating such information \
                             within the Security Futures Product Exchange’s organization",
                            "Could the designated contact employee on the Execution Page \
                            (Page 1) of Form 1-N authorized to receive all \
                            relevant information within the Security Futures Product Exchange's organization?")


' Yes, the designated contact employee on the Execution Page (Page 1) of Form 1-N authorized to receive all relevant information within the Security Futures'