# Creating Training Data

### Introduction
To create the training data for building a finetuning operation of a RAG, one has to consider 3 key ingredients

1. Answers to finetune upon
2. Context to support the answers
3. Question pointing to the answer and context

The reason we note the 3 in reverse to the what conventionally should happen is that for most of the time, we get the questions-answers pairs from data provider as a test data but the context is not always available. A lot of time, we can only get questions-answers pairs and it is up to finetuning team to fill in the context. In this notebook, we will demonstrate a method to fill in the context and avoid issues. 

In [1]:
## import packages
import numpy as np
import pandas as pd
import copy
import jsonlines
import os
import openai
from openai import AzureOpenAI
from tqdm import tqdm

#### Elements of training data

All the finetuning dataset must follow the structure below

`base_case =  [{"role": "system", "content": ""}, {"role": "user", "content": ""}, {"role": "assistant", "content": ""}]`

The role of the finetuner is to fill in all the content according the format.

In [2]:
base_case =  [{"role": "system", "content": ""}, {"role": "user", "content": ""}, {"role": "assistant", "content": ""}]

#### System content

In our case, the system content is where we put in our context.

In [None]:
system_content ='''
                You are  an AI assistant. Your role includes providing data-driven insights across several focus areas:
                Company's ESG Initiatives and Performance: Examine and report on  ESG efforts, detailing specific achievements in 2022, investments in sustainability, and adherence to the 22 ESG topics.
                Financial Analysis: Provide comparative financial performance analyses, including revenue, profit, investments, and expenses, across different periods. Identify factors influencing financial trends, new income streams, cost control measures, and profitability improvement strategies.
                Investment and Strategic Planning: Assess company-wide investments, significant expenses, and major investments. Evaluate their alignment with the company's strategic goals and contribution to future growth.
                Compliance and Regulatory Oversight: Discuss compliance and regulatory issues and findings from recent audits, including measures taken to address these.

                Your analysis should follow applicable laws and ethical guidelines, focusing only on information directly related to the company's strategic interests. Your goal is to aid in informed decision-making through data-driven insights and analysis.

                Instructions:
                1. Use information only from the DOCUMENTATION section and previous conversations to respond.
                2. The DOCUMENTATION section includes search results. Each search result has two components - the document name followed by double pipe (||) and then the actual content. Always include the source (document name with extension) from which the content was used to generate the answer.
                3. Reference the source using curly brackets. Do not combine sources; list each source separately.
                4. Avoid repeating previously stated information or sentences.
                5. Keep your answer relevant to the context provided. Do not infer causation or correlation, and do not divert from the topic.
                6. Ensure that your answer can be fact-checked against the given context, so Always include source information in proper format.
                7. If your response requires a table, create it in HTML format.
                8. Keep your response concise and no need to explain the math behind financial calculations.

                DOCUMENTATION:
                {context}
                '''

#### Finding Questions

Lets suppose that we have the following question from our test dataset. "By how much did the group's revenue increase from 2022 to 2023?"

Since this question is already in the dataset, it is best NOT to use this question directly as it reduces the number of data we can use for testing. Rather, we should consider a easier question.  "By how much did the group's revenue increase from 2021 to 2022?" This question has the same distribution but is not the question that is specifically being asked. 

In [4]:
Question = "By how much did the group's revenue increase from 2022 to 2023?"

#### Editing the answer

We get the answers from our RAG and we can check the correctness of the answer ourselves. 

In [None]:
Context = "..."
Answer = "The group revenue is ..."

We use Chat GPT to do editing of the answers.

In [6]:
endpoint = ""
api_key = "" 

openai.azure_endpoint = endpoint  # your endpoint should look like "https://<your-resource-name>.openai.azure.com/"  
openai.api_version = ""  # specify the API version you're using  
openai.api_key = api_key  
openai.api_type = "azure"  

def call_llm(query):

    completion = openai.chat.completions.create(
        model="gpt-4o",
        messages=[        
            {
                "role": "user",
                "content": query
            }
        ],
        temperature=0.0,
        #response_format={"type": "json_object"}
    )
        
    return completion.choices[0].message.content

def get_llm_response(question, answer, prompt = None):

    if prompt is None:
        prompt = """
    #INSTRUCTION
    You are a Financial Expert and your task is to rephrase an answer from a financial question to make the tone sounds like those written by a financial expert.
    Add a table of at the end of your rephrased answers to present any value and do not remove any calculation steps.
    Keep the citation in the answer. 

    #INPUT
    QUESTION: {question}
    ANSWER: {answer}

    #OUTPUT 
    provide your rephrase as a text. 
    """

    try:
        response = call_llm(prompt.format(question=question, answer=answer))
    except Exception as e:
        print(f"Error calling LLM: {e}")
        response = None            

    return response

In [None]:
Editted_Answer = get_llm_response(Question, Answer)

In [None]:
def build_jsonlines(Context, Question, Answer):
    case = copy.deepcopy(base_case)
    case[0]["content"] = system_content.format(context = Context)
    case[1]["content"] = Question
    case[2]["content"] = Answer 
    return {"messages": case}

training_data_loc = r'training.jsonl'
with jsonlines.open(training_data_loc, mode='w') as writer:
    writer.write_all(build_jsonlines(Context, Question, Editted_Answer))