## Algorithm
* Input set of documents
    - number of samples to generate
    - maximum token length (input + context + answer)
    - difficulty distribution 
    - Seed questions? 
    - randomize answer output formats
* Output : test set with questions,contexts,answer

1. Select document and part of document to frame question from (random)
2. Formulate a question,context pair contrained on output format. 
3. Identify difficulty type 
4. Evol question using evol-instruct like paradigm to improve difficulty 
    - ask for improved reasoning - (Evol instruct C Increased Reasoning Steps Prompt)
    - Add more contexts and frame question for multi hop
        - Adding more context
           1. Identiy entity from current context and retrive paras with same entity : formulate new question
           2. Identify similar paras using sentence similarity : formulate new question
           
           
#### Ideas to increase complextity

- Extend question by using info in the newly added contexts.


- Concretizing

    In RAG case,this should yeild an instruction that is related to real world use-case on concept decribed in the context. 
    For example, if context speaks about different types of instructions and their use-cases
    question will be like "I want to train a chatbot, how to form training dataset?"


- Improved reasoning
           
#### Open-issues
1. How to ensure inter-document dependancy is null

A question framed might have possible answers from different documents

2. Questions framed by LLM are easily answerable. 

It feels like LLM first identifies a candidate sentence and frames question based on it. So by default the answer to questions can be located easily.

3. GPT 3.5 tends to add or increase length of question when asked to create difficult question

often this ends up as a unsually long question that contains two questions

## Functions

In [278]:
import os
import openai
import json
import numpy as np

In [2]:
os.environ["OPENAI_API_KEY"] = json.load(open("/Users/shahules/openai-key.json"))["ikka"]

In [3]:
openai.api_key = os.getenv("OPENAI_API_KEY")

In [4]:
def llm2(prompt, **kwargs):
    response = openai.ChatCompletion.create(
        model=kwargs.get("model", "gpt-3.5-turbo"),
        messages=[{"role": "system", "content": prompt}],
        temperature=kwargs.get("temperature", 0),
        top_p=kwargs.get("top_p", 1),
        frequency_penalty=kwargs.get("frequency_penalty", 0.0),
        presence_penalty=kwargs.get("presence_penalty", 0.0),
        max_tokens=kwargs.get("max_tokens", 500),
        n=kwargs.get("n", 1),
    )
    return response

In [275]:
from langchain.embeddings import OpenAIEmbeddings
Embedding = OpenAIEmbeddings()

In [276]:
def calculate_similarity(question, generated_questions):
        question_vec = np.asarray(Embedding.embed_query(question)).reshape(1, -1)
        gen_question_vec = np.asarray(
            Embedding.embed_documents(generated_questions)
        )
        norm = np.linalg.norm(gen_question_vec, axis=1) * np.linalg.norm(
            question_vec, axis=1
        )
        return (
            np.dot(gen_question_vec, question_vec.T).reshape(
                -1,
            )
            / norm
        )

## Playground

- random select document 
- Identify sections from each document

In [5]:
import re

In [6]:
def read_doc(path):
    with open(path,'r') as file:
        return file.read()

In [7]:
text = read_doc("../arxiv-llm/textdata/2303.18223v11.A_Survey_of_Large_Language_Models")

In [8]:
pattern = r'^(?:\d+\.\d+\s+)?[A-Z][A-Z-\s]+$'


In [9]:
len(text.split('\n\n'))

86

In [10]:
example = """
Albert Einstein (/ˈaɪnstaɪn/ EYEN-styne;[4] German: [ˈalbɛʁt ˈʔaɪnʃtaɪn] (listen); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist,[5] widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century
"""

In [11]:
example[0:155]

'\nAlbert Einstein (/ˈaɪnstaɪn/ EYEN-styne;[4] German: [ˈalbɛʁt ˈʔaɪnʃtaɪn] (listen); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist,'

In [516]:
question_formulate = """
Your task is to formulate a question from given context satifying the rules given below:
    1.The question should be fully answered from the given context. 
    2.The question should be framed from a part that contains non-trivial information. 
    3.The answer should to the question must be in {answer_format} format. 
    4.The answer should not contain any links. 
    5.The question should be of {difficulty} difficulty.
    6.The question must be reasonable and must be understood and responded by humans.
{context}:context
"""

context_formulate =  """\
Task: Candidate sentence extraction.
Given the question and context, extract chunks  from given context that is required to answer the question. 
Rules to follow while doing this task
    1. The chunks could be anywhere in the provided context, pay attention to the whole context.
    2. Extract the exact sentences from given context, you're not allowed to include or exclude even a single character from the candidate sentences.
    3. If the context do not contain information required to answer the question return "No candidate sentences found".

question:{question}
context:\n{context}
The sentences that can be extracted from the given context to answer the question are:""" 

answer_formulate = """
Asnwer the question using the information from the qiven context. 
You are not allowed to include information that cannot be deducted from the given context.
question:{question}
context:{context}
answer:
"""

context_from_answer = """
Given question, context and answer. Locate the relevant information in the context from context that was to used to form the given answer. 
question:\n{question}
context:\n{context}
answer:\n{answer}
extracted context:"""


# answer_index_formulate = """
# Locate the relevant information in the context and provide the start and end indices of the text that can be used to answer the question. Keep in mind that the relevant information might be surrounded by other unrelated text. 
# You can identify the relevant portion using any relevant keywords, phrases, or patterns present in the context.
# \n\n
# question:\nWhen was Einstein born?
# context:\nAlbert Einstein (/ˈaɪnstaɪn/ EYEN-styne;[4] German: [ˈalbɛʁt ˈʔaɪnʃtaɪn] (listen); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist,[5] widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century.
# answer: the answer to the given question can be located between character index 0 and 155 of given context.  
# question:\n{question}
# context:\n{context}
# answer:
# """

In [517]:
context = text.split('\n\n')[20]

In [518]:
prompt_input = question_formulate.format(answer_format="text",difficulty="medium",context=context)

In [519]:
output = llm2(prompt_input,temperature=0)

In [520]:
question = output['choices'][0]['message']['content']

In [521]:
question

'What are the two commonly used pre-training tasks for training LLMs?'

In [522]:
# context = """
# Adolf Hitler was an Austrian-born German politician who was the dictator of Germany from 1933 until his suicide in 1945. He rose to power as the leader of the Nazi Party, becoming the chancellor in 1933 and then taking the title of Führer und Reichskanzler in 1934
# """
# question = "when was hitler born?"

In [523]:
prompt_input = context_formulate.format(question=question,context=context)

In [524]:
output = llm2(prompt_input,temperature=0)

In [525]:
extracted_context = output['choices'][0]['message']['content']


In [526]:
context

'20\nfor position embeddings, RoPE or ALiBi is a better choice\nsince it performs better on long sequences.\n4.2.3 Pre-training Tasks\nPre-training plays a key role that encodes general knowl-\nedge from large-scale corpus into the massive model param-\neters. For training LLMs, there are two commonly used pre-\ntraining tasks, namely language modeling and denoising\nautoencoding.\nLanguage Modeling. The language modeling task (LM) is\nthe most commonly used objective to pre-train decoder-only\nLLMs, e.g., GPT3 [55] and PaLM [56]. Given a sequence of\ntokens x={x1, . . . , x n}, the LM task aims to autoregres-\nsively predict the target tokens xibased on the preceding\ntokens x<iin a sequence. A general training objective is to\nmaximize the following likelihood:\nLLM(x) =nX\ni=1logP(xi|x<i). (4)\nSince most language tasks can be cast as the prediction\nproblem based on the input, these decoder-only LLMs might\nbe potentially advantageous to implicitly learn how to ac-\ncomplish these 

In [504]:
extracted_context

'- To construct the scientific corpus, existing efforts mainly collect arXiv papers, scientific textbooks, math webpages, and other related scientific resources.\n- The first source is from programming question answering communities like Stack Exchange.\n- The second source is from public software repositories such as GitHub, where code data (including comments and docstrings) are collected for utilization.'

In [338]:
prompt_input = answer_formulate.format(question=question,context=context)

In [189]:
output = llm2(prompt_input)

In [190]:
answer = output['choices'][0]['message']['content']


In [191]:
answer

'Some advantages of LoRA in parameter-efficient fine-tuning of Language Model Models (LLMs) are:\n\n1. Memory and storage savings: LoRA can significantly reduce the memory and storage usage of LLMs. It allows for the use of a single large model copy while maintaining task-specific low-rank decomposition matrices. This saves VRAM and storage space.\n\n2. Lightweight adaptation: LoRA provides a lightweight adaptation approach for downstream tasks. It allows for efficient tuning of LLMs, making them more suitable for resource-limited settings.\n\n3. Comparable performance with fewer parameters: LoRA performs relatively well compared to other efficient tuning methods, using significantly fewer trainable parameters. This means that it can achieve comparable performance to larger models while being more parameter-efficient.\n\nOverall, LoRA offers memory and storage savings, lightweight adaptation, and comparable performance with fewer parameters, making it advantageous for parameter-efficie

#### Extract more context for multi-hop questions

In [381]:
index=23
all_contexts = text.split('\n\n')[index:index+20]

In [382]:
similarity = calculate_similarity(extracted_context,all_contexts)

In [383]:
similarity

array([0.83912225, 0.76749923, 0.87348663, 0.82593298, 0.8097002 ,
       0.85576036, 0.84262018, 0.82826259, 0.8075985 , 0.82869592,
       0.80941025, 0.82444939, 0.77911891, 0.84715337, 0.82129219,
       0.81333971, 0.8430313 , 0.84772823, 0.8263808 , 0.84240174])

In [384]:
similarity.argsort()[::-1]

array([ 2,  5, 17, 13, 16,  6, 19,  0,  9,  7, 18,  3, 11, 14, 15,  4, 10,
        8, 12,  1])

In [385]:
print(all_contexts[2])

25
mance of the model. Here, we discuss some essential factors
for instance construction.
•Scaling the instructions. It has been widely shown that
scaling the number of tasks can largely enhance the general-
ization ability of LLMs [28, 62, 79]. With the increasing of the
task number, the model performance initially shows a con-
tinuous growth pattern, while the gain becomes negligible
when it reaches a certain level [64, 79]. A plausible specula-
tion is that a certain number of representative tasks can pro-
vide relatively sufficient knowledge and adding more tasks
may not bring additional gains [64]. Also, it is beneficial to
enhance the diversity of the task descriptions in several as-
pects, such as length, structure, and creativity [28]. As for the
number of instances per task, it has been found that a small
number of instances can usually saturate the generalization
performance of the model [62, 64]. Whereas, increasing the
number of instances for some tasks to a large number ( 

### Complicating instructions

In [391]:
## Approach 1
## includes info in context2 and adds an 'and' part to the original question

multi_context = """
You are a prompt rewriter. Your objective is to rewrite a given question into a more complex version to make those famous AI systems
(e.g., ChatGPT and GPT4) a bit harder to handle.
You will be provided with a question and two set of contexts namely context1 and context2. 
Your task is to complicate the given question in a way that answering it requires information derived from both context1 and context2. 
Follow the rules given below while rewriting the question.
    1. The rewritten question should not be very long. 
    2. The rewritten question must be reasonable and must be understood and responded by humans.
    3. The rewritten question must be fully answerable from information present in context1 and context2. 
    4. Read and understand both contexts and rewrite the question so that answering requires insight from both context1 and context2.
    
question:\n{question}
context1:\n{context1}
context2:\n{context2}
"""

In [387]:
prompt_input = multi_context.format(question=question,context1=extracted_context,context2=all_contexts[2])

In [388]:
output = llm2(prompt_input)

In [389]:
output

<OpenAIObject chat.completion id=chatcmpl-7tDZdGGXKjv8TTEdVgWeSk3k9IXUX at 0x7fe9e0670310> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "What are the two major approaches to adapting pre-trained LLMs, and how do factors such as scaling the instructions and formatting design impact their performance?",
        "role": "assistant"
      }
    }
  ],
  "created": 1693394573,
  "id": "chatcmpl-7tDZdGGXKjv8TTEdVgWeSk3k9IXUX",
  "model": "gpt-3.5-turbo-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 30,
    "prompt_tokens": 1731,
    "total_tokens": 1761
  }
}

In [543]:
## Concretizing 
Concretize_prompt = """
You are a question rewriter. Your objective is to rewrite a given question into a more complex version to make those famous AI systems
(e.g., ChatGPT and GPT4) a bit harder to handle.
You will be provided with a question and a context. 
Your task is to rewrite the question. You SHOULD complicate the given question using the following method:
Relate the original question to a real-life scenario and reframe the question. 
Follow the rules given below while rewriting the question.
    1. The rewritten question must be reasonable and must be understood and responded by humans.
    2. The rewritten question should be fully answerable from insights derived from the provided context. 
    3. The rewritten question should not ask for any external links. 
    4. Rewritten question should only add maximum 15 words into given question

question:\n{question}
context:\n{context}
"""

In [544]:
prompt_input = Concretize_prompt.format(question=question,context=context)

In [545]:
print(prompt_input)


You are a question rewriter. Your objective is to rewrite a given question into a more complex version to make those famous AI systems
(e.g., ChatGPT and GPT4) a bit harder to handle.
You will be provided with a question and a context. 
Your task is to rewrite the question. You SHOULD complicate the given question using the following method:
Relate the original question to a real-life scenario and reframe the question. 
Follow the rules given below while rewriting the question.
    1. The rewritten question must be reasonable and must be understood and responded by humans.
    2. The rewritten question should be fully answerable from insights derived from the provided context. 
    3. The rewritten question should not ask for any external links. 
    4. Rewritten question should only add maximum 15 words into given question

question:
What are the two commonly used pre-training tasks for training LLMs?
context:
20
for position embeddings, RoPE or ALiBi is a better choice
since it perf

In [546]:
output = llm2(prompt_input)

In [547]:
question

'What are the two commonly used pre-training tasks for training LLMs?'

In [548]:
question_c = output['choices'][0]['message']['content']

In [549]:
question_c

'What are the two commonly used pre-training tasks for training LLMs in a real-life scenario?\n'

In [515]:
llm2(answer_formulate.format(question=question_c,context=context))

<OpenAIObject chat.completion id=chatcmpl-7tFVdnu0QTYeSDqD8JTRqowNtox7C at 0x7fe9d180cf40> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "The specific sources of data used for pre-training Language Models (LLMs) such as GPT-3 and GPT-NeoX include webpages, conversation data, books and news, scientific data, and code.",
        "role": "assistant"
      }
    }
  ],
  "created": 1693402013,
  "id": "chatcmpl-7tFVdnu0QTYeSDqD8JTRqowNtox7C",
  "model": "gpt-3.5-turbo-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 44,
    "prompt_tokens": 1263,
    "total_tokens": 1307
  }
}

In [447]:
test_prompt = """
I want you act as a Prompt Rewriter.
Your objective is to rewrite a given prompt into a more complex version to make those famous AI systems
(e.g., ChatGPT and GPT4) a bit harder to handle.
But the rewritten prompt must be reasonable and must be understood and responded by humans.
Your rewriting cannot omit the non-text parts such as the table and code in #Given Prompt#:. Also, please
do not omit the input in #Given Prompt#.
You SHOULD complicate the given prompt using the following method:
Please replace general concepts with more specific concepts. or
You should try your best not to make the #Rewritten Prompt# become verbose, #Rewritten Prompt# can only
add 10 to 20 words into #Given Prompt#.
‘#Given Prompt#’, ‘#Rewritten Prompt#’, ‘given prompt’ and ‘rewritten prompt’ are not allowed to appear in
#Rewritten Prompt#
#Given Prompt#:
Answer the question using given context:
question:\n{question}
context:\n{context}
#Rewritten Prompt#:
"""

In [450]:
question

'What are the two major approaches to adapting pre-trained LLMs?'

In [451]:
extracted_context

'After pre-training, LLMs can acquire the general abilities for solving various tasks. However, an increasing number of studies have shown that LLM’s abilities can be further adapted according to specific goals. In this section, we introduce two major approaches to adapting pre-trained LLMs, namely instruction tuning and alignment tuning.'

In [453]:
llm2(test_prompt.format(question=question,context=extracted_context),temperature=1)

<OpenAIObject chat.completion id=chatcmpl-7tFNLQhhT9kyUOaAIwd5Rd7TRaCDh at 0x7fe9f12a5270> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "Please provide an answer to the following question based on the information given in the context:\n\nQuestion: What are the primary methods for fine-tuning pre-trained LLM models?\n\nContext: Following the pre-training process, LLM models can gain a broad range of capabilities to tackle different tasks. Nonetheless, numerous studies have demonstrated that their abilities can be further customized in accordance with specific objectives. In this particular section, we will elaborate on two main strategies for adapting pre-trained LLMs, namely instruction tuning and alignment tuning.",
        "role": "assistant"
      }
    }
  ],
  "created": 1693401499,
  "id": "chatcmpl-7tFNLQhhT9kyUOaAIwd5Rd7TRaCDh",
  "model": "gpt-3.5-turbo-0613",
  "object": "chat.completion",
  "usage": {
   