## Algorithm
* Input set of documents
    - number of samples to generate
    - maximum token length (input + context + answer)
    - difficulty distribution 
    - Seed questions? 
    - randomize answer output formats
* Output : test set with questions,contexts,answer

1. Select document and part of document to frame question from (random)
2. Formulate a question,context pair contrained on output format. 
3. Identify difficulty type 
4. Evol question using evol-instruct like paradigm to improve difficulty 
    - ask for improved reasoning - (Evol instruct C Increased Reasoning Steps Prompt)
    - Add more contexts and frame question for multi hop
        - Adding more context
           1. Identiy entity from current context and retrive paras with same entity : formulate new question
           2. Identify similar paras using sentence similarity : formulate new question
           
           
#### Ideas to increase complextity

- Extend question by using info in the newly added contexts.


- Concretizing

    In RAG case,this should yeild an instruction that is related to real world use-case on concept decribed in the context. 
    For example, if context speaks about different types of instructions and their use-cases
    question will be like "I want to train a chatbot, how to form training dataset?"


- Improved reasoning
           
- Reasoning over multiple contexts
    - add context1, context2, context3, etc untill max tokens - x. And then ask model to formulate a question that would require reasoning over multiple contexts. 
    
#### Open-issues
1. How to ensure inter-document dependancy is null

A question framed might have possible answers from different documents

2. Questions framed by LLM are easily answerable. 

It feels like LLM first identifies a candidate sentence and frames question based on it. So by default the answer to questions can be located easily.

3. GPT 3.5 tends to add or increase length of question when asked to create difficult question

often this ends up as a unsually long question that contains two questions

## Functions

In [1]:
import os
import openai
import json
import numpy as np

In [2]:
os.environ["OPENAI_API_KEY"] = json.load(open("/Users/shahules/openai-key.json"))["ikka"]

In [3]:
openai.api_key = os.getenv("OPENAI_API_KEY")

In [4]:
def llm2(prompt, **kwargs):
    response = openai.ChatCompletion.create(
        model=kwargs.get("model", "gpt-3.5-turbo"),
        messages=[{"role": "system", "content": prompt}],
        temperature=kwargs.get("temperature", 0),
        top_p=kwargs.get("top_p", 1),
        frequency_penalty=kwargs.get("frequency_penalty", 0.0),
        presence_penalty=kwargs.get("presence_penalty", 0.0),
        max_tokens=kwargs.get("max_tokens", 500),
        n=kwargs.get("n", 1),
    )
    return response

In [5]:
from langchain.embeddings import OpenAIEmbeddings
Embedding = OpenAIEmbeddings()

In [6]:
def calculate_similarity(question, generated_questions):
        question_vec = np.asarray(Embedding.embed_query(question)).reshape(1, -1)
        gen_question_vec = np.asarray(
            Embedding.embed_documents(generated_questions)
        )
        norm = np.linalg.norm(gen_question_vec, axis=1) * np.linalg.norm(
            question_vec, axis=1
        )
        return (
            np.dot(gen_question_vec, question_vec.T).reshape(
                -1,
            )
            / norm
        )

## Playground

- random select document 
- Identify sections from each document

In [None]:
BLOCKLIST = ["Based on the provided context",]

In [8]:
import re

In [9]:
def read_doc(path):
    with open(path,'r') as file:
        return file.read()

In [10]:
text = read_doc("../arxiv-llm/textdata/2303.18223v11.A_Survey_of_Large_Language_Models")

In [11]:
pattern = r'^(?:\d+\.\d+\s+)?[A-Z][A-Z-\s]+$'


In [12]:
len(text.split('\n\n'))

86

In [13]:
example = """
Albert Einstein (/ˈaɪnstaɪn/ EYEN-styne;[4] German: [ˈalbɛʁt ˈʔaɪnʃtaɪn] (listen); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist,[5] widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century
"""

In [14]:
example[0:155]

'\nAlbert Einstein (/ˈaɪnstaɪn/ EYEN-styne;[4] German: [ˈalbɛʁt ˈʔaɪnʃtaɪn] (listen); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist,'

In [15]:
question_formulate = """
Your task is to formulate a question from given context satifying the rules given below:
    1.The question should be fully answered from the given context. 
    2.The question should be framed from a part that contains non-trivial information. 
    3.The answer should to the question must be in {answer_format} format. 
    4.The answer should not contain any links. 
    5.The question should be of {difficulty} difficulty.
    6.The question must be reasonable and must be understood and responded by humans.
{context}:context
"""

context_formulate =  """\
Task: Candidate sentence extraction.
Given the question and context, extract sentences from given context that can be used to answer the question. 
Rules to follow while doing this task:
    1. Your task is not to answer the question using given context but to extract sentences from given context that can be used to answer given question.
    2. The sentences could be anywhere in the provided context, pay attention to the whole context.
    3. While extracting candidate sentences you're not allowed to make any changes to sentences from given context.
    4. If the answer is not present in context, you should only return the empty string.


question:{question}
context:\n{context}
candidate sentences:\n
""" 

context_formulatev1 = """
Please extract relevant sentences from the provided context that can potentially help answer the following question. If no relevant sentences are found, or if you believe the question cannot be answered from the given context, return the phrase "Insufficient Information."
question:{question}
context:\n{context}
candidate sentences:\n
"""

context_formulatev2 = """
Please extract relevant sentences from the provided context that can potentially help answer the following question. If no relevant sentences are found, or if you believe the question cannot be answered from the given context, return the phrase "Insufficient Information".  While extracting candidate sentences you're not allowed to make any changes to sentences from given context.


question:{question}
context:\n{context}
candidate sentences:\n
"""

answer_formulate = """
Asnwer the question using the information from the qiven context. 
You are not allowed to include information that cannot be deducted from the given context.
question:{question}
context:{context}
answer:
"""

context_from_answer = """
Given question, context and answer. Locate the relevant information in the context from context that was to used to form the given answer. 
question:\n{question}
context:\n{context}
answer:\n{answer}
extracted context:"""


# answer_index_formulate = """
# Locate the relevant information in the context and provide the start and end indices of the text that can be used to answer the question. Keep in mind that the relevant information might be surrounded by other unrelated text. 
# You can identify the relevant portion using any relevant keywords, phrases, or patterns present in the context.
# \n\n
# question:\nWhen was Einstein born?
# context:\nAlbert Einstein (/ˈaɪnstaɪn/ EYEN-styne;[4] German: [ˈalbɛʁt ˈʔaɪnʃtaɪn] (listen); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist,[5] widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century.
# answer: the answer to the given question can be located between character index 0 and 155 of given context.  
# question:\n{question}
# context:\n{context}
# answer:
# """

In [21]:
context = text.split('\n\n')[56]

In [16]:
prompt_input = question_formulate.format(answer_format="text",difficulty="medium",context=context)

In [17]:
output = llm2(prompt_input,temperature=0)

In [18]:
question = output['choices'][0]['message']['content']

In [19]:
question

'What are the two major approaches to adapting pre-trained LLMs?'

In [20]:
# context = """
# Adolf Hitler was an Austrian-born German politician who was the dictator of Germany from 1933 until his suicide in 1945. He rose to power as the leader of the Nazi Party, becoming the chancellor in 1933 and then taking the title of Führer und Reichskanzler in 1934
# """
# question = "when was hitler born?"

In [21]:
prompt_input = context_formulate.format(question=question,context=context)

In [22]:
output = llm2(prompt_input,temperature=0)

In [23]:
extracted_context = output['choices'][0]['message']['content']


In [26]:
context

'23\noverhead, and the third solution increases about 50% com-\nmunication overhead but saves memory proportional to\nthe number of GPUs. PyTorch has implemented a similar\ntechnique as ZeRO, called FSDP [237].\nMixed Precision Training. In previous PLMs ( e.g.,\nBERT [23]), 32-bit floating-point numbers, also known as\nFP32, have been predominantly used for pre-training. In\nrecent years, to pre-train extremely large language models,\nsome studies [232] have started to utilize 16-bit floating-\npoint numbers (FP16), which reduces memory usage and\ncommunication overhead. Additionally, as popular NVIDIA\nGPUs ( e.g., A100) have twice the amount of FP16 computa-\ntion units as FP32, the computational efficiency of FP16 can\nbe further improved. However, existing work has found that\nFP16 may lead to the loss of computational accuracy [59, 69],\nwhich affects the final model performance. To alleviate it, an\nalternative called Brain Floating Point (BF16) has been used\nfor training, whic

In [24]:
extracted_context

'After pre-training, LLMs can acquire the general abilities for solving various tasks. However, an increasing number of studies have shown that LLM’s abilities can be further adapted according to specific goals. In this section, we introduce two major approaches to adapting pre-trained LLMs, namely instruction tuning and alignment tuning. The former approach mainly aims to enhance (or unlock) the abilities of LLMs, while the latter approach aims to align the behaviors of LLMs with human values or preferences.'

In [180]:
prompt_input = answer_formulate.format(question=question,context=context)

In [181]:
output = llm2(prompt_input)

In [182]:
answer = output['choices'][0]['message']['content']


In [183]:
answer

'The two major approaches to adapting pre-trained LLMs are instruction tuning and alignment tuning.'

#### Extract more context for multi-hop questions

In [27]:
index=23
all_contexts = text.split('\n\n')[index:index+20]

In [28]:
similarity = calculate_similarity(extracted_context,all_contexts)

In [29]:
similarity

array([0.84381185, 0.77465926, 0.88152245, 0.83331187, 0.82022018,
       0.86879151, 0.85556014, 0.84220169, 0.81447715, 0.83058065,
       0.80893047, 0.82643535, 0.78249331, 0.85134229, 0.82774607,
       0.81884942, 0.84903393, 0.84885221, 0.82688037, 0.84365871])

In [30]:
similarity.argsort()[::-1]

array([ 2,  5,  6, 13, 16, 17,  0, 19,  7,  3,  9, 14, 18, 11,  4, 15,  8,
       10, 12,  1])

In [31]:
print(all_contexts[2])

25
mance of the model. Here, we discuss some essential factors
for instance construction.
•Scaling the instructions. It has been widely shown that
scaling the number of tasks can largely enhance the general-
ization ability of LLMs [28, 62, 79]. With the increasing of the
task number, the model performance initially shows a con-
tinuous growth pattern, while the gain becomes negligible
when it reaches a certain level [64, 79]. A plausible specula-
tion is that a certain number of representative tasks can pro-
vide relatively sufficient knowledge and adding more tasks
may not bring additional gains [64]. Also, it is beneficial to
enhance the diversity of the task descriptions in several as-
pects, such as length, structure, and creativity [28]. As for the
number of instances per task, it has been found that a small
number of instances can usually saturate the generalization
performance of the model [62, 64]. Whereas, increasing the
number of instances for some tasks to a large number ( 

### Complicating instructions

In [32]:
## Approach 1
## includes info in context2 and adds an 'and' part to the original question

multi_context = """
You are a prompt rewriter. Your objective is to rewrite a given question into a more complex version to make those famous AI systems
(e.g., ChatGPT and GPT4) a bit harder to handle.
You will be provided with a question and two set of contexts namely context1 and context2. 
Your task is to complicate the given question in a way that answering it requires information derived from both context1 and context2. 
Follow the rules given below while rewriting the question.
    1. The rewritten question should not be very long. 
    2. The rewritten question must be reasonable and must be understood and responded by humans.
    3. The rewritten question must be fully answerable from information present in context1 and context2. 
    4. Read and understand both contexts and rewrite the question so that answering requires insight from both context1 and context2.
    
question:\n{question}
context1:\n{context1}
context2:\n{context2}
"""

In [54]:
single_part = """
Please rephrase the following multi-part question into a single, shortened and comprehensive question that encompasses all the key elements.
question:\n{question}
rewritten question:"""



In [33]:
prompt_input = multi_context.format(question=question,context1=extracted_context,context2=all_contexts[2])

In [34]:
output = llm2(prompt_input)

In [36]:
questionv1 = output['choices'][0]['message']['content']

In [37]:
questionv1

'What are the two major approaches to adapting pre-trained LLMs, and how do factors such as scaling the instructions and formatting design impact their performance?'

In [None]:
output = llm2(single_part.format(question=questionv1),temperature=0)
questionv1_compact = output['choices'][0]['message']['content']
questionv1_compact

### Add more reasoning steps

In [69]:
reasoning = """
Please rewrite the following question to make it require multi-step reasoning to answer using the provided context. Ensure that the question remains relevant to the context.
question:{question}
context:\n{context}
Rewritten Question Requiring Multi-Step Reasoning:

"""

In [70]:
prompt_input = reasoning.format(question=question,context=context)

In [72]:
output = llm2(prompt_input)

In [73]:
output

<OpenAIObject chat.completion id=chatcmpl-7vOaLdbKUBYZ4ESBkBghw7nKXSCJo at 0x7faba9766ef0> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "What are the two major approaches to adapting pre-trained LLMs and what are the steps involved in instruction tuning?",
        "role": "assistant"
      }
    }
  ],
  "created": 1693913557,
  "id": "chatcmpl-7vOaLdbKUBYZ4ESBkBghw7nKXSCJo",
  "model": "gpt-3.5-turbo-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 23,
    "prompt_tokens": 1535,
    "total_tokens": 1558
  }
}

## Concretizing 


In [543]:
## Concretizing 
Concretize_prompt = """
You are a question rewriter. Your objective is to rewrite a given question into a more complex version to make those famous AI systems
(e.g., ChatGPT and GPT4) a bit harder to handle.
You will be provided with a question and a context. 
Your task is to rewrite the question. You SHOULD complicate the given question using the following method:
Relate the original question to a real-life scenario and reframe the question. 
Follow the rules given below while rewriting the question.
    1. The rewritten question must be reasonable and must be understood and responded by humans.
    2. The rewritten question should be fully answerable from insights derived from the provided context. 
    3. The rewritten question should not ask for any external links. 
    4. Rewritten question should only add maximum 15 words into given question

question:\n{question}
context:\n{context}
"""

In [544]:
prompt_input = Concretize_prompt.format(question=question,context=context)

In [545]:
print(prompt_input)


You are a question rewriter. Your objective is to rewrite a given question into a more complex version to make those famous AI systems
(e.g., ChatGPT and GPT4) a bit harder to handle.
You will be provided with a question and a context. 
Your task is to rewrite the question. You SHOULD complicate the given question using the following method:
Relate the original question to a real-life scenario and reframe the question. 
Follow the rules given below while rewriting the question.
    1. The rewritten question must be reasonable and must be understood and responded by humans.
    2. The rewritten question should be fully answerable from insights derived from the provided context. 
    3. The rewritten question should not ask for any external links. 
    4. Rewritten question should only add maximum 15 words into given question

question:
What are the two commonly used pre-training tasks for training LLMs?
context:
20
for position embeddings, RoPE or ALiBi is a better choice
since it perf

In [546]:
output = llm2(prompt_input)

In [547]:
question

'What are the two commonly used pre-training tasks for training LLMs?'

In [548]:
question_c = output['choices'][0]['message']['content']

In [549]:
question_c

'What are the two commonly used pre-training tasks for training LLMs in a real-life scenario?\n'

In [515]:
llm2(answer_formulate.format(question=question_c,context=context))

<OpenAIObject chat.completion id=chatcmpl-7tFVdnu0QTYeSDqD8JTRqowNtox7C at 0x7fe9d180cf40> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "The specific sources of data used for pre-training Language Models (LLMs) such as GPT-3 and GPT-NeoX include webpages, conversation data, books and news, scientific data, and code.",
        "role": "assistant"
      }
    }
  ],
  "created": 1693402013,
  "id": "chatcmpl-7tFVdnu0QTYeSDqD8JTRqowNtox7C",
  "model": "gpt-3.5-turbo-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 44,
    "prompt_tokens": 1263,
    "total_tokens": 1307
  }
}

In [26]:
test_prompt = """
Create a multi-hop reasoning question based on the provided context. The question should require the reader to make multiple logical connections or inferences using the information available in the context. Ensure that the question can be answered entirely from the information present in the context. Do not frame questions that contains more than 15 words.

Context:
{context}

Detailed Description and Rules:
1. Understand the Context: Read the provided context carefully and understand the information it contains. Identify key pieces of information that are scattered or indirectly related to each othe
2. Identify Multiple Steps: Think about how you can construct a question that requires the reader to connect pieces of information from different parts of the context. This may involve drawing connections between various facts or concepts.
3. Formulate a Multi-hop Question: Craft a question that necessitates the reader to make multiple logical connections or inferences based on the information provided in the context. Here are some strategies to create multi-hop questions:

   - Bridge related entities: Identify information that relates specific entities and frame question that can be answered only by analysing information of both entities.
   
   - Use Pronouns: identify (he, she, it, they) that refer to same entity or concepts in the context, and ask questions that would require the reader to figure out what pronouns refer to.

   - Refer to Specific Details: Mention specific details or facts from different parts of the context and ask how they are related.

   - Pose Hypothetical Scenarios: Present a hypothetical situation or scenario that requires combining different elements from the context to arrive at an answer.

   - Ask About Cause and Effect: Inquire about the causes or effects of specific events or actions mentioned in the context, which may involve reasoning through multiple steps.

4. Ensure Clarity: Make sure the question is clear and unambiguous. It should be evident to the reader that they need to connect multiple pieces of information to answer the question.
5. phrases like 'based on the provided context','according to the context?',etc are not allowed to appear in the question.
Multi-hop Reasoning Question:

"""

In [27]:
output = llm2(test_prompt.format(context=context))
question_r = output['choices'][0]['message']['content']

In [28]:
len(question_r.split())

15

In [29]:
question_r

'What are some challenges that need to be addressed when applying LLMs to real-world scenarios?'

In [30]:
print(context)

56
some recent work has studied the human-like characteristics
of LLMs, such as self-awareness, theory of mind (ToM), and
affective computing [603, 604]. In particular, an empirical
evaluation of ToM conducted on two classic false-belief
tasks speculates that LLMs may have ToM-like abilities
since the model in the GPT-3.5 series achieves comparable
performance with nine-year-old children in ToM task [603].
In addition, another line of work has investigated applying
LLMs into the software development domain, e.g., code
suggestion [605], code summarization [606], and automated
program repair [607]. To summarize, to assist humans by
LLMs in real-world tasks has become a significant area of
research. However, it also presents challenges. Ensuring the
accuracy of LLM-generated content, addressing biases, and
maintaining user privacy and data security are crucial con-
siderations when applying LLMs to real-world scenarios.
10 C ONCLUSION AND FUTURE DIRECTIONS
In this survey, we have reviewed