## Algorithm
* Input set of documents
    - number of samples to generate
    - maximum token length (input + context + answer)
    - difficulty distribution 
    - Seed questions? 
    - randomize answer output formats
* Output : test set with questions,contexts,answer

1. Select document and part of document to frame question from (random)
2. Formulate a question,context pair contrained on output format. 
3. Identify difficulty type 
4. Evol question using evol-instruct like paradigm to improve difficulty 
    - ask for improved reasoning - (Evol instruct C Increased Reasoning Steps Prompt)
    - Add more contexts and frame question for multi hop
        - Adding more context
           1. Identiy entity from current context and retrive paras with same entity : formulate new question
           2. Identify similar paras using sentence similarity : formulate new question
           
           
           
#### Open-issues
1. How to ensure inter-document dependancy is null

A question framed might have possible answers from different documents

## Functions

In [19]:
import os
import openai
import json

In [23]:
os.environ["OPENAI_API_KEY"] = json.load(open("/Users/shahules/openai-key.json"))["ikka"]

In [24]:
openai.api_key = os.getenv("OPENAI_API_KEY")

In [25]:
def llm2(prompt, **kwargs):
    response = openai.ChatCompletion.create(
        model=kwargs.get("model", "gpt-3.5-turbo"),
        messages=[{"role": "system", "content": prompt}],
        temperature=kwargs.get("temperature", 0),
        top_p=kwargs.get("top_p", 1),
        frequency_penalty=kwargs.get("frequency_penalty", 0.0),
        presence_penalty=kwargs.get("presence_penalty", 0.0),
        max_tokens=kwargs.get("max_tokens", 500),
        n=kwargs.get("n", 1),
    )
    return response

## Playground

- random select document 
- Identify sections from each document

In [26]:
import re

In [27]:
def read_doc(path):
    with open(path,'r') as file:
        return file.read()

In [28]:
text = read_doc("../arxiv-llm/textdata/2303.18223v11.A_Survey_of_Large_Language_Models")

In [29]:
pattern = r'^(?:\d+\.\d+\s+)?[A-Z][A-Z-\s]+$'


In [30]:
len(text.split('\n\n'))

86

In [118]:
example = """
Albert Einstein (/ˈaɪnstaɪn/ EYEN-styne;[4] German: [ˈalbɛʁt ˈʔaɪnʃtaɪn] (listen); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist,[5] widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century
"""

In [122]:
example[0:155]

'\nAlbert Einstein (/ˈaɪnstaɪn/ EYEN-styne;[4] German: [ˈalbɛʁt ˈʔaɪnʃtaɪn] (listen); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist,'

In [156]:
question_formulate = """
Your task is to formulate a question from given context satifying the rules given below:
    1.The question should be fully answered from the given context. 
    2.The question should be framed from a part that contains non-trivial information. 
    3.The answer should to the question must be in {answer_format} format. 
    4.The answer should not contain any links. 
    5.The question should be of {difficulty} difficulty.
    6.The question must be reasonable and must be understood and responded by humans.
{context}:context
"""

answer_formulate = """
Locate the relevant information in the context and provide the start and end indices of the text that can be used to answer the question. Keep in mind that the relevant information might be surrounded by other unrelated text. 
You can identify the relevant portion using any relevant keywords, phrases, or patterns present in the context.
\n\n
question:\nWhen was Einstein born?
context:\nAlbert Einstein (/ˈaɪnstaɪn/ EYEN-styne;[4] German: [ˈalbɛʁt ˈʔaɪnʃtaɪn] (listen); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist,[5] widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century.
start_index:0
end_index:155
question:\n{question}
context:\n{context}
start_index:
"""

In [99]:
context = text.split('\n\n')[22]

In [164]:
prompt_input = question_formulate.format(answer_format="text",difficulty="hard",context=context)

In [165]:
output = llm2(prompt_input,temperature=1)

In [166]:
question = output['choices'][0]['message']['content']

In [167]:
question

'What are the primary technical issues in training large language models (LLMs) and what are some approaches to address these challenges?\n'

In [168]:
prompt_input = answer_formulate.format(question=question,context=context)

In [169]:
print(prompt_input)


Locate the relevant information in the context and provide the start and end indices of the text that can be used to answer the question. Keep in mind that the relevant information might be surrounded by other unrelated text. 
You can identify the relevant portion using any relevant keywords, phrases, or patterns present in the context.



question:
When was Einstein born?
context:
Albert Einstein (/ˈaɪnstaɪn/ EYEN-styne;[4] German: [ˈalbɛʁt ˈʔaɪnʃtaɪn] (listen); 14 March 1879 – 18 April 1955) was a German-born theoretical physicist,[5] widely held to be one of the greatest and most influential scientists of all time. Best known for developing the theory of relativity, he also made important contributions to quantum mechanics, and was thus a central figure in the revolutionary reshaping of the scientific understanding of nature that modern physics accomplished in the first decades of the twentieth century.
start_index:0
end_index:155
question:
What are the primary technical issues in 

In [172]:
output = llm2(prompt_input,temperature=0)

In [173]:
output


<OpenAIObject chat.completion id=chatcmpl-7sxU20hRexzzluaW5upUdkLDIzTfc at 0x7fc8104f38b0> JSON: {
  "choices": [
    {
      "finish_reason": "stop",
      "index": 0,
      "message": {
        "content": "start_index: 222\nend_index: 267",
        "role": "assistant"
      }
    }
  ],
  "created": 1693332722,
  "id": "chatcmpl-7sxU20hRexzzluaW5upUdkLDIzTfc",
  "model": "gpt-3.5-turbo-0613",
  "object": "chat.completion",
  "usage": {
    "completion_tokens": 11,
    "prompt_tokens": 1872,
    "total_tokens": 1883
  }
}

In [174]:
context[222:267]

'y to 10% Adam FP16 0.1 1.0 -\nPanGu- α(200B) -'

In [136]:
context[334:]

'amW FP16 0.1 - 0.1\nPaLM (540B) 1M →4M 1×10−2no inverse square root Adafactor BF16 lr21.0 0.1\nBLOOM (176B) 4M 6×10−5yes cosine decay to 10% Adam BF16 0.1 1.0 0.0\nMT-NLG (530B) 64 K →3.75M 5×10−5yes cosine decay to 10% Adam BF16 0.1 1.0 -\nGopher (280B) 3M →6M 4×10−5yes cosine decay to 10% Adam BF16 - 1.0 -\nChinchilla (70B) 1.5M →3M 1×10−4yes cosine decay to 10% AdamW BF16 - - -\nGalactica (120B) 2M 7×10−6yes linear decay to 10% AdamW - 0.1 1.0 0.1\nLaMDA (137B) 256K - - - - BF16 - - -\nJurassic-1 (178B) 32 K →3.2M 6×10−5yes - - - - - -\nLLaMA (65B) 4M 1.5×10−4yes cosine decay to 10% AdamW - 0.1 1.0 -\nGLM (130B) 0.4M →8.25M 8×10−5yes cosine decay to 10% AdamW FP16 0.1 1.0 0.1\nT5 (11B) 64K 1×10−2no inverse square root AdaFactor - - - 0.1\nERNIE 3.0 Titan (260B) - 1×10−4- - Adam FP16 0.1 1.0 -\nPanGu- Σ(1.085T) 0.5M 2×10−5yes - Adam FP16 - - -\n4.3.2 Scalable Training Techniques\nAs the model and data sizes increase, it has become chal-\nlenging to efficiently train LLMs under a lim