- Here, we look at building the "LLM Synthesis" component of a RAG pipeline
    - Given a set of retieved nodes, we will synthesize a response even if the retrieved context overflows the context window.

- Strategies
    - Create & refine
    - Tree summarization

- Load Data

In [2]:
!mkdir data
!wget --user-agent "Mozilla" "https://arxiv.org/pdf/2307.09288.pdf" -O "data/llama2.pdf"

mkdir: cannot create directory ‘data’: File exists


--2023-09-20 16:15:25--  https://arxiv.org/pdf/2307.09288.pdf
Resolving arxiv.org (arxiv.org)... 128.84.21.199
Connecting to arxiv.org (arxiv.org)|128.84.21.199|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 13661300 (13M) [application/pdf]
Saving to: ‘data/llama2.pdf’


2023-09-20 16:15:33 (1.82 MB/s) - ‘data/llama2.pdf’ saved [13661300/13661300]



In [5]:
from pathlib import Path
from llama_hub.file.pymu_pdf.base import PyMuPDFReader

loader = PyMuPDFReader()
documents = loader.load(file_path="./data/llama2.pdf")

- Use highlevel abstractions to 
  - ingest data into pinecone
  - get a vector retriever

In [7]:
from llama_index.vector_stores import PineconeVectorStore
from llama_index import VectorStoreIndex, ServiceContext
from llama_index.storage import StorageContext

import pinecone
import os

pinecone.init(api_key=os.environ["PINECONE_API_KEY"], environment="gcp-starter")

In [8]:
# Dimensions for text-embedding-ada-002
pinecone.create_index("quickstart", dimension=1536, metric="euclidean")

pinecone_index = pinecone.Index("quickstart")

In [9]:
# Create vector store and index
vector_store = PineconeVectorStore(pinecone_index)

# NOTE: set chunk size of 1024
service_context = ServiceContext.from_defaults(chunk_size=1024)
storage_context = StorageContext.from_defaults(vector_store=vector_store)
index = VectorStoreIndex.from_documents(
    documents=documents,
    service_context=service_context,
    storage_context=storage_context
)

None of PyTorch, TensorFlow >= 2.0, or Flax have been found. Models won't be available and only tokenizers, configuration and file/data utilities can be used.


Upserted vectors:   0%|          | 0/112 [00:00<?, ?it/s]

In [10]:
# Create retriever
retriever = index.as_retriever()

- Given example question, get a retrieved set of nodes

In [12]:
query_str = "Can you tell me about results from RLHF using both model-based and human-based evaluation?"

retrieved_nodes = retriever.retrieve(query_str)

- Building Response Synthesis with LLMS
  1. Try a simple prompt

In [18]:
# Try sythesizing response using a single input prompt + LLM call
from llama_index.llms import OpenAI
from llama_index.prompts import PromptTemplate

llm = OpenAI(model="text-davinci-003")

qa_prompt = PromptTemplate("""
Context information is below.
-------------------
{context_str}
-------------------
Given the context information and not prior knowledge, answer the query.
Query: {query_str}
Answer: 
""")

In [28]:
def generate_response(retrieved_nodes, query_str, qa_prompt, llm):
    """Generate response from retrieved nodes and query string."""

    context_str = "\n\n".join([n.get_content() for n in retrieved_nodes])

    fmt_qa_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)
    response = llm.complete(fmt_qa_prompt)

    return str(response), fmt_qa_prompt

In [22]:
# Create a query
query_str = "Can you tell me about results from RLHF using both model-based and human-based evaluation?"

# retrieve nodes
retrieved_nodes = retriever.retrieve(query_str)

# generate response
response, fmt_qa_prompt = generate_response(retrieved_nodes, query_str, qa_prompt, llm)
print(f"*****Response******:\n{response}\n\n")

*****Response******:
RLHF results were evaluated using both model-based and human-based evaluation. Model-based evaluation was used to select the best-performing models among several ablations at each iteration from RLHF-V1 to V5. Human evaluation was used to validate major model versions and measure the robustness of the reward model. Additionally, a general reward was trained to ensure the measure would not diverge from human preferences. Results showed that the reward models were well calibrated with human preference annotations. Furthermore, the model-based evaluation revealed that the temperature of the model was influenced by RLHF and could be dynamically re-scaled contingent upon the context.




In [23]:
print(f"*****Formatted Prompt*****:\n{fmt_qa_prompt}\n\n")

*****Formatted Prompt*****:

Context information is below.
-------------------
3.4
RLHF Results
3.4.1
Model-Based Evaluation
Evaluating LLMs is a challenging open-research problem. Human evaluation, while a gold standard, can
be complicated by various HCI considerations (Clark et al., 2021; Gehrmann et al., 2023), and is not always
scalable. Thus, to select the best-performing models among several ablations at each iteration from RLHF-V1
to V5, we first observed the improvement of the rewards from the latest reward models, to save costs and
increase iteration speed. We later validated major model versions with human evaluations.
How Far Can Model-Based Evaluation Go?
To measure the robustness of our reward model, we collected
a test set of prompts for both helpfulness and safety, and asked three annotators to judge the quality of the
answers based on a 7-point Likert scale (the higher the better). We observe that our reward models overall
are well calibrated with our human preference a

- **Problem** - if we set the top-k retriever to a higher value? The context would overflow

In [24]:
retriever = index.as_retriever(similarity_top_k=6)
retrieved_nodes = retriever.retrieve(query_str)

response, fmt_qa_prompt = generate_response(retrieved_nodes, query_str, qa_prompt, llm)
print(f"Response (k=5): {response}")

ValueError: The prompt is too long for the model. Please use a prompt that is less than 4097 tokens.

  2. "Create and Refine" Strategy
     - To deal with context overflows, we can synthesize a response sequentially through
       all nodes.
     - Start with first node, generate initial response
     - For subsequent nodes, refine the answer using additional context

In [25]:
refine_prompt = PromptTemplate("""\
The original query is as follows: {query_str}
We have provided an existing answer: {existing_answer}
We have the opportunity to refine the existing answer \
(only if needed) with some more context below.
-------------------
{context_str}
-------------------
Given the new context, refine the original answer to better answer the query. \
If the context isn't useful, return the original answer.
Refined answer: \
""")

In [32]:
from llama_index.response.notebook_utils import display_source_node

def generate_response_cr(retrieved_nodes, query_str, qa_prompt, refine_prompt, llm):
    """Generate a response using create and refine strategy.
    
    The first node uses the 'QA' prompt.
    All subsequent nodes use the 'refine' prompt.
    """

    cur_response = None
    fmt_prompts = []

    for idx, node in enumerate(retrieved_nodes):
        print(f"[Node {idx}]")
        display_source_node(node, source_length=2000)

        context_str = node.get_content()

        if idx == 0:
            fmt_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)
        else:
            fmt_prompt = refine_prompt.format(
                context_str=context_str,
                query_str=query_str,
                existing_answer=cur_response
            )
        
        cur_response = llm.complete(fmt_prompt)
        fmt_prompts.append(fmt_prompt)

    return str(cur_response), fmt_prompts

In [33]:
response, fmt_prompts = generate_response_cr(
    retrieved_nodes, query_str, qa_prompt, refine_prompt, llm
)
print(str(response))

[Node 0]


**Node ID:** b84c8932-df30-4e46-87f8-e7c59fc5fa13<br>**Similarity:** 0.210552812<br>**Text:** 3.4
RLHF Results
3.4.1
Model-Based Evaluation
Evaluating LLMs is a challenging open-research problem. Human evaluation, while a gold standard, can
be complicated by various HCI considerations (Clark et al., 2021; Gehrmann et al., 2023), and is not always
scalable. Thus, to select the best-performing models among several ablations at each iteration from RLHF-V1
to V5, we first observed the improvement of the rewards from the latest reward models, to save costs and
increase iteration speed. We later validated major model versions with human evaluations.
How Far Can Model-Based Evaluation Go?
To measure the robustness of our reward model, we collected
a test set of prompts for both helpfulness and safety, and asked three annotators to judge the quality of the
answers based on a 7-point Likert scale (the higher the better). We observe that our reward models overall
are well calibrated with our human preference annotations, as illustrated in Figure 29 in the appendix. This
confirms the relevance of using our reward as a point-wise metric, despite being trained with a Pairwise
Ranking Loss.
Still, as Goodhart’s Law states, when a measure becomes a target, it ceases to be a good measure. To ensure
our measure won’t diverge from the human preferences, we additionally used a more general reward, trained
17<br>

[Node 1]


**Node ID:** 736c9643-6418-41e4-b40f-8d6d54803165<br>**Similarity:** 0.284444094<br>**Text:** 5
Discussion
Here, we discuss the interesting properties we have observed with RLHF (Section 5.1). We then discuss the
limitations of Llama 2-Chat (Section 5.2). Lastly, we present our strategy for responsibly releasing these
models (Section 5.3).
5.1
Learnings and Observations
Our tuning process revealed several interesting results, such as Llama 2-Chat’s abilities to temporally
organize its knowledge, or to call APIs for external tools.
SFT (Mix)
SFT (Annotation)
RLHF (V1)
0.0
0.2
0.4
0.6
0.8
1.0
Reward Model Score
RLHF (V2)
Figure 20: Distribution shift for progressive versions of Llama 2-Chat, from SFT models towards RLHF.
Beyond Human Supervision.
At the outset of the project, many among us expressed a preference for
supervised annotation, attracted by its denser signal. Meanwhile reinforcement learning, known for its insta-
bility, seemed a somewhat shadowy field for those in the NLP research community. However, reinforcement
learning proved highly effective, particularly given its cost and time effectiveness. Our findings underscore
that the crucial determinant of RLHF’s success lies in the synergy it fosters between humans and LLMs
throughout the annotation process.
Even with proficient annotators, each individual writes with significant variation. A model fine-tuned on
SFT annotation learns this diversity, including, unfortunately, the tail-end of poorly executed annotation. Fur-
thermore, the model’s performance is capped by the writing abilities of the most skilled annotators. Human
annotators are arguably less subject to discrepancy when comparing two outputs’ preference annotation
for RLHF. Consequently, the reward mechanism swiftly learns to assign low scores to undesirable tail-end
distribution and aligns towards the human preference. This phenomena is illustrated in Figure 20, where we
can see that the worst answers are progressively removed, shifting the distribution to the right.
In addition, during annotation, the model has the potential to ven...<br>

[Node 2]


**Node ID:** 354e521d-a768-47f4-8d30-d0fc55b4a039<br>**Similarity:** 0.313639402<br>**Text:** RLHF-v5
(with PPO)
RLHF-v5
(no PPO)
RLHF-v4
RLHF-v3
            RLHF-v2
      RLHF-v1
SFT-v2       
SFT-v1
10%
20%
30%
40%
50%
60%
70%
80%
90%
10%
20%
30%
40%
50%
60%
70%
80%
Helpfulness
Judge: Meta Reward Models
Harmlessness
  RLHF-v5
  (with PPO)
RLHF-v5  
(no PPO)  
RLHF-v4
RLHF-v3
     RLHF-v2
RLHF-v1     
SFT-v2    
SFT-v1
10%
20%
30%
40%
50%
60%
70%
80%
90%
10%
20%
30%
40%
50%
60%
70%
80%
Helpfulness
Judge: GPT-4
Harmlessness
Figure 11: Evolution of Llama 2-Chat. We show the evolution after multiple iterations fine-tuning for the
win-rate % of Llama 2-Chat compared to ChatGPT. Left: the judge is our reward model, which may favor
our model, and right, the judge is GPT-4, which should be more neutral.
on diverse open-source Reward Modeling datasets. We have not yet observed any such divergence, and
hypothesize that iterative model updates may be helping to prevent this.
As a last verification step to ensure no regression between our new model and the previous one, we use both
to sample during the next annotation iteration. This enables a model comparison “for free” on new prompts
and can help to increase diversity when sampling.
Progression of Models.
Figure 11 reports the progress of our different SFT and then RLHF versions for
both Safety and Helpfulness axes, measured by our in-house Safety and Helpfulness reward models. On
this set of evaluations, we outperform ChatGPT on both axes after RLHF-V3 (harmlessness and helpfulness
>50%). Despite the aforementioned relevance of using our reward as a point-wise metric, it can arguably be
biased in favor of Llama 2-Chat. Therefore, for a fair comparison, we additionally compute the final results
using GPT-4 to assess which generation is preferred. The order in which ChatGPT and Llama 2-Chat outputs
appeared in GPT-4 prompt are randomly swapped to avoid any bias. As expected, the win-rate in favor of
Llama 2-Chat is less pronounced, although obtaining more than a 60% win-rate for our latest Llama 2-Chat.
The prompt...<br>

[Node 3]


**Node ID:** 50c58113-4184-457b-8831-7cab0cce8e6a<br>**Similarity:** 0.332381606<br>**Text:** sampled human preferences, whereby human annotators select which of two model outputs they prefer.
This human feedback is subsequently used to train a reward model, which learns patterns in the preferences
of the human annotators and can then automate preference decisions.
3.2.1
Human Preference Data Collection
Next, we collect human preference data for reward modeling. We chose a binary comparison protocol over
other schemes, mainly because it enables us to maximize the diversity of collected prompts. Still, other
strategies are worth considering, which we leave for future work.
Our annotation procedure proceeds as follows. We ask annotators to first write a prompt, then choose
between two sampled model responses, based on provided criteria. In order to maximize the diversity, the
two responses to a given prompt are sampled from two different model variants, and varying the temperature
hyper-parameter. In addition to giving participants a forced choice, we also ask annotators to label the degree
to which they prefer their chosen response over the alternative: either their choice is significantly better, better,
slightly better, or negligibly better/ unsure.
For our collection of preference annotations, we focus on helpfulness and safety. Helpfulness refers to how
well Llama 2-Chat responses fulfill users’ requests and provide requested information; safety refers to
whether Llama 2-Chat’s responses are unsafe, e.g., “giving detailed instructions on making a bomb” could
be considered helpful but is unsafe according to our safety guidelines. Separating the two allows us to
apply specific guidelines to each and better guide annotators; for example, our safety annotations provide
instructions to focus on adversarial prompts, among other guidance.
Apart from differences in annotation guidelines, we additionally collect a safety label during the safety stage.
This additional information bins model responses into one of three categories: 1) the preferred response
is saf...<br>

[Node 4]


**Node ID:** b24145d3-5c55-4bf3-a91f-3654468b84dd<br>**Similarity:** 0.336259484<br>**Text:** 1
2
3
4
5
6
7
8
9
10
11
12
13
14
Meta Helpfulness Data Batch Stage
0.52
0.54
0.56
0.58
0.60
0.62
0.64
Accuracy On All Examples
7b
13b
70b
GPT4
OpenAssistant
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Meta Helpfulness Data Batch Stage
0.50
0.55
0.60
0.65
0.70
0.75
0.80
Accuracy On Examples With Label "Significantly Better"
7b
13b
70b
GPT4
OpenAssistant
Figure 6: Scaling trends for the reward model. More data and a larger-size model generally improve
accuracy, and it appears that our models have not yet saturated from learning on the training data.
The fact that helpfulness and safety performed the best on their own domain is potentially due to the tension
between the two objectives (i.e., being as helpful as possible versus refusing unsafe prompts when necessary),
which may confuse the reward model during training. In order for a single model to perform well on both
dimensions, it needs to not only learn to select the better response given a prompt but also to distinguish
adversarial prompts from safe ones. As a result, optimizing two separate models eases the reward modeling
task. More detailed analysis on this tension between safety and helpfulness can be found in Appendix A.4.1.
When we group the scores by preference rating in Table 8, we can see that the accuracy is superior for the
“significantly better” test set and degrades gradually as comparison pairs become more similar (e.g., “slightly
better”). It is expected that learning to model human preferences becomes challenging when deciding
between two similar model responses, due to annotator subjectivity and their reliance on nuanced details
that may differentiate responses. We emphasize that the accuracy on more distinct responses matters the
most to improve Llama 2-Chat performance. The human preference annotation agreement rate is also higher
on more distinct responses than similar pairs.
Scaling Trends.
We study the scaling trends in terms of data and model size for the reward model, fine-
tuning different model s...<br>

[Node 5]


**Node ID:** 981d713a-86c9-4f5d-97d3-452fcd2e736f<br>**Similarity:** 0.338366389<br>**Text:** Figure 1: Helpfulness human evaluation results for Llama
2-Chat compared to other open-source and closed-source
models. Human raters compared model generations on ~4k
prompts consisting of both single and multi-turn prompts.
The 95% confidence intervals for this evaluation are between
1% and 2%. More details in Section 3.4.2. While reviewing
these results, it is important to note that human evaluations
can be noisy due to limitations of the prompt set, subjectivity
of the review guidelines, subjectivity of individual raters,
and the inherent difficulty of comparing generations.
Figure 2: Win-rate % for helpfulness and
safety between commercial-licensed base-
lines and Llama 2-Chat, according to GPT-
4. To complement the human evaluation, we
used a more capable model, not subject to
our own guidance. Green area indicates our
model is better according to GPT-4. To remove
ties, we used win/(win + loss). The orders in
which the model responses are presented to
GPT-4 are randomly swapped to alleviate bias.
1
Introduction
Large Language Models (LLMs) have shown great promise as highly capable AI assistants that excel in
complex reasoning tasks requiring expert knowledge across a wide range of fields, including in specialized
domains such as programming and creative writing. They enable interaction with humans through intuitive
chat interfaces, which has led to rapid and widespread adoption among the general public.
The capabilities of LLMs are remarkable considering the seemingly straightforward nature of the training
methodology. Auto-regressive transformers are pretrained on an extensive corpus of self-supervised data,
followed by alignment with human preferences via techniques such as Reinforcement Learning with Human
Feedback (RLHF). Although the training methodology is simple, high computational requirements have
limited the development of LLMs to a few players. There have been public releases of pretrained LLMs
(such as BLOOM (Scao et al., 2022), LLaMa-1 (Touvron...<br>


RLHF results were evaluated using both model-based and human-based evaluation. Model-based evaluation was used to select the best-performing models among several ablations at each iteration from RLHF-V1 to V5. Human evaluation was used to measure the robustness of the reward model, with three annotators judging the quality of the answers based on a 7-point Likert scale. Additionally, a more general reward was used to ensure the measure wouldn't diverge from human preferences. We collected a large dataset of over 1 million binary comparisons based on humans applying our specified guidelines, which we refer to as Meta reward modeling data. The reward model takes a model response and its corresponding prompt (including contexts from previous turns) as inputs and outputs a scalar score to indicate the quality (e.g., helpfulness and safety) of the model generation. Leveraging such response scores as rewards, we can optimize Llama 2-Chat during RLHF for better human preference alignment and

In [35]:
response

"\nRLHF results were evaluated using both model-based and human-based evaluation. Model-based evaluation was used to select the best-performing models among several ablations at each iteration from RLHF-V1 to V5. Human evaluation was used to measure the robustness of the reward model, with three annotators judging the quality of the answers based on a 7-point Likert scale. Additionally, a more general reward was used to ensure the measure wouldn't diverge from human preferences. We collected a large dataset of over 1 million binary comparisons based on humans applying our specified guidelines, which we refer to as Meta reward modeling data. The reward model takes a model response and its corresponding prompt (including contexts from previous turns) as inputs and outputs a scalar score to indicate the quality (e.g., helpfulness and safety) of the model generation. Leveraging such response scores as rewards, we can optimize Llama 2-Chat during RLHF for better human preference alignment a

In [36]:
# view a sample qa prompt
print(fmt_prompts[0])


Context information is below.
-------------------
3.4
RLHF Results
3.4.1
Model-Based Evaluation
Evaluating LLMs is a challenging open-research problem. Human evaluation, while a gold standard, can
be complicated by various HCI considerations (Clark et al., 2021; Gehrmann et al., 2023), and is not always
scalable. Thus, to select the best-performing models among several ablations at each iteration from RLHF-V1
to V5, we first observed the improvement of the rewards from the latest reward models, to save costs and
increase iteration speed. We later validated major model versions with human evaluations.
How Far Can Model-Based Evaluation Go?
To measure the robustness of our reward model, we collected
a test set of prompts for both helpfulness and safety, and asked three annotators to judge the quality of the
answers based on a 7-point Likert scale (the higher the better). We observe that our reward models overall
are well calibrated with our human preference annotations, as illustrated i

In [39]:
# view a sample qa prompt
print(fmt_prompts[-1])

The original query is as follows: Can you tell me about results from RLHF using both model-based and human-based evaluation?
We have provided an existing answer: 
RLHF results were evaluated using both model-based and human-based evaluation. Model-based evaluation was used to select the best-performing models among several ablations at each iteration from RLHF-V1 to V5. Human evaluation was used to measure the robustness of the reward model, with three annotators judging the quality of the answers based on a 7-point Likert scale. Additionally, a more general reward was used to ensure the measure wouldn't diverge from human preferences. We collected a large dataset of over 1 million binary comparisons based on humans applying our specified guidelines, which we refer to as Meta reward modeling data. The reward model takes a model response and its corresponding prompt (including contexts from previous turns) as inputs and outputs a scalar score to indicate the quality (e.g., helpfulness a

Observation: This is an initial step, but obviously there are inefficiencies. One is the fact that it’s quite slow - we make sequential calls. The second piece is that each LLM call is inefficient - we are only inserting a single node, but not “stuffing” the prompt with as much context as necessary.

3. Hierarchical SUmmarization Strategy

- Generate an answer for each node independently then hierarchically combine the answers.
- This "combine" step can happen once, or for maxiumum generality can happen recursively until there is one 'root' node
  - That root node is then returned as the final answer
- We will implemenmt this approach below with a fixed number of children of 10; i.e. combine 10 nodes at a time

NOTE: In LlamaIndex this is `tree_summarize` and in LangChain this is `map-reduce`

In [43]:
import numpy as np

np.mean([len(n.get_content()) for n in retrieved_nodes]) * 10

33371.666666666664

In [47]:
def combine_results(
    texts,
    query_str,
    qa_prompt,
    llm,
    cur_prompt_list,
    num_children=10,
):
    new_texts = []

    for idx in range(0, len(texts), num_children):
        text_batch = texts[idx : idx + num_children]
        context_str = "\n\n".join([t for t in text_batch])

        fmt_qa_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)
        combined_response = llm.complete(fmt_qa_prompt)

        new_texts.append(str(combined_response))
        cur_prompt_list.append(fmt_qa_prompt)

    if len(new_texts) == 1:
        return new_texts[0]
    else:
        print(len(cur_prompt_list))
        return combine_results(
            new_texts, query_str, qa_prompt, llm, num_children=num_children
        )
    

def generate_response_hs(retrieved_nodes, query_str, qa_prompt, llm, num_children=10):
    """Generate a response using hierarchical summarization strategy.

    Combine num_children nodes hierarchically until we get one root node.
    """

    fmt_prompts = []
    node_responses = []

    for node in retrieved_nodes:
        context_str = node.get_content()

        fmt_qa_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)
        node_response = llm.complete(fmt_qa_prompt)

        node_responses.append(node_response)
        fmt_prompts.append(fmt_qa_prompt)

    response_txt = combine_results(
        [str(r) for r in node_responses],
        query_str,
        qa_prompt,
        llm,
        fmt_prompts,
        num_children=num_children,
    )

    return response_txt, fmt_prompts

response, fmt_prompts = generate_response_hs(retrieved_nodes, query_str, qa_prompt, llm)
print(str(response))

The results from RLHF using both model-based and human-based evaluation have yielded positive results. Model-based evaluation has shown that Llama 2-Chat has improved in terms of helpfulness and safety, while human-based evaluation has shown that the reward model has been able to accurately learn patterns in the preferences of the human annotators and automate preference decisions. Additionally, the collected preference data has been compared to existing open-source datasets and has been found to feature more conversation turns and be longer, on average. Human evaluation results for Llama 2-Chat compared to other open-source and closed-source models were presented in Figure 1, with 95% confidence intervals between 1% and 2%. To complement the human evaluation, a more capable model, GPT-4, was used to measure win-rate % for helpfulness and safety between commercial-licensed baselines and Llama 2-Chat, as shown in Figure 2.


Similar to the above section, there are inefficiencies. We are still generating an answer for each node independently that we can try to optimize away.

Our ResponseSynthesizer module handles this!

4. [Optional] Creaet a async version of hierarchical summarization

- A pro of hierarchical summarization is that LLM calls can be parallelized, speeding up response synthesis
- Here, we use `asyncio.gather` to execute coroutines (LLM calls) for each Node concurrently.

In [48]:
import nest_asyncio
import asyncio

nest_asyncio.apply()

In [49]:
async def acombine_results(
    texts,
    query_str,
    qa_prompt,
    llm,
    cur_prompt_list,
    num_children=10,
):
    fmt_prompts = []

    for idx in range(0, len(texts), num_children):
        text_batch = texts[idx : idx + num_children]

        content_str = "\n\n".join([t for t in text_batch])
        fmt_qa_prompt = qa_prompt.format(context_str=content_str, query_str=query_str)

        fmt_prompts.append(fmt_qa_prompt)
        cur_prompt_list.append(fmt_qa_prompt)

    # generate completions asynchronously
    tasks = [llm.acomplete(p) for p in fmt_prompts]
    combined_responses = await asyncio.gather(*tasks)  # combine tasks and run them
    new_texts = [str(r) for r in combined_responses]

    if len(new_texts) == 1:
        return new_texts[0]
    else:
        return await acombine_results(
            new_texts, query_str, qa_prompt, llm, num_children=num_children
        )
    
async def agenerate_response_hs(
    retrieved_nodes,
    query_str,
    qa_prompt,
    llm,
    num_children=10
):
    """Generate a response using hierarchical summarization strategy.

    Combine num_children nodes hierarchically until we get one root node.
    """

    fmt_prompts = []
    node_responses = []
    for node in retrieved_nodes:
        context_str = node.get_content()

        fmt_qa_prompt = qa_prompt.format(context_str=context_str, query_str=query_str)
        fmt_prompts.append(fmt_qa_prompt)
    
    tasks = [llm.acomplete(p) for p in fmt_prompts]
    node_responses = await asyncio.gather(*tasks)  # combine tasks and run them

    response_txt = combine_results(
        [str(r) for r in node_responses],
        query_str,
        qa_prompt,
        llm,
        fmt_prompts,
        num_children=num_children
    )

    return response_txt, fmt_prompts

In [50]:
response, fmt_prompts = await agenerate_response_hs(
    retrieved_nodes, query_str, qa_prompt, llm
)
print(str(response))

The results from RLHF using both model-based and human-based evaluation have been positive. Model-based evaluation was used to select the best-performing models among several ablations at each iteration from RLHF-V1 to V5. Human evaluation was used to measure the robustness of the reward model, with three annotators judging the quality of the answers based on a 7-point Likert scale. The results showed that Llama 2-Chat outperformed the other models by a significant margin on both single turn and multi-turn prompts. Additionally, the scaling performance had not yet plateaued given the existing volume of data annotation used for training, indicating that there is room for more improvement with more annotations. Proximal Policy Optimization (PPO) and Rejection Sampling fine-tuning were used to explore RLHF fine-tuning.


- Putting it all together

In [51]:
from llama_index.retrievers import BaseRetriever
from llama_index.llms.base import LLM
from dataclasses import dataclass
from typing import Optional, List


@dataclass
class Response:
    response: str
    source_nodes: Optional[List] = None

    def __str__(self):
        return self.response


class MyQueryEngine:
    """My query engine.

    Uses the tree summarize response synthesis module by default.

    """

    def __init__(
        self,
        retriever: BaseRetriever,
        qa_prompt: PromptTemplate,
        llm: LLM,
        num_children=10,
    ) -> None:
        self._retriever = retriever
        self._qa_prompt = qa_prompt
        self._llm = llm
        self._num_children = num_children

    def query(self, query_str: str):
        retrieved_nodes = self._retriever.retrieve(query_str)
        response_txt, _ = generate_response_hs(
            retrieved_nodes,
            query_str,
            self._qa_prompt,
            self._llm,
            num_children=self._num_children,
        )
        response = Response(response_txt, source_nodes=retrieved_nodes)
        return response

    async def aquery(self, query_str: str):
        retrieved_nodes = await self._retriever.aretrieve(query_str)
        response_txt, _ = await agenerate_response_hs(
            retrieved_nodes,
            query_str,
            self._qa_prompt,
            self._llm,
            num_children=self._num_children,
        )
        response = Response(response_txt, source_nodes=retrieved_nodes)
        return response

query_engine = MyQueryEngine(retriever, qa_prompt, llm, num_children=10)

In [52]:
%%timeit
response = query_engine.query(query_str)
print(str(response))

The results from RLHF using both model-based and human-based evaluation have been positive. Model-based evaluation was used to select the best-performing models among several ablations at each iteration from RLHF-V1 to V5. Human evaluation was used to measure the robustness of the reward model, with three annotators judging the quality of the answers based on a 7-point Likert scale. The results showed that Llama 2-Chat outperformed the other models by a significant margin on both single turn and multi-turn prompts. Additionally, a more general reward was trained to ensure the measure wouldn't diverge from the human preferences. The human preference data collected has enabled us to train a reward model which can automate preference decisions. This reward model has been used to optimize Llama 2-Chat during RLHF, resulting in better human preference alignment and improved helpfulness and safety.
The results from RLHF using both model-based and human-based evaluation have yielded positive 

In [55]:
response = await query_engine.aquery(query_str)
print(str(response))

The results from RLHF using both model-based and human-based evaluation have been positive. Model-based evaluation was used to select the best-performing models among several ablations at each iteration from RLHF-V1 to V5. Human evaluation was used to measure the robustness of the reward model, and three annotators were asked to judge the quality of the answers based on a 7-point Likert scale. The results showed that Llama 2-Chat outperformed the other models by a significant margin on both single turn and multi-turn prompts. Additionally, the human preference data collected through the binary comparison protocol has enabled us to maximize the diversity of collected prompts and train a reward model that can learn patterns in the preferences of the human annotators. This reward model has been used to optimize Llama 2-Chat during RLHF, resulting in better human preference alignment and improved helpfulness and safety.
