In [1]:
pip install llama-index llama-index-embeddings-huggingface llama-index-llms-groq

Collecting llama-index
  Downloading llama_index-0.12.47-py3-none-any.whl.metadata (12 kB)
Collecting llama-index-embeddings-huggingface
  Downloading llama_index_embeddings_huggingface-0.5.5-py3-none-any.whl.metadata (458 bytes)
Collecting llama-index-llms-groq
  Downloading llama_index_llms_groq-0.3.2-py3-none-any.whl.metadata (2.0 kB)
Collecting llama-index-agent-openai<0.5,>=0.4.0 (from llama-index)
  Downloading llama_index_agent_openai-0.4.12-py3-none-any.whl.metadata (439 bytes)
Collecting llama-index-cli<0.5,>=0.4.2 (from llama-index)
  Downloading llama_index_cli-0.4.4-py3-none-any.whl.metadata (1.4 kB)
Collecting llama-index-core<0.13,>=0.12.47 (from llama-index)
  Downloading llama_index_core-0.12.47-py3-none-any.whl.metadata (2.5 kB)
Collecting llama-index-embeddings-openai<0.4,>=0.3.0 (from llama-index)
  Downloading llama_index_embeddings_openai-0.3.1-py3-none-any.whl.metadata (684 bytes)
Collecting llama-index-indices-managed-llama-cloud>=0.4.0 (from llama-index)
  Downl

In [1]:
pip install llama-index-readers-file pymupdf

Collecting pymupdf
  Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl.metadata (3.4 kB)
Downloading pymupdf-1.26.3-cp39-abi3-manylinux_2_28_x86_64.whl (24.1 MB)
[2K   [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m24.1/24.1 MB[0m [31m53.4 MB/s[0m eta [36m0:00:00[0m
[?25hInstalling collected packages: pymupdf
Successfully installed pymupdf-1.26.3


In [15]:
pip install sentence-transformers



In [17]:
import os
from google.colab import userdata

GROQ_API_KEY = userdata.get('GROQ_API_KEY')
os.environ['GROQ_API_KEY'] = GROQ_API_KEY

In [2]:
%load_ext autoreload
%autoreload 2

In [10]:
## Load the Docs
from llama_index.readers.file import PDFReader, PyMuPDFReader

loader = PyMuPDFReader()
docs = loader.load("/content/llama2.pdf")

In [11]:
from llama_index.core import Document

doc_text = "\n\n".join([doc.get_content() for doc in docs])
docs = [Document(text = doc_text)]

In [13]:
### split the docs
from llama_index.core.node_parser import HierarchicalNodeParser, SentenceSplitter

node_parser = HierarchicalNodeParser.from_defaults()

nodes = node_parser.get_nodes_from_documents(docs)

len(nodes)

1009

In [14]:
from llama_index.core.node_parser import get_leaf_nodes, get_root_nodes

leaf = get_leaf_nodes(nodes)
root = get_root_nodes(nodes)

print(len(leaf))
print(len(root))

783
47


In [33]:
## Store then into Storage
from llama_index.core import StorageContext
from llama_index.core.storage.docstore import SimpleDocumentStore
from llama_index.llms.groq import Groq
from llama_index.embeddings.huggingface import HuggingFaceEmbedding

docstore = SimpleDocumentStore()

docstore.add_documents(nodes)

storage_context = StorageContext.from_defaults(docstore = docstore)

llm = Groq(model="gemma2-9b-it")

In [34]:
embedding = HuggingFaceEmbedding(model_name="sentence-transformers/all-MiniLM-L6-v2")

In [35]:
from llama_index.core import VectorStoreIndex

index = VectorStoreIndex(leaf, storage_context = storage_context, embed_model = embedding)

In [36]:
### create a retriever on top of index
from llama_index.core.retrievers import AutoMergingRetriever

base_retriever = index.as_retriever(similarity_top_k=6)

retriever = AutoMergingRetriever(
    base_retriever,
    storage_context,
    verbose=True,
)

In [39]:
query = (
    "What could be the potential outcomes of adjusting the amount of safety"
    " data used in the RLHF stage?"
)
nodes = retriever.retrieve(query)
base_nodes = base_retriever.retrieve(query)

> Merging 4 nodes into parent node.
> Parent node id: 8d46ca45-ac34-40a9-a53c-0902a61e5022.
> Parent node text: We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: an...



In [40]:
print(len(nodes))
print(len(base_nodes))

3
6


In [41]:
from llama_index.core.response.notebook_utils import display_source_node

for node in nodes:
    display_source_node(node, source_length=10000)

**Node ID:** 8d46ca45-ac34-40a9-a53c-0902a61e5022<br>**Similarity:** 0.5901986705564048<br>**Text:** We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: annotators
write a prompt that they believe can elicit unsafe behavior, and then compare multiple model responses to
the prompts, selecting the response that is safest according to a set of guidelines. We then use the human
preference data to train a safety reward model (see Section 3.2.2), and also reuse the adversarial prompts to
sample from the model during the RLHF stage.
Better Long-Tail Safety Robustness without Hurting Helpfulness
Safety is inherently a long-tail problem,
where the challenge comes from a small number of very specific cases. We investigate the impact of Safety
RLHF by taking two intermediate Llama 2-Chat checkpoints—one without adversarial prompts in the RLHF
stage and one with them—and score their responses on our test sets using our safety and helpfulness reward
models. In Figure 14, we plot the score distribution shift of the safety RM on the safety test set (left) and that
of the helpfulness RM on the helpfulness test set (right). In the left hand side of the figure, we observe that
the distribution of safety RM scores on the safety set shifts to higher reward scores after safety tuning with
RLHF, and that the long tail of the distribution near zero thins out. A clear cluster appears on the top-left
corner suggesting the improvements of model safety. On the right side, we do not observe any gathering
pattern below the y = x line on the right hand side of Figure 14, which indicates that the helpfulness score
distribution is preserved after safety tuning with RLHF. Put another way, given sufficient helpfulness training
data, the addition of an additional stage of safety mitigation does not negatively impact model performance
on helpfulness to any notable degradation. A qualitative example is shown in Table 12.
Impact of Safety Data Scaling.
A tension between helpfulness and safety of LLMs has been observed in
previous studies (Bai et al., 2022a). To better understand how the addition of safety training data affects
general model performance, especially helpfulness, we investigate the trends in safety data scaling by
adjusting the amount of safety data used in the RLHF stage.<br>

**Node ID:** 264fbb03-26fb-41cb-a56c-639fca5cdc7f<br>**Similarity:** 0.5780043649973514<br>**Text:** We also list two
qualitative examples where safety and helpfulness reward models don’t agree with each other in Table 35.
A.4.2
Qualitative Results on Safety Data Scaling
In Section 4.2.3, we study the impact of adding more safety data into model RLHF in a quantitative manner.
Here we showcase a few samples to qualitatively examine the evolution of model behavior when we scale
safety data in Tables 36, 37, and 38.<br>

**Node ID:** 223b0ec4-7318-4415-87de-d919008ef069<br>**Similarity:** 0.5741378889283748<br>**Text:** 0
0.2
0.4
0.6
0.8
1.0
Helpfulness RM Score before Safety RLHF
0.0
0.2
0.4
0.6
0.8
1.0
Helpfulness RM Score after Safety RLHF
0
1000
0
1000
Figure 14: Impact of safety RLHF measured by reward model score distributions. Left: safety reward
model scores of generations on the Meta Safety test set. The clustering of samples in the top left corner
suggests the improvements of model safety.<br>

In [42]:
for node in base_nodes:
    display_source_node(node, source_length=10000)

**Node ID:** 653323b6-ce41-4866-a1d2-3268e5a71eef<br>**Similarity:** 0.6836169592528931<br>**Text:** A qualitative example is shown in Table 12.
Impact of Safety Data Scaling.
A tension between helpfulness and safety of LLMs has been observed in
previous studies (Bai et al., 2022a). To better understand how the addition of safety training data affects
general model performance, especially helpfulness, we investigate the trends in safety data scaling by
adjusting the amount of safety data used in the RLHF stage.<br>

**Node ID:** 0faaf7d6-7b57-4421-b09e-580062e5b05a<br>**Similarity:** 0.6027956782479491<br>**Text:** A clear cluster appears on the top-left
corner suggesting the improvements of model safety. On the right side, we do not observe any gathering
pattern below the y = x line on the right hand side of Figure 14, which indicates that the helpfulness score
distribution is preserved after safety tuning with RLHF. Put another way, given sufficient helpfulness training
data, the addition of an additional stage of safety mitigation does not negatively impact model performance
on helpfulness to any notable degradation. A qualitative example is shown in Table 12.
Impact of Safety Data Scaling.<br>

**Node ID:** 264fbb03-26fb-41cb-a56c-639fca5cdc7f<br>**Similarity:** 0.5780043649973514<br>**Text:** We also list two
qualitative examples where safety and helpfulness reward models don’t agree with each other in Table 35.
A.4.2
Qualitative Results on Safety Data Scaling
In Section 4.2.3, we study the impact of adding more safety data into model RLHF in a quantitative manner.
Here we showcase a few samples to qualitatively examine the evolution of model behavior when we scale
safety data in Tables 36, 37, and 38.<br>

**Node ID:** 223b0ec4-7318-4415-87de-d919008ef069<br>**Similarity:** 0.5741378889283748<br>**Text:** 0
0.2
0.4
0.6
0.8
1.0
Helpfulness RM Score before Safety RLHF
0.0
0.2
0.4
0.6
0.8
1.0
Helpfulness RM Score after Safety RLHF
0
1000
0
1000
Figure 14: Impact of safety RLHF measured by reward model score distributions. Left: safety reward
model scores of generations on the Meta Safety test set. The clustering of samples in the top left corner
suggests the improvements of model safety.<br>

**Node ID:** 0a428b82-a30e-4bf5-bf81-e847d7f75982<br>**Similarity:** 0.5405301548440852<br>**Text:** We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: annotators
write a prompt that they believe can elicit unsafe behavior, and then compare multiple model responses to
the prompts, selecting the response that is safest according to a set of guidelines. We then use the human
preference data to train a safety reward model (see Section 3.2.2), and also reuse the adversarial prompts to
sample from the model during the RLHF stage.<br>

**Node ID:** d837599e-3faf-4ad2-9335-3851c582123c<br>**Similarity:** 0.5338518898806923<br>**Text:** In Figure 14, we plot the score distribution shift of the safety RM on the safety test set (left) and that
of the helpfulness RM on the helpfulness test set (right). In the left hand side of the figure, we observe that
the distribution of safety RM scores on the safety set shifts to higher reward scores after safety tuning with
RLHF, and that the long tail of the distribution near zero thins out. A clear cluster appears on the top-left
corner suggesting the improvements of model safety.<br>

In [51]:
### Define The Query Engine
from llama_index.core.query_engine import RetrieverQueryEngine

query_engine = RetrieverQueryEngine.from_args(retriever, llm=llm)
base_query_engine = RetrieverQueryEngine.from_args(base_retriever, llm=llm)

In [53]:
response = query_engine.query(query)

str(response)

> Merging 4 nodes into parent node.
> Parent node id: 8d46ca45-ac34-40a9-a53c-0902a61e5022.
> Parent node text: We conduct RLHF by first collecting human preference data for safety similar to Section 3.2.2: an...



"Adjusting the amount of safety data used in the RLHF stage could lead to changes in the model's performance, particularly in terms of safety and helpfulness.  It's possible to observe trends in how the model behaves as the amount of safety data is increased.  \n"

In [54]:
base_response = base_query_engine.query(query)

str(base_response)

"Adjusting the amount of safety data used in the RLHF stage could lead to improvements in model safety, as indicated by shifts in reward model score distributions towards higher reward scores.  Additionally, it may help to preserve the distribution of helpfulness scores, ensuring that the model's helpfulness is not negatively impacted. \n\n\n"