# Sub-Document Summary Metadata Pack

This LlamaPack provides an advanced technique for injecting each chunk with "sub-document" metadata. This context augmentation technique is helpful for both retrieving relevant context and for synthesizing correct answers.

It is a step beyond simply adding a summary of the document as the metadata to each chunk. Within a long document, there can be multiple distinct themes, and we want each chunk to be grounded in global but relevant context.

Source: https://github.com/run-llama/llama_index/blob/main/llama-index-packs/llama-index-packs-subdoc-summary/examples/subdoc-summary.ipynb
Video: https://www.youtube.com/watch?v=m6P1Rp91AzM&t=1s

## Setup Data

In [None]:
!mkdir -p 'data/'
!curl 'https://arxiv.org/pdf/2307.09288.pdf' -o 'data/llama2.pdf'

811.82s - pydevd: Sending message related to process being replaced timed-out after 5 seconds
817.00s - pydevd: Sending message related to process being replaced timed-out after 5 seconds


  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed
100 13.0M  100 13.0M    0     0  27.7M      0 --:--:-- --:--:-- --:--:-- 28.0M


In [None]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader("data").load_data()

## Run the Sub-Document Summary Metadata Pack

In [None]:
%pip install llama-index-packs-subdoc-summary llama-index-llms-openai llama-index-embeddings-openai

In [None]:
from llama_index.packs.subdoc_summary import SubDocSummaryPack
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding

subdoc_summary_pack = SubDocSummaryPack(
    documents,
    parent_chunk_size=8192,  # default,
    child_chunk_size=512,  # default
    llm=OpenAI(model="gpt-3.5-turbo"),
    embed_model=OpenAIEmbedding(),
)

In [None]:
from IPython.display import Markdown, display
from llama_index.core.response.notebook_utils import display_source_node

response = subdoc_summary_pack.run("How was Llama2 pretrained?")
display(Markdown(str(response)))
for n in response.source_nodes:
    display_source_node(n, source_length=10000, metadata_mode="all")

Llama 2 was pretrained using an optimized auto-regressive transformer with robust data cleaning, updated data mixes, training on 40% more total tokens, doubling the context length, and using grouped-query attention to improve inference scalability for larger models.

**Node ID:** 172a1344-d48d-443b-8383-677037570c06<br>**Similarity:** 0.8720929924174893<br>**Text:** page_label: 1
file_name: llama2.pdf
file_path: data/llama2.pdf
file_type: application/pdf
file_size: 13661300
creation_date: 2024-02-17
last_modified_date: 2024-02-17
last_accessed_date: 2024-02-17
context_summary: Llama 2 is a collection of pretrained and fine-tuned large language models optimized for dialogue use cases, ranging from 7 billion to 70 billion parameters. The models, known as Llama 2-Chat, have shown superior performance compared to open-source chat models on various benchmarks and are considered as potential alternatives to closed-source models.

Llama 2 : Open Foundation and Fine-Tuned Chat Models
Hugo Touvron∗Louis Martin†Kevin Stone†
Peter Albert Amjad Almahairi Yasmine Babaei Nikolay Bashlykov Soumya Batra
Prajjwal Bhargava Shruti Bhosale Dan Bikel Lukas Blecher Cristian Canton Ferrer Moya Chen
Guillem Cucurull David Esiobu Jude Fernandes Jeremy Fu Wenyin Fu Brian Fuller
Cynthia Gao Vedanuj Goswami Naman Goyal Anthony Hartshorn Saghar Hosseini Rui Hou
Hakan Inan Marcin Kardas Viktor Kerkez Madian Khabsa Isabel Kloumann Artem Korenev
Punit Singh Koura Marie-Anne Lachaux Thibaut Lavril Jenya Lee Diana Liskovich
Yinghai Lu Yuning Mao Xavier Martinet Todor Mihaylov Pushkar Mishra
Igor Molybog Yixin Nie Andrew Poulton Jeremy Reizenstein Rashi Rungta Kalyan Saladi
Alan Schelten Ruan Silva Eric Michael Smith Ranjan Subramanian Xiaoqing Ellen Tan Binh Tang
Ross Taylor Adina Williams Jian Xiang Kuan Puxin Xu Zheng Yan Iliyan Zarov Yuchen Zhang
Angela Fan Melanie Kambadur Sharan Narang Aurelien Rodriguez Robert Stojnic
Sergey Edunov Thomas Scialom∗
GenAI, Meta
Abstract
In this work, we develop and release Llama 2, a collection of pretrained and fine-tuned
large language models (LLMs) ranging in scale from 7 billion to 70 billion parameters.
Our fine-tuned LLMs, called Llama 2-Chat , are optimized for dialogue use cases. Our
models outperform open-source chat models on most benchmarks we tested, and based on
ourhumanevaluationsforhelpfulnessandsafety,maybeasuitablesubstituteforclosed-
source models. We provide a detailed description of our approach to fine-tuning and safety
improvements of Llama 2-Chat in order to enable the community to build on our work and
contribute to the responsible development of LLMs.<br>

**Node ID:** dbbde2a7-d51c-4245-959d-ba97ba414b55<br>**Similarity:** 0.8700958215249326<br>**Text:** page_label: 5
file_name: llama2.pdf
file_path: data/llama2.pdf
file_type: application/pdf
file_size: 13661300
creation_date: 2024-02-17
last_modified_date: 2024-02-17
last_accessed_date: 2024-02-17
context_summary: Llama 2-Chat is developed through pretraining, supervised fine-tuning, and reinforcement learning with human feedback methodologies, focusing on refining the model iteratively. The training process involves using an optimized auto-regressive transformer, robust data cleaning, updated data mixes, and specific architectural enhancements like increased context length and grouped-query attention.

Figure4: Trainingof Llama 2-Chat : Thisprocessbeginswiththe pretraining ofLlama 2 usingpublicly
availableonlinesources. Followingthis,wecreateaninitialversionof Llama 2-Chat throughtheapplication
ofsupervised fine-tuning . Subsequently, the model is iteratively refined using Reinforcement Learning
with Human Feedback (RLHF) methodologies, specifically through rejection sampling and Proximal Policy
Optimization(PPO).ThroughouttheRLHFstage,theaccumulationof iterativerewardmodelingdata in
parallel with model enhancements is crucial to ensure the reward models remain within distribution.
2 Pretraining
Tocreatethenewfamilyof Llama 2models,webeganwiththepretrainingapproachdescribedinTouvronetal.
(2023), using an optimized auto-regressive transformer, but made several changes to improve performance.
Specifically,weperformedmorerobustdatacleaning,updatedourdatamixes,trainedon40%moretotal
tokens,doubledthecontextlength,andusedgrouped-queryattention(GQA)toimproveinferencescalability
for our larger models. Table 1 compares the attributes of the new Llama 2 models with the Llama 1 models.
2.1 Pretraining Data
Our training corpus includes a new mix of data from publicly available sources, which does not include data
fromMeta’sproductsorservices. Wemadeanefforttoremovedatafromcertainsitesknowntocontaina
highvolumeofpersonalinformationaboutprivateindividuals. Wetrainedon2trilliontokensofdataasthis
providesagoodperformance–costtrade-off,up-samplingthemostfactualsourcesinanefforttoincrease
knowledge and dampen hallucinations.
Weperformedavarietyofpretrainingdatainvestigationssothatuserscanbetterunderstandthepotential
capabilities and limitations of our models; results can be found in Section 4.1.
2.2 Training Details
We adopt most of the pretraining setting and model architecture from Llama 1 .<br>

In [None]:
from IPython.display import Markdown, display

response = subdoc_summary_pack.run(
    "What is the functionality of latest ChatGPT memory."
)
display(Markdown(str(response)))

for n in response.source_nodes:
    display_source_node(n, source_length=10000, metadata_mode="all")

The latest ChatGPT model, equipped with Ghost Attention (GAtt), demonstrates strong multi-turn memory ability by consistently referring to defined attributes for up to 20 turns in a conversation. This integration of GAtt in the ChatGPT model allows for efficient long context attention beyond 2048 tokens, showcasing potential for robust performance in handling extended contexts.

**Node ID:** 005a3c23-8d97-4e5d-957e-98ad2dfb93ad<br>**Similarity:** 0.7923889627946064<br>**Text:** page_label: 54
file_name: llama2.pdf
file_path: data/llama2.pdf
file_type: application/pdf
file_size: 13661300
creation_date: 2024-02-17
last_modified_date: 2024-02-17
last_accessed_date: 2024-02-17
context_summary: Llama 2-Chat with GAtt consistently refers to defined attributes for up to 20 turns, showcasing strong multi-turn memory ability. The integration of GAtt in Llama 2-Chat enables efficient long context attention beyond 2048 tokens, indicating potential for robust performance in handling extended contexts.

Dialogue Turn Baseline + GAtt
2 100% 100%
4 10% 100%
6 0% 100%
20 0% 100%
Table30: GAttresults. Llama 2-Chat withGAttisabletorefertoattributes100%ofthetime,forupto20
turns from our human evaluation. We limited the evaluated attributes to public figures and hobbies.
Theattentionnowspansbeyond20turns. Wetestedthemodelabilitytorememberthesystemarguments
troughahumanevaluation. Thearguments(e.g. hobbies,persona)aredefinedduringthefirstmessage,and
then from turn 2 to 20. We explicitly asked the model to refer to them (e.g. “What is your favorite hobby?”,
“Whatisyourname?”),tomeasurethemulti-turnmemoryabilityof Llama 2-Chat . Wereporttheresults
inTable30. EquippedwithGAtt, Llama 2-Chat maintains100%accuracy,alwaysreferringtothedefined
attribute,andso,upto20turns(wedidnotextendthehumanevaluationmore,andalltheexampleshad
lessthan4048tokensintotalovertheturns). Asacomparison, Llama 2-Chat withoutGAttcannotanymore
refer to the attributes after only few turns: from 100% at turn t+1, to 10% at turn t+3 and then 0%.
GAttZero-shotGeneralisation. Wetriedatinferencetimetosetconstrainnotpresentinthetrainingof
GAtt. For instance, “answer in one sentence only”, for which the model remained consistent, as illustrated in
Figure 28.
We applied first GAtt to Llama 1 , which was pretrained with a context length of 2048 tokens and then
fine-tuned with 4096 max length. We tested if GAtt works beyond 2048 tokens, and the model arguably
managed to understand attributes beyond this window. This promising result indicates that GAtt could be
adapted as an efficient technique for long context attention.
A.3.6 How Far Can Model-Based Evaluation Go?<br>

**Node ID:** 0b1719e9-d7fa-42af-890b-5eeb946857c5<br>**Similarity:** 0.7837282816384877<br>**Text:** page_label: 16
file_name: llama2.pdf
file_path: data/llama2.pdf
file_type: application/pdf
file_size: 13661300
creation_date: 2024-02-17
last_modified_date: 2024-02-17
last_accessed_date: 2024-02-17
context_summary: The text discusses the challenges faced in maintaining multi-turn consistency in dialogue systems and introduces a method called Ghost Attention (GAtt) to address these issues. GAtt involves incorporating instructions throughout a conversation to ensure dialogue control over multiple turns.

Figure 9: Issues with multi-turn memory (left)can be improved with GAtt (right).
We train for between 200and400iterations for all our models, and use evaluations on held-out prompts for
earlystopping. EachiterationofPPOonthe70Bmodeltakesonaverage ≈330seconds. Totrainquicklywith
large batch sizes, we use FSDP (Zhao et al., 2023). This was effective when using O(1) forward or backward
passes,butcausedalargeslowdown( ≈20×)duringgeneration,evenwhenusingalargebatchsizeandKV
cache. We were able to mitigate this by consolidating the model weights to each node once before generation
and then freeing the memory after generation, resuming the rest of the training loop.
3.3 System Message for Multi-Turn Consistency
In a dialogue setup, some instructions should apply for all the conversation turns, e.g., to respond succinctly,
or to“act as”some public figure. When we provided such instructions to Llama 2-Chat , the subsequent
response should always respect the constraint. However, our initial RLHF models tended to forget the initial
instruction after a few turns of dialogue, as illustrated in Figure 9 (left).
To address these limitations, we propose Ghost Attention (GAtt), a very simple method inspired by Context
Distillation (Bai et al., 2022b) that hacks the fine-tuning data to help the attention focus in a multi-stage
process. GAtt enables dialogue control over multiple turns, as illustrated in Figure 9 (right).
GAttMethod. Assumewe haveaccess toa multi-turndialoguedataset betweentwo persons(e.g., auser
and an assistant), with a list of messages [u1, a1, . . . , u n, an], where unandancorrespond to the user and
assistant messages for turn n, respectively. Then, we define an instruction, inst, that should be respected
throughout the dialogue. For example, instcould be “act as.” We can then synthetically concatenate this
instruction to all the user messages of the conversation.
Next, we can sample from this synthetic data using the latest RLHF model.<br>