# Metadata Replacement + Node Sentence Window

In this notebook, we use the `SentenceWindowNodeParser` to parse documents into single sentences per node. Each node also contains a "window" with the sentences on either side of the node sentence.

Then, during retrieval, before passing the retrieved sentences to the LLM, the single sentences are replaced with a window containing the surrounding sentences using the `MetadataReplacementNodePostProcessor`.

This is most useful for large documents/indexes, as it helps to retrieve more fine-grained details.

By default, the sentence window is 5 sentences on either side of the original sentence.

In this case, chunk size settings are not used, in favor of following the window settings.

In [None]:
%pip install llama-index-embeddings-openai
%pip install llama-index-embeddings-huggingface
%pip install llama-index-llms-openai

Collecting llama-index-embeddings-openai
  Downloading llama_index_embeddings_openai-0.1.6-py3-none-any.whl (6.0 kB)
Collecting llama-index-core<0.11.0,>=0.10.1 (from llama-index-embeddings-openai)
  Downloading llama_index_core-0.10.15-py3-none-any.whl (15.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m15.3/15.3 MB[0m [31m45.2 MB/s[0m eta [36m0:00:00[0m
Collecting dataclasses-json (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-openai)
  Downloading dataclasses_json-0.6.4-py3-none-any.whl (28 kB)
Collecting deprecated>=1.2.9.3 (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-openai)
  Downloading Deprecated-1.2.14-py2.py3-none-any.whl (9.6 kB)
Collecting dirtyjson<2.0.0,>=1.0.8 (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-openai)
  Downloading dirtyjson-1.0.8-py3-none-any.whl (25 kB)
Collecting httpx (from llama-index-core<0.11.0,>=0.10.1->llama-index-embeddings-openai)
  Downloading httpx-0.27.0-py3-none-any.w

In [None]:
%load_ext autoreload
%autoreload 2

## Setup

If you're opening this Notebook on colab, you will probably need to install LlamaIndex 🦙.

In [None]:
!pip install llama-index

Collecting llama-index
  Downloading llama_index-0.10.15-py3-none-any.whl (5.6 kB)
Collecting llama-index-agent-openai<0.2.0,>=0.1.4 (from llama-index)
  Downloading llama_index_agent_openai-0.1.5-py3-none-any.whl (12 kB)
Collecting llama-index-cli<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_cli-0.1.7-py3-none-any.whl (25 kB)
Collecting llama-index-indices-managed-llama-cloud<0.2.0,>=0.1.2 (from llama-index)
  Downloading llama_index_indices_managed_llama_cloud-0.1.3-py3-none-any.whl (6.6 kB)
Collecting llama-index-legacy<0.10.0,>=0.9.48 (from llama-index)
  Downloading llama_index_legacy-0.9.48-py3-none-any.whl (2.0 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m2.0/2.0 MB[0m [31m38.2 MB/s[0m eta [36m0:00:00[0m
Collecting llama-index-multi-modal-llms-openai<0.2.0,>=0.1.3 (from llama-index)
  Downloading llama_index_multi_modal_llms_openai-0.1.4-py3-none-any.whl (5.8 kB)
Collecting llama-index-program-openai<0.2.0,>=0.1.3 (from llama-index)
  Do

In [None]:
import os
import openai

In [None]:
os.environ["OPENAI_API_KEY"] = 'sk-........'

In [None]:
from llama_index.llms.openai import OpenAI
from llama_index.embeddings.openai import OpenAIEmbedding
from llama_index.embeddings.huggingface import HuggingFaceEmbedding
from llama_index.core.node_parser import SentenceWindowNodeParser
from llama_index.core.node_parser import SentenceSplitter

# create the sentence window node parser w/ default settings
node_parser = SentenceWindowNodeParser.from_defaults(
    window_size=3,
    window_metadata_key="window",
    original_text_metadata_key="original_text",
)

# base node parser is a sentence splitter
text_splitter = SentenceSplitter()

llm = OpenAI(model="gpt-3.5-turbo", temperature=0.1)
embed_model = HuggingFaceEmbedding(
    model_name="sentence-transformers/all-mpnet-base-v2", max_length=512
)

from llama_index.core import Settings

Settings.llm = llm
Settings.embed_model = embed_model
Settings.text_splitter = text_splitter

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


## Load Data, Build the Index

In this section, we load data and build the vector index.

### Load Data

Here, we build an index using chapter 3 of the recent IPCC climate report.

In [None]:
# data source = https://pdfdrive.com.co/the-life-of-pi-pdf/

In [None]:
from llama_index.core import SimpleDirectoryReader

documents = SimpleDirectoryReader(
    input_files=["/content/Life-of-Pi-by-Yann-Martel-pdfdrive.com.co.pdf"]
).load_data()

### Extract Nodes

We extract out the set of nodes that will be stored in the VectorIndex. This includes both the nodes with the sentence window parser, as well as the "base" nodes extracted using the standard parser.

In [None]:
nodes = node_parser.get_nodes_from_documents(documents)

In [None]:
base_nodes = text_splitter.get_nodes_from_documents(documents)

### Build the Indexes

We build both the sentence index, as well as the "base" index (with default chunk sizes).

In [None]:
from llama_index.core import VectorStoreIndex

sentence_index = VectorStoreIndex(nodes)

In [None]:
base_index = VectorStoreIndex(base_nodes)

## Querying

### With MetadataReplacementPostProcessor

Here, we now use the `MetadataReplacementPostProcessor` to replace the sentence in each node with it's surrounding context.

In [None]:
from llama_index.core.postprocessor import MetadataReplacementPostProcessor

query_engine = sentence_index.as_query_engine(
    similarity_top_k=2,
    # the target key defaults to `window` to match the node_parser's default
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
)
window_response = query_engine.query(
    "What is the significance of the number 227 in 'Life of Pi'?"
)
print(window_response)

The significance of the number 227 in 'Life of Pi' is that it represents the duration of the protagonist's trial, lasting over seven months, during which he survived.


We can also check the original sentence that was retrieved for each node, as well as the actual window of sentences that was sent to the LLM.

In [None]:
window = window_response.source_nodes[0].node.metadata["window"]
sentence = window_response.source_nodes[0].node.metadata["original_text"]

print(f"Window: {window}")
print("------------------")
print(f"Original Sentence: {sentence}")

Window: Owen Chase, whose
account of the sinking of the whaling ship Essex by a whale inspired Herman Melville, survived eighty-three
days at sea with two mates, interrupted by a one-week stay on an inhospitable island.  The Bailey family
survived 118 days.  I have heard of a Korean merchant sailor named Poon, I believe, who survived the Pacific
for 173 days in the 1950s.
 I survived 227 days.  That's how long my trial lasted, over seven months.
 I kept myself busy.  That was one key to my survival. 
------------------
Original Sentence: I survived 227 days. 


### Contrast with normal VectorStoreIndex

In [None]:
query_engine = base_index.as_query_engine(similarity_top_k=2)
vector_response = query_engine.query(
    "What is the significance of the number 227 in 'Life of Pi'?"
)
print(vector_response)

The number 227 in 'Life of Pi' represents the total number of days Pi spent at sea on the lifeboat with Richard Parker.


## Analysis

So the `SentenceWindowNodeParser` + `MetadataReplacementNodePostProcessor` combo is the clear winner here. But why?

Embeddings at a sentence level seem to capture more fine-grained details, like the word `AMOC`.

We can also compare the retrieved chunks for each index!

In [None]:
for source_node in window_response.source_nodes:
    print(source_node.node.metadata["original_text"])
    print("--------")

I survived 227 days. 
--------
I believe the answer lies in something I
mentioned earlier, that measure of madness that moves life in strange but saving ways. 
--------


Here, we can see that the sentence window index easily retrieved two nodes that talk about 227 number. Remember, the embeddings are based purely on the original sentence here, but the LLM actually ends up reading the surrounding context as well!

Now, let's try and disect why the naive vector index failed.

In [None]:
for node in vector_response.source_nodes:
    print("227 mentioned?", "227" in node.node.text)
    print("--------")

227 mentioned? False
--------
227 mentioned? False
--------
227 mentioned? False
--------
227 mentioned? False
--------
227 mentioned? False
--------


So source node at index [2] mentions 227, but what did this text actually look like?

In [None]:
print(vector_response.source_nodes[2].node.text)

Yann Martel: Life of Pi
I turn. Leaning against the sofa in the living room, looking up at me bashfully, is a little brown girl, pretty in
pink, very much at home. She's holding an orange cat in her arms. Two front legs sticking straight up and a
deeply sunk head are all that is visible of it above her crossed arms. The rest of the cat is hanging all the way
down to the floor. The animal seems quite relaxed about being stretched on the rack in this manner.
"And this is your daughter," I say.
"Yes. Usha. Usha darling, are you sure Moccasin is comfortable like that?"
Usha drops Moccasin. He flops to the floor unperturbed.
"Hello, Usha," I say.
She comes up to her father and peeks at me from behind his leg.
"What are you doing, little one?" he says. "Why are you hiding?"
She doesn't reply, only looks at me with a smile and hides her face.
"How old are you, Usha?" I ask.
She doesn't reply.
Then Piscine Molitor Patel, known to all as Pi Patel, bends down and picks up his daughter.
"You know

So 227 is disuccsed, but sadly it is in the middle chunk. With LLMs, it is often observed that text in the middle of retrieved context is often ignored or less useful. A recent paper ["Lost in the Middle" discusses this here](https://arxiv.org/abs/2307.03172).

## [Optional] Evaluation

We more rigorously evaluate how well the sentence window retriever works compared to the base retriever.

We define/load an eval benchmark dataset and then run different evaluations over it.

**WARNING**: This can be *expensive*, especially with GPT-4. Use caution and tune the sample size to fit your budget.

In [None]:
from llama_index.core.evaluation import DatasetGenerator, QueryResponseDataset

from llama_index.llms.openai import OpenAI
import nest_asyncio
import random

nest_asyncio.apply()

In [None]:
len(base_nodes)

216

In [None]:
num_nodes_eval = 10
# there are 216 nodes total. Take the first 25 to generate questions (the back half of the doc is all references)
sample_eval_nodes = random.sample(base_nodes[:25], num_nodes_eval)
# NOTE: run this if the dataset isn't already saved
# generate questions from the largest chunks (1024)
dataset_generator = DatasetGenerator(
    sample_eval_nodes,
    llm=OpenAI(model="gpt-4"),
    show_progress=True,
    num_questions_per_chunk=2,
)


  dataset_generator = DatasetGenerator(


In [None]:
eval_dataset = await dataset_generator.agenerate_dataset_from_nodes()







  0%|          | 0/10 [00:00<?, ?it/s][A[A[A[A[A[A





 10%|█         | 1/10 [00:02<00:18,  2.04s/it][A[A[A[A[A[A





 20%|██        | 2/10 [00:02<00:08,  1.08s/it][A[A[A[A[A[A





 40%|████      | 4/10 [00:02<00:02,  2.10it/s][A[A[A[A[A[A





 50%|█████     | 5/10 [00:02<00:01,  2.76it/s][A[A[A[A[A[A





 60%|██████    | 6/10 [00:02<00:01,  3.50it/s][A[A[A[A[A[A





 80%|████████  | 8/10 [00:03<00:00,  4.25it/s][A[A[A[A[A[A





100%|██████████| 10/10 [00:03<00:00,  2.86it/s]






  0%|          | 0/2 [00:00<?, ?it/s][A[A[A[A[A[A





 50%|█████     | 1/2 [00:00<00:00,  2.12it/s][A[A[A[A[A[A





100%|██████████| 2/2 [00:06<00:00,  3.40s/it]






  0%|          | 0/2 [00:00<?, ?it/s][A[A[A[A[A[A





 50%|█████     | 1/2 [00:05<00:05,  5.85s/it][A[A[A[A[A[A





100%|██████████| 2/2 [00:13<00:00,  6.88s/it]






  0%|          | 0/2 [00:00<?, ?it/s][A[A[A[A[A[A





 50%|█████     | 1/2 [00:06<00:06

In [None]:
eval_dataset.save_json("lifeofpie_dataset.json")

In [None]:
# optional
eval_dataset = QueryResponseDataset.from_json("lifeofpie_dataset.json")

  return cls(**data)


### Compare Results

In [None]:
import asyncio
import nest_asyncio

nest_asyncio.apply()

In [None]:
from llama_index.core.evaluation import (
    CorrectnessEvaluator,
    SemanticSimilarityEvaluator,
    RelevancyEvaluator,
    FaithfulnessEvaluator,
    PairwiseComparisonEvaluator,
)


from collections import defaultdict
import pandas as pd

# NOTE: can uncomment other evaluators
evaluator_c = CorrectnessEvaluator(llm=OpenAI(model="gpt-4"))
evaluator_s = SemanticSimilarityEvaluator()
evaluator_r = RelevancyEvaluator(llm=OpenAI(model="gpt-4"))
evaluator_f = FaithfulnessEvaluator(llm=OpenAI(model="gpt-4"))
# pairwise_evaluator = PairwiseComparisonEvaluator(llm=OpenAI(model="gpt-4"))

In [None]:
from llama_index.core.evaluation.eval_utils import (
    get_responses,
    get_results_df,
)
from llama_index.core.evaluation import BatchEvalRunner

max_samples = 6

eval_qs = eval_dataset.questions
ref_response_strs = [r for (_, r) in eval_dataset.qr_pairs]

# resetup base query engine and sentence window query engine
# base query engine
base_query_engine = base_index.as_query_engine(similarity_top_k=2)
# sentence window query engine
query_engine = sentence_index.as_query_engine(
    similarity_top_k=2,
    # the target key defaults to `window` to match the node_parser's default
    node_postprocessors=[
        MetadataReplacementPostProcessor(target_metadata_key="window")
    ],
)

In [None]:
import numpy as np

base_pred_responses = get_responses(
    eval_qs[:max_samples], base_query_engine, show_progress=True
)
pred_responses = get_responses(
    eval_qs[:max_samples], query_engine, show_progress=True
)

pred_response_strs = [str(p) for p in pred_responses]
base_pred_response_strs = [str(p) for p in base_pred_responses]









  0%|          | 0/6 [00:00<?, ?it/s][A[A[A[A[A[A[A[A







 17%|█▋        | 1/6 [00:02<00:12,  2.51s/it][A[A[A[A[A[A[A[A







 33%|███▎      | 2/6 [00:03<00:05,  1.36s/it][A[A[A[A[A[A[A[A







 50%|█████     | 3/6 [00:03<00:02,  1.24it/s][A[A[A[A[A[A[A[A







 67%|██████▋   | 4/6 [00:04<00:01,  1.25it/s][A[A[A[A[A[A[A[A







100%|██████████| 6/6 [00:04<00:00,  1.22it/s]








  0%|          | 0/6 [00:00<?, ?it/s][A[A[A[A[A[A[A[A







 17%|█▋        | 1/6 [00:04<00:23,  4.76s/it][A[A[A[A[A[A[A[A







 33%|███▎      | 2/6 [00:05<00:09,  2.39s/it][A[A[A[A[A[A[A[A







 50%|█████     | 3/6 [00:06<00:05,  1.70s/it][A[A[A[A[A[A[A[A







 67%|██████▋   | 4/6 [00:06<00:02,  1.08s/it][A[A[A[A[A[A[A[A







 83%|████████▎ | 5/6 [00:06<00:00,  1.33it/s][A[A[A[A[A[A[A[A







100%|██████████| 6/6 [00:07<00:00,  1.25s/it]


In [None]:
evaluator_dict = {
    "correctness": evaluator_c,
    "faithfulness": evaluator_f,
    "relevancy": evaluator_r,
    "semantic_similarity": evaluator_s,
}
batch_runner = BatchEvalRunner(evaluator_dict, workers=2, show_progress=True)

Run evaluations over faithfulness/semantic similarity.

In [None]:
eval_results = await batch_runner.aevaluate_responses(
    queries=eval_qs[:max_samples],
    responses=pred_responses[:max_samples],
    reference=ref_response_strs[:max_samples],
)









  0%|          | 0/24 [00:00<?, ?it/s][A[A[A[A[A[A[A[A







  4%|▍         | 1/24 [00:02<01:07,  2.92s/it][A[A[A[A[A[A[A[A







 12%|█▎        | 3/24 [00:04<00:30,  1.47s/it][A[A[A[A[A[A[A[A







 17%|█▋        | 4/24 [00:05<00:22,  1.13s/it][A[A[A[A[A[A[A[A







 21%|██        | 5/24 [00:05<00:16,  1.13it/s][A[A[A[A[A[A[A[A







 25%|██▌       | 6/24 [00:06<00:16,  1.08it/s][A[A[A[A[A[A[A[A







 29%|██▉       | 7/24 [00:08<00:17,  1.04s/it][A[A[A[A[A[A[A[A







 33%|███▎      | 8/24 [00:08<00:12,  1.25it/s][A[A[A[A[A[A[A[A







 38%|███▊      | 9/24 [00:10<00:18,  1.24s/it][A[A[A[A[A[A[A[A







 54%|█████▍    | 13/24 [00:10<00:05,  1.94it/s][A[A[A[A[A[A[A[A







 58%|█████▊    | 14/24 [00:11<00:04,  2.22it/s][A[A[A[A[A[A[A[A







 62%|██████▎   | 15/24 [00:11<00:03,  2.28it/s][A[A[A[A[A[A[A[A







 67%|██████▋   | 16/24 [00:12<00:03,  2.09it/s][A[A[A[A[

In [None]:
base_eval_results = await batch_runner.aevaluate_responses(
    queries=eval_qs[:max_samples],
    responses=base_pred_responses[:max_samples],
    reference=ref_response_strs[:max_samples],
)

In [None]:
#with --> num_nodes_eval = 10, base_nodes[:25] (25 nodes), max_samples = 6
results_df = get_results_df(
    [eval_results, base_eval_results],
    ["Sentence Window Retriever", "Base Retriever"],
    ["correctness", "relevancy", "faithfulness", "semantic_similarity"],
)
display(results_df)

Unnamed: 0,names,correctness,relevancy,faithfulness,semantic_similarity
0,Sentence Window Retriever,3.583333,0.833333,0.666667,0.794194
1,Base Retriever,3.6,0.8,0.8,0.809503


In [None]:
#with --> num_nodes_eval = 10, base_nodes[:25] (25 nodes), max_samples = 5

results_df = get_results_df(
    [eval_results, base_eval_results],
    ["Sentence Window Retriever", "Base Retriever"],
    ["correctness", "relevancy", "faithfulness", "semantic_similarity"],
)
display(results_df)

Unnamed: 0,names,correctness,relevancy,faithfulness,semantic_similarity
0,Sentence Window Retriever,2.9,0.8,1.0,0.762755
1,Base Retriever,3.6,0.8,0.8,0.809503


We have some limitations with sending request to gpt4 according to our open ai plan but you can clearly see the improvement in results by incleasing max_samples. Also if you will increase base nodes to be considered and num_nodes_eval you will see significant improvement of Sentence Window Retriever	over Base Retriever.