# Metadata Replacement + Node Sentence Window

- similar kind of idealogy to parent-document retriever

In [37]:
import yaml, os, textwrap, random
from llama_index.llms import AzureOpenAI, OpenAI
from llama_index.llm_predictor import LLMPredictor
from llama_index import set_global_service_context
from llama_index.text_splitter import SentenceSplitter
from llama_index.embeddings import HuggingFaceEmbedding
from llama_index.node_parser import SentenceWindowNodeParser
from llama_index.postprocessor import MetadataReplacementPostProcessor
from llama_index.evaluation import (
                                    DatasetGenerator,
                                    QueryResponseDataset,
                                    )
from llama_index import (
                        ServiceContext, 
                        VectorStoreIndex,
                        load_index_from_storage, 
                        SimpleDirectoryReader, 
                        StorageContext
                        )

In [2]:
with open('cadentials.yaml') as f:
    credentials = yaml.load(f, Loader=yaml.FullLoader)

In [3]:
llm_flag = 'DIRECT'

embedding_llm = HuggingFaceEmbedding(
                                    model_name="BAAI/bge-small-en-v1.5",
                                    device='mps'
                                    )

if llm_flag == 'AZURE':
    llm=AzureOpenAI(
                    model=credentials['AZURE_ENGINE'],
                    api_key=credentials['AZURE_OPENAI_API_KEY'],
                    deployment_name=credentials['AZURE_DEPLOYMENT_ID'],
                    api_version=credentials['AZURE_OPENAI_API_VERSION'],
                    azure_endpoint=credentials['AZURE_OPENAI_API_BASE'],
                    temperature=0.3
                    )
    
    chat_llm = LLMPredictor(llm)
else:
    chat_llm = OpenAI(
                    api_key=credentials['DEMO_OPENAI_API_KEY'],
                    temperature=0.3
                    )

if llm_flag == 'AZURE':
    service_context = ServiceContext.from_defaults(
                                                    embed_model=embedding_llm,
                                                    llm_predictor=chat_llm
                                                    )
else:
    service_context = ServiceContext.from_defaults(
                                                    embed_model=embedding_llm,
                                                    llm=chat_llm
                                                    )

set_global_service_context(service_context)

### SentenceSplitter
- keep sentences and paragraphs together. 
- Therefore compared to the original TokenTextSplitter, there are `less likely to be hanging sentences or parts of sentences at the end of the node chunk`

### SentenceWindowNodeParser
- Splits a document into Nodes, with `each node being a sentence` 
- Each node contains a window from the surrounding sentences in the metadata.

In [4]:
node_parser = SentenceWindowNodeParser.from_defaults(
                                                    window_size=3,
                                                    window_metadata_key="window",
                                                    original_text_metadata_key="original_text",
                                                    ) # big chunks

text_splitter = SentenceSplitter(
                                chunk_size=1000,
                                chunk_overlap=200
                                ) # big chunks

In [5]:
documents = SimpleDirectoryReader(
                                input_files=["./data/IPCC_AR6_WGII_Chapter03.pdf"]
                                ).load_data()
len(documents)

172

In [6]:
nodes_small = node_parser.get_nodes_from_documents(documents)
nodes_big = text_splitter.get_nodes_from_documents(documents)

#### Let's see how SentenceWindowNodeParser works

In [7]:
print("number of small nodes: ", len(nodes_small))
print("number of big nodes: ", len(nodes_big))

number of small nodes:  11087
number of big nodes:  469


In [17]:
print(nodes_small[10].metadata['window'])

Contribution of Working Group II to the Sixth Assessment Report of 
the Intergovernmental Panel on Climate Change [H.-O.   Pörtner, D.C.   Roberts, M.  Tignor, E.S.   Poloczanska, K.  Mintenbeck, 
A. Alegría, M.  Craig, S.  Langsdorf, S.  Löschke, V .   Möller, A.  Okem, B.  Rama (eds.)].  Cambridge University Press, Cambridge, 
UK and New York, NY , USA, pp.  


In [18]:
print(nodes_small[10].metadata['original_text'])

Poloczanska, K.  Mintenbeck, 
A. Alegría, M.  Craig, S.  Langsdorf, S.  Löschke, V .  


## Build Vector Index

In [8]:
nodes_small_index = VectorStoreIndex(
                                    nodes_small, 
                                    service_context=service_context
                                    )

nodes_big_index = VectorStoreIndex(
                                    nodes_big, 
                                    service_context=service_context
                                    )

## Querying

- Here, we now use the `MetadataReplacementPostProcessor` to replace the sentence in each node with it’s surrounding context.

In [29]:
query_engine = nodes_small_index.as_query_engine(
                                                similarity_top_k=2,
                                                node_postprocessors=[
                                                                    MetadataReplacementPostProcessor(target_metadata_key="window")
                                                                    ]
                                                )
window_response = query_engine.query(
                                    "What are the concerns surrounding the AMOC?"
                                    )
print(textwrap.fill(str(window_response), width=140))

There is low confidence in the quantification of AMOC changes in the 20th century due to disagreement in quantitative reconstructed and
simulated trends. Additionally, direct observational records since the mid-2000s are too short to determine the relative contributions of
internal variability, natural forcing, and anthropogenic forcing to AMOC change. However, it is highly confident that over the 21st century,
AMOC will decline for all SSP scenarios but will not experience an abrupt collapse before 2100.


In [33]:
window = window_response.source_nodes[0].node.metadata["window"]
sentence = window_response.source_nodes[0].node.metadata["original_text"]

print(f"######################## Window ##########################\n{window}")
print('\n\n')
print(f"################### Original Sentence ####################\n{sentence}")

######################## Window ##########################
Nevertheless, projected future annual cumulative upwelling wind 
changes at most locations and seasons remain within ±10–20% of 
present-day values (medium confidence) (WGI AR6 Section  9.2.3.5; 
Fox-Kemper et al., 2021).
 Continuous observation of the Atlantic meridional overturning 
circulation (AMOC) has improved the understanding of its variability 
(Frajka-Williams et  al., 2019), but there is low confidence in the 
quantification of AMOC changes in the 20th century because of low 
agreement in quantitative reconstructed and simulated trends (WGI 
AR6 Sections 2.3.3, 9.2.3.1; Fox-Kemper et al., 2021; Gulev et al., 2021). 
 Direct observational records since the mid-2000s remain too short to 
determine the relative contributions of internal variability, natural 
forcing and anthropogenic forcing to AMOC change (high confidence) 
(WGI AR6 Sections 2.3.3, 9.2.3.1; Fox-Kemper et al., 2021; Gulev et al., 
2021).  Over the 21st 

In [35]:
query_engine_big = nodes_big_index.as_query_engine(similarity_top_k=2)
vector_response = query_engine_big.query(
    "What are the concerns surrounding the AMOC?"
    )
print(textwrap.fill(str(vector_response), width=140))

There are concerns surrounding the Atlantic overturning circulation (AMOC). There is low confidence in reconstructed and modelled AMOC
changes for the 20th century. However, it is projected that the AMOC will decline over the 21st century with high confidence, although there
is low confidence for quantitative projections.


## Why `SentenceWindowNodeParser + MetadataReplacementNodePostProcessor` is the Winner here ?
- Embeddings at a sentence level seem to capture more fine-grained details, like the word `AMOC`.

## Evaluation

In [42]:
num_nodes_eval = 30 #there are 469 big nodes total. Take the first 200 to generate questions (the back half of the doc is all references)
sample_eval_nodes = random.sample(nodes_big[:200], num_nodes_eval) # NOTE: run this if the dataset isn't already saved
eval_service_context = ServiceContext.from_defaults(llm=chat_llm)
dataset_generator = DatasetGenerator(
                                    sample_eval_nodes,
                                    service_context=eval_service_context,
                                    num_questions_per_chunk=2,
                                    show_progress=True
                                    ) # generate questions from the largest chunks (1024)

In [40]:
eval_dataset = await dataset_generator.agenerate_dataset_from_nodes()
eval_dataset.save_json("generated/ipcc_eval_qr_dataset.json")

In [None]:
eval_dataset = QueryResponseDataset.from_json("data/ipcc_eval_qr_dataset.json")