In [1]:
from llama_index.llms.openai import OpenAI

In [2]:
llm = OpenAI(model="gpt-4", temperature=0.1)

In [3]:
from llama_index.core import SimpleDirectoryReader

In [4]:
documents = SimpleDirectoryReader(
    # input_files = ["./docs/colbert.pdf"]
    input_files = ["./docs/PLAID.pdf", "./docs/colbert.pdf", "./docs/colbert-v2.pdf"]
).load_data()

In [5]:
from llama_index.core import Document

In [6]:
document = Document(text="\n\n".join([doc.text for doc in documents]))

In [7]:
from utils import build_automerging_index

automerging_index = build_automerging_index(
    documents,
    llm,
    embed_model="local:BAAI/bge-small-en-v1.5",
    save_dir="merging_index"
)

✅ In Answer Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Answer Relevance, input response will be set to __record__.main_output or `Select.RecordOutput` .
✅ In Context Relevance, input prompt will be set to __record__.main_input or `Select.RecordInput` .
✅ In Context Relevance, input response will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input source will be set to __record__.app.query.rets.source_nodes[:].node.text .
✅ In Groundedness, input statement will be set to __record__.main_output or `Select.RecordOutput` .


In [8]:
from utils import get_automerging_query_engine

automerging_query_engine = get_automerging_query_engine(
    automerging_index,
)

In [9]:
window_response = automerging_query_engine.query("explain 'maps centroids to their corresponding embedding IDs' in this paper")
print(str(window_response))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


To explain 'maps centroids to their corresponding embedding IDs' in this paper, the authors group the embedding IDs that correspond to each centroid together and save this information as an inverted list on disk. This allows for efficient nearest-neighbor search during query time, enabling quick retrieval of token-level embeddings similar to those in a given query.


In [10]:
eval_questions = []
with open('eval_questions.txt') as f:
    for line in f:
        eval_questions.append(line.strip())

In [12]:
from trulens_eval import Tru
tru = Tru()

tru.reset_database()

In [13]:
from utils import get_prebuilt_trulens_recorder

In [15]:
tru.reset_database()

tru_recorder_sentence_window = get_prebuilt_trulens_recorder(
    automerging_query_engine,
    app_id = "Automerging Window Query Engine"
)

In [16]:
for question in eval_questions:
    with tru_recorder_sentence_window as recording:
        response = automerging_query_engine.query(question)
        print(question)
        print(str(response))

How does offline indexing work in ColBert?
Offline indexing in ColBERT involves isolating computations between queries and documents to pre-compute document representations. The indexing procedure consists of processing documents in batches, applying the document encoder to each batch, and storing the resulting embeddings for each document.


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Groundedness per statement in source:   0%|          | 0/2 [00:00<?, ?it/s]

> Merging 1 nodes into parent node.
> Parent node id: 6546a2db-202a-41bd-bee0-2354cee31ca3.
> Parent node text: We couple those in
ColBERTv2 ,1a new late-interaction retriever that
employs a simple combination...

What is late interaction?
Late interaction is a paradigm that aims to reduce the burden on the encoder by decomposing relevance modeling into token-level computations. It encodes meaning at the level of tokens and delegates query-document matching to the interaction mechanism.


Groundedness per statement in source:   0%|          | 0/2 [00:00<?, ?it/s]

Groundedness per statement in source:   0%|          | 0/2 [00:00<?, ?it/s]

Explain Query encoder?
The Query encoder takes a textual query and tokenizes it into BERT-based WordPiece tokens. It then prepends a special token [Q] to the query before encoding it.


Groundedness per statement in source:   0%|          | 0/2 [00:00<?, ?it/s]

Groundedness per statement in source:   0%|          | 0/2 [00:00<?, ?it/s]

> Merging 3 nodes into parent node.
> Parent node id: 59ebc0a8-f9cc-4bde-8d73-37d350a1ee7b.
> Parent node text: Given BERT’s representation of each token, our encoder passes
the contextualized output represent...



Groundedness per statement in source:   0%|          | 0/2 [00:00<?, ?it/s]

Explain Document encoder?
The document encoder segments a document into its constituent tokens and prepends BERT's start token [CLS] followed by a special token [D] to indicate a document sequence. Unlike queries, no [mask] tokens are appended to documents. After passing this input sequence through BERT and a subsequent linear layer, the document encoder filters out the embeddings corresponding to punctuation symbols based on a predefined list. This filtering process aims to reduce the number of embeddings per document by excluding embeddings of punctuation, as it is believed that even contextualized embeddings of punctuation are unnecessary for effectiveness.


Groundedness per statement in source:   0%|          | 0/4 [00:00<?, ?it/s]

Explain storing document embeddings process?
The process of storing document embeddings involves running the document encoder on batches of documents in the collection and saving the output embeddings for each document. This is done by leveraging the pruning-friendly nature of the MaxSim operations to efficiently conduct searches between the query embedding and all document embeddings across the entire collection.


Groundedness per statement in source:   0%|          | 0/2 [00:00<?, ?it/s]

Groundedness per statement in source:   0%|          | 0/4 [00:00<?, ?it/s]

What is Tok-k re-ranking with colbert?
ColBERT can be utilized for re-ranking the results produced by another retrieval model, often a term-based model, or for direct retrieval from a document collection.


Groundedness per statement in source:   0%|          | 0/1 [00:00<?, ?it/s]

Groundedness per statement in source:   0%|          | 0/2 [00:00<?, ?it/s]

Groundedness per statement in source:   0%|          | 0/1 [00:00<?, ?it/s]

How does re-ranking works in colbert?
ColBERT can be utilized for re-ranking the results of another retrieval model, often a term-based model, or it can be used directly for end-to-end retrieval from a document collection.


Groundedness per statement in source:   0%|          | 0/1 [00:00<?, ?it/s]

Groundedness per statement in source:   0%|          | 0/1 [00:00<?, ?it/s]

Explain end-to-end retrival with colbert?
End-to-end retrieval with ColBERT involves utilizing its late-interaction operator to enable retrieval directly from a large collection. This approach aims to enhance recall compared to traditional term-based retrieval methods, ultimately improving the effectiveness of the retrieval process.


In [17]:
tru.get_leaderboard(app_ids=[])

Groundedness per statement in source:   0%|          | 0/2 [00:00<?, ?it/s]

Unnamed: 0_level_0,Answer Relevance,Context Relevance,Groundedness,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Automerging Window Query Engine,0.828571,0.825,0.966667,3.625,0.000654


Groundedness per statement in source:   0%|          | 0/2 [00:00<?, ?it/s]

In [21]:
tru.run_dashboard(port=8502)

Starting dashboard ...
Config file already exists. Skipping writing process.
Credentials file already exists. Skipping writing process.
Dashboard already running at path: None


<Popen: returncode: 1 args: ['streamlit', 'run', '--server.headless=True', '...>

In [22]:
tru.get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Answer Relevance,Context Relevance,Groundedness,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Automerging Window Query Engine,0.825,0.84375,0.9125,3.625,0.000654
