In [1]:
from llama_index.llms.openai import OpenAI

In [8]:
llm = OpenAI(model="gpt-4-turbo-preview", temperature=0.1)

In [9]:
from llama_index.core import SimpleDirectoryReader

In [10]:
documents = SimpleDirectoryReader(
    # input_files = ["./docs/colbert.pdf"]
    input_files = ["./docs/PLAID.pdf", "./docs/colbert.pdf", "./docs/colbert-v2.pdf"]
).load_data()

In [11]:
from llama_index.core import Document

In [12]:
document = Document(text="\n\n".join([doc.text for doc in documents]))

In [13]:
from utils import build_automerging_index

automerging_index = build_automerging_index(
    documents,
    llm,
    embed_model="local:BAAI/bge-small-en-v1.5",
    save_dir="merging_index"
)

In [14]:
from utils import get_automerging_query_engine

automerging_query_engine = get_automerging_query_engine(
    automerging_index,
)

In [15]:
window_response = automerging_query_engine.query("explain 'maps centroids to their corresponding embedding IDs' in this paper")
print(str(window_response))

huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


The paper explains that to facilitate fast nearest-neighbor search, the embedding IDs corresponding to each centroid are grouped together and stored as an inverted list on disk. This approach enables quick retrieval of token-level embeddings similar to those in a query during search operations.


In [16]:
eval_questions = []
with open('eval_questions.txt') as f:
    for line in f:
        eval_questions.append(line.strip())

In [17]:
from trulens_eval import Tru
tru = Tru()

tru.reset_database()

🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of Tru` to prevent this.


In [18]:
from utils import get_prebuilt_trulens_recorder

In [19]:
tru.reset_database()

tru_recorder_sentence_window = get_prebuilt_trulens_recorder(
    automerging_query_engine,
    app_id = "Automerging Window Query Engine"
)

In [23]:
for question in eval_questions:
    with tru_recorder_sentence_window as recording:
        response = automerging_query_engine.query(question)
        print(question)
        print(str(response))

How does offline indexing work in ColBert?
Offline indexing in ColBERT involves isolating computations between queries and documents to pre-compute document representations. The indexing procedure consists of processing documents in batches, applying the document encoder to each batch, and storing the resulting embeddings for each document.


Groundedness per statement in source:   0%|          | 0/2 [00:00<?, ?it/s]

> Merging 1 nodes into parent node.
> Parent node id: 307c5873-8d7f-465f-a88f-b6e67daffb9f.
> Parent node text: We couple those in
ColBERTv2 ,1a new late-interaction retriever that
employs a simple combination...



Groundedness per statement in source:   0%|          | 0/2 [00:00<?, ?it/s]

What is late interaction?
Late interaction is a paradigm that aims to reduce the burden on the encoder by decomposing relevance modeling into token-level computations. It encodes meaning at the level of tokens and delegates query-document matching to the interaction mechanism.


Groundedness per statement in source:   0%|          | 0/2 [00:00<?, ?it/s]

Explain Query encoder?
The Query encoder takes a textual query and tokenizes it into BERT-based WordPiece tokens. It then prepends a special token [Q] to the query before encoding it.


Groundedness per statement in source:   0%|          | 0/2 [00:00<?, ?it/s]

Groundedness per statement in source:   0%|          | 0/2 [00:00<?, ?it/s]

> Merging 3 nodes into parent node.
> Parent node id: 9f7617a0-d994-44d2-9d2d-be098189e1b6.
> Parent node text: Given BERT’s representation of each token, our encoder passes
the contextualized output represent...



Groundedness per statement in source:   0%|          | 0/2 [00:00<?, ?it/s]

Explain Document encoder?
The document encoder in the provided context first segments a document into its constituent tokens, to which it prepends BERT's start token [CLS] followed by a special token [D] indicating a document sequence. Unlike queries, no [mask] tokens are appended to documents. After passing this input sequence through BERT and a subsequent linear layer, the document encoder filters out the embeddings corresponding to punctuation symbols based on a predefined list. This filtering process aims to reduce the number of embeddings per document by excluding embeddings of punctuation, as it is believed that even contextualized embeddings of punctuation are unnecessary for effectiveness.


Groundedness per statement in source:   0%|          | 0/4 [00:00<?, ?it/s]

Explain storing document embeddings process?
The process of storing document embeddings involves isolating computations between queries and documents in order to pre-compute document representations offline. This is done by running a document encoder on batches of documents in the collection and storing the output embeddings for each document. By leveraging the pruning-friendly nature of MaxSim operations, fast vector-similarity data structures are used to efficiently conduct searches between the query embedding and all document embeddings across the entire collection.


Groundedness per statement in source:   0%|          | 0/3 [00:00<?, ?it/s]

Groundedness per statement in source:   0%|          | 0/4 [00:00<?, ?it/s]

What is Tok-k re-ranking with colbert?
ColBERT can be utilized for re-ranking the results produced by another retrieval model, often a term-based model, or for direct end-to-end retrieval from a document collection.


Groundedness per statement in source:   0%|          | 0/1 [00:00<?, ?it/s]

Groundedness per statement in source:   0%|          | 0/3 [00:00<?, ?it/s]

Groundedness per statement in source:   0%|          | 0/1 [00:00<?, ?it/s]

How does re-ranking works in colbert?
ColBERT can be utilized for re-ranking the results of another retrieval model, often a term-based model, or it can be directly used for end-to-end retrieval from a document collection.


Groundedness per statement in source:   0%|          | 0/1 [00:00<?, ?it/s]

Explain end-to-end retrival with colbert?
End-to-end retrieval with ColBERT involves utilizing its late-interaction operator to enable retrieval directly from a large collection. This approach aims to enhance recall compared to traditional term-based retrieval methods, ultimately improving the effectiveness of the retrieval process.


Groundedness per statement in source:   0%|          | 0/2 [00:00<?, ?it/s]

Groundedness per statement in source:   0%|          | 0/1 [00:00<?, ?it/s]

Groundedness per statement in source:   0%|          | 0/2 [00:00<?, ?it/s]

In [24]:
tru.get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Groundedness,Answer Relevance,Context Relevance,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Automerging Window Query Engine,0.889062,0.89375,0.846875,3.5625,0.000662


In [25]:
tru.run_dashboard(port=8502)

Starting dashboard ...


huggingface/tokenizers: The current process just got forked, after parallelism has already been used. Disabling parallelism to avoid deadlocks...
	- Avoid using `tokenizers` before the fork if possible
	- Explicitly set the environment variable TOKENIZERS_PARALLELISM=(true | false)


Accordion(children=(VBox(children=(VBox(children=(Label(value='STDOUT'), Output())), VBox(children=(Label(valu…

Dashboard started at http://10.191.2.98:8502 .


<Popen: returncode: None args: ['streamlit', 'run', '--server.headless=True'...>

In [26]:
tru.get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Groundedness,Answer Relevance,Context Relevance,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Automerging Window Query Engine,0.889062,0.89375,0.846875,3.5625,0.000662
