In [26]:
from llama_index.llms.openai import OpenAI

In [27]:
llm = OpenAI(model="gpt-4", temperature=0.1)

In [28]:
from llama_index.core import SimpleDirectoryReader

In [29]:
documents = SimpleDirectoryReader(
    # input_files = ["./docs/colbert.pdf"]
    input_files = ["./docs/PLAID.pdf", "./docs/colbert.pdf", "./docs/colbert-v2.pdf"]
).load_data()

In [30]:
from llama_index.core import Document

In [31]:
document = Document(text="\n\n".join([doc.text for doc in documents]))

In [32]:
from utils import build_sentence_window_index

sentence_index = build_sentence_window_index(
    document,
    llm,
    embed_model="local:BAAI/bge-small-en-v1.5",
    save_dir="sentence_index"
)

In [33]:
from utils import get_sentence_window_query_engine

sentence_window_engine = get_sentence_window_query_engine(sentence_index)

In [36]:
window_response = sentence_window_engine.query("explain 'maps centroids to their corresponding embedding IDs' in this paper")
print(str(window_response))

In this paper, 'maps centroids to their corresponding embedding IDs' refers to the process of assigning each output embedding from the BERT encoder to the nearest centroid. These centroids are essentially representative points in the embedding space. The IDs of these embeddings are then grouped together based on their corresponding centroid. This grouping forms an inverted list, which is saved to disk to support fast nearest-neighbor search. This allows for quick identification of token-level embeddings that are similar to those in a query during the search process.


In [37]:
eval_questions = []
with open('eval_questions.txt') as f:
    for line in f:
        eval_questions.append(line.strip())

In [38]:
from trulens_eval import Tru
tru = Tru()

tru.reset_database()

🦑 Tru initialized with db url sqlite:///default.sqlite .
🛑 Secret keys may be written to the database. See the `database_redact_keys` option of Tru` to prevent this.


In [40]:
from utils import get_prebuilt_trulens_recorder

In [41]:
tru.reset_database()

tru_recorder_sentence_window = get_prebuilt_trulens_recorder(
    sentence_window_engine,
    app_id = "Sentence Window Query Engine"
)

In [42]:
for question in eval_questions:
    with tru_recorder_sentence_window as recording:
        response = sentence_window_engine.query(question)
        print(question)
        print(str(response))

How does offline indexing work in ColBert?
In ColBERT, offline indexing involves computing and storing document embeddings. This process is designed to isolate most of the computations between queries and documents, which allows for pre-computing document representations. The indexing procedure is straightforward: the system goes over the documents in the collection in batches, runs the document encoder on each batch, and stores the output embeddings for each document. Although this is an offline process, several optimizations are incorporated to enhance the throughput of indexing. These optimizations can significantly reduce the offline cost of indexing.


Groundedness per statement in source:   0%|          | 0/5 [00:00<?, ?it/s]

Groundedness per statement in source:   0%|          | 0/5 [00:00<?, ?it/s]

What is late interaction?
Late interaction is an alternative to the single-vector similarity paradigm in information retrieval. Introduced in the ColBERT model, it involves encoding queries and documents at a finer granularity into multi-vector representations. Relevance is then estimated using interactions between these sets of vectors. In this approach, an embedding is produced for every token in the query and document, and relevance is modeled as the sum of maximum similarities between each query vector and all vectors in the document. This method aims to reduce the burden on the encoder by decomposing relevance modeling into token-level computations. It encodes meaning at the level of tokens and delegates query-document matching to the interaction mechanism. However, this added expressivity comes with a larger space footprint than single-vector models, as it requires storing billions of small vectors for web-scale collections.


Groundedness per statement in source:   0%|          | 0/7 [00:00<?, ?it/s]

Explain Query encoder?
The Query Encoder is a component of the ColBERT system that processes textual queries. It tokenizes the query into its BERT-based WordPiece tokens. To distinguish the input sequences that correspond to queries, a special token [Q] is prepended to the query. This process is done before the late interaction stage in the system.


Groundedness per statement in source:   0%|          | 0/4 [00:00<?, ?it/s]

Groundedness per statement in source:   0%|          | 0/7 [00:00<?, ?it/s]

Groundedness per statement in source:   0%|          | 0/4 [00:00<?, ?it/s]

Explain Document encoder?
A document encoder is a part of the system that processes a document by breaking it down into its constituent tokens. It begins by adding BERT's start token [CLS] and a special token [D] to indicate a document sequence. Unlike queries, documents do not have [mask] tokens appended to them. The input sequence is then passed through BERT and a subsequent linear layer. The document encoder then filters out the embeddings that correspond to punctuation symbols, which are identified through a predefined list. This filtering process is designed to reduce the number of embeddings per document, as it is hypothesized that embeddings of punctuation, even when contextualized, are not necessary for effectiveness. The final output is a bag of embeddings for the document.


Groundedness per statement in source:   0%|          | 0/7 [00:00<?, ?it/s]

Groundedness per statement in source:   0%|          | 0/7 [00:00<?, ?it/s]

Explain storing document embeddings process?
The process of storing document embeddings involves running the document encoder on batches of documents in the collection. The output embeddings for each document are then stored. This process, known as indexing, is done offline. To enhance the throughput of indexing, some simple optimizations are incorporated. The embeddings are then transferred from the CPU to the GPU, which can be the most expensive step in re-ranking with ColBERT. Finally, the output embeddings are normalized so that each has an L2 norm equal to one. This makes the dot-product of any two embeddings equivalent to their cosine similarity, falling in the range of -1 to 1.


Groundedness per statement in source:   0%|          | 0/7 [00:00<?, ?it/s]

What is Tok-k re-ranking with colbert?
Top-k re-ranking with ColBERT is a process where ColBERT is used to rank a small set of k documents (for example, k=1000) given a query. This is typically done after the output of another retrieval model, often a term-based model, or it can be used directly for end-to-end retrieval from a document collection. The process involves loading the indexed document representations into memory, representing each document as a matrix of embeddings. A query is then computed into its bag of contextualized embeddings and the document representations are gathered into a 3-dimensional tensor consisting of k document matrices. This method relies on batch computations to exhaustively score each document.


Groundedness per statement in source:   0%|          | 0/5 [00:00<?, ?it/s]

How does re-ranking works in colbert?
Re-ranking in ColBERT works by leveraging the model on top of a term-based retrieval model. This process involves re-ranking the top results extracted by a bag-of-words retrieval model, which is a common setting for testing and deploying neural ranking models.


Groundedness per statement in source:   0%|          | 0/2 [00:00<?, ?it/s]

Explain end-to-end retrival with colbert?
End-to-end retrieval with ColBERT involves several steps. First, a query and a document are given, represented as q and d respectively. These are then processed to compute the bags of embeddings Eq and Ed. This is done by normalizing the output of a convolutional neural network (CNN) applied to BERT embeddings of the query and document. The document embeddings are further filtered to reduce the number of embeddings per document. 

The relevance score of the document to the query, denoted as Sq,d, is then estimated through a late interaction between their bags of contextualized embeddings. This interaction is conducted as a sum of maximum similarity computations, such as cosine similarity or squared L2 distance. 

ColBERT is differentiable end-to-end and is fine-tuned using the Adam optimizer. The interaction mechanism has no trainable parameters. Given a triple with a query, a positive document, and a negative document, ColBERT is used to produ

Groundedness per statement in source:   0%|          | 0/11 [00:00<?, ?it/s]

Groundedness per statement in source:   0%|          | 0/2 [00:00<?, ?it/s]

Groundedness per statement in source:   0%|          | 0/7 [00:00<?, ?it/s]

Groundedness per statement in source:   0%|          | 0/5 [00:00<?, ?it/s]

Groundedness per statement in source:   0%|          | 0/11 [00:00<?, ?it/s]

In [43]:
tru.get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Answer Relevance,Groundedness,Context Relevance,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Sentence Window Query Engine,0.9125,0.789756,0.75625,9.25,0.027877


In [51]:
tru.run_dashboard()

Starting dashboard ...
Config file already exists. Skipping writing process.
Credentials file already exists. Skipping writing process.
Dashboard already running at path: None


<Popen: returncode: 1 args: ['streamlit', 'run', '--server.headless=True', '...>

In [52]:
tru.get_leaderboard(app_ids=[])

Unnamed: 0_level_0,Answer Relevance,Groundedness,Context Relevance,latency,total_cost
app_id,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
Sentence Window Query Engine,0.9125,0.789756,0.75625,9.25,0.027877
