Why have I created this repo? I like to read a lot; papers, blog posts, twitter threads, notebooks, you name it. When I read, i've been quite well at documenting this through my blog. The blog let's me sement my thoughts, improve retention, and log what i've read. it can be quite tedious work, and sometimes I simply don't have the time, or rather capacity to do it, but i've been quite good at it, at least up until a couple of months ago. now, even though I sometimes write quite extensive summaries or thoughts about papers, i've been pretty poor at looking back at my notes when i need to remember something in a paper i read, and i attribute this mainly to a lack of **indexing**. the process often looks like this: "ooh, i know i've read something about X in a paper" *scrolls through all [blog entries](https://leonericsson.github.io/indexer), not finding what i'm looking for immediately and just give up*. yes, i could just become more patient and keep looking, why not instead take this opportunity and build something! this is the motivation for this repo. 

my initial plan. before doing any valuable research, here's what i **think** is the way i want to solve this problem, including some questions that i yet don't know the answer to:

1. save the link to every piece of research content i've read, that i can recall and find (i've already got a decent chunk saved).
2. download the content.
    - should everything be standardized into a format?
    - do i need to be careful to exclude irrelevant information from the source?
    - how do i deal with twitter threads?
3. decide on an embedding model
    - what is a good context size? relates to how we plan to chunk a 20 page paper.
4. embed content
    - how do we deal with long format texts e.g. papers?
    - does the search engine need to rely on more than just a single embedding model?
5. quantize embeddings to int8
    - does this process impact the selected embedding model?
5. index the embeddings
6. re-ranker
7. search engine
     - how do we link the source content + source url to the search results? especially
     considering that we will surely have multiple embedding points from the same source content.
    
okay, this seems like a good place to start don't you think?

**comparing embedding models**

In [None]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
from sentence_transformers.quantization import quantize_embeddings

# 1. Specify preffered dimensions
dimensions = 512

# 2. load model
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1", truncate_dim=dimensions)

# For retrieval you need to pass this prompt.
query = 'Represent this sentence for searching relevant passages: are all layers in a transformer equally important?'

docs = [
    query,
    """Within the field of vector search, an intriguing development has arisen: binary vector search. This approach shows promise in tackling the long-standing issue of memory consumption by achieving a remarkable 30x reduction. However, a critical aspect that sparks debate is its effect on accuracy.

We believe that using binary vector search, along with specific optimization techniques, can maintain similar accuracy. To provide clarity on this subject, we showcase a series of experiments that will demonstrate the effects and implications of this approach.""",
    """We empirically study a simple layer-pruning strategy for popular families of openweight pretrained LLMs, finding minimal degradation of performance on different
question-answering benchmarks until after a large fraction (up to half) of the layers
are removed. To prune these models, we identify the optimal block of layers to
prune by considering similarity across layers; then, to “heal” the damage, we
perform a small amount of finetuning. In particular, we use parameter-efficient
finetuning (PEFT) methods, specifically quantization and Low Rank Adapters
(QLoRA), such that each of our experiments can be performed on a single A100
GPU. From a practical perspective, these results suggest that layer pruning methods
can complement other PEFT strategies to further reduce computational resources of
finetuning on the one hand, and can improve the memory and latency of inference
on the other hand. From a scientific perspective, the robustness of these LLMs
to the deletion of layers implies either that current pretraining methods are not
properly leveraging the parameters in the deeper layers of the network or that the
shallow layers play a critical role in storing knowledge.""",
    """In this work, we introduce a significant 1-bit LLM variant called BitNet b1.58, where every parameter
is ternary, taking on values of {-1, 0, 1}. We have added an additional value of 0 to the original 1-bit
BitNet, resulting in 1.58 bits in the binary system. BitNet b1.58 retains all the benefits of the original
1-bit BitNet, including its new computation paradigm, which requires almost no multiplication
operations for matrix multiplication and can be highly optimized. Additionally, it has the same energy
consumption as the original 1-bit BitNet and is much more efficient in terms of memory consumption,
throughput and latency compared to FP16 LLM baselines.""",
]

# 2. Encode
embeddings = model.encode(docs)

# Optional: Quantize the embeddings
binary_embeddings = quantize_embeddings(embeddings, precision="ubinary")

similarities = cos_sim(embeddings[0], embeddings[1:])
print('similarities:', similarities)


In [None]:
import torch.nn.functional as F
from sentence_transformers import SentenceTransformer

matryoshka_dim = 512

model = SentenceTransformer("nomic-ai/nomic-embed-text-v1.5", trust_remote_code=True)

sentences = [
    """search_document: Within the field of vector search, an intriguing development has arisen: binary vector search. This approach shows promise in tackling the long-standing issue of memory consumption by achieving a remarkable 30x reduction. However, a critical aspect that sparks debate is its effect on accuracy.

We believe that using binary vector search, along with specific optimization techniques, can maintain similar accuracy. To provide clarity on this subject, we showcase a series of experiments that will demonstrate the effects and implications of this approach.""",
    """search_document: We empirically study a simple layer-pruning strategy for popular families of openweight pretrained LLMs, finding minimal degradation of performance on different
question-answering benchmarks until after a large fraction (up to half) of the layers
are removed. To prune these models, we identify the optimal block of layers to
prune by considering similarity across layers; then, to “heal” the damage, we
perform a small amount of finetuning. In particular, we use parameter-efficient
finetuning (PEFT) methods, specifically quantization and Low Rank Adapters
(QLoRA), such that each of our experiments can be performed on a single A100
GPU. From a practical perspective, these results suggest that layer pruning methods
can complement other PEFT strategies to further reduce computational resources of
finetuning on the one hand, and can improve the memory and latency of inference
on the other hand. From a scientific perspective, the robustness of these LLMs
to the deletion of layers implies either that current pretraining methods are not
properly leveraging the parameters in the deeper layers of the network or that the
shallow layers play a critical role in storing knowledge.""",
    """search_document: In this work, we introduce a significant 1-bit LLM variant called BitNet b1.58, where every parameter
is ternary, taking on values of {-1, 0, 1}. We have added an additional value of 0 to the original 1-bit
BitNet, resulting in 1.58 bits in the binary system. BitNet b1.58 retains all the benefits of the original
1-bit BitNet, including its new computation paradigm, which requires almost no multiplication
operations for matrix multiplication and can be highly optimized. Additionally, it has the same energy
consumption as the original 1-bit BitNet and is much more efficient in terms of memory consumption,
throughput and latency compared to FP16 LLM baselines.""",
]
embeddings = model.encode(sentences, convert_to_tensor=True)
embeddings = F.layer_norm(embeddings, normalized_shape=(embeddings.shape[1],))
#embeddings = embeddings[:, :matryoshka_dim]
#embeddings = F.normalize(embeddings, p=2, dim=1)

print(embeddings)


In [None]:
query = "search_query: are all layers in a transformer equally important?"
query_embedding = model.encode(query, convert_to_tensor=True)

In [None]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim

model = SentenceTransformer(
    "jinaai/jina-embeddings-v2-base-en", # switch to en/zh for English or Chinese
    trust_remote_code=True
)

# control your input sequence length up to 8192
model.max_seq_length = 1024

embeddings = model.encode([
    'How is the weather today?',
    'What is the current weather like today?'
])
print(cos_sim(embeddings[0], embeddings[1]))


I spent the last 30 minutes comparing embedding models before realizing that this doesn't matter right now. we'll come back to this decision when we have a framework to populate the embedding space, at this stage any decent model will do. i'll leave the links to the prospects for future reference

1. https://huggingface.co/mixedbread-ai/mxbai-embed-large-v1
2. https://huggingface.co/jinaai/jina-embeddings-v2-base-en
3. https://huggingface.co/nomic-ai/nomic-embed-text-v1.5

for no specific reason let's move forward with mixedbread's embedder for now..


---

now that we've settled on an embedding model, the next major consideration is how to handle the varying content lengths of our documents. i want to be very specific in my search queries, far more so than what embedding only the abstract of a paper allows. for instance, i know that the llama 2 paper contains detailed ablations comparing mqa, gqa, and mha; and i should easily be able to query for that. so what granularity of document chunking does this specificity require? i don't know. but we'll find out together! starting of with the most naive approach; just embed the entire document!

let's download and store our content

In [None]:
import requests
import fitz  # PyMuPDF
from bs4 import BeautifulSoup
from typing import Tuple

def download(url: str) -> Tuple[str, str]:
    """Download content from the given URL, determine its type, and extract the title and text."""
    response = requests.get(url)
    content_type = response.headers['Content-Type']

    def _parse_html(content: bytes) -> Tuple[str, str]:
        """
        Extract the title and text content from HTML data.
        """
        soup = BeautifulSoup(content, 'html.parser')
        title = soup.find('title').text
        text = soup.get_text()
        
        text = text.replace('\n', ' ')
        text = ' '.join(text.split())
        
        return title, text
    
    def _parse_pdf(content: bytes) -> Tuple[str, str]:
        """
        Extract the title and text content from a PDF file.
        """
        filename = 'document.pdf'
        with open(filename, 'wb') as f:
            f.write(content)
        
        doc = fitz.open(filename)
        text = ""
        for page in doc:
            text += page.get_text()
        
        text = text.replace('\n', ' ')
        text = ' '.join(text.split())

        # Extract title from metadata or the first block of text
        metadata = doc.metadata
        title = metadata.get('title', '')
        if not title:
            first_page = doc[0]
            blocks = first_page.get_text("blocks")
            
            # Assuming the title is in the first block
            if blocks:
                title = blocks[0][4].strip()
            else:
                title = 'unknown'

        return title, text

    if 'text/html' in content_type:
        title, text = _parse_html(response.content)
    elif 'application/pdf' in content_type:
        title, text = _parse_pdf(response.content)
    else:
        raise Exception('Unsupported content type')
    
    # Save content to file
    with open(f'data/{title}.txt', 'w') as f:
        f.write(text)

    return title, text

the above is straightforward. most of our content will either be html parsable by beautifulsoup or a pdf, which we can handle using this [pymupdf](https://pymupdf.readthedocs.io/en/latest/the-basics.html) package i found. i’ve future-proofed by implementing download(url), which takes a url and downloads the content to a .txt file. we’ll revisit the save format in the future; perhaps some kind of json structure, given that we’ll want to link content chunks to urls. however, that is for future me to solve.

In [None]:
download("https://arxiv.org/pdf/2401.01325")
download("https://arxiv.org/pdf/2310.02207")
download("https://huggingface.co/blog/moe#:~:text=Mixture%20of%20Experts%20enable%20models,budget%20as%20a%20dense%20model.")

i've downloaded three papers that i'm going to embed

In [None]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
from sentence_transformers.quantization import quantize_embeddings

# 1. Specify preffered dimensions
dimensions = 512

# 2. load model
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1", truncate_dim=dimensions)

query_prefix = 'Represent this sentence for searching relevant passages: '

In [None]:
import os

# load all documents in the data folder, returning a list of text content
def load(dir: str) -> list[str]:
    data = []
    for filename in os.listdir(dir):
        with open(f'data/{filename}', 'r') as f:
            data.append(f.read())
    return data

In [None]:
documents = load('data')
db_embedding = model.encode(documents)

In [None]:
def search(query: str) -> list[str]:
    query = query_prefix + query
    query_embedding = model.encode(query)
    similarities = cos_sim(db_embedding, query_embedding)
    return similarities

we've got three papers embedded in our database. this is a list of specific concepts from one of these papers:

1. positional out-of-distribution
2. passkey retrieval
3. context extension without fine-tuning
4. RoPE
5. selfextend

In [None]:
search_queries = ["positional out-of-distribution", "passkey retrieval",
                  "context extension without fine-tuning", "RoPE",
                  "selfextend"]

for sq in search_queries:
    print(f"Query: {sq}")
    results = search(sq)
    for i, r in enumerate(results):
        print(f"Document {i+1}: {r}")
    print()

The concepts are all extracted from Document 3. Note the wavering stability in similarity between the query and doc 3.

Let's see what happens if we split each document into N chunks, linking each chunk back to its source document. This should drastically increase similarity.

In [None]:
def chunk(documents: list[str], N: int) -> Tuple[list[str], list[int]]:
    """ Split documents into 512-word segments. """
    prev_chunk_index = 0
    chunk_index = []
    document_chunks = []
    for _, doc in enumerate(documents):
        # chunk document into 512-word segments
        chunks = [doc[i:i+N] for i in range(0, len(doc), N)]
        if len(chunks[-1]) != N:
            chunks.pop()
        
        chunk_index.append(prev_chunk_index + len(chunks))
        document_chunks.extend(chunks)
        prev_chunk_index = chunk_index[-1]

    return document_chunks, chunk_index

Split the documents into fixed-size segments. For each document, create segments of a specified length, ensuring uniformity by discarding smaller, incomplete segments. Maintain a cumulative count of segments for each document and compile all segments into a single list. The function returns both the list of segments and the cumulative segment indices for tracking purposes.

In [None]:
import numpy
import bisect

def search_top_k(query: str, top_k: int, n_chunks: list[int]) -> list[str]:
    """
    Given a query, return the top-k most relevant documents, by index in n_chunks.
    """
    query = query_prefix + query
    query_embedding = model.encode(query)
    similarities = cos_sim(db_embedding, query_embedding).numpy()
    top_k_indices = numpy.argsort(similarities)[:top_k]
    
    source_documents = []
    for ind in top_k_indices:
        doc = bisect.bisect_right(n_chunks, ind)
        source_documents.append(doc)

    return source_documents

i'm still uncertain on the implementation of search, it depends on what we want to display as results of a query. In it's current state, search will return the documents that match the top_k chunks in the embedding search, without filtering out multiple pointers to the same source.

In [34]:
# static per db
documents = load('data')
chunks, n_chunks = chunk(documents, 512) 
db_embedding = model.encode(chunks)

In [None]:
search_queries = ["positional out-of-distribution", "passkey retrieval",
                  "context extension without fine-tuning", "RoPE",
                  "selfextend"]

for sq in search_queries:
    print(f"Query: {sq}")
    result = search_top_k(sq, 3, n_chunks)
    print(f"Top-3 documents: {result}")
    print()

The results aren't surprising, but encouraging nonetheless. We need to remember that we're going to introduce a re-ranking algorithm around here as well. We'll re-rank on the chunked embeddings, then comes the question whether we want to completely disregard chunks that pertain to the same source document or not, this depends on what we want to display in the results of the search; just the source document or the best matching chunk(s) as well. 

you know what, on second thought, after re-ranking the embedding search results we could actually: (1) **sum** the similarity scores per source document or (2) count the number of pointers to the source document! This would improve the search, prioritizing documents that contain multiple matches to the query.

let's gather everything that we're using down below

In [62]:
import numpy as np
import os

import requests
import fitz  # PyMuPDF
from bs4 import BeautifulSoup
from typing import Tuple

def download(url: str) -> Tuple[str, str]:
    """Download content from the given URL, determine its type, and extract the title and text."""
    response = requests.get(url)
    content_type = response.headers['Content-Type']

    def _parse_html(content: bytes) -> Tuple[str, str]:
        """
        Extract the title and text content from HTML data.
        """
        soup = BeautifulSoup(content, 'html.parser')
        title = soup.find('title').text
        text = soup.get_text()
        
        text = text.replace('\n', ' ')
        text = ' '.join(text.split())
        
        return title, text
    
    def _parse_pdf(content: bytes) -> Tuple[str, str]:
        """
        Extract the title and text content from a PDF file.
        """
        filename = 'document.pdf'
        with open(filename, 'wb') as f:
            f.write(content)
        
        doc = fitz.open(filename)
        text = ""
        for page in doc:
            text += page.get_text()
        
        text = text.replace('\n', ' ')
        text = ' '.join(text.split())

        # Extract title from metadata or the first block of text
        metadata = doc.metadata
        title = metadata.get('title', '')
        if not title:
            first_page = doc[0]
            blocks = first_page.get_text("blocks")
            
            # Assuming the title is in the first block
            if blocks:
                title = blocks[0][4].strip()
            else:
                title = 'unknown'

        return title, text

    if 'text/html' in content_type:
        title, text = _parse_html(response.content)
    elif 'application/pdf' in content_type:
        title, text = _parse_pdf(response.content)
    else:
        raise Exception('Unsupported content type')
    
    # Save content to file
    with open(f'data/{title}.txt', 'w') as f:
        f.write(text)

    return title, text

def load(dir: str) -> list[str]:
    """ Load all documents in the data folder, returning a list of text content."""
    data = []
    for filename in os.listdir(dir):
        with open(f'data/{filename}', 'r') as f:
            data.append(f.read())
    return data

def chunk(documents: list[str], N: int) -> Tuple[list[str], list[int]]:
    """ Split documents into 512-word segments. """
    prev_chunk_index = 0
    chunk_index = []
    document_chunks = []
    for _, doc in enumerate(documents):
        # chunk document into 512-word segments
        chunks = [doc[i:i+N] for i in range(0, len(doc), N)]
        if len(chunks[-1]) != N:
            chunks.pop()
        
        chunk_index.append(prev_chunk_index + len(chunks))
        document_chunks.extend(chunks)
        prev_chunk_index = chunk_index[-1]

    return document_chunks, chunk_index

def search_top_k(query: str, top_k: int) -> Tuple[list[int], list[int]]:
    """
    Given a query, find top-k document chunk matches by cosine similarity.
    """
    query = query_prefix + query
    query_embedding = model.encode(query)
    similarities = cos_sim(db_embedding, query_embedding).numpy()
    top_k_indices = np.argsort(similarities, axis=0, )[-top_k:][::-1]
    
    return top_k_indices.flatten(), similarities[top_k_indices].flatten()

def aggregate_and_sort(documents, scores):
    """ Aggregate scores by source document and sort the documents by new scores."""

    unique_docs, inverse_indices = np.unique(documents, return_inverse=True)
    
    # aggregate scores 
    aggregated_scores = np.bincount(inverse_indices, weights=scores)
    
    # sort descending order
    sorted_indices = np.argsort(-aggregated_scores)
    
    sorted_documents = unique_docs[sorted_indices]
    sorted_scores = aggregated_scores[sorted_indices]
    
    return sorted_documents, sorted_scores

def search(query: str, top_k: int, n_chunks: list[int]) -> list[int]:
    """
    Search the embedding database for the most relevant documents to the query. 
    """
    
    top_k_indices, scores = search_top_k(query, top_k)
    # re-rank the top-k document chunks
    top_k_documents = numpy.searchsorted(n_chunks, top_k_indices, side='right')
    top_k_documents = aggregate_and_sort(top_k_documents, scores)

    return top_k_documents
    

In [None]:
from sentence_transformers import SentenceTransformer
from sentence_transformers.util import cos_sim
from sentence_transformers.quantization import quantize_embeddings

# 1. Specify preffered dimensions
dimensions = 512

# 2. load model
model = SentenceTransformer("mixedbread-ai/mxbai-embed-large-v1", truncate_dim=dimensions)

query_prefix = 'Represent this sentence for searching relevant passages: '

# load data and initialize embedded db
documents = load('data')
chunks, n_chunks = chunk(documents, 512) 
db_embedding = model.encode(chunks)

In [63]:
search("positional out-of-distribution", 3, n_chunks)

(array([2]), array([2.03844404]))

this is everything we've got so far. i've cleaned up the search code, it now aggregates similarity scores pertaining to the same
source document. we also have easy access to the chunk matches if we need them down the line. 

we're still missing a re-ranking algorithm. 

i'm keen on changing the whole n_chunks / document setup we've got right now. i feel like this is a band-aid solution that isn't going to hold up long term. we need to think of a better way to connect documents in /data to chunks. there should also be a link to the source url. this needs to be saved to disk along side the embedding database. it needs to be extendable.