Using uv can create a virtual Environment for the python dependencies.

--command
uv add faiss-cpu sentence-transformers
uv add ipykernel   

In [1]:
import os, glob, json
import numpy as np
import faiss
from sentence_transformers import SentenceTransformer


  from .autonotebook import tqdm as notebook_tqdm


In [2]:
!curl https://raw.githubusercontent.com/chandralegend/scratch-rag/refs/heads/main/docs/how-y-comb-started-paul-graham.txt -o ./docs/how-y-comb-started-paul-graham.txt

  % Total    % Received % Xferd  Average Speed   Time    Time     Time  Current
                                 Dload  Upload   Total   Spent    Left  Speed

  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
  0     0    0     0    0     0      0      0 --:--:-- --:--:-- --:--:--     0
100  7828  100  7828    0     0  18101      0 --:--:-- --:--:-- --:--:-- 18162


In [14]:
corpus = {}
for filepath in glob.glob(os.path.join("docs", '*.txt')):
    with open(filepath, 'r', encoding='utf-8') as file:
        doc_id = os.path.basename(filepath)
        corpus[doc_id] = file.read()
print(f"Loaded {len(corpus)} documents.")

Loaded 1 documents.


In [16]:
chunk_size = 500  # Number of words per chunk
overlap = 50      # Number of overlapping words between chunks

def chunk_text(text, chunk_size=500, overlap=50):
    """
    Splits the input text into chunks of specified size with overlap.
    """

    words = text.split()
    chunks = []
    for i in range(0, len(words), chunk_size - overlap):
        chunk = ' '.join(words[i:i + chunk_size])
        chunks.append(chunk)
    return chunks



# Create chunked corpus
chunked_corpus = []
for doc_id, text in corpus.items():
    chunks = chunk_text(text, chunk_size, overlap)
    for i, chunk in enumerate(chunks):
        chunked_corpus.append({
            'doc_id': doc_id,
            'chunk_id': f"{doc_id}_chunk_{i}",
            'text': chunk
        })

chunked_corpus[:2]

[{'doc_id': 'how-y-comb-started-paul-graham.txt',
  'chunk_id': 'how-y-comb-started-paul-graham.txt_chunk_0',
  'text': "How Y Combinator Started Y Combinator's 7th birthday was March 11. As usual we were so busy we didn't notice till a few days after. I don't think we've ever managed to remember our birthday on our birthday. On March 11 2005, Jessica and I were walking home from dinner in Harvard Square. Jessica was working at an investment bank at the time, but she didn't like it much, so she had interviewed for a job as director of marketing at a Boston VC fund. The VC fund was doing what now seems a comically familiar thing for a VC fund to do: taking a long time to make up their mind. Meanwhile I had been telling Jessica all the things they should change about the VC business — essentially the ideas now underlying Y Combinator: investors should be making more, smaller investments, they should be funding hackers instead of suits, they should be willing to fund younger founders, etc

In [17]:
model = SentenceTransformer('all-MiniLM-L6-v2')
embeddings = model.encode([chunk['text'] for chunk in chunked_corpus], show_progress_bar=True, convert_to_numpy=True)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Xet Storage is enabled for this repo, but the 'hf_xet' package is not installed. Falling back to regular HTTP download. For better performance, install the package with: `pip install huggingface_hub[hf_xet]` or `pip install hf_xet`
Batches: 100%|██████████| 1/1 [00:00<00:00,  3.96it/s]


In [20]:
embeddings.shape

(4, 384)

In [24]:
def normalize(vec):
    norm = np.linalg.norm(vec, axis=1, keepdims=True)
    return vec / norm

vec = normalize(embeddings).astype('float32')
dimension = vec.shape[1]
print(f"Dimension of embeddings: {dimension}")

Dimension of embeddings: 384


In [25]:
index = faiss.IndexFlatIP(dimension)
index.add(vec)
print(f"Total vectors in index: {index.ntotal}")

Total vectors in index: 4


In [26]:
query = "When is the YCOmbinators birthday?"
query_embedding = model.encode([query], convert_to_numpy=True)
query_vec = normalize(query_embedding).astype('float32')

In [29]:
# Retrieve from the index
result = index.search(query_vec, 2)
result

(array([[0.3564237 , 0.29097176]], dtype=float32), array([[0, 2]]))

In [30]:
distances, indices = index.search(query_vec, 2)

In [31]:
# 'distances' is indices[0] for the first query, 'indices' is indices[1]
# The returned array 'indices' from FAISS is a 2D array, so we access the first row with [0]
top_k_indices = indices[0] 

print("Top results:")
for i, idx in enumerate(top_k_indices):
    # Access the text from the original chunked_corpus list using the index
    original_text = chunked_corpus[idx] 
    distance = distances[0][i] # Get the corresponding distance

    print(f"Rank {i+1}:")
    print(f"Index ID: {idx}")
    print(f"Distance: {distance:.4f}")
    print(f"Text: {original_text}\n")

Top results:
Rank 1:
Index ID: 0
Distance: 0.3564
Text: {'doc_id': 'how-y-comb-started-paul-graham.txt', 'chunk_id': 'how-y-comb-started-paul-graham.txt_chunk_0', 'text': "How Y Combinator Started Y Combinator's 7th birthday was March 11. As usual we were so busy we didn't notice till a few days after. I don't think we've ever managed to remember our birthday on our birthday. On March 11 2005, Jessica and I were walking home from dinner in Harvard Square. Jessica was working at an investment bank at the time, but she didn't like it much, so she had interviewed for a job as director of marketing at a Boston VC fund. The VC fund was doing what now seems a comically familiar thing for a VC fund to do: taking a long time to make up their mind. Meanwhile I had been telling Jessica all the things they should change about the VC business — essentially the ideas now underlying Y Combinator: investors should be making more, smaller investments, they should be funding hackers instead of suits, t