### Chunking implementation

In [21]:
def chunk_text(text, max_length=500):
    # Text is splitted into chunks at most max_length characters, at sentence boundaries if possible
    import re
    sentences = re.split(r'(?<=[.!?])\s+', text.strip()) # split on sentence end
    chunks = []
    current_chunk = ""
    for sentence in sentences:
        if len(current_chunk) + len(sentence) + 1  <= max_length:
            current_chunk += sentence + " "
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence + " "
    
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

In [None]:
with open("./data/cat-facts.txt", "r", encoding="utf-8") as f:
    text = f.read()


print(chunk_text(text=text, max_length=500)[0])

On average, cats spend 2/3 of every day sleeping. That means a nine-year-old cat has been awake for only three years of its life. Unlike dogs, cats do not have a sweet tooth. Scientists believe this is due to a mutation in a key taste receptor. When a cat chases its prey, it keeps its head level. Dogs and humans bob their heads up and down.


### Embedding the chunks

In [None]:
from sentence_transformers import SentenceTransformer

chunks = chunk_text(text, max_length=500)
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
chunk_embeddings = model.encode(chunks)

To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Loading weights: 100%|██████████| 103/103 [00:00<00:00, 655.16it/s, Materializing param=pooler.dense.weight]                             
[1mBertModel LOAD REPORT[0m from: sentence-transformers/all-MiniLM-L6-v2
Key                     | Status     |  | 
------------------------+------------+--+-
embeddings.position_ids | UNEXPECTED |  | 

[3mNotes:
- UNEXPECTED[3m	:can be ignored when loading from different task/architecture; not ok if you expect identical arch.[0m


### Vector index storage

In [33]:
import numpy as np

vectors = np.array(chunk_embeddings)
# Keep an array or list of chunk texts in the same order
chunks_list = chunks 


### Test the retreival with a query

In [37]:
def retrieve(query, vectors, chunks_list, model):
    q_vec = model.encode([query])[0]
    # Compute cosine similarty between q_vec and all chunk vectors
    scores = np.dot(vectors, q_vec) / (np.linalg.norm(vectors, axis=1) * np.linalg.norm(q_vec) + 1e-9)
    top_indx = int(np.argmax(scores))
    return chunks_list[top_indx], scores[top_indx]

In [42]:
Query = "What is a cat lover called"
retrieve(query=Query, vectors=vectors, chunks_list=chunks_list, model=model)

('Two members of the cat family are distinct from all others: the clouded leopard and the cheetah. The clouded leopard does not roar like other big cats, nor does it groom or rest like small cats. The cheetah is unique because it is a running cat; all others are leaping cats. They are leaping cats because they slowly stalk their prey and then leap on it. A cat lover is called an Ailurophilia (Greek: cat+lover). In Japan, cats are thought to have the power to turn into super spirits when they die.',
 np.float32(0.5999281))

### Save embeddings

In [None]:
import numpy as np
import json

# Save 
np.save('embeddings.npy', vectors) # chunk_embeddings


# save the chunk texts
with open("chunks.json", "w") as f:
    json.dump(chunks_list, f)


### Load the embeddings

In [None]:
vectors = np.load("./data/embeddings.npy")

with open('./data/chunks_json', "r") as f:
    chunks_list = json.load(f)

In [None]:
print(vectors[:10])
print("\n")
print(chunks_list[:10])