## Day 2

### Chunking implementation

In [18]:
def chunk_text(text, max_length=500):
    # Text is splitted into chunks at most max_length characters, at sentence boundaries if possible
    import re
    sentences = re.split(r'(?<=[.!?])\s+', text.strip()) # split on sentence end
    chunks = []
    current_chunk = ""
    for sentence in sentences:
        if len(current_chunk) + len(sentence) + 1  <= max_length:
            current_chunk += sentence + " "
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence + " "
    
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

In [19]:
with open("./data/cat-facts.txt", "r", encoding="utf-8") as f:
    text = f.read()


print(chunk_text(text=text, max_length=500)[0])

On average, cats spend 2/3 of every day sleeping. That means a nine-year-old cat has been awake for only three years of its life. Unlike dogs, cats do not have a sweet tooth. Scientists believe this is due to a mutation in a key taste receptor. When a cat chases its prey, it keeps its head level. Dogs and humans bob their heads up and down.


### Embedding the chunks

In [20]:
from sentence_transformers import SentenceTransformer

chunks = chunk_text(text, max_length=500)
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
chunk_embeddings = model.encode(chunks)

### Vector index storage

In [21]:
import numpy as np

vectors = np.array(chunk_embeddings)
# Keep an array or list of chunk texts in the same order
chunks_list = chunks 


### Test the retreival with a query

In [5]:
def retrieve(query, vectors, chunks_list, model):
    q_vec = model.encode([query])[0]
    # Compute cosine similarty between q_vec and all chunk vectors
    scores = np.dot(vectors, q_vec) / (np.linalg.norm(vectors, axis=1) * np.linalg.norm(q_vec) + 1e-9)
    top_indx = int(np.argmax(scores))
    return chunks_list[top_indx], scores[top_indx]

In [23]:
Query = "What is a cat lover called"
retrieve(query=Query, vectors=vectors, chunks_list=chunks_list, model=model)

('Two members of the cat family are distinct from all others: the clouded leopard and the cheetah. The clouded leopard does not roar like other big cats, nor does it groom or rest like small cats. The cheetah is unique because it is a running cat; all others are leaping cats. They are leaping cats because they slowly stalk their prey and then leap on it. A cat lover is called an Ailurophilia (Greek: cat+lover). In Japan, cats are thought to have the power to turn into super spirits when they die.',
 np.float32(0.5999281))

### Save embeddings

In [None]:
import numpy as np
import json

# Save 
np.save('embeddings.npy', vectors) # chunk_embeddings


# save the chunk texts
with open("chunks.json", "w") as f:
    json.dump(chunks_list, f)


### Load the embeddings

In [7]:
import numpy as np
import json

vectors = np.load("./data/embeddings.npy")

with open('./data/chunks.json', "r") as f:
    chunks_list = json.load(f)

In [2]:
print(vectors[:10])
print("\n")
print(chunks_list[:10])

[[ 0.09816721 -0.06158825  0.04418764 ...  0.04759893  0.00374702
  -0.02113941]
 [ 0.08874442 -0.03300164  0.06386363 ...  0.09614801  0.06132019
   0.08886918]
 [ 0.1137786   0.02467443  0.04589037 ...  0.11650186  0.06650402
   0.02615136]
 ...
 [ 0.14878643 -0.05136343  0.05026303 ...  0.05624724 -0.01553893
   0.10549022]
 [ 0.07461078 -0.09752137  0.0282176  ...  0.04340719  0.04313685
   0.03407015]
 [ 0.04919337  0.08810407  0.01938833 ...  0.07192653  0.06094994
  -0.00545023]]


['On average, cats spend 2/3 of every day sleeping. That means a nine-year-old cat has been awake for only three years of its life. Unlike dogs, cats do not have a sweet tooth. Scientists believe this is due to a mutation in a key taste receptor. When a cat chases its prey, it keeps its head level. Dogs and humans bob their heads up and down.', 'The technical term for a cat’s hairball is a “bezoar.”\nA group of cats is called a “clowder.”\nFemale cats tend to be right pawed, while male cats are more o

## Day 3

### Select the generation model (Pipeline)

In [10]:
from transformers import pipeline
from sentence_transformers import SentenceTransformer


embedder = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Load the model and tokenizer for generation (this might download weights the first time)

generator = pipeline("text2text-generation", model="google/flan-t5-base")

def answer_query(query, top_k=3):
    # Retrieve top k chunks
    q_vec = embedder.encode([query])[0] # embed the query using same model as before
    scores = np.dot(vectors, q_vec) / (np.linalg.norm(vectors, axis=1)*np.linalg.norm(q_vec) + 1e-9)
    top_indices = scores.argsort()[-top_k:][::-1] # indices of top k chunks, sorted by score desc
    retrieved_chunks = [chunks_list[i] for i in top_indices] 
    # construct context string
    context = " ".join(retrieved_chunks)
    prompt = (f"Answer the question using ONLY the context below and Explain in detail. If the answer is not in the context, say 'I do not know.'\n\n"
              f"Context: {context}\n\nQuestion: {query}\nAnswer:")
    result = generator(prompt, max_length=200, num_return_sequences=1)
    answer = result[0]['generated_text']
    return answer

Device set to use cpu


In [13]:
answer_query("What kind of diseases cats have?")

Both `max_new_tokens` (=256) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


'Toxoplasmosis'

### Select the model (AutomodelForSeq2LM)

In [None]:
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "google/flan-t5-base"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)



def generate_answer(prompt):
    inputs = tokenizer(prompt, return_tensors='pt')
    outputs = model.generate(**inputs, max_length=500)
    return tokenizer.decode(outputs[0], skip_special_tokens = True)


  from .autonotebook import tqdm as notebook_tqdm


In [11]:
from sentence_transformers import SentenceTransformer


# Sentence transformer for embeddings (retrieval)
embedding_model = SentenceTransformer('all-MiniLM-L6-v2')
def answer_query(query):
    context = retrieve(query, vectors, chunks_list, embedding_model)
    prompt =  f"""
                You are a QA assistant.

                Rules:
                - Use the context as the ONLY source of factual information.
                - You may paraphrase and combine details into your own sentences.
                - Do NOT add new facts that are not supported by the context.
                - If the context does not contain the answer, say exactly: I do not know.

                Task:
                Answer the question in your own words.

                Context:
                {context}

                Question: {query}

                Answer:"""
    print(context)
    answer = generate_answer(prompt)
    return answer

In [12]:
answer_query("Explain the relevancy of the goddess Bast")

('The females are less than 20 inches (50 cm) long and can weigh as little as 2.5 lbs. (1.2 kg). Many Egyptians worshipped the goddess Bast, who had a woman’s body and a cat’s head. Mohammed loved cats and reportedly his favorite cat, Muezza, was a tabby. Legend says that tabby cats have an “M” for Mohammed on top of their heads because Mohammad would often rest his hand on the cat’s head.', np.float32(0.3846466))


'Many Egyptians worshipped the goddess Bast, who had a woman’s body and a cat’s head.'