## Day 1 - Introduction to RAG

In [None]:
# Python Environment - Day 1 Task 1
%pip install transformers sentence-transformers  faiss-cpu gradio streamlit chromadb

In [None]:
# Choose a Dataset - Day 1 Task 2 - 3 - 4
## Dataset: ./data/cat-facts.txt

In [1]:
# Test Environment - Day 1 Task 5
from transformers import pipeline
generator = pipeline('text-generation', model='gpt2')
print(generator("Hello, my name is", max_length=10))

  from .autonotebook import tqdm as notebook_tqdm
To support symlinks on Windows, you either need to activate Developer Mode or to run Python as an administrator. In order to activate developer mode, see this article: https://docs.microsoft.com/en-us/windows/apps/get-started/enable-your-device-for-development
Device set to use cpu
Truncation was not explicitly activated but `max_length` is provided a specific value, please use `truncation=True` to explicitly truncate examples to max length. Defaulting to 'longest_first' truncation strategy. If you encode pairs of sequences (GLUE-style) with the tokenizer you can select this strategy more precisely by providing a specific strategy to `truncation`.
Setting `pad_token_id` to `eos_token_id`:50256 for open-end generation.
Both `max_new_tokens` (=256) and `max_length`(=10) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/mai

[{'generated_text': "Hello, my name is Sam. I am here to inform you that you need to join the Brotherhood. I have a new weapon... and you need to join us! You need to join us! We will defeat you! In order to do that, we need you and your weapon. So, go to the Brotherhood Chamber and tell them all to save us. I will be the one to stop you! We will defeat you! Our goal is simple: save humanity from a future of destruction. And you... you know, I know you're not alone. You know that the Brotherhood has become more dangerous due to fear! I don't want to be the only one to be killed. I want to save humanity from a future where we all die. But even if you've never heard of me before, this is the first time I'm speaking. I wanted to say that we've met in an elevator that I've never seen before. There is only one way to the elevator. The first time I saw you, I thought you were a hero, and I wanted to meet you. I wanted to tell you how I lost my life to save you from the Brotherhood. But that 

## Day 2 - Core Components of a RAG Pipeline (Data, Embeddings, and Retrieval)

### Chunking implementation (On Cat Data)

In [None]:
# Chunking implementation - Day 2 Task 1
def chunk_text(text, max_length=500):
    # Text is splitted into chunks at most max_length characters, at sentence boundaries if possible
    import re
    sentences = re.split(r'(?<=[.!?])\s+', text.strip()) # split on sentence end
    chunks = []
    current_chunk = ""
    for sentence in sentences:
        if len(current_chunk) + len(sentence) + 1  <= max_length:
            current_chunk += sentence + " "
        else:
            chunks.append(current_chunk.strip())
            current_chunk = sentence + " "
    
    if current_chunk:
        chunks.append(current_chunk.strip())
    return chunks

In [None]:
with open("./data/cat-facts.txt", "r", encoding="utf-8") as f:
    text = f.read()

# First Chunk
print(chunk_text(text=text, max_length=500)[1])

On average, cats spend 2/3 of every day sleeping. That means a nine-year-old cat has been awake for only three years of its life. Unlike dogs, cats do not have a sweet tooth. Scientists believe this is due to a mutation in a key taste receptor. When a cat chases its prey, it keeps its head level. Dogs and humans bob their heads up and down.


### Embedding the chunks

In [None]:
# Embedding the Chunks - Day 2  Task 2
from sentence_transformers import SentenceTransformer

chunks = chunk_text(text, max_length=500)
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
chunk_embeddings = embedding_model.encode(chunks)

### Vector index storage

In [None]:
# Store in a Simple Vector index - Day 2 Task 3
import numpy as np

vectors = np.array(chunk_embeddings)
# Keep an array or list of chunk texts in the same order
chunks_list = chunks 


### Test the retreival with a query ⭐

In [9]:
# Test the retrieval on a Simple Query - Day 2 Task 4
def retrieve(query, vectors, chunks_list, model):
    '''Retrieve the most relevant chunk based on cosinle similarity'''
    q_vec = model.encode([query])[0]
    # Compute cosine similarty between q_vec and all chunk vectors
    scores = np.dot(vectors, q_vec) / (np.linalg.norm(vectors, axis=1) * np.linalg.norm(q_vec) + 1e-9)
    top_indx = int(np.argmax(scores))
    return chunks_list[top_indx], scores[top_indx]

In [10]:
# Inspect the result - Day 2 Task 5
from sentence_transformers import SentenceTransformer

embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')


Query = "What is a cat lover called"
retrieve(query=Query, vectors=vectors, chunks_list=chunks_list, model=embedding_model)

('Two members of the cat family are distinct from all others: the clouded leopard and the cheetah. The clouded leopard does not roar like other big cats, nor does it groom or rest like small cats. The cheetah is unique because it is a running cat; all others are leaping cats. They are leaping cats because they slowly stalk their prey and then leap on it. A cat lover is called an Ailurophilia (Greek: cat+lover). In Japan, cats are thought to have the power to turn into super spirits when they die.',
 np.float32(0.5999281))

### Save embeddings

In [None]:
# Save your work - Day 2 Task 6
import numpy as np
import json

# Save 
np.save('embeddings.npy', vectors) # chunk_embeddings


# save the chunk texts
with open("chunks.json", "w") as f:
    json.dump(chunks_list, f)


### Load the embeddings ⭐

In [2]:
# Load the embeddings - Day 2 Task 7
import numpy as np
import json

vectors = np.load("./data/embeddings.npy")

with open('./data/chunks.json', "r") as f:
    chunks_list = json.load(f)

In [3]:
print(vectors[:10])
print("\n")
print(chunks_list[:10])

[[ 0.09816721 -0.06158825  0.04418764 ...  0.04759893  0.00374702
  -0.02113941]
 [ 0.08874442 -0.03300164  0.06386363 ...  0.09614801  0.06132019
   0.08886918]
 [ 0.1137786   0.02467443  0.04589037 ...  0.11650186  0.06650402
   0.02615136]
 ...
 [ 0.14878643 -0.05136343  0.05026303 ...  0.05624724 -0.01553893
   0.10549022]
 [ 0.07461078 -0.09752137  0.0282176  ...  0.04340719  0.04313685
   0.03407015]
 [ 0.04919337  0.08810407  0.01938833 ...  0.07192653  0.06094994
  -0.00545023]]


['On average, cats spend 2/3 of every day sleeping. That means a nine-year-old cat has been awake for only three years of its life. Unlike dogs, cats do not have a sweet tooth. Scientists believe this is due to a mutation in a key taste receptor. When a cat chases its prey, it keeps its head level. Dogs and humans bob their heads up and down.', 'The technical term for a cat’s hairball is a “bezoar.”\nA group of cats is called a “clowder.”\nFemale cats tend to be right pawed, while male cats are more o

## Day 3: Building Your First RAG System (End-to-End QA)

### Generate an Answer(Pipeline)

In [4]:
# Integrate Retrieval and Generation (PIPELINE VERSION) - Day 3 Task 1

from transformers import pipeline
from sentence_transformers import SentenceTransformer


embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Load the model and tokenizer for generation (this might download weights the first time)

generator = pipeline("text2text-generation", model="google/flan-t5-base")

def answer_query(query, top_k=3):
    # Retrieve top k chunks
    q_vec = embedding_model.encode([query])[0] # embed the query using same model as before
    scores = np.dot(vectors, q_vec) / (np.linalg.norm(vectors, axis=1)*np.linalg.norm(q_vec) + 1e-9)
    top_indices = scores.argsort()[-top_k:][::-1] # indices of top k chunks, sorted by score desc
    retrieved_chunks = [chunks_list[i] for i in top_indices] 
    # construct context string
    context = " ".join(retrieved_chunks)
    prompt = (f"Answer the question using ONLY the context below and Explain in detail. If the answer is not in the context, say 'I do not know.'\n\n"
              f"Context: {context}\n\nQuestion: {query}\nAnswer:")
    result = generator(prompt, max_length=200, num_return_sequences=1)
    answer = result[0]['generated_text']
    return answer

Device set to use cpu


In [None]:
# Test with known Question - Day 3 Task 2
answer_query("What is the name of heaviest cat ever?")

Both `max_new_tokens` (=256) and `max_length`(=200) seem to have been set. `max_new_tokens` will take precedence. Please refer to the documentation for more information. (https://huggingface.co/docs/transformers/main/en/main_classes/text_generation)


'Himmy'

### Generate an Answer (AutomodelForSeq2LM) ⭐

In [18]:
# Integrate Retrieval and Generation (AutoModelForSeq2SeqLM VERSION) - Day 3 Task 1

from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

model_name = "google/flan-t5-base"
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)


# Same with generator pipeline
def generate_answer(prompt):
    """Generate  Answer using FLAN-T5"""
    inputs = tokenizer(prompt, return_tensors='pt')
    outputs = model.generate(**inputs, max_length=500)
    return tokenizer.decode(outputs[0], skip_special_tokens = True)


In [None]:
# Test with known Question - Day 3 Task 2
def answer_query(query):
    context = retrieve(query, vectors, chunks_list, embedding_model)
    # Refine prompt if needed - Day 3 Task 3
    prompt =  (f"""
                You are a QA assistant.

                Rules:
                - Use the context as the ONLY source of factual information.
                - You may paraphrase and combine details into your own sentences.
                - Do NOT add new facts that are not supported by the context.
                - If the context does not contain the answer, say exactly: "I do not know."

                Task:
                Answer the question in your own words.

                Context:
                {context}

                Question: {query}

                Answer:""") 
    # Logging - Day 3 Task 4
    print(f"Context: {context}")
    answer = generate_answer(prompt)
    return answer

In [None]:
answer_query("Lightiest cat ever?")

'blue point Himalayan called Tinker Toy.'

## Day 4: Building an Interactive RAG Application (UI Integration)

### Gradio UI

In [None]:
# Day 4 Task 1-2-3-4-5
import gradio as gr

def rag_system(query):
    # Use our answer_query function from Day 3
    answer = answer_query(query)
    return answer

iface = gr.Interface(fn=rag_system, inputs="text", outputs="text", title="RAG QA System", description="Ask a question and get an answer from documents.")
iface.launch()

* Running on local URL:  http://127.0.0.1:7860
* To create a public link, set `share=True` in `launch()`.




## Day 5: Adding Conversational Memory to RAG Assistant

#### Necessary Fucntions

In [None]:
# Load vectors
import numpy as np
import json
from sentence_transformers import SentenceTransformer
from transformers import AutoModelForSeq2SeqLM, AutoTokenizer

# Saved Vectors
vectors = np.load("./data/embeddings.npy")
# Embedding model 
embedding_model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
# Saved Chunks
with open('./data/chunks.json', "r") as f:
    chunks_list = json.load(f)
# Retrieval function
def retrieve(query, vectors, chunks_list, model):
    '''Retrieve the most relevant chunk based on cosinle similarity'''
    q_vec = model.encode([query])[0]
    # Compute cosine similarty between q_vec and all chunk vectors
    scores = np.dot(vectors, q_vec) / (np.linalg.norm(vectors, axis=1) * np.linalg.norm(q_vec) + 1e-9)
    top_indx = int(np.argmax(scores))
    return chunks_list[top_indx], scores[top_indx]

# LLM model name to generate answer
model_name = "google/flan-t5-base"
# model library from HuggingFace
model = AutoModelForSeq2SeqLM.from_pretrained(model_name)
# Model tokenizer from HuggingFace
tokenizer = AutoTokenizer.from_pretrained(model_name)

# generate answer fucntion - same with generator pipeline
def generate_answer(prompt):
    """Generate  Answer using FLAN-T5"""
    inputs = tokenizer(prompt, return_tensors='pt')
    outputs = model.generate(**inputs, max_length=500)
    return tokenizer.decode(outputs[0], skip_special_tokens = True)

### Gradio UI (with history ⭐)