# 12: Generative AI

The hand-in exercise for this topic is in the notebook named ‘rag_task.ipynb’. Do all 4
tasks within this notebook. For task 2, you should try at least 3 types of chunking such as
chunk in paragraphs, sentences or even by punctuation marks – you are welcome to
choose your own chunking strategy. For task 4 you should try at least one other type of
similarity or distance function to calculate the similarity.

### Task
1. Creat a RAG pipeline that can take following text and answer following questions
2. Try different types of chunking to get better answers?
3. Does asking questions differently give better answers? Why?
4. Try a different similarity search instead of cosine similarity - do the answers improve?

In [1]:

sample_text = """
The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.

Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization. This deforestation contributes to climate change, as the rainforest acts as a major carbon sink, absorbing millions of tons of carbon dioxide annually.

Indigenous tribes have lived in the Amazon for thousands of years, relying on its rich biodiversity for food, medicine, and shelter. These tribes have unique languages, traditions, and knowledge of the ecosystem. However, many face threats from illegal land encroachment and industrial activities.

Scientists believe that the Amazon plays a crucial role in global weather patterns by releasing water vapor into the atmosphere, which influences rainfall across South America and even other continents. The Amazon River, which flows through the rainforest, is the second longest river in the world and carries more water than any other river.

Efforts to protect the Amazon include international agreements, conservation programs, and sustainable development projects that aim to balance economic growth with environmental protection. Many organizations and governments are working to reduce illegal logging and promote reforestation initiatives.
"""

In [2]:
# Properly formulated questions for the sample text
questions = [
    "What is the Amazon rainforest?",
    "Which countries does the Amazon span across?",
    "Why is deforestation a problem in the Amazon?",
    "How does the Amazon rainforest affect global weather patterns?",
    "What role do indigenous tribes play in the Amazon?",
    "What is the importance of the Amazon River?",
    "What types of wildlife can be found in the Amazon?",
    "How does deforestation contribute to climate change?",
    "What efforts are being made to protect the Amazon?",
    "Why is the Amazon considered a major carbon sink?"
]

### Imports

In [3]:
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances
from transformers import pipeline
import re

### Starting to build the pipeline:

In [4]:
# Choose our model, in this case, we are using the MiniLM model
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer(MODEL_NAME)

# Load a pre-trained question-answering model pipeline from transformers library
qa_pipeline = pipeline("question-answering")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


The SentenceTransformer turns each chunk of text into a vector (a list of numbers) so we can compare how close in meaning two texts are.  
The QA pipeline takes a question and a text chunk and tries to pull out the best short answer from it.


### Testing sentence chucking:

Lets create some differint chunking functions:

In [5]:
def split_text_sentences(text):
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    return [s.strip() for s in sentences if s.strip()]
chunks_sent = split_text_sentences(sample_text)

In [6]:
def split_text_paragraphs(text):
    return [p.strip() for p in re.split(r"\n+", text) if p.strip()]
chunks_para = split_text_paragraphs(sample_text)


In [7]:
def split_text_tokens(text, max_tokens=120, overlap=30):
    words = text.split()
    chunks = []
    i = 0
    while i < len(words):
        chunks.append(" ".join(words[i:i+max_tokens]))
        i += max_tokens - overlap
    return chunks
chunks_tok = split_text_tokens(sample_text)


Now we take the text chunks from each strategy (sentences, paragraphs, tokens) and run them through the embedding model.  
This turns every chunk into a vector (list of numbers) that captures its meaning, so we can later compare them to the question.


In [8]:
E_sent = model.encode(chunks_sent, convert_to_numpy=True)
E_para = model.encode(chunks_para, convert_to_numpy=True)
E_tok  = model.encode(chunks_tok, convert_to_numpy=True)

  return forward_call(*args, **kwargs)


Next we set up the retrieval step.  
1) Take the question, turn it into a vector (just like the chunks).  
2) Compare that vector to all chunk vectors using cosine or euclidean.  
3) Pick the chunk that is closest in meaning.  
4) Feed that chunk into the QA model to pull out the actual answer.

- `retrieve_passage` takes a question, turns it into a vector, and finds the chunk that is closest in meaning (using cosine or euclidean).  
- `answer_question` takes the question and the retrieved chunk, then uses the QA model to extract the best short answer.


In [9]:
# generic retriever: works for any chunk set and either cosine or euclidean
def retrieve_passage(query, chunks, embeddings, metric="cosine"):
    q = model.encode([query], convert_to_numpy=True)[0]
    if metric == "cosine":
        # higher is better
        scores = (embeddings @ (q / (np.linalg.norm(q) + 1e-12)))
        idx = np.argmax(scores)
    elif metric == "euclidean":
        # smaller is better
        dists = np.linalg.norm(embeddings - q, axis=1)
        idx = np.argmin(dists)
    else:
        raise ValueError("metric must be 'cosine' or 'euclidean'")
    return chunks[idx]

# QA: pass the question and the retrieved chunk to the QA model
def answer_question(question, chunks, embeddings, metric="cosine"):
    context = retrieve_passage(question, chunks, embeddings, metric=metric)
    out = qa_pipeline(question=question, context=context)
    return out["answer"]

let demonstrate the basic RAG loop

In [10]:

print("RAG demo with cosine\n")
for q in questions:
    print("Q:", q)
    print("A:", answer_question(q, chunks_tok, E_tok, metric="cosine"))
    print()


RAG demo with cosine

Q: What is the Amazon rainforest?
A: largest tropical rainforest in the world

Q: Which countries does the Amazon span across?
A: Brazil, Peru, and Colombia

Q: Why is deforestation a problem in the Amazon?
A: climate change

Q: How does the Amazon rainforest affect global weather patterns?
A: deforestation

Q: What role do indigenous tribes play in the Amazon?
A: relying on its rich biodiversity for food, medicine, and shelter

Q: What is the importance of the Amazon River?
A: carries more water

Q: What types of wildlife can be found in the Amazon?
A: jaguars, sloths, and thousands of species of insects and birds

Q: How does deforestation contribute to climate change?
A: the rainforest acts as a major carbon sink

Q: What efforts are being made to protect the Amazon?
A: international agreements, conservation programs, and sustainable development projects

Q: Why is the Amazon considered a major carbon sink?
A: global weather patterns



We can see that the RAG pipeline works and returns reasonable answers to the questions.

---

### Lets try differint types of chunking strategies 

In [11]:
chunkers = {
    "Sentences": (chunks_sent, E_sent),
    "Paragraphs": (chunks_para, E_para),
    "Tokens": (chunks_tok, E_tok),
}

for name, (chs, E) in chunkers.items():
    print(f"\n--- {name} ---")
    for q in questions:
        ans = answer_question(q, chs, E, metric="cosine")
        print(f"Q: {q}")
        print(f"A: {ans}")
    print()


--- Sentences ---
Q: What is the Amazon rainforest?
A: the largest tropical rainforest in the world
Q: Which countries does the Amazon span across?
A: Brazil, Peru, and Colombia
Q: Why is deforestation a problem in the Amazon?
A: agriculture, logging, and urbanization
Q: How does the Amazon rainforest affect global weather patterns?
A: releasing water vapor into the atmosphere
Q: What role do indigenous tribes play in the Amazon?
A: relying on its rich biodiversity for food, medicine, and shelter
Q: What is the importance of the Amazon River?
A: carries more water
Q: What types of wildlife can be found in the Amazon?
A: Indigenous tribes
Q: How does deforestation contribute to climate change?
A: the rainforest acts as a major carbon sink
Q: What efforts are being made to protect the Amazon?
A: international agreements, conservation programs, and sustainable development projects
Q: Why is the Amazon considered a major carbon sink?
A: deforestation contributes to climate change


--- Pa

The three chunking strategies gave mostly the same answers, but there were some small differences. Sentences sometimes lost context (wildlife question, carbon sink), while paragraphs occasionally gave vague or off answers (Amazon River, carbon sink). Token chunks handled the wildlife question best but also mixed up one carbon sink answer. This shows how chunking choice can slightly shift what the QA model returns.


### Lets try to see what happens if we ask quations in a differint way 

In [12]:
rephrased_questions = [
    "Explain briefly what the Amazon rainforest is.",
    "List the countries where the Amazon is found.",
    "Give one reason why deforestation in the Amazon is harmful.",
]

chunkers = {
    "Sentences": (chunks_sent, E_sent),
    "Paragraphs": (chunks_para, E_para),
    "Tokens": (chunks_tok, E_tok),
}

for name, (chs, E) in chunkers.items():
    print(f"\n--- {name} ---")
    for q in rephrased_questions:
        ans = answer_question(q, chs, E, metric="cosine")
        print("Q:", q)
        print("A:", ans)
    print()



--- Sentences ---
Q: Explain briefly what the Amazon rainforest is.
A: largest tropical rainforest in the world
Q: List the countries where the Amazon is found.
A: international agreements, conservation programs, and sustainable development projects
Q: Give one reason why deforestation in the Amazon is harmful.
A: agriculture, logging, and urbanization


--- Paragraphs ---
Q: Explain briefly what the Amazon rainforest is.
A: largest tropical rainforest in the world
Q: List the countries where the Amazon is found.
A: international agreements, conservation programs, and sustainable development projects
Q: Give one reason why deforestation in the Amazon is harmful.
A: climate change


--- Tokens ---
Q: Explain briefly what the Amazon rainforest is.
A: largest tropical rainforest in the world
Q: List the countries where the Amazon is found.
A: Brazil, Peru, and Colombia
Q: Give one reason why deforestation in the Amazon is harmful.
A: climate change



Rephrasing made the differences between chunking strategies clearer. All three handled the rainforest definition well. For the countries question, only tokens retrieved the right chunk, while sentences and paragraphs picked unrelated text. On deforestation, sentences highlighted causes, while paragraphs and tokens focused on consequences. This shows that both the phrasing of the question and the chunking strategy affect which passage is retrieved and, in turn, the answer.
<br></br>This happens because when we rephrase a question, its vector moves slightly in the embedding space, and that can make it land closer to a different chunk.


### Try a different similarity search instead of cosine similarity - do the answers improve?

In [13]:
chunkers = {
    "Sentences": (chunks_sent, E_sent),
    "Paragraphs": (chunks_para, E_para),
    "Tokens": (chunks_tok, E_tok),
}

for name, (chs, E) in chunkers.items():
    print(f"\n=== {name} ===")
    for metric in ["cosine", "euclidean"]:
        print(f"\n--- Using {metric.upper()} ---")
        for q in questions:
            ans = answer_question(q, chs, E, metric=metric)
            print("Q:", q)
            print("A:", ans)
        print()



=== Sentences ===

--- Using COSINE ---
Q: What is the Amazon rainforest?
A: the largest tropical rainforest in the world
Q: Which countries does the Amazon span across?
A: Brazil, Peru, and Colombia
Q: Why is deforestation a problem in the Amazon?
A: agriculture, logging, and urbanization
Q: How does the Amazon rainforest affect global weather patterns?
A: releasing water vapor into the atmosphere
Q: What role do indigenous tribes play in the Amazon?
A: relying on its rich biodiversity for food, medicine, and shelter
Q: What is the importance of the Amazon River?
A: carries more water
Q: What types of wildlife can be found in the Amazon?
A: Indigenous tribes
Q: How does deforestation contribute to climate change?
A: the rainforest acts as a major carbon sink
Q: What efforts are being made to protect the Amazon?
A: international agreements, conservation programs, and sustainable development projects
Q: Why is the Amazon considered a major carbon sink?
A: deforestation contributes to c

Running all questions confirmed that cosine and euclidean gave exactly the same answers. This is because sentence-transformer embeddings are normalized, so cosine and euclidean produce the same ranking of passages.
