Task
1. Creat a RAG pipeline that can take following text and answer following questions
2. Try different types of chunking to get better answers?
3. Does asking questions differently give better answers? Why?
4. Try a different similarity search instead of cosine similarity - do the answers improve?



<hr>
<h4>Task 1 Creat a RAG pipeline that can take following text and answer following questions</h4>

In [9]:
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity, euclidean_distances
import re

In [None]:
from huggingface_hub import notebook_login

notebook_login()

In [8]:
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer(MODEL_NAME)

In [10]:
sample_text = """
The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.

Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization. This deforestation contributes to climate change, as the rainforest acts as a major carbon sink, absorbing millions of tons of carbon dioxide annually.

Indigenous tribes have lived in the Amazon for thousands of years, relying on its rich biodiversity for food, medicine, and shelter. These tribes have unique languages, traditions, and knowledge of the ecosystem. However, many face threats from illegal land encroachment and industrial activities.

Scientists believe that the Amazon plays a crucial role in global weather patterns by releasing water vapor into the atmosphere, which influences rainfall across South America and even other continents. The Amazon River, which flows through the rainforest, is the second longest river in the world and carries more water than any other river.

Efforts to protect the Amazon include international agreements, conservation programs, and sustainable development projects that aim to balance economic growth with environmental protection. Many organizations and governments are working to reduce illegal logging and promote reforestation initiatives.
"""

In [None]:
# Function to split text into meaningful chunks (paragraphs)
def chunk_by_paragraph(text):
    return [para.strip() for para in re.split("\n+", text) if para.strip()]

def chunk_by_sentence(text):
    sentences = re.split(r'(?<!\w\.\w.)(?<![A-Z][a-z]\.)(?<=\.|\?)\s', text)
    return [sentence.strip() for sentence in sentences if sentence.strip()]

def chunk_by_punctuation(text):
    return [chunk.strip() for chunk in re.split(r'(?<=[.!?])\s+', text) if chunk.strip()]

# Store document embeddings
stored_texts_paragraphs = chunk_by_paragraph(sample_text)  # Store each paragraph separately
stored_texts_sentences = chunk_by_sentence(sample_text)  # Store each sentence separately
stored_texts_punctuation = chunk_by_punctuation(sample_text)  # Store each punctuation chunk separately

stored_embeddings_paragraphs = model.encode(stored_texts_paragraphs, convert_to_numpy=True)  # Store embeddings for each chunk
stored_embeddings_sentences = model.encode(stored_texts_sentences, convert_to_numpy=True)  # Store embeddings for each chunk
stored_embeddings_punctuation = model.encode(stored_texts_punctuation, convert_to_numpy=True)  # Store embeddings for each chunk

In [12]:
stored_texts_paragraphs = chunk_by_paragraph(sample_text)
stored_texts_sentences = chunk_by_sentence(sample_text)
stored_texts_punctuation = chunk_by_punctuation(sample_text)


In [13]:
# Function to generate embeddings
def get_transformer_embeddings(texts):
    return model.encode(texts, convert_to_numpy=True)

# Search functions
def cosine_search(query, stored_texts, stored_embeddings):
    query_embedding = get_transformer_embeddings([query])
    sims = cosine_similarity(query_embedding, stored_embeddings)[0]
    best_idx = np.argmax(sims)
    return stored_texts[best_idx]

def euclidean_search(query, stored_texts, stored_embeddings):
    query_embedding = get_transformer_embeddings([query])
    dists = euclidean_distances(query_embedding, stored_embeddings)[0]
    best_idx = np.argmin(dists)
    return stored_texts[best_idx]

def dot_search(query, stored_texts, stored_embeddings):
    query_embedding = get_transformer_embeddings([query])
    dots = np.dot(stored_embeddings, query_embedding.T).flatten()
    best_idx = np.argmax(dots)
    return stored_texts[best_idx]

In [25]:
questions = [
    "What is the Amazon rainforest?",
    "Which countries does the Amazon span across?",
    "Why is deforestation a problem in the Amazon?",
    "How does the Amazon rainforest affect global weather patterns?",
    "What role do indigenous tribes play in the Amazon?",
    "What is the importance of the Amazon River?",
    "What types of wildlife can be found in the Amazon?",
    "How does deforestation contribute to climate change?",
    "What efforts are being made to protect the Amazon?",
    "Why is the Amazon considered a major carbon sink?"
]

question_alternate = [
    "Amazon rainforest, can you explain what it is?",
    "Can you tell me which countries the Amazon spans across?",
    "What are the issues with deforestation in the Amazon?",
    "How does the Amazon rainforest influence global weather?",
    "What is the role of indigenous tribes in the Amazon?",
    "Why is the Amazon River significant?",
    "What kinds of wildlife inhabit the Amazon?",
    "In what ways does deforestation impact climate change?",
    "What measures are being taken to safeguard the Amazon?",
    "Why is the Amazon regarded as a key carbon sink?"
]


chunk_modes = {
    "paragraphs": chunk_by_paragraph,
    "sentences": chunk_by_sentence,
    "punctuation": chunk_by_punctuation
}

search_modes = {
    "cosine": cosine_search,
    "euclidean": euclidean_search,
    "dot": dot_search
}

<hr>
<h4>Task 2 Try different types of chunking to get better answers?</h4>

In [None]:
print("\nSample Questions and Answers (paragraphs + cosine):\n")
chunks = chunk_modes["paragraphs"](sample_text)
stored_embeddings = get_transformer_embeddings(chunks)

for q in questions:
    ans = search_modes["cosine"](q, chunks, stored_embeddings)
    print(f"Q: {q}\nA: {ans}\n")


Sample Questions and Answers (paragraphs + cosine):

Q: What is the Amazon rainforest?
A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.

Q: Which countries does the Amazon span across?
A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.

Q: Why is deforestation a problem in the Amazon?
A: Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, loggin

In [None]:
print("\nSample Questions and Answers (sentences + cosine):\n")
chunks = chunk_modes["sentences"](sample_text)
stored_embeddings = get_transformer_embeddings(chunks)

for q in questions:
    ans = search_modes["cosine"](q, chunks, stored_embeddings)
    print(f"Q: {q}\nA: {ans}\n")


Sample Questions and Answers (sentences + cosine):

Q: What is the Amazon rainforest?
A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers.

Q: Which countries does the Amazon span across?
A: It spans across nine countries, including Brazil, Peru, and Colombia.

Q: Why is deforestation a problem in the Amazon?
A: Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization.

Q: How does the Amazon rainforest affect global weather patterns?
A: Scientists believe that the Amazon plays a crucial role in global weather patterns by releasing water vapor into the atmosphere, which influences rainfall across South America and even other continents.

Q: What role do indigenous tribes play in the Amazon?
A: Indigenous tribes have lived in the Amazon for thousands of years, relying on its rich biodiversity for food, medicine, and shel

In [19]:
print("\nSample Questions and Answers (punctuation + cosine):\n")
chunks = chunk_modes["punctuation"](sample_text)
stored_embeddings = get_transformer_embeddings(chunks)

for q in questions:
    ans = search_modes["cosine"](q, chunks, stored_embeddings)
    print(f"Q: {q}\nA: {ans}\n")


Sample Questions and Answers (punctuation + cosine):

Q: What is the Amazon rainforest?
A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers.

Q: Which countries does the Amazon span across?
A: It spans across nine countries, including Brazil, Peru, and Colombia.

Q: Why is deforestation a problem in the Amazon?
A: The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.

Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization.

Q: How does the Amazon rainforest affect global weather patterns?
A: However, many face threats from illegal land encroachment and industrial activities.

Scientists believe that the Amazon plays a crucial role in global weather patterns by releasing water vapor into the atmosphere, which influences rainfall acr

<h4>Results :</h4>
<h5>The <strong>Paragraph chunking</strong> had the complete answer and contained way more information when needed. Like when asked about which country, it would answer with the entire paragraph, where country was just a small part, further inside.</h5>
<h5>The <strong>Stentence chunking</strong> was the best, since it was shorter and precise. Contained important information relevant to question straight away. Like which country? "it spans across nine countries"
<h5>The <strong>Punctuation chunking</strong> seemed really broken. I have mashed sentences together with some odd newlines gltich. This could be in the regex implementation. But when asked "What role do indigenous tribes play in the Amazon?" it answer with "This deforestation contributes to climate change...". Which is completly the wrong answer. Interrestingly enough, punctuation got the question about wildlife correct, while stentence chunking didnt. 

<hr>
<h4>Task 4 Try a different similarity search instead of cosine similarity - do the answers improve?</h4>

In [21]:
print("\nSample Questions and Answers (sentences + cosine):\n")
chunks = chunk_modes["sentences"](sample_text)
stored_embeddings = get_transformer_embeddings(chunks)

for q in questions:
    ans = search_modes["cosine"](q, chunks, stored_embeddings)
    print(f"Q: {q}\nA: {ans}\n")


Sample Questions and Answers (sentences + cosine):

Q: What is the Amazon rainforest?
A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers.

Q: Which countries does the Amazon span across?
A: It spans across nine countries, including Brazil, Peru, and Colombia.

Q: Why is deforestation a problem in the Amazon?
A: Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization.

Q: How does the Amazon rainforest affect global weather patterns?
A: Scientists believe that the Amazon plays a crucial role in global weather patterns by releasing water vapor into the atmosphere, which influences rainfall across South America and even other continents.

Q: What role do indigenous tribes play in the Amazon?
A: Indigenous tribes have lived in the Amazon for thousands of years, relying on its rich biodiversity for food, medicine, and shel

In [22]:
print("\nSample Questions and Answers (sentences + euclidean):\n")
chunks = chunk_modes["sentences"](sample_text)
stored_embeddings = get_transformer_embeddings(chunks)

for q in questions:
    ans = search_modes["euclidean"](q, chunks, stored_embeddings)
    print(f"Q: {q}\nA: {ans}\n")


Sample Questions and Answers (sentences + euclidean):

Q: What is the Amazon rainforest?
A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers.

Q: Which countries does the Amazon span across?
A: It spans across nine countries, including Brazil, Peru, and Colombia.

Q: Why is deforestation a problem in the Amazon?
A: Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization.

Q: How does the Amazon rainforest affect global weather patterns?
A: Scientists believe that the Amazon plays a crucial role in global weather patterns by releasing water vapor into the atmosphere, which influences rainfall across South America and even other continents.

Q: What role do indigenous tribes play in the Amazon?
A: Indigenous tribes have lived in the Amazon for thousands of years, relying on its rich biodiversity for food, medicine, and s

In [23]:
print("\nSample Questions and Answers (sentences + dot):\n")
chunks = chunk_modes["sentences"](sample_text)
stored_embeddings = get_transformer_embeddings(chunks)

for q in questions:
    ans = search_modes["dot"](q, chunks, stored_embeddings)
    print(f"Q: {q}\nA: {ans}\n")


Sample Questions and Answers (sentences + dot):

Q: What is the Amazon rainforest?
A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers.

Q: Which countries does the Amazon span across?
A: It spans across nine countries, including Brazil, Peru, and Colombia.

Q: Why is deforestation a problem in the Amazon?
A: Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization.

Q: How does the Amazon rainforest affect global weather patterns?
A: Scientists believe that the Amazon plays a crucial role in global weather patterns by releasing water vapor into the atmosphere, which influences rainfall across South America and even other continents.

Q: What role do indigenous tribes play in the Amazon?
A: Indigenous tribes have lived in the Amazon for thousands of years, relying on its rich biodiversity for food, medicine, and shelter

<h4>Results :</h4>
<h5>The answer were identical across all three methods. One interresting note is that neither search methods would improve to gets the question about wildlfie correct.

<hr>
<h4>Task 3 Does asking questions differently give better answers? Why?</h4>

In [29]:
print("\nSample Questions and Answers (sentences + cosine + original questions):\n")
chunks = chunk_modes["sentences"](sample_text)
stored_embeddings = get_transformer_embeddings(chunks)

for q in questions:
    ans = search_modes["cosine"](q, chunks, stored_embeddings)
    print(f"Q: {q}\nA: {ans}\n")


Sample Questions and Answers (sentences + cosine + original questions):

Q: What is the Amazon rainforest?
A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers.

Q: Which countries does the Amazon span across?
A: It spans across nine countries, including Brazil, Peru, and Colombia.

Q: Why is deforestation a problem in the Amazon?
A: Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization.

Q: How does the Amazon rainforest affect global weather patterns?
A: Scientists believe that the Amazon plays a crucial role in global weather patterns by releasing water vapor into the atmosphere, which influences rainfall across South America and even other continents.

Q: What role do indigenous tribes play in the Amazon?
A: Indigenous tribes have lived in the Amazon for thousands of years, relying on its rich biodiversity for foo

In [28]:
print("\nSample Questions and Answers (sentences + cosine + alternate questions):\n")
chunks = chunk_modes["sentences"](sample_text)
stored_embeddings = get_transformer_embeddings(chunks)

for q in question_alternate:
    ans = search_modes["cosine"](q, chunks, stored_embeddings)
    print(f"Q: {q}\nA: {ans}\n")


Sample Questions and Answers (sentences + cosine + alternate questions):

Q: Amazon rainforest, can you explain what it is?
A: The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers.

Q: Can you tell me which countries the Amazon spans across?
A: It spans across nine countries, including Brazil, Peru, and Colombia.

Q: What are the issues with deforestation in the Amazon?
A: Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization.

Q: How does the Amazon rainforest influence global weather?
A: Scientists believe that the Amazon plays a crucial role in global weather patterns by releasing water vapor into the atmosphere, which influences rainfall across South America and even other continents.

Q: What is the role of indigenous tribes in the Amazon?
A: Indigenous tribes have lived in the Amazon for thousands of years, relying

<h4>Results :</h4>
<h5>The answer were identical regardless if it was the orignal or alternate questions. Maybe changing the wording form "why" to "how" questions was still two closely related to chance the vector too much. 