# 12: Generative AI

The hand-in exercise for this topic is in the notebook named ‘rag_task.ipynb’. Do all 4
tasks within this notebook. For task 2, you should try at least 3 types of chunking such as
chunk in paragraphs, sentences or even by punctuation marks – you are welcome to
choose your own chunking strategy. For task 4 you should try at least one other type of
similarity or distance function to calculate the similarity.

### Task
1. Creat a RAG pipeline that can take following text and answer following questions
2. Try different types of chunking to get better answers?
3. Does asking questions differently give better answers? Why?
4. Try a different similarity search instead of cosine similarity - do the answers improve?

In [62]:

sample_text = """
The Amazon rainforest is the largest tropical rainforest in the world, covering approximately 5.5 million square kilometers. It spans across nine countries, including Brazil, Peru, and Colombia. The rainforest is home to around 10% of the known species on Earth, including jaguars, sloths, and thousands of species of insects and birds.

Deforestation is a significant threat to the Amazon, with thousands of square kilometers lost each year due to agriculture, logging, and urbanization. This deforestation contributes to climate change, as the rainforest acts as a major carbon sink, absorbing millions of tons of carbon dioxide annually.

Indigenous tribes have lived in the Amazon for thousands of years, relying on its rich biodiversity for food, medicine, and shelter. These tribes have unique languages, traditions, and knowledge of the ecosystem. However, many face threats from illegal land encroachment and industrial activities.

Scientists believe that the Amazon plays a crucial role in global weather patterns by releasing water vapor into the atmosphere, which influences rainfall across South America and even other continents. The Amazon River, which flows through the rainforest, is the second longest river in the world and carries more water than any other river.

Efforts to protect the Amazon include international agreements, conservation programs, and sustainable development projects that aim to balance economic growth with environmental protection. Many organizations and governments are working to reduce illegal logging and promote reforestation initiatives.
"""

In [63]:
# Properly formulated questions for the sample text
questions = [
    "What is the Amazon rainforest?",
    "Which countries does the Amazon span across?",
    "Why is deforestation a problem in the Amazon?",
    "How does the Amazon rainforest affect global weather patterns?",
    "What role do indigenous tribes play in the Amazon?",
    "What is the importance of the Amazon River?",
    "What types of wildlife can be found in the Amazon?",
    "How does deforestation contribute to climate change?",
    "What efforts are being made to protect the Amazon?",
    "Why is the Amazon considered a major carbon sink?"
]

### Imports

In [64]:
import numpy as np
from sentence_transformers import SentenceTransformer
from sklearn.metrics.pairwise import cosine_similarity
from sklearn.metrics.pairwise import euclidean_distances
from transformers import pipeline
import re

### Starting to build the pipeline:

Taking inspiration from the method used in the lecture

In [65]:
# Choose our model, in this case, we are using the MiniLM model
MODEL_NAME = "sentence-transformers/all-MiniLM-L6-v2"
model = SentenceTransformer(MODEL_NAME)

# Load a pre-trained question-answering model pipeline from transformers library
qa_pipeline = pipeline("question-answering")

No model was supplied, defaulted to distilbert/distilbert-base-cased-distilled-squad and revision 564e9b5 (https://huggingface.co/distilbert/distilbert-base-cased-distilled-squad).
Using a pipeline without specifying a model name and revision in production is not recommended.
Device set to use cpu


#### Defining the necessarry funcitons:

In [66]:
# Split the text into paragraphs, also known as chunks in generative AI training
def split_text(text):
    return [para.strip() for para in re.split("\n+", text) if para.strip()]

# Making numerical representations of the text for the model to understand
stored_texts = split_text(sample_text)  # Store each paragraph separately
stored_embeddings = model.encode(stored_texts, convert_to_numpy=True)  # Store embeddings for each chunk

# Function to trun the text into embeddings
def get_transformer_embeddings(texts):
    return model.encode(texts, convert_to_numpy=True)

# ONE FOR COSINE SIMILARITY SEARCH
# Functionss to return the most similar paragraph to the query paragraph!
def cosine_retrieve_passage(query):
    query_embedding = get_transformer_embeddings([query])
    similarities = cosine_similarity(query_embedding, stored_embeddings)[0]
    best_match_idx = np.argmax(similarities)
    return stored_texts[best_match_idx]

# QA function to answer questions
def cosine_answer_question(query):
    relevant_passage = cosine_retrieve_passage(query)
    answer = qa_pipeline(question=query, context=relevant_passage)
    return answer['answer']

# ONE FOR EUCLIDEAN DISTANCE SIMILARITY SEARCH
# Functionss to return the most similar paragraph to the query paragraph!
def euclidean_retrieve_passage(query):
    query_embedding = get_transformer_embeddings([query])
    similarities = euclidean_distances(query_embedding, stored_embeddings)[0]
    best_match_idx = np.argmax(similarities)
    return stored_texts[best_match_idx]

# QA function to answer questions
def euclidean_answer_question(query):
    relevant_passage = euclidean_retrieve_passage(query)
    answer = qa_pipeline(question=query, context=relevant_passage)
    return answer['answer']

#### QA Results using based on coisine similarity search

In [67]:
print("\nSample Questions and Answers:\n")
for question in questions:
    response = cosine_answer_question(question)
    print(f"Q: {question}\nA: {response}\n")


Sample Questions and Answers:

Q: What is the Amazon rainforest?
A: largest tropical rainforest in the world

Q: Which countries does the Amazon span across?
A: Brazil, Peru, and Colombia

Q: Why is deforestation a problem in the Amazon?
A: climate change

Q: How does the Amazon rainforest affect global weather patterns?
A: releasing water vapor into the atmosphere

Q: What role do indigenous tribes play in the Amazon?
A: relying on its rich biodiversity for food, medicine, and shelter

Q: What is the importance of the Amazon River?
A: global weather patterns

Q: What types of wildlife can be found in the Amazon?
A: Indigenous tribes

Q: How does deforestation contribute to climate change?
A: the rainforest acts as a major carbon sink

Q: What efforts are being made to protect the Amazon?
A: international agreements, conservation programs, and sustainable development projects

Q: Why is the Amazon considered a major carbon sink?
A: climate change



#### QA Resutls based on Euclidean Distance similarity search

In [68]:
print("\nSample Questions and Answers:\n")
for question in questions:
    response = euclidean_answer_question(question)
    print(f"Q: {question}\nA: {response}\n")


Sample Questions and Answers:

Q: What is the Amazon rainforest?
A: rich biodiversity

Q: Which countries does the Amazon span across?
A: agriculture, logging, and urbanization

Q: Why is deforestation a problem in the Amazon?
A: The rainforest is home to around 10% of the known species on Earth

Q: How does the Amazon rainforest affect global weather patterns?
A: relying on its rich biodiversity for food, medicine, and shelter

Q: What role do indigenous tribes play in the Amazon?
A: Amazon rainforest is the largest tropical rainforest in the world

Q: What is the importance of the Amazon River?
A: tropical rainforest in the world

Q: What types of wildlife can be found in the Amazon?
A: rainfall across South America and even other continents

Q: How does deforestation contribute to climate change?
A: rich biodiversity for food, medicine, and shelter

Q: What efforts are being made to protect the Amazon?
A: Amazon rainforest is the largest tropical rainforest in the world

Q: Why is 

### Comparison Cosine vs Euclidean:

Lets compare the first 5 Q & A from both the similarity searches

##### **Cosine Similarity**
Q: What is the Amazon rainforest?
A: largest tropical rainforest in the world

Q: Which countries does the Amazon span across?
A: Brazil, Peru, and Colombia

Q: Why is deforestation a problem in the Amazon?
A: climate change

Q: How does the Amazon rainforest affect global weather patterns?
A: releasing water vapor into the atmosphere

Q: What role do indigenous tribes play in the Amazon?
A: relying on its rich biodiversity for food, medicine, and shelter

##### **Euclidean Distance**

Q: What is the Amazon rainforest?
A: rich biodiversity

Q: Which countries does the Amazon span across?
A: agriculture, logging, and urbanization

Q: Why is deforestation a problem in the Amazon?
A: The rainforest is home to around 10% of the known species on Earth

Q: How does the Amazon rainforest affect global weather patterns?
A: relying on its rich biodiversity for food, medicine, and shelter

Q: What role do indigenous tribes play in the Amazon?
A: Amazon rainforest is the largest tropical rainforest in the world

### Conclusion

The use of similarity search in generative AI indeed do affect the answers

* Answers seem to stick under the same definition, but formulations will vary
* Some Q&A's are identical
* None of the tested methods seem to be giving incorrect answers

---


### Now lets try to pass in more poorly formulated questions to see if it has an effect on the response

Anything that is somewhat under the same definition is deemed acceptable. Im going to use just cosine similarity for this experiment.

In [69]:
# Poorly formulated questions for the sample text
poor_questions = [
    "Amazon rainforest???????",
    "Which countries Amazon span",
    "is deforestation in the Amazon?",
    "Amazon rainforest affect global weather patterns ro what the hell!?",
    "Are there tribes in the Amazon? and what do they do there, isnt it boring?",
    "Is there a amazing river in the Amazon?",
    "wildlife in amazonasa",
    "deforation and climate change?",
    "are someone protecting the Amazon?",
    "carboin sink? in thae Amazon?"
]

In [70]:
print("\nSample Questions and Answers:\n")
for question in poor_questions:
    response = cosine_answer_question(question)
    print(f"Q: {question}\nA: {response}\n")


Sample Questions and Answers:

Q: Amazon rainforest???????
A: tropical rainforest in the world

Q: Which countries Amazon span
A: Brazil, Peru, and Colombia

Q: is deforestation in the Amazon?
A: This deforestation contributes to climate change

Q: Amazon rainforest affect global weather patterns ro what the hell!?
A: releasing water vapor into the atmosphere

Q: Are there tribes in the Amazon? and what do they do there, isnt it boring?
A: These tribes have unique languages, traditions, and knowledge of the ecosystem

Q: Is there a amazing river in the Amazon?
A: the second longest river in the world and carries more water than any other river

Q: wildlife in amazonasa
A: Amazon

Q: deforation and climate change?
A: the rainforest acts as a major carbon sink

Q: are someone protecting the Amazon?
A: international agreements, conservation programs, and sustainable development projects

Q: carboin sink? in thae Amazon?
A: Amazon River



#### Conclusion

* Question formulation seem to play an impactfull role in the accuracy of answers
* Short questions lead to short answers
* Lacking explenations compared to proper formulated questions