# Workshop III

Participants:

Juan José Gil Hoyos

Ana Estefanía Henao Restrepo

## Exercise 1

Vector embeddings, in simple terms, are numerical representations of objects, words, or data points in a multi-dimensional vector space. These embeddings are designed to capture meaningful relationships and similarities between the objects they represent. Each dimension in the vector space corresponds to a specific feature or attribute, and the values in the vector indicate the strength or importance of that feature for the object.

Vector embeddings are useful for a wide range of applications in various fields, including natural language processing, computer vision, recommendation systems, and more.

## Exercise 2

The choice of distance metric for estimating the similarity or dissimilarity between two embeddings (vectors) depends on the nature of the data and the specific task at hand. There is no one-size-fits-all answer, and different distance metrics may be more appropriate for different scenarios. Here are some commonly used distance metrics and when they might be suitable:

1. Euclidean Distance: This is the most common and intuitive distance metric. It calculates the straight-line distance between two points in a Euclidean space. Euclidean distance works well when the dimensions of the vectors represent comparable physical units, and you want to measure the "as-the-crow-flies" distance between points. However, it can be sensitive to scale, meaning that features with large values may dominate the distance calculation.

2. Cosine Similarity: Cosine similarity measures the cosine of the angle between two vectors. It is often used for text data and high-dimensional data where the magnitude of the vectors is less important than their orientation. Cosine similarity is particularly useful when you want to capture semantic similarity between vectors, such as in natural language processing tasks, where words with similar meanings should have high cosine similarity.

3. Manhattan Distance (L1 Distance): Manhattan distance calculates the sum of the absolute differences between the coordinates of two vectors. It is less sensitive to outliers than Euclidean distance and can be useful when dealing with data that has a sparse distribution, like some types of categorical data.

4. Minkowski Distance: Minkowski distance is a generalization of both Euclidean and Manhattan distances. It allows you to control the "order" of the distance metric, with 1 being Manhattan distance and 2 being Euclidean distance. Choosing the appropriate order depends on the characteristics of your data and the problem you are solving.

5. Hamming Distance: Hamming distance is specifically designed for binary data, such as comparing sequences of binary digits or binary feature vectors. It counts the number of differing bits between two binary vectors.

6. Jaccard Similarity: Jaccard similarity measures the similarity between two sets by dividing the size of their intersection by the size of their union. It is commonly used in text analysis and recommendation systems for comparing sets of items.

7. Mahalanobis Distance: Mahalanobis distance accounts for correlations between dimensions in multivariate data. It is useful when dealing with data where features are correlated, and you want to account for these correlations in the distance calculation.

The choice of the best distance criterion depends on the specific characteristics of your data and the goals of your analysis. It's essential to consider factors like the scale of the data, the nature of the features, and the desired properties of the distance metric (e.g., sensitivity to outliers, handling of sparse data, capturing semantic similarity, etc.). In practice, it's often a good idea to experiment with different distance metrics to determine which one performs best for your particular task through cross-validation or other evaluation methods.

# Exercise 3

In [None]:
# Sample text (you can replace this with your text corpus)
text = """

Once upon a time in the peaceful village of Mount Paozu, a young boy named Goku lived a simple life. Unbeknownst to him, Goku was no ordinary child; he was a Saiyan, an alien warrior race
Goku's life took an unexpected turn when he met Bulma, a brilliant scientist, who sought the Dragon Balls, magical orbs that could grant any wish. Together, Goku and Bulma embarked on a journey
filled with adventures, encountering strange creatures, and forming lasting friendships. Along the way, Goku's life became intertwined with the pursuit of the Dragon Balls, leading to epic battles with formidable foes like Emperor Pilaf and the shape-shifting Oolong.
As Goku grew, he discovered his true heritage as a Saiyan and began training under the wise Master Roshi, known as the Turtle Hermit. This phase of Goku's life was marked by grueling training, humorous mishaps, and the learning of powerful techniques like the Kamehameha wave.
The World Martial Arts Tournament played a significant role in Goku's life, where he faced strong opponents like Krillin and Yamcha. These tournaments showcased Goku's martial arts prowess and served as a platform to make lifelong friends.
Goku's life took a dramatic turn when he learned about the existence of his evil Saiyan brother, Raditz, who threatened Earth's peace. Goku sacrificed his life to save the planet but was later resurrected and sent to the planet Namek, where he confronted the tyrannical Frieza.
In the midst of these challenges, Goku's life saw him ascend to the legendary Super Saiyan form, a transformation that unlocked immense power. This transformation marked a turning point in Goku's life and the series.
Dragon Ball Z Kai brought a new perspective to Goku's life, offering a remastered version of the original Z series. It condensed the storyline and focused on the core events of Goku's life, making the saga more accessible to a new generation of fans.
Dragon Ball GT saw Goku's life take an unexpected twist when he was transformed into a child and set on a quest to recover the Black Star Dragon Balls. This phase of Goku's life brought new adventures, including battles against formidable foes like Baby and Omega Shenron.
Dragon Ball Super introduced Goku's life to a multiverse of possibilities. It showcased epic battles with gods and otherworldly foes and introduced new transformations, such as Super Saiyan Blue and Ultra Instinct. Goku's life in Dragon Ball Super expanded the boundaries of his power and explored the concept of alternate realities.
As Goku's life continued, he passed on his martial arts knowledge to the next generation, including his son Gohan and protege Uub. Goku's legacy lived on, ensuring that the spirit of adventure and the pursuit of strength would endure.
In his later years, Goku's life saw him exploring other realms and dimensions, seeking new challenges and adventures. His journey was a testament to his unwavering spirit, dedication to protecting Earth, and his never-ending quest for self-improvement as a martial artist.
And so, Goku's life, filled with battles, friendships, and extraordinary transformations, became a legendary tale known throughout the Dragon Ball universe, inspiring generations to come.

"""

In [None]:
!pip install transformers

Collecting transformers
  Downloading transformers-4.33.2-py3-none-any.whl (7.6 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.6/7.6 MB[0m [31m15.2 MB/s[0m eta [36m0:00:00[0m
Collecting huggingface-hub<1.0,>=0.15.1 (from transformers)
  Downloading huggingface_hub-0.17.2-py3-none-any.whl (294 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m294.9/294.9 kB[0m [31m25.4 MB/s[0m eta [36m0:00:00[0m
Collecting tokenizers!=0.11.3,<0.14,>=0.11.1 (from transformers)
  Downloading tokenizers-0.13.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (7.8 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m7.8/7.8 MB[0m [31m28.1 MB/s[0m eta [36m0:00:00[0m
[?25hCollecting safetensors>=0.3.1 (from transformers)
  Downloading safetensors-0.3.3-cp310-cp310-manylinux_2_17_x86_64.manylinux2014_x86_64.whl (1.3 MB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m1.3/1.3 MB[0m [31m39.9 MB/s[0m eta [36m0:00:0

In [None]:
! pip install spacy



In [None]:
from transformers import BertTokenizer, BertForQuestionAnswering
import torch
import nltk

In [None]:
nltk.download('punkt')

# Load pre-trained BERT model and tokenizer
model_name = "bert-large-uncased-whole-word-masking-finetuned-squad"  # You can choose a different model
tokenizer = BertTokenizer.from_pretrained(model_name)
model = BertForQuestionAnswering.from_pretrained(model_name)

# Split the text into meaningful chunks (in this case, sentences)
sentences = nltk.sent_tokenize(text)  # Split into sentences

[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Downloading (…)solve/main/vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

Downloading (…)okenizer_config.json:   0%|          | 0.00/28.0 [00:00<?, ?B/s]

Downloading (…)lve/main/config.json:   0%|          | 0.00/443 [00:00<?, ?B/s]

Downloading model.safetensors:   0%|          | 0.00/1.34G [00:00<?, ?B/s]

Some weights of the model checkpoint at bert-large-uncased-whole-word-masking-finetuned-squad were not used when initializing BertForQuestionAnswering: ['bert.pooler.dense.weight', 'bert.pooler.dense.bias']
- This IS expected if you are initializing BertForQuestionAnswering from the checkpoint of a model trained on another task or with another architecture (e.g. initializing a BertForSequenceClassification model from a BertForPreTraining model).
- This IS NOT expected if you are initializing BertForQuestionAnswering from the checkpoint of a model that you expect to be exactly identical (initializing a BertForSequenceClassification model from a BertForSequenceClassification model).


There are several tools and approaches that could help us generate vector embeddings easily and at a high level of quality. Let's talk about these options:

* **Pre-trained Language Models:** Utilize pre-trained language models like BERT, GPT-3, RoBERTa, or others. These models are trained on massive text corpora and can generate high-quality embeddings for both questions and text chunks.

* **Hugging Face Transformers Library:** The Hugging Face Transformers library provides a user-friendly interface for working with pre-trained models. We can load models, tokenize text, and generate embeddings easily using this library.

* **Sentence Transformers:** This library is specifically designed for generating embeddings from sentences or text chunks. It includes pre-trained models like BERT, RoBERTa, and more, fine-tuned for sentence-level tasks.

* **Spacy:** Spacy is a popular NLP library that provides word and sentence embeddings. While not as powerful as some pre-trained models, it can be a lightweight solution for certain tasks.

* **Word Embeddings (Word2Vec, GloVe):** If you're working with individual words or need word-level embeddings within your text chunks, pre-trained word embeddings like Word2Vec or GloVe can be useful.

* **Pooling Strategies:** For generating embeddings from the output of pre-trained models, consider pooling strategies like averaging word embeddings, max-pooling, or using special tokens like [CLS] or [SEP] embeddings, depending on the model architecture.

Let's use Hugging Face transformers:

In [None]:
# User's question
question = "What is the most remarkable thing that Goku did?"

# Initialize a list to store relevant chunks
relevant_chunks = []

# Tokenize the question
question_tokens = tokenizer.tokenize(question)

# Iterate through the paragraphs
for sentence in sentences:
    # Tokenize the paragraph
    sentence_tokens = tokenizer.tokenize(sentence)

     # Skip empty chunks or tokens
    if not sentence_tokens:
        continue

    # Tokenize the question and chunk and feed them to the model
    inputs = tokenizer.encode_plus(question_tokens, sentence_tokens, return_tensors="pt", padding=True, truncation=True)

    # Get the model's predicted answer span
    with torch.no_grad():
        start_logits, end_logits = model(**inputs)["start_logits"], model(**inputs)["end_logits"]

    # Get the answer span
    start_idx = torch.argmax(start_logits, dim = 1).item()
    end_idx = torch.argmax(end_logits, dim = 1).item()

    # Decode and store the answer
    answer = tokenizer.decode(inputs["input_ids"][0][start_idx:end_idx + 1])
    relevant_chunks.append((sentence, answer))

# Sort the relevant chunks by answer relevance (you can use more advanced ranking methods)
relevant_chunks.sort(key=lambda x: len(x[1]), reverse=False)

# Define N (number of relevant chunks to return)
N = 20

# Get the top N most relevant chunks
top_chunks = relevant_chunks[:N]

print('Main question: '.upper(), question.upper())

# Print the most relevant chunks and their answers
for i, (sentence, answer) in enumerate(top_chunks, 1):
    print(f"Sentence {i}:")
    print(sentence)
    print("Answer:", answer)
    print("=" * 50)

MAIN QUESTION:  WHAT IS THE MOST REMARKABLE THING THAT GOKU DID?
Sentence 1:
These tournaments showcased Goku's martial arts prowess and served as a platform to make lifelong friends.
Answer: tournaments
Sentence 2:
His journey was a testament to his unwavering spirit, dedication to protecting Earth, and his never-ending quest for self-improvement as a martial artist.
Answer: his journey
Sentence 3:
Unbeknownst to him, Goku was no ordinary child; he was a Saiyan, an alien warrior race
Goku's life took an unexpected turn when he met Bulma, a brilliant scientist, who sought the Dragon Balls, magical orbs that could grant any wish.
Answer: he met bulma
Sentence 4:
This transformation marked a turning point in Goku's life and the series.
Answer: transformation
Sentence 5:


Once upon a time in the peaceful village of Mount Paozu, a young boy named Goku lived a simple life.
Answer: lived a simple life
Sentence 6:
Dragon Ball GT saw Goku's life take an unexpected twist when he was transforme

# Exercise 4

Improving the robustness of a Question-and-Answer (Q&A) system in terms of semantics and functionality involves a combination of technological advancements, data quality, and algorithmic enhancements. Here are some key factors and strategies to consider:

1. **Advanced Natural Language Processing (NLP) Models:** Utilize state-of-the-art NLP models like GPT-4 or more recent versions, which can provide better semantic understanding and generate more coherent responses.

2. **Multimodal Understanding:** Incorporate not just text but also images, audio, and video to enhance contextual understanding. This is particularly important for understanding questions related to multimedia content.

3. **Semantic Parsing:** Develop or integrate sophisticated semantic parsing techniques that can understand the structure and meaning of complex questions. This involves breaking down questions into their constituent parts and understanding the relationships between them.

4. **Knowledge Graphs:** Build and integrate knowledge graphs that represent structured information about the world. This enables the system to answer questions by traversing the graph to find relevant facts and connections.

5. **Named Entity Recognition (NER):** Enhance the system's ability to identify and understand named entities, such as people, places, and organizations, in order to provide more precise answers.

6. **Contextual Understanding:** Improve the system's ability to recognize and utilize context from previous questions or statements within the same conversation to provide more coherent and relevant answers.

7. **Fact-Checking and Validation:** Implement mechanisms to fact-check answers against reliable sources and validate the accuracy of responses before presenting them to users.

8. **User Feedback Loop:** Incorporate a feedback mechanism that allows users to rate the quality and relevance of answers. This feedback can be used to continually improve the system's performance.

9. **Domain Specialization:** Customize the Q&A system for specific domains or industries to provide more accurate and domain-specific answers. This may involve training the model on domain-specific data.

10. **Data Augmentation:** Expand the training data by incorporating diverse and high-quality sources to expose the model to a wide range of question types and language variations.

11. **Fine-Tuning and Transfer Learning:** Fine-tune pre-trained models on specific Q&A tasks and domains to adapt them to the particular needs of the application.

12. **Ethical Considerations:** Implement safeguards to ensure that the system provides unbiased and ethical responses, avoiding harmful or discriminatory content.

13. **Scalability and Performance Optimization:** Ensure that the system can handle high volumes of queries efficiently and has low latency to provide a seamless user experience.

14. **Multilingual Support:** Develop the system to support multiple languages, as this expands its user base and utility.

15. **Interoperability:** Allow the system to integrate with other applications and services through APIs, making it more versatile and useful in various contexts.

16. **Continuous Learning:** Implement mechanisms for the system to learn and adapt over time as it interacts with users and receives feedback.

Improving the robustness of a Q&A system is an ongoing process that requires a combination of cutting-edge technology, high-quality data, and a commitment to user feedback and improvement. It's also important to stay updated with the latest advancements in the field of NLP to continually enhance the system's capabilities.