# **Lab III - Advanced Topics**
## **Machine Learning II**


$$ $$
**Student 1:** Susana María Álvarez
* **CC: 1049609578**
* **Email:** susana.alvarezc@udea.edu.co

$$ $$
**Student 2:** Alejandro Martínez Henández
* **CC:** 1035877060
* **Email:** alejandro.martinezh@udea.edu.co

# 1. In your own words, describe what vector embeddings are and what they are useful for.

They are a vector representation (few dimensions) of something that is much more complex (data with many dimensions). The goal is to capture/summarize the characteristics of the complex data as accurately as possible.

It can be used as a tag to identify and relate data within a set. For example, in language processing, you would expect words like fast and swift to have similar numeric tags.

Embeddings are also utilized in many other fields, such as representing DNA, RNA, or protein sequences, making it easier to find similar sequences, analyzing graphs and social networks, among others.




# 2. What do you think is the best distance criterion to estimate how far two embeddings (vectors) are from each other? Why?

As seen in class, cosine similarity (and consequently, cosine distance)could be the best distance criterion for measuring how similar two embeddings (or vectors) are, because it focuses on the direction the vectors are pointing, not how long they are (magnitudes), so it captures the essence of the vectors relationship through the angle between them.

This is especially useful in areas like language processing, where the meaning of words (represented as embeddings) depends on their context. Essentially, if two vectors point in almost the same direction (meaning a cosine similarity close to 1), they are considered very similar or related, regardless of their length.

### Equation:

$$\frac{\vec{u} \cdot \vec{v}}{\left | u \right | \left | v \right | } = cos(\theta)$$


# 3. Let us build a Q&A (question answering) system! 😀For this, consider the following steps:

## a. Pick whatever text you like, in the order of 20+ paragraphs.

For this exercise we chose "The Shoemaker And The Devil" tale by Anton Chekhov.

> "A discontented shoemaker, fantasizes about wealth while working late on Christmas Eve. His encounter with a devilish customer, who offers riches in exchange for his soul, leads Fyodor through a dream where his desires are fulfilled. However, the opulence and luxury quickly sour, as Fyodor realizes the emptiness and moral compromise of such a life. Awakening to find it was all a dream, Fyodor's perspective shifts, finding contentment in his modest existence and rejecting the notion that wealth is worth one's soul."


The next block of code is designed to read a text file.



In [None]:
# Define the path to the tale file within the Google Drive file system.
tale_path = '/content/drive/MyDrive/Materias/Aprendizaje automático II/Laboratorio 3/lab3_SA_AMH/The Shoemaker And The Devil by Anton Chekhov.txt'

# Open the tale file for reading ('r') using a context manager.
with open(tale_path, 'r') as tale_file:
    # Read the entire content of the file into a single string.
    tale = tale_file.read()

tale


'IT was Christmas Eve. Marya had long been snoring on the stove; all the paraffin in the little lamp had burnt out, but Fyodor Nilov still sat at work. He would long ago have flung aside his work and gone out into the street, but a customer from Kolokolny Lane, who had a fortnight before ordered some boots, had been in the previous day, had abused him roundly, and had ordered him to finish the boots at once before the morning service.\n\n"It\'s a convict\'s life!" Fyodor grumbled as he worked. "Some people have been asleep long ago, others are enjoying themselves, while you sit here like some Cain and sew for the devil knows whom. . . ."\n\nTo save himself from accidentally falling asleep, he kept taking a bottle from under the table and drinking out of it, and after every pull at it he twisted his head and said aloud:\n\n"What is the reason, kindly tell me, that customers enjoy themselves while I am forced to sit and work for them? Because they have money and I am a beggar?"\n\nHe hat

## b. Split that text into meaningful chunks/pieces.

The next code is designed to split the text into sentences using the Natural Language Toolkit (NLTK) library, and then remove any items from the resulting list that are too short to be meaningful sentences (specifically, items consisting of only a period).

In [None]:
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

# Use NLTK's sent_tokenize to split the document into sentences.
# This function uses a pre-trained model to accurately identify sentence boundaries.
sentences = sent_tokenize(tale) # A list where each element is a separate sentence.

# Note about an issue observed with the result
# The "sentences" list contains many items that are just ".", which need to be removed.

# Initialize a counter for iterating through the sentence list
i = 0
# Print the initial number of items (sentences) in the list
print(f"Initial amount of items: {len(sentences)}")

# Loop through the sentence list to remove items that are too short (less than 3 characters)
while i < len(sentences):
    if len(sentences[i]) < 3: # If the sentence is less than 3 characters long,
        sentences.pop(i) # remove it from the list.
    i += 1 # Only increment the counter if an item was not removed to avoid skipping items

# Print the final number of items in the list after filtering out short items
print(f"Final items number after filter: {len(sentences)}")


[nltk_data] Downloading package punkt to /root/nltk_data...
[nltk_data]   Unzipping tokenizers/punkt.zip.


Initial amount of items: 192
Final items number after filter: 182


## c. Implement the embedding generation logic. Which tools and approaches would help you generate them easily and high-level?

First, we need to install the proper library:

In [None]:
!pip install sentence-transformers

Collecting sentence-transformers
  Downloading sentence_transformers-2.4.0-py3-none-any.whl (149 kB)
[2K     [90m━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━[0m [32m149.5/149.5 kB[0m [31m2.0 MB/s[0m eta [36m0:00:00[0m
Installing collected packages: sentence-transformers
Successfully installed sentence-transformers-2.4.0


In [None]:
# Import the SentenceTransformer class and numpy library.
from sentence_transformers import SentenceTransformer
import numpy as np
import torch.nn.functional as F

# Load the model from the sentence-transformers library.
# The model 'all-MiniLM-L6-v2' is a compact, efficient version of the Transformer model,
# designed for creating sentence embeddings.
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

# Generate embeddings for each sentence in the 'sentences' list.
# This operation converts the list of textual sentences into a list of numerical vectors (embeddings),
# where each vector represents the semantic meaning of the corresponding sentence.
# The resulting 'sentence_embeddings' variable is a 2D array where each row corresponds to a sentence embedding.
sentence_embeddings = model.encode(sentences)

sentence_embeddings

The secret `HF_TOKEN` does not exist in your Colab secrets.
To authenticate with the Hugging Face Hub, create a token in your settings tab (https://huggingface.co/settings/tokens), set it as secret in your Google Colab and restart your session.
You will be able to reuse this secret in all of your notebooks.
Please note that authentication is recommended but still optional to access public models or datasets.


modules.json:   0%|          | 0.00/349 [00:00<?, ?B/s]

config_sentence_transformers.json:   0%|          | 0.00/116 [00:00<?, ?B/s]

README.md:   0%|          | 0.00/10.7k [00:00<?, ?B/s]

sentence_bert_config.json:   0%|          | 0.00/53.0 [00:00<?, ?B/s]

config.json:   0%|          | 0.00/612 [00:00<?, ?B/s]

pytorch_model.bin:   0%|          | 0.00/90.9M [00:00<?, ?B/s]

  return self.fget.__get__(instance, owner)()


tokenizer_config.json:   0%|          | 0.00/350 [00:00<?, ?B/s]

vocab.txt:   0%|          | 0.00/232k [00:00<?, ?B/s]

tokenizer.json:   0%|          | 0.00/466k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/112 [00:00<?, ?B/s]

1_Pooling/config.json:   0%|          | 0.00/190 [00:00<?, ?B/s]

array([[ 3.69317308e-02,  1.43328294e-01, -1.11463461e-02, ...,
        -5.46864010e-02, -7.51725538e-03, -1.03631308e-02],
       [ 7.19751479e-05, -7.69873150e-03, -3.95841012e-03, ...,
        -6.46948516e-02, -4.69714440e-02,  1.48200588e-02],
       [-4.67637591e-02,  1.20460786e-01,  6.28386885e-02, ...,
        -4.14340831e-02, -3.61942570e-03,  6.80238847e-03],
       ...,
       [ 2.76820194e-02,  4.39140685e-02, -6.92020059e-02, ...,
        -3.17805819e-02,  8.57159868e-03, -3.81122082e-02],
       [ 1.00472972e-01,  4.88680415e-02,  3.12465504e-02, ...,
        -1.86411589e-02, -3.04018725e-02,  2.16403846e-02],
       [ 2.64658555e-02,  4.77815829e-02, -4.90951128e-02, ...,
        -2.56617982e-02,  5.63742667e-02, -2.96878517e-02]], dtype=float32)

## d. For every question asked by the user, return a sorted list of the N chunks/pieces in your text that relate the most to the question. Do results make sense?

A way to solve this could be using cosine similarity. So we need to calculate the cosine similarity between the question embedding and the sentence embeddings. Then we find the indices of sentences with the highest similarities, order and print them.

In [None]:
from sklearn.metrics.pairwise import cosine_similarity
import torch.nn.functional as F

# Define a function to find the most relevant sentences to a given question.
def find_most_relevant_sentences(question, sentence_embeddings, sentences, top_k=5):
    """
    Find the most relevant sentences to a question using cosine similarity.

    Parameters:
    - question (str): The question to match against the sentences.
    - sentence_embeddings (np.ndarray): Precomputed embeddings for the sentences.
    - sentences (list of str): The sentences to search through.
    - top_k (int): The number of top relevant sentences to return.

    Returns:
    - list of str: The top_k most relevant sentences to the question.
    """
    # Convert the question into its embedding.
    question_embedding = model.encode([question])

    # Calculate cosine similarity between the question embedding and the sentence embeddings.
    similarities = cosine_similarity(question_embedding, sentence_embeddings)

    # Find the indices of sentences with the highest similarities.
    most_similar_indices = similarities[0].argsort()[-top_k:][::-1]

    # Return the most relevant sentences.
    return [sentences[i] for i in most_similar_indices]

# List of questions to be answered
questions = [
    "What is the Shoemaker's name?",
    "Which was Fyodor's work?",
    "Who dragged Fyodor to hell?",
    "What does Fyodor want?",
    "What does the evil spirit want?"
]

# Iterate trough  all the questions
for question in questions:
    top_k_sentences = find_most_relevant_sentences(question, sentence_embeddings, sentences, top_k=5)
    print(f"Most relevant sentences for: '{question}'")
    for sentence in top_k_sentences:
        print(sentence)
    print()  # Print a blank line for better readability between questions.

Most relevant sentences for: 'What is the Shoemaker's name?'
What shoemaker made it?"
"Thank you, shoemaker!
You are not a shoemaker!"
And without loss of time the shoemaker began complaining of his lot.
He had often seen beautiful young ladies in the houses of rich customers, but they either took no notice of him whatever, or else sometimes laughed and whispered to each other: "What a red nose that shoemaker has!"

Most relevant sentences for: 'Which was Fyodor's work?'
thought Fyodor.
Fyodor grumbled as he worked.
Dreaming like this, Fyodor suddenly thought of his work, and opened his eyes.
There was a great deal of money, but Fyodor wanted more still.
thought Fyodor; "here's a go!"

Most relevant sentences for: 'Who dragged Fyodor to hell?'
And he dragged Fyodor to hell, straight to the furnace, and devils flew up from all directions and shouted:

"Fool!
thought Fyodor.
When Fyodor went in to him he was sitting on the floor pounding something in a mortar, just as he had been the for

Using the previous results from the model, we could tell that  it doesn't work perfectly, but some of the results are related to the possible answer and we must take into account how it is designed to know what we must ask.

Every result will return only a the closest sentence, but at the moment, the model cannot understand the context.

There are other models that we can use to get proper answers.
The next code shows

In [None]:
# Import necessary classes from the transformers package
from transformers import AutoModelForQuestionAnswering, AutoTokenizer, pipeline

# Specify the model name for the question answering task
model_name = "deepset/roberta-base-squad2"

# Initialize the pipeline for question answering with the specified model and tokenizer
nlp = pipeline('question-answering', model=model_name, tokenizer=model_name)

def qa(question):
    """
    Prints the answer to a given question based on a predefined context.

    Parameters:
    - question (str): The question to be answered.
    """
    # Define the input for the question-answering pipeline
    QA_input = {
        'question': question,
        'context': tale,
    }

    # Get the answer from the question-answering pipeline
    ans = nlp(QA_input)
    # Print the question and its corresponding answer
    print(f'''
    {question}:
        {ans['answer']}''')

# List of questions to be answered
questions = [
    "What is the Shoemaker's name?",
    "Which was Fyodor's work?",
    "Who dragged Fyodor to hell?",
    "What does Fyodor want?",
    "What does the evil spirit want?"
]

# Iterate through the list of questions and use the qa function to print answers
for i in questions:
    qa(i)

config.json:   0%|          | 0.00/571 [00:00<?, ?B/s]

model.safetensors:   0%|          | 0.00/496M [00:00<?, ?B/s]

tokenizer_config.json:   0%|          | 0.00/79.0 [00:00<?, ?B/s]

vocab.json:   0%|          | 0.00/899k [00:00<?, ?B/s]

merges.txt:   0%|          | 0.00/456k [00:00<?, ?B/s]

special_tokens_map.json:   0%|          | 0.00/772 [00:00<?, ?B/s]


    What is the Shoemaker's name?:
        Kuzma Lebyodkin

    Which was Fyodor's work?:
        shoemaker

    Who dragged Fyodor to hell?:
        the evil spirit

    What does Fyodor want?:
        boots

    What does the evil spirit want?:
        a tiny scrap of one's soul


# 4. What do you think that could make these types of systems more robust in terms of semantics and functionality?

* Input Data Quality:

    Enhancing the cleanliness of input data can significantly enhance result quality. This involves preprocessing the input text, including tasks such as handling punctuation, special characters, and removing irrelevant information.

* Top_k Experimentation for Relevant Sentences:

    When seeking the most relevant sentences, experimentation with various values for top_k is valuable. Adjusting this parameter can have a notable impact on the relevance of the extracted sentences.

* Utilizing Fine-Tuning:

    Employing fine-tuning involves the model adjusting its parameters using task-specific data and the associated loss function. Throughout this optimization process, the model acquires the ability to furnish more accurate answers or generate improved sentence embeddings tailored to the particular task

# 5. Bonus points if deployed on a local or cloud server.

Plese refer to readme in github repository: https://github.com/SusanaAlvarezC/ML2-Lab3-/