Analytics & Data Science

Universidad de Antioquia - ML2

Feb 2024

Natalia López Grisale CC. 1040048893
Tatiana García Zuluaga CC. 1017198484
Melissa Ortega Alzate CC.1036964792

# Libraries

In [1]:
# Calculate distances
from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

# Open Source library to build embeddings
#! pip install -U sentence-transformers
from sentence_transformers import SentenceTransformer

  from .autonotebook import tqdm as notebook_tqdm


# Functions

In [2]:
# Define function to read the text file
def read_text_from_file(file_name):
    """
    Read text data from a file.
    Args:
        file_name (str): The name of the file to read.
    Returns:
        str: The text read from the file.
    """
    with open(file_name, 'r', encoding='iso-8859-1') as file:
        text = file.read()
    return text

In [3]:
# Define function to split the text into a number of desired sentences
def split_text(text, sentences_per_fragment):
    """
    Split text into fragments based on a specified number of lines per fragment.

    Args:
        text (str): The text to be split into fragments.
        lines_per_fragment (int): The number of lines per fragment.

    Returns:
        list of str: List of fragments, where each fragment contains the specified number of lines.
    """
    lines = text.split('.')
    fragments = []
    
    for i in range(0, len(lines), sentences_per_fragment):
        fragment = '.'.join(lines[i: i+sentences_per_fragment])
        fragments.append(fragment)
    
    return fragments

# 1. In your own words, describe what vector embeddings are and what they are useful for.

The embeddings are a numerical representation not only of language but also of its context and semantics. 

They are fixed-dimensional vectors constructed from texts so that ML models can understand audio, text, image, video instructions, etc. Embeddings are, therefore, a representation in a large dimensional space with the best possible meaning of context and semantics. Since embeddings are vectors, they can be manipulated with all traditional linear algebra techniques.

These embeddings are useful in many areas, and thanks to them, recommendation systems (YouTube, Netflix), semantic search engines (YouTube), translators, ChatGPT, text classifiers, and in general, all language understanding AI models have been built.

# 2. What do you think is the best distance criterion to estimate how far two embeddings ...

In [4]:
# Examples
cat = np.array([0.9 , 0.8, 0.2])
dog = np.array([0.8 , 0.85, 0.15])
computer = np.array([0.1 , 0.3, 0.9])

One of the advantages of working with vectors is that we can measure the distance between them, allowing us to assess how far apart one word is from another. Therefore, the similarity of two vectors can be measured as follows:

- **Cosine distance:** This criterion takes into account the direction of the vectors by measuring the angle formed between them. When two vectors have the same direction, the angle between them is smaller. For example, two orthogonal vectors could represent words that are not very similar. Two vectors forming an angle of 180 degrees would represent two words with opposite or very different meanings. The smaller the angle, the more similar the vectors. This is a very efficient distance criterion for measuring the similarity between two words. 

The result is a number between 0 and 1. And as the next example shows, when the results is near to 1, the vectors have more similarity.

In [5]:
# Cosine distance example using numpy

# Calculate cat-dog
dot_product = np.dot(cat, dog)
norm_cat = np.linalg.norm(cat)
norm_dog = np.linalg.norm(dog)
print(f"Cosine distance between cat and dog: {dot_product/(norm_cat * norm_dog)}")

# Calculate cat-dog
dot_product = np.dot(cat, computer)
norm_cat = np.linalg.norm(cat)
norm_computer = np.linalg.norm(computer)
print(f"Cosine distance between cat and dog: {dot_product/(norm_cat * norm_computer)}")

Cosine distance between cat and dog: 0.9954467122628464
Cosine distance between cat and dog: 0.4379820840197298


- **Euclidean distance:** this criterio calculates the distance between the vectors taking into account their magnitud. This distance is calculated as the difference between the vectors. The result is a number between 0 and infinite.  The smaller the distance, the larger the similarity between the vectors. In the following example, the euclidean distance bewtween cat and computer is higher than the one between cat and dog indicating cat and computer are more further apart.

In [6]:
# Euclidean distance example
print(f"Euclidean distance between cat and dog: {np.linalg.norm(cat - dog)}")
print(f"Euclidean distance between cat and computer: {np.linalg.norm(cat - computer)}")

Euclidean distance between cat and dog: 0.12247448713915887
Euclidean distance between cat and computer: 1.174734012447073


- **Dot product:** this metric take the not only the angel but the magnitud of the vectors.  This product will be negative when the vectors are opposite and positive if they have the same direction. Therefore, when the data is normalized, it is the same as calculate the cosine distance. In the example, the metric is larger for cat-dog distance, indicating that they have higher similarity than cat-computer.

In [7]:
# Dot product distance example
print(f" Dot product distance between cat and dog: {np.dot(cat, dog)}")
print(f" Dot product distance between cat and computer: {np.dot(cat, computer)}")

 Dot product distance between cat and dog: 1.4300000000000002
 Dot product distance between cat and computer: 0.51


It is a challenge to to calculate the distances between hunderds of vectors with hunderds of dimensions. Some distances metrics are more compute-heavy than others and it is important to balance the speed of compute and the accuracy. For sentence transformes **cosine distance** would be the best criteria for the calculation. Also the models onwoards use them and it is important to match the distance with the one the model is already using.

![Descripción de la imagen](https://weaviate.io/assets/images/hero-183a22407b0eaf83e53d574aee0a049a.png)

Figure taken from: https://weaviate.io/blog/distance-metrics-in-vector-search 

# 3. Let us build a Q&A (question answering) system!

### a. Pick a text

The text was taken and edited from: https://aws.amazon.com/what-is/machine-learning/?nc1=h_ls

In [8]:
# Load the text using the predefined function
text = read_text_from_file('suggested_text/sample_1.txt')

# Text description
print("The length of the text is", len(text), "characters\n")
print("The text up to character 342 is:\n", text[:342])

The length of the text is 14044 characters

The text up to character 342 is:
 Machine learning is the science of developing algorithms and statistical models that computer systems use to perform tasks without explicit instructions, relying on patterns and inference instead. Computer systems use machine learning algorithms to process large quantities of historical data and identify data patterns. This allows them to p


### b. Split that text into meaningful chunks/pieces.

In [9]:
# Split the text into fragments
fragments = split_text(text, 2)

# Print the fragments
for i, fragment in enumerate(fragments[:10]):
    print(f"Fragment {i+1}: {fragment}\n")

# Print the total number of fragments
print(f"\nThe text was divided into {i+1} fragments")

Fragment 1: Machine learning is the science of developing algorithms and statistical models that computer systems use to perform tasks without explicit instructions, relying on patterns and inference instead. Computer systems use machine learning algorithms to process large quantities of historical data and identify data patterns

Fragment 2:  This allows them to predict outcomes more accurately from a given input data set. For example, data scientists could train a medical application to diagnose cancer from x-ray images by storing millions of scanned images and the corresponding diagnoses

Fragment 3: 
Machine learning helps businesses by driving growth, unlocking new revenue streams, and solving challenging problems. Data is the critical driving force behind business decision-making but traditionally, companies have used data from various sources, like customer feedback, employees, and finance

Fragment 4:  Machine learning research automates and optimizes this process. By using sof

### c. Implement the embedding generation logic

In [10]:
# Instantiate the pretrained model
model = SentenceTransformer('sentence-transformers/paraphrase-multilingual-mpnet-base-v2')

# Generate embeddings for each fragment
embeddings = model.encode(fragments)

# Sample data
print("Number of embeddings:", len(embeddings))
print(f"\nEmbedding for fragment 1:\n {embeddings[0]}:")

Number of embeddings: 66

Embedding for fragment 1:
 [-1.48474304e-02  7.25356862e-02  2.91582360e-03 -4.75783236e-02
 -7.07611069e-02 -4.08746339e-02 -2.42115725e-02  8.66363771e-05
 -8.34782571e-02  1.79940257e-02  9.69681889e-02  1.69683918e-01
 -7.51287714e-02  1.85720757e-01  5.01122735e-02 -2.02736065e-01
 -1.25303455e-02 -5.84237203e-02  4.42154631e-02  8.35931748e-02
 -3.54419723e-02 -7.57947415e-02  1.18775796e-02 -2.02463530e-02
 -3.27721760e-02 -9.49414372e-02 -4.07991782e-02 -1.10695690e-01
 -5.66377081e-02  1.07649215e-01 -5.01370691e-02 -1.66517437e-01
  1.41108125e-01  8.77800509e-02  1.66838113e-02 -3.44206765e-02
 -8.37415364e-03  9.68867838e-02  7.52341002e-02 -1.49951339e-01
  2.48079792e-01 -2.45287940e-02  7.27478713e-02 -4.47176807e-02
  2.09419243e-02  2.75891215e-01  6.25376925e-02 -1.25490114e-01
 -2.54110415e-02  1.12973325e-01  7.29606440e-03  9.62402895e-02
 -1.86455518e-01  5.36349453e-02 -1.10063732e-01 -9.58444029e-02
 -6.10179566e-02  1.28090963e-01 -3.8

In [11]:
# Print embeddings list characteristics
print("Variable type of embeddings:", type(embeddings))
print("Dimensions of embeddings:", embeddings.shape)

Variable type of embeddings: <class 'numpy.ndarray'>
Dimensions of embeddings: (66, 768)


- Each row corresponds to a fragment and each column to a dimension in the embedding space. In other words, each row in this matrix represents the embedding for a particular fragment.

- The pre-trained model is a sentence-transformers model: It maps sentences & paragraphs to a 768 dimensional dense vector space and can be used for tasks like clustering or semantic search.

### Which tools and approaches would help you generate them easily and high-level?

One of the best ways to generate this embeedings is to use pre-trained word embeddings models like the one used. In this case, one has to choose a model that trained for the specific case we need. In this case for semantic search. 

There are other frameworks like TensorFlow or Pytorch that offer neural networks architectures for generating embeddings and one can design an specific model like Word2Vec or skip-gram. 

Another tool could be use the embeddings models of the OpenAI, which are avaiable using their API. The same for Google's Universal Sentence Encoder and Facebook's infersent which offer APIs to also generate embeddings in a very way.

Finally a high-level approaches to generate embaddings when large datasets are used is the utilization of GPU in order to accelerate the training. In general, depending on the specifica case, we have to choose the best approach to generate embeddings in an efficient way.

### d. For every question, return a sorted list of the N pieces that relate the most to the question.

In [12]:
def get_answers(user_question, text_fragments, n=3):
    """
    Get the most related text fragments to a user question.

    Args:
        user_question (str): The question asked by the user.
        text_fragments (list): List of text fragments to compare with the user question.
        n (int, optional): Number of most related fragments to return. Defaults to 5.

    Returns:
        tuple: A tuple containing two lists:
               - similarities: List of tuples (fragment, similarity_score) for all fragments.
               - sorted_fragments: List of tuples (fragment, similarity_score) for the top N related fragments.
    """
    # Instantiate the pretrained model
    model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')

    # Generate embedding for user question
    embedding_question = model.encode(user_question)

    # Initialize a list to store the similarity between the question and each fragment
    similarities = []

    # Generate embeddings for each text fragment and calculate cosine similarity
    for fragment in text_fragments:
        embedding_fragment = model.encode(fragment)
        similarity = cosine_similarity([embedding_question], [embedding_fragment], dense_output=True)[0][0]
        similarities.append((fragment, similarity))

    # Sort text fragments based on their similarity with the user question
    sorted_fragments = sorted(similarities, key=lambda x: x[1], reverse=True)

    # Return the top N most relevant fragments
    return sorted_fragments[:n]


In [13]:
# Using the function to find the most related fragments
user_question = "How does machine learning help businesses?"

questions = [
    "How does machine learning help businesses?",
    "How does machine learning improve financial services?",
    "What are the strengths of supervised learning?",
    "How does unsupervised machine learning differ from supervised learning?",
]

# Number of fragments
n = 2

In [14]:
# Iterating over the list of questions
for user_question in questions:
    related_fragments = get_answers(user_question, fragments, n=n)

    # Results
    print(f"============ Question: {user_question} =============")
    for fragment, similarity in related_fragments:
        print(f"{fragment}\n=> Similarity: {similarity}\n")



Machine learning helps businesses by driving growth, unlocking new revenue streams, and solving challenging problems. Data is the critical driving force behind business decision-making but traditionally, companies have used data from various sources, like customer feedback, employees, and finance
=> Similarity: 0.8451827168464661


Letâs take a look at machine learning applications in some key industries:
Machine learning can support predictive maintenance, quality control, and innovative research in the manufacturing sector. Machine learning technology also helps companies improve logistical solutions, including assets, supply chain, and inventory management
=> Similarity: 0.7533918619155884


Machine learning helps businesses by driving growth, unlocking new revenue streams, and solving challenging problems. Data is the critical driving force behind business decision-making but traditionally, companies have used data from various sources, like customer feedback, employees, and fin

### Do results make sense?

Analyzing the above results, a good general relationship between questions and answers is observed. The first answer, the one with the highest number in the cosine similarity calculation, is much more appropriate and consistent with the question, while the answers with lower similarity measure are sentences that mention or bear some relation to the words in the question, but are not a correct answer to it.

Making some attempts with other text documents and questions, it is observed that the larger the document the less accurate the answer can become. This occurs when the words contained in the questions are repeated throughout the text in different contexts.

# 4. What do you think that could make these types of systems more robust in terms of semantics and functionality?

To achieve robustness of vector embedding systems, we propose:

- Train the embeddings on large and diverse datasets spanning a wide range of languages, domains, and topics to capture a broad spectrum of semantic relationships or, alternatively, augment the training data with various transformations to increase the robustness of the embeddings to variations in the input data.

- Use transfer learning of pre-trained embeddings to strengthen learning and improve robustness against sparse data in target tasks or domains.

- Apply regularization techniques that help preserve semantic relationships, such as orthogonality regularization or semantic similarity constraints during training.

In addition, different models of lexico-syntactic patterns could be incorporated to extract semantic relations such as synonymy, hyponymy and hyperonymy.

# Bibliography

- AWS. (Consulted feb 2024). What is Machine Learning? Retrieved from https://aws.amazon.com/machine-learning/

- Distance Metrics in Vector Search. (Consulted feb 2024). Weaviate. Retrieved from https://weaviate.io/blog/distance-metrics-in-vector-search

- Hugging Face. (Consulted feb 2024). Retrieved from https://huggingface.co/

- Perone, C. S. (2014). Word Embeddings: Introduction [Slideshare slides]. Retrieved from https://www.slideshare.net/perone/word-embeddings-introduction

- Scikit-learn. (Consulted feb 2024). Cosine Similarity. Retrieved from https://scikit-learn.org/stable/modules/generated/sklearn.metrics.pairwise.cosine_similarity.html

- Sentence Transformers - Multilingual Sentence Embeddings. Hugging Face. Retrieved from https://huggingface.co/sentence-transformers/paraphrase-multilingual-mpnet-base-v2

- Alarcón, C. (2023). Curso de Embeddings y Bases de Datos Vectoriales para NLP [Curso en línea]. Platzi. https://platzi.com/cursos/embeddings-nlp/