<div style="background-image: linear-gradient(145deg, rgba(35, 47, 62, 1) 0%, rgba(0, 49, 129, 1) 40%, rgba(32, 116, 213, 1) 60%, rgba(244, 110, 197, 1) 85%, rgba(255, 173, 151, 1) 100%); padding: 1rem 2rem; width: 95%"><img style="width: 60%;" src="../../images/MLU_logo.png"></div>

# <a name="0">MLU Mathematical Fundamentals for Machine Learning</a>
# <a name="0">Lecture 1: Basic linear algebra</a>
## <a name="0">Lab 1.4: Vector spaces and embeddings in NLP and LLMs</a>

 1. <a href="#1">LLM Embeddings as an application of vector spaces</a> 
 2. <a href="#2">Distance and similarity in vector space</a>
 3. <a href="#3">Analogy reasoning</a>

Vector spaces are a core concept in linear algebra, with applications in various machine learning algorithms. In this notebook, we will explore the definition, properties, and applications of vector spaces. We will use Large Language Model (LLM) embeddings as a practical example to illustrate these concepts.

A vector space is a collection of objects, called vectors, that can be added together and multiplied by scalars. Formally, a vector space $V$ over a field $F$ is a set equipped with two operations:

1. **Vector Addition**: For any vectors $\mathbf{u}, \mathbf{v} \in V$, there exists a vector $\mathbf{u} + \mathbf{v} \in V$.
2. **Scalar Multiplication**: For any scalar $c \in F$ and vector $\mathbf{v} \in V$, there exists a vector $c\mathbf{v} \in V$.

These operations must satisfy certain axioms, such as associativity, commutativity, and distributivity.

Vector spaces have several important properties:

1. **Closure**: The sum of any two vectors in the space is also in the space.
2. **Zero Vector**: There exists a zero vector $\mathbf{0}$ such that $\mathbf{v} + \mathbf{0} = \mathbf{v}$ for any vector $\mathbf{v}$.
3. **Inverse**: For every vector $\mathbf{v}$, there exists a vector $-\mathbf{v}$ such that $\mathbf{v} + (-\mathbf{v}) = \mathbf{0}$.
4. **Scalar Multiplication**: The product of any scalar and any vector is also in the space.


In [None]:
# Upgrade libraries
!pip install -q --upgrade pip
!pip install -q --upgrade scikit-learn

In [None]:
%%capture
# Import libraries
import pandas as pd
import matplotlib.pyplot as plt

import json
import boto3

from sklearn.metrics.pairwise import cosine_similarity
import numpy as np

%matplotlib inline

## <a name="1">1. LLM Embeddings as an application of vector spaces</a>
(<a href="#0">Go to top</a>)

Many NLP constructs, such as Large Language Models (LLMs), are based on the concept of **embeddings**, which are vector representations of words, sentences, or documents. These embeddings live in a high-dimensional vector space, where each dimension captures some aspect of the meaning or context of the text.

For example, consider a simple LLM that generates embeddings for words. Each word is represented as a vector in a high-dimensional space, where the dimensions correspond to different semantic features. These embeddings can be added, subtracted, and scaled, making the space of embeddings a vector space.

Which [Amazon Bedrock](https://aws.amazon.com/bedrock/), we have access to various embedding models, for instance the [Amazon Titan embedding models](https://docs.aws.amazon.com/bedrock/latest/userguide/titan-embedding-models.html). [This article](https://aws.amazon.com/jp/blogs/machine-learning/get-started-with-amazon-titan-text-embeddings-v2-a-new-state-of-the-art-embeddings-model-on-amazon-bedrock/) explains in detail what the state-of-the-art Titan models are capable of. We will use Amazon Titan Text Embeddings V2 in the demo below.


In [None]:
# Initialize the Bedrock client using boto3
bedrock = boto3.client("bedrock")

# Initialize the Bedrock runtime client
bedrock_runtime = boto3.client("bedrock-runtime")

Here's an example of the embedding produced by Amazon Titan Text Embeddings V2 on a word. The embedding dimension, i.e. the length of the embedding vector, is a characteristic of every embedding model and is typically on the order of magnitude of the thousands for current state-of-the-art models.

In [None]:
# Define the word for which we want to generate an embedding
word = "amazing"
payload = {"inputText": word}

# Invoke the model to generate the embedding
response = bedrock_runtime.invoke_model(
    modelId="amazon.titan-embed-text-v2:0",
    body=json.dumps(payload)
)

# Parse the response body from the model invocation
body = json.loads(response.get('body').read())

print(f"The vector space has {len(body['embedding'])} dimensions\n")

# Printing the first 10 components of the embedding vector
print(f"Printing the first 10 components of the embedding vector for word '{word}':\n\n{body['embedding'][:10]}")

## <a name="2">2. Distance and similarity in vector space</a>
(<a href="#0">Go to top</a>)

Vectors are often depicted as arrows pointing in a certain direction within a space. They can also be represented numerically through the list of their components on a basis, which makes mathematical operations easier.

The purpose of vector embeddings is to map complex entities like text or images into a vector space, allowing machine learning models to identify patterns and relationships. These embeddings convert non-numeric data into a numerical format while preserving their semantic meanings and relationships. This enables computational models in machine learning and natural language processing (NLP) to recognize similarities and differences between entities.

In a vector space embedding, similar entities are placed close to each other, reflecting their semantic or contextual similarity. For example, in word embeddings, words with similar meanings are positioned near each other in the vector space. This spatial arrangement helps embeddings capture and organize the semantic relationships between entities, a concept known as the semantic space.

Let's create a function that returns the embedding given a text input:

In [None]:
# Function to compute embeddings from text input
def get_embedding(text, model="amazon.titan-embed-text-v2:0"):
    payload = {"inputText": text}

    response = bedrock_runtime.invoke_model(
        modelId=model,
        body=json.dumps(payload)
    )
    body = json.loads(response.get("body").read())
    
    # Return np array as it's easier to deal with
    return np.array(body["embedding"])

In [None]:
# testing it
emb1 = get_embedding("cat")
print(f"The Python type of the embedding vector is: {type(emb1)}\nThe array shape is: {emb1.shape=}")

In the context of embeddings, semantically similar strings (words, phrases, or sentences) are represented by vectors that are "close" to each other in the embedding space. This concept is fundamental to many natural language processing (NLP) tasks, such as word similarity, sentence similarity, and clustering. The concept of "closeness" in the embedding space can be measured using various distance metrics, with Euclidean distance and cosine similarity being two of the most commonly used methods.

* **Euclidean distance** measures the straight-line distance between two points in the embedding space. A small value of the distance between two words embeddings means that the words are close to each other.
* **Cosine similarity** measures the cosine of the angle between two vectors. It is particularly useful for high-dimensional spaces because it focuses on the orientation of the vectors rather than their magnitude, and it is bounded between -1 (diametrically opposed vectors) to 1 (parallel vectors). When two word embeddings are close to each other, the cosine similarity is large. 

I.e. words or sentences that are close will have: 
- small Euclidean distance
- large cosine similarity

Let's create two functions that return the above metrics:

In [None]:
def cosine_similarity(vector1, vector2):
    # Calculate the dot product
    dot_product = np.dot(vector1, vector2)

    # Calculate the norms of the vectors
    norm_vector1 = np.linalg.norm(vector1)
    norm_vector2 = np.linalg.norm(vector2)

    # Calculate the cosine similarity
    cosine_sim = dot_product / (norm_vector1 * norm_vector2)

    return cosine_sim


def euclidean_distance(vector1, vector2):
    # Calculate the difference between the vectors
    difference = vector1 - vector2

    # Calculate the Euclidean distance
    euclidean_dist = np.linalg.norm(difference)

    return euclidean_dist

Let's test them for concepts that are semantically approaching each other:

<img src="../../images/vector_similarity.png" alt="vector_similarity" width="70%"/>

In [None]:
words = ["mouse", "pizza", "rome", "italian", "Italy", "italy"]
base_word = "italy"

# Construct dataframe to store data
df = pd.DataFrame(words, columns=["word_1"])
df["word_2"] = base_word

# Compute distance metrics for each pair of words
df["cosine_similarity"] = df.apply(lambda r: cosine_similarity(get_embedding(r.word_1), get_embedding(r.word_2)), axis=1)
df["euclidean_distance"] = df.apply(lambda r: euclidean_distance(get_embedding(r.word_1), get_embedding(r.word_2)), axis=1)

df

Let's repeat the same exercise. This time, instead of just single words, we use sentences. Let's see how the Cosine similarity and Euclidean distance compare when used to compare a human (positive) review with:
* a similar review generated by an LLM
* a negative review 
* an irrelevant review

### Exercise 1

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Exercise 1.</b> In the next cell you find an asin description made by a human, a similar description made by an LLM, a description with negative tone, and an irrelevant description. Compute cosine similarity and Euclidean distance for the human description against the other 3, and check if the results align with your expectation.
</p>
    </span>
</div>

In [None]:
human_description = "This laptop has a sleek design, high-resolution display, and long battery life."
llm_description = "Featuring a slim design, this laptop offers a vibrant screen and extended battery duration."
bad_description = "The laptop lacks in battery duration, its display resolution is lacking, and the appearance is bulky."
irrelevant_description = "It was a sunny and crisp day in the village located right on the Amalfi coast." 

###### YOUR CODE HERE ######






###### END OF CODE ######

<div style="align: left; border: 4px solid lightcoral; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="../../images/MLU_question.png" alt="MLU solution" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><br/><b>Challenge Help</b></p>
        <p><br/><br/>If you're stuck, remove the <code>#</code> before the <code>load</code> instruction in the next code cell to display sample solutions.</p>
    </span>
</div>

In [None]:
# %load solutions/lab14_ex1_solutions.txt

## <a name="3">3. Analogy reasoning</a>
(<a href="#0">Go to top</a>)

As elements of a vector space, the usual operations on embeddings are well defined. By performing vector operations on embeddings, we can uncover interesting relationships and patterns within the language. 

One such application is **analogy reasoning** to find words that are related in a similar way to a given pair of words. For instance, in the classic example `king - man + female = queen` the embedding for "queen" can be approximated by subtracting the embedding for "man" from the embedding for "king" and then adding the embedding for "female." This demonstrates the ability of LLM embeddings to capture subtle semantic nuances and perform complex linguistic tasks.

In the next cell, we give a demonstration of analogy reasoning. We create a function `analogy_reasoning` that performs the following steps:

* Retrieves embeddings for the given query and target words.
* Calculates an analogy vector based on the query words.
* Computes cosine similarity and Euclidean distance between the analogy vector and each target word's embedding.
* Identifies the closest target word based on the similarity measures.
* The function returns a tuple containing the closest target word based on both cosine similarity and Euclidean distance.

In [None]:
def analogy_reasoning(query_words, target_words, get_embedding_func):
    """
    Performs analogy reasoning using LLM embeddings.

    Args:
    query_words: A list of query words for the analogy.
    target_words: A list of target words to compare.
    get_embedding_func: A function to get the embedding for a word.

    Returns:
    A tuple containing the closest target word based on cosine similarity and Euclidean distance.
    """

    # Get embeddings for query and target words
    query_embs = np.array([get_embedding_func(word) for word in query_words])
    target_embs = np.array([get_embedding_func(word) for word in target_words])

    # Calculate analogy vector
    analogy_vec = query_embs[0] - query_embs[1] + query_embs[2]

    # Calculate cosine similarity and Euclidean distance
    cos_sim = np.array([cosine_similarity(target, analogy_vec) for target in target_embs])
    euc_dist = np.array([euclidean_distance(target, analogy_vec) for target in target_embs])

    # Find closest target based on cosine similarity and Euclidean distance
    closest_index_cos = np.argmax(cos_sim)
    closest_index_dist = np.argmin(euc_dist)
    
    print(f"Cosine similarity   |  {query_words[0]} - {query_words[1]} + {query_words[2]} ~ {target_words[closest_index_cos]}")
    print(f"Euclidean distance  |  {query_words[0]} - {query_words[1]} + {query_words[2]} ~ {target_words[closest_index_dist]}")

    return target_words[closest_index_cos], target_words[closest_index_dist]

Trying to find the closest word embedding from a whole vocabulary would be very costly. Thus, we will choose a handful of words as targets and let the `analogy_reasoning` function find the closest among that set. 

Let's first check that the `king - man + female ~ queen` analogy holds:

In [None]:
query_words = ['king', 'man', 'female']
target_words = ['queen', 'lawyer', 'professor', 'dancer', 'mother', 'father', 'apple', 'pear', 'orange', 'red', 'blue']

result = analogy_reasoning(query_words, target_words, get_embedding)

### Exercise 2

<div style="align: left; border: 4px solid cornflowerblue; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 80%; max-height:80%; margin: 5px;" src="../../images/MLU_challenge.png" alt="MLU challenge" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><b>Exercise 2.</b> Using the provided analogy_reasoning function, create a list of 3 analogy problems (e.g., "king - man + female = queen"). For each problem, provide a list of potential target phrases. Test the function on these analogies and report the results.
</p>
        <p>Here're some ideas for topics:
            <ul>
                <li>Famous singers and their most iconic songs.</li>
                <li>Countries and their capitals.</li>
                <li>Words in different languages</li>
            <ul>
        </p>
    </span>
</div>

In [None]:
###### YOUR CODE HERE ######






###### END OF CODE ######

<div style="align: left; border: 4px solid lightcoral; text-align: left; margin: auto; padding-left: 20px; padding-right: 20px; width: 65%">
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="../../images/MLU_question.png" alt="MLU solution" width=12% height=12%/>
    <span style="padding: 20px; align: left;">
        <p><br/><b>Challenge Help</b></p>
        <p><br/><br/>If you're stuck, remove the <code>#</code> before the <code>load</code> instruction in the next code cell to display sample solutions.</p>
    </span>
</div>

In [None]:
# %load solutions/lab14_ex2_solutions.txt

<div style="display: flex; align-items: center; justify-content: left; background-color:#330066; width:99%;"> 
        <img style="float: left; max-width: 100%; max-height:100%; margin: 15px;" src="../../images/MLU_robot.png" alt="MLU robot" width="100" height="100"/>
    <span style="color: white; padding-left: 10px; align: left; margin: 15px;">
        <h3>Congratulations!</h3>
        You have completed Lab 1.4: Vector spaces and embeddings in NLP and LLMs of Lecture 1: Basic linear algebra of MLU Mathematical Fundamentals of Machine Learning.
        <br/>
    </span>
</div>