# Introduction to vector databases

## Vector search

### Similar words

We have already established that words are represented as high-dimensional vectors in the vector space known as embeddings. Words are transformed into embeddings by using embedding techniques, such as BERT. Also of note is that similar words tend to have vectors (embeddings) that are positioned nearby in the vector space.

Without the loss of generality, let's suppose we have a vector space in two dimensions, where words are represented as two-dimensional points. The point (2, 3) denotes the word sunny while the point (3, 4) denotes the word sunlit. These two words have similar meanings. Hence, their vectors are located nearby in the vector space. On the other hand, the point (-2, -3.5) represents the word dark. Due to having the opposite meaning of sunny, the vector representing the word dark is situated apart from the vectors sunny and sunlit.

<img src="https://i.ibb.co/1J84S8SX/untitled2.jpg" alt="untitled2" border="0">

Moreover, a query can be mapped into high-dimensional vectors, and a match can be found in the vector space. In vector query execution, we aim to discover similar vectors to identify the most relevant candidates for our search results. If the vector content is indexed, the type of index used guides the search for relevant matches, which may be either exhaustive or focused on near neighbours. Also of note is that the latter facilitates faster processing. Once we pinpoint our candidates, we score the results using similarity metrics that measure the strength of the match.

Some popular algorithms used in vector search include KNN (K-Nearest Neighbours) and ANN (Approximate Nearest Neighbours). Both KNN and ANN are based on distance measurements to find the nearest neighbours. KNN finds the exact nearest neighbours, while ANN finds the approximate nearest neighbours. ANN can significantly improve search efficiency when compared to KNN when processing vast amounts of data, especially when dealing with high-dimensional vectors.

In the scope of this notebook, we will explore the KNN approach.

## Vector databases

Vector databases are highly specialized systems expertly designed for the efficient storage, retrieval, and management of high-dimensional vectors. Unlike traditional databases that limit themselves to scalar values like integers, vector databases excel in processing tasks that incorporate complex data types such as text, images, and audio mapped to multi-dimensional vector spaces, adeptly capturing relationships and similarities between data points. With the implementation of advanced indexing techniques and powerful similarity search algorithms, vector databases enable effective querying of large-scale datasets. This makes them essential tools for applications in natural language processing. 

One example represents the vector database pgvector, which we will later use for natural language processing tasks to store text embeddings in Programming Assignment 2. 

In the scope of this notebook, we will conduct simple KNN searches to obtain relevant information. First, we will start with the basics.

### System initialization

In this tutorial we re-use PostgreSQL database Docker image we have presented in *Web crawling - basic tools* notebook. If you already have tested the showcase example and have not deleted the container *postgres-wier* with the database, you can start the Docker container as indicated below.



<img src="https://i.ibb.co/DgCxJWdQ/docker-demo.png" alt="docker-demo" border="0">



Otherwise, follow next steps. 

First, prepare a file *database.sql*. The script will create a table with two rows:

``` sql
CREATE SCHEMA IF NOT EXISTS showcase;

CREATE TABLE showcase.counters (
    counter_id integer  NOT NULL,
    value integer NOT NULL,
    CONSTRAINT pk_counters PRIMARY KEY ( counter_id )
 );

INSERT INTO showcase.counters VALUES (1,0), (2,0);
```

Go to an empty folder and save the script into a subfolder named *init-scripts*. Create another empty folder named *pgdata*.

We run docker container using the following command. The command will name the container *postgresql-wier*, set username and password, map database files to folder *./pgdata* and initialization scripts to *./init-scripts*, map port 5432 to host machine (i.e. localhost) and run image *pgvector:pg16* in a detached mode. 

``` 
docker run --name postgresql-wier \
    -e POSTGRES_PASSWORD=SecretPassword \
    -e POSTGRES_USER=user \
    -e POSTGRES_DB=wier \
    -v $PWD/pgdata:/var/lib/postgresql/data \
    -v $PWD/init-scripts:/docker-entrypoint-initdb.d \
    -p 5432:5432 \
    -d pgvector/pgvector:pg16
```

If you use Command Prompt on Windows, the equivalent of the above command is as follows:

``` 
docker run --name postgresql-wier ^
    -e POSTGRES_PASSWORD=SecretPassword ^
    -e POSTGRES_USER=user ^
    -e POSTGRES_DB=wier ^
    -v "%CD%\pgdata:/var/lib/postgresql/data" ^
    -v "%CD%\init-scripts:/docker-entrypoint-initdb.d" ^
    -p 5432:5432 ^
    -d pgvector/pgvector:pg16
```

To check container's logs, run `docker logs -f postgresql-wier`.


### Getting started

We will enable the pgvector extension and insert a some example sentences in the database.  

But first, do check that you use the correct image for your database, i.e., *pgvector/pgvector:pg16*.  

Also, install the necessary dependencies.

In [None]:
%pip install pgvector
%pip install sentence_transformers
%pip install numpy

In the next step, we will enable the pgvector extension in your PostgreSQL database (do this once in each database where you want to use it) with the following SQL statement:
```sql
CREATE EXTENSION IF NOT EXISTS vector
```

In [None]:
from pgvector.psycopg import register_vector
import psycopg

#connect to db
conn = psycopg.connect(host="localhost", dbname='wier', autocommit=True, password='SecretPassword', user='user')

#enable `vector` extension if not already enabled
conn.execute('CREATE EXTENSION IF NOT EXISTS vector')
register_vector(conn)

Now, we will create two new tables *showcase.vector_demo* and *showcase.vector_demo2*, for storing embeddings. Both tables have similar column definitions (except for embedding vector size):
- **id**: primary key (unique for each sentence)
- **sentence**: text (content) of the sentence
- **embedding**: vector representation of the sentence.  

In [None]:
#delete tables vector_demo and vector_demo2 from the db if they exist
conn.execute('DROP TABLE IF EXISTS showcase.vector_demo')
conn.execute('DROP TABLE IF EXISTS showcase.vector_demo2')

#create tables vector_demo and vector_demo2 with columns id, content, and embedding
conn.execute('CREATE TABLE showcase.vector_demo (id bigserial PRIMARY KEY, sentence text, embedding vector(384))')
conn.execute('CREATE TABLE showcase.vector_demo2 (id bigserial PRIMARY KEY, sentence text, embedding vector(768))')

conn.close()

Next, we will define a list of sentences and calculate their embeddings using two different models from the [SentenceTransformer](https://sbert.net/) library, i.e.
- [all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
- [LaBSE](https://huggingface.co/sentence-transformers/LaBSE)

In [None]:
from sentence_transformers import SentenceTransformer

#define sentences to be stored in the database
sentences = [
    'The sun is shining',
    'The sun shines with great brightness',
    'The sun shines',
    'The sun provides light and energy',
    'The sun shines brightly in the clear sky',
    'The sun shines brightly, warming the earth and providing light',
    'The clouds are covering the sky'    
]

#load SentenceTransformer model and generate embeddings
model = SentenceTransformer('sentence-transformers/all-MiniLM-L6-v2')
embeddings = model.encode(sentences)

model2 = SentenceTransformer('sentence-transformers/LaBSE')
embeddings2 = model2.encode(sentences)

print('Size of embedding: ' + str(embeddings.shape))
print('Size of embedding1: ' + str(embeddings2.shape))

print('\n\n\nembedding:\n ')
print(embeddings)

print('\n\n\nembedding 2:\n ')
print(embeddings2)


After, we will store sentences in the database.

In [None]:
#print values in stored in the table showcase.vector_demo2
def print_db_values():

    """
    Print sentences and corresponding embeddings in the table showcase.vector_demo2.
    """
    conn = psycopg.connect(host="localhost", dbname='wier', autocommit=True, password='SecretPassword', user='user')

    retVal = []
    print("\nValues in the vector_demo2 table:")
    cur = conn.cursor()
    cur.execute("SELECT id, sentence, embedding FROM showcase.vector_demo2 ORDER BY id")
    for id, sentence, embedding in cur.fetchall():
        print(f"\Id: {id},   Sentence: {sentence},   Embedding: {embedding}")
        retVal.append({id: (sentence, embedding)})
    cur.close()
    conn.close()

    #return retVal

#insert a list of sentences and corresponding embeddings in the table showcase.vector_demo
def insert_db_sentences(sentences, embeddings):
    """
    Insert a list of sentences and corresponding embeddings in the table showcase.vector_demo.

    Parameters
    - sentences: A list of sentences to be inserted in the vector_demo table.
    - embeddings:  Embeddings to be inserted in the vector_demo table.
    """
    conn = psycopg.connect(host="localhost", dbname='wier', autocommit=True, password='SecretPassword', user='user')
    cur = conn.cursor() 
    for sentence, embedding in zip(sentences, embeddings):
        embedding = embedding.tolist() #convert numpy array to python lists for compatibility with PostgreSQL
        cur.execute('INSERT INTO showcase.vector_demo (sentence, embedding) VALUES (%s, %s)', (sentence, embedding))
    cur.close()
    conn.close()


#insert a list of sentences and corresponding embeddings in the table showcase.vector_demo2
def insert_db_sentences2(sentences, embeddings):
    """
    Insert a list of sentences and corresponding embeddings in the table showcase.vector_demo2.

    Parameters
    - sentences: A list of sentences to be inserted in the vector_demo2 table.
    - embeddings:  Embeddings to be inserted in the vector_demo2 table.
    """
    conn = psycopg.connect(host="localhost", dbname='wier', autocommit=True, password='SecretPassword', user='user')
    cur = conn.cursor() 
    for sentence, embedding in zip(sentences, embeddings):
        embedding = embedding.tolist() #convert numpy array to python lists for compatibility with PostgreSQL
        cur.execute('INSERT INTO showcase.vector_demo2 (sentence, embedding) VALUES (%s, %s)', (sentence, embedding))
    cur.close()
    conn.close()


insert_db_sentences(sentences, embeddings)
insert_db_sentences2(sentences, embeddings2)
print_db_values()


### Distance functions supported in pgvector database

Pgvector supports the following distance functions are:

- L2 distance (**`<->`**)
- (Negative) inner product (**`<#>`**)
- Cosine distance (**`<=>`**)
- L1 distance(**`<+>`**)
- Hamming distance (**`<~>`**): used for binary vectors
- Jaccard distance (**`<%>`**): used for binary vectors

####  L2 distance (`<->`)  

Let vectors $x$, $y \in \mathbb{R}^n$. L2 distance (the Euclidean distance) $d_{L2}(x, y)$ is defined as
$d_{L2}(x, y) = \sqrt{\sum_{i=1}^{n} (x_i - y_i)^2}$.

#### (Negative) inner product (`<#>`)
The negative inner product of the dot product between two vectors. The inner product measures similarity, so negating it turns it into a pseudo-distance measure. For vectors $x$, $y \in \mathbb{R}^n$, the negative inner product $d_{\text{inner}}(x, y)$ is defined as
$d_{\text{inner}}(x, y) = - (x \cdot y)$.

#### Cosine distance (`<=>`)  
Cosine distance measures the angle between two vectors in a high-dimensional space. Let vectors $x$, $y \in \mathbb{R}^n$. The cosine distance is defined as
$d_{\cos}(x, y) = 1 - \frac{x \cdot y}{ \lVert x \rVert \lVert y \rVert}$
For cosine similarity, use $1 - d_{\cos}(x, y)$.

####  L1 Distance (`<+>`)  
L1 distance is also known as Manhattan distance. Let vectors $x$, $y \in \mathbb{R}^n$. The Manhattan distance is defined as:
$d_{L1}(x,y) = \sum_{i=1}^{n} |x_i - y_i|$

#### Hamming Distance (`<~>`)  
Hamming distance is used for binary vectors, as it counts the number of positions at which the corresponding elements differ. For vectors $x$, $y \in \{0, 1\}^n$, the Hamming distance is defined as 
$d_{Hamming}(x,y) = \sum_{i=1}^{n} |x_i - y_i|$. Obviously, $x_i - y_i \in \{0, 1\}$ for all $i \in {1, ..., n}$.
  

#### Jaccard Distance (`<%>`)  
Jaccard distance is used for binary vectors. It is derived from the Jaccard similarity. For vectors $x$, $y \in \{0, 1\}^n$, the Jaccard distance is defined as:
$d_{Jaccard}(x,y) = 1 - \frac{|x \cap y |}{| x \cup y |}$, where $ | x \cap y | $ is the number of positions where both vectors have 1s, and $ | x \cup y | $ is the number of positions where at least one vector has a 1.


### Querying pgvector database

Now that we have listed all the distance functions supported by the PNG vector database, we can showcase the differences in results using the above metrics. First, we will define functions for querying the database using the above distances for non-binary vectors.

In [None]:
#query using L2 distance
def query_db_L2(query, model_name, table_name):
    """
    The query_db_L2 function retrieves the top 5 most similar sentences from a pgvector database based on L2 (Euclidean) distance. 
    It uses a pre-trained SentenceTransformer model to encode the input query and then searches for the closest embeddings stored in the database.

    Parameters
    - query (str): The input text query to be searched.
    - model_name (str): The name of the SentenceTransformer model to be used for encoding the query.
    - table_name (str): The name of the table containing the stored sentence embeddings. Possible options are showcase.vector_demo and showcase.vector_demo2
    """
    
    #download the model
    model = SentenceTransformer(model_name)

    #calculate embedding for the query
    query_embedding = model.encode(query).tolist()  

    conn = psycopg.connect(host="localhost", dbname='wier', autocommit=True, password='SecretPassword', user='user')
    cur = conn.cursor() 

    # execute the query to fetch the top 5 most similar sentences based on L2 distance
    result = cur.execute(
        'SELECT sentence, (embedding <-> %s::vector) AS distance '
        'FROM ' + table_name + ' '
        'ORDER BY embedding <-> %s::vector '
        'LIMIT 5',
        (query_embedding, query_embedding)  # pass the embedding twice, once for ordering and once for calculation
    ).fetchall()
    cur.close()
    conn.close()
    return result

#query using L1 distance
def query_db_L1(query, model_name, table_name):
    """
    The query_db_L1 function retrieves the top 5 most similar sentences from a pgvector database based on L1 (Manhattan) distance. 
    It uses a pre-trained SentenceTransformer model to encode the input query and then searches for the closest embeddings stored in the database.

    Parameters
    - query (str): The input text query to be searched.
    - model_name (str): The name of the SentenceTransformer model to be used for encoding the query.
    - table_name (str): The name of the table containing the stored sentence embeddings. Possible options are showcase.vector_demo and showcase.vector_demo2
    """
    
    #download the model
    model = SentenceTransformer(model_name)

    #calculate embedding for the query
    query_embedding = model.encode(query).tolist()  

    conn = psycopg.connect(host="localhost", dbname='wier', autocommit=True, password='SecretPassword', user='user')
    cur = conn.cursor() 

    # execute the query to fetch the top 5 most similar sentences based on L1 distance
    result = cur.execute(
        'SELECT sentence, (embedding <+> %s::vector) AS distance '
        'FROM ' + table_name + ' '
        'ORDER BY embedding <+> %s::vector '
        'LIMIT 5',
        (query_embedding, query_embedding)  # pass the embedding twice, once for ordering and once for calculation
    ).fetchall()
    cur.close()
    conn.close()
    return result


#query using cosine distance
def query_db_cosine(query, model_name, table_name):
    """
    The query_db_cosine function retrieves the top 5 most similar sentences from a pgvector database based on cosine distance. 
    It uses a pre-trained SentenceTransformer model to encode the input query and then searches for the closest embeddings stored in the database.

    Parameters
    - query (str): The input text query to be searched.
    - model_name (str): The name of the SentenceTransformer model to be used for encoding the query.
    - table_name (str): The name of the table containing the stored sentence embeddings. Possible options are showcase.vector_demo and showcase.vector_demo2
    """
    
    #download the model
    model = SentenceTransformer(model_name)

    #calculate embedding for the query
    query_embedding = model.encode(query).tolist()  

    conn = psycopg.connect(host="localhost", dbname='wier', autocommit=True, password='SecretPassword', user='user')
    cur = conn.cursor() 

    # execute the query to fetch the top 5 most similar sentences based on cosine distance
    result = cur.execute(
        'SELECT sentence, 1 - (embedding <=> %s::vector) AS distance '
        'FROM ' + table_name + ' '
        'ORDER BY embedding <=> %s::vector '
        'LIMIT 5',
        (query_embedding, query_embedding)  # pass the embedding twice, once for ordering and once for calculation
    ).fetchall()
    cur.close()
    conn.close()
    return result

#query using negative inner product
def query_db_inner(query, model_name, table_name):
    """
    The query_db_inner function retrieves the top 5 most similar sentences from a pgvector database based on (negative) inner product. 
    It uses a pre-trained SentenceTransformer model to encode the input query and then searches for the closest embeddings stored in the database.

    Parameters
    - query (str): The input text query to be searched.
    - model_name (str): The name of the SentenceTransformer model to be used for encoding the query.
    - table_name (str): The name of the table containing the stored sentence embeddings. Possible options are showcase.vector_demo and showcase.vector_demo2
    """
    
    #download the model
    model = SentenceTransformer(model_name)

    #calculate embedding for the query
    query_embedding = model.encode(query).tolist()  

    conn = psycopg.connect(host="localhost", dbname='wier', autocommit=True, password='SecretPassword', user='user')
    cur = conn.cursor() 

    # execute the query to fetch the top 5 most similar sentences based negative inner product
    result = cur.execute(
        'SELECT sentence, -(embedding <#> %s::vector) AS distance '
        'FROM ' + table_name + ' '
        'ORDER BY embedding <#> %s::vector '
        'LIMIT 5',
        (query_embedding, query_embedding)  # pass the embedding twice, once for ordering and once for calculation
    ).fetchall()
    cur.close()
    conn.close()
    return result


#print results
def print_results(result):
    """
    This function displays the results.

    Parameters
    - result: a list of tuples including sentence and embedding values
    """
    for i,(sentence, distance) in enumerate(result, start=1):
        print(f"{i}. {sentence} {distance}")

##### 1. Identical sentence (we query the same vector)

The query sentence "The sun is shining" is already stored in our database. For this reason, we obtain the perfect match across all distances.

In [None]:
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
query = 'The sun is shining'
table_name = 'showcase.vector_demo'

#L2 distance
resultL2 = query_db_L2(query, model_name, table_name)
print("\n**Top 5 similar sentences using L2 distance:**\n")
print_results(resultL2)

#L1 distance
resultL1 = query_db_L1(query, model_name, table_name)
print("\n**Top 5 similar sentences using L1 (Manhattan) distance:**\n")
print_results(resultL1)

#cosine distance
resultC = query_db_cosine(query, model_name, table_name)
print("\n**Top 5 similar sentences using cosine distance:**\n")
print_results(resultC)

#negative inner product
result_inner_product = query_db_inner(query, model_name, table_name)
print("\n**Top 5 similar sentences using negative inner product:**\n")
print_results(result_inner_product)

##### 2. Synonymous sentence (different wording, same meaning)

In this example, the query sentence is "The sun is shining brightly". We do not have the exact match stored in the database. Even though sentences have different wordings, they express the same idea, allowing us to see how cosine distance and inner product treat them as similar. In contrast, L2 and L1 distances show slight differences. 

In [None]:
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
query = 'The sun is shining brightly'
table_name = 'showcase.vector_demo'

#L2 distance
resultL2 = query_db_L2(query, model_name, table_name)
print("\n**Top 5 similar sentences using L2 distance:**\n")
print_results(resultL2)

#L1 distance
resultL1 = query_db_L1(query, model_name, table_name)
print("\n**Top 5 similar sentences using L1 (Manhattan) distance:**\n")
print_results(resultL1)

#cosine distance
resultC = query_db_cosine(query, model_name, table_name)
print("\n**Top 5 similar sentences using cosine distance:**\n")
print_results(resultC)

#negative inner product
result_inner_product = query_db_inner(query, model_name, table_name)
print("\n**Top 5 similar sentences using negative inner product:**\n")
print_results(result_inner_product)


#### 3. Contrasting sentence (antonyms)

If our query is "The moon is glowing", one can see that this sentence and sentences in the database are semantically opposite (e.g., sun versus moon, shining versus glowing). For this reason, the distance metrics show a higher degree of dissimilarity for them, particularly L2 and L1 distances.

In [None]:
query = 'The moon is glowing'
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
table_name = 'showcase.vector_demo'

#L2 distance
resultL2 = query_db_L2(query, model_name, table_name)
print("\n**Top 5 similar sentences using L2 distance:**\n")
print_results(resultL2)

#L1 distance
resultL1 = query_db_L1(query, model_name, table_name)
print("\n**Top 5 similar sentences using L1 (Manhattan) distance:**\n")
print_results(resultL1)

#cosine distance
resultC = query_db_cosine(query, model_name, table_name)
print("\n**Top 5 similar sentences using cosine distance:**\n")
print_results(resultC)

#negative inner product
result_inner_product = query_db_inner(query, model_name, table_name)
print("\n**Top 5 similar sentences using negative inner product:**\n")
print_results(result_inner_product)

#### 4. Short versus long sentences (different sentence lenghts)

The short sentences compared to the long ones will help see how cosine distance and inner product are slightly affected by the amount of content. However, overall, we still obtain similar results across all distances.

In [None]:
query = 'The sun shines in the sky, a gentle breeze rustles the leaves and birds chirp in harmony, welcoming a new day filled with endless possibilities'
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
table_name = 'showcase.vector_demo'

#L2 distance
resultL2 = query_db_L2(query, model_name, table_name)
print("\n**Top 5 similar sentences using L2 distance:**\n")
print_results(resultL2)

#L1 distance
resultL1 = query_db_L1(query, model_name, table_name)
print("\n**Top 5 similar sentences using L1 (Manhattan) distance:**\n")
print_results(resultL1)

#cosine distance
resultC = query_db_cosine(query, model_name, table_name)
print("\n**Top 5 similar sentences using cosine distance:**\n")
print_results(resultC)

#negative inner product
result_inner_product = query_db_inner(query, model_name, table_name)
print("\n**Top 5 similar sentences using negative inner product:**\n")
print_results(result_inner_product)

#### 5. Different topics

Our query sentence includes an entirely different topic. As a consequence, the obtained results indicate a higher degree of dissimilarity.

In [None]:
query = 'Albert Einstein'
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
table_name = 'showcase.vector_demo'

#L2 distance
resultL2 = query_db_L2(query, model_name, table_name)
print("\n**Top 5 similar sentences using L2 distance:**\n")
print_results(resultL2)

#L1 distance
resultL1 = query_db_L1(query, model_name, table_name)
print("\n**Top 5 similar sentences using L1 (Manhattan) distance:**\n")
print_results(resultL1)

#cosine distance
resultC = query_db_cosine(query, model_name, table_name)
print("\n**Top 5 similar sentences using cosine distance:**\n")
print_results(resultC)

#negative inner product
result_inner_product = query_db_inner(query, model_name, table_name)
print("\n**Top 5 similar sentences using negative inner product:**\n")
print_results(result_inner_product)

#### 6. Minor changes (similar meaning but slight rewording)

The query "Shining is the sun" has a similar meaning as the sentence "The sun is shining", which is already included in the database. 

In [None]:
query = 'Shining is the sun'
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
table_name = 'showcase.vector_demo'

#L2 distance
resultL2 = query_db_L2(query, model_name, table_name)
print("\n**Top 5 similar sentences using L2 distance:**\n")
print_results(resultL2)

#L1 distance
resultL1 = query_db_L1(query, model_name, table_name)
print("\n**Top 5 similar sentences using L1 (Manhattan) distance:**\n")
print_results(resultL1)

#cosine distance
resultC = query_db_cosine(query, model_name, table_name)
print("\n**Top 5 similar sentences using cosine distance:**\n")
print_results(resultC)

#negative inner product
result_inner_product = query_db_inner(query, model_name, table_name)
print("\n**Top 5 similar sentences using negative inner product:**\n")
print_results(result_inner_product)

#### 7.  What about Slovene?

So far, we have tested several examples in English. Let's try one example sentence in Slovene, e.g. "Sonce sije svetlo na jasnem nebu". Its translation into English is "The sun is shining brightly in the clear sky". We already have this sentence stored in the database. 

`How will a query in Slovene impact the results?`

In [None]:
query = 'Sonce sije svetlo na jasnem nebu' #The sun is shining brightly in the clear sky
model_name = 'sentence-transformers/all-MiniLM-L6-v2'
table_name = 'showcase.vector_demo'

#L2 distance
resultL2 = query_db_L2(query, model_name, table_name)
print("\n**Top 5 similar sentences using L2 distance:**\n")
print_results(resultL2)

#L1 distance
resultL1 = query_db_L1(query, model_name, table_name)
print("\n**Top 5 similar sentences using L1 (Manhattan) distance:**\n")
print_results(resultL1)

#cosine distance
resultC = query_db_cosine(query, model_name, table_name)
print("\n**Top 5 similar sentences using cosine distance:**\n")
print_results(resultC)

#negative inner product
result_inner_product = query_db_inner(query, model_name, table_name)
print("\n**Top 5 similar sentences using negative inner product:**\n")
print_results(result_inner_product)

We can see that the *all-MiniLM-L6-v2* model yields results that are not as expected due to the sentence "The sun shines" being the best option for all distances.

Some models, like the *LaBSE* model, are multilingual. Such models support different languages. Specifically, the *LaBSE* model supports both Slovene and English. Let's see how this model will perform.

In [None]:
query = 'Sonce sije svetlo na jasnem nebu' #The sun is shining brightly in the clear sky
model_name = 'sentence-transformers/LaBSE'
table_name = 'showcase.vector_demo2'

#L2 distance
resultL2 = query_db_L2(query, model_name, table_name)
print("\n**Top 5 similar sentences using L2 distance:**\n")
print_results(resultL2)

#L1 distance
resultL1 = query_db_L1(query, model_name, table_name)
print("\n**Top 5 similar sentences using L1 (Manhattan) distance:**\n")
print_results(resultL1)

#cosine distance
resultC = query_db_cosine(query, model_name, table_name)
print("\n**Top 5 similar sentences using cosine distance:**\n")
print_results(resultC)

#negative inner product
result_inner_product = query_db_inner(query, model_name, table_name)
print("\n**Top 5 similar sentences using negative inner product:**\n")
print_results(result_inner_product)

This model yields the expected results. 

In this tutorial, we have demonstrated that the model used for calculating embeddings plays a crucial role in vector retrieval. So far, we have utilized the KNN approach to search the database with non-binary vectors. 

You can test querying the database using Hamming and Jaccard distances for practice. To achieve this, you must select a model that calculates embeddings as binary vectors, create a new table for storing the embeddings and implement the code for querying using the Hamming and Jaccard distances. You can help with [this example](https://github.com/pgvector/pgvector-python/blob/master/examples/cohere/example.py).

## References

- [Understanding Vector databases.](https://learn.microsoft.com/en-us/data-engineering/playbook/solutions/vector-database/)
- [Book Vector Spaces First: Introduction to Linear Algebra](https://ruor.uottawa.ca/items/f66a4ede-e276-486c-9067-9621d5347440)
- [Vector database pgvector.](https://github.com/pgvector/pgvector)
- [pgvector-python](https://github.com/pgvector/pgvector-python)
- [Sentence transformers.](https://sbert.net/)
- [Model all-MiniLM-L6-v2](https://huggingface.co/sentence-transformers/all-MiniLM-L6-v2)
- [Model LaBSE](https://huggingface.co/sentence-transformers/LaBSE)