## Atlas Vector Search

Atlas Search's Vector Search capability provides developers the mechanism to **store** dense vectors, structured around certain algorithms (i.e. KNN), and an engine to **compute** similar vectors (i.e. euclidean distance) for relevance score calculation.

Please review the [index definition](https://www.mongodb.com/docs/atlas/atlas-search/define-field-mappings-for-vector-search/#std-label-bson-data-types-knn-vector) and [query syntax](https://www.mongodb.com/docs/atlas/atlas-search/knn-beta/#options) documentation to learn more. 

## Use Case

We are a wholesaler of food suppplies and our largest customers are pizza franchises. They've been unfortunately complaining that their ingredient purchasers have spent way too much time searching for all the different types of cheeses.

We will use Atlas' Vector Search capabilities to reduce the time spent searching for all cheeses manually, and instead just enter "cheese", where all the different types are returned automatically due to them being semantically similar. 

## How can we do it today? 

There are really two approaches developers can use today, and that's via a tagging data structure or a synonyms mapping collection

In [1]:
# tagging

{
    "name": "Mozarella",
    "tags": ["cheese", "dairy", "pizza ingredient", "fermented dairy", "..."]
}

# synonyms

{
    "mappingType": "explicit",
    "input": ["Mozarella"],
    "synonyms": ["cheese", "dairy", "pizza ingredient", "fermented dairy", "..."]
}

{'mappingType': 'explicit',
 'input': ['Mozarella'],
 'synonyms': ['cheese', 'dairy', 'pizza ingredient', 'fermented dairy', '...']}

Both options require human or automated updating which have challenges with have challenges with consistency, management, and intuitiveness which all together results in a poor user experience. 

## Create the Vector Index

<center><img width="700px" src="index.png"/></center>

In [2]:
{
  "mappings": {
    "fields": {
      "embedding": [
        {
          "dimensions": 384,
          "similarity": "euclidean",
          "type": "knnVector"
        }
      ]
    }
  }
}

{'mappings': {'fields': {'embedding': [{'dimensions': 384,
     'similarity': 'euclidean',
     'type': 'knnVector'}]}}}

### Index Field Mapping Parameters:
- **dimensions:** The number of vector space dimensions which we’ll enforce at index and query time. Represented as the number of floats in an array. Limited to 1024. 
- **similarity:** The vector similarity function used in search to determine the nearest neighbors. Options include: Euclidean, Dot product, and Cosine

## Create Embeddings

Let's create the embeddings that we'll use to store our products.

In [3]:
from sentence_transformers import SentenceTransformer
from pprint import pprint

# https://huggingface.co/obrizum/all-MiniLM-L6-v2
# how is this converting?
model = SentenceTransformer('obrizum/all-MiniLM-L6-v2')

In [None]:
# strings as an array that we will
products = [
    {"name":"Mozzarella"},
    {"name":"Parmesan"},
    {"name":"Cheddar"},
    {"name":"Brie"},
    {"name":"Swiss"},
    {"name":"Gruyere"},
    {"name":"Feta"},
    {"name":"Gouda"},
    {"name":"Provolone"},
    {"name":"Monterey Jack"},
    {"name":"Telephone"}
]

# create a new embedding field for each product object
for product in products:
  # convert to embedding, then to array
    embeddings = model.encode(product['name']).tolist()
    product['embedding'] = embeddings
    
pprint(products)

## Store in Mongo

Now we'll store this newly created array of objects with their corresponding product name embeddings in our collection one by one

In [5]:
import pymongo

mongo_uri = ""

# connection object
connection = pymongo.MongoClient(mongo_uri)
database = 'eap'
collection = 'vector'

# delete all first
# connection[database][collection].delete_many({})

# insert
# connection[database][collection].insert_many(products)

## Query in Mongo Using KNN

Lorem

In [6]:
query = "cheese"
vector_query = model.encode(query).tolist()

pipeline = [
  {
    "$search": {
        "index":"default",
      "knnBeta": {
        "vector": vector_query,
        "path": "embedding",
        # limit the result set
        "k": 10
      }
    }
  },
{
    "$project":{
        "embedding":0,
        "_id":0,
        'score': {
            '$meta': 'searchScore'
        }
    }
}
]

results = list(connection[database][collection].aggregate(pipeline))
pprint(results)

[{'name': 'Cheddar', 'score': 0.627196729183197},
 {'name': 'Mozzarella', 'score': 0.6126944422721863},
 {'name': 'Swiss', 'score': 0.46435827016830444},
 {'name': 'Provolone', 'score': 0.4593936502933502},
 {'name': 'Monterey Jack', 'score': 0.44634753465652466},
 {'name': 'Gouda', 'score': 0.4415607750415802},
 {'name': 'Gruyere', 'score': 0.43056464195251465},
 {'name': 'Feta', 'score': 0.4289986193180084},
 {'name': 'Parmesan', 'score': 0.42754852771759033},
 {'name': 'Brie', 'score': 0.4135145843029022}]


## Architecture Review

Review this diagram to understand how Atlas Vector Search plays within your search architecture:

<center><img src="diagram.png" style="padding-top:1em"/></center>

## Combining Queries

What if we want to prioritize documents that contain more exact string matches in addition to the above contextual vector search?

In [13]:
string_match_query = "Telephone"

# lets first insert an exact match
# connection[database][collection].insert_one({"name":string_match_query})

# what products include cheese?
# let's update the index definition with this static mapping:
{
  "mappings": {
    "fields": {
      "embedding": [{
        "dimensions": 384,
        "similarity": "euclidean",
        "type": "knnVector"
      }],
      "name": {
        "type": "string"
      }
    }
  }
}

# now we combine queries using a score boost
pipeline = [{
    "$search": {
      "compound": {
        "should": [{
            "knnBeta": {
              "vector": vector_query,
              "path": "embedding",
              "k": 10
            }
          },
          {
            "text": {
              "query": "Telephone",
              "path": "name",
              "score": {"boost": {"value": 10}
              }
            }
          }
        ]
      }
    }
  },
  {
    "$project": {
      "embedding": 0,
      "_id": 0,
      'score': {
        '$meta': 'searchScore'
      }
    }
  }
]

results = list(connection[database][collection].aggregate(pipeline))
pprint(results)

[{'name': 'Cheddar', 'score': 0.627196729183197},
 {'name': 'Mozzarella', 'score': 0.6126944422721863},
 {'name': 'Swiss', 'score': 0.46435827016830444},
 {'name': 'Provolone', 'score': 0.4593936502933502},
 {'name': 'Monterey Jack', 'score': 0.44634753465652466},
 {'name': 'Gouda', 'score': 0.4415607750415802},
 {'name': 'Gruyere', 'score': 0.43056464195251465},
 {'name': 'Feta', 'score': 0.4289986193180084},
 {'name': 'Parmesan', 'score': 0.42754852771759033},
 {'name': 'Brie', 'score': 0.4135145843029022}]


similarity score matrix plot

Keep the initial query like now. Beneath, mention that now k = 20 and here's the plot of result item rank (x-axis) and score (y-axis). Not sure what the optimal number of examples would be, but maybe go for k = 20, 50, 100 with this structure? (edited)

dimensionality vs size of the index vs relevance vs speed

In [9]:
#Let's now modify the k value to see how it impacts the item rank and score

query = "cheese"
vector_query = model.encode(query).tolist()

pipeline = [
  {
    "$search": {
      "knnBeta": {
        "vector": vector_query,
        "path": "embedding",
        "k": 12
      }
    }
  },
{
    "$project":{
        "embedding":0,
        "_id":0,
        'score': {
            '$meta': 'searchScore'
        }
    }
}
]

results = list(connection[database][collection].aggregate(pipeline))
pprint(results)

[{'name': 'Cheddar', 'score': 0.627196729183197},
 {'name': 'Mozzarella', 'score': 0.6126944422721863},
 {'name': 'Swiss', 'score': 0.46435827016830444},
 {'name': 'Provolone', 'score': 0.4593936502933502},
 {'name': 'Monterey Jack', 'score': 0.44634753465652466},
 {'name': 'Gouda', 'score': 0.4415607750415802},
 {'name': 'Gruyere', 'score': 0.43056464195251465},
 {'name': 'Feta', 'score': 0.4289986193180084},
 {'name': 'Parmesan', 'score': 0.42754852771759033},
 {'name': 'Brie', 'score': 0.4135145843029022},
 {'name': 'Telephone', 'score': 0.4124469757080078}]


## Embedded Vector Search

In [None]:
# index definition
{
  "mappings": {
    "dynamic": true,
    "fields": {
      "teachers": [
        {
          "dynamic": true,
          "fields": {
            "vector": {
              "dimensions": 3,
              "similarity": "euclidean",
              "type": "knnVector"
            }
          },
          "type": "embeddedDocuments"
        }
      ]
    }
  }
}

# docs
{
  "teachers": [
    {
      "first": "Jane",
      "last": "Smith",
      "vector": [
        1,
        1,
        1
      ]
    }
  ]
}

# query
[
    {
        '$search': {
            'index': 'embedded_vector', 
            'embeddedDocument': {
                'path': 'teachers', 
                'operator': {
                    'compound': {
                        'filter': [
                            {
                                'text': {
                                    'path': 'teachers.first', 
                                    'query': 'John'
                                }
                            }, {
                                'text': {
                                    'path': 'teachers.last', 
                                    'query': 'Smith'
                                }
                            }
                        ], 
                        'should': [
                            {
                                'knnBeta': {
                                    'path': 'teachers.vector', 
                                    'k': 10, 
                                    'vector': [
                                        6, 6, 6
                                    ]
                                }
                            }
                        ]
                    }
                }
            }
        }
    }
]

