# Add/update embedding vectors


When adding a new set of embedding vectors or updating them, we need to perform the following sequence of steps

1. Given a model id, its revision, and a given a set of resources, ask the service[or some python code] for embedding vectors
2. Create/update embedding resources according to [this mapping](https://bbpgitlab.epfl.ch/dke/users/eugeniashurko/dataset-embeddings/-/blob/master/mappings/seu-embedding.hjson) --> model revision needs to be added to the `generation.activity.used.id`
3. Push them to Nexus
4. Tag them with the model UUID and the its revision (e.g. `e2b953b9-6724-4278-a1e5-3472bd63e374?rev=1`)

Related JIRA tickets: 
* https://bbpteam.epfl.ch/project/issues/browse/DKE-718
* https://bbpteam.epfl.ch/project/issues/browse/DKE-715

Prerequisites:

- The embedding model has been built
- Embedding service can read models from a dedicated Nexus project where all models are stored (here, at the moment, we can download models locally and get vectors directly from the models, without using the service)
- Model ID equals the Nexus resource id of the EmbeddingModel resource
- __Important__: local contexts in the projects with vectors should contain:

```
{
      "embedding": {
        "@id": "nsg:embedding",
        "@container": "@list"
      }
}
```

Questions:

* do we really need to url-encode tags ?
* add missing types and properties to the context

---

## Setup

### Imports

In [3]:
import requests
import getpass
import uuid
import os
import math
import warnings

from collections import OrderedDict

import numpy as np
import nexussdk as nxs

from urllib.parse import quote_plus
from kgforge.core import KnowledgeGraphForge
from kgforge.specializations.mappings import DictionaryMapping
from bluegraph.downstream import EmbeddingPipeline
from bluegraph.core import GraphElementEmbedder

from inference_tools.similarity.data_registration import (BucketConfiguration,
                                                          create_forge_session,
                                                          register_embeddings)

In [4]:
from kgforge.version import __version__
print(__version__)

0.6.3


---

## User input

In [42]:
CONFIG_PATH = "../../configs/new-forge-config.yaml"
ENDPOINT = "https://bbp.epfl.ch/nexus/v1"
DOWNLOAD_DIR = "../../data"
TOKEN = getpass.getpass()

········


Bucket where embedding models live

In [6]:
MODEL_CATALOG_ORG = "dke"
MODEL_CATALOG_PROJECT = "embedding-pipelines"

ID of the embedding model to use.

In [26]:
MODEL_IDS = [
    "https://bbp.epfl.ch/nexus/v1/resources/dke/embedding-pipelines/_/d0c21fd5-cb9c-445c-b0a4-94847ba61f5a",  # neurite features
    "https://bbp.epfl.ch/nexus/v1/resources/dke/embedding-pipelines/_/9fe6873b-ef6a-41b5-854a-382bc1be9fff",  # dendrite
    "https://bbp.epfl.ch/nexus/v1/resources/dke/embedding-pipelines/_/84519407-ad30-4d31-877e-1d6560325393",  # axon
    "https://bbp.epfl.ch/nexus/v1/resources/dke/embedding-pipelines/_/1c4fcd2e-000f-437b-b65b-844ee211105a",  # brain regions
    "https://bbp.epfl.ch/nexus/v1/resources/dke/embedding-pipelines/_/608fab85-0cc9-4ff9-a4bd-4249589b5889",  # coordinates
    "https://bbp.epfl.ch/nexus/v1/resources/dke/embedding-pipelines/_/43965be4-72f9-4901-9a95-d9ca13da8fb4",  # TMD
    "https://bbp.epfl.ch/nexus/v1/resources/dke/embedding-pipelines/_/7a111efa-7467-42d2-9e0c-c1ca7a883216",  # TMD (scaled)
]

__PROVIDE HERE THE ID OF YOUR MODEL (OPTIONAL, REVISION)__

In [27]:
MODEL_REVISIONS = {}  # Specify a revision (key model_id, value revision number), if necessary.
# If not specified the latest revision is used

Buckets where the input data lives together with the Bucket where the new embedding vectors should be registered.

In [28]:
DATA_BUCKETS = {
    BucketConfiguration("https://bbp.epfl.ch/nexus/v1", "bbp-external", "seu"): [
             BucketConfiguration(
                "https://bbp.epfl.ch/nexus/v1",
                 "dke","seu-embeddings")
        ]
}

If the embedding endpoint/bucket are not specified, we assume that embeddings should live in the same bucket as the input data.

Data type filter for generating embedding vectors

In [29]:
DATA_TYPE_FILTER = "NeuronMorphology"

In [30]:
HARD_RESOURCE_LIMIT = 10000  # Limit on number of resources we can retrieve with SPARQL queries

---

## Create Forge sessions

### Session for embedding models

In [15]:
forge_models = create_forge_session(
    CONFIG_PATH,
    BucketConfiguration(ENDPOINT, MODEL_CATALOG_ORG, MODEL_CATALOG_PROJECT),
    TOKEN)

### Sessions for different buckets for data and embedding vectors

In [16]:
# TODO: find a way to pass different tokens and different configs
FORGE_SESSIONS = {}
for data_bucket, emb_buckets in DATA_BUCKETS.items():
    if data_bucket not in FORGE_SESSIONS:
        FORGE_SESSIONS[data_bucket] = create_forge_session(CONFIG_PATH, data_bucket, TOKEN)
    for bucket in emb_buckets:
        if bucket not in FORGE_SESSIONS:
            FORGE_SESSIONS[bucket] = create_forge_session(CONFIG_PATH, bucket, TOKEN)

---

## Fetch resources from data buckets

In [17]:
resource_set = {}
for bucket_config in DATA_BUCKETS.keys():
    if bucket_config not in resource_set:
#         !!! CURRENTLY SEARCH DOES NOT WORK, WHEN TOO MANY RESOURCES
#
#         resources = DATA_SESSIONS[
#             (bucket.data_endpoint, bucket.data_org, bucket.data_proj)].search(
#                 {"type": DATA_TYPE_FILTER}, limit=None)
        forge = FORGE_SESSIONS[bucket_config]
        query = f"""
            SELECT ?id
            WHERE {{
                ?id a {DATA_TYPE_FILTER} ;
                    <https://bluebrain.github.io/nexus/vocabulary/deprecated> false .
            }}
        """ 
        resources = forge.sparql(query, limit=HARD_RESOURCE_LIMIT)
        resources = [forge.retrieve(r.id) for r in resources] 

        resource_set[bucket_config] = resources

In [18]:
for k, v in resource_set.items():
    print("Bucket: ", k)
    print("\t", len(v), "resources")
    print()

Bucket:  BucketConfiguration(endpoint='https://bbp.epfl.ch/nexus/v1', org='bbp-external', proj='seu')
	 400 resources



## Load the embedding model

In [39]:
model_resources = []
MODEL_REVISIONS = {}
MODEL_TAGS = {}
for model_id in MODEL_IDS:
    model_revision = MODEL_REVISIONS.get(model_id)
    model_resource = forge_models.retrieve(
        f"{model_id}{'?rev=' + str(model_revision) if model_revision is not None else ''}")
    model_resources.append(model_resource)

    # If revision is not provided by the user, fetch the latest
    if model_revision is None:
        model_revision = model_resource._store_metadata._rev 
        MODEL_REVISIONS[model_id] = model_revision

    tag = f"{model_id.split('/')[-1]}?rev={model_revision}"
    MODEL_TAGS[model_id] = tag

In [35]:
pipeline_paths = {}
for model_resource in model_resources:
    forge_models.download(model_resource, "distribution.contentUrl", DOWNLOAD_DIR, overwrite=True)
    pipeline_paths[model_resource.id] = os.path.join(
        DOWNLOAD_DIR, model_resource.distribution.name)

In [37]:
pipelines = {}
for k, pipeline_path in pipeline_paths.items():
    pipelines[k] = EmbeddingPipeline.load(
        pipeline_path,
        embedder_interface=GraphElementEmbedder,
        embedder_ext="zip")

  return f(*args, **kwds)
  return f(*args, **kwds)


## Compute embedding vectors for all the resources and push to Nexus

- TODO: add the NeuronMorphology revision once available
- TODO: add prediction of previously unseen points (currently, only the in-sample points are considered)

In [None]:
SEU_DICTIONARY_MAPPING = "../../mappings/seu-embedding.hjson"

In [40]:
for i, model_id in enumerate(MODEL_IDS):
    print(f"Processing model '{model_id}'")
    embedding_table = pipelines[model_id].generate_embedding_table()
    for bucket_config, resources in resource_set.items():
        vectors = {}
        for resource in resources:
            if resource.id not in vectors:
                if resource.id in embedding_table.index:
                    vectors[resource.id] = embedding_table.loc[
                        resource.id].tolist()[0].tolist()
                else:
                    warnings.warn(
                        f"\tEmbedding vector for '{resource.id}' in '{bucket_config}' was not computed")
        for embedding_bucket in DATA_BUCKETS[bucket_config]:
            forge = FORGE_SESSIONS[embedding_bucket]
            print(f"\tRegistering/updating {len(vectors)} vectors for '{embedding_bucket}...'")
            register_embeddings(
                forge, vectors, model_id, MODEL_REVISIONS[model_id], MODEL_TAGS[model_id],
                SEU_DICTIONARY_MAPPING)

Processing model 'https://bbp.epfl.ch/nexus/v1/resources/dke/embedding-pipelines/_/d0c21fd5-cb9c-445c-b0a4-94847ba61f5a'
	Registering/updating 400 vectors for 'BucketConfiguration(endpoint='https://bbp.epfl.ch/nexus/v1', org='dke', proj='seu-embeddings')...'

<count> 298
<action> _update_many
<succeeded> False
<error> UpdatingError: incorrect rev

<count> 102
<action> _update_many
<succeeded> True
Tagging updated resources...
<count> 298
<action> _tag_many
<succeeded> False
<error> TaggingError: resource should be synchronized

<count> 102
<action> _tag_many
<succeeded> True
Processing model 'https://bbp.epfl.ch/nexus/v1/resources/dke/embedding-pipelines/_/9fe6873b-ef6a-41b5-854a-382bc1be9fff'
	Registering/updating 397 vectors for 'BucketConfiguration(endpoint='https://bbp.epfl.ch/nexus/v1', org='dke', proj='seu-embeddings')...'


  del sys.path[0]
  del sys.path[0]
  del sys.path[0]


<count> 99
<action> _register_many
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<ac

<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<s

  del sys.path[0]
  del sys.path[0]
  del sys.path[0]


<count> 104
<action> _register_many
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<action> _tag_one
<succeeded> True
<a

The following tag should be used to create new ES views on the vectors.

In [41]:
MODEL_TAGS

{'https://bbp.epfl.ch/nexus/v1/resources/dke/embedding-pipelines/_/d0c21fd5-cb9c-445c-b0a4-94847ba61f5a': 'd0c21fd5-cb9c-445c-b0a4-94847ba61f5a?rev=8',
 'https://bbp.epfl.ch/nexus/v1/resources/dke/embedding-pipelines/_/9fe6873b-ef6a-41b5-854a-382bc1be9fff': '9fe6873b-ef6a-41b5-854a-382bc1be9fff?rev=9',
 'https://bbp.epfl.ch/nexus/v1/resources/dke/embedding-pipelines/_/84519407-ad30-4d31-877e-1d6560325393': '84519407-ad30-4d31-877e-1d6560325393?rev=9',
 'https://bbp.epfl.ch/nexus/v1/resources/dke/embedding-pipelines/_/1c4fcd2e-000f-437b-b65b-844ee211105a': '1c4fcd2e-000f-437b-b65b-844ee211105a?rev=4',
 'https://bbp.epfl.ch/nexus/v1/resources/dke/embedding-pipelines/_/608fab85-0cc9-4ff9-a4bd-4249589b5889': '608fab85-0cc9-4ff9-a4bd-4249589b5889?rev=7',
 'https://bbp.epfl.ch/nexus/v1/resources/dke/embedding-pipelines/_/43965be4-72f9-4901-9a95-d9ca13da8fb4': '43965be4-72f9-4901-9a95-d9ca13da8fb4?rev=6',
 'https://bbp.epfl.ch/nexus/v1/resources/dke/embedding-pipelines/_/7a111efa-7467-42d2-9e