# Add/update embedding vectors


When adding a new set of embedding vectors or updating them, we need to perform the following sequence of steps

1. Given a model id, its revision, and a given a set of resources, ask the service[or some python code] for embedding vectors
2. Create/update embedding resources according to [this mapping](https://bbpgitlab.epfl.ch/dke/users/eugeniashurko/dataset-embeddings/-/blob/master/mappings/seu-embedding.hjson) --> model revision needs to be added to the `generation.activity.used.id`
3. Push them to Nexus
4. Tag them with the model UUID and the its revision (e.g. `e2b953b9-6724-4278-a1e5-3472bd63e374?rev=1`)

Related JIRA tickets: 
* https://bbpteam.epfl.ch/project/issues/browse/DKE-718
* https://bbpteam.epfl.ch/project/issues/browse/DKE-715

Prerequisites:

- The embedding model has been built
- Embedding service can read models from a dedicated Nexus project where all models are stored (here, at the moment, we can download models locally and get vectors directly from the models, without using the service)
- Model ID equals the Nexus resource id of the EmbeddingModel resource
- __Important__: local contexts in the projects with vectors should contain:

```
{
      "embedding": {
        "@id": "nsg:embedding",
        "@container": "@list"
      }
}
```

Questions:

* do we really need to url-encode tags ?
* add missing types and properties to the context

---

## Setup

### Imports

In [167]:
import requests
import getpass
import uuid
import os
import math
import warnings

from collections import OrderedDict

import numpy as np
import nexussdk as nxs

from collections import namedtuple
from urllib.parse import quote_plus
from kgforge.core import KnowledgeGraphForge
from kgforge.specializations.mappings import DictionaryMapping
from bluegraph.downstream import EmbeddingPipeline
from bluegraph.core import GraphElementEmbedder

In [168]:
from kgforge.version import __version__
print(__version__)

0.6.3.dev9+gc159ffd


### Helpers

In [169]:
BucketConfiguration = namedtuple(
    'BucketConfiguration', 'endpoint org proj')


def create_forge_session(bucket_config):
    return KnowledgeGraphForge(
        "https://raw.githubusercontent.com/BlueBrain/nexus-forge/master/examples/notebooks/use-cases/prod-forge-nexus.yml",
        token=TOKEN, 
        endpoint=bucket_config.endpoint,        
        bucket=f"{bucket_config.org}/{bucket_config.proj}")


def register_embeddings(forge, vectors, model_id, model_revision, tag):
    new_embeddings = []
    updated_embeddings = []
    for at_id, embedding in vectors.items():
        existing_vectors = forge.search({
            "type": "Embedding",
            "derivation": {
                "entity": {
                    "id": at_id
                }
            },
            "generation": {
                "activity": {
                    "used": {
                        "id": model_id
                    }
                }
            }
        })
        if len(existing_vectors) > 0:
            vector_resource = existing_vectors[0]
            vector_resource.embedding = embedding
            vector_resource.generation.activity.used.hasSelector = forge.from_json({
                "type": "FragmentSelector",
                "conformsTo": "https://bluebrainnexus.io/docs/delta/api/resources-api.html#fetch",
                "value": f"?rev={model_revision}"
            })
            updated_embeddings.append(vector_resource)
        else: 
            new_embeddings.append({
                "morphology_id": at_id,
                "morphology_rev": "TODO",
                "model_id": model_id,
                "model_rev": model_revision,
                "embedding_name": f"Embedding of morphology {at_id.split('/')[-1]} at revision TODO" ,
                "embedding": embedding,
                "uuid": at_id.split("/")[-1]

            })
    mapping = DictionaryMapping.load("./mappings/seu-embedding.hjson")
    new_embedding_resources = forge.map(new_embeddings, mapping)
    for r in new_embedding_resources:
        r.id = forge.format("identifier", "embeddings", str(uuid.uuid4()))
    forge.register(new_embedding_resources)
    forge.update(updated_embeddings)
    forge.tag(new_embedding_resources + updated_embeddings, tag)

---

## User input

In [170]:
ENDPOINT = "https://staging.nexus.ocp.bbp.epfl.ch/v1"
DOWNLOAD_DIR = "./data"
TOKEN = getpass.getpass()

········


Bucket where embedding models live

In [171]:
MODEL_CATALOG_ORG = "dke"
MODEL_CATALOG_PROJECT = "embedder_catalog"

ID of the embedding model to use

In [187]:
MODEL_ID = "https://staging.nexus.ocp.bbp.epfl.ch/v1/resources/dke/embedder_catalog/_/14d61701-c4fa-44ea-8139-0e0ed606b4ec"
MODEL_REVISION = None  # Specify a revision, if necessary. If None, the latest revision is used

Buckets where the input data lives together with the Bucket where the new embedding vectors should be registered.

In [188]:
DATA_BUCKETS = {
    BucketConfiguration("https://bbp.epfl.ch/nexus/v1", "bbp-external", "seu"): [
             BucketConfiguration(
                "https://staging.nexus.ocp.bbp.epfl.ch/v1",
                 "dke","seu-embeddings"),
             BucketConfiguration(
                "https://staging.nexus.ocp.bbp.epfl.ch/v1",
                 "dke", "seu-embeddings-2")
        ]
}

If the embedding endpoint/bucket are not specified, we assume that embeddings should live in the same bucket as the input data.

Data type filter for generating embedding vectors

In [189]:
DATA_TYPE_FILTER = "NeuronMorphology"

In [190]:
HARD_RESOURCE_LIMIT = 10000  # Limit on number of resources we can retrieve with SPARQL queries

---

## Create Forge sessions

### Session for embedding models

In [191]:
forge_models = KnowledgeGraphForge(
    "https://raw.githubusercontent.com/BlueBrain/nexus-forge/master/examples/notebooks/use-cases/prod-forge-nexus.yml",
    endpoint=ENDPOINT,
    token=TOKEN, 
    bucket="dke/embedder_catalog")

### Sessions for different buckets for data and embedding vectors

In [192]:
# TODO: find a way to pass different tokens and different configs
FORGE_SESSIONS = {}
for data_bucket, emb_buckets in DATA_BUCKETS.items():
    if k not in FORGE_SESSIONS:
        FORGE_SESSIONS[data_bucket] = create_forge_session(data_bucket)
    for bucket in emb_buckets:
        if bucket not in FORGE_SESSIONS:
            FORGE_SESSIONS[bucket] = create_forge_session(bucket)

---

## Fetch resources from data buckets

In [193]:
resource_set = {}
for bucket_config in DATA_BUCKETS.keys():
    if bucket_config not in resource_set:
#         !!! CURRENTLY SEARCH DOES NOT WORK, WHEN TOO MANY RESOURCES
#
#         resources = DATA_SESSIONS[
#             (bucket.data_endpoint, bucket.data_org, bucket.data_proj)].search(
#                 {"type": DATA_TYPE_FILTER}, limit=None)
        forge = FORGE_SESSIONS[bucket_config]
        query = f"""
            SELECT ?id
            WHERE {{
                ?id a {DATA_TYPE_FILTER} ;
                    <https://bluebrain.github.io/nexus/vocabulary/deprecated> false .
            }}
        """ 
        resources = forge.sparql(query, limit=HARD_RESOURCE_LIMIT)
        resources = [forge.retrieve(r.id) for r in resources] 

        resource_set[bucket_config] = resources

In [194]:
for k, v in resource_set.items():
    print("Bucket: ", k)
    print("\t", len(v), "resources")
    print()

Bucket:  BucketConfiguration(endpoint='https://bbp.epfl.ch/nexus/v1', org='bbp-external', proj='seu')
	 298 resources



## Load the embedding model

In [195]:
model_resource = forge_models.retrieve(
    f"{MODEL_ID}{'?rev=' + str(MODEL_REVISION) if MODEL_REVISION is not None else ''}")

# If revision is not provided by the user, fetch the latest
if MODEL_REVISION is None:
    MODEL_REVISION = model_resource._store_metadata._rev 

MODEL_TAG = f"{MODEL_ID.split('/')[-1]}?rev={MODEL_REVISION}"

In [196]:
MODEL_TAG

'14d61701-c4fa-44ea-8139-0e0ed606b4ec?rev=10'

In [197]:
forge_models.download(model_resource, "distribution.contentUrl", DOWNLOAD_DIR, overwrite=True)
pipeline_path = os.path.join(DOWNLOAD_DIR, model_resource.distribution.name)

In [198]:
pipeline = EmbeddingPipeline.load(
    pipeline_path,
    embedder_interface=GraphElementEmbedder,
    embedder_ext="zip")

In [199]:
embedding_table = pipeline.generate_embedding_table()

## Compute embedding vectors for all the resources and push to Nexus

- TODO: add the NeuronMorphology revision once available
- TODO: add prediction of previously unseen points (currently, only the in-sample points are considered)

In [200]:
for bucket_config, resources in resource_set.items():
    vectors = {}
    for resource in resources:
        if resource.id not in embeddings:
            if resource.id in embedding_table.index:
                vectors[resource.id] = embedding_table.loc[
                    resource.id].tolist()[0].tolist()
            else:
                warnings.warn(
                    f"Embedding vector for '{resource.id}' in '{bucket_config}' was not computed")
    for embedding_bucket in DATA_BUCKETS[bucket_config]:
        forge = FORGE_SESSIONS[embedding_bucket]
        print(f"Registering/updating {len(vectors)} vectors for '{embedding_bucket}...'")
        register_embeddings(forge, vectors, MODEL_ID, MODEL_REVISION, MODEL_TAG)

  # Remove the CWD from sys.path while we load stuff.
  # Remove the CWD from sys.path while we load stuff.
  # Remove the CWD from sys.path while we load stuff.
  # Remove the CWD from sys.path while we load stuff.
  # Remove the CWD from sys.path while we load stuff.
  # Remove the CWD from sys.path while we load stuff.
  # Remove the CWD from sys.path while we load stuff.
  # Remove the CWD from sys.path while we load stuff.
  # Remove the CWD from sys.path while we load stuff.
  # Remove the CWD from sys.path while we load stuff.
  # Remove the CWD from sys.path while we load stuff.
  # Remove the CWD from sys.path while we load stuff.
  # Remove the CWD from sys.path while we load stuff.
  # Remove the CWD from sys.path while we load stuff.
  # Remove the CWD from sys.path while we load stuff.
  # Remove the CWD from sys.path while we load stuff.
  # Remove the CWD from sys.path while we load stuff.
  # Remove the CWD from sys.path while we load stuff.
  # Remove the CWD from sys.

Registering/updating 200 vectors for 'BucketConfiguration(endpoint='https://staging.nexus.ocp.bbp.epfl.ch/v1', org='dke', proj='seu-embeddings')...'
<count> 200
<action> _register_many
<succeeded> True

<count> 200
<action> _tag_many
<succeeded> True
Registering/updating 200 vectors for 'BucketConfiguration(endpoint='https://staging.nexus.ocp.bbp.epfl.ch/v1', org='dke', proj='seu-embeddings-2')...'
<count> 200
<action> _register_many
<succeeded> True

<count> 200
<action> _tag_many
<succeeded> True


The following tag should be used to create new ES views on the vectors.

In [201]:
MODEL_TAG

'14d61701-c4fa-44ea-8139-0e0ed606b4ec?rev=10'