# TfIdf-based lexical embedding pipeline

In this example notebook we will illustrate how Tf-Idf encoding based on character n-grams of aliases from the [NCIt](https://ncithesaurus.nci.nih.gov/ncitbrowser/) ontology can be used to constuct embeddings of words and lexical similarity search using BlueGraph's `EmbeddingPipeline`.

In [45]:
import math
import os
import random
import time
from collections import defaultdict

import json

import rdflib
from rdflib import RDFS, XSD

import numpy as np
import pandas as pd
import zipfile

from joblib import parallel_backend

from sklearn.decomposition import PCA, TruncatedSVD
from scipy.sparse import vstack

from bluegraph import PandasPGFrame
from bluegraph.core.utils import Preprocessor
from bluegraph.downstream import EmbeddingPipeline
from bluegraph.downstream.similarity import SimilarityProcessor
from bluegraph.preprocess.utils import TfIdfEncoder

## Helpers

In [39]:
def find_uri_by_label(graph, label, lang="en", dtype=XSD.string):
    params = [
        {},
        {"lang": "en"},
        {"datatype": dtype},
        {"lang": "en", "datatype": dtype}
    ]
    resource = None
    for param_set in params:
        for s in graph.subjects(RDFS.label, rdflib.Literal(label, **param_set)):
            resource = str(s)
            break
        if resource is not None:
            break
    
    return resource

Load the ontology

In [4]:
ontology_graph = rdflib.Graph()
ontology_graph.parse("../../ontologies/bbp/brain-modeling-ontology.ttl", format="ttl")
ontology_graph.parse("../../ontologies/bbp/molecular-systems-ontology.ttl", format="ttl")
ontology_graph.parse("../../ontologies/bbp/etypes.ttl", format="ttl")
ontology_graph.parse("../../ontologies/bbp/mtypes.ttl", format="ttl")

<Graph identifier=N2f60d9c3c2fa40d598bc1025921a67ef (<class 'rdflib.graph.Graph'>)>

In [10]:
frame = PandasPGFrame.from_ontology(rdf_graph=ontology_graph, remove_prop_uris=True)

In [20]:
ALIAS_PROPS = ["label", "prefLabel", "synonym", "altLabel"]

Get all unique aliases (all lower case)

In [48]:
alias_mapping = {}
for node in frame.nodes():
    record = frame._nodes.loc[node].to_dict()
    for prop in ALIAS_PROPS:
        if not isinstance(record[prop], float):
            value = record[prop]
            if isinstance(value, str):
                alias_mapping[record[prop].lower()] = find_uri_by_label(ontology_graph, node)
            else:
                for el in value:
                    alias_mapping[el.lower()] = find_uri_by_label(ontology_graph, node)

In [50]:
aliases = list(alias_mapping.keys())

In [51]:
len(aliases)

570

Specify Tf-Idf model parameters

In [55]:
params = {
    "analyzer": "char",
    "dtype": np.float32,
    "max_df": 1.0,
    "min_df": 0.0001,
    "ngram_range": (3, 3),
    "max_features": 1024
}

Create an instance of `EmbeddingPipeline` using:

- `TfIdfEncoder` as a preprocessor,
- No embedder
- BlueGraph `SimilarityProcessor` with Euclidean distance based on an index segmented into 100 Voronoi cells (more details can be found [here](https://github.com/facebookresearch/faiss/wiki/Faster-search)).

In [56]:
pipeline = EmbeddingPipeline(
    preprocessor=TfIdfEncoder(params),
    embedder=None,
    similarity_processor=SimilarityProcessor(
        similarity="euclidean", dimension=1024, n_segments=200))

Run fitting of the pipeline on the aliases.

In [57]:
pipeline.run_fitting(aliases, index=aliases)



Save the pipeline.

In [60]:
pipeline.save("../data/BMO-linking", compress=True)

In [61]:
embedding_table = pipeline.generate_embedding_table()

In [62]:
embedding_table.sample(5)

Unnamed: 0_level_0,embedding
@id,Unnamed: 1_level_1
model brain parameter,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
layer 3 tufted pyramidal cell,"[0.0, 0.0, 0.0, 0.28955737, 0.0, 0.0, 0.0, 0.0..."
l5_lbc,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."
layer 2/3 bitufted cell,"[0.0, 0.0, 0.24943915, 0.0, 0.0, 0.0, 0.0, 0.0..."
l6_tpc:a,"[0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, 0.0, ..."


Retrieve embedding vectors for the trems of interest.

In [92]:
terms = [
    "l5_lbc",
    "layer 5 bipolar cell",
    "burst non-accommodating electrical type",
    "lalala not in index",
    "emodel building workflow"
]

In [93]:
vectors = pipeline.retrieve_embeddings(terms)

In [94]:
print("Vector sizes: ")
for i, v in enumerate(vectors):
    print("\t'{}': {}".format(terms[i], len(v) if v is not None else None))

Vector sizes: 
	'l5_lbc': 1024
	'layer 5 bipolar cell': 1024
	'burst non-accommodating electrical type': 1024
	'lalala not in index': None
	'emodel building workflow': 1024


Get similar points to the query terms

In [101]:
points, distances = pipeline.get_similar_points(
    existing_indices=terms, k=3)

In [102]:
for i, el in enumerate(terms):
    print(f"Similar terms to '{el}': ")
    if points[i] is not None:
        for p in points[i]:
            print(f"\t- {p} (ontology term {alias_mapping[p]})")
    else:
        print(f"\t {el} is not in index")
    print()

Similar terms to 'l5_lbc': 
	- l5_lbc (ontology term http://uri.interlex.org/base/ilx_0383225)
	- l6_lbc (ontology term http://uri.interlex.org/base/ilx_0383233)
	- l4_lbc (ontology term http://uri.interlex.org/base/ilx_0383213)

Similar terms to 'layer 5 bipolar cell': 
	- layer 5 bipolar cell (ontology term http://uri.interlex.org/base/ilx_0383221)
	- layer 6 bipolar cell (ontology term http://bbp.epfl.ch/neurosciencegraph/ontologies/mtypes/GaBeBiJcTQqLpE8LFIPQlw)
	- layer 4 bipolar cell (ontology term http://uri.interlex.org/base/ilx_0383209)

Similar terms to 'burst non-accommodating electrical type': 
	- burst non-accommodating electrical type (ontology term http://uri.interlex.org/base/ilx_0738203)
	- burst accommodating electrical type (ontology term http://uri.interlex.org/base/ilx_0738199)
	- delayed non-accommodating electrical type (ontology term http://uri.interlex.org/base/ilx_0738205)

Similar terms to 'lalala not in index': 
	 lalala not in index is not in index

Similar

Predict vectors for potentially unseen points

In [97]:
terms_to_predict = [
    "bipolar cell",
    "burst non-accommodating neuron",
    "mariotti cell",
    "e-model reconstruction workflow"]

In [103]:
vectors = pipeline.run_prediction(terms_to_predict)

Get similar points for these vectors

In [104]:
points, distances = pipeline.get_similar_points(vectors=vectors, k=3)

In [105]:
for i, el in enumerate(terms_to_predict):
    print(f"Similar terms to '{el}': ")
    if points[i] is not None:
        for p in points[i]:
            print(f"\t- {p} (ontology term {alias_mapping[p]})")
    else:
        print(f"\t {el} is not in index")
    print()

Similar terms to 'bipolar cell': 
	- layer 6 bipolar cell (ontology term http://bbp.epfl.ch/neurosciencegraph/ontologies/mtypes/GaBeBiJcTQqLpE8LFIPQlw)
	- layer 5 bipolar cell (ontology term http://uri.interlex.org/base/ilx_0383221)
	- layer 4 bipolar cell (ontology term http://uri.interlex.org/base/ilx_0383209)

Similar terms to 'burst non-accommodating neuron': 
	- burst non-accommodating electrical type (ontology term http://uri.interlex.org/base/ilx_0738203)
	- delayed non-accommodating electrical type (ontology term http://uri.interlex.org/base/ilx_0738205)
	- continuous non-accommodating electrical type (ontology term http://uri.interlex.org/base/ilx_0738201)

Similar terms to 'mariotti cell': 
	- layer 6 martinotti cell (ontology term http://uri.interlex.org/base/ilx_0381374)
	- layer 2 martinotti cell (ontology term http://bbp.epfl.ch/neurosciencegraph/ontologies/mtypes/6LZqO1y_RCyflICMXG09iA)
	- layer 5 martinotti cell (ontology term http://uri.interlex.org/base/ilx_0381369)

