# TfIdf-based lexical embedding pipeline

In this example notebook we will illustrate how Tf-Idf encoding based on character n-grams of aliases from the [NCIt](https://ncithesaurus.nci.nih.gov/ncitbrowser/) ontology can be used to constuct embeddings of words and lexical similarity search using BlueGraph's `EmbeddingPipeline`.

(Credits to [Pierre-Alexandre Fonta](https://github.com/pafonta) for the model)

In [3]:
import os
import random
import time
from collections import defaultdict

import json

import numpy as np
import pandas as pd
import shutil

from joblib import parallel_backend

from sklearn.decomposition import PCA, TruncatedSVD
from scipy.sparse import vstack

from bluegraph.downstream.data_structures import EmbeddingPipeline
from bluegraph.downstream.similarity import SimilarityProcessor
from bluegraph.downstream.data_structures import Preprocessor
from bluegraph.preprocess.utils import TfIdfEncoder

Open NCIT ontology terms

In [4]:
path = "../data/NCIT_ontology.json"
if not os.path.isfile(path):
    shutil.unpack_archive(path + ".zip", extract_dir=path)

with open("../data/NCIT_ontology.json", "r") as f:
    ontology = json.load(f)

IsADirectoryError: [Errno 21] Is a directory: '../data/NCIT_ontology.json'

Get all unique aliases (all lower case)

In [22]:
aliases = list(set(alias.lower() for k, v in ontology.items() for alias in v[1]))

Get a sample of ~100000 aliases

In [6]:
N = 100000
terms_to_include = [
    "covid-19 infection",
    "covid-19",
    "glucose",
    "sars-cov-2"
]
random_sample = random.sample(aliases, N)
aliases_to_train = list(set(random_sample + terms_to_include))

Specify Tf-Idf model parameters

In [7]:
params = {
    "analyzer": "char",
    "dtype": np.float32,
    "max_df": 1.0,
    "min_df": 0.0001,
    "ngram_range": (3, 3),
    "max_features": 2048
}

Create an instance of `EmbeddingPipeline` using:

- `TfIdfEncoder` as a preprocessor,
- No embedder
- BlueGraph `SimilarityProcessor` with Euclidean distance based on an index segmented into 100 Voronoi cells (more details can be found [here](https://github.com/facebookresearch/faiss/wiki/Faster-search)).

In [24]:
pipeline = EmbeddingPipeline(
    preprocessor=TfIdfEncoder(params),
    embedder=None,
    similarity_processor=SimilarityProcessor(
        similarity="euclidean", dimension=2048, n_segments=100))

Run fitting of the pipeline on the selected subset of aliases.

In [25]:
pipeline.run_fitting(aliases_to_train, index=aliases_to_train)



Retrieve embedding vectors for the trems of interest.

In [29]:
terms = ["glucose", "covid-19 infection", "lalala not in index"]

In [30]:
vectors = pipeline.retrieve_embeddings(terms)

In [44]:
print("Vector sizes: ")
for i, v in enumerate(vectors):
    print("\t'{}': {}".format(terms[i], len(v) if v is not None else None))

Vector sizes: 
	'glucose': 2048
	'covid-19 infection': 2048
	'lalala not in index': None


In [46]:
points, distances = pipeline.get_similar_points(
    existing_indices=["glucose", "covid-19 infection", "lalala not in index"],
    k=5)

In [51]:
print("Similar points: ")
for i, p in enumerate(points):
    print("\t'{}': {}".format(terms[i], list(p) if p is not None else None))

Similar points: 
	'glucose': ['u-13c-glucose', 'glucose', '2h glucose', 'deoxyglucose', '2 hr glucose']
	'covid-19 infection': ['covid-19 infection', 'hpv infection', 'hpv16 infection', 'hpv-16 infection', 'gum infection']
	'lalala not in index': None


Save the pipeline.

In [11]:
pipeline.save("../data/Cord-19-NCIT-linking", compress=True)

Predict vectors for potentially unseen points

In [52]:
vectors = pipeline.run_prediction(
    ["hello", "darkness", "my old", "friend", "glucose"])

Get similar points for these vectors

In [54]:
pipeline.get_similar_points(vectors=vectors)

([Index(['shellfish', 'shell', 'helg', 'hel113', 'hel-s-153w', 'hel-s-43',
         'hel-s-28', 'helsnf1', 'hel-s-61p', 'hel-9'],
        dtype='object'),
  Index(['dar', 'edar', 'witness', 'sourness', 'shakiness', 'nes1', 'darc',
         'helplessness', 'avodart', 'dart'],
        dtype='object'),
  Index(['dj400n23', 'tcf3b', 'i(17q)', 'phq0311', 'nr1c1', 'mrd7', 'nf1',
         'au-011', 'hba1a', 'dxs239'],
        dtype='object'),
  Index(['girlfriend', 'wd5-making new friends', 'wd7-making new friends',
         'brief-p', 'voxorien', 'votrient', 'nutrifriend cachexia',
         'friend of gata 2', 'nr1c1', 'tcf3b'],
        dtype='object'),
  Index(['u-13c-glucose', 'glucose', '2h glucose', 'deoxyglucose',
         '2 hr glucose', 'glucobay', 'glucosuria', 'glucclr', 'eagluc', 'gluc'],
        dtype='object')],
 [array([0.4691298 , 0.4691298 , 0.76471364, 0.76471364, 0.76471364,
         0.76471364, 0.76471364, 0.76471364, 0.76471364, 0.76471364],
        dtype=float32),
  array

Predict vectors for potentially unseen points and add them to the index

In [14]:
vectors = pipeline.run_prediction(
    ["hello", "darkness", "my old", "friend", "glucose"],
    add_to_index=True,
    data_indices=["hello", "darkness", "my old", "friend", "glucose"])

