# Ingest Website to Graph DB

## **Part 5** - Ontology Refinement using Vector Search

1. Add Vectorised Indices to GrpahDB (Node4j).

2. Use Semantic Proximity Searches to identify Rerlationships and Nodes which are candidates for:

   1. Amalgamation - Collapsing/Simplification

   2. Elimination - Not Relevant to Desired KM Use Case

This is a GCP reworking of (the openai example in)

https://python.langchain.com/v0.1/docs/integrations/vectorstores/neo4jvector/

... and a langchain-vertexai reworking of:

https://python.langchain.com/v0.1/docs/integrations/text_embedding/google_generative_ai/

This is becuase currently at BJSS we cvant get access to the Gemini API Keys. 

This notebook is **langchain-google-genai independant** for the avoidance of doubt.

There are no copied class files in this notebook. 

The Resultant Graph can be used within Node and Relationship Filters on the Full Corpus of Interest - in this case **Generative AI**. 

##### **Minimal install for Vertex AI**

This solved the instability problem by *NOT* installing OpenAI classes via the community install. 

In [None]:
pip install -U langchain langchain-google-vertexai neo4j langchain_community

**Check Version Nos of what was installed**

In [None]:
!pip show langchain langchain-core langchain-google-vertexai langchain-experimental langchain-community neo4j google-cloud-aiplatform

**Check Jupyter Version No**

In [None]:
!jupyter --version

**Check Python Version/Path** - *Expect 3.10.14*

In [29]:
import sys
import platform
print(sys.version)
print(platform.python_version())
print(sys.path)

3.10.14 | packaged by conda-forge | (main, Mar 20 2024, 12:45:18) [GCC 12.3.0]
3.10.14
['/opt/conda/lib/python310.zip', '/opt/conda/lib/python3.10', '/opt/conda/lib/python3.10/lib-dynload', '', '/opt/conda/lib/python3.10/site-packages']


**Now for the Imports**

This time we are isloating Vertex AI

In [30]:
import os
from langchain.globals import set_debug
from langchain_google_vertexai import VertexAIEmbeddings
from langchain_community.graphs import Neo4jGraph
from langchain_community.vectorstores import Neo4jVector

##### **Connect to Google LLMs**

This API KEY approach works with **langchain-google-vertexai** but **_not_** with **langchain-google-genai**

*Least Privilege Security.*

The Notebook is "owned" by a bespoke Service Account created in terrafrom for this purpose.

Minimal permisisons are added (also via terraform) via predefined roles (esp. Vertex) as required.

This is typically triggered by a PERMISSION DENIED error

In [31]:
# Set It - will require regeneration
os.environ['GOOGLE_API_KEY'] = '9cc21da15de90f21bfaabba2e030c440ee5fd6d3'
# Access the environment variable later in your code
env_api_key = os.environ['GOOGLE_API_KEY']
print(f"env_api_key: {env_api_key}")
PROJECT_ID = "nlp-dev-6aae"
test_embedding = "hello, world!"
search_string = "Responsible Ai"

env_api_key: 9cc21da15de90f21bfaabba2e030c440ee5fd6d3


****Enable Langchain Debugging****

See: https://python.langchain.com/v0.1/docs/guides/development/debugging/

In [32]:
# Currently Disabled, Set to True to enable
set_debug(True)

##### **Create The Embeddings**

This proves basc connectivity & functionality of the GCP Embedding Model for GenAI

Sourced from here: https://python.langchain.com/v0.1/docs/integrations/platforms/google/

In [33]:
embeddingGCP = VertexAIEmbeddings(
    model_name="textembedding-gecko@latest", project=PROJECT_ID
)

query_result = embeddingGCP.embed_query(test_embedding)

print(f"test_embedding: {test_embedding}")
print(f"query_result: {query_result}")

test_embedding: hello, world!
query_result: [0.052017416805028915, -0.030953068286180496, -0.030846256762742996, -0.028158482164144516, 0.01781940646469593, -0.0019130000146105886, 0.028597984462976456, -0.007565246894955635, 0.010808120481669903, -0.0057900105603039265, 0.03907504677772522, 0.05087621137499809, -0.00807026494294405, -0.06057383120059967, -0.006879169028252363, -0.02224457450211048, 0.013218574225902557, -0.008559225127100945, -0.000701079610735178, -0.0029124850407242775, -0.003639709437265992, 0.009413229301571846, -0.02782364934682846, -0.030522421002388, 0.021218476817011833, 0.011880539357662201, -0.0013187489239498973, -0.07345182448625565, 0.012441609054803848, 0.05887635052204132, -0.03551314026117325, 0.017118927091360092, -0.05440368875861168, 0.006286651361733675, 0.03878151252865791, -0.05733191594481468, 0.03970646485686302, 0.009752064943313599, -0.0015157802263274789, -0.0001953284372575581, 0.02433612570166588, -0.09208427369594574, -0.04463260993361473

#### **Node4J Connectivity**

Requires signing up for free version.

DB Will be stopped if not recently used and will require resuming else will fail. 

In [34]:
os.environ["NEO4J_URI"] = "neo4j+s://a657168d.databases.neo4j.io"
os.environ["NEO4J_USERNAME"] = "neo4j"
os.environ["NEO4J_PASSWORD"] = "VM3A9Mz6usNT99nLs_lqQssfVK8JxeD81DnEiXlDkZU"

graph = Neo4jGraph()

[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:33:57,065  [#0000]  _: <POOL> created, routing address IPv4Address(('a657168d.databases.neo4j.io', 7687))
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:33:57,067  [#0000]  _: <WORKSPACE> resolve home database
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:33:57,068  [#0000]  _: <POOL> attempting to update routing table from IPv4Address(('a657168d.databases.neo4j.io', 7687))
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:33:57,069  [#0000]  _: <RESOLVE> in: a657168d.databases.neo4j.io:7687
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:33:57,078  [#0000]  _: <RESOLVE> dns resolver out: 34.78.243.29:7687
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:33:57,079  [#0000]  _: <POOL> _acquire router connection, database=None, address=ResolvedIPv4Address(('34.78.243.29', 7687))
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:33:57,080  [#0000]  _: <POOL> trying to hand out new connection
[DEBUG   ] MainTh

#### Working with vectorstore

Above, we created a vectorstore from scratch. However, often times we want to work with an existing vectorstore. In order to do that, we can initialize it directly.

Extract from: https://python.langchain.com/v0.1/docs/integrations/vectorstores/neo4jvector/

Apparently Vectorised GrpahDB Indices and Vector Stores are synonymous? a

In [35]:
existing_graph = Neo4jVector.from_existing_graph(
    embedding=embeddingGCP,
    url=os.environ["NEO4J_URI"],
    username=os.environ["NEO4J_USERNAME"],
    password=os.environ["NEO4J_PASSWORD"],
    index_name="topic_index",
    node_label="Topic",
    text_node_properties=["id"],
    embedding_node_property="embedding",
)

[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:09,165  [#0000]  _: <POOL> created, routing address IPv4Address(('a657168d.databases.neo4j.io', 7687))
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:09,167  [#0000]  _: <WORKSPACE> resolve home database
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:09,168  [#0000]  _: <POOL> attempting to update routing table from IPv4Address(('a657168d.databases.neo4j.io', 7687))
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:09,169  [#0000]  _: <RESOLVE> in: a657168d.databases.neo4j.io:7687
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:09,171  [#0000]  _: <RESOLVE> dns resolver out: 34.78.243.29:7687
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:09,172  [#0000]  _: <POOL> _acquire router connection, database=None, address=ResolvedIPv4Address(('34.78.243.29', 7687))
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:09,173  [#0000]  _: <POOL> trying to hand out new connection
[DEBUG   ] MainTh

**Verify DB supports Vectors**

In [36]:
existing_graph.verify_version()

[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:23,530  [#0000]  _: <POOL> acquire routing connection, access_mode='WRITE', database='neo4j'
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:23,532  [#0000]  _: <POOL> routing aged?, database=None
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:23,534  [#0000]  _: <ROUTING> purge check: last_updated_time=7926.814072352, ttl=0, perf_time=7941.179299322 => False
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:23,535  [#0000]  _: <POOL> routing aged?, database=neo4j
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:23,535  [#0000]  _: <ROUTING> purge check: last_updated_time=7926.939060963, ttl=10, perf_time=7941.181113087 => False
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:23,537  [#0000]  _: <ROUTING> checking table freshness (readonly=False): table expired=True, has_server_for_mode=True, table routers={IPv4Address(('a657168d.databases.neo4j.io', 7687))} => False
[DEBUG   ] MainThread(1405

**Verify Index Created**

In [37]:
existing_index = existing_graph.retrieve_existing_index() 
print(f"existing_index: {existing_index}")

[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:29,116  [#0000]  _: <POOL> acquire routing connection, access_mode='WRITE', database='neo4j'
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:29,118  [#0000]  _: <POOL> routing aged?, database=None
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:29,118  [#0000]  _: <ROUTING> purge check: last_updated_time=7926.814072352, ttl=0, perf_time=7946.764064505 => False
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:29,120  [#0000]  _: <POOL> routing aged?, database=neo4j
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:29,121  [#0000]  _: <ROUTING> purge check: last_updated_time=7941.206268381, ttl=10, perf_time=7946.76680766 => False
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:29,122  [#0000]  _: <ROUTING> checking table freshness (readonly=False): table expired=False, has_server_for_mode=True, table routers={IPv4Address(('a657168d.databases.neo4j.io', 7687))} => True
[DEBUG   ] MainThread(14055

**Perform a Search** using the Vector Search Index.

In [38]:
##result = existing_graph.similarity_search(search_string)
result = existing_graph.similarity_search(search_string)
print(f"query: {search_string}")
print(f"result: {result}")

[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:35,708  [#0000]  _: <POOL> acquire routing connection, access_mode='WRITE', database='neo4j'
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:35,709  [#0000]  _: <POOL> routing aged?, database=None
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:35,710  [#0000]  _: <ROUTING> purge check: last_updated_time=7926.814072352, ttl=0, perf_time=7953.355660708 => False
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:35,711  [#0000]  _: <POOL> routing aged?, database=neo4j
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:35,712  [#0000]  _: <ROUTING> purge check: last_updated_time=7941.206268381, ttl=10, perf_time=7953.357204855 => False
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:34:35,712  [#0000]  _: <ROUTING> checking table freshness (readonly=False): table expired=True, has_server_for_mode=True, table routers={IPv4Address(('a657168d.databases.neo4j.io', 7687))} => False
[DEBUG   ] MainThread(1405

In [39]:
docs_with_score = existing_graph.similarity_search_with_score("Responsible Ai")
print(f"query: {search_string}")
print(f"result: {docs_with_score}")

[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:35:30,098  [#0000]  _: <POOL> acquire routing connection, access_mode='WRITE', database='neo4j'
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:35:30,099  [#0000]  _: <POOL> routing aged?, database=None
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:35:30,100  [#0000]  _: <ROUTING> purge check: last_updated_time=7926.814072352, ttl=0, perf_time=8007.745794124 => True
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:35:30,101  [#0000]  _: <POOL> dropping routing table for database=None
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:35:30,102  [#0000]  _: <POOL> routing aged?, database=neo4j
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:35:30,102  [#0000]  _: <ROUTING> purge check: last_updated_time=7953.392476813, ttl=10, perf_time=8007.747877401 => True
[DEBUG   ] MainThread(140551127127872) 2024-05-20 15:35:30,103  [#0000]  _: <POOL> dropping routing table for database=neo4j
[DEBUG   ] MainThread(14055112

#### LangChain Dox, Debug, Diagnostics

Neo4J wrapper classes are stored in:
1. libs\community\langchain_community\vectorstores\neo4j_vector.py - Neo4jVector class
2. libs\core\langchain_core\vectorstores.py - VectorStore interface

In theory therefore it  may be possible to 
1. exclude langchain community 
2. copy the contents of neo4j_vector.py into a jupyter cell
3. use the jupyter debugger on said code

There does not aoear to be any chaining happening in this class   

#### Node4J Dox, Debug, Diagnostics

Node4J on Vector Indices:  https://neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/ 

Node4j Bolt Driver Source Code and Dox Link: https://github.com/neo4j/neo4j-python-driver/blob/5.0/README.rst 

Bolt Driver API: https://neo4j.com/docs/api/python-driver/current/index.html 
1. Logging Enabling - https://neo4j.com/docs/api/python-driver/current/api.html#logging
2. Direct Query Execution - https://neo4j.com/docs/api/python-driver/current/api.html#query

#### Putting it all together

* Enable Node4J Driver Debug Logs
* Run Direct Queries using the No4J Driver bases on https://neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/
     1. Check for existence fo Vector Indices
     2. Run Vector Queries & Check Results 
* Then check what is happenign within the Neo4jVector class in doing the same


#### Enable Node4J Driver Logs

Example taken from https://neo4j.com/docs/api/python-driver/current/api.html#logging **Full Control**

In [None]:
import logging
import sys

# create a handler, e.g. to log to stdout
handler = logging.StreamHandler(sys.stdout)
# configure the handler to your liking
handler.setFormatter(logging.Formatter(
    "[%(levelname)-8s] %(threadName)s(%(thread)d) %(asctime)s  %(message)s"
))
# add the handler to the driver's logger
logging.getLogger("neo4j").addHandler(handler)
# make sure the logger logs on the desired log level
logging.getLogger("neo4j").setLevel(logging.DEBUG)
# from now on, DEBUG logging to stdout is enabled in the driver

#### Run Cypher Queries Direct on DB Instance

* Login (Use GCP BJSS Account) - https://login.neo4j.com/u/login/identifier?state=hKFo2SBFamE1U0h5UThHQ3A1MWplcVp3bFhFdU9mNWxFM3RiUKFur3VuaXZlcnNhbC1sb2dpbqN0aWTZIG93cW5YSmstRDQ3ZzFadWx1QWdOeER4Zk1rV2lvbmdho2NpZNkgV1NMczYwNDdrT2pwVVNXODNnRFo0SnlZaElrNXpZVG8
* Then choose Query Tab - https://workspace-preview.neo4j.io/workspace/query
* Then try out queries on this page - https://neo4j.com/docs/cypher-manual/current/indexes/semantic-indexes/vector-indexes/#indexes-vector-query
* Or Rerun Queries from DEBUG Neo4J Driver File (above) 

#### Node Vector Search Queries to Try

##### Syntax
db.index.vector.queryNodes(indexName :: STRING, numberOfNearestNeighbours :: INTEGER, query :: ANY) :: (node :: NODE, score :: FLOAT)

##### WIP

###### Responsible AI must be translated into Vectors 

CALL db.index.vector.queryNodes('topic_index', 10, "Responsible AI")
YIELD node AS similarAbstract, score

###### Extract id (text) & embedding (searcheable vector)

MATCH (n:Topic) where n.id="Responsible Ai" RETURN n.id, n.embedding

#### Example Similarity Search Query to be converted.

MATCH (title:Title)<--(:Paper)-->(abstract:Abstract)
WHERE toLower(title.text) = 'efficient and robust approximate nearest neighbor search using
  hierarchical navigable small world graphs'

CALL db.index.vector.queryNodes('abstract-embeddings', 10, abstract.embedding)
YIELD node AS similarAbstract, score

MATCH (similarAbstract)<--(:Paper)-->(similarTitle:Title)
RETURN similarTitle.text AS title, score

#### Converted Query Wip

MATCH (n:Topic) where n.id='Responsible Ai'

CALL db.index.vector.queryNodes('topic_index', 10, n.embedding) YIELD node AS similarTopic, score

MATCH (similarTopic)<--(:Topic) RETURN similarTopic.id AS id, score

#### Result captured below

|id|score|
|:-|----:|
|Responsible Ai|1|
|Responsible Ai|1|
|Machine Learning Systems|0.9488551020622253|
|Model Explainability|0.9418313503265381|
|Scaling Prototypes|0.9415329694747925|
|Scaling Prototypes|0.9415329694747925|
|Evaluation Metrics|0.9358523488044739|