Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
5 changes: 4 additions & 1 deletion README.rst
Original file line number Diff line number Diff line change
Expand Up @@ -28,6 +28,7 @@ Using the built-in :code:`PGFrame` data structure (currently, `pandas <https://p
- `graph-tool <https://graph-tool.skewed.de/>`_ (for the analytics API)
- `Neo4j <https://neo4j.com/>`_ (for the analytics and representation learning API);
- `StellarGraph <https://stellargraph.readthedocs.io/en/stable/>`_ (for the representation learning API).
- `gensim <https://radimrehurek.com/gensim/>`_ (for the representation learning API).

This repository originated from the Blue Brain effort on building a COVID-19-related knowledge graph from the `CORD-19 <https://www.kaggle.com/allen-institute-for-ai/CORD-19-research-challenge>`_ dataset and analysing the generated graph to perform literature review of the role of glucose metabolism deregulations in the progression of COVID-19. For more details on how the knowledge graph is built, explored and analysed, see `COVID-19 co-occurrence graph generation and analysis <https://github.com/BlueBrain/BlueGraph/tree/master/cord19kg#readme>`__.

Expand Down Expand Up @@ -156,7 +157,9 @@ To get familiar with the ideas behind the co-occurrence analysis and the graph a
- `Literature exploration (PGFrames + in-memory analytics tutorial) <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/Literature%20exploration%20(PGFrames%20%2B%20in-memory%20analytics%20tutorial).ipynb>`_ illustrates how to use BlueGraphs's analytics API for in-memory graph backends based on the :code:`NetworkX` and the :code:`graph-tool` libraries.
- `NASA keywords (PGFrames + Neo4j analytics tutorial) <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/NASA%20keywords%20(PGFrames%20%2B%20Neo4j%20analytics%20tutorial).ipynb>`_ illustrates how to use the Neo4j-based analytics API for persistent property graphs.

`Embedding and downstream tasks tutorial <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/Embedding%20and%20downstream%20tasks%20tutorial.ipynb>`_ starts from the co-occurrence graph generation example and guides the user through the graph representation learning and all it's downstream tasks including node similarity queries, node classification, edge prediction and embedding pipeline building.
`Embedding and downstream tasks tutorial <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/Embedding%20and%20downstream%20tasks%20tutorial.ipynb>`_ starts from the co-occurrence graph generation example and guides the user through the graph representation learning and all it's downstream tasks including node similarity queries, node classification and edge prediction.

`Create and run embedding pipelines <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/Create%20and%20run%20embedding%20pipelines.ipynb>`_ illustrates how embedding pipelines can be built and executed using BlueGraph.

Finally, `Create and push embedding pipeline into Nexus.ipynb <https://github.com/BlueBrain/BlueGraph/blob/master/examples/notebooks/Create%20and%20push%20embedding%20pipeline%20into%20Nexus.ipynb>`_ illustrates how embedding pipelines can be created and pushed to `Nexus <https://bluebrainnexus.io/>`_ and
`Embedding service API <https://github.com/BlueBrain/BlueGraph/blob/master/services/embedder/examples/notebooks/Embedding%20service%20API.ipynb>`_ shows how embedding service that retrieves the embedding pipelines from Nexus can be used.
Expand Down
16 changes: 16 additions & 0 deletions bluegraph/backends/gensim/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
# BlueGraph: unifying Python framework for graph analytics and co-occurrence analysis.

# Copyright 2020-2021 Blue Brain Project / EPFL

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .embed.embedders import GensimNodeEmbedder
15 changes: 15 additions & 0 deletions bluegraph/backends/gensim/embed/__init__.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
# BlueGraph: unifying Python framework for graph analytics and co-occurrence analysis.

# Copyright 2020-2021 Blue Brain Project / EPFL

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
113 changes: 113 additions & 0 deletions bluegraph/backends/gensim/embed/embedders.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,113 @@
# BlueGraph: unifying Python framework for graph analytics and co-occurrence analysis.

# Copyright 2020-2021 Blue Brain Project / EPFL

# Licensed under the Apache License, Version 2.0 (the "License");
# you may not use this file except in compliance with the License.
# You may obtain a copy of the License at

# http://www.apache.org/licenses/LICENSE-2.0

# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS,
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from collections import namedtuple
import warnings
import pandas as pd

from gensim.models.poincare import PoincareModel

from bluegraph.core.embed.embedders import GraphElementEmbedder
from bluegraph.backends.params import (GENSIM_PARAMS,
DEFAULT_GENSIM_PARAMS)


GensimGraph = namedtuple('GensimGraph', 'graph graph_configs')


class GensimNodeEmbedder(GraphElementEmbedder):

_transductive_models = [
"poincare",
"word2vec"
]

def __init__(self, model_name, directed=True, include_type=False,
feature_props=None, feature_vector_prop=None,
edge_weight=None, **model_params):
if directed is False and model_name == "poincare":
raise GraphElementEmbedder.FittingException(
"Poincare embedding can be performed only on directed graphs: "
"undirected graph was provided")
super().__init__(
model_name=model_name, directed=directed,
include_type=include_type,
feature_props=feature_props,
feature_vector_prop=feature_vector_prop,
edge_weight=edge_weight, **model_params)

@staticmethod
def _generate_graph(pgframe, graph_configs):
"""Generate backend-specific graph object."""
return GensimGraph(pgframe, graph_configs)

def _dispatch_model_params(self, **kwargs):
"""Dispatch training parameters."""
params = {}
for k, v in kwargs.items():
if k not in GENSIM_PARAMS[self.model_name]:
warnings.warn(
f"GensimNodeEmbedder's model '{self.model_name}' "
f"does not support the training parameter '{k}', "
"the parameter will be ignored",
GraphElementEmbedder.FittingWarning)
else:
params[k] = v

for k, v in DEFAULT_GENSIM_PARAMS.items():
if k not in params:
params[k] = v
return params

def _fit_transductive_embedder(self, train_graph):
"""Fit transductive embedder (no model, just embeddings)."""

model_params = {**self.params}
del model_params["epochs"]

if self.model_name == "poincare":
model = PoincareModel(
train_graph.graph.edges(), **model_params)

model.train(epochs=self.params["epochs"])

embedding = pd.DataFrame(
[
(n, model.kv.get_vector(n))
for n in train_graph.graph.nodes()
],
columns=["@id", "embedding"]
).set_index("@id")
return embedding

def _fit_inductive_embedder(self, train_graph):
"""Fit inductive embedder (predictive model and embeddings)."""
raise NotImplementedError(
"Inductive models are not implemented for gensim-based "
"node embedders")

def _predict_embeddings(self, graph, nodes=None):
"""Fit inductive embedder (predictive model and embeddings)."""
raise NotImplementedError(
"Inductive models are not implemented for gensim-based "
"node embedders")

@staticmethod
def _save_predictive_model(model, path):
pass

@staticmethod
def _load_predictive_model(path):
pass
2 changes: 1 addition & 1 deletion bluegraph/backends/neo4j/analyse/paths.py
Original file line number Diff line number Diff line change
Expand Up @@ -104,7 +104,7 @@ def _compute_yen_shortest_paths(graph, source, target, n,
graph._generate_st_match_query(source, target) +
Neo4jPathFinder._generate_path_search_call(
graph, source, target,
"gds.beta.shortestPath.yens.stream",
"gds.shortestPath.yens.stream",
distance, exclude_edge,
extra_params={"k": n}) +
"YIELD nodeIds\n"
Expand Down
13 changes: 10 additions & 3 deletions bluegraph/backends/neo4j/embed/embedders.py
Original file line number Diff line number Diff line change
Expand Up @@ -48,11 +48,16 @@ class Neo4jNodeEmbedder(GraphElementEmbedder):
@staticmethod
def _generate_graph(pgframe=None, uri=None, username=None,
password=None, driver=None,
node_label=None, edge_label=None):
node_label=None, edge_label=None,
graph_configs=None):
"""Generate backend-specific graph object."""
if graph_configs is None:
graph_configs = {"directed": True}

return pgframe_to_neo4j(
pgframe=pgframe, uri=uri, username=username, password=password,
driver=driver, node_label=node_label, edge_label=edge_label)
driver=driver, node_label=node_label, edge_label=edge_label,
directed=graph_configs["directed"])

def _dispatch_model_params(self, **kwargs):
"""Dispatch training parameters."""
Expand Down Expand Up @@ -223,7 +228,9 @@ def fit_model(self, pgframe=None, uri=None, username=None, password=None,
train_graph = self._generate_graph(
pgframe=pgframe, uri=uri, username=username,
password=password, driver=driver,
node_label=node_label, edge_label=edge_label)
node_label=node_label, edge_label=edge_label,
graph_configs=self.graph_configs)
# self.graph_configs
else:
train_graph = graph_view

Expand Down
13 changes: 7 additions & 6 deletions bluegraph/backends/neo4j/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -162,12 +162,12 @@ def pgframe_to_neo4j(pgframe=None, uri=None, username=None, password=None,
node_label_repr = f":{node_label}" if node_label else ""

query = (
f"""
WITH [{", ".join(node_repr)}] AS batch
UNWIND batch as individual
CREATE (n{node_label_repr})
SET n += individual
""")
f"""
WITH [{", ".join(node_repr)}] AS batch
UNWIND batch as individual
CREATE (n{node_label_repr})
SET n += individual
""")
execute(driver, query)

# Add node types to the Neo4j node labels
Expand All @@ -189,6 +189,7 @@ def pgframe_to_neo4j(pgframe=None, uri=None, username=None, password=None,
edge_labels = [edge_label]

for edge_label in edge_labels:

# Select edges of a given type, if applicable
edges = pgframe.edges(
raw_frame=True,
Expand Down
24 changes: 24 additions & 0 deletions bluegraph/backends/params.py
Original file line number Diff line number Diff line change
Expand Up @@ -84,3 +84,27 @@
"clusters_q": 1,
"num_powers": 10
}


GENSIM_PARAMS = {
"poincare": [
"epochs",
"size",
"alpha",
"negative",
"workers",
"epsilon",
"regularization_coeff",
"burn_in",
"burn_in_alpha",
"init_range",
"dtype",
"seed"
]
}


DEFAULT_GENSIM_PARAMS = {
"size": 64,
"epochs": 50
}
12 changes: 8 additions & 4 deletions bluegraph/core/embed/embedders.py
Original file line number Diff line number Diff line change
Expand Up @@ -59,7 +59,7 @@ def _inductive_models(self):

@staticmethod
@abstractmethod
def _generate_graph(self, pgframe):
def _generate_graph(pgframe, graph_configs):
"""Generate backend-specific graph object."""
pass

Expand Down Expand Up @@ -167,7 +167,7 @@ def fit_model(self, pgframe):
if not isinstance(embeddings, pd.DataFrame):
embeddings = pd.DataFrame(
{"embedding": embeddings.tolist()},
index=train_graph.nodes())
index=pgframe.nodes())
elif self.model_name in self._inductive_models:
self._embedding_model = self._fit_inductive_embedder(train_graph)
embeddings = self._predict_embeddings(train_graph)
Expand Down Expand Up @@ -234,8 +234,12 @@ def load(path):

with open(os.path.join(path, "emb.pkl"), "rb") as f:
embedder = pickle.load(f)
embedder._embedding_model = embedder._load_predictive_model(
os.path.join(path, "model"))

embedder._embedding_model = None
if os.path.isfile(os.path.join(path, "model")):
embedder._embedding_model = embedder._load_predictive_model(
os.path.join(path, "model"))

if decompressed:
shutil.rmtree(path)

Expand Down
5 changes: 4 additions & 1 deletion bluegraph/core/io.py
Original file line number Diff line number Diff line change
Expand Up @@ -954,6 +954,8 @@ def edge_types(self, flatten=False):
"""Return a list of edges types."""
if flatten:
types = _aggregate_values(self._edges["@type"])
if isinstance(types, str):
types = [types]
else:
types = []
for el in self._edges["@type"]:
Expand Down Expand Up @@ -1112,9 +1114,10 @@ def get_edge_typing(self):
def aggregate_properties(frame, func, into="aggregation_result"):
if "@type" in frame.columns:
df = frame.drop("@type", axis=1)
aggregated = df.aggregate(func, axis=1).values.tolist()
frame = pd.DataFrame(
{
into: df.aggregate(func, axis=1),
into: aggregated,
"@type": frame["@type"]
},
index=frame.index)
Expand Down
5 changes: 3 additions & 2 deletions bluegraph/downstream/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -13,6 +13,7 @@
# WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
# See the License for the specific language governing permissions and
# limitations under the License.
from .data_structures import (ElementClassifier,
EmbeddingPipeline)
from .data_structures import ElementClassifier
from .pipelines import EmbeddingPipeline

from .utils import *
Loading