Performance improvement walk extraction #96

MartinBoeckling · 2022-06-08T13:21:48Z

🚀 Feature/ Enhancement

Increase walk extraction speed

I have experimented with the pyRDF2Vec package for my thesis and therefore appreciated the provided python package. During the course of my thesis I came across the described performance limitations for large knowledge graphs. As I faced memory issues and long computation times also after optimizing the different parameter settings for the walk extractor I partially implemented the walk extraction with the python igraph package which gave a performance boost regarding the walk extraction. Since I had benefited from the reduced calculation time I thought it would be useful to share my implemented approach.

Additional context

The igraph package is partially implemented in C and implements multiple graph algorithms, like the Breadth-first search and Depth-first search algorithm. A performance comparison can be seen in the picture attached below where the Pyrdf2vec approach took around 53 minutes and the implemented approach with igraph took around 2 minutes. The performance comparison was measured with the following Google Colab document on the FB15k train dataset. As I did not transfer the Knowledge Graph object to an igraph object I directly read in the csv file.

If additional information regarding the implemented approach are necessary, feel free to reach out.

Solution

A possible solution approach can be found here:

# import packages
import pandas as pd
import numpy as np
from tqdm import tqdm
from igraph import Graph
from itertools import groupby
import multiprocessing as mp

def predicateGeneration(self, pathList):
  # assign class graph to graph variable
  graph = self.graph
  # extract description of edge given edge id stored in numpy
  predValues = np.array([e.attributes()['description'] for e in graph.es(pathList)])
  # extract node sequences that are part of the edge path and flatten numpy array
  nodeSequence = np.array([graph.vs().select(e.tuple).get_attribute_values('name') for e in graph.es(pathList)]).flatten()
  # delete consecutive character values in numpy array based from prior matrix
  nodeSequence = np.array([key for key, _group in groupby(nodeSequence)])
  # combine description values and node sequences to one single array
  pathSequence = np.insert(predValues, np.arange(len(nodeSequence)), nodeSequence)
  # convert numpy array to list 
  pathSequence = pathSequence.tolist()
  # return path sequence numpy array
  return pathSequence

def walkIteration(self, idNumber):
  # assign class graph variable to local graph variable
  graph = self.graph
  # extract index of graph node
  nodeIndex = graph.vs.find(idNumber).index
  # perform breadth-first search algorithm
  bfsList = graph.bfsiter(nodeIndex, 'out', advanced=True)
  # iterate over breadth-first search iterator object to filter those paths out
  # defined distance variable
  distanceList = [nodePath for nodePath in bfsList if nodePath[1] <= self.distance]
  # create vertex list from distance list extracting vertex element
  vertexList = [vertexElement[0] for vertexElement in distanceList]
  # compute shortest path from focused node index to extracted vertex list outputting edge ID
  shortestPathList = graph.get_shortest_paths(v=nodeIndex, to=vertexList, output='epath')
  # extract walk sequences with edge id to generate predicates
  walkSequence = list(map(self.predicateGeneration, shortestPathList))      
  if len(walkSequence) < maxWalks: maxWalks = len(walkSequence)
  random.seed(15)
  walkSequence = random.sample(walkSequence, maxWalks)
  return walkSequence

# Read a CSV file containing the entities we want to extract.
graphData = pd.read_csv(**filePath**)
# transform values of row values dataframe into list
graphDataValues = graphData.to_records(index=False)
# initialize Knowledge Graph
self.graph = Graph().TupleList(rowValues, directed=True, edge_attrs=['description'])
# initialize multiprocessing pool with cpu number
pool = mp.Pool(mp.cpu_count())
# extract walk predicates using the walkIteration method 
walkPredicateList = list(tqdm(pool.imap_unordered(self.walkIteration,
entities, chunksize=8), desc='Walk Extraction', total=len(entities)))

The text was updated successfully, but these errors were encountered:

GillesVandewiele · 2022-06-08T13:30:33Z

Holy smokes, this is huge (speedup of an order of magnitude)! Thank you so much @MartinBoeckling, would you be willing to perhaps create a PR from this so you can be credited as a contributor when this is added? You could put this in a simple example file and we can work from there.

At first sight, it does seem that the igraph approach uses quite a bit more memory? Perhaps we should support both alternatives for those with lower RAM memory (although 0.5 GB is still not much at all)

BTW: if you thesis gets published, I would be very interested in reading it! What will the title be so I can keep an eye open for it?

MartinBoeckling · 2022-06-08T14:12:17Z

Sure, I can definitely create a pull request with a sample implementation. Regarding the memory consumption I was able to reduce the memory consumption to 200 mb by using tuples. Maybe an implementation with numpy arrays could reduce the memory consumption further.
And sure, if my thesis gets published I will let you know what the exact title is. Currently I am still writing my thesis, so I don't know if the current title will remain the exact same. But it will be something like Wildfire prediction with machine learning techniques using spatial knowledge graph embeddings. When the thesis is published I can also send you the link to the thesis via mail if you want.

MartinBoeckling · 2022-09-01T09:41:26Z

@GillesVandewiele, regarding the Pull Request for this feature, should I also add test and documentation or should I just create the pull request with the example file of the igraph implementation?
Also I wanted to make sure if the develop branch would be the correct one to merge the Pull Request

GillesVandewiele · 2022-09-01T09:57:38Z

Hi @MartinBoeckling. Documentation would definitely be nice, but I already appreciate the PR a lot! We could take care of docs & tests later. The dev branch would indeed be ideal! Thank you

MartinBoeckling · 2022-09-01T12:45:28Z

Created Pull Request #122, therefore I would close this issue

bsteenwi · 2023-07-10T14:36:26Z

Will reopen this issue,

Created a separate branch to update some and test some aspects.
Current implementation works after some fixes, but it seems to be quite slow with respect to other use cases

MartinBoeckling · 2023-07-10T15:42:09Z

Hi @bsteenwi,
should I revisit the code to see where there is optimization possibility with respect to the other use cases that you have tested?

bsteenwi · 2023-07-12T08:06:21Z

Hi @MartinBoeckling,

I had to make a lot of changes based on the code in your pull request to make it work...
So a revision, or providing me an example that shows the advantage of the iGraph implementation would be very helpful.

MartinBoeckling · 2023-07-25T14:14:36Z

Hi @bsteenwi ,
at the top I have provided a small example where I have seen the performance advantage of Igraph compared to the implemented approach in pyRDF2Vec. PyRDF2Vec took 53 minutes and the Igraph based implementation was able to perform the task in 2 minutes (tested on fb15k dataset). I would test the revised code then for different dataset to get a better feeling for differrent connectors and datasets.
Thanks for fixing the code based on the pull request.

MartinBoeckling added the enhancement New feature or request label Jun 8, 2022

MartinBoeckling closed this as completed Sep 1, 2022

bsteenwi reopened this Jul 10, 2023

bsteenwi linked a pull request Jul 10, 2023 that will close this issue

igraph-based random walker #122

Open

3 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance improvement walk extraction #96

Performance improvement walk extraction #96

MartinBoeckling commented Jun 8, 2022

GillesVandewiele commented Jun 8, 2022 •

edited

Loading

MartinBoeckling commented Jun 8, 2022 •

edited

Loading

MartinBoeckling commented Sep 1, 2022

GillesVandewiele commented Sep 1, 2022

MartinBoeckling commented Sep 1, 2022

bsteenwi commented Jul 10, 2023

MartinBoeckling commented Jul 10, 2023

bsteenwi commented Jul 12, 2023

MartinBoeckling commented Jul 25, 2023 •

edited

Loading

Performance improvement walk extraction #96

Performance improvement walk extraction #96

Comments

MartinBoeckling commented Jun 8, 2022

🚀 Feature/ Enhancement

Additional context

Solution

GillesVandewiele commented Jun 8, 2022 • edited Loading

MartinBoeckling commented Jun 8, 2022 • edited Loading

MartinBoeckling commented Sep 1, 2022

GillesVandewiele commented Sep 1, 2022

MartinBoeckling commented Sep 1, 2022

bsteenwi commented Jul 10, 2023

MartinBoeckling commented Jul 10, 2023

bsteenwi commented Jul 12, 2023

MartinBoeckling commented Jul 25, 2023 • edited Loading

GillesVandewiele commented Jun 8, 2022 •

edited

Loading

MartinBoeckling commented Jun 8, 2022 •

edited

Loading

MartinBoeckling commented Jul 25, 2023 •

edited

Loading