Successor notice: OverflowDB
What started as a tinkergraph fork has slowly but steadily evolved into a separate graph db and has now moved to a separate repository: https://github.com/ShiftLeftSecurity/overflowdb Users of this Shiftleft-Tinkergraph can continue to use it, and we still accept PRs and can release new versions. But future development will happen on OverflowDB, and it's already 70% more memory efficient than Shiftleft-Tinkergraph, which itself is 70% more memory efficient than the original Tinkergraph.
This is Fork of Apache TinkerGraph that uses uses 70% less memory (for our use case, ymmv) and implements a strict schema validation. Related blog article on ShiftLeft Blog
- add a dependency to the latest published artifact on maven central
- extend SpecializedTinkerVertex for vertices and SpecializedTinkerEdge for edges
- create instances of
SpecializedElementFactory.ForEdgeand pass them to
The repository contains examples for the grateful dead graph and there is a full test setup that uses them. 2) and 3) are basically boilerplate and therefor good candidates for code generation.
Other than that, it's a minimally invasive operation, because all other graph and traversal APIs remain the same, i.e., you won't need to change any of your queries. We didn't encounter a single issue when we deployed this into production.
Motivation and context
The main difference is that instead of generic HashMaps we use specific structures as per your domain. To make this more clear, let's look at the main use cases for HashMaps in TinkerGraph:
- allow any vertex and any edge to have any property (basically a key/value pair, e.g.,
foo=42). To achieve this, each element in the graph has a
Map<String, Property>, and each property is wrapped inside a
HashMap$Node, see TinkerVertex and TinkerEdge.
- TinkerGraph allows to connect any two vertices by any edge. Therefor each vertex holds two
Map<String, Set<Edge>>instances (one for incoming and one for outgoing edges), where the String refers to the edge label.
Being generic and not enforcing a schema makes complete sense for the default TinkerGraph - it allows users to play without restrictions and build prototypes. Once a project is more mature though, chances are you have a good understanding of your domain and can define a schema, so that you don't need the generic structure any more and can save a lot of memory.
Using less memory is not the only benefit, though: knowing exactly which properties a given element can have, of which type they are and which edges are allowed on a specific vertex, helps catching errors very early in the development cycle. Your IDE can help you to build valid (i.e., schema conforming) graphs and traversals. If you use a statically-checked language, your compiler can find errors that would otherwise only occur at runtime. Even if you are using a dynamic language you are better off, because you'll get an error when you load the graph, e.g., by setting a property on the wrong vertex type. This is far better than getting invalid results at query time, when you need to debug all the way back to a potentially very simple mistake. Since we already had a loosely-defined schema for our code property graph, this exercise helped to complete and strengthen it.
What does this mean in practice?
'Enforcing a strict schema' actually translates to something very simple: we just replaced the generic HashMaps with specific members:
Element properties: vertices and edges contain generic
HashMap<String, Object>that hold all the element's properties. We just replaced them with specific class members, e.g.,
Edges on a vertex: the generic TinkerVertex contains two
HashMap<String, Set<Edge>> in|outEdgeswhich can reference any edge. We replaced these by specific
Set<SomeSpecificEdgeType>for each edge type that is allowed to connect this vertex with another vertex.
This means that we can throw an error if the schema is violated, e.g., if a the user tries to set a property that is not defined for a specific vertex, or if the user tris to connect a vertex via an edge that's not supposed to be connected to this vertex. It is important to note though, that it's up to you if you want to make this a strict validation or not - you can choose to tolerate schema violations in your domain classes.
- indices aren't updated automatically when you mutate or add elements to the graph. This would be easy to do I guess, but we haven't had the need yet. Workaround: drop and recreate the index.
- an OLAP (GraphComputer) implementation is available, but we haven't really tested it yet
- you cannot (yet) mix generic and specialized Elements: it's all or nothing, and you'll get an error if you accidentally try
Bring in changes from upstream TinkerGraph
When a new Apache TinkerGraph is being released, here's the steps to bring them into this fork:
# view diff cd ~/Projects/tinkerpop/tinkerpop3 git diff 3.3.2..3.3.3 tinkergraph-gremlin/src > ~/tp-upgrade.patch # apply patch (-p2 strips the base directory, which is different in our fork) cd ~/Projects/shiftleft/tinkergraph-gremlin git apply -p2 ~/tp-upgrade.patch # manually fix all conflicts (*.orig / *.rej files) # update all versions in pom.xml mvn clean test
- change the version in
pom.xmlto a non-snapshot (e.g.
- commit and tag it (e.g.
v22.214.171.124), push everything (including the tag!)
- await Travis to automatically deploy the tagged version to sonatype and stage it so that it'll be synchronized to maven central within a few hours. Note: check the log output of the last travis step (
$ ./travis/deploy.sh) to be sure. You should see something like the following at the very end:
[INFO] Remote staged 1 repositories, finished with success. [INFO] Remote staging repositories are being released... Waiting for operation to complete... ............ [INFO] Remote staging repositories released. [INFO] ------------------------------------------------------------------------ [INFO] BUILD SUCCESS [INFO] ------------------------------------------------------------------------
- change the version to the next snapshot (e.g.