-
-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Update in-memory backend documentation #1934
Update in-memory backend documentation #1934
Conversation
fe9a875
to
a2bb322
Compare
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
LGTM. Thank you @dk-github !
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Nice documentation! Helps me understand JanusGraph better. Just a few points:
Shutting down the graph or terminating the process that hosts the | ||
JanusGraph graph will irrevocably delete all data from the graph. This | ||
backend is local to a particular JanusGraph graph instance and cannot be | ||
shared across multiple JanusGraph graphs. | ||
|
||
Ideal Use Case | ||
-------------- | ||
Rapid testing: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you maybe use markdown ### syntax so that the generated doc website can show the table of contents correctly like https://docs.janusgraph.org/storage-backend/cassandra/#setup-cassandra-cluster?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
done
- Loss of data due to _unexpected_ death of host process is acceptable (the backend provides a simple mechanism for making fast snapshots to handle _expected_ restarts) | ||
- Size of the graph data makes it possible to host it in a single JVM process (i.e. a few tens of Gigabytes max, unless you use a specialized JVM and hardware) | ||
- Higher performance is required, but no expertise/resources available to tune more complex backends. Due to its memory-only nature, in-memory backend typically performs faster than disk-based ones, in queries using simple indices | ||
and in graph modifications. However it is not specifically optimized for performance, and does not support advanced indexing functionality. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It is not very clear to me whether 'advanced indexing functionality' refers to any index, or mixed index (which requires a index backend, thus can be considered as 'advanced') only
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
it vaguely refers to both the fact that there is no in-memory index backend (and so no mixed indices if using just in-memory backend), and also the fact that simple indices are basically taking advantage of the order of the column and row keys, and basically do binary search. There are no fancy additional data structures for indices specifically, and there is no database whose native indexing can be utilised.
Limitations: | ||
|
||
- Obviously the scalability is limited to the heap size of a single JVM, and no transparent resilience to failures is offered | ||
- The backend offers store-level locking only, whereas a Janusgraph transaction typically changes multiple stores (e.g. vertex store and index store). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you elaborate what 'store-level locking' is? Does it mean 'JanusGraphManagement.setConsistency(element, ConsistencyModifier.LOCK)' won't work for this backend?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It means that it can use java read-write lock to lock the individual store it modifies, for the duration of the modification, to prevent concurrent modifications on the same store. But it does not attempt to lock all the stores involved in one Janusgraph/backend transaction, and so individual store updates from parallel transactions being committed can interleave.
So say tx1 updates the index store, but before it can update the vertex store accordingly, tx2 updates the index store with its own data. As long as this data is completely different from the tx1's, this is actually fine and allows to reduce the lock contention significantly. But if they modify the same data - you can get "ghost vertices" or "missing vertices" etc.
Note that naively implementing locking all the stores at the beginning of the transaction can lead to deadlocks unless they are locked in the same order, and can lead to high contention in case of big number of parallel transactions, because half of the stores are going to be locked all the time. Basically, implementing this correctly and efficiently would be akin to implementing a robust in-memory database engine, which is not the intent of this backend at all.
however this can happen - e.g. when a large heap nears saturation and the GC pause exceeds configured backend timeout. | ||
- The data layout used by the backend can theoretically be susceptible to fragmentation in certain scenarios | ||
(with a lot of add/delete operations), thus reducing the amount of useful data that can be stored in a heap | ||
of specified size. The backend provides simple mechanisms to report fragmentation and defragment the storage if required. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Are those mechanisms documented somewhere? It would be nice to have a short description on how these mechanisms can be set up/activated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There currently is no support for activating these via configuration - one would have to have a little dig in the code to see how it can be invoked when needed. As and if some common patterns of how it can be applied for generic use case emerge - these patterns can be made configurable. For now, it can just be ignored, unless for a specific use case there is a suspicion that fragmentation might be a problem.
to explain possible production use cases, limitations and alternatives (issue JanusGraph#1929) Signed-off-by: Dmitry Kovalev <dk.global@gmail.com>
3f2561e
to
f869a5e
Compare
thank you @porunov, will merge this early next week if there are no objections of further comments |
* Issue JanusGraph#1871: Close graph instance at end of mapper run. Signed-off-by: Ted Wilmes <ted.wilmes@experoinc.com> * Fixed broken dist docker-compose (updated to supported ES version) Signed-off-by: Michal Podstawski <mpodstawski@gmail.com> * Minor code cleanup (NPEs etc) Signed-off-by: Michal Podstawski <mpodstawski@gmail.com> * Spelling fixes * actually * amend * assumed * backend * cassandra * centric * check * cohabitors * configured * conjunction * connections * containing * control * currently * default * disabled-the * exhaust * existing * explicitly * generation * geoshape * graph-class * graph * gremlin * implement * increment * information * initial * instance * interfaces * it's * janus * labels * levenshtein * logies * message * nonexistent * overridden * overriding * parameterized * params * parsable * partitioner * password * payload * persistent * preceded * propagate * pseudo * recommended * requires * sequence * submission * temporary * truststore * unknown * upgrading * version * writing Signed-off-by: Josh Soref <jsoref@users.noreply.github.com> * Add cell ttl support to BerkeleyDB Signed-off-by: Pavel Ershov <owner.mad.epa@gmail.com> * Improve added relations containers JanusGraph#1700 Signed-off-by: Pavel Ershov <owner.mad.epa@gmail.com> * Refactor in ES module Signed-off-by: Michał Podstawski <mpodstawski@gmail.com> * Add fixes for TP tests on berkeley backend Signed-off-by: Pavel Ershov <owner.mad.epa@gmail.com> * Add fixes for TP tests on berkeley backend Signed-off-by: Pavel Ershov <owner.mad.epa@gmail.com> * Add log4j.properties to inmemory Signed-off-by: Jan Jansen <jan.jansen@gdata.de> * JANUSGRAPH-1866 Filter out only system vertices in Hadoop Vertex deserializer Remove erroneously added unused import. Test that schema vertices are skipped. Hadoop vertices deserialization should skip schema vertices that are created implicitly when defining schema elements like labels. Correct tests for HBase Snapshot input format. Snapshot should be taken before reading the graph in order to have anything to read from. Signed-off-by: Evgeniy Ignatiev <yevgeniy.ignatyev@gmail.com> * Update Copyright year in documentation CTR [doc only] Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com> * Extract JanusGraph Gremlin driver requirements * Predicates * Geoshape * RelationIdenitifier Signed-off-by: Jan Jansen <jan.jansen@gdata.de> * * Improve the CQLIterator performance by using getPagingStateUnsafe ( this should avoid md5sum calculation of resultset) Signed-off-by: Ganesh Guttikonda <gguttikonda@snapfish-llc.com> * Update to TinkerPop 3.4.4 Fixes JanusGraph#1617 Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com> * upgrading inmemory backend storage layout to reduce memory footprint (JanusGraph#1483) Signed-off-by: Dmitry Kovalev <dk.global@gmail.com> * Add testcontainers support for cassandra [full build] Fixes JanusGraph#1475 * Update jacoco * Cleanup pom.xml * Introduce profiles for Cassandra * Update TESTING.md Signed-off-by: Jan Jansen <jan.jansen@gdata.de> * Add 'Getting Started' guide to documentation [doc only] Signed-off-by: Florian Grieskamp <florian.grieskamp@gdata.de> * Fix installation docs missing hadoop-2 in examples CTR [doc only] Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com> * JanusGraph release 0.3.3 [full build] Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com> * JanusGraph release 0.4.1 [full build] Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com> * [doc only] Updated in-memory backend documentation (JanusGraph#1934) to explain possible production use cases, limitations and alternatives (issue JanusGraph#1929) Signed-off-by: Dmitry Kovalev <dk.global@gmail.com> * Split up hadoop implementations [full build] Signed-off-by: Jan Jansen <jan.jansen@gdata.de> * Fix inmemory docs format CTR [doc only] Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com> * Bump jackson2.version from 2.6.6 to 2.10.2 Fixes JanusGraph#1307 Signed-off-by: Jan Jansen <jan.jansen@gdata.de> * Bump v0.3 branch to 0.3.4-SNAPSHOT CTR [doc only] Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com> * Bump v0.4 branch to 0.4.2-SNAPSHOT CTR [doc only] Signed-off-by: Oleksandr Porunov <alexandr.porunov@gmail.com> Co-authored-by: Ted Wilmes <twilmes@gmail.com> Co-authored-by: micpod <57301006+micpod@users.noreply.github.com> Co-authored-by: Josh Soref <jsoref@users.noreply.github.com> Co-authored-by: Pavel <owner.mad.epa@gmail.com> Co-authored-by: Oleksandr Porunov <alexandr.porunov@gmail.com> Co-authored-by: Jan Jansen <farodin91@users.noreply.github.com> Co-authored-by: Evgeniy Ignatiev <YevIgn@users.noreply.github.com> Co-authored-by: gani8780 <gguttikonda@snapfish-llc.com> Co-authored-by: Dmitry Kovalev <dk.global@gmail.com> Co-authored-by: rngcntr <7890887+rngcntr@users.noreply.github.com>
to explain possible production use cases, limitations and alternatives (issue #1929) [doc only]
Signed-off-by: Dmitry Kovalev dk.global@gmail.com
Thank you for contributing to JanusGraph!
In order to streamline the review of the contribution we ask you
to ensure the following steps have been taken:
For all changes:
master
)?For documentation related changes:
[doc only]
tag to the first line of your commit message to avoid spending CPU cycles in
Travis CI when no code, tests, or build configuration are modified?
Note:
Please ensure that once the PR is submitted, you check Travis CI for build issues and submit an update to your PR as soon as possible.