public
Description: Minimalistic modular engine for StrokeDB
Homepage: http://strokedb.com
Clone URL: git://github.com/oleganza/strokedb-core.git
name age message
file .gitignore Tue Aug 05 05:16:04 -0700 2008 fixed gitignore [oleganza]
file .gitmodules Sun Sep 21 04:03:52 -0700 2008 4 libs: extlib, declarations, tokyocabinet-wrap... [oleganza]
file MIT-LICENSE Tue Aug 05 04:04:56 -0700 2008 added mit license [oleganza]
file README Tue Aug 05 04:05:46 -0700 2008 some stupid thoughts in the README [oleganza]
file Rakefile Thu Sep 11 02:50:47 -0700 2008 added gem_console and declarations to the Rakef... [oleganza]
file TODO Sun Aug 24 17:54:43 -0700 2008 TODO updated [oleganza]
directory bin/ Sun Sep 21 04:03:52 -0700 2008 4 libs: extlib, declarations, tokyocabinet-wrap... [oleganza]
directory docs/ Sun Aug 10 20:22:30 -0700 2008 added docs/CORE_EXTENSIONS [oleganza]
directory examples/ Fri Aug 22 09:27:58 -0700 2008 added slots API: lazy_accessors (with specs), s... [oleganza]
directory lib/ Sun Sep 21 04:03:52 -0700 2008 4 libs: extlib, declarations, tokyocabinet-wrap... [oleganza]
directory spec/ Sat Sep 06 11:34:07 -0700 2008 extracted declarations, fixed has_many [oleganza]
directory test/ Mon Aug 25 08:04:26 -0700 2008 random thought on modules and metas [oleganza]
directory vendor/ Sun Sep 21 04:03:52 -0700 2008 4 libs: extlib, declarations, tokyocabinet-wrap... [oleganza]
StrokeDB Core.

Minimalistic modular database engine.

Authors: Oleg Andreev <oleganza@idbns.com> & Yurii Rashkovskii <yrashk@idbns.com>

Copyright 2008 IDBNS, inc.

= Architecture overview

StrokeDB stores a set of versioned documents identified by UUID.
A document is a hash-like container of slots: flat set of values tagged with string keys.
Slots store arbitrary serializable values (most common types are booleans, 
numbers, strings, arrays, numbers, time).
Basic repository implements three retrieval methods: 
  * get(uuid) 
  * get_version(version)
  * each{|doc| ... }

Efficient indexing is implemented with Views. View is an indexed sorted list of the key-value
pairs. Each view relies on map(doc) method to produce key-value pairs (like in map-reduce).
View requests are implemented by prefix-based searching using powerful find() method.

StrokeDB Core API relies on: 
  * two methods of document: doc["slot"], doc["slot"]= 
  * slots: "uuid", "version", "previous_version"
  * ability to store Arrays and Strings in a "previous_version" slot. (Array is used when 
  the version is a result of a merge operation.)

== Versions

Every document references several previous versions. 
The very first version of a document does not have a reference to any previous version. 
Regular versions have a reference to only one previous_version.
In case of a merge, document references two or more previous versions.
Document deletion is just a creation of the new version with the {"deleted"=>true} slot.

== Guarantees

StrokeDB doesn't guarantee referential integrity. Any document may reference any UUID, even
not available in the repository. 


== Views
TODO

== Synchronization (fetch and merge operations)

Data cases:
1. Many new documents, few versions.
2. Many documents with many versions organized linearly.

Branches must be handled efficiently as well, but the first priority is to make first two cases 
as performant as possible.

Fetch contexts:
1. Occasional fetch (hourly, daily etc.)
2. Streamed syncing (i.e. master-slave replication)


The concept.

1. Every repository is identified by UUID.
2. Every repository writes a log of commits.
3. Commit tuples:
   (timestamp, "store", uuid, version)
   (timestamp, "pull",  repo_uuid, repo_timestamp)
4. When you pull from a repository:
   1. Find out the latest timestamp in your history (index: repo_uuid -> repo_timestamp)
   2. If there is not timestamp yet, pull the whole log.
   3. If there is a timestamp for a repository UUID, pull the tail of the log.
   4. For each "store" record: fetch the version.
   5. For each "pull" record - add to a deferred list of repositories waiting for update.
   6. When whole log is fetched, fetch deferred repositories. We have two options here:
     1. Fetch from the same repository we've been fetching from few moments ago (say, fetch B log from A)
     2. Or, fetch directly from the desired repository (B log from B repository)

Partial fetches.

Repository may expose only limited set of documents for synchronization, specified with a view.
Thus, the view needs to run an update log.

TODO: how to efficiently determine which versions are already fetched?

Security.

We may add git-like security features this way:
1. Each version number is a UUIDv5 (SHA-1) of the doc's content.
2. Each subsequent doc UUID is a UUIDv5 for previous commit content.
   
   
== Availability