yahoo · zertyz · Sep 12, 2018 · Sep 12, 2018 · Sep 12, 2018 · Sep 12, 2018
diff --git a/extra/EngBlog.md b/extra/EngBlog.md
@@ -0,0 +1,255 @@
+### MDBM
+
+
+#### Introduction
+
+Back in 1979, AT&T released a lightweight database engine written by Ken Thompson, 
+called DBM (http://en.wikipedia.org/wiki/Dbm).
+In 1987 Ozan Yigit created a work-alike version, SDBM, that he released to the 
+public domain. 
+
+The DBM family of databases has been quietly powering lots of things "under the
+hood" on various versions of unix. I first encountered it rebuilding sendmail 
+rulesets on an early version of linux.
+
+A group of programmers at SGI, including Larry \McVoy, wrote a version based on SDBM, 
+called MDBM, with the twist that it memory-mapped the data.
+
+This is how MDBM came to Yahoo, over a decade ago, where it has also been quietly 
+powering lots of things "under the hood". We've been tinkering with it since that
+time, improving performance for some of our particular use cases, and adding *lots*
+of features (some might say too many). We've also added extensive documentation and
+tests.
+
+
+And I'm proud to say that Yahoo! has released our version back into the wild.
+
+ - Source code:   <https://github.com/yahoo/mdbm> <br>
+ - Documentation: <http://yahoo.github.io/mdbm/> <br>
+ - User Group:    <https://groups.yahoo.com/groups/mdbm-users/> <br>
+
+
+
+#### "Who did what now?..."
+
+These days, all the cool kids are saying "NoSQL", and "Zero Copy", for high performance,
+but MDBM has been living it for well over a decade. Lets talk about what they mean,
+how they are achieved, and why you should care.
+
+The exact definition of "NoSQL" has gotten a bit muddy these days, now including
+"not-only-SQL". But at it's core, it means optimizing the structure and interface
+to your DB to maximize performance for your particular application. 
+
+There are a number of things that SQL databases can do that MDBM can not. 
+MDBM is a simple key-value store. You can search for a key, and it will return references 
+to the associated value(s). You can store, overwrite, or append a value to a given key.
+The interface is minimal. You can iterate over the database, but there are no "joins",
+"views" or "select" clauses, nor any relationship between tables or entities unless
+you explicitly create them.
+
+So, if MDBM doesn't have any of these features, why would you want to use it?
+
+ 1. simplicity
+ 2. raw performance
+
+
+
+#### "Keep it simple..."
+
+The API has a lot of features, but using the main functionality is very simple.
+Here's a quick example in C:
+
+
+```
+    datum key = { keyString, strlen(keyString) };
+    datum value = { valueString, strlen(valueString) };
+    datum found;
+    /* open a database, creating it if needed */
+    MDBM *db = mdbm_open(filename, MDBM_O_RDWR|MDBM_O_CREAT, 0644, 0, 0);
+    /* store the value */
+    mdbm_store(db, key, value, MDBM_REPLACE);
+    ... 
+    /* fetch the value */
+    mdbm_lock_smart (db, key, 0);
+    found = mdbm_fetch(db, key);
+    use_value(found);
+    mdbm_unlock_smart (db, key, 0);
+    ... 
+    /* close the database */
+    mdbm_close(db);
+```
+
+Additionally, fully functional databases can be less than 1k in size. They can also
+be many terabytes in size (though that's not very practical yet on current hardware).
+However, we do have DBs that are 10s of Gigabytes in common use, in production.
+
+
+
+#### Speed... it really is screaming fast;
+
+On hardware that was current several years ago, MDBM performed 15 million QPS for
+read/write locking, and almost 10 million QPS for partitioned locking.
+Both with latencies well under 5 microseconds.
+
+Here's a performance comparison data vs some other \NoSQL databases from a couple years ago:
+
+  Performance: (based on \LevelDB benchmarks)
+  Machine: 8 Core Intel(R) Xeon(R) CPU L5420  @ 2.50GHz
+
+| *Test*          | *MDBM*      | *LevelDB*    | *KyotoCabinet* | *BerkeleyDB* |
+|:----------------|------------:|-------------:|---------------:|-------------:|
+| Write Time      | 1.1 &mu;s   | 4.5 &mu;s    | 5.1 &mu;s      | 14.0 &mu;s   |
+| Read Time       | 0.45 &mu;s  | 5.3 &mu;s    | 4.9 &mu;s      | 8.4 &mu;s    |
+| Sequential Read | 0.05 &mu;s  | 0.53 &mu;s   | 1.71 &mu;s     | 39.1 &mu;s   |
+| Sync Write      | 2625 &mu;s  | 34944 &mu;s  | 177169 &mu;s   | 13001 &mu;s  |
+[Performance Comparison]
+
+  NOTES: 
+    These are single-process, single-thread timings.
+    LevelDB does not support multi-process usage, and many features must be 
+      lock-protected externally.
+    MDBM iteration (sequential read) is un-ordered.
+    Minimal tuning was performed on all of the candidates.
+
+
+How does MDBM achieve this performance? There are two important components. 
+
+ 1. "Memory Mapping" - It leverages the kernel's virtual-memory system, 
+       so that most operations can happen in-memory.
+ 2. "Zero-Copy" - The library provides raw pointers to data stored in the MDBM.
+       This requires some care (valgrind is your friend), but if you need the 
+       performance, it's worth it.
+       If you want to trade the performance for safety, it's easy to do that too.
+
+
+#### Memory Mapping - "It's all in your head"
+
+Behind the scenes, Linux (and many other operating systems) keep often used parts of files
+in-memory via the virtual-memory subsystem. As different disk pages are needed, memory pages
+will be written out to disk (if they've changed) and discarded. Then the needed pages are
+read in to memory. MDBM leverages this system by explicitly telling the VM system to load
+(memory-map) the database file. As pages are modified, they are written out to disk, but
+writes can be delayed and bunched up until some threshold is reached, or the pages are 
+needed for something else. 
+
+This means less wear-and-tear on your spinning-rust or solid-state disks, but it also
+makes a huge difference in performance. Disks are perhaps an order-of-magnitude (10x) 
+slower than memory for sequential access (reading from beginning to end, or always 
+appending to the end of a file). However, for random access (what most DBs need), 
+disks can be 5 orders-of-magnitude (100,000 times) slower than memory.
+Solid state disks fare a bit better, but there's still a huge gap.
+
+If there is a lot of memory pressure, you can "pin" the MDBM pages so that the VM system
+will keep them in memory. Or, you could let the VM page parts in and out, with some 
+performance hit. But what if your dataset is bigger than your available memory?
+Out of the box, MDBM can run with two (or more) levels, so you can have a "cache" MDBM
+that keeps frequently used items together in memory, and lets less used entries stay
+on-disk. You can also use "windowed-mode" where MDBM explicitly manages mapping portions
+in and out of memory itself (with some performance penalty).
+
+#### "Zero-Copy" - "Saved by Zero"
+
+Lets look at what used to be involved in sending a message out over the network:
+a) user assembles pieces of the message into one big buffer in memory (1st copy)
+b) user calls network function
+c) transition to kernel code
+d) kernel copies user data to kernel memory (second copy)
+e) kernel notifies device driver
+f) driver copies data to device (third copy)
+g) transition back to user space
+
+Each one of these copies (and transitions) has a very noticeable performance cost.
+The linux kernel team spent a good amount of time and effort reducing this to:
+a) user gives list of pieces to kernel (no copy)
+b) transition to kernel code
+c) kernel sets up DMA (direct-memory-access) for network card to read and send peices
+d) transition back to user space
+
+If you're connecting to a remote SQL DB over the network, you're incurring these costs for
+the request and the response on both sides. If you're connecting to a local service, then
+you can replace the driver section with a copy to userspace for the DB server. 
+(This completely ignores network/loopback latency, and any disk writes for the server.)
+
+For something like \LevelDB, you still have to wait to copy data for the kernel, and 
+DMA it to the disk. (LevelDB appends new entries to a "log" file, and squashes the 
+various log files together as another pass over the data.)
+
+For an MDBM in steady state, you can do a normal store with the cost of one memory copy.
+To avoid that extra copy, you can reserve space for a value, and update it in-place.
+The data will be written out to disk eventually by the VM system, but you don't have to 
+wait for it. NOTE: you can explicitly flush the data to disk, but for highest performance,
+you should let the VM batch up changes and flush them when when there is spare I/O and
+cycles available.
+
+Because the data is explicitly mapped into memory, once you know the location 
+of a bit of data, you can treat it like any other bit of memory on the stack 
+or the heap. i.e. you can do something like:
+```
+    /* fetch a value */
+    mdbm_lock_smart (db, key, 0);
+    found = mdbm_fetch(db, key);
+    /* increment the the entry in-place */
+    *(int*)found.dptr += 1;
+    mdbm_unlock_smart (db, key, 0);
+```
+
+#### Data Distribution - "It's bigger on the inside..."
+
+MDBM allows you to use various hashing functions on a file-by-file basis, including FNV, 
+Jenkins, and MD5. So, you can usually find a decent page distribution for your data.
+However, it's hard to escape statistics, so you will end up with pages that have 
+higher and lower occupancy than other pages. Also, if your values are not uniformly-sized,
+then you may have some individual DB entries that vary wildly from the average.
+These factors can all conspire to reduce the space efficiency of your DB.
+
+MDBM has several ways to cope with this:
+
+ 1. It can split individual pages in two, using a form of [Extendible Hashing](http://en.wikipedia.org/wiki/Extendible_hashing).
+ 2. It has a feature called "overflow pages" that allows some pages to be larger than others.
+ 3. It has a feature called "large objects" that allows very big single DB entries, which are over a (configurable) size to be placed in a special area in the DB, outside of the normal pages.
+
+
+#### "With great power comes great responsibility..."
+
+This all sounds great, but there are some costs of which you should be aware.
+
+On clean shutdown of the machine, all of the MDBM data will be flushed to disk. 
+However, in cases like power-failure and hardware problems, it's possible for 
+data to be lost, and the resulting DB to be corrupted. MDBM includes a tool to 
+check DB consistency. However, you should always have contingencies. 
+One way or another this is some form of redundancy...
+
+At Yahoo!, MDBM use typically falls into a few categories:
+
+ 1. The DBs are cached data. So the DB can be truncated/deleted and will fill with appropriate data over time.
+ 2. The DBs are generated in bulk somewhere (i.e. Hadoop grid), and copied to where they are used. They can be re-copied from a source or peer. If they are read-only during use, then corruption is not an issue.
+ 3. The data represents transient data (monitoring), for which it's loss is less critical.
+ 4. The data needs to persist and is dynamically generated. We typically have some combination of redundancy across machines/data-centers, and logging the data to another channel. In case of damage, data can be copied from a peer, or re-generated from the logged data.
+
+
+There is one other cost. Because MDBM gives you raw pointers into the DB's
+data, you have to be very careful about making sure you don't have array over-runs,
+invalid pointer access, or the like. Unit tests and tools like valgrind are a 
+great help in preventing issues. (You do have unit tests, right?)
+
+If you do run into a problem, MDBM does provide "protected mode", where pages
+of the DB individually become writable only as needed. However, this comes
+at a noticeable performance cost, so it isn't used in normal production.
+
+You shouldn't let the preceding costs scare you away, just be aware that
+some care is required. Redundancy is always your friend.
+
+Yahoo has been using MDBM in production for over a decade, for things both 
+small (a few KB) and large (10s of GB).
+One recent project has DBs ranging from 5MB to 10GB spread across 1500 DBs 
+(not counting replicas) for a total dataset size of 4 Terabytes.
+
+When I first encoutered MDBM, we had scaled out what was one of the largest
+Oracle instances (at the time) in about every direction it could be scaled.
+Unfortunately, the serving side was having trouble expanding enough to meet
+latency requirements. The solution was a tier of partitioned (aka "sharded"),
+replicated, distributed copies of the data in MDBMs.
+
+If it looks like it might be a fit for your application, take it out for a
+spin, and let us know how it works for you.
+
diff --git a/gendoc/README b/gendoc/README
@@ -31,6 +31,10 @@ print <<EOF
 MDBM Version 4 is a significant improvement to version 3 by optionally using
 pthreads instead of ylock.  Additionally, all y* package dependencies are gone.
 
+Version 4.12.4 (06/12/2018, crowdert)
+  - Fix off-by-one in mash.
+  - Fix variable usage, () vs {}, in mdbm_reset_all_locks.
+
 Version 4.12.3 (01/12/2016, crowdert)
   - Use realpath unconditionally in mdbm_delete_lockfiles
 

diff --git a/src/java/Makefile b/src/java/Makefile
@@ -15,8 +15,12 @@ SONAME=-Wl,-soname,$(LIBNAME).$(LIBVER)
 #-Wl,-rpath,$(DEFAULT_LIB_INSTALL_PATH)
 
 # TODO: Fix this to not be rh specific
-INCDIR += -I/usr/lib/jvm/java/include -I/usr/lib/jvm/java/include/linux
-LIB_BUILD_FLAGS +=  -L/usr/lib/jvm/java/jre/lib/amd64/ -ljsig $(LIBRT)  -L$(TOPDIR)/src/lib/$(OBJDIR)  -lmdbm -Wall  -fno-strict-aliasing -Wno-unused-function -D_FILE_OFFSET_BITS=64
+#INCDIR += -I/usr/lib/jvm/java/include -I/usr/lib/jvm/java/include/linux
+#JAVALIBDIR += -L/usr/lib/jvm/java/jre/lib/amd64/
+# IF in ArchLinux, use these lines instead:
+INCDIR += -I/usr/lib/jvm/java-10-jdk/include/ -I/usr/lib/jvm/java-10-jdk/include/linux
+JAVALIBDIR += -L/usr/lib/jvm/java-10-jdk/lib/
+LIB_BUILD_FLAGS += $(JAVALIBDIR) -ljsig $(LIBRT)  -L$(TOPDIR)/src/lib/$(OBJDIR)  -lmdbm -Wall  -fno-strict-aliasing -Wno-unused-function -D_FILE_OFFSET_BITS=64
 
 LIB_DEST=$(PREFIX)/lib$(ARCH_SUFFIX)
 ifeq ($(SET_RPATH),1)
@@ -43,4 +47,4 @@ install:: default-make-target
 clean :: clean-objs
 
 maven: 
-	LD_LIBRARY_PATH=../lib/object/ mvn -B clean verify -Djava.awt.headless=true -DfailIfNoTests=false  -DnativeDir=../lib/object/ -DlibDir=object/
+	LD_LIBRARY_PATH=../lib/object/ mvn -B clean verify -Dosgi.requiredJavaVersion=1.8 -Djava.awt.headless=true -DfailIfNoTests=false  -DnativeDir=../lib/object/ -DlibDir=object/
diff --git a/src/java/pom.xml b/src/java/pom.xml
@@ -18,7 +18,7 @@
 		<maven-pmd-plugin-version>3.1</maven-pmd-plugin-version>
 		<com.google.testability-explorer.version>1.3.3</com.google.testability-explorer.version>
 		<min_jdk>1.7</min_jdk>
-		<max_jdk>1.9</max_jdk>
+		<max_jdk>11</max_jdk>
 		<min_jdk_version>${min_jdk}</min_jdk_version>
 		<max_jdk_version>${max_jdk}</max_jdk_version>
 		<source_jdk_version>${min_jdk_version}</source_jdk_version>

diff --git a/src/java/src/main/java/com/yahoo/db/mdbm/internal/DeallocatingClosedBase.java b/src/java/src/main/java/com/yahoo/db/mdbm/internal/DeallocatingClosedBase.java
@@ -19,15 +19,16 @@ public abstract class DeallocatingClosedBase extends ClosedBaseChecked {
     protected volatile long pointer = 0L;
     protected volatile Deallocator deallocator;
     @SuppressWarnings("restriction")
-    protected volatile sun.misc.Cleaner cleaner;
+    protected volatile java.lang.ref.Cleaner cleaner;
 
     @SuppressWarnings("restriction")
     protected DeallocatingClosedBase(long pointer, Dealloc destructor) {
         super(false, true);
         this.pointer = pointer;
         if (null != destructor) {
             this.deallocator = new Deallocator(pointer, destructor);
-            this.cleaner = sun.misc.Cleaner.create(this, deallocator);
+            this.cleaner = java.lang.ref.Cleaner.create();
+            this.cleaner.register(this, deallocator);
         } else {
             this.deallocator = null;
             this.cleaner = null;

diff --git a/src/java/src/main/java/com/yahoo/db/mdbm/internal/NativeMdbmPoolImplementation.java b/src/java/src/main/java/com/yahoo/db/mdbm/internal/NativeMdbmPoolImplementation.java
@@ -2,9 +2,6 @@
 /* Licensed under the terms of the 3-Clause BSD license. See LICENSE file in the project root for details. */
 package com.yahoo.db.mdbm.internal;
 
-import java.io.IOException;
-import java.io.PrintWriter;
-import java.io.StringWriter;
 import java.util.concurrent.atomic.AtomicInteger;
 
 import com.yahoo.db.mdbm.MdbmInterface;

diff --git a/src/lib/atomic.h b/src/lib/atomic.h
@@ -78,13 +78,19 @@ static inline void atomic_barrier() {
 static inline void atomic_read_barrier() {
 #ifdef __x86_64__
     __asm__ __volatile__ ("lfence" : : : "memory");
-#else
+#elif __ARM_ARCH_7A__
+    __asm__ __volatile__ ("dmb");
+#else	// X86
     __asm__ __volatile__ ("lock addl $0,0(%%esp)" : : : "memory");
 #endif
 }
 
 static inline void atomic_pause() {
+#ifdef __ARM_ARCH_7A__
+  __asm__ __volatile__ ("yield");
+#else	// X86 & X86_64
   __asm__ __volatile__ ("pause");
+#endif
 }
 
 

diff --git a/src/lib/log.c b/src/lib/log.c
@@ -127,7 +127,10 @@ int mdbm_log_vlogerror_at (const char* file, int line, int level, int error, con
     if (len < sizeof(buf)) {
         mdbm_strlcpy(buf+len,strerror(error),sizeof(buf)-len);
     }
-    return mdbm_log_vlog_at(file, line, level,buf,NULL);
+    {
+        va_list unused;
+        return mdbm_log_vlog_at(file, line, level,buf,unused);
+    }
 }
 
 
@@ -170,11 +173,11 @@ int mdbm_log_vlog_at (const char* file, int line, int level, const char* format,
         offset += sizeof(FATAL)-1;
     }
 
-    if (args) {
+    //if (args) {
         vsnprintf(buf+offset,sizeof(buf)-offset-2,format,args);
-    } else {
-        mdbm_strlcpy(buf+offset,format,sizeof(buf)-offset-2);
-    }
+    //} else {
+    //    mdbm_strlcpy(buf+offset,format,sizeof(buf)-offset-2);
+    //}
 
     buflen = strlen(buf);
     if (buf[buflen-1] != '\n') {