a simple distributed spill-LRU-to-disk data store
Clone or download
Fetching latest commit…
Cannot retrieve the latest commit at this time.
Permalink
Failed to load latest commit information.
src
test remove __precompile__ Aug 7, 2018
.codecov.yml
.gitignore keep machine IP with FileRef Feb 16, 2018
.travis.yml update versions to test on Sep 24, 2018
LICENSE.md
README.md build and badge Sep 15, 2017
REQUIRE
appveyor.yml

README.md

MemPool

Build Status

Simple distributed datastore that supports custom serialization, spilling least recently used data to disk and memory-mapping.

Usage

addprocs(4)
using MemPool
@everywhere MemPool.max_memsize[] = 10^9 # 1 GB per worker

This sets the memory limit on each process to 10^9 bytes (1GB). If this is exceeded, the least recently used data will be written to disk using movetodisk described below until the total pool size is below 1 GB. Data thus spilled are written in a directory called .mempool. The data can be read back with memory mapping. Overriding mmwrite and mmread described in the next section is recommended for efficiency.

Data store functions:

  • poolset(x::Any, pid=myid()): store the object x on pid. Returns a DRef object.
  • poolget(r::DRef): gets the data stored at DRef. If the data has been moved to disk, it will be read on the caller side.
  • pooldelete(r::DRef): removes data at r, including any data on disk, that was not saved using savetodisk.
  • movetodisk(r::DRef): moves data to disk and release it from memory. Uses MemPool.mmwrite to write to disk. See section below. Returns a FileRef which can be passed to poolget to read the data. Further poolget calls to r itself will cause the data to be read from disk and cached in memory and marked most recently used.
  • copytodisk(r::DRef): copies data to disk keeping the original copy in memory. Subsequent poolget(r) will read data from disk on callee process, or return the cached value if the callee owns the ref.
  • savetodisk(r::DRef, path): saves data to a given file path. Leaves original data in memory, doesn't affect LRU accounting. Use this when you want to explicitly save data using the format described below.

MemPool.mmwrite, MemPool.mmread

mmwrite and mmread are fast alternatives to Base.serialize and Base.deserialize which can memory map if read from disk. They fallback to Base.serialize so as to support all Julia types. This format is only suitable for temporary storage since all four functions can change implementations.

  • mmwrite(s::AbstractSerializer, x::Any) is called to write data to the wire / file when data needs to be transferred / written to disk. Packages can define how parts of their datastructure can be written in raw format that can be mmapped back later with mmread. mmwrite must begin with the command Base.serialize_type{MemPool.MMSer{typeof(x)} so that Julia's base serializer will dispatch any deserialization to mmread.
  • mmread(::Type{T}, io::AbstractSerializer) is called to deserialize data written with mmwrite.

mmwrite can currently store Array{String} much more efficiently than Base. It is also extended for fast storage of NullableArrays, PooledArrays, and IndexedTables by JuliaDB.