Skip to content

Simonl07/mrsketch

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

16 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

MRSketch

Simple library for building multi-resolution sketches: Multi-level streaming statistics & Multi-level online machine learning

Supports arbitrary number of dimensions, levels of resolution, and node states

Install

pip install mrsketch

Quick Start

First, define the individual states that need to be stored at each resolution by specifying these functions to a state descriptor:

Assuming we have a dataset/stream that contains [Year, Month, Day, Value] for each record

we want to keep track of the max value at each level of resolutions (year, month, and day levels)

from mrsketch import Graph, StateDescriptor

max_state = StateDescriptor(
    constructor=lambda: [0],                                   # mutable list to keep track of max
    updater=lambda x, state: state[0] = max(state[0], x[3]),   # update the max by comparing with new record's value
    merger=lambda state1, state2: [max(state1[0] + state2[0])],# the max of two max as a new state
    serializer=lambda state: repr(state[0]),                   # python's repr
    deserializer=lambda x: [eval(x)],                          # reconstruct the state from input string
    extractor=lambda x: x[0]                                   # when get(), extract the max from the list (state)
)
Parameters default description use case
constructor Required function that takes no input a return a newly initialized state Used by the Graph to initalize new nodes
updater Required function that takes an input record, the state of a given node, and mutate the state with the input record Used by the graph to update and populate effected nodes upon inserting a record
merger Required function that two input state and combined into a new state Used by the Graph to perform graph merges and aggregated queries
serializer Required function that takes a state as input and return the serialization of the given state as string, serializer and deserializer must be inverse functions Used by the Graph to perform graph serialization
deserializer Required function that takes a string as input and return a deserialized state from the input, serializer and deserializer must be inverse functions Used by the Graph to perform graph deserialization
extractor lambda state: state optional function that defines the behavior of state extraction on graph.get() Used as the final stage of get(), once the states of relevant nodes are merged, extractor is applied on the final merged state to support custom extraction behavior. Default function is Identity, which returns the entire state obj.

Serializer and deserializer must be inverse functions: deserializer(serializer(G)) == G

Once the StateDescriptor is defined, pass it into the Graph constructor:

g = Graph(
    state_descriptors={
        'max': max_state
    },
    path_extractor=lambda x: {
        'time': list(x[:3]),
    }
)

The path_extractor is a function that takes a record as input and returns a NDPath

In this example the path along the time dimension is extracted by slicing the first three items in each record

This tells the graph how the hierarchical data can be extracted from a given record

now lets push some records into the graph:

g.push([2020, 10, 30, 2])
g.push([2020, 10, 31, 2])
g.push([2020, 10, 31, 5])
g.push([2020, 11, 2, 10])

The graph will update the state of all nodes that are impacted by the dates. To extract, specify an NDPath

g.get({'time': [2020, 10, 30]}) # -> 2
g.get({'time': [2020, 10]}) # -> 5
g.get({'time': [2020]}) # -> 10

Design & Documentation

Path: Path represents a sequence of values along a specific hierarchical tree of resolution: Jun 16th, 2017 can be encoded as [2017,6,16], where indices encodes the level of hierarchy (2017 - top/root level, 16 - lowest/leaf level), and every next index is the child of (or nested in) the previous index. Geohash 8ygzq can be encoded as [8, y, g, z, q]

NDPath: N-dimensional path. A mapping of dimension names to Path. A Path only describes the path along a single dimension, the graph supports arbitrary number of multi-resolution dimensions:

g = Graph(
    state_descriptors={
        'max': max_state
    },
    path_extractor=lambda x: {
        'time': list(x[:3]),
        'geohash': list(x[3:10]) # <- add arbitrary dimensions as long as its 
    }                             # hierarchy (path) can be extracted from the records
)

Graph: A Directed Acyclic Graph where node is identified by an NDPath describing its coordinates among N dimensions. The node manages arbitrary amount of states described by the graph state_descriptors. A node points to other child nodes with NDPath that overlapse with itself and extend by 1 level of resolution in any dimension. So time=2018/1/1, geohash=abc is a children of time=2018/1, geohash=abc, but not a children of time=2019/1, geohash=abc

Portability: The (de)serializer functions specified in state descriptors are intented to be used by the graph to produce a serialized byte sequence that snapshots the entire state. The serialized bytes in the form of json or pickle can then be sent over the network and restored into memory.

Thread safety: Concurrent read/write are protected by read/write locks within each node. This improves concurrency by allowing writes to different region in the graph.

About

Multi-dimensional + Multi-resolution graph

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published