2.2. Data management architecture

GregLawson edited this page Aug 22, 2011 · 12 revisions

#Data Management Goals

#Streams A Stream is a ruby class that abstracts as much useful data management architecture as possible. A stream can be visualized as a possibly infinite table consisting of rows and columns. Each column has a unique 2.2.9. Column Data Types. This project intends to leave most data management issues to the underlying database management systems. Since Ruby Rails has been chosen as the user interface and database interface, the capabilities 2.2.2.-ActiveRecord abstracts from or simulates on top of the underlying database system are the easiest to support. The tools specific to the underlying database system are still available, since 2.2.2.-ActiveRecord makes few demands on the underlying database.These include: ##Stream Inputs An input Enumerator method next to return the next data value. Note this is more primitive than the commonly used each method, so as to allow for unbounded size collections (i.g. infinite) as is typical in real-time data acquisition. When the position reached at the end, StopIteration is raised. Methods such as data_valid, end_of_stream, wait_for_data, data_available? are needed to interrogate real-time data status. Enumerable methods imply a finite set of data so can only apply to historical data. Acquired data may not be available for an unknown length of time, so each should be called in a separate thread so that the program does not hang. A method parallel_each (or each_new) could implement that.
##Stream Outputs An output method (named something like save, put, write) matched to an input enumerator to later retrieve the data. Typically each output method execution produces a row that is later consumed by a matched input Enumerator. ##Stream Methods ###Stream Data Buffers A data buffer can be implemented as (using duck typing):

  1. an Array of Hashes of name => value pairs,

  2. 2.2.2.-ActiveRecord - When implemented as a SQL database, an id primary key is added along with a creation and update time-stamp. A logical primary key is added as a unique index to prevent duplicate entries.

  3. 2.2.3.-ActiveModel

###Ruby Expressions A ruby expression (including 2.2.3.-Active relation) performing a data transformation, such as digital comparisons of two of the above, generating a stream of exceptions (set differences, intersection, etc.) ###A Statistical model A statistical (2.2.4.-R) model provides a way of compressing (usually analog) data by predicting expected data value distributions, generating a stream of unexpected outliers. See 5.1. Mathematics as Compression. This may include data stored in R data tables. The output of such a model could be a prediction of data values or user outputs such as reports or graphs. #Stream Buffer Life-cycle Eventually each stream input is eventually consumed and deleted, to avoid an unbounded need to by more disk drives. But for practical reasons some data's consumption is delayed indefinitely:

  1. Test data, error messages, and historical trends are saved to support development.
  2. Statistics that representing a set of historical values that are well enough characterized that the original data provides little added-value.
  3. data representing external objects with long life-cycles.
  4. supporting data for a statistical model. Training data, interesting outliers, context around anomalies. It is assumed that for large data sets this supporting data is much smaller than keeping all acquired data.

#Convention over Configuration In keeping with the Ruby Rails principal of [http://en.wikipedia.org/wiki/Convention_over_configuration Convention over Configuration], following certain conventions, will gain simplicity, without losing flexibility. Since a major goal is to discover and create database tables, by default tables will follow Ruby Rails conventions:

  1. a serial auto-incrementing unique primary key, named "id" will be created,

  2. foreign keys will be named "table-name"_id,

  3. two time-stamp fields will be added created_at and updated_at #Scheduling A major performance concern is the conversion between the above data types. Since the system needs to deal with very large data sets, the following techniques are proposed for adequate performance:

  4. Sampling - processing small samples of large potential data sets allows interactive discovery of import data structure. Small samples of data can also be analyzed and plotted to interactively discover the appropriate data analysis. It may also be useful to have some interface for processing progressively larger samples.

  5. Infrequent conversions with caching - A commit mechanism to signal when import data structure and/or data analysis are sufficiently stabilized to invest in data analysis of the complete data set. Background computations of complete data analysis whose update rate is adaptive to the work load.

  6. Ĺazy conversion - delay conversion till requested.

  7. Batch conversion - convert as a background task

  8. hybrid - append converted and unconverted tables. Implies incremental an out-of-order conversion - probably more difficult than allowing only converted DB being incrementally and in order conversion.

  9. R data tables probably are most efficiently converted as a batch.
    #Ruby interaction with above data types

  10. using get_attributes and set_attributes the above data types should be convertable to and from a value hash, where the keys are the symbol names of the columns and the values are the typed values of each row.

  11. alternative syntax row_name[:column_name] can be used in expressions or assigned to.
    #Interactive versus Background Processing

Interactive processing involves editing the specifications for input import and data analysis outputs. Editing database records containing specifications is not a compute intensive task, so feedback of a small sample of data allows errors to be detected before committing to a long data analysis. The precommit data sample can optionally be larger than the editing data sample. Background processing should be used to do complete data analysis at a rate consistent with decent interactive performance.

#Computations Can be done in:

Clone this wiki locally
You can’t perform that action at this time.
You signed in with another tab or window. Reload to refresh your session. You signed out in another tab or window. Reload to refresh your session.
Press h to open a hovercard with more details.