Skip to content
Fetching contributors…
Cannot retrieve contributors at this time
728 lines (605 sloc) 28.2 KB

The MongoDB database driver

While OPA is endowed with its own internal database, many OPA users have expressed a desire to deploy existing database solutions as part of their OPA projects. The MongoDB database is one such database and is popular with web-based database applications.

In this chapter, we describe the current state of support for MongoDB in the OPA standard library. We assume some familiarity with MongoDB concepts and particularly with the MongoDB shell. This familiarisation can be gained by reading the MongoDB tutorial.

Introduction

MongoDB is a server-based document-oriented non-relational database intended to be scalable and fast. Documents are stored in a binary JSON-like format called BSON. Although BSON has a richer set of types than JSON it is 100% compatible with JSON. For speed, MongoDB does not implement joins but is instead provided with a powerful query language of its own and almost anything that can be done with a relational database can be implemented in MongoDB with a little bit of effort (see MongoDB’s page on SQL compatibility).

In addition, MongoDB allows multiple indices into its data although these are not automatic and have to be initiated in client code. MongoDB is intended to be deployed in reliable large-scale web-based applications and thus has features which facilitate scalability such as sharding and master-slave arrangements of servers along with features for reliability such as replicated servers with failover.

Backups of MongoDB data are usually done either offline on a slave server in the network using external tools or to redundant nodes in the MongoDB server network.

Overview

The OPA support for MongoDB consists of a hierarchy of modules leading to successively higher-level programming.

Bson

Support for the BSON binary format is in the form of the Bson module, all other modules are built on top of this one. In general, BSON values are handled by the Mongo.document OPA datatype but we also provide the Bson.opa2doc and Bson.doc2opa functions to allow conversion between OPA types and BSON documents.

MongoCommon

This contains general support routines for dealing with replies from the MongoDB server. These include:

  • printing results to meaningful strings

  • testing results for error status

  • handling tag lists instead of bit-mapped integers

  • extracting fields and OPA types from MongoDB replies

MongoConnection

The code which talks to the MongoDB server is in the private MongoDriver module. This includes support for replica sets with automatic reconnection on failover and cursors but for programming at this level we provide a single all-purpose module called MongoConnection.

Advanced programmers wishing to use some of the more obscure features of MongoDB can use the driver code directly but this is not recommended. MongoDB has a complex API involving over 70 functions and many of the simple access commands have numerous options. Our intention with this driver is to make accessing MongoDB databases as simple and logical as possible while still exposing the power and flexibility of the MongoDB engine.

MongoCommands

As an adjunct to the low-level programming interface we provide a module containing a large (but still incomplete) number of the MongoDB command set called MongoCommands. These encompass most functions that will be required for meta-programming the MongoDB database, such as dropDatabase, repairDatabase, createCollection and so on plus functions associated with normal database access operations such as getLastError. The more advanced MongoDB functionality is also supported here, including findAndModify and the very powerful mapReduce function.

These commands occur in two flavours, those which return Bson.document values and those which convert their results into OPA types. If you are only looking for a single value out of a large and complex reply document then using the Bson module access functions on the raw BSON may be more efficient. If you intend complex analysis of the reply then the OPA types may be more convenient. At the present time only partial support is provided for OPA types. Some command results may never be treated this way because they include arbitrary field names which we can’t safely convert into OPA types.

MongoCollection

This module represents a type-safe view of the low-level routines in MongoConnection. Here, we insist upon OPA types as arguments and results from MongoDB operations. This necessarily limits what we can put into the database since the BSON documents stored in the database have to be consistent with the OPA types they represent.

To achieve this, we have implemented the MongoSelect and MongoUpdate modules which enforce a type discipline upon the arguments to, for example, MongoCollection.insert. The type safety is implemented as run-time type checks so there is a significant performance penalty for using these routines. In the future, however, we will provide fully type-safe compile-time type checks along the lines of the OPA internal database.

Programming

Here, we provide some notes on programming with the OPA MongoDB driver. The full interface is too large for complete coverage here, refer to the online OPA API documentation for detailed notes on each function.

Using BSON types in OPA

The full OPA BSON datatype is as follows:

/**
 * A BSON value encapsulates the types used by MongoDB.
 **/
type Bson.value =
    { Double: float }
  / { String: string }
  / { Document: Bson.document }
  / { Array: Bson.document }
  / { Binary: string }
  / { ObjectID: string }
  / { Boolean: bool }
  / { Date: Date.date }
  / { Null }
  / { Regexp: (string, string) }
  / { Code: string }
  / { Symbol: string }
  / { CodeScope: (string, Bson.document) }
  / { Int32: int }
  / { Timestamp: (int, int) }
  / { Int64: int }
  / { Min }
  / { Max }

/**
 * A BSON element is a named value.
 **/
type Bson.element = { name:string; value:Bson.value }

/**
 * The main exported type, a BSON document is just a list of elements.
 */
type Bson.document = list(Bson.element)

While values of this type can be constructed manually:

doc = [{name="$eval"; value={Code="function(x,y) \{return x*y;}"}},
       {name="args"; value={Array=[{name="0"; value={Int32=6}},
                                   {name="1"; value={Int32=7}}]}}]

there are two more convenient ways of constructing BSON values. Firstly, we provide a set of abbreviations in the Bson.Abbrevs module:

H = Bson.Abbrevs
doc = [H.code("$eval","function(x,y) \{return x*y;}"),
       H.valarr("args",[{Int32=6},{Int32=7}])]

Secondly, we can construct the values in OPA and use Bson.opa2doc:

doc = Bson.opa2doc({`$eval`=("function(x,y) \{return x*y;}":Bson.code);
                    args=([6,7]:list(Bson.int32))})

Notice that to get a field with non-alphanumeric characters we have to backquote the field name in the OPA value and that to control the representation in the BSON type we can apply helper types, for example Bson.code is just a string but it instructs Bson.opa2doc to treat it as code. Remember also to escape curly brackets in strings. Note that to get Int32 values you need the Bson.int32 type, the default for int is actually Bson.int64.

There are several such types provided by the Bson module but some merit special mention:

  • Optional types have a special significance with respect to Bson.doc2opa in that if a field value is missing in the document it will appear in the OPA type as \{none\}. The alternate direction does not apply, \{none\} values are represented in the BSON document as \{ none : null \}.

type Bson.register('a) = {present:'a} / {absent}
  • We take this one step further, however, with the Bson.register type, which actually behaves much as option('a) except that when we call Bson.doc2opa any \{absent\} values are omitted from the resulting document altogether. Note that there is a module Bson.Register which provides the same functionality for Bson.register as the Option module does for type option.

  • Two other cases should be mentioned. Both list and intmap are mapped onto Array values in BSON. The difference is that list is mapped to consecutive-numbered elements in the Array document whereas intmap allows sparse arrays.

As a rough guide to Bson.opa2doc and Bson.doc2opa, the following simple schema shows the mapping:

  // We use a "natural" mapping of constant types
  float <-> Double
  string <-> String
  Bson.binary <-> Binary
  Bson.oid <-> ObjectID
  bool <-> Boolean
  Date.date <-> Date
  void <-> Null
  Bson.regexp <-> Regexp
  Bson.code <-> Code
  Bson.symbol <-> Symbol
  Bson.codescope <-> CodeScope
  Bson.int32 <-> Int32
  Bson.timestamp <-> Timestamp
  Bson.int64 <-> Int64
  Bson.min <-> Min
  Bson.max <-> Max

  // Basic record scheme
  {a:'a; b:'b} <-> { a: 'a, b: 'b }

  // Sum types
  {a:'a} / {b:'b} <-> { a: 'a } <or> { b: 'b }

  // Non-record types are called "value"
  'a <-> { value: 'a }

  // Special cases

  // Default for int is Int64
  int <-> Int64

  // Options
  option('a):
    {some=a} <-> { some : 'a }
    {none} <-> { none : null }
    {none} <- { }

  // Registers
  Bson.register('a):
    {present=a} <-> { present : 'a }
    {absent} <- { absent : null }
    {absent} <-> { }

  // Lists are consecutive arrays
  list('a) <-> { Array=(<label>,{ 0:'a; 1:'a; ... }) }

  // Intmaps are non-consecutive arrays
  ordered_map(int,'a) <or>
  intmap('a) <-> { Array=(<label>,{ 1:'a; 3:'a; ... }) }

  // Bson.document is treated verbatim (including labels)
  Bson.document <-> Bson.document

Notes:

  • For ObjectID values, there are a couple of routines which convert between (hex value) strings and the BSON representation, Bson.oid_of_string and Bson.oid_to_string. You can also create a BSON-style OID value with Bson.new_oid.

  • Bson.document types are completely write-through, ie. they are not processed at all.

  • In case you’re wondering, Min and Max are used in sharded databases to indicate infimum and supremum bounds on sharding regions, respectively.

Using the low-level interface

Connecting to and using the low-level drivers should be done using the MongoConnection module. This gathers together various low-level features in a single module.

Opening a connection to the MongoDB server

The preferred method is to use the system of named connections which can be defined from the command line or setup internally using the Mongo.param type and the MongoConnection.add_named_connection function.

Initially, there is one default connection (called ``default'') which is set to localhost:27017, the default port for MongoDB servers on the local machine. To open this connection use:

mongodb =
  match MongoConnection.open("default") with
  | {success=mongodb} -> mongodb
  | {~failure} -> ... // take action on error

// or

mongodb = MongoConnection.openfatal("default")

The MongoConnection.open function returns an outcome of either the connection or the standard Mongo.failure type whereas the MongoConnection.openfatal function returns just the connection but treats a failed connection as a fatal error.

To setup the connection from the command line the following options are defined:

Option               Abbrev Type              Description
------               ------ ----              -----------
--mongo-name         (--mn) <string>          Name for the MongoDB server connection
--mongo-repl-name    (--mr) <string>          Replica set name for the MongoDB server
--mongo-buf-size     (--mb) <int>             Hint for initial MongoDB connection buffer size
--mongo-concurrency  (--mx) <string>          Concurrency type, 'pool', 'cell' or 'singlethreaded'
--mongo-socket-pool  (--mp) <int>             Number of sockets in socket pool (>=2 enables socket pool)
--mongo-close-socket (--mc) <bool>            Maintain MongoDB server sockets in a closed state
--mongo-seed         (--ms) <host>\{:<port>\} Add a seed to a replica set, allows multiple seeds
--mongo-host         (--mh) <host>\{:<port>\} Host name of a MongoDB server, overwrites any previous hosts
--mongo-log          (--ml) <bool>            Enable MongoLog logging
--mongo-log-type     (--mt) <string>          Type of logging: stdout, stderr, logger, none

So, for example, to connect to the default connection at machinexyz:12345 you would use:

% prog.exe --mh machinexyz:12345

This remains a single connection, to connect to a replica set you also need to define a name for the replica set plus some seeds:

% prog.exe --mn blort --mr blort --ms machinexyz:27017 --ms machineuvw:27017

Here we have defined a connection called blort'' to a replica set also called blort'' with two seed machines. Remember that you only really need one seed which is active in the set, the connection logic queries the seeds for the actual host list and then polls the hosts until it finds the current primary server. From then on reconnection will be attempted if the current primary goes down.

Note that you can define as many named connections as you like, this example still retains the default connection.

Note also that you can clone a connection such that the connection itself will not be closed until all clones have already been closed.

Handling concurrency within an OPA program can be done in three ways:

  • Socket pool mode, set with --mx pool, means that a pool of open connections is maintained to the same server such that blocking only occurs if there are no more available connections in the pool (set with --mp 2, for example). If you ensure that the pool size is at least as big as the number of threads in your code then no blocking will occur. This method is quite expensive on resources, however.

  • Cell mode, set with --mx cell, is where one connection is opened but it protected by a cell. This means that you can have mulitple threads but they will always block if more than one thread requires the socket at the same time.

  • Single-threaded mode, set with --mx singlethreaded. In this case, no blocking is performed and the program is not thread safe. If, however, you can guarantee that only one thread will ever attempt to use the socket then MongoDB database operations should be significantly faster.

In addition, for cell and singlethreaded modes, you can optionally maintain the socket in a closed state. This has specialised use, will degrade the performance of the connection and is not recommended.

Named connections can also be defined within the program:

do MongoConnection.add_named_connection({
  name="blort";
  replname={some="blort"};
  bufsize=50*1024;
  concurrency={pool};
  pool_max=2;
  close_socket=false;
  log=false;
  seeds=[("localhost",10001),("localhost",10002)];
})

mongodb2 = N.openfatal("blort")

Once a connection has been opened, it can be pointed to different databases and collections using a functional interface. The default database is db'' and the default collection is collection'' but we can make a connection to a different collection without re-opening the connection as follows:

mongodb_wiki = MongoConnection.namespace(mongodb,"db","wiki")

This mechanism also applies to the flags that some of the MongoDB operations can take, for example to set the Upsert flag for all insert operations:

mongodb3 = MongoConnection.upsert(mongodb)

This method is quite flexible since you can define these flags once when the connection is made, making the flags globally persistent, or you can add these function calls at the point of calling the operation, ie. locally defined flags (there are examples below). All of the MongoDB flags are supported in this way.

One particular flag is worth mentioning, the log flag which can be set on the command line and can actually be overridden in this way allowing you to generate logs for targetted sections of code. In fact, you can change any of the command line options this way but bear in mind that some of them, for example, seed lists, will not take effect until the connection is reconnected.

Basic operations

The basic database access operations are the same as the MongoDB protocol operations, ie. insert, update, query, get_more, delete, kill_cursors and msg. So, for example, to insert a document:

// A couple of documents
p1 = [H.str("name","Joe1"), H.i32("age",44)]
p2 = [H.str("name","Joe2"), H.i32("age",55)]

// Insert the documents
_ = MongoConnection.insert(mongodb,p1)
_ = MongoConnection.insert_batch(mongodb,[p1,p2]))

The basic write operations come in three types:

  • insert is the write-and-forget operation where the insert message is sent and a boolean value is returned which simply states that the correct number of bytes were written to the socket.

  • inserte is a ``safe'' operation where the insert message has a getlasterror query piggy-backed onto it and then the raw optional reply is returned.

  • insert_result does an inserte and then analyses the reply, turning it into a standard Mongo.result type.

All of the basic write operations have these three forms. The Mongo.result type is an outcome of either success as a Bson.document type or failure as a Mongo.failure type. The Mongo.failure type looks like:

type Mongo.failure =
    {OK}
  / {Error : string}
  / {DocError : Bson.document}
  / {Incomplete}
  / {NotFound}

This defines either a raw document error \{DocError=doc\} which is an error as reported by the MongoDB server, a driver error \{Error=str\} which is a message generated by the OPA driver or a few special-purpose errors returned under specific circumstances (\{OK\} is simply a connection that has never been used).

Post-processing of results may include checking for errors:

error = MongoConnection.insert_result(MongoConnection.upsert(mongodb),[H.i32("i",n)])
do println("insert error={MongoCommon.is_error(error)}")

or extracting specific fields from the reply:

do println("errmsg={MongoCommon.result_string(error,"errmsg")}")

noting that we also support the MongoDB dot notation syntax:

do println("indexSizes._id_={MongoCommon.dotresult_int(collStats,"indexSizes._id_")}")

Closing a connection is as simple as:

do MongoConnection.close(mongodb)

Remember that the connection will only close once all of the clones have also been closed.

Cursors

Handling queries in MongoDB has the complication that, for efficiency, cursors are stored on the server which entails tracking them at the client side. While the bare MongoConnection.query and MongoConnection.get_more operations can be used to handle queries in conjunction with the reply support code in MongoCommon they are a bit inconvenient.

For this purpose we have defined cursor operations in the MongoCursor module and re-exported the most important ones into the MongoConnection.Cursor module. A cursor object itself contains all the parameters needed to manage the cursor at the server side and, in fact, duplicates some of the information in the connection object. Using the re-exported functions reduces the number of parameters to the basic functions since this information can be lifted from the connection into the cursor object.

Here is an example of a low-level cursor dialog:

cursor = MongoConnection.Cursor.init(mongodb)
cursor = MongoConnection.Cursor.set_query(cursor,{some=[H.str("name","Joe")]})
cursor = MongoConnection.Cursor.set_limit(cursor,3)
cursor = MongoConnection.Cursor.set_fields(cursor,{some=[H.i32("_id",0)]})
cursor = MongoConnection.Cursor.next(cursor)
result = MongoConnection.Cursor.check_cursor_error(cursor)
do println("result 1 = {MongoCommon.pretty_of_result(result)}")
do println("valid 1 ={MongoConnection.Cursor.valid(cursor)}")
cursor = MongoConnection.Cursor.next(cursor)
result = MongoConnection.Cursor.check_cursor_error(cursor)
do println("result 2 = {MongoCommon.pretty_of_result(result)}")
do println("valid 2 = {MongoConnection.Cursor.valid(cursor)}")
_ = MongoConnection.Cursor.reset(cursor)

The cursor is initialised with init and then the parameters for the query are setup. The next function generates the query (or get_more) call to the server and places the next document internally in the cursor object along with any error status. The check_cursor_error function is a convenient way of extracting either the current document or the error as a Mongo.result. Subsequent calls to next will either return the next document from the previous reply or issue a get_more call to re-populate the cursor. The end of the matching documents (or if no document matches) is signalled with NotFound and if you try to read past the end of matching documents you will get an ``end of data'' error from the driver. The valid function is used to poll whether there is any remaining data. Finally, the call to reset is important here because it doesn’t just end the query, it will issue a kill_cursors operation to the server to tell it to delete the cursor (cursors time out after 10 minutes by default on the MongoDB server).

This method works fine but this logic has been wrapped up into some convenience functions:

  • find_one returns the first matching document as a Mongo.result

  • find_all gives all the matches as a list of documents (use the limit function to limit the number of replies).

For example:

// Find all objects in db.session, excluding the _id field
mongo_session_no_id =
  MongoConnection.fields(MongoConnection.namespace(mongodb,"db","session"),{some=[H.i32("_id",0)]})
do println("findAll: {CM.pretty_of_results(MongoConnection.Cursor.find_all(mongo_session_no_id,[]))}")

You can also define custom loops over the matches using start (or find) in conjunction with next and valid. (Note that you must use the MongoConnection.Cursor.for loop instead of the more usual for function in the OPA stdlib, you need to check for valid and only call next if still valid at that point, otherwise you will miss the last document in the list of matches).

Collections

While you can achieve anything that MongoDB is capable of using the low-level drivers, there are no guarantees of type safety while converting between BSON documents and OPA values. You can of course base your entire project around BSON values and eliminate the need for converting between MongoDB’s documents and OPA types altogether but this may not be very convenient depending upon what is happening elsewhere in your application. Secondly, to use the low-level drivers requires an investment in learning MongoDB’s powerful but rather complex interface (which may be new to users of relational databases) in order to exploit what MongoDB has to offer. Finally, basing your application on MongoDB’s API will tie your application to MongoDB and you may at some point in the future wish to migrate to other database solutions.

Ultimately, the intention is to provide an abstract view of the database which is general enough to encompass several of the existing database solutions, of which MongoDB is an important player, and support this with compiler-generated syntax in the manner of the OPA inbuilt database. This support is still not available but we can offer an intermediate layer of programming MongoDB whereby we assume collections of OPA types and support type-safety by performing run-time type-checks on operations over these collections. This support is in the form of the MongoCollection module plus some support modules for generating values suitable to be applied to these functions.

The collection type

The central idea in the MongoCollection module is a collection (in the MongoDB terminology sense) of OPA values. This is embodied in the Mongo.collection type which is extremely simple, it’s just a MongoConnection value plus a run-time representation of the type of values to be stored in the collection:

type Mongo.collection('a) = {
  db: Mongo.mongodb; // the mongodb connection
  ty: OpaType.ty; // type of the collection
}

When a value is stored in the collection it is automatically converted from its OPA type into a matching BSON document and vice versa for queries. Note, however, that the collection type is also parametrised with the compile-time type of the collection. It is imperative that the types 'a and ty represent the same type and for this reason, we derive ty from the type of the collection at the point of creating the collection object.

While this sounds simple there are a number of pitfalls to watch out for. We assume that any offline modifications of the collection will not create any incompatible values. If, for example, we add or delete a field from a record then the entry can no longer be represented as an OPA type.

To overcome this problem we place checks in the code to verify the suitability of documents read from the collection and an error will be generated if any such values are found. We also provide features to allow handling of this situation in some specific circumstances, for example, if you type a field in the collection as Bson.register it will allow you to successfully read in values with missing fields but this is not recommended for collections. Ultimately, it is up to the maintainer of the database to ensure that the values stored there are consistent with the application’s usage of the collection.

Despite these provisos, using a collection is very simple and gives the programmer the ability to integrate OPA types with the MongoDB system without having to understand the underlying complexity of the database and with a modest level of type-safety. The cost, for the moment, is the overhead of the run-time type-checks which will slow down database operations.

Programming with collections

A simple dialog for creating and manipulating a collection might be as follows:

// The type of our first collection
type t = {i:int}

// Create a collection of type t
c1 = (C.openfatal("default","db","collection"):Mongo.collection(t))

// Put a single value into the collection
result = MongoCollection.insert_result(c1,{i=0})
Jump to Line
Something went wrong with that request. Please try again.