Initial commit

vaterlaus · Sep 14, 2010 · c596123 · c596123
commit c596123
Show file tree

Hide file tree

Showing 20 changed files with 1,989 additions and 0 deletions.
diff --git a/.project b/.project
@@ -0,0 +1,17 @@
+<?xml version="1.0" encoding="UTF-8"?>
+<projectDescription>
+	<name>django_cassandra_backend</name>
+	<comment></comment>
+	<projects>
+	</projects>
+	<buildSpec>
+		<buildCommand>
+			<name>org.python.pydev.PyDevBuilder</name>
+			<arguments>
+			</arguments>
+		</buildCommand>
+	</buildSpec>
+	<natures>
+		<nature>org.python.pydev.pythonNature</nature>
+	</natures>
+</projectDescription>
diff --git a/.pydevproject b/.pydevproject
@@ -0,0 +1,8 @@
+<?xml version="1.0" encoding="UTF-8" standalone="no"?>
+<?eclipse-pydev version="1.0"?>
+
+<pydev_project>
+<pydev_property name="org.python.pydev.PYTHON_PROJECT_INTERPRETER">Default</pydev_property>
+<pydev_property name="org.python.pydev.PYTHON_PROJECT_VERSION">python 2.6</pydev_property>
+
+</pydev_project>
diff --git a/README.txt b/README.txt
@@ -0,0 +1,153 @@
+Introduction
+============
+This is an early development release of a Django backend for the Cassandra database.
+It has only been under development for a short time and there are almost certainly
+issues/bugs with this release -- see the end of this document for a list of known
+issues. Needless to say, you shouldn't use this release in a production setting, the
+format of the data stored in Cassandra may change in future versions, there's no
+promise of backwards compatibility with this version, and so on.
+
+Please let me know if you find any bugs or have any suggestions for how to improve
+the backend. You can contact me at: rob.vaterlaus@bigswitch.com
+
+Installation
+============
+The backend requires the 0.7 version of Cassandra. 0.7 has several features
+(e.g. programmatic creation/deletion of keyspaces & column families, secondary index
+support) that are useful with running as a Django database backend, so I targeted
+that version instead of 0.6. Unfortunately, the Cassandra Thrift API changed between
+0.6 and 0.7 so the two version are incompatible.
+
+There's a beta1 version of 0.7 available at the Cassandra web site. I'm actually using a
+somewhat later daily binary release dated 8/23. I obtained the daily release by following
+the "Latest Builds" link in the Cassandra downloads page, but the last few times I've
+tried it that link was dead, so I'm not sure what's going on with that. I had switched
+to the 8/23 release, because I had read that there was an issue with the secondary
+index support in the beta1 release and I was trying to get secondary index support
+working in the backend. As it turns out I was still seeing problems with the 8.23
+release, so I wound up disabling the secondary index code (details below), so it's
+possible/probable that the backend will work with the beta1 release, especially
+if you don't try to enable secondary index support (i.e. don't set the db_index
+to True for any of the fields). But I haven't tested with beta1, so no promises.
+
+The backend also requires the Django-nonrel fork of Django and djangotoolbox.
+Both are available here: <http://www.allbuttonspressed.com/projects/django-nonrel>.
+I installed the Django-nonrel version of Django globally in site-packages and
+copied djangotoolbox into the directory where I'm testing the Cassandra backend,
+but there are probably other (better?) ways to install those things.
+
+You also need to generate the Python Thrift API code as described in the Cassandra
+documentation and copy the generated "cassandra" directory (from Cassandra's
+interface/gen-py directory) over to the top-level Django project directory.
+
+To configure a project to use the Cassandra backend all you have to do is change
+the database settings in the settings.py file. Change the ENGINE value to be
+'django_cassandra.db' and the NAME value to be the name of the keyspace to use.
+You can set HOST and PORT to override the default values of 'localhost' and 9160.
+In theory you can also set USER and PASSWORD if you're using authentication with
+Cassandra, but this hasn't been tested yet, so it may not work.
+
+Configure Cassandra as described in the Cassandra documentation.
+If want to be able to do range queries over primary keys then you need to set the
+partitioner in the cassandra.yaml config file to be the OrderPreservingPartitioner.
+
+Once you're finished configuring Cassandra start up the Cassandra daemon process as
+described in the Cassandra documentation.
+
+Run syncdb. This creates the keyspace (if necessary) and the column families for the
+models in the installed apps. The Cassandra backend creates one column family per
+model. It will use the db_table value from the meta settings for the name of the
+column family if it's specified; otherwise it uses the default name similar to
+other backends.
+
+Now you should be able to use the normal model and query set calls from you
+Django code.
+
+This release includes a test project and app. If you want to use the backend in
+another project you just need to copy the django_cassandra directory to the 
+top-level directory of the project (along with the cassandra and djangotoolbox
+directories).
+
+What Works
+==========
+- the basics: creating model instances, querying (get/filter/exclude), count,
+  update/save, delete, order_by
+- efficient queries for exact matches on the primary key. It can also do range
+  queries on the primary key, but your Cassandra cluster must be configured to use the
+  OrderPreservingPartitioner if you want to do that. Unfortunately, currently it 
+  doesn't fail gracefully if you try to do range queries when using the
+  RandomPartitioner, so just don't do that for now :-)
+- inefficient queries for everything else that can't be done efficiently in
+  Cassandra. The basic approach used in the query processing code is to first try
+  to prune the number of rows to look at by finding a part of the query that can
+  be evaluated efficiently (i.e. a primary key filter predicate or a secondary
+  index predicate, once that's working). Then it evaluates the remaining filter
+  predicates over the pruned rows to obtain the final result. If there's no part
+  of the query that can be evaluated efficiently, then it just fetches the entire
+  set of rows and does all of the filtering in the backend code.
+- programmatic creation of the keyspace & column families via syncdb
+- Django admin UI, except for users in the auth application (see below)
+- I think all of the filter operations (e.g. gt, startswith, regex, etc.) are supported
+  although it's possible I missed something
+- complex queries with Q nodes
+
+What Doesn't Work (Yet)
+=======================
+- Secondary Index Support: There's code in there to use secondary indexes, but
+  I was seeing weird results when I tried to execute Cassandra queries using the
+  secondary indexes so I disabled that code. Hopefully that's just an issue with the
+  specific version of Cassandra I'm using, but I haven't tried it out with a more
+  recent version to see if it's working now. If you're feeling adventurous you could
+  try it out with a newer version of Cassandra and enable the secondary index code
+  by setting the value of SECONDARY_INDEX_SUPPORT_ENABLED to True in predicate.py.
+  You enable secondary index support for fields by setting the db_index argument to
+  True when constructing the field.
+- I haven't tested all of the different field types, so there are probably
+  issues there with how the data is converted to and from Cassandra with some of the
+  field types. My use case was mostly string fields, so most of the testing was with
+  that. I've also tried out date, datetime, time, and decimal fields, so I think
+  those should work too, but I haven't tried anything else.
+- joins
+- chunked queries. It just tries to get everything all at once from Cassandra.
+  Currently the maximum that it can get (i.e. the count value in the Cassandra
+  Thrift API) is set semi-arbitrarily to 10000, so if you try to query over a
+  column family with more rows (or columns) than that it may not work.
+  Probably the value could be set higher than that, but at some point Cassandra
+  fails if it's too big (i.e. it didn't work if I set it to 0x7fffffff).
+  If you want to make it bigger you can change the MAX_FETCH_COUNT variable
+  in compiler.py.
+- ListModel/ListField support from djangotoolbox (I think?). I haven't
+  investigated how this works and if it's feasible to support in Cassandra,
+  although I'm guessing it probably wouldn't be too hard. For now, this means
+  that several of the unit tests from djangotoolbox fail if you have that
+  in your installed apps.
+- there's no way to configure the settings used to create the keyspaces
+  and column families (e.g. replication strategy, replication factor) or the
+  read & write consistency levels used when querying or inserting/mutating
+  columns in Cassandra. My plan was to add global database settings and
+  per-model Meta settings to configure those things, but I haven't gotten to
+  it yet.
+- Cassandra authentication. Actually this may work but I haven't tested it yet.
+  There's code in there that tries to login to Cassandra if the USER and
+  PASSWORD are specified in the database settings, but I've only tested with
+  the AllowAllAuthenticator.
+- probably a lot of other stuff that I've forgotten or am unaware of :-)
+
+Known Issues
+============
+- I haven't been able to get the admin UI to work for users in the Django
+  authentication middleware. I included djangotoolbox in my installed apps, as
+  suggested on the Django-nonrel web site, which got my further, but I still get
+  an error in some Django template code that tries to render a change list (I think).
+  I still need to track down what's going on there.
+- f you enable the authentication and session middleware a bunch of the
+  associated unit tests fail if you run all of the unit tests.
+  This may be related to the issue with editing users in the admin UI
+- the code needs a cleanup pass for things like the exception handling/safety,
+  some refactoring, more pydoc comments, etc.
+- I have a feeling there are some places where I haven't completely leveraged
+  the code in djangotoolbox, so there may be places where I haven't done
+  things in the optimal way
+- the error handling/messaging isn't great for things like the Cassandra
+  daemon not running, a versioning mismatch between client and Cassandra
+  daemon, etc.
diff --git a/__init__.py b/__init__.py
diff --git a/django_cassandra/__init__.py b/django_cassandra/__init__.py
diff --git a/django_cassandra/db/__init__.py b/django_cassandra/db/__init__.py
diff --git a/django_cassandra/db/base.py b/django_cassandra/db/base.py
@@ -0,0 +1,88 @@
+from djangotoolbox.db.base import NonrelDatabaseFeatures, \
+    NonrelDatabaseOperations, NonrelDatabaseWrapper, NonrelDatabaseClient, \
+    NonrelDatabaseValidation, NonrelDatabaseIntrospection, \
+    NonrelDatabaseCreation
+
+from thrift import Thrift
+from thrift.transport import TTransport
+from thrift.transport import TSocket
+from thrift.protocol import TBinaryProtocol
+from cassandra import Cassandra
+from cassandra.ttypes import *
+import time
+from .creation import DatabaseCreation
+from .introspection import DatabaseIntrospection
+
+class DatabaseFeatures(NonrelDatabaseFeatures):
+    string_based_auto_field = True
+
+class DatabaseOperations(NonrelDatabaseOperations):
+    compiler_module = __name__.rsplit('.', 1)[0] + '.compiler'
+
+    def sql_flush(self, style, tables, sequence_list):
+        for table_name in tables:
+            self.connection.creation.flush_table(table_name)
+        return ""
+
+class DatabaseClient(NonrelDatabaseClient):
+    pass
+
+class DatabaseValidation(NonrelDatabaseValidation):
+    pass
+
+# TODO: Maybe move this somewhere else? db.utils.py maybe?
+class CassandraConnection(object):
+    def __init__(self, client, transport):
+        self.client = client
+        self.transport = transport
+
+    def commit(self):
+        pass
+
+    def open(self):
+        if self.transport:
+            self.transport.open()
+
+    def close(self):
+        if self.transport:
+            self.transport.close()
+
+class DatabaseWrapper(NonrelDatabaseWrapper):
+    def __init__(self, *args, **kwds):
+        super(DatabaseWrapper, self).__init__(*args, **kwds)
+
+        # Set up the associated backend objects
+        self.features = DatabaseFeatures(self)
+        self.ops = DatabaseOperations(self)
+        self.client = DatabaseClient(self)
+        self.creation = DatabaseCreation(self)
+        self.validation = DatabaseValidation(self)
+        self.introspection = DatabaseIntrospection(self)
+
+        # Get the host and port specified in the database backend settings.
+        # Default to the standard Cassandra settings.
+        host = self.settings_dict.get('HOST')
+        if not host or host == '':
+            host = 'localhost'
+        port = self.settings_dict.get('PORT')
+        if not port or port == '':
+            port = 9160
+
+        # Create the client connection to the Cassandra daemon
+        socket = TSocket.TSocket(host, port)
+        transport = TTransport.TFramedTransport(TTransport.TBufferedTransport(socket))
+        protocol = TBinaryProtocol.TBinaryProtocolAccelerated(transport)
+        client = Cassandra.Client(protocol)
+
+        # Create our connection wrapper
+        self.db_connection = CassandraConnection(client, transport)
+        self.db_connection.open()
+
+        version = client.describe_version()
+        # FIXME: Should do some version check here to make sure that we're
+        # talking to a cassandra daemon that supports the operations we require
+
+        # Set up the Cassandra keyspace
+        keyspace_name = self.settings_dict.get('NAME')
+        self.creation.init_keyspace(keyspace_name)
+