Sync'ed with r4528 from std-trunk (Pre-heating for 2.2 release).

git-svn-id: http://www.pytables.org/svn/pytables/PyTablesPro/trunk@4534 1b98710c-d8ec-0310-ae81-f5f2bcd8cb94
PyTables · Jul 1, 2010 · c689858 · c689858
1 parent 2a38a7d
commit c689858
Show file tree

Hide file tree

Showing 31 changed files with 6,521 additions and 4,983 deletions.
diff --git a/ANNOUNCE.txt.in b/ANNOUNCE.txt.in
@@ -2,16 +2,13 @@
  Announcing PyTables @VERSION@
 ===========================
 
-PyTables Pro is a library for managing hierarchical datasets and
-designed to efficiently cope with extremely large amounts of data with
-support for full 64-bit file addressing.  PyTables Pro runs on top of
-the HDF5 library and NumPy package for achieving maximum throughput and
-convenient use.  The main difference between PyTables Pro and regular
-PyTables is that the Pro version includes OPSI, a new indexing
-technology, allowing to perform data lookups in tables exceeding 10
-gigarows (10**10 rows) in less than 1 tenth of a second.
+I'm happy to announce PyTables Pro 2.2 (final).  After 18 months of
+continuous development and testing, this is, by far, the most powerful
+and well-tested release ever.  I hope you like it too.
 
-#XXX version-specific blurb XXX#
+
+What's new
+==========
 
 The main new features in 2.2 series are:
 
@@ -48,6 +45,19 @@ For an on-line version of the manual, visit:
 http://www.pytables.org/docs/manual-@VERSION@
 
 
+What it is?
+===========
+
+PyTables Pro is a library for managing hierarchical datasets and
+designed to efficiently cope with extremely large amounts of data with
+support for full 64-bit file addressing.  PyTables Pro runs on top of
+the HDF5 library and NumPy package for achieving maximum throughput and
+convenient use.  The main difference between PyTables Pro and regular
+PyTables is that the Pro version includes OPSI, a new indexing
+technology, allowing to perform data lookups in tables exceeding 10
+gigarows (10**10 rows) in less than 1 tenth of a second.
+
+
 Resources
 =========
 

diff --git a/README.txt b/README.txt
@@ -18,6 +18,9 @@ resources so that they take much less space (between a factor 3 to 5,
 and more if the data is compressible) than other solutions, like for
 example, relational or object oriented databases.
 
+Not a RDBMS replacement
+-----------------------
+
 PyTables is not designed to work as a relational database replacement,
 but rather as a teammate. If you want to work with large datasets of
 multidimensional data (for example, for multidimensional analysis), or
@@ -28,6 +31,9 @@ data monitoring systems (for example, traffic measurements of IP
 packets on routers), or as a centralized repository for system logs,
 to name only a few possible uses.
 
+Tables
+------
+
 A table is defined as a collection of records whose values are stored
 in fixed-length fields. All records have the same structure and all
 values in each field have the same data type. The terms "fixed-length"
@@ -37,6 +43,9 @@ the goal is to save very large quantities of data (such as is
 generated by many scientific applications, for example) in an
 efficient manner that reduces demand on CPU time and I/O.
 
+Arrays
+------
+
 There are other useful objects like arrays, enlargeable arrays or
 variable length arrays that can cope with different missions on your
 project. Also, quite a bit of effort has been invested to make
@@ -45,6 +54,9 @@ experience. PyTables implements a few easy-to-use methods for
 browsing. See the documentation (located in the ``doc/`` directory)
 for more details.
 
+Easy to use
+-----------
+
 One of the principal objectives of PyTables is to be user-friendly.
 To that end, special Python features like generators, slots and
 metaclasses in new-brand classes have been used. In addition,
@@ -53,6 +65,18 @@ enable the interactive work to be as productive as possible. For these
 reasons, you will need to use Python 2.4 or higher (Python 2.4.4 or
 better recommended) to take advantage of PyTables.
 
+Platforms
+---------
+
+We are using Linux on top of Intel32 and Intel64 boxes as the main
+development platforms, but PyTables should be easy to compile/install
+on other UNIX or Windows machines.  Nonetheless, caveat emptor: more
+testing is needed to achieve complete portability, we'd appreciate
+input on how it compiles and installs on your platform.
+
+Compiling
+---------
+
 To compile PyTables you will need, at least, a recent version of HDF5
 (C flavor) library, the Zlib compression library and the NumPy and
 Numexpr packages. Besides, if you want to take advantage of the LZO
@@ -68,15 +92,8 @@ reasonably recent version of them (>= 1.5.2 for numarray and >= 24.x
 for Numeric). PyTables has been successfully tested against numarray
 1.5.2 and Numeric 24.2.
 
-We are using Linux on top of Intel32 and Intel64 boxes as the main
-development platforms, but PyTables should be easy to compile/install
-on other UNIX or Windows machines.  Nonetheless, caveat emptor: more
-testing is needed to achieve complete portability, we'd appreciate
-input on how it compiles and installs on your platform.
-
-
 Installation
-============
+------------
 
 The Python Distutils are used to build and install PyTables, so it is
 fairly simple to get things ready to go. Following are very simple

diff --git a/RELEASE_NOTES.txt b/RELEASE_NOTES.txt
@@ -9,7 +9,27 @@
 Changes from 2.2rc2 to 2.2 (final)
 ==================================
 
-- None yet.
+- Updated Blosc to 1.0 (final).
+
+- Filter ID of Blosc changed from wrong 32010 to reserved 32001.  This
+  will prevent PyTables 2.2 (final) to read files created with Blosc and
+  PyTables 2.2 pre-final.  `ptrepack` can be used to retrieve those
+  files, if necessary.  More info in ticket #281.
+
+- Recent benchmarks suggest a new parametrization is better in most
+  scenarios:
+
+  * The default chunksize has been doubled for every dataset size.  This
+    works better in most of scenarios, specially with the new Blosc
+    compressor.
+
+  * The HDF5 CHUNK_CACHE_SIZE parameter has been raised to 2 MB in order
+    to better adapt to the chunksize increase.  This provides better hit
+    ratio (at the cost of consuming more memory).
+
+  Some plots have been added to the User's Manual (chapter 5) showing
+  how the new parametrization works.
+
 
 Changes from 2.2rc1 to 2.2rc2
 =============================
@@ -28,9 +48,10 @@ Changes from 2.2rc1 to 2.2rc2
   renamed to `BUFFER_TIMES` which is more consistent with other
   parameter names.
 
-- On Windows platforms, the `sys.path` directory is added now to the
-  PATH environment variable.  That way DLLs in `sys.path` are to be
-  found now.  Thanks to Christoph Gohlke for the hint.
+- On Windows platforms, the path to the tables module is now appended to
+  sys.path and the PATH environment variable. That way DLLs and PYDs in
+  the tables directory are to be found now.  Thanks to Christoph Gohlke
+  for the hint.
 
 - A replacement for barriers for Mac OSX, or other systems not
   implementing them, has been carried out.  This allows to compile

diff --git a/VERSION b/VERSION
@@ -1 +1 @@
-2.2.devpro
+2.2pro
diff --git a/bench/bench-postgres-ranges.sh b/bench/bench-postgres-ranges.sh
@@ -3,10 +3,12 @@
 export PYTHONPATH=..:$PYTHONPATH
 
 pyopt="-O -u"
-qlvl="-Q8 -x"
-size="500m"
-#size="1m"
+#qlvl="-Q8 -x"
+#qlvl="-Q8"
+qlvl="-Q7"
+#size="500m"
+size="1g"
 
-python $pyopt indexed_search.py -P -c -n $size -m -v
+#python $pyopt indexed_search.py -P -c -n $size -m -v
 python $pyopt indexed_search.py -P -i -n $size -m -v -sfloat $qlvl
 
diff --git a/bench/bench-pytables-ranges.sh b/bench/bench-pytables-ranges.sh
@@ -11,15 +11,16 @@ sizes="1g"
 working_dir="data.nobackup"
 #working_dir="/scratch2/faltet"
 
-#for comprlvl in '-z0' '-z1 -lzlib' '-z1 -llzo' ; do
-#for comprlvl in '-z9 -lblosc' '-z5 -lblosc' '-z1 -lblosc' '-z0' ; do
-for comprlvl in '-z0' ; do
+#for comprlvl in '-z0' '-z1 -llzo' '-z1 -lzlib' ; do
+#for comprlvl in '-z6 -lblosc' '-z3 -lblosc' '-z1 -lblosc' ; do
+for comprlvl in '-z5 -lblosc' ; do
+#for comprlvl in '-z0' ; do
   for optlvl in '-tfull -O9' ; do
   #for optlvl in '-tultralight -O3' '-tlight -O6' '-tmedium -O6' '-tfull -O9'; do
   #for optlvl in '-tultralight -O3'; do
-    rm -f $working_dir/*
-    for mode in -c '-Q7 -i -s float' ; do
-    #for mode in -c '-Q8 -i -x -s float' ; do
+    #rm -f $working_dir/*  # XXX esta ben posat??
+    for mode in '-Q8 -i -s float' ; do
+    #for mode in -c '-Q7 -i -s float' ; do
     #for mode in '-c -s float' '-Q8 -I -s float' '-Q8 -S -s float'; do
       for size in $sizes ; do
         $bench $flags $mode -n $size $optlvl $comprlvl -d $working_dir

diff --git a/bench/optimal-chunksize.py b/bench/optimal-chunksize.py
@@ -12,8 +12,8 @@
 # Size of dataset
 #N, M = 512, 2**16     # 256 MB
 #N, M = 512, 2**18     # 1 GB
-N, M = 512, 2**19     # 2 GB
-#N, M = 2000, 1000000  # 15 GB
+#N, M = 512, 2**19     # 2 GB
+N, M = 2000, 1000000  # 15 GB
 #N, M = 4000, 1000000  # 30 GB
 datom = tables.Float64Atom()   # elements are double precision
 
@@ -48,22 +48,22 @@ def bench(chunkshape, filters):
     filename = '/scratch2/faltet/data.nobackup/test.h5'
     #filename = '/scratch1/faltet/test.h5'
 
-    f = tables.openFile(filename, 'r')
-
-    # f = tables.openFile(filename, 'w')
-    # e = f.createEArray(f.root, 'earray', datom, shape=(0, M),
-    #                    filters = filters,
-    #                    chunkshape = chunkshape)
-    # # Fill the array
-    # t1 = time()
-    # for i in xrange(N):
-    #     #e.append([numpy.random.rand(M)])  # use this for less compressibility
-    #     e.append([quantize(numpy.random.rand(M), 6)])
-    # os.system("sync")
-    # print "Creation time:", round(time()-t1, 3),
-    # filesize = get_db_size(filename)
-    # filesize_bytes = os.stat(filename)[6]
-    # print "\t\tFile size: %d -- (%s)" % (filesize_bytes, filesize)
+    #f = tables.openFile(filename, 'r')
+
+    f = tables.openFile(filename, 'w')
+    e = f.createEArray(f.root, 'earray', datom, shape=(0, M),
+                       filters = filters,
+                       chunkshape = chunkshape)
+    # Fill the array
+    t1 = time()
+    for i in xrange(N):
+        #e.append([numpy.random.rand(M)])  # use this for less compressibility
+        e.append([quantize(numpy.random.rand(M), 6)])
+    #os.system("sync")
+    print "Creation time:", round(time()-t1, 3),
+    filesize = get_db_size(filename)
+    filesize_bytes = os.stat(filename)[6]
+    print "\t\tFile size: %d -- (%s)" % (filesize_bytes, filesize)
 
     # Read in sequential mode:
     e = f.root.earray
@@ -74,8 +74,8 @@ def bench(chunkshape, filters):
         t = row
     print "Sequential read time:", round(time()-t1, 3),
 
-    f.close()
-    return
+    #f.close()
+    #return
 
     # Read in random mode:
     i_index = numpy.random.randint(0, N, 128)
@@ -98,16 +98,16 @@ def bench(chunkshape, filters):
 
 # Benchmark with different chunksizes and filters
 #for complevel in (0, 1, 3, 6, 9):
-#for complib in (None, 'zlib', 'lzo', 'blosc'):
-for complib in (None,):
+for complib in (None, 'zlib', 'lzo', 'blosc'):
+#for complib in ('blosc',):
     if complib:
         filters = tables.Filters(complevel=5, complib=complib)
     else:
         filters = tables.Filters(complevel=0)
     print "8<--"*20, "\nFilters:", filters, "\n"+"-"*80
     #for ecs in (11, 14, 17, 20, 21, 22):
-    #for ecs in range(10, 24):
-    for ecs in (19,):
+    for ecs in range(10, 24):
+    #for ecs in (19,):
         chunksize = 2**ecs
         chunk1 = 1
         chunk2 = chunksize/datom.itemsize

diff --git a/bench/postgres_backend.py b/bench/postgres_backend.py
@@ -5,8 +5,8 @@
 import psycopg2 as db2
 
 CLUSTER_NAME = "base"
-#DATA_DIR = "/scratch/faltet/postgres/%s" % CLUSTER_NAME
-DATA_DIR = "/var/lib/pgsql/data/%s" % CLUSTER_NAME
+DATA_DIR = "/scratch2/postgres/data/%s" % CLUSTER_NAME
+#DATA_DIR = "/var/lib/pgsql/data/%s" % CLUSTER_NAME
 DSN = "dbname=%s port=%s"
 CREATE_DB = "createdb %s"
 DROP_DB = "dropdb %s"