Skip to content

Commit

Permalink
Sync'ed with r4528 from std-trunk (Pre-heating for 2.2 release).
Browse files Browse the repository at this point in the history
git-svn-id: http://www.pytables.org/svn/pytables/PyTablesPro/trunk@4534 1b98710c-d8ec-0310-ae81-f5f2bcd8cb94
  • Loading branch information
Francesc Alted committed Jul 1, 2010
1 parent 2a38a7d commit c689858
Show file tree
Hide file tree
Showing 31 changed files with 6,521 additions and 4,983 deletions.
28 changes: 19 additions & 9 deletions ANNOUNCE.txt.in
Expand Up @@ -2,16 +2,13 @@
Announcing PyTables @VERSION@
===========================

PyTables Pro is a library for managing hierarchical datasets and
designed to efficiently cope with extremely large amounts of data with
support for full 64-bit file addressing. PyTables Pro runs on top of
the HDF5 library and NumPy package for achieving maximum throughput and
convenient use. The main difference between PyTables Pro and regular
PyTables is that the Pro version includes OPSI, a new indexing
technology, allowing to perform data lookups in tables exceeding 10
gigarows (10**10 rows) in less than 1 tenth of a second.
I'm happy to announce PyTables Pro 2.2 (final). After 18 months of
continuous development and testing, this is, by far, the most powerful
and well-tested release ever. I hope you like it too.

#XXX version-specific blurb XXX#

What's new
==========

The main new features in 2.2 series are:

Expand Down Expand Up @@ -48,6 +45,19 @@ For an on-line version of the manual, visit:
http://www.pytables.org/docs/manual-@VERSION@


What it is?
===========

PyTables Pro is a library for managing hierarchical datasets and
designed to efficiently cope with extremely large amounts of data with
support for full 64-bit file addressing. PyTables Pro runs on top of
the HDF5 library and NumPy package for achieving maximum throughput and
convenient use. The main difference between PyTables Pro and regular
PyTables is that the Pro version includes OPSI, a new indexing
technology, allowing to perform data lookups in tables exceeding 10
gigarows (10**10 rows) in less than 1 tenth of a second.


Resources
=========

Expand Down
33 changes: 25 additions & 8 deletions README.txt
Expand Up @@ -18,6 +18,9 @@ resources so that they take much less space (between a factor 3 to 5,
and more if the data is compressible) than other solutions, like for
example, relational or object oriented databases.

Not a RDBMS replacement
-----------------------

PyTables is not designed to work as a relational database replacement,
but rather as a teammate. If you want to work with large datasets of
multidimensional data (for example, for multidimensional analysis), or
Expand All @@ -28,6 +31,9 @@ data monitoring systems (for example, traffic measurements of IP
packets on routers), or as a centralized repository for system logs,
to name only a few possible uses.

Tables
------

A table is defined as a collection of records whose values are stored
in fixed-length fields. All records have the same structure and all
values in each field have the same data type. The terms "fixed-length"
Expand All @@ -37,6 +43,9 @@ the goal is to save very large quantities of data (such as is
generated by many scientific applications, for example) in an
efficient manner that reduces demand on CPU time and I/O.

Arrays
------

There are other useful objects like arrays, enlargeable arrays or
variable length arrays that can cope with different missions on your
project. Also, quite a bit of effort has been invested to make
Expand All @@ -45,6 +54,9 @@ experience. PyTables implements a few easy-to-use methods for
browsing. See the documentation (located in the ``doc/`` directory)
for more details.

Easy to use
-----------

One of the principal objectives of PyTables is to be user-friendly.
To that end, special Python features like generators, slots and
metaclasses in new-brand classes have been used. In addition,
Expand All @@ -53,6 +65,18 @@ enable the interactive work to be as productive as possible. For these
reasons, you will need to use Python 2.4 or higher (Python 2.4.4 or
better recommended) to take advantage of PyTables.

Platforms
---------

We are using Linux on top of Intel32 and Intel64 boxes as the main
development platforms, but PyTables should be easy to compile/install
on other UNIX or Windows machines. Nonetheless, caveat emptor: more
testing is needed to achieve complete portability, we'd appreciate
input on how it compiles and installs on your platform.

Compiling
---------

To compile PyTables you will need, at least, a recent version of HDF5
(C flavor) library, the Zlib compression library and the NumPy and
Numexpr packages. Besides, if you want to take advantage of the LZO
Expand All @@ -68,15 +92,8 @@ reasonably recent version of them (>= 1.5.2 for numarray and >= 24.x
for Numeric). PyTables has been successfully tested against numarray
1.5.2 and Numeric 24.2.

We are using Linux on top of Intel32 and Intel64 boxes as the main
development platforms, but PyTables should be easy to compile/install
on other UNIX or Windows machines. Nonetheless, caveat emptor: more
testing is needed to achieve complete portability, we'd appreciate
input on how it compiles and installs on your platform.


Installation
============
------------

The Python Distutils are used to build and install PyTables, so it is
fairly simple to get things ready to go. Following are very simple
Expand Down
29 changes: 25 additions & 4 deletions RELEASE_NOTES.txt
Expand Up @@ -9,7 +9,27 @@
Changes from 2.2rc2 to 2.2 (final)
==================================

- None yet.
- Updated Blosc to 1.0 (final).

- Filter ID of Blosc changed from wrong 32010 to reserved 32001. This
will prevent PyTables 2.2 (final) to read files created with Blosc and
PyTables 2.2 pre-final. `ptrepack` can be used to retrieve those
files, if necessary. More info in ticket #281.

- Recent benchmarks suggest a new parametrization is better in most
scenarios:

* The default chunksize has been doubled for every dataset size. This
works better in most of scenarios, specially with the new Blosc
compressor.

* The HDF5 CHUNK_CACHE_SIZE parameter has been raised to 2 MB in order
to better adapt to the chunksize increase. This provides better hit
ratio (at the cost of consuming more memory).

Some plots have been added to the User's Manual (chapter 5) showing
how the new parametrization works.


Changes from 2.2rc1 to 2.2rc2
=============================
Expand All @@ -28,9 +48,10 @@ Changes from 2.2rc1 to 2.2rc2
renamed to `BUFFER_TIMES` which is more consistent with other
parameter names.

- On Windows platforms, the `sys.path` directory is added now to the
PATH environment variable. That way DLLs in `sys.path` are to be
found now. Thanks to Christoph Gohlke for the hint.
- On Windows platforms, the path to the tables module is now appended to
sys.path and the PATH environment variable. That way DLLs and PYDs in
the tables directory are to be found now. Thanks to Christoph Gohlke
for the hint.

- A replacement for barriers for Mac OSX, or other systems not
implementing them, has been carried out. This allows to compile
Expand Down
2 changes: 1 addition & 1 deletion VERSION
@@ -1 +1 @@
2.2.devpro
2.2pro
10 changes: 6 additions & 4 deletions bench/bench-postgres-ranges.sh
Expand Up @@ -3,10 +3,12 @@
export PYTHONPATH=..:$PYTHONPATH

pyopt="-O -u"
qlvl="-Q8 -x"
size="500m"
#size="1m"
#qlvl="-Q8 -x"
#qlvl="-Q8"
qlvl="-Q7"
#size="500m"
size="1g"

python $pyopt indexed_search.py -P -c -n $size -m -v
#python $pyopt indexed_search.py -P -c -n $size -m -v
python $pyopt indexed_search.py -P -i -n $size -m -v -sfloat $qlvl

13 changes: 7 additions & 6 deletions bench/bench-pytables-ranges.sh
Expand Up @@ -11,15 +11,16 @@ sizes="1g"
working_dir="data.nobackup"
#working_dir="/scratch2/faltet"

#for comprlvl in '-z0' '-z1 -lzlib' '-z1 -llzo' ; do
#for comprlvl in '-z9 -lblosc' '-z5 -lblosc' '-z1 -lblosc' '-z0' ; do
for comprlvl in '-z0' ; do
#for comprlvl in '-z0' '-z1 -llzo' '-z1 -lzlib' ; do
#for comprlvl in '-z6 -lblosc' '-z3 -lblosc' '-z1 -lblosc' ; do
for comprlvl in '-z5 -lblosc' ; do
#for comprlvl in '-z0' ; do
for optlvl in '-tfull -O9' ; do
#for optlvl in '-tultralight -O3' '-tlight -O6' '-tmedium -O6' '-tfull -O9'; do
#for optlvl in '-tultralight -O3'; do
rm -f $working_dir/*
for mode in -c '-Q7 -i -s float' ; do
#for mode in -c '-Q8 -i -x -s float' ; do
#rm -f $working_dir/* # XXX esta ben posat??
for mode in '-Q8 -i -s float' ; do
#for mode in -c '-Q7 -i -s float' ; do
#for mode in '-c -s float' '-Q8 -I -s float' '-Q8 -S -s float'; do
for size in $sizes ; do
$bench $flags $mode -n $size $optlvl $comprlvl -d $working_dir
Expand Down
48 changes: 24 additions & 24 deletions bench/optimal-chunksize.py
Expand Up @@ -12,8 +12,8 @@
# Size of dataset
#N, M = 512, 2**16 # 256 MB
#N, M = 512, 2**18 # 1 GB
N, M = 512, 2**19 # 2 GB
#N, M = 2000, 1000000 # 15 GB
#N, M = 512, 2**19 # 2 GB
N, M = 2000, 1000000 # 15 GB
#N, M = 4000, 1000000 # 30 GB
datom = tables.Float64Atom() # elements are double precision

Expand Down Expand Up @@ -48,22 +48,22 @@ def bench(chunkshape, filters):
filename = '/scratch2/faltet/data.nobackup/test.h5'
#filename = '/scratch1/faltet/test.h5'

f = tables.openFile(filename, 'r')

# f = tables.openFile(filename, 'w')
# e = f.createEArray(f.root, 'earray', datom, shape=(0, M),
# filters = filters,
# chunkshape = chunkshape)
# # Fill the array
# t1 = time()
# for i in xrange(N):
# #e.append([numpy.random.rand(M)]) # use this for less compressibility
# e.append([quantize(numpy.random.rand(M), 6)])
# os.system("sync")
# print "Creation time:", round(time()-t1, 3),
# filesize = get_db_size(filename)
# filesize_bytes = os.stat(filename)[6]
# print "\t\tFile size: %d -- (%s)" % (filesize_bytes, filesize)
#f = tables.openFile(filename, 'r')

f = tables.openFile(filename, 'w')
e = f.createEArray(f.root, 'earray', datom, shape=(0, M),
filters = filters,
chunkshape = chunkshape)
# Fill the array
t1 = time()
for i in xrange(N):
#e.append([numpy.random.rand(M)]) # use this for less compressibility
e.append([quantize(numpy.random.rand(M), 6)])
#os.system("sync")
print "Creation time:", round(time()-t1, 3),
filesize = get_db_size(filename)
filesize_bytes = os.stat(filename)[6]
print "\t\tFile size: %d -- (%s)" % (filesize_bytes, filesize)

# Read in sequential mode:
e = f.root.earray
Expand All @@ -74,8 +74,8 @@ def bench(chunkshape, filters):
t = row
print "Sequential read time:", round(time()-t1, 3),

f.close()
return
#f.close()
#return

# Read in random mode:
i_index = numpy.random.randint(0, N, 128)
Expand All @@ -98,16 +98,16 @@ def bench(chunkshape, filters):

# Benchmark with different chunksizes and filters
#for complevel in (0, 1, 3, 6, 9):
#for complib in (None, 'zlib', 'lzo', 'blosc'):
for complib in (None,):
for complib in (None, 'zlib', 'lzo', 'blosc'):
#for complib in ('blosc',):
if complib:
filters = tables.Filters(complevel=5, complib=complib)
else:
filters = tables.Filters(complevel=0)
print "8<--"*20, "\nFilters:", filters, "\n"+"-"*80
#for ecs in (11, 14, 17, 20, 21, 22):
#for ecs in range(10, 24):
for ecs in (19,):
for ecs in range(10, 24):
#for ecs in (19,):
chunksize = 2**ecs
chunk1 = 1
chunk2 = chunksize/datom.itemsize
Expand Down
4 changes: 2 additions & 2 deletions bench/postgres_backend.py
Expand Up @@ -5,8 +5,8 @@
import psycopg2 as db2

CLUSTER_NAME = "base"
#DATA_DIR = "/scratch/faltet/postgres/%s" % CLUSTER_NAME
DATA_DIR = "/var/lib/pgsql/data/%s" % CLUSTER_NAME
DATA_DIR = "/scratch2/postgres/data/%s" % CLUSTER_NAME
#DATA_DIR = "/var/lib/pgsql/data/%s" % CLUSTER_NAME
DSN = "dbname=%s port=%s"
CREATE_DB = "createdb %s"
DROP_DB = "dropdb %s"
Expand Down

0 comments on commit c689858

Please sign in to comment.