Skip to content
This repository has been archived by the owner on Dec 11, 2023. It is now read-only.

Pandas out_flavor for better ctable performance #176

Closed
ARF1 opened this issue Apr 19, 2015 · 3 comments · May be fixed by #184
Closed

Pandas out_flavor for better ctable performance #176

ARF1 opened this issue Apr 19, 2015 · 3 comments · May be fixed by #184

Comments

@ARF1
Copy link

ARF1 commented Apr 19, 2015

In this issue I want to make the case for the extension of the effect of out_flavor to __getitem__() (and related functions) and the introduction of a pandas outflavor. While I appreciate the rationale for limiting bcolz to the numpy data model I believe the possible performance improvements with pandas merit consideration.

I would be very interested to know if this has any chance for inclusion in bcolz, since the effort would be non-trivial for a clean implementation.

Executive Summary

  • bcolz is a columnar (albeit chunked) data container
  • thus after decompression of each chunk, the data for each column is (chunk-)contiguous
  • currently however, this column-contiguous data is forced into a row-major memory structure (numpy structured array)
  • this has two consequences:
    1. the contiguous column-data has to be copied in row-sized chunks which is much slower than en-bloc copying of contiguous data,
    2. the data has to be copied (rather than being decompressed directly into the "proper" place)
  • Moving to a column-major memory layout (e.g. pandas DataFrame) would provide:
    1. much faster assignment of data to colums (which is a very common operation with ctable)
    2. opens the possibility for avoiding repeated copy operations in the first place
    3. ctable would be able to really leverage multi-core machines. Currently the numpy structured array is a bottleneck for __getitem__() and the like: Numpy bottleneck with ctable #174
    4. improve down-stream performance of data analysis as much data analysis is carried out along column rather than along rows
  • due to the similar data access model of numpy structured arrays and pandas dataframes most code could probably remain unchanged after the output data object is initialised either as numpy or pandas

Evidence of column assignment bottleneck

Note: for the sake of simplicity, when I use the term "column" in relation to numpy structured arrays, I refer to the fields of the (single column) structured array that bcolz uses to store its output column.

Numpy structured arrays (as used by bcolz) are inherently row-major. It is impossible to change this as far as I can see. This means that column assignment is fairly slow:

In [1]: import numpy as np

In [2]: M,N=int(1e7),10

# make a row-major numpy array

In [4]: A1=np.zeros((M,N),'f')

In [9]: dt=np.dtype(','.join(['f' for _ in range(N)]))

# make a structured numpy array

In [10]: A2=np.zeros((M,),dtype=dt)

In [11]: X=np.arange(M+0.0)

### simulate ctable __getitem()__ operation:
# assignment to a row-major numpy array column for column

In [13]: %timeit for n in range(N):A1[:,n]=X
1 loops, best of 3: 2.36 s per loop

# assignment of structured array field for field

In [15]: %timeit for n in dt.names: A2[n]=X
1 loops, best of 3: 2.36 s per loop

Significant speedup (factor 6.5!) can be achieved by moving to a column-major memory layout for the numpy array:

In [1]: import numpy as np

In [2]: M,N=int(1e7),10

# make a colum-major numpy array
In [3]: A1=np.zeros((M,N),'f', 'F')

In [4]: dt=np.dtype(','.join(['f' for _ in range(N)]))

# make a structured numpy array

In [5]: A2=np.zeros((M,),dtype=dt)

In [6]: X=np.arange(M+0.0)

### simulate ctable __getitem()__ operation:
# assignment to a column-major numpy array column for column

In [8]: %timeit for n in range(N):A1[:,n]=X
1 loops, best of 3: 374 ms per loop

# assignment of structured array field for field

In [9]: %timeit for n in dt.names: A2[n]=X
1 loops, best of 3: 2.43 s per loop

Why Pandas DataFrames would help

While moving to column-major numpy arrays would be the ideal solution, this is obviously not an option: they require the entire array to have a homogeneous dtype.

Pandas DataFrame however supports columns of different dtypes and by design stores data in column-major order. (As I remember because Wes McKinney said that most of his data analysis happened along columns, though I cannot find the reference. That said I think the following article in which he explains his reasons for not choosing numpy structured arrays is interesting: Wes McKinney: A Roadmap for Rich Scientific Data Structures in Python)

In addition, the choice of column-major ordering permits size-mutability: columns can be added without copying the data. This fits well with the ctable column-store philosopy. With numpy structured arrays, the entire array has to be copied (almost entry by entry as far as I can see) to add a new column.

Due to these advantages, column-major ordering is used my many well-known dedicated number crunching environments, among others: Fortran, MATLAB and R

Note: Making a pandas DataFrame for an already existing numpy structured array returned by ctable is not an option either: again, effectively element by element copying of the data is required.

Evidence of performance improvements with Pandas

Instantiation of Pandas DataFrames is admittedly an issue with smaller databases. Leaving this issue aside for the moment, one can see that assignment to an (instantiated DataFrame) is much faster:

In [1]: import numpy as np

In [2]: import pandas as pd

In [3]: M,N=int(1e7),10

In [4]: A1=np.zeros((M,N),'f', 'F')

In [5]: dt=np.dtype(','.join(['f' for _ in range(N)]))

In [6]: A2=np.zeros((M,),dtype=dt)

In [7]: X=np.arange(M+0.0)

# make a pandas DataFrame

In [8]: df_A1 = pd.DataFrame(index=xrange(M), columns=range(N), dtype='f')

### simulate ctable __getitem()__ operation:
# assignment to a column-major numpy array column for column

In [9]: %timeit for n in range(N): A1[:,n]=X
1 loops, best of 3: 369 ms per loop

# assignment to a structured numpy array field for field

In [10]: %timeit for n in dt.names: A2[n]=X
1 loops, best of 3: 2.36 s per loop

# (optimized) assignment to a pandas DataFrame column for column

In [8]: %timeit for n in range(N): df_A1._data.blocks[0].values[n,:]=X
1 loops, best of 3: 364 ms per loop

Instantiation of the DataFrame will probably be an issue. My gut reaction is that it should be possible to instantiate (and cache) an empty DataFrame with the correct structure and then for each __getitem__() call shallow-copy this template and assign new data arrays. This should help with the instantiation overhead, since the column makeup of a ctable instance usually does not change too often during a program.

Down the road

In the long run it might be worth getting rid of the memory copies altogether for DataFrame out_flavor and decopressing the chunks directly into the arrays backing the DataFrame. This would likely lead to further performance improvements (though smaller as the remaining memory copies would already be efficient en-bloc copies).

@ARF1
Copy link
Author

ARF1 commented Apr 19, 2015

See also FrancescAlted's comment on this issue in #66.

@ARF1
Copy link
Author

ARF1 commented May 3, 2015

@FrancescAlted Some food for thought for when you return to bcolz: I implemented a quick hack to show the possible performance gains (>x2 on my machine) from using a column-major output flavor with ctable: #184.

The implementation is clearly non-ideal but shows what I was after: returning pandas dataframes can be faster than returning numpy structured arrays.

Relative timings of pandas out_flavor:

  • x2.1 faster than numpy out_flavor
  • x3.6 faster than pandas dataframe from numpy out_flavor
  • x1.6 faster than pandas dataframe from dict of numpy arrays

By introducing an abstraction layer for the creation of the "result array" and its data access, one can minimize the impact on core bcolz code. Also, this would allow users to hook-in and implement their own out_flavors.

The implementation of the abstraction layer in the PR is probably sub-optimal. It currently penalises numpy results with only a few rows. There are three possible reasons for this that I suspect:

  • increased python overhead due to subclassing and more python function calls
  • dispatcher architecture is inefficient
  • my _arr1 cache implementation is too crude

That should be easy enough to solve with some profiling and possibly a different abstraction layer architecture.

Eventually, further performance gains could result if carray wrote its results directly into the pre-allocated numpy arrays backing the dataframe, rather than first assembling its own data-contiguous numpy array before copying its contents into the pandas dataframe. This cannot be achieved with a dict of numpy arrays but only with a "native" pandas out_flavor (or a user-defined implementation of the abstraction layer).

As a side-benefit the abstraction layer would make implementing categoricals (#66) easier.

Timing code:

In [1]: import bcolz

In [2]: import pandas

In [3]: a = bcolz.open(rootdir='myData.bcolz')

In [4]: columns = ['col1', 'col2', 'col3', 'col4', 'col5', 'col6', 'col7', 'col8', 
'col9', 'col10', 'col11']

In [5]: b = a[columns]

In [6]: b
Out[6]:
ctable((8769282,), [('col1', '<i4'), ('col2', '<i4'), ('col3', 'i1'), 
('col4', '<i4'), ('col5', 'i1'), ('col6', '<i4'), ('col7', 'i1'), 
('col8', '<i4'), ('col9','i1'), ('col10', '<i4'), ('col11', 'i1')])
  nbytes: 242.53 MB; cbytes: 66.69 MB; ratio: 3.64
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[[(0, 1687, -2, 1687, -2, 1687, -2, 1687, -2, 0, 0)]
 [(0, 1703, -2, 1703, -2, 1703, -2, 1703, -2, 0, 0)]
 [(0, 1719, -2, 1719, -2, 1719, -2, 1719, -2, 0, 0)] ...,
 [(0, 16191, -2, 16191, -2, 16191, -2, 16191, -2, 0, 0)]
 [(0, 16190, -2, 16190, -2, 16190, -2, 16190, -2, 0, 0)]
 [(0, 16186, -2, 16186, -2, 16186, -2, 16186, -2, 0, 0)]]

In [7]: %timeit -r10 -n10 b[:]
10 loops, best of 10: 2.64 s per loop

In [8]: %timeit -r10 -n10 pandas.DataFrame(b[:])
10 loops, best of 10: 4.5 s per loop

In [9]: %timeit -r10 -n10 pandas.DataFrame({col: b[col][:] for col in columns})
10 loops, best of 10: 2.03 s per loop

In [10]: bcolz.defaults.ctable_out_flavor = 'pandas'

# without pandas/pandas#9977 (uses materialised int64-index)
In [11]: %timeit -r10 -n10 b[:]
10 loops, best of 10: 1.29 s per loop

# with pandas/pandas#9977 (uses non-materialised index)
In [12]: %timeit -r10 -n10 b[:]
10 loops, best of 10: 1.24 s per loop

@ARF1
Copy link
Author

ARF1 commented May 5, 2015

Closing this in favor of #187 introducing an efficient abstraction layer which allows users to provide their own pandas out_flavor implementation.

@ARF1 ARF1 closed this as completed May 5, 2015
Sign up for free to subscribe to this conversation on GitHub. Already have an account? Sign in.
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants