factorize for bcolz #63

esc · 2014-10-07T13:06:49Z

No description provided.

esc · 2014-10-07T13:44:22Z

The first implementation factorize_pure is a slow python implementation, here is how it compares to pandas:

In [1]: import bcolz

In [2]: import pandas as pd

In [3]: import numpy as np

In [4]: import random

In [5]: a = np.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2')

In [6]: %timeit pd.factorize(a)
1 loops, best of 3: 952 ms per loop

In [8]: c = bcolz.carray(a)

In[9]: import bcolz.algos

In [11]: %timeit bcolz.algos.factorize_pure(c)
1 loops, best of 3: 51.7 s per loop

Also, here is the profiling output:

%prun -l 20 -s cumulative bcolz.algos.factorize_pure(c)

         40000490 function calls in 62.945 seconds

   Ordered by: cumulative time
   List reduced from 28 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000   62.945   62.945 <string>:1(<module>)
        1   10.059   10.059   62.945   62.945 algos.py:5(factorize_pure)
 10000000   15.851    0.000   52.886    0.000 {method 'append' of 'bcolz.carray_ext.carray' objects}
 10000001   25.481    0.000   37.035    0.000 utils.py:102(to_ndarray)
 10000001   10.504    0.000   10.504    0.000 {numpy.core.multiarray.array}
 10000005    1.050    0.000    1.050    0.000 {len}
      152    0.000    0.000    0.000    0.000 toplevel.py:571(clevel)
      152    0.000    0.000    0.000    0.000 toplevel.py:576(shuffle)
      152    0.000    0.000    0.000    0.000 toplevel.py:581(cname)
        1    0.000    0.000    0.000    0.000 collections.py:38(__init__)
        1    0.000    0.000    0.000    0.000 _abcoll.py:545(update)
        1    0.000    0.000    0.000    0.000 utils.py:72(calc_chunksize)
        4    0.000    0.000    0.000    0.000 collections.py:54(__setitem__)
        1    0.000    0.000    0.000    0.000 {isinstance}
        1    0.000    0.000    0.000    0.000 toplevel.py:631(__init__)
        1    0.000    0.000    0.000    0.000 abc.py:128(__instancecheck__)
        1    0.000    0.000    0.000    0.000 abc.py:148(__subclasscheck__)
        3    0.000    0.000    0.000    0.000 _weakrefset.py:70(__contains__)
        2    0.000    0.000    0.000    0.000 {math.log10}
        1    0.000    0.000    0.000    0.000 utils.py:52(csformula)

It seems like the to_ndarray is quite heavily involved here.

esc · 2014-10-07T14:10:24Z

Added an optimization (factorize_pure2), where the array is converted and appended chunk-wise:

In [7]: %timeit bcolz.algos.factorize_pure2(c)
1 loops, best of 3: 4.5 s per loop

And the corresponding profile:

         606 function calls in 4.560 seconds

   Ordered by: cumulative time
   List reduced from 29 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.560    4.560 <string>:1(<module>)
        1    4.511    4.511    4.560    4.560 algos.py:23(factorize_pure2)
       38    0.047    0.001    0.048    0.001 {method 'append' of 'bcolz.carray_ext.carray' objects}
       39    0.000    0.000    0.001    0.000 utils.py:102(to_ndarray)
      152    0.000    0.000    0.000    0.000 toplevel.py:571(clevel)
      152    0.000    0.000    0.000    0.000 toplevel.py:576(shuffle)
        1    0.000    0.000    0.000    0.000 collections.py:38(__init__)
      152    0.000    0.000    0.000    0.000 toplevel.py:581(cname)
       44    0.000    0.000    0.000    0.000 {len}
        1    0.000    0.000    0.000    0.000 _abcoll.py:545(update)
        1    0.000    0.000    0.000    0.000 utils.py:72(calc_chunksize)
        1    0.000    0.000    0.000    0.000 {numpy.core.multiarray.empty}
        1    0.000    0.000    0.000    0.000 {isinstance}
        1    0.000    0.000    0.000    0.000 {numpy.core.multiarray.array}
        1    0.000    0.000    0.000    0.000 abc.py:128(__instancecheck__)
        4    0.000    0.000    0.000    0.000 collections.py:54(__setitem__)
        1    0.000    0.000    0.000    0.000 toplevel.py:631(__init__)
        1    0.000    0.000    0.000    0.000 utils.py:52(csformula)
        1    0.000    0.000    0.000    0.000 abc.py:148(__subclasscheck__)
        3    0.000    0.000    0.000    0.000 _weakrefset.py:70(__contains__)

I think that is a pretty nice achievement over the naive approach, but can probably do better yet still.

esc · 2014-10-07T14:15:38Z

And with profiling for cython active we get:

         1703 function calls (1690 primitive calls) in 4.356 seconds

   Ordered by: cumulative time
   List reduced from 59 to 20 due to restriction <20>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.000    0.000    4.356    4.356 <string>:1(<module>)
        1    4.275    4.275    4.355    4.355 algos.py:23(factorize_pure2)
       38    0.000    0.000    0.049    0.001 {method 'append' of 'bcolz.carray_ext.carray' objects}
       38    0.001    0.000    0.049    0.001 carray_ext.pyx:1334(append)
      152    0.000    0.000    0.047    0.000 carray_ext.pyx:305(__cinit__)
      152    0.000    0.000    0.047    0.000 carray_ext.pyx:357(compress_arrdata)
      152    0.046    0.000    0.046    0.000 carray_ext.pyx:406(compress_data)
       39    0.001    0.000    0.031    0.001 carray_ext.pyx:489(__getitem__)
       39    0.030    0.001    0.030    0.001 carray_ext.pyx:462(_getitem)
       39    0.000    0.000    0.000    0.000 utils.py:102(to_ndarray)
        1    0.000    0.000    0.000    0.000 collections.py:38(__init__)
      8/2    0.000    0.000    0.000    0.000 abc.py:148(__subclasscheck__)
        1    0.000    0.000    0.000    0.000 carray_ext.pyx:988(__cinit__)
        1    0.000    0.000    0.000    0.000 carray_ext.pyx:1034(create_carray)
        1    0.000    0.000    0.000    0.000 _abcoll.py:545(update)
        1    0.000    0.000    0.000    0.000 {isinstance}
        1    0.000    0.000    0.000    0.000 abc.py:128(__instancecheck__)
     12/5    0.000    0.000    0.000    0.000 {issubclass}
        5    0.000    0.000    0.000    0.000 _weakrefset.py:36(__init__)
      152    0.000    0.000    0.000    0.000 carray_ext.pyx:243(check_zeros)

esc · 2014-10-07T14:47:17Z

I made a cython version too but it's only marginally faster than the second pure python approach, but probably one can tweak the cython some more.

esc · 2014-10-09T14:46:30Z

I tweaked the code some more and am now at a factor of 2 slower than pandas:

In [1]: import bcolz, bcolz.algos, numpy, random

In [2]: a = numpy.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2')

In [3]: c = bcolz.carray(a)

In [4]: %timeit bcolz.carray_ext.factorize_cython(c)
1 loops, best of 3: 1.95 s per loop

In [5]: import pandas

In [6]: %timeit pandas.factorize(a)
1 loops, best of 3: 971 ms per loop

CarstVaartjes · 2014-10-09T21:42:28Z

Hi, with the cython code "independentized" from Pandas (how do you say that haha), the klib upgrade seems to give a 20% increase for in64 and a 300% increase for strings:

import numpy as np
import random
import pandas as pd
from khash.hashtable import Int64HashTable, StringHashTable

def new_klib_int(input_array):
    ht = Int64HashTable(len(input_array))
    return ht.get_labels_groupby(input_array)

def new_klib_str(input_array):
    ht = StringHashTable(len(input_array))
    return ht.factorize(input_array)

a = np.random.randint(100, size=10000000)
b = np.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2')
b = pd.Series(b).values

%timeit pd.factorize(a)
%timeit new_klib_int(a)
%timeit pd.factorize(b)
%timeit new_klib_str(b)

In [20]: %timeit pd.factorize(a)
10 loops, best of 3: 122 ms per loop

In [21]: %timeit new_klib_int(a)
10 loops, best of 3: 101 ms per loop

In [22]: %timeit pd.factorize(b)
1 loops, best of 3: 496 ms per loop

In [23]: %timeit new_klib_str(b)
10 loops, best of 3: 165 ms per loop

esc · 2014-10-10T11:07:13Z

@CarstVaartjes that is pretty good. I'll perhaps try to use the klib hashtable in bcolz soon. The difference at the moment is that the 'labels' returned are a Numpy array. for bcolz this should perhaps be a carray? So for example you could store the integer labels as another column in the ctable.

As a side note, you are using an int64 for the labels. Depending on the cardinality of the categorical/factor you may get away with an unsigned smaller version of the data, e.g. uint8. You could compute the number of unique values ahead of time,or alternatively downcast the array after you have create the labels, since the size of the hashtable will tell you what size of integer you will need. Either one will however double the runtime but decrease the space needed. For a carray however, due to the shuffle filter in blosc, that shouldn't matter much since all the zeros in the large integers will compressed away.

Another interesting read would be the recent addition of the Categorical datatype to Pandas:

https://pandas-docs.github.io/pandas-docs-travis/categorical.html

Effectively factorize should return an object of the categorical type and my gut feeling is that if bcolz is to become a useful tool for data analysis, we will need first class categorical datatypes.

CarstVaartjes · 2014-10-10T13:44:52Z

Hi Valentin,

we'll include it in #62 and yes want to convert it there to a version that uses a carray version! Main issue is that there is no carray_ext pxd file yet (giving headaches to develop separate cython files) + I worry about writing to a carray sequentially and then sorting it (because of the compressing performance penalty it triggers while randomly writing)
But in general, we should be okay for a good groupby!

On categorical: we ourselves do something exactly like that in our architecture (we work with separate reference tables that factorize all datetime and string objects into int64 identifiers before writing things to hdf5 and in the near future hopefully bcolz) (we are a small startup that does data analytics for retailers and fmcg manufacturers). This might be a nice setup in general, i'm not sure of the python object implementation in pandas isn't overdoing it while a "translated array" (with the integers) and a "translation array" with the value as separately saved arrays might be much better

esc · 2014-10-10T18:30:00Z

I did do some more experimentation by using the kash stuff you provided and it turns out to be quite alot faster than the standard python dict:

In [1]: import bcolz, bcolz.algos, numpy, random

In [2]: a = numpy.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2')
c 
In [3]: c = bcolz.carray(a)

In [4]: bcolz.carray_ext.factorize_cython(c)[0]
Out[4]: 
carray((9961472,), uint8)
  nbytes: 9.50 MB; cbytes: 5.53 MB; ratio: 1.72
  cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
[0 0 0 ..., 1 2 2]

In [5]: %timeit bcolz.carray_ext.factorize_cython(c)
1 loops, best of 3: 1.27 s per loop

In [6]: import pandas

In [7]: %timeit pandas.factorize(a)
1 loops, best of 3: 1.01 s per loop

And we are now only about 30ms slower than pandas on this particular example.

esc · 2014-10-10T19:17:53Z

Stragely enough using typed numpy arrays, we become faster than pandas:

In [1]: import bcolz, numpy, random, pandas

In [2]: a = numpy.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2')

In [3]: c = bcolz.carray(a, bcolz.cparams(clevel=5, shuffle=False, cname='blosclz'))

In [4]: c
Out[4]: 
carray((10000000,), |S2)
  nbytes: 19.07 MB; cbytes: 8.73 MB; ratio: 2.18
  cparams := cparams(clevel=5, shuffle=False, cname='blosclz')
['NY' 'IL' 'CA' ..., 'NY' 'NY' 'OH']

In [5]: %timeit bcolz.carray_ext.factorize_cython(c)
1 loops, best of 3: 687 ms per loop

In [6]: %timeit pandas.factorize(a)
1 loops, best of 3: 1.03 s per loop

In [7]: pandas.__version__
Out[7]: '0.14.0'

Though I'm not sure if what I am seeing is correct it is getting quite late and I fear I made some kind of error somewhere. Anyway, I'll be double-checking the results again soon.

CarstVaartjes · 2014-10-10T19:19:31Z

It's quicker than Pandas because it's a updated version of klib, see: pandas-dev/pandas#8524

esc · 2014-10-10T19:24:30Z

@CarstVaartjes ah I see, so I'll wait for the new klib to make it into pandas and then test again, but good to know that it might be comparable.

FrancescElies · 2014-10-13T09:17:15Z

trying to catch up with your current code base, I did some tests with klib 0.2.8 and the old 0.2.6, both lead to very similar results, factorize from @esc is at least for me the new klib as fast as the old one with this example.
With my laptop is not faster than pandas, so bcolz and the way factorize are built may play a role on the number we see (see below both results)

klib_version = '0.2.8'
In [1]: import bcolz, numpy, random, pandas

In [2]: a = numpy.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2')

In [3]: c = bcolz.carray(a, bcolz.cparams(clevel=5, shuffle=False, cname='blosclz'))

In [4]: c
Out[4]: 
carray((10000000,), |S2)
  nbytes: 19.07 MB; cbytes: 8.73 MB; ratio: 2.18
  cparams := cparams(clevel=5, shuffle=False, cname='blosclz')
['NY' 'NY' 'CA' ..., 'IL' 'IL' 'CA']

In [5]: %timeit bcolz.carray_ext.factorize_cython(c)
1 loops, best of 3: 1.18 s per loop

In [6]: %timeit pandas.factorize(a)
1 loops, best of 3: 943 ms per loop

In [7]: pandas.__version__
Out[7]: '0.14.0'

klib_version = '0.2.6'
In [1]: import bcolz, numpy, random, pandas

In [2]: a = numpy.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2')

In [3]: c = bcolz.carray(a, bcolz.cparams(clevel=5, shuffle=False, cname='blosclz'))

In [4]: c
Out[4]: 
carray((10000000,), |S2)
  nbytes: 19.07 MB; cbytes: 8.73 MB; ratio: 2.18
  cparams := cparams(clevel=5, shuffle=False, cname='blosclz')
['NY' 'IL' 'IL' ..., 'CA' 'CA' 'IL']

In [5]: %timeit bcolz.carray_ext.factorize_cython(c)
1 loops, best of 3: 1.16 s per loop

In [6]: %timeit pandas.factorize(a)
1 loops, best of 3: 939 ms per loop

In [7]: pandas.__version__
Out[7]: '0.14.0'

FrancescElies · 2014-10-13T10:14:11Z

Hi Valentin (@esc)

do you also get this result for bcolz factorize or am I missing something?
The factorize bcolz result looks a bit strange, the output below comes with klib 0.2.8, if klib 0.2.6 is used then it looks also different.

Thanks

In [7]: result_bcolz = bcolz.carray_ext.factorize_cython(c)

In [8]: result_pandas = pandas.factorize(a)

In [9]: result_pandas
Out[9]: (array([0, 1, 2, ..., 2, 1, 2]), array(['IL', 'OH', 'NY', 'CA'], dtype=object))

In [10]: result_bcolz
Out[10]: 
(carray((9961472,), uint8)
   nbytes: 9.50 MB; cbytes: 6.88 MB; ratio: 1.38
   cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
 [0 1 0 ..., 6 6 1],
 {0: 'IL',
  1: 'OH',
  2: 'CA',
  3: 'OH',
  4: 'CA',
  5: 'NY',
  6: 'OH',
  7: 'NY',
  8: 'NY',
  9: 'NY',
  10: 'CA',
  11: 'CA',
  12: 'IL',
  13: 'CA',
  14: 'CA',
  15: 'NY',
  16: 'CA',
  17: 'NY',
  18: 'OH',
  19: 'NY',
  20: 'IL',
  21: 'IL',
  22: 'NY',
  23: 'OH',
  24: 'IL',
  25: 'OH',
  26: 'IL',
  27: 'IL',
  28: 'OH',
  29: 'IL',
  30: 'IL'})

esc · 2014-10-13T11:56:27Z

@FrancescElies No that seems to be foobared. I'm double checking it now. BTW: how did you swap in-out the klib?

esc · 2014-10-13T12:22:21Z

I actually get a slightly differnt result:

In [1]: import bcolz, numpy, random, pandas

In [2]: a = numpy.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(10000000)), dtype='S2')

In [3]: c = bcolz.carray(a, bcolz.cparams(clevel=5, shuffle=False, cname='blosclz'))

In [4]: result = bcolz.carray_ext.factorize_cython(c)

In [5]: result
Out[5]: 
(carray((9961472,), uint8)
   nbytes: 9.50 MB; cbytes: 5.53 MB; ratio: 1.72
   cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
 [0 1 0 ..., 0 0 1], {0: 'NY', 1: 'OH', 2: 'CA'})

And after inserting some print statements, it seems like NY and IL actually hash to the same value.

esc · 2014-10-13T12:44:25Z

Switching back to the python dictionary gives correct results once again:

In [1]: import bcolz, numpy, random, pandas

In [2]: a = numpy.fromiter((random.choice(['NY', 'IL', 'OH', 'CA']) for _ in xrange(500000)), dtype='S2')

In [3]: c = bcolz.carray(a, bcolz.cparams(clevel=5, shuffle=False, cname='blosclz'))

In [4]: result = bcolz.carray_ext.factorize_cython(c)

In [5]: result
Out[5]: 
(carray((393216,), uint8)
   nbytes: 384.00 KB; cbytes: 410.34 KB; ratio: 0.94
   cparams := cparams(clevel=5, shuffle=True, cname='blosclz')
 [0 1 0 ..., 3 1 1], {'CA': 1, 'IL': 3, 'NY': 0, 'OH': 2})

In [6]: c[10]
Out[6]: 'OH'

In [7]: c[:10]
Out[7]: 
array(['NY', 'CA', 'NY', 'CA', 'CA', 'OH', 'OH', 'CA', 'NY', 'NY'], 
      dtype='|S2')

In [8]: result[0][:10]
Out[8]: array([0, 1, 0, 1, 1, 2, 2, 1, 0, 0], dtype=uint8)

CarstVaartjes · 2014-10-16T15:04:28Z

I can see why you would want to have a split between the core and the analytics part (maintenance, bloat, etc.); some things which are different or have to look at from my humble PoV:

the ctable is something that is in Pandas rather than numpy (not that I'm advocating that you should do that btw! I just mean it's a difference, though it's of course really good that bcolz already has it)
Make a carray_ext.pxd file #68 would be necessary (a good thing in general of course, but also necessary); we looked at it, but it will be a mighty complicated pxd with all the available functions
some things are a bit grey areas, e.g.
- factorizing is necessary for analytics, but maybe nice to have as core functionality for other apps
- caching of a factorization result (the index array and the unique value array/dict) can be great; if it's in bcolz you might have it as a direct function on a carray / table columns, but it can also work from the analytics part though in that case you cannot enforce re-indexing after create/update/deletes if a second process writes from them
if you want to go to add layers on top such as multi-server distribution (for load balancing and/or failover), data access security models, etc. having them bcolz and analytics as separate modules would be good (because that way you can plug in a model in between)
with separate modules we might be able to also open source some of our internal stuff (s3 synching, metadata models) which definitely are better for a higher level module (you would get bcolz -> load balancing -> analytics -> meta layer with security check, data definitions & web calls)

Just some thoughts. Probably the earlier this would happen the better right? So the pxd is then really high prio?

FrancescElies · 2014-10-16T15:35:35Z

@esc you can find the tests in
https://github.com/FrancescElies/bcolz/blob/factorize/bcolz/tests/test_carray.py
factorizeStringsTest, test01 is pandas dependent which I am not sure if this is a good idea.

You can also find factorizeIntsTest but this is obiously not working since it would test factorize for ints.

I also started having a look to groupsort from pandas
https://github.com/FrancescElies/bcolz/blob/groupsort/bcolz/algos.py#L44-44

At the moment I am having a look at the infos you sent, I am open to any comments or things you would like to me to change in case you were interested, thanks!

FrancescElies · 2014-10-16T21:07:29Z

@esc, inside factorize_cython would it be a good idea to type the reverse parameter as carry? Though should normally not be the case of having all unique values, potentially this dict could be also quite big in some particular cases.

esc · 2014-10-16T21:15:10Z

@FrancescElies I had been thinking about that. In principle the lookup is indexed and ordered so it is possible to replace the dict with a carray. Also, since carray appends are (should be) lighting fast for the most part you could simply append to the carray every time you encounter a new unique value w/o much overhead. However only benchmarking will tell about this performance. In my first initial implementation of factorize I noticed that there may be some speed issues with appending scalars to carrays. That will most probably have to be looked at.

CarstVaartjes · 2014-10-18T20:50:50Z

edit: oops, misread the whole thing. the helper was about reading a chunk at a time, not writing it (so the stuff below doesn't make sense)
edit2: no, it actually might make sense ;) as it reads a chunk at a time but then buffers the writing

btw about the issue with appends in chunks in the _factorize_helper_with_counts, isn't this a possible solution?

class carray_buf():

    def __init__(self, carray_input):
        self.carray = carray_input
        self.length = carray_input.chunklen
        self.index = 0
        self.value_list = []

    def append(self, value):
        self.index += 1
        self.value_list.append(value)

        if self.index == self.length:
            self.carray.append(self.value_list)
            self.value_list = []
            self.index = 0

    def close(self):
        # first check whether there are actual values to save
        if self.value_list:
            self.carray.append(self.value_list)
            self.carray.flush()
            self.value_list = []
            self.index = 0

(this is based on some similar issues we had inside our algorithms with saving up for hdf5 writes in our current solution). Then you can just directly append to the item each time and it will take care of efficient saving itself.
Some things that would need to be improved:

cythonized with cdefs
during the init the self.index should be set to the current leftovers (so it first fills up the current chunk)

CarstVaartjes · 2014-10-18T23:38:39Z

Ok so I tried that and while it worked nicely, it made it twice as slow ;) oops!

FrancescAlted · 2014-10-19T17:30:32Z

Just to give my opinion on what @esc is wondering, yes, my intention was to keep bcolz really minimal, and my hope was to find in PyToolz/CyToolz a nice counterpart for taking care of the analytics part, but my hopes really vanished as soon as you shown me that you can easily get a 10x increase in performance by implementing groupby internally in bcolz.

Frankly, I don't know where bcolz should go. My initial intention was indeed to keep it simple (I knew that I was not going to have too much time to put on it), but on the other hand, performance is important as well, so perhaps integrating more functionality on it is good. Problem is, at which point should we decide to stop adding high performance functionality? And, if we not put a limit, perhaps are we bounded to replicate pandas but using carray/ctable bcolz objects? If so, it would not be better to make pandas to work on top of bcolz?

Hmm, we are facing an important decision here, and I don't have all the answers. For the time being though I would like to move this discussion to the mailing list, and perhaps involve people from pandas and other related projects to participate and express their thoughts there.

At any rate, I really like that this discussion is happening and encourage you to continue providing suggestions on how bcolz can be improved.

CarstVaartjes · 2014-10-19T19:03:40Z

Hi Francesc (A. :)

I have some ideas, I will put them on e-mail to you and Valentin later this week! It's probably much better if you involve Jeff Reback etc. as they only know me from posting obscure Pandas & Pytables bugs.
To be honest, I think it will be incredibly hard to disentangle Pandas from Numpy (lots of specific optimizations have been done for it) + see what continuum analytics is doing now. Also to make bcolz perform you need to do some tricks too (see some of the really great stuff @esc did in the factorization) and have to keep in mind the chunks for optimum performance (some algorithms such as counted sort depend on non-sequential writes, which also gives some issues). But with some tweaks out-of-core bcolz is extremely close to pandas, and that alone makes it incredible (and gives it a potential to blast away hadoop etc.)
Anyway, will make a mail later this week!

On the groupby, we're getting really really close. I've extended/adjusted @FrancescElies his code to create a groupby function that can work over multiple columns and also cache factorizations. See: https://github.com/FrancescElies/bcolz/commits/groupsort

import bcolz
from bcolz import carray_ext
import tempfile
import os
import pandas as pd
import numpy as np

projects = [
{'a1': 'build roads',  'a2': 'CA', 'a3': 2, 'm1':   1, 'm2':  2, 'm3':    3.2},
{'a1': 'fight crime',  'a2': 'IL', 'a3': 1, 'm1':   2, 'm2':  3, 'm3':    4.1},
{'a1': 'help farmers', 'a2': 'IL', 'a3': 1, 'm1':   4, 'm2':  5, 'm3':    6.2},
{'a1': 'help farmers', 'a2': 'CA', 'a3': 3, 'm1':   8, 'm2':  9, 'm3':   10.9},
{'a1': 'build roads',  'a2': 'CA', 'a3': 3, 'm1':  16, 'm2': 17, 'm3':   18.2},
{'a1': 'fight crime',  'a2': 'IL', 'a3': 1, 'm1':  32, 'm2': 33, 'm3':   34.3},
{'a1': 'help farmers', 'a2': 'IL', 'a3': 2, 'm1':  64, 'm2': 65, 'm3':   66.6},
{'a1': 'help farmers', 'a2': 'AR', 'a3': 3, 'm1': 128, 'm2': 129, 'm3': 130.9}
]

df = pd.DataFrame(projects * 10000)
print df

prefix = 'bcolz-'
rootdir = tempfile.mkdtemp(prefix=prefix)
os.rmdir(rootdir)  # tests needs this cleared
print(rootdir)
fact_bcolz = bcolz.ctable.fromdataframe(df, rootdir=rootdir)
fact_bcolz.rootdir
self = fact_bcolz

groupby_cols = ['a1', 'a2', 'a3']
fact_bcolz.cache_factor(groupby_cols, refresh=True)  # this caches the factorizations on-disk directly in the rootdir
fact_bcolz.groupby(groupby_cols, {})  # does the first 3 parts of the groupby, see the code

There's room for improvement left of course, with ideally having a metadata later for each column where i can write two carrays (for the factorized index and for the unique values). Also, numexpr cannot handle unsigned int64 unfortunately. But hey, it works! :) It's still slow however compared to Pandas, so more optimizations are required, but i would say: let's make something that works from start to end first, then look at bottlenecks and optimize it

esc · 2014-10-22T20:56:10Z

I'll answer in-line to above using my mailer. Also, I would tent to agree with @FrancescAlted to move any more lengthy discussions to the mailinglists. That way we can use the threading inherent to email in case we need to spin off tangents, the github interface really sucks at this kind of stuff and isn't really suited for more than a few simple replies back and forth.

esc · 2014-10-22T21:02:24Z

Hi,

Carst Vaartjes notifications@github.com [2014-10-16]:

I can see why you would want to have a split between the core and the
analytics part (maintenance, bloat, etc.); some things which are
different or have to look at from my humble PoV:

the ctable is something that is in Pandas rather than numpy (not
that I'm advocating that you should do that btw! I just mean it's a
difference, though it's of course really good that bcolz already has
it)

Yes, perhaps the ctable needs to move then too? Or everything get's
dumped into bcolz. Probably would work too.

Make a carray_ext.pxd file #68 would be necessary (a good thing in general of course, but also
necessary); we looked at it, but it will be a mighty complicated pxd
with all the available functions

Hehe, yes.

some things are a bit grey areas, e.g.

factorizing is necessary for analytics, but maybe nice to have as
core functionality for other apps

True. Requires the klib stuff though if it wants to be fast.

caching of a factorization result (the index array and the unique
value array/dict) can be great; if it's in bcolz you might have it as
a direct function on a carray / table columns, but it can also work
from the analytics part though in that case you cannot enforce
re-indexing after create/update/deletes if a second process writes
from them

Yeah, that sucks, maybe we need to hash the contents somehow to a file
on disk and invalidate the factorization if the hash changes.

if you want to go to add layers on top such as multi-server
distribution (for load balancing and/or failover), data access
security models, etc. having them bcolz and analytics as separate
modules would be good (because that way you can plug in a model in
between)

Absolutely. Modular deisgn does have it's benefits. But drawbacks too,
just look at GNU herd.

with separate modules we might be able to also open source some of
our internal stuff (s3 synching, metadata models) which definitely are
better for a higher level module (you would get bcolz -> load
balancing -> analytics -> meta layer with security check, data
definitions & web calls)

That would be nice. I have a bloscpack S3 backend in the mental
pipeline, just need to lay down the code.

Just some thoughts. Probably the earlier this would happen the better
right? So the pxd is then really high prio?

Well, the pxd file would be awesome for sure, the cython code seems a
bit mangled already. Might be tricky, but will be essential for reuse.

esc · 2014-10-22T21:04:18Z

Francesc Elies notifications@github.com [2014-10-16]:

@esc you can find the tests in
https://github.com/FrancescElies/bcolz/blob/factorize/bcolz/tests/test_carray.py
factorizeStringsTest, test01 is pandas dependent which I am not sure if this is a good idea.

You can also find factorizeIntsTest but this is obiously not working since it would test factorize for ints.

How do you want me to integrate this? Can I cherry-pick the commit and
push it as part of my pull-request? Do you see the factorization stuff
as a single pull-request or should that be a commit in the groupby
branch?

best,

V-

esc · 2014-10-22T21:07:27Z

Carst Vaartjes notifications@github.com [2014-10-18]:

btw about the issue with appends in chunks in the _factorize_helper_with_counts, isn't this a possible solution?

Yeah, I am not sure what is going on there. If the thing to append is a
scalar of the correct type it should just unbox it and place it into the
leftovers array. After placing it there, check if the leftovers is full,
compress it and open a new one that is empty. Maybe that is what is
being done already, but I think it was super slow last time I checked.

class carray_buf():

    def __init__(self, carray_input):
        self.carray = carray_input
        self.length = carray_input.chunklen
        self.index = 0
        self.value_list = []

    def append(self, value):
        self.index += 1
        self.value_list.append(value)

        if self.index == self.length:
            self.carray.append(self.value_list)
            self.value_list = []
            self.index = 0

    def close(self):
        # first check whether there are actual values to save
        if self.value_list:
            self.carray.append(self.value_list)
            self.carray.flush()
            self.value_list = []
            self.index = 0
(this is based on some similar issues we had inside our algorithms with saving up for hdf5 writes in our current solution). Then you can just directly append to the item each time and it will take care of efficient saving itself.
Some things that would need to be improved:

cythonized with cdefs

during the init the self.index should be set to the current leftovers (so it first fills up the current chunk)

Reply to this email directly or view it on GitHub:
#63 (comment)

esc · 2014-10-22T21:09:45Z

Carst Vaartjes notifications@github.com [2014-10-19]:

Hi Francesc (A. :)

I have some ideas, I will put them on e-mail to you and Valentin later
this week! It's probably much better if you involve Jeff Reback etc.
as they only know me from posting obscure Pandas & Pytables bugs.

To be honest, I think it will be incredibly hard to disentangle Pandas
from Numpy (lots of specific optimizations have been done for it) +
see what continuum analytics is doing now. Also to make bcolz perform
you need to do some tricks too (see some of the really great stuff
@esc did in the factorization) and have to keep in mind the chunks for
optimum performance (some algorithms such as counted sort depend on
non-sequential writes, which also gives some issues). But with some
tweaks out-of-core bcolz is extremely close to pandas, and that alone
makes it incredible (and gives it a potential to blast away hadoop
etc.)

Yes, we can't put pandas on top of bcolz. Things aren't modular enough
to allow that.

Anyway, will make a mail later this week!

On the groupby, we're getting really really close. I've extended/adjusted @FrancescElies his code to create a groupby function that can work over multiple columns and also cache factorizations. See: https://github.com/FrancescElies/bcolz/commits/groupsort
import bcolz
from bcolz import carray_ext
import tempfile
import os
import pandas as pd
import numpy as np

projects = [
{'a1': 'build roads',  'a2': 'CA', 'a3': 2, 'm1':   1, 'm2':  2, 'm3':    3.2},
{'a1': 'fight crime',  'a2': 'IL', 'a3': 1, 'm1':   2, 'm2':  3, 'm3':    4.1},
{'a1': 'help farmers', 'a2': 'IL', 'a3': 1, 'm1':   4, 'm2':  5, 'm3':    6.2},
{'a1': 'help farmers', 'a2': 'CA', 'a3': 3, 'm1':   8, 'm2':  9, 'm3':   10.9},
{'a1': 'build roads',  'a2': 'CA', 'a3': 3, 'm1':  16, 'm2': 17, 'm3':   18.2},
{'a1': 'fight crime',  'a2': 'IL', 'a3': 1, 'm1':  32, 'm2': 33, 'm3':   34.3},
{'a1': 'help farmers', 'a2': 'IL', 'a3': 2, 'm1':  64, 'm2': 65, 'm3':   66.6},
{'a1': 'help farmers', 'a2': 'AR', 'a3': 3, 'm1': 128, 'm2': 129, 'm3': 130.9}
]

df = pd.DataFrame(projects * 10000)
print df

prefix = 'bcolz-'
rootdir = tempfile.mkdtemp(prefix=prefix)
os.rmdir(rootdir)  # tests needs this cleared
print(rootdir)
fact_bcolz = bcolz.ctable.fromdataframe(df, rootdir=rootdir)
fact_bcolz.rootdir
self = fact_bcolz

groupby_cols = ['a1', 'a2', 'a3']
fact_bcolz.cache_factor(groupby_cols, refresh=True)  # this caches the factorizations on-disk directly in the rootdir
fact_bcolz.groupby(groupby_cols, {})  # does the first 3 parts of the groupby, see the code
There's room for improvement left of course, with ideally having a metadata later for each column where i can write two carrays (for the factorized index and for the unique values). Also, numexpr cannot handle unsigned int64 unfortunately. But hey, it works! :) It's still slow however compared to Pandas, so more optimizations are required, but i would say: let's make something that works from start to end first, then look at bottlenecks and optimize it

This looks really promising! I would like to profile and optimize this!

CarstVaartjes · 2014-10-22T21:13:26Z

Hi Esc, let us wrap up the commit first to a nice, quickly performing total (Friday I really hope), and then you can decide how to pick (and what should be and should not be in bcolz).
The good news: we have it fully functioning now! Bad news: still some in-core elements and some issues around performance (hope to solve the biggest bottlenecks by late Friday applying your factorization writing method to other parts too).
If only there was an "auto-create pxd" macro :)

esc · 2014-10-22T21:38:02Z

bcolz/carray_ext.pyx

+        else:
+            # allocate enough memory to hold the string, add one for the
+            # null byte that marks the end of the string.
+            insert = <char *>malloc(allocation_size)


Not sure if we need to check for a NULL pointer here. I never did grok cython's malloc wrapper. (but IIRC yes, we need to check it)

FrancescElies · 2014-10-23T08:41:26Z

@esc & @CarstVaartjes about the integrating the tests, I have really no preferences, In my opinion you could do it the way it fits best for you, cherry picking, copy paste
I am just glad if you want to take any part :)

esc · 2014-10-23T11:39:06Z

Okay, I saw that my own pull-request is still a bit messy and I'm not sure that will be the pull-request that will end up in the code base proper. However I'd like to start thinking about how to structure the individual pull-requests for the groupby, so that they are as easy to review as possible, i.e. bite-sized chunks (no pun intended). For example, from this pull-request, the klib integration (whatever form that might take) is one thing. The factorize code (in all it's variants, and for all types) is another.

FrancescElies · 2014-10-24T23:18:47Z

As somebody told me today, I should not be doing this at this moment :), anyway...

After trying out quite a lot of things and working hard with @CarstVaartjes, using @esc code trying to produce similar things, and having discussions around nice ideas from @esc & @CarstVaartjes for a groupby function, at the end problems were more or less successfully tackled.

We are aware that this is not finished at all, this can be still optimised, better structured, code replication removed and of course needs further testing,

Nevertheless, we would like generate debate around it and see if at the end we can get something nice ;)

Small actual groupby implementation's summary:

Factorize the groupby columns and save the result to disk (saves the factorization result for the groupby columns for further use if groupby is called again saving time)
Compute group index for each observation
Create the aggregation result

The following file https://github.com/FrancescElies/bcolz/blob/groupby/test_speed/groupby_vs_pandas.py shows how at the moment the groupby function is being called, runs a small benchmark and compares the execution time versus pandas (code based on @FrancescAlted sum script #78 (comment)), the correctness of the values needs to visual inspection though.
For testing purposes each of the 'group by' columns has a different type (string, float64, int32 & int64). The comparison to pandas does not take the cache part into a account (cache should only happen once if data suffers no modification).

Benchmark groupby, suppose we want to groupby the following dataframe and aggregate the sum of the different groups (groupby columns a1, a2, a3 & a4)

N = int(1e6)
projects = [
{'a1': 'build roads',  'a2': 1.1, 'a3': 2, 'a4': 2000000000, 'm1':   1, 'm2':  2, 'm3':    3.2},
{'a1': 'fight crime',  'a2': 1.2, 'a3': 1, 'a4': 1000000000, 'm1':   2, 'm2':  3, 'm3':    4.1},
{'a1': 'help farmers', 'a2': 1.2, 'a3': 1, 'a4': 1000000000, 'm1':   4, 'm2':  5, 'm3':    6.2},
{'a1': 'help farmers', 'a2': 1.1, 'a3': 3, 'a4': 3000000000, 'm1':   8, 'm2':  9, 'm3':   10.9},
{'a1': 'build roads',  'a2': 1.1, 'a3': 3, 'a4': 3000000000, 'm1':  16, 'm2': 17, 'm3':   18.2},
{'a1': 'fight crime',  'a2': 1.2, 'a3': 1, 'a4': 1000000000, 'm1':  32, 'm2': 33, 'm3':   34.3},
{'a1': 'help farmers', 'a2': 1.2, 'a3': 2, 'a4': 2000000000, 'm1':  64, 'm2': 65, 'm3':   66.6},
{'a1': 'help farmers', 'a2': 1.3, 'a3': 3, 'a4': 3000000000, 'm1': 128, 'm2': 129, 'm3': 130.9}
]
df = pd.DataFrame(projects * N)

> python test_speed/groupby_vs_pandas.py

reference
                                       m1         m2            m3
a1           a2  a3 a4
build roads  1.1 2  2000000000    1000000    2000000  3.200000e+06
                 3  3000000000   16000000   17000000  1.820000e+07
fight crime  1.2 1  1000000000   34000000   36000000  3.840000e+07
help farmers 1.1 3  3000000000    8000000    9000000  1.090000e+07
             1.2 1  1000000000    4000000    5000000  6.200000e+06
                 2  2000000000   64000000   65000000  6.660000e+07
             1.3 3  3000000000  128000000  129000000  1.309000e+08
-->  Pandas groupby 1.396 sec

-->  Bcolz caching 1.801 sec

[('build roads', 1.1, 2, 2000000000, 1000000, 2000000, 3200000.0000426522)
 ('fight crime', 1.2, 1, 1000000000, 34000000, 36000000, 38400000.00080768)
 ('help farmers', 1.2, 1, 1000000000, 4000000, 5000000, 6200000.0001121415)
 ('help farmers', 1.1, 3, 3000000000, 8000000, 9000000, 10900000.000203883)
 ('build roads', 1.1, 3, 3000000000, 16000000, 17000000, 18199999.99965895)
 ('help farmers', 1.2, 2, 2000000000, 64000000, 65000000, 66600000.0010485)
 ('help farmers', 1.3, 3, 3000000000, 128000000, 129000000, 130900000.00236544)]
-->  Bcolz groupby 2.333 sec

1.671 x times slower than pandas (cache time not taken into account)

The profiler looks like this (sum and looping seems to be consuming most of the time)

In [7]: %prun -l 10 -s cumulative fact_bcolz.groupby(groupby_cols, agg_list)
         24218115 function calls in 6.557 seconds

   Ordered by: cumulative time
   List reduced from 152 to 10 due to restriction <10>

   ncalls  tottime  percall  cumtime  percall filename:lineno(function)
        1    0.001    0.001    6.557    6.557 ctable.py:1289(groupby)
        1    0.000    0.000    6.323    6.323 {bcolz.carray_ext.aggregate_groups}
        1    0.008    0.008    6.323    6.323 carray_ext.pyx:3025(aggregate_groups)
       21    2.992    0.142    5.959    0.284 carray_ext.pyx:3014(agg_sum)
 24000049    2.083    0.000    3.024    0.000 carray_ext.pyx:2449(__next__)
     1369    0.204    0.000    0.970    0.001 carray_ext.pyx:1862(__getitem__)
     4376    0.713    0.000    0.713    0.000 carray_ext.pyx:465(_getitem)
     2990    0.026    0.000    0.541    0.000 carray_ext.pyx:504(__getitem__)
        8    0.005    0.001    0.444    0.055 chunked_eval.py:81(eval)
        8    0.010    0.001    0.438    0.055 chunked_eval.py:157(_eval_blocks)

The code can be found at https://github.com/FrancescElies/bcolz/tree/groupby

Again a long post, it can't be helped sorry :)

@CarstVaartjes: if I am missing something please please feel free to add.

@esc & @FrancescAlted thanks a lot for keep reading & responding our non ending questions

CarstVaartjes · 2014-10-25T09:49:30Z

Maybe it's good to summarize it:

We have a fully functioning groupby now (https://github.com/FrancescElies/bcolz/tree/groupby)
Most parts are out-of-core, only the hashing in factorization isn't but that should not give issues except for extreme cases
We have build a method to cache the factorization for disk based ctables
Without caching in our specific example the relative performance of out-of-core bcolz vs in-mem pandas is 2.5-3.0x
With caching out-of-core bcolz vs in-mem pandas is only 1.4-1.8x slower <- this is extremely impressive!
I will build in a bool array sub-selection method this weekend (so you can combine a where clause and a groupby directly together) after that I would say that groupby version 0.1 is finished and ready for production testing :)
There are probably quite a lot of improvements that can still be made (extensions, performance improvements) but we need some guidance from you guys there. I would make that a version 0.2
In terms of structure (what should be where etc.) the setup is a bit hampered by the current setup of bcolz (no pxd, etc.) so we cannot make it a separate module on top of bcolz yet, but there is no theoretical reason why it would not be feasible.

Please let us know how you guys want to proceed from here (integrate it, have it as a separate function, the relationship to pandas, blaze, etc. etc.) -> shall I make a mail to the list for a discussion?
We ourselves are testing this version atm and will switch part of our own system from in-mem pandas services to out-of-core bcolz (which will make it a lot more scaleable etc.)

FrancescAlted · 2014-10-25T13:02:33Z

Thanks Valentin for the summary because I was a bit lost with so many PRs
and tickets (for example, I doubt that #78 was really necessary, unless we
are entering an era where mailing lists are really a thing of the past). I
continue to think that we should use more the mailing list for having this
discussion (call me out of fashion if you want :). Valentin could you
please start a thread there so that we can start discussing the different
problems that you want to tackle?

Thanks again!

2014-10-25 11:49 GMT+02:00 Carst Vaartjes notifications@github.com:

Maybe it's good to summarize it:

We have a fully functioning groupby now (
https://github.com/FrancescElies/bcolz/tree/groupby)

Most parts are out-of-core, only the hashing in factorization isn't
but that should not give issues except for extreme cases

We have build a method to cache the factorization for disk based
ctables

Without caching in our specific example the relative performance of
out-of-core bcolz vs in-mem pandas is 2.5-3.0x

With caching out-of-core bcolz vs in-mem pandas is only 1.4-1.8x
slower <- this is extremely impressive!

I will build in a bool array sub-selection method this weekend (so
you can combine a where clause and a groupby directly together) after that
I would say that groupby version 0.1 is finished and ready for production
testing :)

There are probably quite a lot of improvements that can still be
made (extensions, performance improvements) but we need some guidance from
you guys there. I would make that a version 0.2

In terms of structure (what should be where etc.) the setup is a bit
hampered by the current setup of bcolz (no pxd, etc.) so we cannot make it
a separate module on top of bcolz yet, but there is no theoretical reason
why it would not be feasible.

Please let us know how you guys want to proceed from here (integrate it,
have it as a separate function, the relationship to pandas, blaze, etc.
etc.) -> shall I make a mail to the list for a discussion?
We ourselves are testing this version atm and will switch part of our own
system from in-mem pandas services to out-of-core bcolz (which will make it
a lot more scaleable etc.)

—
Reply to this email directly or view it on GitHub
#63 (comment).

Francesc Alted

CarstVaartjes · 2014-10-25T13:12:07Z

Actually both #78 and #76 can be closed now; this pull as well I think -> once you have decided on how you want to integrate functionality we will prepare that and make an entirely new pull.
There was also quite a bit of e-mail traffic on the side as well, next to all of this and I understand how this is not really an "issue" but more a discussion and should be put into the mailing list :)
I'll make a mail and let's see how you want to take it from there.

FrancescElies · 2014-10-26T13:53:24Z

My apologies if open too many tickets, open source field is new to me

esc · 2014-12-21T21:38:27Z

What should we do with this issue? Perhaps in future we can use the mailinglist for any architectural and planning discussions and reserve using issues to track bugs and fixes.

CarstVaartjes · 2014-12-21T21:54:25Z

Good point, this got a bit more out of hand than we thought! Let's do the rest through the mailing list, in short: it works (we've gone into production with the BCOLZ based groupby, works really well!) and build of the following 3 blocks:

carray_ext pxd extension so we can have a more modular approach (existing pull request)
basic carray functionality: factorization
ctable groupby functionality

Let's see how we can do that there

CarstVaartjes · 2015-02-28T13:37:25Z

This one can be closed btw, as it's covered in bquery now

esc · 2015-02-28T16:05:04Z

Sure.

initial pure python implementation of factorize

297d13a

second round of pure-python implementation

2afa7e5

implement factorize in cython

38fed17

esc mentioned this pull request Oct 8, 2014

ctable: where_terms & groupby + tests #62

Closed

smaller array and better types

c567b95

esc mentioned this pull request Oct 10, 2014

Make a carray_ext.pxd file #68

Closed

esc added 6 commits October 10, 2014 18:59

add a function to decompress a chunk into an ndarray

2427213

use two buffers during factorize

645b17b

some more tweaks

65d08aa

better _to_ndarray

8164f1f

adding length to chunk class

2326a43

use the klib hashtable

5de961b

use typed numpy arrays

4cb4acf

FrancescElies mentioned this pull request Oct 16, 2014

groupsort for bcolz #76

Closed

esc reviewed Oct 22, 2014
View reviewed changes

ARF1 mentioned this pull request Feb 26, 2015

bcolz: add carray_ext.pxd to package ContinuumIO/anaconda-issues#273

Closed

esc closed this Feb 28, 2015

factorize for bcolz #63

factorize for bcolz #63

Conversation

esc commented Oct 7, 2014

esc commented Oct 7, 2014

esc commented Oct 7, 2014

esc commented Oct 7, 2014

esc commented Oct 7, 2014

esc commented Oct 9, 2014

CarstVaartjes commented Oct 9, 2014

esc commented Oct 10, 2014

CarstVaartjes commented Oct 10, 2014

esc commented Oct 10, 2014

esc commented Oct 10, 2014

CarstVaartjes commented Oct 10, 2014

esc commented Oct 10, 2014

FrancescElies commented Oct 13, 2014

FrancescElies commented Oct 13, 2014

esc commented Oct 13, 2014

esc commented Oct 13, 2014

esc commented Oct 13, 2014

CarstVaartjes commented Oct 16, 2014

FrancescElies commented Oct 16, 2014

FrancescElies commented Oct 16, 2014

esc commented Oct 16, 2014

CarstVaartjes commented Oct 18, 2014

CarstVaartjes commented Oct 18, 2014

FrancescAlted commented Oct 19, 2014

CarstVaartjes commented Oct 19, 2014

esc commented Oct 22, 2014

esc commented Oct 22, 2014

esc commented Oct 22, 2014

esc commented Oct 22, 2014

esc commented Oct 22, 2014

CarstVaartjes commented Oct 22, 2014

esc Oct 22, 2014

Choose a reason for hiding this comment

FrancescElies commented Oct 23, 2014

esc commented Oct 23, 2014

FrancescElies commented Oct 24, 2014

CarstVaartjes commented Oct 25, 2014

FrancescAlted commented Oct 25, 2014

CarstVaartjes commented Oct 25, 2014

FrancescElies commented Oct 26, 2014

esc commented Dec 21, 2014

CarstVaartjes commented Dec 21, 2014

CarstVaartjes commented Feb 28, 2015

esc commented Feb 28, 2015