<img src="images/hdd.jpg" width="20%" align="right">

DataFrame Storage
================

Decompressing text and parsing CSV files is expensive.  One of the most effective strategies with medium data is to use a binary storage format like HDF5.  Often this is sufficient so that you can switch back to using Pandas again instead of using dask.

In this section we'll learn how to efficiently arrange and store your datasets in on-disk binary formats.

### Setup

Create data if we don't have any

In [None]:
from prep import accounts_csvs
accounts_csvs(3, 1000000, 500)

### Read CSV

First we read our csv data as before

In [None]:
import os
filename = os.path.join('data', 'accounts.*.csv')
filename

In [None]:
import dask.dataframe as dd
df = dd.read_csv(filename)
df.head()

### Write to HDF5

Pandas contains a specialized HDF5 format, `HDFStore`.  The ``dd.DataFrame.to_hdf`` method works exactly like the ``pd.DataFrame.to_hdf`` method.

In [None]:
target = os.path.join('data', 'accounts.h5')
target

In [None]:
%%time
df.to_hdf(target, '/data')

In [None]:
df2 = dd.read_hdf(target, '/data')
df2.head()

### Compare CSV to HDF5 speeds

We do a simple computation that requires reading in a bit of our dataset and compare performance between CSV files and our newly created HDF5 file

In [None]:
%time df.amount.sum().compute()

In [None]:
%time df2.amount.sum().compute()

Sadly this is about the same cost.  The culprit here is names column, which is an object dtype and thus hard to store efficiently.

### Categoricals

We can use Pandas categoricals to replace our object dtypes with a numerical representation.  This takes a bit more time up front, but results in better performance.

In [None]:
%%time
df.categorize(columns=['names']).to_hdf(target, '/data2')

In [None]:
df2 = dd.read_hdf(target, '/data2')
df2.head()

In [None]:
%%time
df2.amount.sum().compute()

This is significantly faster.  This tells us that it's not only the file type that we use but also how we represent our variables that influences storage performance.

However this can still be better.  We had to read all of the columns (`names` and `amount`) in order to compute the sum of one (`amount`).  We'll improve further on this with `castra`, an on-disk column-store.  First though we learn about how to set an index in a dask.dataframe.

`set_index`
------------

As we're about to learn, the index is even more important in `dask.dataframe` than it was in `pandas`.  The index determines how we parallelize our computations and how efficiently we can index into parts of our dataset.  

In [None]:
%%time
# By default `DataFrame.set_index` is *not lazily evaluated*
df3 = df.set_index('id')

In [None]:
df3.head()

### But now we can perform lookups with `.loc`

In [None]:
%%time 
df3.loc[100].head()

### Castra

Additionally, once we have a proper index we can use `castra`, an on-disk column-store that is partitioned along the index.

In [None]:
%%time
if os.path.exists('accounts.castra'):
    import shutil
    shutil.rmtree('accounts.castra')

c = df3.to_castra('accounts.castra', categories=['names'])
df4 = c.to_dask()

In [None]:
%%time
df4.head()

In [None]:
%%time
df4.loc[0:4].compute()

In [None]:
%%time
df4.amount.sum().compute()

In [None]:
# %%time
df4.names.drop_duplicates().compute()

## Conclusion

Storage choices strongly impact performance.  We evolved from text-based CSV files to binary-based Castra and saw our query times drop from 1s to 80ms.

We also used `DataFrame.set_index` to organize our data along a special column.  A common recipe for success with `dask.dataframe` is as follows:

1.  Read in your data however it was delivered to you

        df = dd.read_csv('myfiles.*.csv')
    
2.  Set your index 

        df2 = df.set_index('column-name')
        
3.  Base computation on Castra file

        c = df2.to_castra('/path/to/new/file.castra', 
                          categories=['list', 'of', 'columns', 'to', 'categorize'])
        df3 = c.to_dask()
        
4.  Perform efficient queries

        df3.loc['2014': '2015'].groupby().events.count()