!["Anaconda"](img/anaconda-logo.png)
<br>
*Copyright Continuum 2012-2016 All Rights Reserved.*

# DataFrame Storage

<img src="img/hdd.jpg" width="20%" align="right">

Efficient storage can dramatically improve performance, particularly when operating repeatedly from disk.

* Decompressing text and parsing CSV files is expensive.
* One of the most effective strategies with medium data is to use a binary storage format like HDF5.  
* Often the performance gains from doing this are sufficient so that you can switch back to using Pandas again instead of using `dask.dataframe`.

In this section we'll learn how to efficiently arrange and store your datasets in on-disk binary formats.  We'll use the following:

1.  [Pandas `HDFStore`](http://pandas.pydata.org/pandas-docs/stable/io.html#io-hdf5) format on top of `HDF5`
2.  Categoricals for storing text data numerically
3.  [Castra](http://github.com/Blosc/castra), an experimental data store optimized for `dask.dataframe`

**Main Take-aways**

1.  Storage formats affect performance by an order of magnitude
2.  Text data will keep even a fast format like HDF5 slow
3.  A combination of binary formats, column storage, and partitioned data turns one second wait times into 80ms wait times.

## Table of Contents
* [DataFrame Storage](#DataFrame-Storage)
	* [Set-up](#Set-up)
* [Read CSV](#Read-CSV)
* [Write to HDF5](#Write-to-HDF5)
* [Compare CSV to HDF5 speeds](#Compare-CSV-to-HDF5-speeds)
	* [1.  Store text efficiently with categoricals](#1.--Store-text-efficiently-with-categoricals)
	* [2.  Store columns separately with Castra](#2.--Store-columns-separately-with-Castra)
	* [Castra and partitioned indexes](#Castra-and-partitioned-indexes)
* [Exercise](#Exercise)
	* [Solutions](#Solutions)
* [Conclusion](#Conclusion)


## Set-up

Create data if we don't have any

In [None]:
from src.dask_prep import accounts_csvs
accounts_csvs(3, 1000000, 500)

# Read CSV

First we read our csv data as before

In [None]:
import os
filename = os.path.join('tmp', 'accounts.*.csv')
filename

In [None]:
import dask.dataframe as dd
df = dd.read_csv(filename)
df.head()

# Write to HDF5

Pandas contains a specialized HDF5 format, `HDFStore`.  The ``dd.DataFrame.to_hdf()`` method works exactly like the ``pd.DataFrame.to_hdf()`` method.

In [None]:
target = os.path.join('tmp', 'accounts.h5')
target

In [None]:
%time df.to_hdf(target, '/data')

In [None]:
df2 = dd.read_hdf(target, '/data')
df2.head()

# Compare CSV to HDF5 speeds

We do a simple computation that requires reading a column of our dataset and compare performance between CSV files and our newly created HDF5 file.

Which do you expect to be faster?

In [None]:
%time df.amount.sum().compute()

In [None]:
%time df2.amount.sum().compute()

Sadly they are about the same.  

The culprit here is `names` column, which is of `object` dtype and thus hard to store efficiently.  There are two problems here:

1.  How do we store text data like `names` efficiently on disk?
2.  Why did we have to read the `names` column when all we wanted was `amount`

## 1.  Store text efficiently with categoricals

We can use Pandas categoricals to replace our object dtypes with a numerical representation.  This takes a bit more time up front, but results in better performance.

More on categoricals at the [pandas docs](http://pandas-docs.github.io/pandas-docs-travis/categorical.html) and [this blogpost](http://matthewrocklin.com/blog/work/2015/06/18/Categoricals/).

In [None]:
# Categorize data, then store in HDFStore
%time df.categorize(columns=['names']).to_hdf(target, '/data2')

In [None]:
# It looks the same
df2 = dd.read_hdf(target, '/data2')
df2.head()

In [None]:
# But loads more quickly
%time df2.amount.sum().compute()

This is significantly faster.

This tells us that it's **not only the file type** that we use but also how we represent our variables that influences storage performance.

However this can still be better.  

* We had to read all columns (`names` and `amount`) to compute the sum of one (`amount`).  
* We'll improve further on this with `castra`, an on-disk column-store.  
* First though we learn about how to set an index in a `dask.dataframe`.

## 2.  Store columns separately with Castra

[`Castra`](http://github.com/Blosc/castra) is an on-disk column-store that is partitioned along the index.  It matches the `dask.dataframe` model exactly.

You will likely have to install castra using `pip`.

```bash
pip install castra
```

*Disclaimer: Castra is experimental and under heavy development churn.  You should not use it in production.*

*We discuss Castra here to demonstrate the value of thinking hard about storage, not as an endorsement of its use.*

In [None]:
%%time
if os.path.exists('tmp/accounts.castra'):
    import shutil
    shutil.rmtree('tmp/accounts.castra')

c = df.to_castra('tmp/accounts.castra', categories=['names'])
df3 = c.to_dask()

In [None]:
%time df3.amount.sum().compute()

In [None]:
%time df3.names.drop_duplicates().compute()

## Castra and partitioned indexes

As discussed in the [previous notebook](03a-DataFrame.ipynb), `dask.dataframe` partitions your data along the index.  This can make some queries, like `loc`, `groupby`, and `join/merge` much faster because `dask.dataframe` can selectively load data and because it knows that related data is close-by.

Castra uses the same model of data partitioned along the index when it stores data on disk.  By using the two together we can get efficient disk-based queries.

# Exercise

Index your dataframe by `id`, then store it into a `Castra` file as we do above.  Inspect your data and verify that the index column is `id` and that it starts at `0`.

What are the divisions of your dataframe?

Use the `.loc` accessor to get the records corresponding to account `100`.  Compare this to the speed you saw in the last section.

## Solutions

In [None]:
%load solutions/DataFrame-02.py

# Conclusion

Storage choices strongly impact performance.  We evolved from text-based CSV files to binary-based Castra and saw our query times drop from 1s to 80ms.

We also used `DataFrame.set_index()` to organize our data along a special column.  A common recipe for success with `dask.dataframe` is as follows:

1.  Read in your data however it was delivered to you

        df = dd.read_csv('myfiles.*.csv')
    
2.  Set your index 

        df2 = df.set_index('column-name')
        
3.  Base computation on Castra file

        c = df2.to_castra('/path/to/new/file.castra', 
                          categories=['list', 'of', 'columns', 'to', 'categorize'])
        df3 = c.to_dask()
        
4.  Perform efficient queries

        df3.loc['2014': '2015'].groupby().events.count()

<br>
*Copyright Continuum 2012-2016 All Rights Reserved.*