!["Anaconda"](img/anaconda-logo.png)
<br>
*Copyright Continuum 2012-2016 All Rights Reserved.*

# Dask DataFrame

In the last section we manipulated CSV files in parallel by building dask graphs by hand and running them with `dask` `get` functions. 

In this section we use `dask.dataframe` to build and execute dask graphs to process large volumes of CSV files automatically.

## Table of Contents
* [Dask DataFrame](#Dask-DataFrame)
* [Overview](#Overview)
* [Example: Load Data from CSVs and inspect the dask graph](#Example:-Load-Data-from-CSVs-and-inspect-the-dask-graph)
	* [Setup](#Setup)
	* [`dask.dataframe.read_csv`](#dask.dataframe.read_csv)
* [Exercise: Inspect dask graph](#Exercise:-Inspect-dask-graph)
* [How does this compare to Pandas?](#How-does-this-compare-to-Pandas?)
	* [Features and Size](#Features-and-Size)
	* [Speed](#Speed)
* [Exercises: Pandas API and Dask](#Exercises:-Pandas-API-and-Dask)
* [Divisions and the Index](#Divisions-and-the-Index)
* [Exercise: Setting the Index, use of `loc[]`](#Exercise:-Setting-the-Index,-use-of-loc[])
* [Limitations](#Limitations)
	* [What doesn't work?](#What-doesn't-work?)
	* [What definitely works?](#What-definitely-works?)


# Overview

The `dask.dataframe` module implements a blocked parallel `DataFrame` object
* Dask `DataFrame` mimics a subset of the Pandas `DataFrame`.
* Dask `DataFrame` is comprised of several in-memory pandas `DataFrames` separated along the index. 
* One operation on a dask `DataFrame` triggers many pandas operations on the constituent pandas `DataFrame`s in a way that is mindful of potential parallelism and memory constraints.

**Related Documentation**

*  [Dask DataFrame documentation](http://dask.pydata.org/en/latest/dataframe.html)
*  [Pandas documentation](http://pandas.pydata.org/)

**Main Take-aways**

1.  Dask.dataframe should be familiar to Pandas users
2.  The index grows to include partitions, which are important for efficient queries

# Example: Load Data from CSVs and inspect the dask graph

In the last section we manually built dask graphs to read in many CSV files at once and compute their total length.  

In this section we'll use `dask.dataframe` to accomplish the same result using a more Pandas-like interface rather than using dictionaries.

## Setup

We create artifical data.

In [1]:
import sys
sys.path.append('../src')
from dask_prep import accounts_csvs
accounts_csvs(3, 1000000, 500)

import os
filename = os.path.join('data', 'accounts.*.csv')

## `dask.dataframe.read_csv`

This works just like `pandas.read_csv()`, except on multiple csv files at once.

In [2]:
filename

'data\\accounts.*.csv'

In [17]:
!ls -1 tmp/accounts.*.csv

'ls' is not recognized as an internal or external command,
operable program or batch file.


In [3]:
import dask.dataframe as dd
df = dd.read_csv(filename)
%time len(df)

Wall time: 2.31 s


3000000

In [18]:
dd.read_csv?

In [4]:
df

dd.DataFrame<from-de..., npartitions=3>

# Exercise: Inspect dask graph

Dask `DataFrame` copies a subset of the Pandas API.  

However unlike Pandas, operations on dask.dataframes don't trigger immediate computation, instead they add key-value pairs to an underlying dask graph.

In [5]:
df.dask  # .dask attribute contains underlying graph

{('from-delayed-f517b7582c95f84e03ef80130369c19a',
  0): 'bytes_read_csv-aacec7b7d7d68ba4c1b4242fc906ced8',
 'bytes_read_csv-aacec7b7d7d68ba4c1b4242fc906ced8': (<function dask.compatibility.apply>,
  <function dask.dataframe.csv.bytes_read_csv>,
  ['read-file-block-d6806e4e60b2acb62b5c8d012873eae3-0',
   b'id,names,amount\r\n',
   (dict, []),
   (dict,
    [['id', dtype('int64')],
     ['amount', dtype('float64')],
     ['names', dtype('O')]]),
   (list, ['id', 'names', 'amount'])],
  (dict, [['write_header', False], ['enforce', False]])),
 'read-file-block-91d813ebf8615e3b2f9a4fee713209cc-0': (<function dask.bytes.local.read_block_from_file>,
  'C:\\Users\\AngelSparkles\\Python\\Continuum\\continuum_training\\Dask\\data\\accounts.2.csv',
  0,
  64000000,
  b'\n',
  None),
 'read-file-block-13e1cffafb26fc8edd8bec25f742c19a-0': (<function dask.bytes.local.read_block_from_file>,
  'C:\\Users\\AngelSparkles\\Python\\Continuum\\continuum_training\\Dask\\data\\accounts.1.csv',
  0,
  640000

Visualizing the graph may be done if you have an install of `graphviz`

In [29]:
!pip install graphviz
!conda install graphviz



The system cannot find the path specified.


In [20]:
import graphviz

In [25]:
df._visualize()

ExecutableNotFound: failed to execute ['dot', '-Tpng'], make sure the Graphviz executables are on your systems' PATH

In [None]:
!cmd dot.exe

Above we see graphs corresponding to a call to `dd.read_csv()` and `df.amount.sum()` on the result.  

Below we see the resulting computations as dictionaries.  You'll note that these dictionaries are a bit more complex than what we built by hand in the last section.  However if you look closely then you'll see all of the familiar elements of `pd.read_csv()` and the filenames.

Try changing around the expression `df.amount.sum()` and see how the dictionary and graph change.  Explore a bit with the Pandas syntax that you already know.

In [6]:
df.amount.sum().dask

AttributeError: 'module' object has no attribute 'RangeIndex'

In [None]:
df.amount.sum()._visualize()

# How does this compare to Pandas?

## Features and Size

Pandas is more mature and fully featured than `dask.dataframe`.

* If your data fits in memory then you should use Pandas.  
* The `dask.dataframe` module gives you a limited `pandas` experience when you operate on datasets that don't fit comfortably in memory.

During this lesson we provide a small dataset consisting of a few CSV files.  This dataset is 45MB on disk that expands to about 400MB in memory (the difference is caused by using `object` dtype for strings).  This dataset is small enough that you would normally use Pandas.

We've chosen this size so that exercises finish quickly.  Dask.dataframe only really becomes meaningful for problems significantly larger than this, when Pandas breaks with the dreaded 

    MemoryError:  ...

## Speed

Dask.dataframe operations use `pandas` operations internally.  Generally they run at about the same speed except in the following two cases:

1.  Dask introduces a bit of overhead, around 1ms per task.  This is usually negligible.
2.  When Pandas releases the GIL (coming to `groupby` in the next version) `dask.dataframe` can call several pandas operations in parallel increasing speed somewhat proportional to the number of cores.

# Exercises: Pandas API and Dask

If you are already familiar with the Pandas API then you should have a firm grasp on how to use `dask.dataframe`.  There are a couple of small changes.

As noted above, computations on dask `DataFrame` objects don't perform work, instead they build up a dask graph.  We can evaluate this dask graph at any time using the `.compute()` method.

In [7]:
result = df.amount.mean()  # create lazily evaluated result
result

AttributeError: 'module' object has no attribute 'RangeIndex'

In [None]:
result.compute()           # perform actual computation

Try the following exercises

1.  Use the `head()` method to get the first ten rows
2.  Use the `drop_duplicates()` method to find all of the distinct names
3.  Use selections `df[...]` to find how many positive and negative amounts there are
4.  Use groupby `df.groupby(df.A).B.func()` to get the average amount per user ID
5.  Sort the result to (4) by amount, find the names of the top 10 

This section should be easy if you are familiar with Pandas.  If you aren't then that's ok too.  You may find the [pandas documenation](http://pandas.pydata.org/) a useful read in the future.  Don't worry, future sections in this tutorial will not depend on this knowledge.

In [8]:
# 1. Use the `head()` method to get the first ten rows
#    Note, head computes by default, this is the only operation that doesn't need an explicit call to .compute()
df.head()

Unnamed: 0,id,names,amount
0,349,Ursula,967
1,58,Hannah,-1160
2,294,Laura,812
3,326,Oliver,2059
4,446,Charlie,286


In [9]:
# 2. Use the `drop_duplicates()` method to find all of the distinct names
df.drop_duplicates().compute()

Unnamed: 0,id,names,amount
0,349,Ursula,967
1,58,Hannah,-1160
2,294,Laura,812
3,326,Oliver,2059
4,446,Charlie,286
5,288,Yvonne,642
6,330,Charlie,-17
7,374,Charlie,221
8,3,Sarah,43
9,85,George,288


In [10]:
# 3a. Use selections `df[...]` to find how many positive amounts there are
df[df.amount > 0].head()

AttributeError: 'module' object has no attribute 'RangeIndex'

In [11]:
print(len(df[df.amount > 0]))

AttributeError: 'module' object has no attribute 'RangeIndex'

In [12]:
# 3b. Use selections `df[...]` to find how many negative amounts there are
print(len(df[df.amount < 0]))

AttributeError: 'module' object has no attribute 'RangeIndex'

In [None]:
# 4. Use groupby `df.groupby(df.A).B.func()` to get the average amount per user ID 


In [None]:
# 5. Combine your answers to 3 and 4 to compute the average withdrawal (negative amount) per name


In [13]:
# %load solutions/DataFrame-01.py
# 1. Use the `head()` method to get the first ten rows
df.head()

# 2. Use the `drop_duplicates()` method to find all of the distinct names
df.names.drop_duplicates().compute()

# 3. Use selections `df[...]` to find how many positive and negative amounts
# there are
len(df[df.amount < 0])

# 3. Use selections `df[...]` to find how many positive and negative amounts
# there are
len(df[df.amount > 0])

AttributeError: 'module' object has no attribute 'RangeIndex'

In [None]:
# 4. Use groupby `df.groupby(df.A).B.func()` to get the average amount per user
# ID
df.groupby(df.names).amount.mean().compute()

In [None]:
# 5. Combine your answers to 3 and 4 to compute the average withdrawal
# (negative amount) per name
df2 = df[df.amount < 0]
df2.groupby(df2.names).amount.mean().compute()

# Divisions and the Index

<!-- <img src="img/frame.png" align="right" width="40%"> -->

The Pandas index associates a value to each record/row of your data.  

Operations that align with the index, like `loc` can be a bit faster as a result.

In `dask.dataframe` this index becomes even more important.  

* One dask `DataFrame` consists of several Pandas `DataFrame` objects.
* These dataframes are separated along the index by value.
* For example, when working with time series we may partition our large dataset by month.
* Many partitions of our data may not all live in memory at the same time
* Instead they might live on disk; we simply have tasks that can materialize these pandas `DataFrames` on demand.

Partitioning your data can greatly improve efficiency.

* Operations like `loc`, `groupby`, and `merge/join` along the index are *much more efficient* than operations along other columns.  
* You can see how your dataset is partitioned with the `.divisions` attribute.  
* Note that data that comes out of simple data sources like CSV files aren't intelligently indexed by default.  
* In these cases the values for `.divisions` will be `None.`

In [14]:
df = dd.read_csv(filename)

In [15]:
df.head()

Unnamed: 0,id,names,amount
0,349,Ursula,967
1,58,Hannah,-1160
2,294,Laura,812
3,326,Oliver,2059
4,446,Charlie,286


In [16]:
df.divisions

(None, None, None, None)

However if we set the index to some new column then dask will divide our data roughly evenly along that column and create new divisions for us.  Warning, `set_index` triggers immediate computation.

In [17]:
df2 = df.set_index('names')

ValueError: Metadata inference failed, please provide `meta` keyword

In [None]:
df2.divisions

We see here the minimum and maximum values ("Alice" and "Zelda") as well as two intermediate values that separate our data well.  This dataset has three partitions.

In [None]:
df2.npartitions

In [None]:
df2.head()

Operations like `loc` only need to load the relevant partitions

In [18]:
df2.loc['Edith']

NameError: name 'df2' is not defined

In [19]:
df2.loc['Edith'].compute()

NameError: name 'df2' is not defined

# Exercise: Setting the Index, use of `loc[]`

Make a new dataframe that sets the index to the `id` column.  Use `loc` to collect the records with the 100th id.

**Solution:**

In [None]:
%load solutions/DataFrame-02.py


# Limitations

## What doesn't work?

Dask.dataframe only covers a small but well-used portion of the Pandas API.
This limitation is for two reasons:

1.  The Pandas API is *huge*
2.  Some operations are genuinely hard to do in parallel (e.g. sort)

Additionally, some important operations like ``df.set_index()`` work, but are slower
than in Pandas because they may write out to disk.

Finally, `dask.dataframe` is quite new and non-trivial bugs are frequently reported (and quickly fixed).

## What definitely works?

* Trivially parallelizable operations (fast):
    *  Elementwise operations:  ``df.x + df.y``
    *  Row-wise selections:  ``df[df.x > 0]``
    *  Loc:  ``df.loc[4.0:10.5]``
    *  Common aggregations:  ``df.x.max()``
    *  Is in:  ``df[df.x.isin([1, 2, 3])]``
    *  Datetime/string accessors:  ``df.timestamp.month``
* Cleverly parallelizable operations (also fast):
    *  groupby-aggregate (with common aggregations): ``df.groupby(df.x).y.max()``
    *  value_counts:  ``df.x.value_counts``
    *  Drop duplicates:  ``df.x.drop_duplicates()``
    *  Join on index:  ``dd.merge(df1, df2, left_index=True, right_index=True)``
* Operations requiring a shuffle (slow-ish, unless on index)
    *  Set index:  ``df.set_index(df.x)``
    *  groupby-apply (with anything):  ``df.groupby(df.x).apply(myfunc)``
    *  Join not on the index:  ``pd.merge(df1, df2, on='name')``
* Ingest operations
    *  CSVs: ``dd.read_csv``
    *  Pandas: ``dd.from_pandas``
    *  Anything supporting numpy slicing: ``dd.from_array``
    *  Dask.bag: ``mybag.to_dataframe(columns=[...])``

<br>
*Copyright Continuum 2012-2016 All Rights Reserved.*