!["Anaconda"](img/anaconda-logo.png)
<br>
*Copyright Continuum 2012-2016 All Rights Reserved.*

# Bag: Parallel Lists for semi-structured data

While **NumPy and Pandas excel on structured and homogenous data** (e.g. an array of floats), **Messy data** is often encountered at the beginning of data processing pipelines when **large volumes of raw data** are first consumed.

**Dask.bag** is a high level dask collection design to automate common workloads of messy data.  

In a nutshell:

    dask.bag = map, filter, toolz + multiprocessing

## Table of Contents
* [Bag: Parallel Lists for semi-structured data](#Bag:-Parallel-Lists-for-semi-structured-data)
* [Overview](#Overview)
* [Create a Bag](#Create-a-Bag)
* [Manipulate a Bag](#Manipulate-a-Bag)
	* [Exercise: Accounts JSON data](#Exercise:-Accounts-JSON-data)
	* [Basic Queries](#Basic-Queries)
	* [Use `concat` to de-nest](#Use-concat-to-de-nest)
* [Groupby and Foldby](#Groupby-and-Foldby)
	* [`groupby`](#groupby)
	* [`foldby`](#foldby)
	* [Example: Account Data](#Example:-Account-Data)
	* [Exercise: compute total amount per name](#Exercise:-compute-total-amount-per-name)
* [DataFrames](#DataFrames)
* [Limitations](#Limitations)


# Overview

**Messy data** is often encountered at the beginning of data processing pipelines when **large volumes of raw data** are first consumed. The initial set of data might be a mixture of **JSON, CSV, XML, or any other format** that does not enforce strict structure and datatypes.

**NumPy and Pandas excel on structured and homogenous data** (e.g. an array of floats). NumPy and Pandas are **not designed for messy data:** data that is highly nested, variable length, heterogeneous, or has missing values.

For this reason, the initial data massaging and processing is often done with core Python containers such as `list`, `dict`, and `set`. These core data structures are optimized for general-purpose storage and processing.  

Adding streaming computation with iterators/generator expressions or libraries like `itertools` or `toolz` let us process large volumes in a small space.  If we combine this with multiprocessing then we can churn through a fair amount of data.
    
**Related Documentation**

*  [Bag Documenation](http://dask.pydata.org/en/latest/bag.html)
*  [Bag API](http://dask.pydata.org/en/latest/bag.html#api)

# Create a Bag

You can create a `Bag` from a Python sequence, from files, from data on S3, etc.. 

In [None]:
from src.dask_prep import accounts_json
accounts_json(10, 1000, 10)

In [None]:
import dask.bag as db
b = db.from_sequence([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
b

In [None]:
import os
b = db.read_text(os.path.join('tmp', 'accounts.*.json.gz'))
b

In [None]:
import dask.bag as db
b = db.read_text('s3://nyqpug/tips.csv')
b.take(5)

# Manipulate a Bag

`Bag` objects hold the standard functional API found in projects like the Python standard library, `toolz`, or `pyspark`, including `map`, `filter`, `groupby`, etc..

As with `Array` and `DataFrame` objects, operations on `Bag` objects create new bags.  Call the `.compute()` method to trigger execution.  

Dask.bag uses `dask.multiprocessing.get` by default.  For examples using dask.bag with a distributed cluster see 

*  [Dask.distributed docs](http://dask.readthedocs.org/en/latest/distributed.html)
*  [Blogpost analyzing github data on EC2](http://continuum.io/blog/dask-distributed-cluster)

In [None]:
def iseven(n):
    return n % 2 == 0

b = db.from_sequence([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
c = b.filter(iseven).map(lambda x: x ** 2)
c

In [None]:
c.compute()

## Exercise: Accounts JSON data

We've created a fake dataset of gzipped JSON data in your data directory.  This is like the example used in the `DataFrame` example except that it has bundled up all of the entires for each individual `id` into a single record.  This is similar to data that you might collect off of a document store database or a web API.

Each line is a JSON encoded dictionary with the following keys

*  id: Unique identifier of the customer
*  name: Name of the customer
*  transactions: List of `transaction-id`, `amount` pairs, one for each transaction for the customer in that file

In [None]:
filename = os.path.join('tmp', 'accounts.*.json.gz')
lines = db.read_text(filename)
lines.take(1)

Our data comes out of the file as lines of text.  We can make this data look more reasonable by mapping the `json.loads` function onto our bag.

In [None]:
import json
js = lines.map(json.loads)
js.take(3)

## Basic Queries

Once we parse our JSON data into proper Python objects (`dict`s, `list`s, etc.) we can perform more interesting queries by creating small Python functions to run on our data.

In [None]:
js.filter(lambda record: record['name'] == 'Dan').take(5)

In [None]:
def count_transactions(d):
    return {'name': d['name'], 'count': len(d['transactions'])}

(js.filter(lambda record: record['name'] == 'Dan')
   .map(count_transactions)
   .take(5))

In [None]:
(js.filter(lambda record: record['name'] == 'Dan')
   .map(count_transactions)
   .pluck('count')
   .take(5))

In [None]:
# Average number of transactions for all of the Dan's
(js.filter(lambda record: record['name'] == 'Dan')
   .map(count_transactions)
   .pluck('count')
   .mean()
   .compute())

## Use `concat` to de-nest

In the example below we see the use of `concat` to flatten results.  We compute the average amount for all transactions for all Alices.

In [None]:
js.filter(lambda record: record['name'] == 'Dan').pluck('transactions').take(3)

In [None]:
(js.filter(lambda record: record['name'] == 'Dan')
   .pluck('transactions')
   .concat()
   .take(3))

In [None]:
(js.filter(lambda record: record['name'] == 'Dan')
   .pluck('transactions')
   .concat()
   .pluck('amount')
   .take(3))

In [None]:
(js.filter(lambda record: record['name'] == 'Dan')
   .pluck('transactions')
   .concat()
   .pluck('amount')
   .mean()
   .compute())

# Groupby and Foldby

Often we want to group data by some function or key.  We can do this either with the `.groupby` method, which is straightforward but forces a full on-disk shuffling of the data (expensive) or with the harder-to-use but faster `.foldby` method, which does a streaming combined groupby and reduction.

*  `groupby`:  Shuffles data so that all items with the same key are in the same key-value pair
*  `foldby`:  Walks through the data accumulating a result per key

*Note: the full groupby is particularly bad.  In actual workloads you would do well to use `foldby` or switch to `DataFrame`s if possible.*

## `groupby`

Groupby collects items in your collection so that all items with the same value under some function are collected together into a key-value pair.

In [None]:
b = db.from_sequence(['Alice', 'Bob', 'Charlie', 'Dan', 'Edith', 'Frank'])
b.groupby(len).compute()  # names grouped by length

In [None]:
b = db.from_sequence(list(range(10)))
b.groupby(lambda x: x % 2).compute()

In [None]:
b.groupby(lambda x: x % 2).map(lambda k, v: (k, max(v))).compute()

## `foldby`

Foldby can be quite odd at first.  It is similar to the following functions from other libraries:

*  [`toolz.reduceby`](http://toolz.readthedocs.org/en/latest/streaming-analytics.html#streaming-split-apply-combine)
*  [`pyspark.RDD.combineByKey`](http://abshinn.github.io/python/apache-spark/2014/10/11/using-combinebykey-in-apache-spark/)

When using `foldby` you provide 

1.  A key function on which to group elements
2.  A binary operator such as you would pass to `reduce` that you use to perform reduction per each group
3.  A combine binary operator that can combine the results of two `reduce` calls on different parts of your dataset.

Your reduction must be associative.  It will happen in parallel in each of the partitions of your dataset.  Then all of these intermediate results will be combined by the `combine` binary operator.

In [None]:
b.foldby(lambda x: x % 2, binop=lambda acc, x: acc if acc > x else x,
                          combine=lambda acc1, acc2: acc1 if acc1 > acc2 else acc2).compute()

## Example: Account Data

We find the number of people with the same name.

In [None]:
%%time
js.foldby(key='name', binop=lambda total, x: total + 1, 
                      initial=0, 
                      combine=lambda a, b: a + b, 
                      combine_initial=0).compute()

## Exercise: compute total amount per name

We want to groupby (or foldby) the `name` key, then add up the all of the amounts for each name.

Steps

1.  Create a small function that, given a dictionary like 

        {'name': 'Alice', 'transactions': [{'amount': 1, 'id': 123}, {'amount': 2, 'id': 456}]}
        
    produces the sum of the amounts, e.g. `3`
    
2.  Slightly change the binary operator of the `foldby` example above so that the binary operator doesn't count the number of entries, but instead accumulates the sum of the amounts.

# Bags to DataFrames

Pandas is faster than Pure Python.  In the same way `dask.dataframe` can be faster than `dask.bag`.  

You can transform a bag with a simple tuple or flat dictionary structure into a `dask.dataframe` with the `to_dataframe` method.

In [None]:
df = js.to_dataframe()
df.head()

This is less-than-optimal because the `transactions` column is filled with nested data so Pandas has to revert to `object` dtype, which is quite slow in Pandas.  

Ideally we want to transform to a dataframe only after we have flattened our data so that each record is a single `int`, `string`, `float`, etc..  

In [None]:
js.take(3)

In [None]:
def denormalize(record):
    return [{'id': record['id'], 
             'name': record['name'], 
             'amount': transaction['amount'], 
             'transaction-id': transaction['transaction-id']}
            for transaction in record['transactions']]

transactions = js.map(denormalize).concat()
transactions.take(3)

In [None]:
df = transactions.to_dataframe()
df.head()

In [None]:
df.groupby('name')['transaction-id'].count().compute()

# Limitations

Bags provide very general computation (any Python function.)  This generality
comes at cost.  Bags have the following known limitations

1.  By default they rely on the multiprocessing scheduler, which has its own
    set of known limitations (see shared_)
2.  Bag operations tend to be slower than array/dataframe computations in the
    same way that Python tends to be slower than NumPy/Pandas
3.  ``Bag.groupby`` is slow.  You should try to use ``Bag.foldby`` if possible.
    Using ``Bag.foldby`` requires more thought.
4.  The implementation backing ``Bag.groupby`` is under heavy churn and currently 
    does not work in a distributed context.

*Copyright Continuum 2012-2016 All Rights Reserved.*