## Limitations of Dask: Why to use Spark!

Dask is well integrated into the python ecosystem. This is the best and worst part of dask.  
* Best:
  * dask inherites the efficient implementations of the underlying libraries in NumPy and pandas.
  * dask is easy to use for python programers
* Worst
  * dask is python and python is a serial and interpreted (not compiled) language
  * inefficient for user-defined functions (they run in python)
  
Ultimately, dask does not run on Hadoop! and, as a consequence, is not good at shuffling(sorting) data.

## dask bags

### Dask Datatypes, Functions, and Operators

Because Dask is a data parallel language, it's reasonable to categorize dask around the three major "collections" implemented:
  * dask.array: a parallel NumPy array
  * dask.dataframe: a parallel pandas dataframe
  * dask.bag: inherited from Spark (and Pig).
  
So, arrays and dataframes make sense.  Where did this bag come from? Dask reports that "It is similar to a parallel version of PyToolz or a Pythonic version of the PySpark RDD."

A dask bag or multiset is:
  * unordered: cannot be indexed like an array
  * not-unique: can have repeated entries
  * contains arbitrary python objects
  
The dask guidance is to only use bags when absolutely needed and to convert to arrays or dataframes as soon as possible. Bags support the nested data structures typical of JSON, e.g. dictionaries that contain lists of lists.  The limitations are:
  * bags only use the 'processes' scheduler and cannot share memory among the multiple cores of a node
  * user-defined functions are inefficient when compared with pandas or numpy builtints 

Additionally, dask **strongly** encourages you to avoid <code>bag.groupby()</code>, because it requires a full shuffle (sort by key) of the data.

### Conclusions

Guidance for dask:
  * try to convert semi-structured data to dataframes as soon as possible
  * try to use built-in functions whenever possible, they are compiled often
  
Guidance for when and how to use Spark:
  * for workloads that perform shuffles, Spark runs on top of the Hadoop! engine.
  * for complex user-defined functions, but you must write them in Scala which compiles into java.

Spark is a bigger, heavier ecosystem with a more complex distributed query optimizer.