# 5. Parallel and Distributed Computing with Dask

## The Scaling Problem: When Data Exceeds Memory

Economists are increasingly working with datasets that are too large to fit into a single computer's RAM. A typical pandas DataFrame, for example, resides entirely in memory. If you try to load a 50 GB file on a machine with 16 GB of RAM, your process will fail. Similarly, many computations are CPU-bound and could be sped up by using all the available cores on a modern processor.

**Dask** is a Python library for parallel and distributed computing. It helps you scale the Python tools you already use—like NumPy, pandas, and scikit-learn—to multi-core machines and large, distributed clusters.

Dask's core idea is **lazy evaluation**. It builds a **task graph** of the computations you want to perform, and only executes them when you explicitly ask for a result. This allows Dask to optimize the execution plan for parallel performance.

In this notebook, you will learn:
- The core concepts behind Dask: lazy evaluation and task graphs.
- How to use **Dask DataFrames** to work with larger-than-memory tabular data.
- How to use **Dask Arrays** for scalable numerical computing.
- How to use **Dask Delayed** for custom parallel algorithms.
- How to set up a local Dask cluster and use its diagnostic dashboard to monitor computations.

### Understanding Lazy Evaluation and Task Graphs

When you work with Dask, you are not manipulating data directly. Instead, you are building a graph that represents the sequence of tasks to be performed. Each task is a function call on a piece of data, and the graph captures the dependencies between these tasks.

For example, a simple computation like `(a + b) * (c + d)` can be represented as a task graph:

![A Dask Task Graph](../../images/high_performance_python/dask_task_graph.png)

Dask can analyze this graph and see that `sum(a, b)` and `sum(c, d)` are independent and can be computed in parallel. This lazy approach allows Dask to handle complex computations on large datasets efficiently.


In [None]:
import dask
import dask.dataframe as dd
import dask.array as da
from dask.distributed import Client
import pandas as pd
import numpy as np

### Setting Up a Local Dask Cluster

To use Dask, you first need a **cluster**, which consists of a **scheduler** that manages tasks and **workers** that execute them. For a single machine, the `dask.distributed` library makes it simple to create a local cluster. When you create a `Client`, it automatically sets up a scheduler and a set of workers, typically one per CPU core.

In [None]:
# This creates a local cluster with a scheduler and workers.
# The client provides a dashboard link to monitor your work.
client = Client()
client

**Important:** Click the link for the Dashboard. It is a powerful tool that provides a real-time visualization of your computations, showing you which tasks are running on which workers and how memory is being used. Keep it open in a separate browser tab as you go through this notebook.

### Dask DataFrames: Parallelizing Pandas

A Dask DataFrame is a large, parallel DataFrame composed of many smaller pandas DataFrames, split along the index. These smaller DataFrames are called partitions, and each partition can be computed on by a different worker.

The Dask DataFrame API mimics the pandas API, so you can perform many familiar operations. However, these operations are **lazy**—Dask builds a task graph of the planned computation but doesn't execute it until you explicitly request a result with a method like `.compute()`.

#### Example: Analyzing a Large CSV

Dask comes with some sample data. Let's use `dask.datasets` to create a large, synthetic dataset of time-series data.

In [None]:
# Create a sample Dask DataFrame
# This represents a much larger dataset than can fit in memory.
ddf = dd.demo.make_timeseries(
    '2000-01-01', '2000-12-31',
    freq='1s', partition_freq='1M', dtypes={'x': float, 'y': float, 'id': int}
)

# Notice that printing the DataFrame doesn't show data.
# It shows the structure: columns, types, and number of partitions.
ddf

Let's perform a standard pandas-like operation: group by an ID and find the mean of one of the columns. Notice that this operation executes instantly.

In [None]:
# This is a lazy operation. No computation is done yet.
mean_x_by_id = ddf.groupby('id').x.mean()
mean_x_by_id

To trigger the computation, we call `.compute()`. Now, watch your Dask dashboard. You will see the task graph being executed in parallel across all your CPU cores.

In [None]:
# This triggers the actual computation.
result = mean_x_by_id.compute()

print(result.head())

### Dask Arrays: Parallelizing NumPy

Similarly, a Dask Array is composed of many smaller NumPy arrays, called "chunks." It supports a large subset of the NumPy API.

Let's create a large random Dask array and perform some standard operations.

In [None]:
# Create a 20000x20000 array of random numbers.
# This would be ~3.2 GB, potentially too large for some machines.
# Dask handles it by chunking it into smaller NumPy arrays.
x = da.random.random((20000, 20000), chunks=(1000, 1000))
x

Now, let's perform a computation, like taking the mean along an axis. Again, this is lazy.

In [None]:
# A lazy operation
y = x.mean(axis=0)
y

And now, trigger the computation with `.compute()` and watch the dashboard.

In [None]:
result_array = y.compute()
print(result_array.shape)
print(result_array[:10])

### Dask Delayed: Custom Parallel Algorithms

For problems that don't fit the DataFrame or Array models, Dask provides `dask.delayed`. This is a decorator that makes a function lazy, adding it to the task graph. This is useful for parallelizing custom code, such as a loop of simulations.

In [None]:
from dask import delayed
import time

@delayed
def simulate(params):
    # In a real scenario, this would be a complex simulation
    time.sleep(0.1)
    return sum(params)

simulations = [simulate([i, i+1]) for i in range(10)]

# `dask.compute` runs all the delayed objects in parallel
results = dask.compute(*simulations)

print(results)

## Summary: When to Use Dask

Dask is the right tool when your data is large and your computations are parallelizable. It excels when:

- **Working with Large Datasets:** Your data does not fit in RAM, and you need to perform pandas- or NumPy-like operations on it.
- **Leveraging Multi-Core CPUs:** You have computationally intensive workflows that can be sped up by running them in parallel across multiple cores.
- **Scaling to a Cluster:** While we have used a local cluster, Dask is designed to scale seamlessly to large HPC or cloud-based clusters with minimal code changes.

By providing scalable versions of familiar tools, Dask empowers economists to tackle larger and more complex empirical problems without straying from the Python ecosystem. It is a foundational tool for modern, data-intensive research.

In [None]:
# It's good practice to close the client to release resources
client.close()