# Porting Existing Code to Distributed Computing

Data scientists run into a scenario where they have port existing Pandas code to Spark or Dask. Either they start with a small data project, and then move it to a larger dataset to run on production, or they have existing code and programs that are struggling to scale. The limitation of Pandas are well documented:

The primary reason is that pandas is single core, and does not take advantage of all available compute resources. A lot of operations also generate [intermediate copies](https://pandas.pydata.org/pandas-docs/stable/user_guide/scale.html#scaling-to-large-datasets) of data, utilizing more memory than necessary. To effectively handle data with pandas, users preferably need to have [5x to 10x times](https://wesmckinney.com/blog/apache-arrow-pandas-internals/) as much RAM as the size of the dataset.

Spark and Dask allow us to split compute jobs across multiple machines. They also can handle datasets that don’t fit into memory by [spilling data](http://distributed.dask.org/en/latest/worker.html#spill-data-to-disk) over to disk in some cases.

## Current Approaches

### Vertical Scaling

The most frequent thing to do is to scale the compute vertically so that there is no re-write of the code needed. Instead of running everything on a 16 GB VM, we can run it on a 32 GB VM. When that isn't good enough anymore, we can run the program on a 64 GB RAM. The problem with this is that it is often not a good use of resources for the following reasons:

1. You likely only need more compute resources for one step out of many. For example, if the dataset is reduced already because Machine Learning Modelling, then you don't need a big VM during the modelling step.
2. Scaling vertically does not automatically mean complete utilization of CPUs. A lot of people frequently scale the underlying virtual machine, but don't introduce parallelism so other cores are not utilized.

### DIY Parallelism

It's also common for people to introduce their own form of parallelism. A common example of this is sharding a CSV or parquet file into several, and then spinning up a Python process for each one by using the multiprocessing library. As an example code snippet:

```python
def logic(file_name):
    shard = pd.read_csv(file_name)
    result = do_something(shard)
    result.to_csv(f"processed-{file_name}")
    return

import concurrent.futures
with concurrent.futures.ProcessPoolExecutor() as executor:
    futures = executor.map(logic, files)
    concurrent.futures.wait(futures)
```

The DIY Parallelism approach can have issues though. The most common issue is resource contention because the memory and CPU consumption of these threads is not fixed. [This is an example StackOverflow post.](https://stackoverflow.com/questions/71151809/python-processpoolexecutor-memory-problems). Basically you have to chunk this yourself. Memory management falls on the user. But how does each process know the overall consumption?

Second, this assumes that everything is on the same compute, but there is room to do things better if we are open to scaling things out.

## Distributed Computing Frameworks

<Insert Spark and Dask logos (and Ray)>

This brings us to distrubted computing frameworks. Distrubted just means we are scaling out over a cluster so the data lives in multiple machines. The [Dask machine learning documentation](https://ml.dask.org/) shows us what the dimensions of scale are. There are compute bound problems and memory bound problems. 

Distributed computing frameworks such as Spark and Dask scale out to a cluster of machines.

There is an image in the Dask repo [issues](https://github.com/dask/dask/issues/4471) that clearly illustrates the distributed computing paradigm. In general, there is a client or master that takes care of the orchestration and final data collection. The client is responsible for scheduling tasks among workers.

Both Spark and Dask have local modes also where they use the cores available on the local machine. This means we can still take advantage of the additional processing without having a cluster available.

In the diagram below, note how:
- package versions and serialization
- reading in files can be optimized
- data actually lives on a physical machine

<img src="https://user-images.githubusercontent.com/11656932/62263986-bbba2f00-b3e3-11e9-9b5c-8446ba4efcf9.png" align="left" width="700"/>

## Introductions to Partitions

In order to understand partitions, we can look at this image showing the way Dask scales Pandas. Each partition is a Pandas DataFrame. A Dask DataFrame is the collection of all of the Pandas DataFrames. Operations are done on each partition, and then aggregated back.

<img src="https://docs.dask.org/en/latest/_images/dask-dataframe.svg" align="left" width="400"/>

## Data Inconsistencies of Pandas and Spark

One of the first issues is the inconsistencies between Pandas and Spark. Below is a summary of differences.

![img](https://miro.medium.com/max/1400/0*fv0FKyt3jB0ehVrU)

### Setup

In [7]:
import pandas as pd
from pyspark.sql import SparkSession
import dask.dataframe as dd

spark = SparkSession.builder.getOrCreate()

df = pd.DataFrame({"col1": [None, None, "A", "A", "B", "B"], 
                   "col2": [1,2,3,4,5,6]})
ddf = dd.from_pandas(df, npartitions=2)
sdf = spark.createDataFrame(df)

### Groupby

**Pandas**

In [11]:
df.groupby("col1")["col2"].mean()

col1
A    3.5
B    5.5
Name: col2, dtype: float64

**Dask**

This is consistent with Pandas in dropping the NULL values.

In [9]:
ddf.groupby("col1")["col2"].mean().compute()

col1
A    3.5
B    5.5
Name: col2, dtype: float64

**Spark**

In [6]:
sdf.groupBy("col1").mean("col2").show()

                                                                                

+----+---------+
|col1|avg(col2)|
+----+---------+
|null|      1.5|
|   B|      5.5|
|   A|      3.5|
+----+---------+



**Additional Note**

For those wondering, you can make this consistent with `dropna=False`

In [10]:
df.groupby("col1", dropna=False)["col2"].mean()

col1
A      3.5
B      5.5
NaN    1.5
Name: col2, dtype: float64

### Sorting

**Pandas**

In [12]:
df.sort_values(["col1", "col2"])

Unnamed: 0,col1,col2
2,A,3
3,A,4
4,B,5
5,B,6
0,,1
1,,2


Dask

In [14]:
try:
    ddf.sort_values(["col1", "col2"])
except Exception as e:
    print(e)

Dataframes only support sorting by named columns which must be passed as a string or a list of strings; multi-partition dataframes only support sorting by a single column.
You passed ['col1', 'col2']


Spark

In [17]:
sdf.orderBy(["col1", "col2"]).show()

+----+----+
|col1|col2|
+----+----+
|null|   1|
|null|   2|
|   A|   3|
|   A|   4|
|   B|   5|
|   B|   6|
+----+----+



In [20]:
sdf.orderBy(["col1", "col2"], ascending=[False,True]).show()

+----+----+
|col1|col2|
+----+----+
|   B|   5|
|   B|   6|
|   A|   3|
|   A|   4|
|null|   1|
|null|   2|
+----+----+



**Additional Note**

Pandas has an argument called the `na_position` which lets you decide where to place NA values when sorting. Pandas uses NA first or NA last while Spark uses `None` as smallest value.

In [22]:
df.sort_values(["col1", "col2"], na_position="first")

Unnamed: 0,col1,col2
0,,1
1,,2
2,A,3
3,A,4
4,B,5
5,B,6


## Behavior Inconsistencies

## Pandas-like Frameworks

## Do they hold up to the test?

## Introducing Fugue