# Distributed Compute

This is a heart of Fugue. In the previous sections, we went over how to use Fugue in the form of extensions and basic data operations such as joins. In this section, we'll talk about how those Fugue extensions scale.

## Partition and Presort

One of the most fundamental distributed compute concepts is the partition. Our data is spread across several machines, and we often need to rearrange the way the data is spread across the machines. This is because of operations need all of the related data in one place. For example, calculating the median value per group requires all the data from the same group on one machine. Fugue allows users to control the paritioning scheme during execution.

In the example below, `take` is an operation that gets an `n` number of rows. We apply take on each partition. We will have two partitions because `col1` is the partition key and it only has 2 values.

In [1]:
from fugue import FugueWorkflow
import pandas as pd 

data = pd.DataFrame({'col1':[1,1,1,2,2,2], 'col2':[1,4,5,7,4,2]})
df2 = data.copy()

with FugueWorkflow() as dag:
    df = dag.df(df2)
    df = df.partition(by=['col1'], presort="col2 desc").take(1)
    df.show()

PandasDataFrame
col1:long|col2:long
---------+---------
2        |7        
1        |5        
Total count: 2



Along we the partition, we also have the presort. The presort key here was `col2 desc`, which means that the data is sorted in descending order after partitioning. This makes the `take` operation give us the max value. We'll go over one more example.

In [2]:
# schema: *, col2_diff:int
def diff(df: pd.DataFrame) -> pd.DataFrame:
    df['col2_diff'] = df['col2'].diff(1)
    return df

df2 = data.copy()
with FugueWorkflow() as dag:
    df = dag.df(df2)
    df = df.partition(by=['col1']).transform(diff)
    df.show()

PandasDataFrame
col1:long|col2:long|col2_diff:int
---------+---------+-------------
1        |1        |NULL         
1        |4        |3            
1        |5        |1            
2        |7        |NULL         
2        |4        |-3           
2        |2        |-2           
Total count: 6



Notice there are 2 NULL values in the previous example. This is because the first element of the `diff` operation results in NULL. The reason we have 2 NULLs is because the `transformer` was applied once for each partition. The `partition-transform` semantics are very similar to the `pandas groupby-apply` semantics. There is a deeper dive into partitions in the advanced tutorial.

## CoTransformer

Last section, we skipped the `cotransformer` because it required knowledge about partitions. The `cotransformer` takes in multiple DataFrames that are **partitioned in the same way** and outputs one DataFrame. In order to use a `cotransformer`, the `zip` method has to be used first to join them by their common keys. There is also a `@cotransformer` decorator can be used to define the `cotransformer`, but it will still be invoked by the `zip-transform` syntax.

In [3]:
from typing import List, Any, Dict

# schema: df1:str,df2:str
def to_str(df1:List[List[Any]], df2:List[Dict[str,Any]]) -> List[List[Any]]:
    return [[df1.__repr__(),df2.__repr__()]]

with FugueWorkflow() as dag:
    df1 = dag.df([[0,1],[1,3]],"a:int,b:int")
    df2 = dag.df([[0,4],[1,2]],"a:int,c:int")
    df1.zip(df2, partition={"by":"a"}).transform(to_str).show()


PandasDataFrame
df1:str       |df2:str                                                                       
--------------+------------------------------------------------------------------------------
[[0, 1]]      |[{'a': 0, 'c': 4}]                                                            
[[1, 3]]      |[{'a': 1, 'c': 2}]                                                            
Total count: 2



In this example, the important part to note is that the first row of the output contains the items of `df1` and `df2` where `a` had a value of 0. The second row contains the items of `df1` and `df2` where `a` had a value of one. The `df1` column values show `List[List[Any]]` because that was the annotation provided for `df1` in `to_str`. Similarly, the `df2` column values show values in the format `List[Dict[str,Ayn]]` because that was the annotation provided for `df2`.

This operation was partitioned by the column `a` before the `cotransform` was applied. This was done through the `zip` command. CoTransform is a more advanced operation that may take some experience to get used to.

## Persist and Broadcast

Persist and broadcast are two other distributed compute concepts that Fugue has support for. Persist keeps a DataFrame in memory to avoid recomputation. Distributed compute frameworks often need an explicit `persist` call to know which DataFrames need to be kept, otherwise they tend to be calculated repeatedly.

Broadcasting is making a smaller DataFrame available on all the workers of a cluster. Without `broadcast`, these small DataFrames would be repeatedly sent to workers whenever they are needed to perform an operation. Broadcasting caches them on the workers.

In [4]:
with FugueWorkflow() as dag:
    df = dag.df([[0,1],[1,2]],"a:long,b:long")
    df.persist()
    df.broadcast()