# Processor

`Processor` represents the logic unit executing on driver on the **entire** input dataframes.

**Input can be a single** [DataFrames](x-like.ipynb#DataFrames)

**Alternatively, acceptable input DataFrame types are**: `DataFrame`, `LocalDataFrame`, `pd.DataFrame`, `List[List[Any]]`, `Iterable[List[Any]]`, `EmptyAwareIterable[List[Any]]`, `List[Dict[str, Any]]`, `Iterable[Dict[str, Any]]`, `EmptyAwareIterable[Dict[str, Any]]`

**Acceptable output DataFrame types are**: `DataFrame`, `LocalDataFrame`, `pd.DataFrame`, `List[List[Any]]`, `Iterable[List[Any]]`, `EmptyAwareIterable[List[Any]]`, `List[Dict[str, Any]]`, `Iterable[Dict[str, Any]]`, `EmptyAwareIterable[Dict[str, Any]]`

**Before input DataFrames** you can have a parameter with `ExecutionEngine` annotation so Fugue will pass the current `ExecutionEngine` to you

Notice

* `ArrayDataFrame` and other local dataframes can't be used as annotation, you must use `LocalDataFrame` or `DataFrme`
* If output type is NOT one of `DataFrame`, `LocalDataFrame` or `pd.DataFrame`, the output schema is unknown, so you must specify that.
* `DataFrame` or `DataFrames` are the recommended input/output types, all other acceptable types are variations of `LocalDataFrame` that means the dataset will be materialized and brought to driver to process.
* `Iterable` like input may have different execution plans to bring data to driver, in some cases it can be less optimal, you must be careful.


## Native Approach

The simplest way, with no dependency on Fugue. You just need to have acceptable annotations on input dataframes and output.

In [None]:
from typing import Iterable, Dict, Any, List
import pandas as pd

# the output is pd.DataFrame, fugue can get schema from it
def add1(df:pd.DataFrame, n=1) -> pd.DataFrame:
    df["b"]+=n
    return df

# the output has no schema info, so you must specify schema in fugue code
# in practice, it's rare to use such output type for a processor
def add2(df:List[Dict[str,Any]], n=1) -> Iterable[Dict[str,Any]]:
    for row in df:
        row["b"]+=n
        yield row

def concat(df1:pd.DataFrame, df2:pd.DataFrame) -> pd.DataFrame:
    return pd.concat([df1,df2]).reset_index(drop=True)

In [None]:
from fugue import FugueWorkflow

with FugueWorkflow() as dag:
    df = dag.df([[0,1],[0,2],[1,3],[1,1]],"a:int,b:int")
    df.process(add1, params={"n":2}).show()
    dag.process(df,using=add1,params={"n":2}).show() # == above
    df.process(add2, schema="a:int,b:int", params={"n":2}).show()
    dag.process(df,df, using=concat).show()

It's very important to know another use case: with `ExecutionEngine`. **This is how you write native Spark code inside Fugue.**

In [None]:
from fugue import ExecutionEngine, DataFrame
from fugue_spark import SparkExecutionEngine, SparkDataFrame
from typing import Iterable, Dict, Any, List
import pandas as pd

# pay attention to the input and output annotations, they are both general DataFrame
def add(e:ExecutionEngine, df:DataFrame, temp_name="x") -> DataFrame:
    assert isinstance(e,SparkExecutionEngine) # this extension only works with SparkExecutionEngine
    df = e.to_df(df) # to make sure df is SparkDataFrame, or conversion is done here
    df.native.createOrReplaceTempView(temp_name)  # df.native is spark dataframe
    sdf = e.spark_session.sql("select a,b+1 as b from "+temp_name)  # this is how you get spark session
    return SparkDataFrame(sdf) # you must wrap as Fugue SparkDataFrame to return

with FugueWorkflow(SparkExecutionEngine) as dag:
    df = dag.df([[0,1],[0,2],[1,3],[1,1]],"a:int,b:int")
    df.process(add, params={"temp_name":"y"}).show()

It's also important to know how to use `DataFrames` as input annotation. Because this is the only way to be **dynamic**

In [None]:
from typing import Iterable, Dict, Any, List
from fugue import DataFrames, DataFrame
import pandas as pd
import pandas as pd

def concat(dfs:DataFrames) -> pd.DataFrame:
    pdfs = [df.as_pandas() for df in dfs.values()]
    return pd.concat(pdfs).reset_index(drop=True) # Fugue can't take pandas dataframe with special index

with FugueWorkflow() as dag:
    df1 = dag.df([[0,1]],"a:int,b:int")
    df2 = dag.df([[0,2],[1,3]],"a:int,b:int")
    df3 = dag.df([[1,1]],"a:int,b:int")
    dag.process(df1,using=concat).show()
    dag.process(df1,df2,using=concat).show()
    dag.process(df1,df2,df3,using=concat).show()

## With Schema Hint

Notice if you are using `DataFrame`, `LocalDataFrame` or `pd.DataFrame` as the output type, you must not have type hint. And the best practice is to use `DataFrame` as the output type.

In [None]:
from typing import Iterable, Dict, Any, List
import pandas as pd

# schema: a:int, b:int
def add(df:List[Dict[str,Any]], n=1) -> Iterable[Dict[str,Any]]:
    for row in df:
        row["b"]+=n
        yield row


from fugue import FugueWorkflow

with FugueWorkflow() as dag:
    dag.df([[0,1]],"a:int,b:int").process(add).show()

## Decorator Approach

There is no obvious advantage to use decorator for `Processor`.

In [None]:
from fugue import processor, FugueWorkflow
import pandas as pd

@processor("a:int, b:int")
def add(df:List[Dict[str,Any]], n=1) -> Iterable[Dict[str,Any]]:
    for row in df:
        row["b"]+=n
        yield row


with FugueWorkflow() as dag:
    dag.df([[0,1]],"a:int,b:int").process(add).show()

## Interface Approach

All the previous methods are just wrappers of the interface approach. They cover most of the use cases and simplify the usage. But if you want to get all execution context such as partition information, use interface.

In the interface approach, type annotations are not necessary, but again, it's good practice to have them.

In [None]:
from fugue import FugueWorkflow, Processor, DataFrames, DataFrame
from fugue_spark import SparkExecutionEngine
from time import sleep
import pandas as pd
import numpy as np


class Partitioner(Processor):
    def process(self, dfs:DataFrames) -> DataFrame:
        assert len(dfs)==1
        engine = self.execution_engine
        partion = self.partition_spec
        return engine.repartition(dfs[0], partition_spec = partion)


with FugueWorkflow(SparkExecutionEngine) as dag:
    df = dag.df([[0,1],[0,3],[1,2],[1,1]],"a:int,b:int")
    # see the output is sorted by b, partition is passed into Partitioner as partition_spec
    df.partition(num=1, presort="b").process(Partitioner).show() 