# Introduction: Pandas Function APIs


**There are three types of pandas function APIs:**
1) [Grouped map](https://api-docs.databricks.com/python/pyspark/latest/pyspark.sql/api/pyspark.sql.GroupedData.applyInPandas.html#pyspark-sql-groupeddata-applyinpandas)
2) [Map](https://api-docs.databricks.com/python/pyspark/latest/pyspark.sql/api/pyspark.sql.DataFrame.mapInPandas.html#pyspark-sql-dataframe-mapinpandas)
3) [Cogrouped map](https://api-docs.databricks.com/python/pyspark/latest/pyspark.sql/api/pyspark.sql.PandasCogroupedOps.applyInPandas.html#pyspark-sql-pandascogroupedops-applyinpandas)

# Grouped map


You transform your grouped data using `groupBy().applyInPandas()` to implement the “split-apply-combine” pattern  

**Split-apply-combine consists of three steps:**

1) Split the data into groups by using `DataFrame.groupBy`
2) Apply a function on each group. The input and output of the function are both `pandas.DataFrame.` The input data contains all the rows and columns for each group
3) Combine the results into a new `DataFrame` 


**To use `groupBy().applyInPandas()`, you must define the following:**

1) A Python function that defines the computation for each group
2) A StructType object or a string that defines the schema of the output DataFrame

## Subtract the mean from each value in the group

In [0]:
# Creating a Spark DataFrame
df = spark.createDataFrame(
    [(1, 1.0), (1, 2.0), (2, 3.0), (2, 5.0), (2, 10.0)],
    ("id", "v"))

In [0]:
df.show()

In [0]:
df.schema

In [0]:
# Defining a Custom Function (subtract_mean)
def subtract_mean(pdf): # pdf is a pandas.DataFrame
    v = pdf.v
    return pdf.assign(v=v - v.mean())

In [0]:
# Applying the Custom Function and Displaying the Resulting DataFrame
df.groupby("id").applyInPandas(subtract_mean, schema="id long, v double").show()

# Map


- You perform map operations with pandas instances by `DataFrame.mapInPandas()` in order to transform an iterator of `pandas.DataFrame` to another iterator of `pandas.DataFrame` that represents the current PySpark DataFrame and returns the result as a PySpark DataFrame

- The underlying function takes and outputs an iterator of `pandas.DataFrame`
- It can return output of arbitrary length in contrast to some pandas UDFs such as Series to Series


## Transform an iterator of pandas.DataFrame to another iterator

In [0]:
# Creating a Spark DataFrame
df = spark.createDataFrame([(1, 21), (2, 30)], ("id", "age"))

In [0]:
df.show()

In [0]:
# Defining a Custom Filtering Function (filter_func)
def filter_func(iterator):
    for pdf in iterator:
        yield pdf[pdf.id == 1]

In [0]:
df.schema

In [0]:
# Applying the Custom Filtering Function using mapInPandas and Displaying the Resulting DataFrame
df.mapInPandas(filter_func, schema=df.schema).show()