# PySpark `transform()` function -  
        By Aishwarya Raut

PySpark provides two `transform()` functions one with DF and another in `pyspark.sql.functions`.

> `pyspark.sql.DataFrame.transform()`
> `pyspark.sql.functions.transform()`

In [1]:
# Imports
from pyspark.sql import SparkSession

# Create SparkSession
spark = SparkSession.builder \
            .appName('SP') \
            .getOrCreate()

# Prepare Data
simpleData = (("Java",4000,5), \
    ("Python", 4600,10),  \
    ("Scala", 4100,15),   \
    ("Scala", 4500,15),   \
    ("PHP", 3000,20),  \
  )
columns= ["CourseName", "fee", "discount"]

# Create DataFrame
df = spark.createDataFrame(data = simpleData, schema = columns)
df.printSchema()
# df.show(truncate=False)

root
 |-- CourseName: string (nullable = true)
 |-- fee: long (nullable = true)
 |-- discount: long (nullable = true)



# 1. PySpark DataFrame.transform()

The pyspark.sql.DataFrame.transform() is used to chain the custom transformations and this function returns the new DataFrame after applying the specified transformations.

This function always returns the same number of rows that exists on the input PySpark DataFrame.

## 1.1 Syntax 
`pyspark.sql.DataFrame.transform()`

`DataFrame.transform(func: Callable[[…], DataFrame], *args: Any, **kwargs: Any) → pyspark.sql.dataframe.DataFrame
`

`func` – Custom function to call.
`*args` – Arguments to pass to func.
`*kwargs` – Keyword arguments to pass to func.

## 1.2 Create Custom Functions


In [2]:
from pyspark.sql.functions import upper
def to_upper_str_columns(df):
    return df.withColumn("CourseName",upper(df.CourseName))

def reduce_price(df,reduceBy):
    return df.withColumn("new_fee",df.fee - reduceBy)

def apply_discount(df):
    return df.withColumn("discount_fee",
                        df.new_fee-(df.new_fee * df.discount)/100)

# 1.3 PySpark Apply DataFrame.transform()


In [3]:
df2=df.transform(to_upper_str_columns).transform(reduce_price,1000).transform(apply_discount)

In [4]:
df2.printSchema()

root
 |-- CourseName: string (nullable = true)
 |-- fee: long (nullable = true)
 |-- discount: long (nullable = true)
 |-- new_fee: long (nullable = true)
 |-- discount_fee: double (nullable = true)



In [6]:
# custom function
def select_columns(df):
    return df.select("CourseName","discount_fee")

# Chain transformations
df2 = df.transform(to_upper_str_columns) \
        .transform(reduce_price,1000) \
        .transform(apply_discount) \
        .transform(select_columns)


In [7]:
df2.printSchema()

root
 |-- CourseName: string (nullable = true)
 |-- discount_fee: double (nullable = true)



# 2. PySpark sql.functions.transform()

The `PySpark sql.functions.transform()` is used to apply the transformation on a column of type Array. This function applies the specified transformation on every element of the array and returns an object of ArrayType.

Following is the syntax of the `pyspark.sql.functions.transform()` function


# Syntax
`pyspark.sql.functions.transform(col, f)`
The following are the parameters:

`col` – ArrayType column

`f` – Optional. Function to apply.

In [None]:
# Create DataFrame with Array
data = [
 ("James,,Smith",["Java","Scala","C++"],["Spark","Java"]),
 ("Michael,Rose,",["Spark","Java","C++"],["Spark","Java"]),
 ("Robert,,Williams",["CSharp","VB"],["Spark","Python"])
]
df = spark.createDataFrame(data=data,schema=["Name","Languages1","Languages2"])
df.printSchema()
df.show()

# using transform() function
from pyspark.sql.functions import upper
from pyspark.sql.functions import transform
df.select(transform("Languages1", lambda x: upper(x)).alias("languages1")) \
  .show()