Here are the steps we need to follow to develop and use Spark UDFs
- Develop required logic using Python as programming language
- Register the function using `spark.udf.register` . Also assign it to a variable
- Variable can be used as part of Data Frame APIs such as `select`, `filter` etc
- When we register, we register with a name. That name can be used as part of `selectExpr` or as part of Spark SQL queries using spark.sql

In [0]:
help(spark.udf.register)

Help on method register in module pyspark.sql.udf:

register(name: str, f: Union[Callable[..., Any], ForwardRef('UserDefinedFunctionLike')], returnType: Optional[ForwardRef('DataTypeOrString')] = None) -> 'UserDefinedFunctionLike' method of pyspark.sql.udf.UDFRegistration instance
    Register a Python function (including lambda function) or a user-defined function
    as a SQL function.
    
    .. versionadded:: 1.3.1
    
    .. versionchanged:: 3.4.0
        Supports Spark Connect.
    
    Parameters
    ----------
    name : str,
        name of the user-defined function in SQL statements.
    f : function, :meth:`pyspark.sql.functions.udf` or :meth:`pyspark.sql.functions.pandas_udf`
        a Python function, or a user-defined function. The user-defined function can
        be either row-at-a-time or vectorized. See :meth:`pyspark.sql.functions.udf` and
        :meth:`pyspark.sql.functions.pandas_udf`.
    returnType : :class:`pyspark.sql.types.DataType` or str, optional
       

In [0]:
df = spark.read.json('/public/retail_db_json/orders')

In [0]:
df.show()

+-----------------+--------------------+--------+---------------+
|order_customer_id|          order_date|order_id|   order_status|
+-----------------+--------------------+--------+---------------+
|            11599|2013-07-25 00:00:...|       1|         CLOSED|
|              256|2013-07-25 00:00:...|       2|PENDING_PAYMENT|
|            12111|2013-07-25 00:00:...|       3|       COMPLETE|
|             8827|2013-07-25 00:00:...|       4|         CLOSED|
|            11318|2013-07-25 00:00:...|       5|       COMPLETE|
|             7130|2013-07-25 00:00:...|       6|       COMPLETE|
|             4530|2013-07-25 00:00:...|       7|       COMPLETE|
|             2911|2013-07-25 00:00:...|       8|     PROCESSING|
|             5657|2013-07-25 00:00:...|       9|PENDING_PAYMENT|
|             5648|2013-07-25 00:00:...|      10|PENDING_PAYMENT|
|              918|2013-07-25 00:00:...|      11| PAYMENT_REVIEW|
|             1837|2013-07-25 00:00:...|      12|         CLOSED|
|         

In [0]:
dc = spark.udf.register('date_convert', lambda d: int(d[:10].replace('-','')))

In [0]:
dc

<function __main__.<lambda>(d)>

In [0]:
df.select(dc('order_date').alias('order_date')).show()

+----------+
|order_date|
+----------+
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
+----------+
only showing top 20 rows



In [0]:
df.filter(dc('order_date') == 20140101).show()

+-----------------+--------------------+--------+---------------+
|order_customer_id|          order_date|order_id|   order_status|
+-----------------+--------------------+--------+---------------+
|             3414|2014-01-01 00:00:...|   25876|PENDING_PAYMENT|
|             5549|2014-01-01 00:00:...|   25877|PENDING_PAYMENT|
|             9084|2014-01-01 00:00:...|   25878|        PENDING|
|             5118|2014-01-01 00:00:...|   25879|        PENDING|
|            10146|2014-01-01 00:00:...|   25880|       CANCELED|
|             3205|2014-01-01 00:00:...|   25881|PENDING_PAYMENT|
|             4598|2014-01-01 00:00:...|   25882|       COMPLETE|
|            11764|2014-01-01 00:00:...|   25883|        PENDING|
|             7904|2014-01-01 00:00:...|   25884|PENDING_PAYMENT|
|             7253|2014-01-01 00:00:...|   25885|        PENDING|
|             8195|2014-01-01 00:00:...|   25886|     PROCESSING|
|            10062|2014-01-01 00:00:...|   25887|        PENDING|
|         

In [0]:
df. \
    groupBy(dc('order_date').alias('order_date')).\
    count().\
    withColumnRenamed('count','order_count').\
    show()

+----------+-----------+
|order_date|order_count|
+----------+-----------+
|  20130919|        206|
|  20140303|        266|
|  20140202|        192|
|  20140310|        235|
|  20130809|        125|
|  20130817|        253|
|  20131015|        174|
|  20140114|        209|
|  20131029|        128|
|  20140130|        254|
|  20130824|        265|
|  20130913|        103|
|  20130914|        276|
|  20130825|        200|
|  20131031|        208|
|  20140304|        257|
|  20130731|        252|
|  20130730|        227|
|  20131116|        120|
|  20131213|        135|
+----------+-----------+
only showing top 20 rows

