## Spark SQL Functions

* Spark provides robust set of pre-defined functions as part of `pyspark.sql.functions`.
* However, they miight not fulfill all our requirements.
* At times, we might have to develop custom UDFs for these scenarios.
    * No function available for our requirement while applying row level transformations.
    * Also, we might have to use multiple functions sue to which readability of the code is compromised.

Here are the steps we need to follow to develop and use Spark User Defined Functions.
* Develop required logic using Python as programming language.
* Register the function using `spark.udf.register`. also assign it to a variable.
* Variable can be used as a part of Dataframe APIs such as `select`, `filter`, etc.
* When we register, we register with a name. That name can ne used as part of `selectExpr`or as part of Spark SQL queries using `spark.sql`.

In [3]:
from pyspark.sql import SparkSession
from pyspark.sql.functions import *

In [5]:
user_name = 'CodeInDNA'
spark = SparkSession.builder.appName(f'SparkUDFs - {user_name}').getOrCreate()

In [6]:
help(spark.udf.register)

Help on method register in module pyspark.sql.udf:

register(name, f, returnType=None) method of pyspark.sql.udf.UDFRegistration instance
    Register a Python function (including lambda function) or a user-defined function
    as a SQL function.
    
    :param name: name of the user-defined function in SQL statements.
    :param f: a Python function, or a user-defined function. The user-defined function can
        be either row-at-a-time or vectorized. See :meth:`pyspark.sql.functions.udf` and
        :meth:`pyspark.sql.functions.pandas_udf`.
    :param returnType: the return type of the registered user-defined function. The value can
        be either a :class:`pyspark.sql.types.DataType` object or a DDL-formatted type string.
    :return: a user-defined function.
    
    To register a nondeterministic Python function, users need to first build
    a nondeterministic user-defined function for the Python function and then register it
    as a SQL function.
    
    `returnType` can

In [7]:
ordersDF = spark.read.json('../data/orders.json')

In [8]:
ordersDF.show(2)

+-----------------+--------------------+--------+---------------+
|order_customer_id|          order_date|order_id|   order_status|
+-----------------+--------------------+--------+---------------+
|            11599|2013-07-25 00:00:...|       1|         CLOSED|
|              256|2013-07-25 00:00:...|       2|PENDING_PAYMENT|
+-----------------+--------------------+--------+---------------+
only showing top 2 rows



In [9]:
ordersDF.dtypes

[('order_customer_id', 'bigint'),
 ('order_date', 'string'),
 ('order_id', 'bigint'),
 ('order_status', 'string')]

In [11]:
dc = spark.udf.register('date_convert', lambda x: x[:10].replace('-', ''))

In [14]:
(ordersDF.
select(dc('order_date').alias('order_date')).show())

+----------+
|order_date|
+----------+
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
+----------+
only showing top 20 rows



In [15]:
ordersDF.filter(dc('order_date') == 20140101).show()

+-----------------+--------------------+--------+---------------+
|order_customer_id|          order_date|order_id|   order_status|
+-----------------+--------------------+--------+---------------+
|             3414|2014-01-01 00:00:...|   25876|PENDING_PAYMENT|
|             5549|2014-01-01 00:00:...|   25877|PENDING_PAYMENT|
|             9084|2014-01-01 00:00:...|   25878|        PENDING|
|             5118|2014-01-01 00:00:...|   25879|        PENDING|
|            10146|2014-01-01 00:00:...|   25880|       CANCELED|
|             3205|2014-01-01 00:00:...|   25881|PENDING_PAYMENT|
|             4598|2014-01-01 00:00:...|   25882|       COMPLETE|
|            11764|2014-01-01 00:00:...|   25883|        PENDING|
|             7904|2014-01-01 00:00:...|   25884|PENDING_PAYMENT|
|             7253|2014-01-01 00:00:...|   25885|        PENDING|
|             8195|2014-01-01 00:00:...|   25886|     PROCESSING|
|            10062|2014-01-01 00:00:...|   25887|        PENDING|
|         

In [18]:
( ordersDF.
    groupBy(dc('order_date').alias('order_date')).
    count().
    withColumnRenamed('count', 'order_count').show())

+----------+-----------+
|order_date|order_count|
+----------+-----------+
|  20140413|        117|
|  20130919|        206|
|  20140303|        266|
|  20140410|        252|
|  20140512|        246|
|  20140530|        102|
|  20140711|        138|
|  20140202|        192|
|  20140310|        235|
|  20130809|        125|
|  20130817|        253|
|  20131015|        174|
|  20140114|        209|
|  20140505|        171|
|  20140709|        150|
|  20131029|        128|
|  20140130|        254|
|  20130824|        265|
|  20130913|        103|
|  20140610|        137|
+----------+-----------+
only showing top 20 rows



In [23]:
ordersDF.selectExpr('date_convert(order_date) AS order_date').show(5)

+----------+
|order_date|
+----------+
|  20130725|
|  20130725|
|  20130725|
|  20130725|
|  20130725|
+----------+
only showing top 5 rows



In [24]:
ordersDF.createOrReplaceTempView('orders')

In [26]:
spark.sql("""SELECT o.*, date_convert(order_date) AS order_date FROM orders o""").show()

+-----------------+--------------------+--------+---------------+----------+
|order_customer_id|          order_date|order_id|   order_status|order_date|
+-----------------+--------------------+--------+---------------+----------+
|            11599|2013-07-25 00:00:...|       1|         CLOSED|  20130725|
|              256|2013-07-25 00:00:...|       2|PENDING_PAYMENT|  20130725|
|            12111|2013-07-25 00:00:...|       3|       COMPLETE|  20130725|
|             8827|2013-07-25 00:00:...|       4|         CLOSED|  20130725|
|            11318|2013-07-25 00:00:...|       5|       COMPLETE|  20130725|
|             7130|2013-07-25 00:00:...|       6|       COMPLETE|  20130725|
|             4530|2013-07-25 00:00:...|       7|       COMPLETE|  20130725|
|             2911|2013-07-25 00:00:...|       8|     PROCESSING|  20130725|
|             5657|2013-07-25 00:00:...|       9|PENDING_PAYMENT|  20130725|
|             5648|2013-07-25 00:00:...|      10|PENDING_PAYMENT|  20130725|

In [28]:
spark.sql("""SELECT o.*, date_convert(order_date) AS order_date FROM orders o WHERE date_convert(order_date)=20140101""").show(5)

+-----------------+--------------------+--------+---------------+----------+
|order_customer_id|          order_date|order_id|   order_status|order_date|
+-----------------+--------------------+--------+---------------+----------+
|             3414|2014-01-01 00:00:...|   25876|PENDING_PAYMENT|  20140101|
|             5549|2014-01-01 00:00:...|   25877|PENDING_PAYMENT|  20140101|
|             9084|2014-01-01 00:00:...|   25878|        PENDING|  20140101|
|             5118|2014-01-01 00:00:...|   25879|        PENDING|  20140101|
|            10146|2014-01-01 00:00:...|   25880|       CANCELED|  20140101|
+-----------------+--------------------+--------+---------------+----------+
only showing top 5 rows



In [29]:
spark.sql("""SELECT date_convert(order_date), count(*) AS order_date FROM orders o GROUP BY 1""").show(5)

+------------------------+----------+
|date_convert(order_date)|order_date|
+------------------------+----------+
|                20140413|       117|
|                20130919|       206|
|                20140303|       266|
|                20140410|       252|
|                20140512|       246|
+------------------------+----------+
only showing top 5 rows

