# Funciones de Usuario
1. Definiendo una funcion
1. Crear y aplicar UDF
1. Registrar UDF para usar en SQL
1. Usar la sintaxis Decorator (solo Python)

In [None]:
!pip install pyspark py4j

Collecting pyspark
  Downloading pyspark-3.2.1.tar.gz (281.4 MB)
[K     |████████████████████████████████| 281.4 MB 40 kB/s 
[?25hCollecting py4j
  Downloading py4j-0.10.9.3-py2.py3-none-any.whl (198 kB)
[K     |████████████████████████████████| 198 kB 66.4 MB/s 
[?25hBuilding wheels for collected packages: pyspark
  Building wheel for pyspark (setup.py) ... [?25l[?25hdone
  Created wheel for pyspark: filename=pyspark-3.2.1-py2.py3-none-any.whl size=281853642 sha256=fb1558489820958d8e4c0a5b42fd0faaa0cfa329391b48c2bdb12dbfa7b2c21e
  Stored in directory: /root/.cache/pip/wheels/9f/f5/07/7cd8017084dce4e93e84e92efd1e1d5334db05f2e83bcef74f
Successfully built pyspark
Installing collected packages: py4j, pyspark
Successfully installed py4j-0.10.9.3 pyspark-3.2.1


In [None]:
#Google Colab

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import os
import shutil

# solo google colab
import io
from google.colab import files

# inicializamos datos
spark = SparkSession.builder.getOrCreate()

In [None]:
# Cargando data Google Colab
uploaded = files.upload()

Saving events.parquet to events.parquet


In [None]:
os.listdir()

['.config', 'sales.parquet', 'sample_data']

In [None]:
salesPath = './sales.parquet'

salesDF = spark.read.parquet(salesPath)

salesDF.show(5, False)

+--------+-------------------------------+---------------------+-------------------+-----------------------+------------+-----------------------------------------------------------------+
|order_id|email                          |transaction_timestamp|total_item_quantity|purchase_revenue_in_usd|unique_items|items                                                            |
+--------+-------------------------------+---------------------+-------------------+-----------------------+------------+-----------------------------------------------------------------+
|257437  |kmunoz@powell-duran.com        |1592194221828900     |1                  |1995.0                 |1           |[{null, M_PREM_K, Premium King Mattress, 1995.0, 1995.0, 1}]     |
|282611  |bmurillo@hotmail.com           |1592504237604072     |1                  |940.5                  |1           |[{NEWBED10, M_STAN_Q, Standard Queen Mattress, 940.5, 1045.0, 1}]|
|257448  |bradley74@gmail.com            |1592200438030141  

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Definiendo Funcion


Defina una función en Python/Scala local para obtener la primera letra de una cadena del campo `email`.

In [None]:
def firstLetterFunction(email):
  return email[0]

firstLetterFunction("annagray@kaufman.com")

'a'

In [None]:
# En spark Obteniendo primera letra

from pyspark.sql import functions as F

salesDF.withColumn('FirstLetter', F.substring(F.col('email'),1, 1)).show(5, False)

+--------+-------------------------------+---------------------+-------------------+-----------------------+------------+-----------------------------------------------------------------+-----------+
|order_id|email                          |transaction_timestamp|total_item_quantity|purchase_revenue_in_usd|unique_items|items                                                            |FirstLetter|
+--------+-------------------------------+---------------------+-------------------+-----------------------+------------+-----------------------------------------------------------------+-----------+
|257437  |kmunoz@powell-duran.com        |1592194221828900     |1                  |1995.0                 |1           |[{null, M_PREM_K, Premium King Mattress, 1995.0, 1995.0, 1}]     |k          |
|282611  |bmurillo@hotmail.com           |1592504237604072     |1                  |940.5                  |1           |[{NEWBED10, M_STAN_Q, Standard Queen Mattress, 940.5, 1045.0, 1}]|b          |


### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Crear y aplicar UDF
Defina una UDF que envuelva la función. Esto serializa la función y la envía a los ejecutores para poder usarla en nuestro DataFrame.

In [None]:
from pyspark.sql.functions import udf

firstLetterUDF = udf(firstLetterFunction)

Apply UDF on the `email` column.

In [None]:
from pyspark.sql.functions import col
salesDF.select( firstLetterUDF(col("email"))).show(5)

+--------------------------+
|firstLetterFunction(email)|
+--------------------------+
|                         k|
|                         b|
|                         b|
|                         j|
|                         w|
+--------------------------+
only showing top 5 rows



### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png)Registrar UDF para usar en SQL
Registre UDF usando spark.udf.register para crear UDF en el espacio de nombres SQL.

In [None]:
salesDF.createOrReplaceTempView("sales")

spark.udf.register("sql_udf", firstLetterFunction)

<function __main__.firstLetterFunction>

In [None]:
query = 'SELECT sql_udf(email) AS firstLetter FROM sales'


df = spark.sql(query)


df.show(7)

+-----------+
|firstLetter|
+-----------+
|          k|
|          b|
|          b|
|          j|
|          w|
|          e|
|          c|
+-----------+
only showing top 7 rows



### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Usar la sintaxis Decorator (solo Python)
Alternativamente, defina UDF usando la sintaxis de decorador en Python con el tipo de datos que devuelve la función.

Ya no podrá llamar a la función Python local (por ejemplo, `decoratorUDF("annagray@kaufman.com")` no funcionará)

In [None]:
# Our input/output is a string
@udf("string") # spark
def decoratorUDF(email: str) -> str:
  return email[0]

In [None]:
from pyspark.sql.functions import col

salesPath = './sales.parquet'

salesDF = spark.read.parquet(salesPath)
salesDF.select(decoratorUDF(col("email"))).show()

+-------------------+
|decoratorUDF(email)|
+-------------------+
|                  k|
|                  b|
|                  b|
|                  j|
|                  w|
|                  e|
|                  c|
|                  j|
|                  m|
|                  r|
|                  m|
|                  n|
|                  x|
|                  c|
|                  j|
|                  c|
|                  e|
|                  g|
|                  a|
|                  w|
+-------------------+
only showing top 20 rows



In [None]:
decoratorUDF("annagray@kaufman.com") # no funciona

Column<'decoratorUDF(annagray@kaufman.com)'>

Podemos construir una clase de python para que nos permita usar nuestra función tanto para una operación simple como para Spark

In [5]:
from typing import Callable
from pyspark.sql import Column
from pyspark.sql.functions import udf, col
from pyspark.sql.types import StringType, IntegerType, ArrayType, DataType
class py_or_udf:
    def __init__(self, returnType : DataType=StringType()):
        self.spark_udf_type = returnType
        
    def __call__(self, func : Callable):
        def wrapped_func(*args, **kwargs):
            if any([isinstance(arg, Column) for arg in args]) or \
                any([isinstance(vv, Column) for vv in kwargs.values()]):
                return udf(func, self.spark_udf_type)(*args, **kwargs)
            else:
                return func(*args, **kwargs)
            
        return wrapped_func

In [None]:
@py_or_udf(returnType=StringType())
def decoratorUDF(email: str) -> str:
  return email[0]

# This works
# assert decoratorUDF("annagray@kaufman.com") == "a"

decoratorUDF("annagray@kaufman.com")

'a'

In [None]:
# This also works
salesDF.select(decoratorUDF(col("email"))).show()

+-------------------+
|decoratorUDF(email)|
+-------------------+
|                  k|
|                  b|
|                  b|
|                  j|
|                  w|
|                  e|
|                  c|
|                  j|
|                  m|
|                  r|
|                  m|
|                  n|
|                  x|
|                  c|
|                  j|
|                  c|
|                  e|
|                  g|
|                  a|
|                  w|
+-------------------+
only showing top 20 rows



## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Sort Day Lab
1. Define UDF to label day of week
1. Apply UDF to label and sort by day of week
1. Plot active users by day of week as bar graph

Start with a DataFrame of the average number of active users by day of week.

This was the resulting `df` in a previous lab.

In [2]:
# Only when is Local
import findspark

findspark.init()
findspark.find()

'E:\\LibreriasPython\\spark-3.1.2-bin-hadoop2.7\\python\\pyspark'

In [3]:
import pyspark
from pyspark.sql import SparkSession

spark = SparkSession.builder.appName('ManagementURFs').getOrCreate()

In [4]:
from pyspark.sql.functions import approx_count_distinct, avg, col, date_format, to_date

eventsPath = '../data/events.parquet'

df = (spark.read.parquet(eventsPath)
  .withColumn("ts", (col("event_timestamp") / 1e6).cast("timestamp"))
  .withColumn("date", to_date("ts"))
  .groupBy("date").agg(approx_count_distinct("user_id").alias("active_users"))
  .withColumn("day", date_format(col("date"), "E"))
  .groupBy("day").agg(avg(col("active_users")).alias("avg_users")))

df.show(10, False)

+---+------------------+
|day|avg_users         |
+---+------------------+
|Sun|281307.5          |
|Mon|237582.5          |
|Thu|179814.66666666666|
|Sat|273175.3333333333 |
|Wed|225910.5          |
|Fri|251063.66666666666|
|Tue|254316.5          |
+---+------------------+



### 1. Definir UDF para etiquetar el día de la semana
Utilice el **`labelDayOfWeek`** proporcionado a continuación para crear el udf **`labelDowUDF`**

In [6]:
@py_or_udf(StringType())
def labelDayOfWeek(day: str) ->str:
  dow = {"Mon": "1", "Tue": "2", "Wed": "3", "Thu": "4",
         "Fri": "5", "Sat": "6", "Sun": "7"}
  return dow.get(day) + "-" + day

In [9]:
from pyspark.sql import functions as F
df.withColumn('day',labelDayOfWeek(F.col('day'))).show(5)

+-----+------------------+
|  day|         avg_users|
+-----+------------------+
|7-Sun|          281307.5|
|1-Mon|          237582.5|
|4-Thu|179814.66666666666|
|6-Sat| 273175.3333333333|
|3-Wed|          225910.5|
+-----+------------------+
only showing top 5 rows



### 2. Aplique UDF a la etiqueta y ordene por día de la semana
- Actualice la columna del **`day`** aplicando la UDF y reemplazando esta columna
- Ordenar por **`day`**
- Trazar como gráfico de barras

In [None]:
# TODO
finalDF = FILL_IN

display(finalDF)

### Extras

- [Udf Pyspark](https://medium.com/@ayplam/developing-pyspark-udfs-d179db0ccc87)
- [Pandas UDF](https://medium.com/analytics-vidhya/pyspark-udf-deep-dive-8ae984bfac00)