# Funciones de Usuario
1. Definiendo una funcion
1. Crear y aplicar UDF
1. Registrar UDF para usar en SQL
1. Usar la sintaxis Decorator (solo Python)

In [2]:
import findspark

findspark.init()
findspark.find()

'E:\\LibreriasPython\\spark-3.1.2-bin-hadoop2.7\\python\\pyspark'

In [3]:
#Google Colab

import pyspark
from pyspark.sql import SparkSession
from pyspark.sql import functions as F
import os
import shutil


# inicializamos datos
spark = SparkSession.builder.getOrCreate()

In [4]:
os.listdir()

['.ipynb_checkpoints',
 '01 Archivos Json.ipynb',
 '02 Conceptos de Spark Streaming.ipynb',
 '03_UDFs.ipynb',
 'Laboratorio Spark Streaming.ipynb']

In [9]:
salesPath = '../data/sales.parquet'

salesDF = spark.read.parquet(salesPath)

salesDF.show(3, vertical = True,truncate = False)

-RECORD 0------------------------------------------------------------------------------------
 order_id                | 257437                                                            
 email                   | kmunoz@powell-duran.com                                           
 transaction_timestamp   | 1592194221828900                                                  
 total_item_quantity     | 1                                                                 
 purchase_revenue_in_usd | 1995.0                                                            
 unique_items            | 1                                                                 
 items                   | [{null, M_PREM_K, Premium King Mattress, 1995.0, 1995.0, 1}]      
-RECORD 1------------------------------------------------------------------------------------
 order_id                | 282611                                                            
 email                   | bmurillo@hotmail.com             

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Con Spark puro


Con Spark puro para obtener la primera letra de una cadena del campo `email`.

In [11]:
# En spark Obteniendo primera letra

from pyspark.sql import functions as F

salesDF.withColumn('FirstLetter', F.substring(F.col('email'),1, 1)).show(3,vertical = True,truncate = False)

-RECORD 0------------------------------------------------------------------------------------
 order_id                | 257437                                                            
 email                   | kmunoz@powell-duran.com                                           
 transaction_timestamp   | 1592194221828900                                                  
 total_item_quantity     | 1                                                                 
 purchase_revenue_in_usd | 1995.0                                                            
 unique_items            | 1                                                                 
 items                   | [{null, M_PREM_K, Premium King Mattress, 1995.0, 1995.0, 1}]      
 FirstLetter             | k                                                                 
-RECORD 1------------------------------------------------------------------------------------
 order_id                | 282611                           

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Definiendo Funcion


Defina una función en Python/Scala local para obtener la primera letra de una cadena del campo `email`.

In [10]:
def firstLetterFunction(email):
  return email[0]

firstLetterFunction("annagray@kaufman.com")

'a'

### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Crear y aplicar UDF
Defina una UDF que envuelva la función. Esto serializa la función y la envía a los ejecutores para poder usarla en nuestro DataFrame.

In [12]:
from pyspark.sql.functions import udf
# Registrar la función que hemos creado de python para usarla en pyspark
firstLetterUDF = udf(firstLetterFunction)

Apply UDF on the `email` column.

In [None]:
from pyspark.sql.functions import col
salesDF.select( firstLetterUDF(col("email"))).show(5)

+--------------------------+
|firstLetterFunction(email)|
+--------------------------+
|                         k|
|                         b|
|                         b|
|                         j|
|                         w|
+--------------------------+
only showing top 5 rows



### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png)Registrar UDF para usar en SQL
Registre UDF usando spark.udf.register para crear UDF en el espacio de nombres SQL.

In [13]:
# Crear un vista temporal llamada sales
salesDF.createOrReplaceTempView("sales")

# Registrar una función en sql para su uso en las queries con el nombre "sql_udf"
spark.udf.register("sql_udf", firstLetterFunction)

<function __main__.firstLetterFunction(email)>

In [14]:
query = 'SELECT sql_udf(email) AS firstLetter FROM sales'


df = spark.sql(query)


df.show(7)

+-----------+
|firstLetter|
+-----------+
|          k|
|          b|
|          b|
|          j|
|          w|
|          e|
|          c|
+-----------+
only showing top 7 rows



### ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Usar la sintaxis Decorator (solo Python)
Alternativamente, defina UDF usando la sintaxis de decorador en Python con el tipo de datos que devuelve la función.

Ya no podrá llamar a la función Python local (por ejemplo, `decoratorUDF("annagray@kaufman.com")` no funcionará)

In [15]:
# Our input/output is a string
@udf("string") # spark: indicando el tipo de dato de retorno es string para pyspark
def decoratorUDF(email: str) -> str:
  return email[0]

In [17]:
from pyspark.sql.functions import col

salesPath = '../data/sales.parquet'

salesDF = spark.read.parquet(salesPath)
# Usando la función definida con python
salesDF.select(decoratorUDF(col("email"))).show(5)

+-------------------+
|decoratorUDF(email)|
+-------------------+
|                  k|
|                  b|
|                  b|
|                  j|
|                  w|
+-------------------+
only showing top 5 rows



In [18]:
# La función de python con decorate no funciona para python normalmente
decoratorUDF("annagray@kaufman.com") # no funciona

Column<'decoratorUDF(annagray@kaufman.com)'>

### Construcción de funciones para Python y Pyspark
Podemos construir una clase de python para que nos permita usar nuestra función tanto para una operación simple como para Spark

In [19]:
from typing import Callable
from pyspark.sql import Column
from pyspark.sql.functions import udf, col
from pyspark.sql.types import StringType, IntegerType, ArrayType, DataType
class py_or_udf:
    def __init__(self, returnType : DataType=StringType()):
        self.spark_udf_type = returnType
        
    def __call__(self, func : Callable):
        def wrapped_func(*args, **kwargs):
            if any([isinstance(arg, Column) for arg in args]) or \
                any([isinstance(vv, Column) for vv in kwargs.values()]):
                return udf(func, self.spark_udf_type)(*args, **kwargs)
            else:
                return func(*args, **kwargs)
            
        return wrapped_func

In [20]:
@py_or_udf(returnType=StringType())
def decoratorUDF(email: str) -> str:
  return email[0]

# This works
# assert decoratorUDF("annagray@kaufman.com") == "a"

decoratorUDF("annagray@kaufman.com")

'a'

In [21]:
# This also works
salesDF.select(decoratorUDF(col("email"))).show(3)

+-------------------+
|decoratorUDF(email)|
+-------------------+
|                  k|
|                  b|
|                  b|
+-------------------+
only showing top 3 rows



## ![Spark Logo Tiny](https://files.training.databricks.com/images/105/logo_spark_tiny.png) Sort Day Lab
1. Define UDF to label day of week
1. Apply UDF to label and sort by day of week
1. Plot active users by day of week as bar graph

Start with a DataFrame of the average number of active users by day of week.

This was the resulting `df` in a previous lab.

In [4]:
from pyspark.sql.functions import approx_count_distinct, avg, col, date_format, to_date

eventsPath = '../data/events.parquet'

df = (spark.read.parquet(eventsPath)
  .withColumn("ts", (col("event_timestamp") / 1e6).cast("timestamp"))
  .withColumn("date", to_date("ts"))
  .groupBy("date").agg(approx_count_distinct("user_id").alias("active_users"))
  .withColumn("day", date_format(col("date"), "E"))
  .groupBy("day").agg(avg(col("active_users")).alias("avg_users")))

df.show(10, False)

+---+------------------+
|day|avg_users         |
+---+------------------+
|Sun|281307.5          |
|Mon|237582.5          |
|Thu|179814.66666666666|
|Sat|273175.3333333333 |
|Wed|225910.5          |
|Fri|251063.66666666666|
|Tue|254316.5          |
+---+------------------+



### 1. Definir UDF para etiquetar el día de la semana
Utilice el **`labelDayOfWeek`** proporcionado a continuación para crear el udf **`labelDowUDF`**

In [22]:
@py_or_udf(StringType())
def labelDayOfWeek(day: str) ->str:
  dow = {"Mon": "1", "Tue": "2", "Wed": "3", "Thu": "4",
         "Fri": "5", "Sat": "6", "Sun": "7"}
  return dow.get(day) + "-" + day

In [23]:
labelDayOfWeek("Sun")

'7-Sun'

### 2. Aplique UDF a la etiqueta y ordene por día de la semana
- Actualice la columna del **`day`** aplicando la UDF y reemplazando esta columna
- Ordenar por **`day`**
- Trazar como gráfico de barras

In [9]:
from pyspark.sql import functions as F
df.withColumn('day',labelDayOfWeek(F.col('day'))).show(5)

+-----+------------------+
|  day|         avg_users|
+-----+------------------+
|7-Sun|          281307.5|
|1-Mon|          237582.5|
|4-Thu|179814.66666666666|
|6-Sat| 273175.3333333333|
|3-Wed|          225910.5|
+-----+------------------+
only showing top 5 rows



### Extras

- [Udf Pyspark](https://medium.com/@ayplam/developing-pyspark-udfs-d179db0ccc87)
- [Pandas UDF](https://medium.com/analytics-vidhya/pyspark-udf-deep-dive-8ae984bfac00)

In [None]:
# Si tienes pandas
pandasDF = df.toPandas()
pandasDF.head()

In [None]:
# Usando la función de pyton y pyspark en pandas
pandasDF['new']= pandasDF.day.apply(labelDayOfWeek)
pandasDF.head()