# User Defined Functions



## Prepare environment
First, we are going to prepare the environment for running PySaprk in the Google Collab Machine

In [76]:
from google.colab import drive
drive.mount('/content/drive')

Drive already mounted at /content/drive; to attempt to forcibly remount, call drive.mount("/content/drive", force_remount=True).


In [77]:
!python /content/drive/MyDrive/colab/massive/install_pyspark.py

Install JAVA 8
Obtaining last version of spark


  soup = BeautifulSoup(html_doc)
Getting version spark-3.5.1
Downloading https://downloads.apache.org/spark/spark-3.5.1/spark-3.5.1-bin-hadoop3.tgz
Installing PySpark
Setting environment variables for JAVA_HOME and SPARK_HOME


## Start working with Spark
Now we now and understand how Spark appeared in our lives and more or less how it works (and you know, it's amazing 游뱘), we can start to work with it.
As you now, the SparkSession is the way programmers "talk" with Spark. So, we need to inicialize that.

In [78]:
from pyspark.sql import SparkSession

spark = (SparkSession
 .builder
 .appName("example")
 .getOrCreate())

## Create a DF to program the example mentioned in slides

In [79]:
df = spark.createDataFrame([("juan fernando", 20), ("valentina laverde", 31), ("teresa s치nchez", 30), ("julieta ponce", 35), ("antonio garc칤a", 25)], ["name", "age"])
df.show()

+-----------------+---+
|             name|age|
+-----------------+---+
|    juan fernando| 20|
|valentina laverde| 31|
|   teresa s치nchez| 30|
|    julieta ponce| 35|
|   antonio garc칤a| 25|
+-----------------+---+



Remember, what we want is to convert the first letter to capital case.
Fist, we need to create a python function, that from a given input (string) it converts the value into same string with first letter as capital case letter.

In [80]:
def convertCase(lower_string):
    result=""
    arr = lower_string.split(" ")
    for x in arr:
       result= result + x[0].upper() + x[1:len(x)] + " "
    return result[0:-1]

Now, we convert the funciton to udf (the default type of UDF is StringType)

In [81]:
import pyspark.sql.functions as F

convertUDF = F.udf(lambda z: convertCase(z))

Now, we can use the convertUDF, as a function of sparkSQL, for example, in a select() or in a withColumn() call

In [82]:
df.select(convertUDF(F.col("name")).alias("name"), F.col("age") ) \
   .show(truncate=False)

+-----------------+---+
|name             |age|
+-----------------+---+
|Juan Fernando    |20 |
|Valentina Laverde|31 |
|Teresa S치nchez   |30 |
|Julieta Ponce    |35 |
|Antonio Garc칤a   |25 |
+-----------------+---+



In [83]:
df.withColumn("corrected name", convertUDF(F.col("name")))\
  .show(truncate=False)

+-----------------+---+-----------------+
|name             |age|corrected name   |
+-----------------+---+-----------------+
|juan fernando    |20 |Juan Fernando    |
|valentina laverde|31 |Valentina Laverde|
|teresa s치nchez   |30 |Teresa S치nchez   |
|julieta ponce    |35 |Julieta Ponce    |
|antonio garc칤a   |25 |Antonio Garc칤a   |
+-----------------+---+-----------------+



We can also use our UDF on SQL

In [84]:
import pyspark.sql.types as T
spark.udf.register("convertUDF", convertCase, T.StringType())
df.createOrReplaceTempView("NAMES")
spark.sql("select convertUDF(name) as name, age from NAMES") \
     .show(truncate=False)

+-----------------+---+
|name             |age|
+-----------------+---+
|Juan Fernando    |20 |
|Valentina Laverde|31 |
|Teresa S치nchez   |30 |
|Julieta Ponce    |35 |
|Antonio Garc칤a   |25 |
+-----------------+---+



Another way to create UDF method, is to use the annotation @udf(resturnType=\<type\>) above the method definition

In [85]:
@F.udf(returnType=T.StringType())
def upperCase(str):
    return str.upper()


In [86]:
df.withColumn("Upper Name", upperCase(F.col("Name"))) \
.show(truncate=False)

+-----------------+---+-----------------+
|name             |age|Upper Name       |
+-----------------+---+-----------------+
|juan fernando    |20 |JUAN FERNANDO    |
|valentina laverde|31 |VALENTINA LAVERDE|
|teresa s치nchez   |30 |TERESA S츼NCHEZ   |
|julieta ponce    |35 |JULIETA PONCE    |
|antonio garc칤a   |25 |ANTONIO GARC칈A   |
+-----------------+---+-----------------+



## Handling null check

In [87]:
df_nulls = spark.createDataFrame([("juan fernando", 20), (None, 31), ("teresa s치nchez", 30), ("julieta ponce", 35), ("antonio garc칤a", 25)], ["name", "age"])
df_nulls.show()

+--------------+---+
|          name|age|
+--------------+---+
| juan fernando| 20|
|          NULL| 31|
|teresa s치nchez| 30|
| julieta ponce| 35|
|antonio garc칤a| 25|
+--------------+---+



In [88]:
df_nulls.createOrReplaceTempView("NAMES_NULLS")
spark.sql("select convertUDF(name) as Name from NAMES_NULLS " + \
         "where name is not null and convertUDF(name) like '%Juan%'") \
     .show(truncate=False)
#IT COULD FAIL if the udf is executed befoure the not null check

+-------------+
|Name         |
+-------------+
|Juan Fernando|
+-------------+



To aboid this, we can filter nulls in the registration of the UDF

In [89]:
spark.udf.register("_nullsafeUDF", lambda str: convertCase(str) if not str is None else "" , T.StringType())

<function __main__.<lambda>(str)>

In [90]:
spark.sql("select _nullsafeUDF(name) as Name from NAMES_NULLS " + \
         "where _nullsafeUDF(name) like '%Juan%'") \
     .show(truncate=False)

+-------------+
|Name         |
+-------------+
|Juan Fernando|
+-------------+



# Exercise 1:



*   Get data from the CSV: https://raw.githubusercontent.com/tidyverse/ggplot2/main/data-raw/diamonds.csv and save it in a dataframe.
*   Generate a new column, called "cut_color_id". This column, will have the first letter of the *cut* column, and the *color* column value. As an example, if the *cut* is "Premium" and the *color* is "I", the result in the new column will be "PI". Do it with a UDF.
*   Take into account, is better to use the functions of spark, if we can, because they are more optized than UDFs. Do you know how to do the same without an UDF? Do it.



In [91]:
from pyspark.sql import SparkSession
import urllib.request

# Initialize Spark session
spark = SparkSession.builder.appName("diamonds").getOrCreate()

# URL of the CSV file
url = "https://raw.githubusercontent.com/tidyverse/ggplot2/main/data-raw/diamonds.csv"

# Download the file locally
local_path = "/content/drive/MyDrive/colab/massive/diamonds.csv"
urllib.request.urlretrieve(url, local_path)

# Read the CSV file into a DataFrame
diamonds_df = spark.read.option("header", "true").option("inferSchema", "true").csv(local_path)
diamonds_df.show()

+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|carat|      cut|color|clarity|depth|table|price|   x|   y|   z|
+-----+---------+-----+-------+-----+-----+-----+----+----+----+
| 0.23|    Ideal|    E|    SI2| 61.5| 55.0|  326|3.95|3.98|2.43|
| 0.21|  Premium|    E|    SI1| 59.8| 61.0|  326|3.89|3.84|2.31|
| 0.23|     Good|    E|    VS1| 56.9| 65.0|  327|4.05|4.07|2.31|
| 0.29|  Premium|    I|    VS2| 62.4| 58.0|  334| 4.2|4.23|2.63|
| 0.31|     Good|    J|    SI2| 63.3| 58.0|  335|4.34|4.35|2.75|
| 0.24|Very Good|    J|   VVS2| 62.8| 57.0|  336|3.94|3.96|2.48|
| 0.24|Very Good|    I|   VVS1| 62.3| 57.0|  336|3.95|3.98|2.47|
| 0.26|Very Good|    H|    SI1| 61.9| 55.0|  337|4.07|4.11|2.53|
| 0.22|     Fair|    E|    VS2| 65.1| 61.0|  337|3.87|3.78|2.49|
| 0.23|Very Good|    H|    VS1| 59.4| 61.0|  338| 4.0|4.05|2.39|
|  0.3|     Good|    J|    SI1| 64.0| 55.0|  339|4.25|4.28|2.73|
| 0.23|    Ideal|    J|    VS1| 62.8| 56.0|  340|3.93| 3.9|2.46|
| 0.22|  Premium|    F|  

In [92]:
from pyspark.sql.functions import udf
from pyspark.sql.types import StringType

# Define the UDF that takes two columns and concatenates the first letter of 'cut' with 'color'
def concat_cut_color(cut, color):
    return cut[0] + color

# Register the UDF
concat_cut_color_udf = udf(concat_cut_color, StringType())

# Apply the UDF to create a new column 'cut_color_id'
diamonds_df_new = diamonds_df.withColumn("cut_color_id", concat_cut_color_udf(diamonds_df["cut"], diamonds_df["color"]))

# Show the updated DataFrame
diamonds_df_new.show()

+-----+---------+-----+-------+-----+-----+-----+----+----+----+------------+
|carat|      cut|color|clarity|depth|table|price|   x|   y|   z|cut_color_id|
+-----+---------+-----+-------+-----+-----+-----+----+----+----+------------+
| 0.23|    Ideal|    E|    SI2| 61.5| 55.0|  326|3.95|3.98|2.43|          IE|
| 0.21|  Premium|    E|    SI1| 59.8| 61.0|  326|3.89|3.84|2.31|          PE|
| 0.23|     Good|    E|    VS1| 56.9| 65.0|  327|4.05|4.07|2.31|          GE|
| 0.29|  Premium|    I|    VS2| 62.4| 58.0|  334| 4.2|4.23|2.63|          PI|
| 0.31|     Good|    J|    SI2| 63.3| 58.0|  335|4.34|4.35|2.75|          GJ|
| 0.24|Very Good|    J|   VVS2| 62.8| 57.0|  336|3.94|3.96|2.48|          VJ|
| 0.24|Very Good|    I|   VVS1| 62.3| 57.0|  336|3.95|3.98|2.47|          VI|
| 0.26|Very Good|    H|    SI1| 61.9| 55.0|  337|4.07|4.11|2.53|          VH|
| 0.22|     Fair|    E|    VS2| 65.1| 61.0|  337|3.87|3.78|2.49|          FE|
| 0.23|Very Good|    H|    VS1| 59.4| 61.0|  338| 4.0|4.05|2.39|

In [93]:
from pyspark.sql.functions import concat, col, substring

# Assuming diamonds_df is your DataFrame
# Create a new column 'cut_color_id' by concatenating the first letter of 'cut' with 'color'
diamonds_df.show()
diamonds_df = diamonds_df.withColumn("cut_color_id", concat(substring(col("cut"), 1, 1), col("color")))

# Show the updated DataFrame to verify the new column
diamonds_df.show()

+-----+---------+-----+-------+-----+-----+-----+----+----+----+
|carat|      cut|color|clarity|depth|table|price|   x|   y|   z|
+-----+---------+-----+-------+-----+-----+-----+----+----+----+
| 0.23|    Ideal|    E|    SI2| 61.5| 55.0|  326|3.95|3.98|2.43|
| 0.21|  Premium|    E|    SI1| 59.8| 61.0|  326|3.89|3.84|2.31|
| 0.23|     Good|    E|    VS1| 56.9| 65.0|  327|4.05|4.07|2.31|
| 0.29|  Premium|    I|    VS2| 62.4| 58.0|  334| 4.2|4.23|2.63|
| 0.31|     Good|    J|    SI2| 63.3| 58.0|  335|4.34|4.35|2.75|
| 0.24|Very Good|    J|   VVS2| 62.8| 57.0|  336|3.94|3.96|2.48|
| 0.24|Very Good|    I|   VVS1| 62.3| 57.0|  336|3.95|3.98|2.47|
| 0.26|Very Good|    H|    SI1| 61.9| 55.0|  337|4.07|4.11|2.53|
| 0.22|     Fair|    E|    VS2| 65.1| 61.0|  337|3.87|3.78|2.49|
| 0.23|Very Good|    H|    VS1| 59.4| 61.0|  338| 4.0|4.05|2.39|
|  0.3|     Good|    J|    SI1| 64.0| 55.0|  339|4.25|4.28|2.73|
| 0.23|    Ideal|    J|    VS1| 62.8| 56.0|  340|3.93| 3.9|2.46|
| 0.22|  Premium|    F|  

# Caching and Persistence of Data


# DataFrame.cache()




In [94]:
df_to_cache = spark.range(1*10000000).toDF("id").withColumn("sqaure", F.col("id")*F.col("id"))
df_to_cache.show()

+---+------+
| id|sqaure|
+---+------+
|  0|     0|
|  1|     1|
|  2|     4|
|  3|     9|
|  4|    16|
|  5|    25|
|  6|    36|
|  7|    49|
|  8|    64|
|  9|    81|
| 10|   100|
| 11|   121|
| 12|   144|
| 13|   169|
| 14|   196|
| 15|   225|
| 16|   256|
| 17|   289|
| 18|   324|
| 19|   361|
+---+------+
only showing top 20 rows



In [95]:
#cache this data
df_to_cache.cache()

DataFrame[id: bigint, sqaure: bigint]

In [96]:
import time

startTimeQuery = time.process_time()
df_to_cache.count()
endTimeQuery = time.process_time()
endTimeQuery - startTimeQuery

0.15850915600000093

In [97]:
startTimeQuery = time.process_time()
df_to_cache.count()
endTimeQuery = time.process_time()
endTimeQuery - startTimeQuery

0.004738387000003286

In [98]:
df_to_persist = spark.range(10001000).toDF("id2").withColumn("sqaure", F.col("id2")*F.col("id2"))

In [99]:
from pyspark.storagelevel import StorageLevel
#persist this data
df_to_persist.persist(StorageLevel.DISK_ONLY)

DataFrame[id2: bigint, sqaure: bigint]

In [100]:
startTimeQuery = time.process_time()
df_to_persist.count()
endTimeQuery = time.process_time()
endTimeQuery - startTimeQuery

0.1580495159999984

In [101]:
startTimeQuery = time.process_time()
df_to_persist.count()
endTimeQuery = time.process_time()
endTimeQuery - startTimeQuery

0.0056970289999966894

As this data is now saved on disk, after use it, we are going to erase it.

In [102]:
df_to_persist.unpersist()

DataFrame[id2: bigint, sqaure: bigint]

In [103]:
df_to_cache.unpersist()

DataFrame[id: bigint, sqaure: bigint]