<div style="font-size:18pt; padding-top:20px; text-align:center"><b>User-Defined Functions (UDF) in <span style="font-weight:bold; color:green">PySpark</span></b></div><hr>
<div style="text-align:right;">Sergei Yu. Papulin <span style="font-style: italic;font-weight: bold;">(papulin_bmstu@mail.ru)</span></div>

<a name="0"></a>
<div><span style="font-size:14pt; font-weight:bold">Content</span>
    <ol>
        <li><a href="#1">Internal Python UDF in PySpark</a></li>
        <li><a href="#2">External Java UDF in PySpark</a></li>
        <li><a href="#3">Standard Dataframe operations</a></li>
        <li><a href="#4">Pandas UDF</a></li>
        <li><a href="#5">References</a></li>
    </ol>
</div>

<p>Launch the cell below to apply a jupyter notebook style</p>

In [1]:
%%html
<link href="css/style.css" rel="stylesheet" type="text/css">

<p>[OPTIONAL] PySpark setup</p>

In [None]:
import os
import sys

os.environ["SPARK_HOME"]="/opt/cloudera/parcels/SPARK2/lib/spark2"
os.environ["PYSPARK_PYTHON"]="/opt/rh/rh-python36/root/usr/bin/python"
os.environ["PYSPARK_DRIVER_PYTHON"]="/opt/rh/rh-python36/root/usr/bin/python"

spark_home = os.environ.get("SPARK_HOME")
sys.path.insert(0, os.path.join(spark_home, "python"))
sys.path.insert(0, os.path.join(spark_home, "python/lib/py4j-0.10.7-src.zip"))

<p>Import PySpark modules</p>

In [None]:
import pyspark
from pyspark.sql import SparkSession

In [None]:
import pyspark.sql.functions as F
from pyspark.sql.types import StructType, StructField, StringType, IntegerType

<a name="1"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">1. Internal Python UDF in PySpark</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To Content</a></div>
    </div>
</div>

<p><b>Launch Spark Session</b></p>

<p>Configuration</p>

In [None]:
conf = pyspark.SparkConf() \
        .setAppName("pythonUDFPySparkApp") \
        .setMaster("yarn-client") \

<p><b>Option 1.</b> New Spark versions (Spark &#8805; 2.x) - <b>default</b></p>

In [None]:
spark = SparkSession \
    .builder \
    .config(conf=conf) \
    .getOrCreate()

<p><b>Option 2.</b> Old Spark versions (Spark &#60; 2.x)</p>

<p><i>Launch Spark Context</i></p>

In [None]:
sc = pyspark.SparkContext(conf=conf)

<p><i>Get SQL Context</i></p>

In [None]:
from pyspark.sql import SQLContext

In [None]:
sqlContext = SQLContext(sc)

<p><b>Create Dataframe from HDFS file</b></p>

<p>Assign a full HDFS path of data source</p>

In [None]:
file_path = "data/spark_dataframe/persons.csv"

<p>If local file</p>

In [None]:
file_path = "file:///data/spark_dataframe/persons.csv"

<p>Create a schema for a dataframe</p>

In [None]:
schema = StructType([StructField(name="Id", dataType=IntegerType(), nullable=False),
                     StructField("Name", StringType(), True),
                     StructField("City", StringType(), True),
                     StructField("Year", IntegerType(), True),
                     StructField("Grade", IntegerType(), True),
                     StructField("Gender", StringType(), True)])

<p>Read the file and display first 5 rows</p>

In [None]:
person_df = spark.read.load(path=file_path, 
                          format="csv",
                          schema=schema,
                          header="false", 
                          inferSchema="false", sep=",", nullValue="null", mode="DROPMALFORMED")
person_df.show(5)

<p>Replace null values of the grade column with 0 and display the result</p>

In [None]:
person_df = person_df.na.fill({"Grade" : 0})
person_df.show(5)

<p>Create Python UDF to convert a numerical grade to a letter</p>

In [None]:
def conver2letter_grade(x):
    try:
        return "F" if int(x) < 5 else "A"
    except:
        return "F"

convert2letter_udf = F.udf(lambda x: conver2letter_grade(x), StringType())

<p>Apply Python UDF to create a new dataframe with the "LetterGrade" column and display the result</p>

In [None]:
person_with_letter_df = person_df.select("*", convert2letter_udf(person_df["Grade"]).alias("LetterGrade"))
person_with_letter_df.explain()
person_with_letter_df.show(5)

<p><b>Stop Spark session</b></p>

In [None]:
spark.stop()

<a name="2"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">2. External Java UDF in PySpark</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To Content</a></div>
    </div>
</div>

<p><b>Create Java UDF</b></p>

<p>Import the following modules from Maven</p>

In [None]:
org.apache.spark:spark-core_2.10
org.apache.spark:spark-sql_2.10

<p>Create a Java class to convert a numerical grade to a letter</p>

In [None]:
package edu.spark.customsparkudf;

import org.apache.spark.sql.api.java.UDF1;

public class CategorizeValue implements UDF1<Integer, String> {

    @Override
    public String call(Integer value) throws Exception {
        if (value < 5) return "F";
        return "A";
    }

}

<p>Build jar-file with the only class above</p>

<p><b>Launch Spark Session</b></p>

<p>Configuration</p>

In [None]:
conf = pyspark.SparkConf() \
        .setAppName("javaUDFPySparkApp") \
        .setMaster("yarn-client") \
        .set("spark.jars", "/home/cloudera/workspace/CLASS_UDF/lib/*")

<p>Run Spark session</p>

In [None]:
spark = SparkSession \
    .builder \
    .config(conf=conf) \
    .getOrCreate()

<p><b>Create Dataframe from HDFS file</b></p>

In [None]:
# Path to data source in HDFS
file_path = "data/spark_dataframe/persons.csv"

# Schema for a dataframe 
schema = StructType([StructField(name="Id", dataType=IntegerType(), nullable=False),
                     StructField("Name", StringType(), True),
                     StructField("City", StringType(), True),
                     StructField("Year", IntegerType(), True),
                     StructField("Grade", IntegerType(), True),
                     StructField("Gender", StringType(), True)])

# Create a dataframe from the file
person_df = spark.read.load(path=file_path, 
                          format="csv",
                          schema=schema,
                          header="false", 
                          inferSchema="false", sep=",", nullValue="null", mode="DROPMALFORMED")

# Replace null values of the grade column with 0
person_df = person_df.na.fill({"Grade" : 0})

# Display the result
person_df.show()

<p><b>Apply Java UDF inside PySpark</b></p>

<p>Register the Java UDF in Spark</p>

In [None]:
spark.udf.registerJavaFunction("categorize", "edu.spark.customsparkudf.CategorizeValue", StringType())

<p>Apply Java UDF to create a new dataframe with the "LetterGrade" column and display the result</p>

<p>Option 1. <i>Dataframe API</i></p>

In [None]:
person_with_letter_java_df = person_df.selectExpr("*", "categorize(Grade) as LetterGrade")
person_with_letter_java_df.show(5)

<p>Option 2. <i>SQL API</i></p>

<p>Create the "person" view</p>

In [None]:
person_df.createOrReplaceTempView("person")

<p>Execute an SQL query and display the result</p>

In [None]:
person_with_letter_java_df = sqlContext.sql("SELECT *, categorize(Grade) as LetterGrade FROM person")
person_with_letter_java_df.explain()
person_with_letter_java_df.show(5)

<p><b>Stop Spark session</b> (skip now)</p>

In [None]:
spark.stop()

<a name="3"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">3. Standard Dataframe operations</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To Content</a></div>
    </div>
</div>

<p>Create SQL-like expression</p>

In [None]:
def conver2letter_grade_sql(col):
    return F.when(col < 5, "F").otherwise("A")

<p>Apply the function</p>

In [None]:
person_with_letter_df = person_df.select("*", conver2letter_grade_sql(F.col("Grade")).alias("LetterGrade"))
person_with_letter_df.explain()
person_with_letter_df.show()

<p><b>Stop Spark session</b></p>

In [None]:
spark.stop()

<a name="4"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">4. Pandas UDF</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To Content</a></div>
    </div>
</div>

In [None]:
# TODO

<a name="5"></a>
<div style="display:table; width:100%; padding-top:10px; padding-bottom:10px; border-bottom:1px solid lightgrey">
    <div style="display:table-row">
        <div style="display:table-cell; width:80%; font-size:14pt; font-weight:bold">5. References</div>
    	<div style="display:table-cell; width:20%; text-align:center; background-color:whitesmoke; border:1px solid lightgrey"><a href="#0">To content</a></div>
    </div>
</div>