## Creating Functions

Functions as you know them in Python work a bit differently in Pyspark because it operates on a cluster. If you define a function the traditional Python way in PySpark, you will not recieve an error message but the call will not distribute on all nodes. So it will run slower. 

So to convert a Python function to what's called a user defined function (UDF) in PySpark. This is what you do.

In [3]:
# import findspark
# findspark.init()

import pyspark # only run after findspark.init()
from pyspark.sql import SparkSession
# May take awhile locally
spark = SparkSession.builder.appName("Manipulate").getOrCreate()

cores = spark._jsc.sc().getExecutorMemoryStatus().keySet().size()
print("You are working with", cores, "core(s)")
spark

You are working with 1 core(s)


## Create a mock dataframe

In [17]:
# Small dataframe for quick testing if you need it
df = spark.createDataFrame([(3,69,57,56,678,345),(3,67,56,58,678,345),(3,67,54,57,678,345),(3,68,55,58,678,345),(3,68,53,52,678,345)
                           ,(2,11,10,907,16,458),(2,12,14,909,12,456),(2,11,13,910,10,459),(2,12,11,905,16,459),(2,10,13,902,10,459)
                           ,(1,30,11,123,568,891),(1,32,12,124,567,890),(1,34,10,123,566,895),(1,35,15,121,564,894),(1,30,12,124,560,896)], 
                           ['flower_type', 'sepal_len','sepal_width','R','G','B'])

In [5]:
from pyspark.sql.functions import udf
from pyspark.sql.types import IntegerType

def square(x):
    return int(x**2)
square_udf = udf(lambda z: square(z), IntegerType())

df.select('age',square_udf('age').alias('age_squared')).show()

+---+-----------+
|age|age_squared|
+---+-----------+
| 45|       2025|
| 14|        196|
| 63|       3969|
| 75|       5625|
| 24|        576|
| 45|       2025|
+---+-----------+

