#### pyspark.sql.SparkSession.builder.appName
Sets a name for the application, which will be shown in the Spark web UI.
If no application name is set, a randomly generated name will be used.

In [1]:
from pyspark.sql import SparkSession
sparkSession = SparkSession.builder.appName('theAppName')

#### pyspark.sql.SparkSession.builder.config
Sets a config option. Options set using this method are automatically propagated to both SparkConf and SparkSession’s own configuration.

In [4]:
from pyspark import SparkConf

In [5]:
sparkSession.config(conf=SparkConf().setMaster('yarn').setAppName('theNewAppName').set('spark.executor.cores',2))

<pyspark.sql.session.SparkSession.Builder at 0x7fc7a02ecc10>

#### pyspark.sql.SparkSession.builder.enableHiveSupport¶
Enables Hive support, including connectivity to a persistent Hive metastore, support for Hive SerDes, and Hive user-defined functions.

In [6]:
sparkSession.enableHiveSupport()

<pyspark.sql.session.SparkSession.Builder at 0x7fc7a02ecc10>

#### pyspark.sql.SparkSession.builder.getOrCreate
Gets an existing SparkSession or, if there is no existing one, creates a new one based on the options set in this builder.

This method first checks whether there is a valid global default SparkSession, and if yes, return that one. If no valid global default SparkSession exists, the method creates a new SparkSession and assigns the newly created SparkSession as the global default.

In [8]:
sparkSession.getOrCreate()

#### pyspark.sql.SparkSession.builder.master¶
Sets the Spark master URL to connect to, such as “local” to run locally, “local[4]” to run locally with 4 cores, or “spark://master:7077” to run on a Spark standalone cluster.

In [9]:
#sparkSession.master('yarn')

#### pyspark.sql.SparkSession.createDataFrame



In [1]:
from pyspark.sql import SparkSession
from pyspark import SparkConf

sparkSession = SparkSession.builder.config(conf=SparkConf() \
                                          .setMaster('yarn') \
                                          .setAppName('test')).getOrCreate()

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
22/12/28 11:43:47 WARN Client: Neither spark.yarn.jars nor spark.yarn.archive is set, falling back to uploading libraries under SPARK_HOME.


In [2]:
df = sparkSession.createDataFrame([('alice', 1)])
df.show()

                                                                                

+-----+---+
|   _1| _2|
+-----+---+
|alice|  1|
+-----+---+



In [3]:
df = sparkSession.createDataFrame([('alice', 1)],['name', 'age'])
df.show()

+-----+---+
| name|age|
+-----+---+
|alice|  1|
+-----+---+



In [15]:
rdd = sparkSession.sparkContext.parallelize([('alice',18), ('Bob', 24)])
rdd.collect()

                                                                                

[('alice', 18), ('Bob', 24)]

In [5]:
df = sparkSession.createDataFrame(rdd)
df.show()

+-----+---+
|   _1| _2|
+-----+---+
|alice| 18|
|  Bob| 24|
+-----+---+



In [16]:
from pyspark.sql import Row
row = Row('name','age')
rdd = rdd.map(lambda data: row(*data))
df = sparkSession.createDataFrame(rdd)
df.show()

+-----+---+
| name|age|
+-----+---+
|alice| 18|
|  Bob| 24|
+-----+---+



In [18]:
df = sparkSession.createDataFrame([{'name':'alice','age':18},{'name':'Bob','age':23}])
df.show()

+---+-----+
|age| name|
+---+-----+
| 18|alice|
| 23|  Bob|
+---+-----+



#### pyspark.sql.SparkSession.getActiveSession
Returns the active SparkSession for the current thread, returned by the builder

In [19]:
sparkSession.getActiveSession()

#### pyspark.sql.SparkSession.newSession

    Returns a new SparkSession as new session, that has separate SQLConf, registered temporary views and UDFs, but shared SparkContext and table cache.

In [20]:
sparkSession_new = sparkSession.newSession()

#### pyspark.sql.SparkSession.range

In [22]:
rdd = sparkSession.range(start=0, end=1000, step=1, numPartitions=3)
rdd.take(10)

[Row(id=0),
 Row(id=1),
 Row(id=2),
 Row(id=3),
 Row(id=4),
 Row(id=5),
 Row(id=6),
 Row(id=7),
 Row(id=8),
 Row(id=9)]

In [24]:
rdd.show()

+---+
| id|
+---+
|  0|
|  1|
|  2|
|  3|
|  4|
|  5|
|  6|
|  7|
|  8|
|  9|
| 10|
| 11|
| 12|
| 13|
| 14|
| 15|
| 16|
| 17|
| 18|
| 19|
+---+
only showing top 20 rows



#### pyspark.sql.SparkSession.stop

In [25]:
#sparkSession.stop()

#### pyspark.sql.SparkSession.udf

UDF’s a.k.a User Defined Functions, If you are coming from SQL background, UDF’s are nothing new to you as most of the traditional RDBMS databases support User Defined Functions, these functions need to register in the database library and use them on SQL as regular functions.

PySpark UDF’s are similar to UDF on traditional databases. In PySpark, you create a function in a Python syntax and wrap it with PySpark SQL udf() or register it as udf and use it on DataFrame and SQL respectively.
#### Why do we need a UDF?

UDF’s are used to extend the functions of the framework and re-use these functions on multiple DataFrame’s. For example, you wanted to convert every first letter of a word in a name string to a capital case; PySpark build-in features don’t have this function hence you can create it a UDF and reuse this as needed on many Data Frames. UDF’s are once created they can be re-used on several DataFrame’s and SQL expressions.

Before you create any UDF, do your research to check if the similar function you wanted is already available in Spark SQL Functions. PySpark SQL provides several predefined common functions and many more new functions are added with every release. hence, It is best to check before you reinventing the wheel.

When you creating UDF’s you need to design them very carefully otherwise you will come across optimization & performance issues.


In [32]:
from pyspark.sql.functions import udf
from pyspark.sql.types import *

In [48]:
@udf(returnType=IntegerType())
def fun(data):
    return data*2

In [51]:
df = sparkSession.range(0,110)
df.select(fun('id')).show(5)

+-------+
|fun(id)|
+-------+
|      0|
|      2|
|      4|
|      6|
|      8|
+-------+
only showing top 5 rows



In [52]:
fun2 = udf(lambda data: data*3, IntegerType())

In [53]:
df.select(fun2('id')).show(5)

+------------+
|<lambda>(id)|
+------------+
|           0|
|           3|
|           6|
|           9|
|          12|
+------------+
only showing top 5 rows



In [57]:
sparkSession.udf.register('FUN2', fun2)

<function __main__.<lambda>(data)>

In [54]:
df.createTempView('data')

In [58]:
sparkSession.sql('select FUN2(id) from data').show(5)

+--------+
|FUN2(id)|
+--------+
|       0|
|       3|
|       6|
|       9|
|      12|
+--------+
only showing top 5 rows

