In [2]:
# Create a basic SparkSession

from pyspark.sql import SparkSession

spark = SparkSession.builder.appName("Python Spark SQL basic example")\
    .config("spark.some.config.option", "some-value")\
    .getOrCreate()

In [4]:
#Creating DataFrames
#spark is an existing SparkSession
df = spark.read.json("people 2.json")
#Displays the content of the DataFrame to stdout
df.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [5]:
#Untyped Dataset Operations (aka DataFrame Operations)


**In Python it's possible to access a DataFrame's columns either by attribute (df.age) or by indexing(df['age']). While the former is convenient for interactive data exploration, it is highly recommended the letter form to be used**

In [6]:
# spark, df are from the previous example
# Print the schema in a tree format
df.printSchema()

root
 |-- age: long (nullable = true)
 |-- name: string (nullable = true)



In [7]:
# Select only the "name" column
df.select("name").show()

+-------+
|   name|
+-------+
|Michael|
|   Andy|
| Justin|
+-------+



In [8]:
#Select everybody, but increment the age by 1
df.select(df['name'], df['age'] +1).show()


+-------+---------+
|   name|(age + 1)|
+-------+---------+
|Michael|     null|
|   Andy|       31|
| Justin|       20|
+-------+---------+



In [9]:
# Select people older than 21
df.filter(df['age'] > 21).show()

+---+----+
|age|name|
+---+----+
| 30|Andy|
+---+----+



In [10]:
#Count people by age
df.groupBy("age").count().show()

+----+-----+
| age|count|
+----+-----+
|  19|    1|
|null|    1|
|  30|    1|
+----+-----+



In [11]:
#Running SQL Queries Programmatically


**The sql function on a SparkSession enables applications to run SQL queries programmatically and returns the result as a DataFrame.**

In [12]:
#Register the DataFrame as a SQL temporary view
df.createOrReplaceTempView("people")

In [13]:
sqlDF = spark.sql("SELECT * FROM people")
sqlDF.show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [14]:
#Global Temporary View

**Temporary views in Spark SQL are session-scoped and will disappear if the session that creates it terminates. If you want to have a temporary view that is shared among all sessions and keep alive until the Spark application terminates, you can create a global temporary view. Global temporary view is tied to a system preserved database "global_temp" and we must use the qualified name to refer it, e.g. SELeCT * FROM "global_temp.view1"**

In [15]:
#Register the DataFrame as global temporary view
df.createGlobalTempView("people")

In [17]:
#Global temporary view is tied to a system preserved database 'global_temp'
spark.sql("SELECT * FROM global_temp.people").show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [18]:
#Global temporary view is cross-session
spark.newSession().sql("SELECT * FROM global_temp.people").show()

+----+-------+
| age|   name|
+----+-------+
|null|Michael|
|  30|   Andy|
|  19| Justin|
+----+-------+



In [19]:
# Creating Datasets
#WARNING ITS SCALA AND JAVA ONLY

*Datasets are similar to RDDs, however, instead of using Java serialization or Kryo
they use a specialized Encoder to serialize the objects for processing or
transmitting over the network. While both encoders and standard serialization 
are responsible for turning an object into bytes, encoders are code generated dynamically and
use a format that allows Spark to perform many operations like filtering
sorting and hashing without deserializing the bytes back into an object.*

In [None]:
#case class Person(name: String, age: Long)

#Encoders are created for case classes
#val caseClassDS = Seq(Person("Andy", 32)).toDS()
#caseClassDS.show()