In [18]:
from pyspark.sql import SparkSession

In [19]:
spark = SparkSession.builder.master("local[1]") \
                    .appName('test') \
                    .getOrCreate()

# DataFrames

In [3]:
# create dataframe
myRange = spark.range(1000).toDF("number")

# Transformations

In [4]:
# new dataframe with only even numbers
divisBy2 = myRange.where("number % 2 = 0")

# Actions
To trigger the computation, we run an action. An action instructs Spark to compute a result from a series of transformations.
The simplest action is count

In [5]:
divisBy2.count()

                                                                                

500

Of course, count is not the only action. There are three kinds of actions:
- Actions to view data in the console
- Actions to collect data to native objects in the respective language
- Actions to write to output data sources


# An End-to-End Example

In [20]:
flightData2015 = spark.read.option('inferSchema', 'true').option('header', 'true').csv('../data/flight-data/csv/2015-summary.csv')

In [21]:
flightData2015.take(3)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Croatia', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Ireland', count=344)]

Nothing happens to the data when we call sort because it’s just a transformation. However, we can see that Spark is building up a plan for how it will execute this across the cluster by looking at the explain plan. We can call explain on any DataFrame object to see the DataFrame’s lineage (or how Spark will execute this query):

In [22]:
flightData2015.sort('count').explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#187 ASC NULLS FIRST], true, 0
   +- Exchange rangepartitioning(count#187 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=347]
      +- FileScan csv [DEST_COUNTRY_NAME#185,ORIGIN_COUNTRY_NAME#186,count#187] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/workspaces/Spark-The-Definitive-Guide/data/flight-data/csv/2015-..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int>




Now we can specify an action to kick off this plan. However, before doing that, we’re going to set a configuration. By default, when we perform a shuffle, Spark outputs 200 shuffle partitions. Let’s set this value to 5 to reduce the number of the output partitions from the shuffle:

In [23]:
spark.conf.set('spark.sql.shuffle.artitions', '5')

In [24]:
flightData2015.sort('count').take(2)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Singapore', count=1),
 Row(DEST_COUNTRY_NAME='Moldova', ORIGIN_COUNTRY_NAME='United States', count=1)]

You can make any DataFrame into a table or view with one simple method call:

In [25]:
flightData2015.createOrReplaceTempView('flight_data_2015')

Now we can query our data in SQL. To do so, we’ll use the spark.sql function (remember, spark is our SparkSession variable) that conveniently returns a new DataFrame. Although this
might seem a bit circular in logic—that a SQL query against a DataFrame returns another DataFrame—it’s actually quite powerful. This makes it possible for you to specify transformations in the manner most convenient to you at any given point in time and not sacrifice any efficiency to do so! To understand that this is happening, let’s take a look at two explain plans:

In [26]:
sqlWay = spark.sql("""
SELECT DEST_COUNTRY_NAME, count(1)
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
""")
                   
dataFrameWay = flightData2015.groupBy('DEST_COUNTRY_NAME').count()

sqlWay.explain()
dataFrameWay.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[DEST_COUNTRY_NAME#185], functions=[count(1)])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#185, 200), ENSURE_REQUIREMENTS, [plan_id=369]
      +- HashAggregate(keys=[DEST_COUNTRY_NAME#185], functions=[partial_count(1)])
         +- FileScan csv [DEST_COUNTRY_NAME#185] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/workspaces/Spark-The-Definitive-Guide/data/flight-data/csv/2015-..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>


== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[DEST_COUNTRY_NAME#185], functions=[count(1)])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#185, 200), ENSURE_REQUIREMENTS, [plan_id=382]
      +- HashAggregate(keys=[DEST_COUNTRY_NAME#185], functions=[partial_count(1)])
         +- FileScan csv [DEST_COUNTRY_NAME#185] Batched: false, DataFilters: [], Format: CSV, Lo

Let’s pull out some interesting statistics from our data. One thing to understand is that DataFrames (and SQL) in Spark already have a huge number of manipulations available. There are hundreds of functions that you can use and import to help you resolve your big data problems faster. We will use the max function, to establish the maximum number of flights to and from any given location. This just scans each value in the relevant column in the DataFrame and checks whether it’s greater than the previous values that have been seen. This is a transformation, because we are effectively filtering down to one row. Let’s see what that looks like:

In [27]:
spark.sql('SELECT max(count) FROM flight_data_2015').take(1)

[Row(max(count)=370002)]

In [28]:
from pyspark.sql.functions import max

flightData2015.select(max('count')).take(1)

[Row(max(count)=370002)]

Great, that’s a simple example that gives a result of 370,002. Let’s perform something a bit more complicated and find the top five destination countries in the data. This is our first multi- transformation query, so we’ll take it step by step. Let’s begin with a fairly straightforward SQL aggregation:

In [29]:
maxSql = spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM  flight_data_2015
GROUP BY DEST_COUNTRY_NAME
ORDER BY sum(count) DESC
LIMIT 5
""")

maxSql.show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



In [30]:
from pyspark.sql.functions import desc

flightData2015.groupBy('DEST_COUNTRY_NAME')\
    .sum('count')\
    .withColumnRenamed('sum(count)', 'destination_total')\
    .sort(desc('destination_total'))\
    .limit(5)\
    .show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



In [31]:
flightData2015.groupBy('DEST_COUNTRY_NAME')\
    .sum('count')\
    .withColumnRenamed('sum(count)', 'destination_total')\
    .sort(desc('destination_total'))\
    .limit(5)\
    .explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- TakeOrderedAndProject(limit=5, orderBy=[destination_total#281L DESC NULLS LAST], output=[DEST_COUNTRY_NAME#185,destination_total#281L])
   +- HashAggregate(keys=[DEST_COUNTRY_NAME#185], functions=[sum(count#187)])
      +- Exchange hashpartitioning(DEST_COUNTRY_NAME#185, 200), ENSURE_REQUIREMENTS, [plan_id=552]
         +- HashAggregate(keys=[DEST_COUNTRY_NAME#185], functions=[partial_sum(count#187)])
            +- FileScan csv [DEST_COUNTRY_NAME#185,count#187] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/workspaces/Spark-The-Definitive-Guide/data/flight-data/csv/2015-..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,count:int>


