In [1]:
sc.uiWebUrl

'http://192.168.1.2:4040'

In [2]:
spark

In [7]:
flightData2015 = spark\
  .read\
  .option("inferSchema", "true")\
  .option("header", "true")\
  .csv("../bookrepo/Spark-The-Definitive-Guide/data/flight-data/csv/2015-summary.csv")

In [8]:
flightData2015.show()

+--------------------+-------------------+-----+
|   DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+--------------------+-------------------+-----+
|       United States|            Romania|   15|
|       United States|            Croatia|    1|
|       United States|            Ireland|  344|
|               Egypt|      United States|   15|
|       United States|              India|   62|
|       United States|          Singapore|    1|
|       United States|            Grenada|   62|
|          Costa Rica|      United States|  588|
|             Senegal|      United States|   40|
|             Moldova|      United States|    1|
|       United States|       Sint Maarten|  325|
|       United States|   Marshall Islands|   39|
|              Guyana|      United States|   64|
|               Malta|      United States|    1|
|            Anguilla|      United States|   41|
|             Bolivia|      United States|   30|
|       United States|           Paraguay|    6|
|             Algeri

In [9]:
flightData2015.take(3)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=15),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Croatia', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Ireland', count=344)]

we can see that Spark is building up a plan for how it will execute this across the cluster by looking at the explain plan

In [10]:
flightData2015.sort("count").explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#62 ASC NULLS FIRST], true, 0
   +- Exchange rangepartitioning(count#62 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [id=#83]
      +- FileScan csv [DEST_COUNTRY_NAME#60,ORIGIN_COUNTRY_NAME#61,count#62] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/Users/sumitagrawal/Learning/Spark-The-Definative-Guide/bookrepo/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int>




* You can read explain plans from top to bottom, the top being the end result, and the bottom being the source(s) of data.


In [23]:
spark.conf.set("spark.sql.shuffle.partitions", "5")

In [24]:
flightData2015.sort("count").take(2)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Singapore', count=1),
 Row(DEST_COUNTRY_NAME='Moldova', ORIGIN_COUNTRY_NAME='United States', count=1)]

you can monitor the job progress by navigating to the Spark UI on port 4040 to see the physical and logical execution characteristics of your jobs.

There is no performance difference between writing SQL queries or writing DataFrame code, they both “compile” to the same underlying plan that we specify in DataFrame code.

In [25]:
flightData2015.createOrReplaceTempView("flight_data_2015")

In [27]:
# Notice that these plans compile to the exact same underlying plan!

# in Python
sqlWay = spark.sql("""
SELECT DEST_COUNTRY_NAME, count(1)
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
""")

dataFrameWay = flightData2015\
  .groupBy("DEST_COUNTRY_NAME")\
  .count()

sqlWay.explain()
dataFrameWay.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[DEST_COUNTRY_NAME#60], functions=[count(1)])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#60, 5), ENSURE_REQUIREMENTS, [id=#167]
      +- HashAggregate(keys=[DEST_COUNTRY_NAME#60], functions=[partial_count(1)])
         +- FileScan csv [DEST_COUNTRY_NAME#60] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/Users/sumitagrawal/Learning/Spark-The-Definative-Guide/bookrepo/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>


== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[DEST_COUNTRY_NAME#60], functions=[count(1)])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#60, 5), ENSURE_REQUIREMENTS, [id=#180]
      +- HashAggregate(keys=[DEST_COUNTRY_NAME#60], functions=[partial_count(1)])
         +- FileScan csv [DEST_COUNTRY_NAME#60] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFile

In [28]:
from pyspark.sql.functions import max

flightData2015.select(max("count")).take(1)

[Row(max(count)=370002)]

In [29]:
maxSql = spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
ORDER BY sum(count) DESC
LIMIT 5
""")

maxSql.show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           411352|
|           Canada|             8399|
|           Mexico|             7140|
|   United Kingdom|             2025|
|            Japan|             1548|
+-----------------+-----------------+



In [31]:
from pyspark.sql.functions import desc

flightData2015\
  .groupBy("DEST_COUNTRY_NAME")\
  .sum("count")\
  .withColumnRenamed("sum(count)", "destination_total")\
  .sort(desc("destination_total"))\
  .limit(5)\
  .explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- TakeOrderedAndProject(limit=5, orderBy=[destination_total#181L DESC NULLS LAST], output=[DEST_COUNTRY_NAME#60,destination_total#181L])
   +- HashAggregate(keys=[DEST_COUNTRY_NAME#60], functions=[sum(count#62)])
      +- Exchange hashpartitioning(DEST_COUNTRY_NAME#60, 5), ENSURE_REQUIREMENTS, [id=#288]
         +- HashAggregate(keys=[DEST_COUNTRY_NAME#60], functions=[partial_sum(count#62)])
            +- FileScan csv [DEST_COUNTRY_NAME#60,count#62] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/Users/sumitagrawal/Learning/Spark-The-Definative-Guide/bookrepo/..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,count:int>


