Parts of this Tutorial are taken from the Spark Tutorial: Learning Apache Spark, and from O'Reillys book Spark the Definitive Guide.


**Findspark/pyspark**
As we are using findspark, find and start spark / pyspark first. This would not bee needed with a Pyspark kernel, where

In [1]:
import findspark
findspark.init()
import pyspark


### **Spark Context**

In Spark, communication occurs between a driver and executors. The driver has Spark jobs that it needs to run and these jobs are split into tasks that are submitted to the executors for completion. The results from these tasks are delivered back to the driver.


In part 1, we saw that normal python code can be executed via cells. When using Databricks this code gets executed in the Spark driver's Java Virtual Machine (JVM) and not in an executor's JVM, and when using an IPython notebook it is executed within the kernel associated with the notebook. Since no Spark functionality is actually being used, no tasks are launched on the executors.

In order to use Spark and its API we will need to use a SparkContext. When running Spark, you start a new Spark application by creating a SparkContext. When the SparkContext is created, it asks the master for some cores to use to do work. The master sets these cores aside just for you; they won't be used for other applications. When using a pyspark kernel or pyspark from the command line, the SparkContext is created for you automatically as sc.

通过创建SparkContext来启动一个新的Spark应用程序。当创建SparkContext时，它会向主节点请求一些核心（CPU核心）来执行工作。主节点会为您保留这些核心，它们不会被其他应用程序使用。在使用pyspark内核或从命令行使用pyspark时，SparkContext会自动为您创建，并通常命名为"sc"。SparkContext是与Spark应用程序交互的关键接口，它允许您配置应用程序以及启动Spark作业。

In [2]:
# creation is not needed with pyspark kernel, but as we're running locally:
sc = pyspark.SparkContext(appName="ExcOReillySparkDefinitiveGuide")

23/10/27 08:32:10 WARN Utils: Your hostname, hung.local resolves to a loopback address: 127.0.0.1; using 10.172.74.188 instead (on interface en0)
23/10/27 08:32:10 WARN Utils: Set SPARK_LOCAL_IP if you need to bind to another address
Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
23/10/27 08:32:10 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable


In [3]:
type(sc)

pyspark.context.SparkContext

**Spark Session object**

When running in a pyspark console or with a pyspark kernel, the spark object has already been created.
As we're using findspark, let's create a session

In [4]:
from pyspark.sql import SparkSession
spark = SparkSession(sc)
print(spark)

<pyspark.sql.session.SparkSession object at 0x1198e2610>


**Notebook and Python Environment**

A notebook is comprised of a linear sequence of cells. These cells can contain either markdown or code, but we won't mix both in one cell. When a markdown cell is executed it renders formatted text, images, and links just like HTML in a normal webpage. The text you are reading right now is part of a markdown cell. Python code cells allow you to execute arbitrary Python commands just like in any Python shell. Place your cursor inside the cell below, and press "Shift" + "Enter" to execute the code and advance to the next cell. You can also press "Ctrl" + "Enter" to execute the code and remain in the cell. These commands work the same in both markdown and code cells.

In [5]:
#Let's find out the python version
import sys
print(sys.version)

3.11.4 (main, Jul 25 2023, 17:36:13) [Clang 14.0.3 (clang-1403.0.22.14.1)]


In [6]:
# This is a Python cell. You can run normal Python code here...
# Let's first get the python version
print('The sum of 1 and 1 is {0}'.format(1+1))

The sum of 1 and 1 is 2


In [7]:
# Here is another Python cell, this time with a variable (x) declaration and an if statement:
x = 42
if x > 40:
    print ('The sum of 1 and 2 is {0}'.format(1+2))

The sum of 1 and 2 is 3


**Notebook state**

As you work through a notebook it is important that you run all of the code cells.  The notebook is stateful, which means that variables and their values are retained until the notebook is detached (in Databricks) or the kernel is restarted (in IPython notebooks).  If you do not run all of the code cells as you proceed through the notebook, your variables will not be properly initialized and later code might fail.  You will also need to rerun any cells that you have modified in order for the changes to be available to other cells.

In [8]:
# This cell relies on x being defined already.
# If we didn't run the cells from part above this code would fail.
print (x * 2)

84


**Library imports**

We can import standard Python libraries (modules) the usual way. An import statement will import the specified module. In this tutorial and future labs, we will provide any imports that are necessary.

In [9]:
# Import the datetime library
import datetime
print('This was last run on: {0}'.format(datetime.datetime.now()))

This was last run on: 2023-10-27 08:32:11.188872


**Example Cluster**


The diagram below shows an example cluster, where the cores allocated for an application are outlined in purple. (Note: *In the case of the Community Edition tier there is no Worker, and the Master, not shown in the figure, executes the entire code.*)

![executors](http://spark-mooc.github.io/web-assets/images/executors.png)




You can view the details of your Spark application in the Spark web UI.  The web UI is typically accessible through a cluster UI.  When running locally you'll find it at [localhost:4040](http://localhost:4040) (if localhost doesn't work, try [this](http://127.0.0.1:4040/)).  In the web UI, under the "Jobs" tab, you can see a list of jobs that have been scheduled or run.  It's likely there isn't any thing interesting here yet because we haven't run any jobs, but we'll return to this page later.

**O'Reilly Data Set**


For starters, we will be using the data set from "Spark - The Definitive Guide" by Bill Chambers & Matei Zaharia, published by O'Reilly.

In particular, start with reading page 22ff

Find the data set on [GitHub](https://github.com/databricks/Spark-The-Definitive-Guide), download it, and put it to a known folder location.

In [10]:
import os
#file:///C:/users/jan/data/Spark-The-Definitive-Guide/data/flight-data/csv/
datasetpath=os.path.join('../Spark-The-Definitive-Guide')

**Flight Data Example Book Page 50ff**
- 以下的spark是一个SparkSession对象，它是与Spark集群通信的入口点。
- read是用于读取数据的方法。
- option("inferSchema", "true")和option("header", "true")分别用于自动推断数据的模式（数据类型）以及将第一行视为列标题。
- csv(...)函数用于加载CSV文件，文件的路径由glob.glob(...)动态生成

In [11]:
import glob
flightData2015 = spark.read.option("inferSchema", "true").option("header", "true")\
.csv(glob.glob(os.path.join(datasetpath, 'data', 'flight-data','csv', '2010-summary.csv')))

In [12]:
flightData2015.printSchema()

root
 |-- DEST_COUNTRY_NAME: string (nullable = true)
 |-- ORIGIN_COUNTRY_NAME: string (nullable = true)
 |-- count: integer (nullable = true)



In [13]:
flightData2015.count()

255

In [14]:
flightData2015.take(3)

[Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Ireland', count=264),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='India', count=69)]

In [15]:
flightData2015.sort("count").explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- Sort [count#19 ASC NULLS FIRST], true, 0
   +- Exchange rangepartitioning(count#19 ASC NULLS FIRST, 200), ENSURE_REQUIREMENTS, [plan_id=63]
      +- FileScan csv [DEST_COUNTRY_NAME#17,ORIGIN_COUNTRY_NAME#18,count#19] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/Users/shannon/Library/CloudStorage/OneDrive-國立臺灣科技大學/NTUST/Germa..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string,ORIGIN_COUNTRY_NAME:string,count:int>




In [16]:
flightData2015.sort("count").take(2)

[Row(DEST_COUNTRY_NAME='Equatorial Guinea', ORIGIN_COUNTRY_NAME='United States', count=1),
 Row(DEST_COUNTRY_NAME='United States', ORIGIN_COUNTRY_NAME='Romania', count=1)]

In [17]:
spark.conf.get("spark.sql.shuffle.partitions")

'200'

**Flight Data Example SQL Page 53ff**

In [18]:
flightData2015.createOrReplaceTempView("flight_data_2015")

In [19]:
sqlWay = spark.sql("""SELECT DEST_COUNTRY_NAME, count(1) FROM flight_data_2015 GROUP BY DEST_COUNTRY_NAME""")

In [20]:
sqlWay.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[DEST_COUNTRY_NAME#17], functions=[count(1)])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#17, 200), ENSURE_REQUIREMENTS, [plan_id=85]
      +- HashAggregate(keys=[DEST_COUNTRY_NAME#17], functions=[partial_count(1)])
         +- FileScan csv [DEST_COUNTRY_NAME#17] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/Users/shannon/Library/CloudStorage/OneDrive-國立臺灣科技大學/NTUST/Germa..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>




In [21]:
dataFrameWay = flightData2015.groupBy("DEST_COUNTRY_NAME").count()

In [22]:
dataFrameWay.explain()

== Physical Plan ==
AdaptiveSparkPlan isFinalPlan=false
+- HashAggregate(keys=[DEST_COUNTRY_NAME#17], functions=[count(1)])
   +- Exchange hashpartitioning(DEST_COUNTRY_NAME#17, 200), ENSURE_REQUIREMENTS, [plan_id=98]
      +- HashAggregate(keys=[DEST_COUNTRY_NAME#17], functions=[partial_count(1)])
         +- FileScan csv [DEST_COUNTRY_NAME#17] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex(1 paths)[file:/Users/shannon/Library/CloudStorage/OneDrive-國立臺灣科技大學/NTUST/Germa..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<DEST_COUNTRY_NAME:string>




- 可以直接下 sql 
- 或是透過select()方法來選擇欄位

In [23]:
# max(count) 是一个聚合函数，它返回一个列的最大值
spark.sql("SELECT max(count) from flight_data_2015").take(1)

[Row(max(count)=348113)]

In [24]:
from pyspark.sql.functions import max
flightData2015.select(max("count")).take(1)

[Row(max(count)=348113)]

**Top 5 by country in SQL and DataFrame**

In [25]:
maxSql = spark.sql("""
SELECT DEST_COUNTRY_NAME, sum(count) as destination_total
FROM flight_data_2015
GROUP BY DEST_COUNTRY_NAME
ORDER BY sum(count) DESC
LIMIT 5
""")
maxSql.show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           384932|
|           Canada|             8271|
|           Mexico|             6200|
|   United Kingdom|             1629|
|          Germany|             1392|
+-----------------+-----------------+



In [26]:
# in Python
from pyspark.sql.functions import desc
flightData2015\
.groupBy("DEST_COUNTRY_NAME")\
.sum("count")\
.withColumnRenamed("sum(count)", "destination_total")\
.sort(desc("destination_total"))\
.limit(5)\
.show()

+-----------------+-----------------+
|DEST_COUNTRY_NAME|destination_total|
+-----------------+-----------------+
|    United States|           384932|
|           Canada|             8271|
|           Mexico|             6200|
|   United Kingdom|             1629|
|          Germany|             1392|
+-----------------+-----------------+



In [27]:
# in Python
from pyspark.sql.functions import desc
import pyspark.sql.functions as func
flightData2015\
.groupBy("DEST_COUNTRY_NAME")\
.agg(\
    func.mean("count").alias("Mean Count"),
    func.stddev("count").alias("StdDev"),
     func.count("count").alias("Count")\
    )\
.show()


+--------------------+----------+------+-----+
|   DEST_COUNTRY_NAME|Mean Count|StdDev|Count|
+--------------------+----------+------+-----+
|            Anguilla|      21.0|  NULL|    1|
|              Russia|     152.0|  NULL|    1|
|            Paraguay|      90.0|  NULL|    1|
|             Senegal|      29.0|  NULL|    1|
|              Sweden|      65.0|  NULL|    1|
|            Kiribati|      17.0|  NULL|    1|
|              Guyana|      17.0|  NULL|    1|
|         Philippines|     132.0|  NULL|    1|
|            Malaysia|       1.0|  NULL|    1|
|           Singapore|      25.0|  NULL|    1|
|                Fiji|      53.0|  NULL|    1|
|              Turkey|      75.0|  NULL|    1|
|             Germany|    1392.0|  NULL|    1|
|         Afghanistan|      11.0|  NULL|    1|
|              Jordan|      50.0|  NULL|    1|
|               Palau|      31.0|  NULL|    1|
|Turks and Caicos ...|     136.0|  NULL|    1|
|              France|     774.0|  NULL|    1|
|            

**Uebung 2 - Exercise**

Find a bunch of python / spark examples where there will be parallelism in data processing. Look into the Spark UI to find out how and how paralell execution is organized in Spark. Is there an operation comparable to the "reduce" part of Mapreduce?
Hint: find out more about Spark's execution model under http://spark.apache.org/docs/latest/cluster-overview.html
