# Learning Spark - Chapter 4 (Python)
## Spark SQL and DataFrames: Introduction to Built-in Data Sources

In [1]:
import org.apache.spark.sql.SparkSession

Intitializing Scala interpreter ...

Spark Web UI available at http://EM2021002778.bosonit.local:4042
SparkContext available as 'sc' (version = 3.1.1, master = local[*], app id = local-1620586875147)
SparkSession available as 'spark'


import org.apache.spark.sql.SparkSession


In [2]:
val spark = SparkSession
.builder
.appName("SparkSQLExampleApp")
.master("local")
.config("spark.sql.warehouse.dir", "C:/Users/jorgedario.mendez/OneDrive%20-%20Bosonit/Documentos/3.%20Libro/LearningSparkV2-master/JM%20Jupyter/spark-warehouse")
.enableHiveSupport()
.getOrCreate()

spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@7fd1ffd4


In [3]:
import spark.implicits._
import spark.sql

import spark.implicits._
import spark.sql


## Basic Query Examples

In [4]:
// Path to data set
val csvFile = "./departuredelays.csv"

csvFile: String = ./departuredelays.csv


In [5]:
// Read and create a temporary view
// Infer schema (note that for larger files you may want to specify the schema)
val df = spark.read.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.load(csvFile)

df: org.apache.spark.sql.DataFrame = [date: int, delay: int ... 3 more fields]


In [6]:
df.show()

+-------+-----+--------+------+-----------+
|   date|delay|distance|origin|destination|
+-------+-----+--------+------+-----------+
|1011245|    6|     602|   ABE|        ATL|
|1020600|   -8|     369|   ABE|        DTW|
|1021245|   -2|     602|   ABE|        ATL|
|1020605|   -4|     602|   ABE|        ATL|
|1031245|   -4|     602|   ABE|        ATL|
|1030605|    0|     602|   ABE|        ATL|
|1041243|   10|     602|   ABE|        ATL|
|1040605|   28|     602|   ABE|        ATL|
|1051245|   88|     602|   ABE|        ATL|
|1050605|    9|     602|   ABE|        ATL|
|1061215|   -6|     602|   ABE|        ATL|
|1061725|   69|     602|   ABE|        ATL|
|1061230|    0|     369|   ABE|        DTW|
|1060625|   -3|     602|   ABE|        ATL|
|1070600|    0|     369|   ABE|        DTW|
|1071725|    0|     602|   ABE|        ATL|
|1071230|    0|     369|   ABE|        DTW|
|1070625|    0|     602|   ABE|        ATL|
|1071219|    0|     569|   ABE|        ORD|
|1080600|    0|     369|   ABE| 

In [7]:
df.explain(true)

== Parsed Logical Plan ==
Relation[date#16,delay#17,distance#18,origin#19,destination#20] csv

== Analyzed Logical Plan ==
date: int, delay: int, distance: int, origin: string, destination: string
Relation[date#16,delay#17,distance#18,origin#19,destination#20] csv

== Optimized Logical Plan ==
Relation[date#16,delay#17,distance#18,origin#19,destination#20] csv

== Physical Plan ==
FileScan csv [date#16,delay#17,distance#18,origin#19,destination#20] Batched: false, DataFilters: [], Format: CSV, Location: InMemoryFileIndex[file:/C:/Users/jorgedario.mendez/OneDrive - Bosonit/Documentos/3. Libro/Learnin..., PartitionFilters: [], PushedFilters: [], ReadSchema: struct<date:int,delay:int,distance:int,origin:string,destination:string>



In [8]:
// Create a temporary view
df.createOrReplaceTempView("us_delay_flights_tbl")

##### distance is greater than 1,000 miles:

In [9]:
spark.sql("""SELECT * 
FROM us_delay_flights_tbl
WHERE distance > 1000""")
.show()

+-------+-----+--------+------+-----------+
|   date|delay|distance|origin|destination|
+-------+-----+--------+------+-----------+
|1012355|    0|    1586|   ABQ|        JFK|
|1022355|  158|    1586|   ABQ|        JFK|
|1032355|    0|    1586|   ABQ|        JFK|
|1042355|    0|    1586|   ABQ|        JFK|
|1052355|    0|    1586|   ABQ|        JFK|
|1062355|    0|    1586|   ABQ|        JFK|
|1072359|   14|    1586|   ABQ|        JFK|
|1082358|   -4|    1586|   ABQ|        JFK|
|1092358|   20|    1586|   ABQ|        JFK|
|1102358|   -2|    1586|   ABQ|        JFK|
|1112358|  116|    1586|   ABQ|        JFK|
|1122358|    0|    1586|   ABQ|        JFK|
|1132358|  -10|    1586|   ABQ|        JFK|
|1142359|  -17|    1586|   ABQ|        JFK|
|1152358|  135|    1586|   ABQ|        JFK|
|1162358|  -13|    1586|   ABQ|        JFK|
|1172358|  -10|    1586|   ABQ|        JFK|
|1182358|  -17|    1586|   ABQ|        JFK|
|1192358|  -18|    1586|   ABQ|        JFK|
|1202358|   -9|    1586|   ABQ| 

##### all flights between San Francisco (SFO) and Chicago (ORD) with at least a two-hour delay

In [10]:
spark.sql("""SELECT * 
FROM us_delay_flights_tbl 
WHERE delay > 120 
AND 
(origin == 'SFO' AND destination = 'ORD' 
OR origin == 'ORD' AND destination = 'SFO')""").show(50)

+-------+-----+--------+------+-----------+
|   date|delay|distance|origin|destination|
+-------+-----+--------+------+-----------+
|1021430|  132|    1604|   ORD|        SFO|
|1021605|  140|    1604|   ORD|        SFO|
|1021025|  148|    1604|   ORD|        SFO|
|1022015|  122|    1604|   ORD|        SFO|
|1052015|  138|    1604|   ORD|        SFO|
|1061430|  183|    1604|   ORD|        SFO|
|1061900|  155|    1604|   ORD|        SFO|
|1061025|  144|    1604|   ORD|        SFO|
|1301430|  228|    1604|   ORD|        SFO|
|1301605|  187|    1604|   ORD|        SFO|
|1011744|  154|    1604|   ORD|        SFO|
|1011541|  123|    1604|   ORD|        SFO|
|1011310|  130|    1604|   ORD|        SFO|
|1011903|  158|    1604|   ORD|        SFO|
|1020948|  342|    1604|   ORD|        SFO|
|1031744|  167|    1604|   ORD|        SFO|
|1031541|  278|    1604|   ORD|        SFO|
|1041744|  247|    1604|   ORD|        SFO|
|1041310|  159|    1604|   ORD|        SFO|
|1041901|  179|    1604|   ORD| 

##### label all US flights, regardless of origin and destination,with an indication of the delays they experienced: Very Long Delays (> 6 hours), Long Delays (2–6 hours), etc. We’ll add these human-readable labels in a new column called Flight_Delays

In [11]:
spark.sql("""SELECT delay, origin, destination,
CASE 
WHEN delay > 360 THEN 'Very Long Delay' 
WHEN delay > 120 THEN 'Long Delay' 
WHEN delay > 60 THEN 'Short Delay' 
WHEN delay > 0 THEN 'Tolerable Delay' 
WHEN delay = 0 THEN 'No Delay' 
ELSE 'Early' 
END AS Flight_Delays
FROM us_delay_flights_tbl 
ORDER BY origin, delay DESC""").show()

+-----+------+-----------+-------------+
|delay|origin|destination|Flight_Delays|
+-----+------+-----------+-------------+
|  333|   ABE|        ATL|   Long Delay|
|  305|   ABE|        ATL|   Long Delay|
|  275|   ABE|        ATL|   Long Delay|
|  257|   ABE|        ATL|   Long Delay|
|  247|   ABE|        ATL|   Long Delay|
|  247|   ABE|        DTW|   Long Delay|
|  219|   ABE|        ORD|   Long Delay|
|  211|   ABE|        ATL|   Long Delay|
|  197|   ABE|        DTW|   Long Delay|
|  192|   ABE|        ORD|   Long Delay|
|  180|   ABE|        ATL|   Long Delay|
|  173|   ABE|        DTW|   Long Delay|
|  165|   ABE|        ATL|   Long Delay|
|  159|   ABE|        ATL|   Long Delay|
|  159|   ABE|        ORD|   Long Delay|
|  158|   ABE|        ATL|   Long Delay|
|  151|   ABE|        DTW|   Long Delay|
|  127|   ABE|        ATL|   Long Delay|
|  121|   ABE|        DTW|   Long Delay|
|  118|   ABE|        DTW|  Short Delay|
+-----+------+-----------+-------------+
only showing top

### Creating SQL Databases and Tables

In [12]:
spark.sql("CREATE DATABASE learn_spark_db")
spark.sql("USE learn_spark_db")

res6: org.apache.spark.sql.DataFrame = []


###### // Creating a managed table

In [13]:
// spark.sql("CREATE TABLE managed_us_delay_flights_tbl (date STRING, delay INT, distance INT, origin STRING, destination STRING)")

spark.sql("""CREATE TABLE IF NOT EXISTS managed_us_delay_flights_tb_3 
(date STRING, delay INT,distance INT, origin STRING, destination STRING)USING csv""")

res7: org.apache.spark.sql.DataFrame = []


###### // Creating a unmanaged table

In [14]:
spark.sql("""CREATE TABLE us_delay_flights_tbl(date STRING, delay INT, distance INT, origin STRING, destination STRING) 
USING csv OPTIONS (PATH'./departuredelays.csv')""")

res8: org.apache.spark.sql.DataFrame = []


### Creating Views

In [15]:
spark.sql("""CREATE OR REPLACE GLOBAL TEMP VIEW us_origin_airport_SFO_global_tmp_view 
AS SELECT date, delay, origin, destination 
from us_delay_flights_tbl WHERE origin = 'SFO';""")

res9: org.apache.spark.sql.DataFrame = []


In [16]:
spark.sql("""CREATE OR REPLACE TEMP VIEW us_origin_airport_JFK_tmp_view AS
SELECT date, delay, origin, destination from us_delay_flights_tbl WHERE
origin = 'JFK'""")

res10: org.apache.spark.sql.DataFrame = []


Once you’ve created these views, you can issue queries against them just as you would against a table. Keep in mind that when accessing a global temporary view you must use the prefix global_temp., because Spark creates global temporary views in a global temporary database called global_temp. For example:

spark.sql("SELECT * FROM global_temp.us_origin_airport_SFO_global_tmp_view")
// By contrast, you can access the normal temporary view without the global_temp
// prefix:


spark.sql("SELECT * FROM us_origin_airport_JFK_tmp_view")


spark.read.table("us_origin_airport_JFK_tmp_view")

// Or

spark.sql("SELECT * FROM us_origin_airport_JFK_tmp_view")
// You can also drop a view just like you would a table:

spark.sql("DROP VIEW IF EXISTS us_origin_airport_SFO_global_tmp_view;")
spark.sql("DROP VIEW IF EXISTS us_origin_airport_JFK_tmp_view")
// In Scala/Python
spark.catalog.dropGlobalTempView("us_origin_airport_SFO_global_tmp_view")
spark.catalog.dropTempView("us_origin_airport_JFK_tmp_view")


#### Data Sources for DataFrames and SQL Tables

In [18]:
val file = "C:/Users/jorgedario.mendez/OneDrive - Bosonit/Documentos/3. Libro/LearningSparkV2-master/databricks-datasets/learning-spark-v2/flights/summary-data/parquet/2010-summary.parquet"

// Use Parquet
val df = spark.read.format("parquet").load(file)

// Use Parquet; you can omit format("parquet") if you wish as it's the default
val df2 = spark.read.load(file)

// Use CSV
val df3 = spark.read.format("csv")
.option("inferSchema", "true")
.option("header", "true")
.option("mode", "PERMISSIVE")
.load("C:/Users/jorgedario.mendez/OneDrive - Bosonit/Documentos/3. Libro/LearningSparkV2-master/databricks-datasets/learning-spark-v2/flights/summary-data/csv/*")

// Use JSON
val df4 = spark.read.format("json").load("C:/Users/jorgedario.mendez/OneDrive - Bosonit/Documentos/3. Libro/LearningSparkV2-master/databricks-datasets/learning-spark-v2/flights/summary-data/json/*")

file: String = C:/Users/jorgedario.mendez/OneDrive - Bosonit/Documentos/3. Libro/LearningSparkV2-master/databricks-datasets/learning-spark-v2/flights/summary-data/parquet/2010-summary.parquet
df: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
df2: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
df3: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]
df4: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]


#### Writing DataFrames to Parquet files

In [19]:
df.write.format("parquet")
.mode("overwrite")
.option("compression", "snappy")
.save("./df_parquet_flights")

In [21]:
df.write
.mode("overwrite")
.saveAsTable("us_delay_flights_tbl_8")

#### Writing DataFrames to JSON files

In [22]:
df.write.format("json")
.mode("overwrite")
//.option("compression", "snappy")
.save("./df_json_sc2")

In [23]:
spark.sql("""CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl
USING csv
OPTIONS (
path "C:/Users/jorgedario.mendez/OneDrive - Bosonit/Documentos/3. Libro/LearningSparkV2-master/databricks-datasets/learning-spark-v2/flights/summary-data/csv/*",
header "true",
inferSchema "true",
mode "FAILFAST"
)""")

res16: org.apache.spark.sql.DataFrame = []


#### writing csv

In [24]:
df.write.format("csv").mode("overwrite").save("./df_csv")

#### Avro

In [25]:
val df = spark.read.format("avro")
.load("C:/Users/jorgedario.mendez/OneDrive - Bosonit/Documentos/3. Libro/LearningSparkV2-master/databricks-datasets/learning-spark-v2/flights/summary-data/avro/*")

df.show(false)

+--------------------------------+-------------------+-----+
|DEST_COUNTRY_NAME               |ORIGIN_COUNTRY_NAME|count|
+--------------------------------+-------------------+-----+
|United States                   |Romania            |1    |
|United States                   |Ireland            |264  |
|United States                   |India              |69   |
|Egypt                           |United States      |24   |
|Equatorial Guinea               |United States      |1    |
|United States                   |Singapore          |25   |
|United States                   |Grenada            |54   |
|Costa Rica                      |United States      |477  |
|Senegal                         |United States      |29   |
|United States                   |Marshall Islands   |44   |
|Guyana                          |United States      |17   |
|United States                   |Sint Maarten       |53   |
|Malta                           |United States      |1    |
|Bolivia                

df: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]


In [26]:
spark.sql("""CREATE OR REPLACE TEMPORARY VIEW episode_tbl
USING avro
OPTIONS (
path "C:/Users/jorgedario.mendez/OneDrive - Bosonit/Documentos/3. Libro/LearningSparkV2-master/databricks-datasets/learning-spark-v2/flights/summary-data/avro/*"
)""")

res19: org.apache.spark.sql.DataFrame = []


In [27]:
df.write
.format("avro")
.mode("overwrite")
.save("./df_avro")

####  orc

In [28]:
val file = "C:/Users/jorgedario.mendez/OneDrive - Bosonit/Documentos/3. Libro/LearningSparkV2-master/databricks-datasets/learning-spark-v2/flights/summary-data/orc/*"
val df = spark.read.format("orc").load(file)
df.show(10, false)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



file: String = C:/Users/jorgedario.mendez/OneDrive - Bosonit/Documentos/3. Libro/LearningSparkV2-master/databricks-datasets/learning-spark-v2/flights/summary-data/orc/*
df: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]


In [29]:
df.write.format("orc")
.mode("overwrite")
.option("compression", "snappy")
.save("./df_orc")

#### Images

In [30]:
import org.apache.spark.ml.source.image

import org.apache.spark.ml.source.image


In [31]:
val imageDir = "./cctvVideos/train_images/"
val imagesDF = spark.read.format("image").load(imageDir)
imagesDF.printSchema

root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |    |-- nChannels: integer (nullable = true)
 |    |-- mode: integer (nullable = true)
 |    |-- data: binary (nullable = true)
 |-- label: integer (nullable = true)



imageDir: String = ./cctvVideos/train_images/
imagesDF: org.apache.spark.sql.DataFrame = [image: struct<origin: string, height: int ... 4 more fields>, label: int]


In [32]:
imagesDF.select("image.height", "image.width", "image.nChannels", "image.mode","label").show(5, false)

+------+-----+---------+----+-----+
|height|width|nChannels|mode|label|
+------+-----+---------+----+-----+
|288   |384  |3        |16  |0    |
|288   |384  |3        |16  |1    |
|288   |384  |3        |16  |0    |
|288   |384  |3        |16  |0    |
|288   |384  |3        |16  |0    |
+------+-----+---------+----+-----+
only showing top 5 rows



### Binary Files

In [33]:
val path = "./cctvVideos/train_images/"
val binaryFilesDF = spark.read.format("binaryFile")
.option("pathGlobFilter", "*.jpg")
.load(path)
binaryFilesDF.show(5)

+--------------------+-------------------+------+--------------------+-----+
|                path|   modificationTime|length|             content|label|
+--------------------+-------------------+------+--------------------+-----+
|file:/C:/Users/jo...|2021-04-15 02:34:17| 55037|[FF D8 FF E0 00 1...|    0|
|file:/C:/Users/jo...|2021-04-15 02:34:17| 54634|[FF D8 FF E0 00 1...|    1|
|file:/C:/Users/jo...|2021-04-15 02:34:17| 54624|[FF D8 FF E0 00 1...|    0|
|file:/C:/Users/jo...|2021-04-15 02:34:17| 54505|[FF D8 FF E0 00 1...|    0|
|file:/C:/Users/jo...|2021-04-15 02:34:17| 54475|[FF D8 FF E0 00 1...|    0|
+--------------------+-------------------+------+--------------------+-----+
only showing top 5 rows



path: String = ./cctvVideos/train_images/
binaryFilesDF: org.apache.spark.sql.DataFrame = [path: string, modificationTime: timestamp ... 3 more fields]


In [34]:
val binaryFilesDF = spark.read.format("binaryFile")
.option("pathGlobFilter", "*.jpg")
.option("recursiveFileLookup", "true")
.load(path)
binaryFilesDF.show(5)

+--------------------+-------------------+------+--------------------+
|                path|   modificationTime|length|             content|
+--------------------+-------------------+------+--------------------+
|file:/C:/Users/jo...|2021-04-15 02:34:17| 55037|[FF D8 FF E0 00 1...|
|file:/C:/Users/jo...|2021-04-15 02:34:17| 54634|[FF D8 FF E0 00 1...|
|file:/C:/Users/jo...|2021-04-15 02:34:17| 54624|[FF D8 FF E0 00 1...|
|file:/C:/Users/jo...|2021-04-15 02:34:17| 54505|[FF D8 FF E0 00 1...|
|file:/C:/Users/jo...|2021-04-15 02:34:17| 54475|[FF D8 FF E0 00 1...|
+--------------------+-------------------+------+--------------------+
only showing top 5 rows



binaryFilesDF: org.apache.spark.sql.DataFrame = [path: string, modificationTime: timestamp ... 2 more fields]
