## Basic Query Examples

In [None]:
//package spark.hive
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.hive.HiveSessionStateBuilder
import org.apache.spark.{SparkConf}

val sparkConfig = new SparkConf()

sparkConfig.set("spark.broadcast.compress", "false")
sparkConfig.set("spark.shuffle.compress", "false")
sparkConfig.set("spark.shuffle.spill.compress", "false")
sparkConfig.set("spark.io.compression.codec", "lzf")
sparkConfig.set("spark.sql.catalogImplementation", "hive")
sparkConfig.set("hive.exec.dynamic.partition.mode","nonstrict")
sparkConfig.set("spark.default.parallelism","1")
sparkConfig.set("spark.shuffle.partitions","1")
sparkConfig.set("spark.sql.hive.llap", "true")
sparkConfig.set("spark.datasource.hive.warehouse.load.staging.dir","/tmp")
sparkConfig.set(" spark.hadoop.metastore.catalog.default","hive")

val _spark:SparkSession = SparkSession.builder
  .master("local")
  .appName("Unit Test")
  .config(sparkConfig)
  .enableHiveSupport()
  .getOrCreate()
//val conf = new SparkConf
   // .set("spark.sql.warehouse.dir", "hdfs://namenode/sql/metadata/Hive")
     //  .set("spark.sql.catalogImplementation","Hive")
     //  .setMaster("local[*]")
      // .setAppName("Hive Example")
//val spark = SparkSession
  //.set("spark.sql.warehouse.dir", "hdfs://namenode/sql/metadata/Hive")
  //.set("spark.sql.catalogImplementation","Hive")
//  .setMaster("local[*]")
 // .setAppName("Hive Example")
 //.builder
 //.config(conf)
// .enableHiveSupport()
// .appName("SparkSQLExampleApp")
// .getOrCreate()
// Path to data set
val csvFile="C:/Users/sara.arribas/Downloads/Ejemplos_Spark/departuredelays.csv"
// Read and create a temporary view
// Infer schema (note that for larger files you may want to specify the schema)
val df = spark.read.format("csv")
 .option("inferSchema", "true")
 .option("header", "true")
 .load(csvFile)
// Create a temporary view
df.createOrReplaceTempView("us_delay_flights_tbl")

If you want to specify a schema, you can use a DDL-formatted string

In [None]:
val schema = "date STRING, delay INT, distance INT, origin STRING, destination STRING"

We’ll find all flights whose distance is greater than 1,000 miles

In [None]:
spark.sql("""SELECT distance, origin, destination FROM us_delay_flights_tbl WHERE distance > 1000
ORDER BY distance DESC""").show(10)

As the results show, all of the longest flights were between Honolulu (HNL) and New York (JFK). Next, we’ll find all flights between San Francisco (SFO) and Chicago (ORD) with at least a two-hour delay

In [None]:
spark.sql("""SELECT date, delay, origin, destination FROM us_delay_flights_tbl WHERE delay > 120 AND ORIGIN = 'SFO' AND DESTINATION = 'ORD'
ORDER by delay DESC""").show(10)

**Exercise**. Convert the date column into a readable format and find the days or months when these delays were most common. Were the delays related to winter months or holidays?

In the following example, we want to label all US flights, regardless of origin and destination, with an indication of the delays they experienced: Very Long Delays (> 6 hours), Long Delays (2–6 hours), etc. We’ll add these human-readable labels in a new column called Flight_Delays

In [None]:
spark.sql("""SELECT delay, origin, destination,
 CASE
 WHEN delay > 360 THEN 'Very Long Delays'
 WHEN delay > 120 AND delay < 360 THEN 'Long Delays'
 WHEN delay > 60 AND delay < 120 THEN 'Short Delays'
 WHEN delay > 0 and delay < 60 THEN 'Tolerable Delays'
 WHEN delay = 0 THEN 'No Delays'
 ELSE 'Early'
 END AS Flight_Delays
 FROM us_delay_flights_tbl
 ORDER BY origin, delay DESC""").show(10)

**Exercise:** try converting the other two SQL queries to use the DataFrame API.

## SQL Tables and Views

### Managed Versus UnmanagedTables

For a managed table, Spark manages both the metadata and the data in the file store. This could be a local filesystem, HDFS, or an object store such as Amazon S3 or Azure Blob. For an unmanaged table, Spark only manages the metadata, while you manage the data yourself in an external data source such as Cassandra.

### Creating SQL Databases and Tables

In [12]:
spark.sql("CREATE DATABASE learn_spark_db")
spark.sql("USE learn_spark_db")

res8: org.apache.spark.sql.DataFrame = []


#### Creating a managed table

In [6]:
import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.hive.HiveSessionStateBuilder
import org.apache.spark.{SparkConf}

val sparkConfig = new SparkConf()

sparkConfig.set("spark.broadcast.compress", "false")
sparkConfig.set("spark.shuffle.compress", "false")
sparkConfig.set("spark.shuffle.spill.compress", "false")
sparkConfig.set("spark.io.compression.codec", "lzf")
sparkConfig.set("spark.sql.catalogImplementation", "hive")
sparkConfig.set("hive.exec.dynamic.partition.mode","nonstrict")
sparkConfig.set("spark.default.parallelism","1")
sparkConfig.set("spark.shuffle.partitions","1")
sparkConfig.set("spark.sql.hive.llap", "true")
sparkConfig.set("spark.datasource.hive.warehouse.load.staging.dir","/tmp")
sparkConfig.set(" spark.hadoop.metastore.catalog.default","hive")

val spark:SparkSession = SparkSession.builder
  .master("local")
  .appName("Unit Test")
  .config(sparkConfig)
  .enableHiveSupport()
  .getOrCreate()


import org.apache.spark.sql.SparkSession
import org.apache.spark.sql.hive.HiveSessionStateBuilder
import org.apache.spark.SparkConf
sparkConfig: org.apache.spark.SparkConf = org.apache.spark.SparkConf@1ce27fe1
spark: org.apache.spark.sql.SparkSession = org.apache.spark.sql.SparkSession@70afa2ff


In [4]:
spark.stop()

In [13]:
spark.sql("CREATE TABLE IF NOT EXISTS managed_us_delay_flights_tbl (date STRING, delay INT, distance INT, origin STRING, destination STRING)")

res9: org.apache.spark.sql.DataFrame = []


In [14]:
spark.sql("show databases").show()

+--------------+
|     namespace|
+--------------+
|       default|
|       flights|
|learn_spark_db|
+--------------+



In [None]:
val csv_file = "C:/Users/sara.arribas/Downloads/Ejemplos_Spark/departuredelays.csv"
//Schema as defined in the preceding example
val schema="date STRING, delay INT, distance INT, origin STRING, destination STRING"
val flights_df = spark.read.csv(csv_file, schema)
flights_df.write.saveAsTable("managed_us_delay_flights_tbl")


#### Creating an unmanaged table

In [None]:
spark.sql("""CREATE TABLE us_delay_flights_tbl(date STRING, delay INT,
 distance INT, origin STRING, destination STRING)
 USING csv OPTIONS (PATH
 'C:/Users/sara.arribas/Downloads/Ejemplos_Spark/departuredelays.csv')""")

### Creating Views

Está igual en Python

In [None]:
val df_sfo = spark.sql("SELECT date, delay, origin, destination FROM us_delay_flights_tbl WHERE origin = 'SFO'")

val df_jfk = spark.sql("SELECT date, delay, origin, destination FROM us_delay_flights_tbl WHERE origin = 'JFK'")

In [None]:
df_sfo.createOrReplaceGlobalTempView("us_origin_airport_SFO_global_tmp_view")
df_jfk.createOrReplaceTempView("us_origin_airport_JFK_tmp_view")
spark.read.table("us_origin_airport_JFK_tmp_view")

In [None]:
spark.catalog.dropTempView("us_origin_airport_JFK_tmp_view")
spark.catalog.dropGlobalTempView("us_origin_airport_SFO_global_tmp_view")

## Viewing the Metadata

In [None]:
spark.catalog.listDatabases()
spark.catalog.listTables()
spark.catalog.listColumns("us_delay_flights_tbl")

## Reading Tables into DataFrames


Let’s assume you have an existing database, learn_spark_db, and table, us_delay_flights_tbl, ready for use

In [None]:
val usFlightsDF = spark.sql("SELECT * FROM us_delay_flights_tbl")
val usFlightsDF2 = spark.table("us_delay_flights_tbl")

In [None]:
usFlightsDF.show()

## DataFrameReader

It has a defined format and a recommended pattern for usage:

DataFrameReader.format(args).option("key", "value").schema(args).load()

In [None]:
// Use Parquet
val file = "C:/Users/sara.arribas/Downloads/Ejemplos_Spark/summary-data/parquet/2010-summary.parquet"
//file.write.format("parquet").save(parquetPath)

In [None]:
val df = spark.read.format("parquet").option("path",file).load()

In [None]:
// Use Parquet; you can omit format("parquet") if you wish as it's the default
val df2 = spark.read.load(file)

In [None]:
// Use CSV
val df3 = spark.read.format("csv")
 .option("inferSchema", "true")
 .option("header", "true")
 .option("mode", "PERMISSIVE")
 .load("C:/Users/sara.arribas/Downloads/Ejemplos_Spark/summary-data/csv/*")

In [None]:
// Use JSON
val df4 = spark.read.format("json")
 .load("C:/Users/sara.arribas/Downloads/Ejemplos_Spark/summary-data/json/*")

### DataFrameWriter

In [None]:
// Ejemplo
//val location = ...
//df.write.format("json").mode("overwrite").save(location)


## Parquet

### Reading Parquet files into a DataFrame

In [None]:
val file = "C:/Users/sara.arribas/Downloads/Ejemplos_Spark/summary-data/parquet/2010-summary.parquet/"
val df = spark.read.format("parquet").load(file)

### Reading Parquet files into a Spark SQL table

In [None]:
spark.sql("SELECT * FROM us_delay_flights_tbl").show()

### Writing DataFrames to Parquet files

In [None]:
df.write.format("parquet")
 .mode("overwrite")
 .option("compression", "snappy")
 .save("/tmp/data/parquet/df_parquet")

### Writing DataFrames to Spark SQL tables

In [None]:
df.write
 .mode("overwrite")
 .saveAsTable("us_delay_flights_tbl")

## JSON

### Reading a JSON file into a DataFrame

In [None]:
val file = "C:/Users/sara.arribas/Downloads/Ejemplos_Spark/summary-data/json/*"
val df = spark.read.format("json").load(file)

### Reading a JSON file into a Spark SQL table

In [None]:
spark.sql("SELECT * FROM us_delay_flights_tbl").show()

### Writing DataFrames to JSON files

In [None]:
df.write.json("/tmp/data/json/df_json2")

## CSV

### Reading a CSV file into a DataFrame

In [None]:
val file = "C:/Users/sara.arribas/Downloads/Ejemplos_Spark/summary-data/csv/*"
val schema = "DEST_COUNTRY_NAME STRING, ORIGIN_COUNTRY_NAME STRING, count INT"
val df = spark.read.format("csv")
 .schema(schema)
 .option("header", "true")
 .option("mode", "FAILFAST") // Exit if any errors
 .option("nullValue", "") // Replace any null data with quotes
 .load(file)


### Reading a CSV file into a Spark SQL table

In [None]:
spark.sql("SELECT * FROM us_delay_flights_tbl").show(10)

### Writing DataFrames to CSV files

In [None]:
df.write.format("csv").mode("overwrite").save("/tmp/data/csv/df_csv")

## Avro

### Reading an Avro file into a DataFrame

In [None]:
val df = spark.read.format("avro")
.load("C:/Users/sara.arribas/Downloads/Ejemplos_Spark/summary-data/avro/*")
df.show(false)

### Reading an Avro file into a Spark SQL table

In [None]:
spark.sql("SELECT * FROM us_delay_flights_tbl").show(false)

### Writing DataFrames to Avro files

In [None]:
df.write
 .format("avro")
 .mode("overwrite")
 .save("/tmp/data/avro/df_avro")

## ORC

### Reading an ORC file into a DataFrame

In [None]:
val file = "C:/Users/sara.arribas/Downloads/Ejemplos_Spark/summary-data/orc/*"
val df = spark.read.format("orc").load(file)
df.show(10, false)

### Reading an ORC file into a Spark SQL table

In [None]:
spark.sql("SELECT * FROM us_delay_flights_tbl").show()

### Writing DataFrames to ORC files

In [None]:
df.write.format("orc")
 .mode("overwrite")
 .option("compression", "snappy")
 .save("/tmp/data/orc/df_orc")

## Images

### Reading an image file into a DataFrame

In [None]:
import org.apache.spark.ml.source.image
val imageDir = "C:/Users/sara.arribas/Downloads/Ejemplos_Spark/train_images/"
val imagesDF = spark.read.format("image").load(imageDir)
imagesDF.printSchema
imagesDF.select("image.height", "image.width", "image.nChannels", "image.mode",
 "label").show(5, false)


## Binary Files

### Reading a binary file into a DataFrame

In [None]:
val path = "C:/Users/sara.arribas/Downloads/Ejemplos_Spark/train_images/"
val binaryFilesDF = spark.read.format("binaryFile")
 .option("pathGlobFilter", "*.jpg")
 .load(path)
binaryFilesDF.show(5)

In [None]:
val binaryFilesDF = spark.read.format("binaryFile")
 .option("pathGlobFilter", "*.jpg")
 .option("recursiveFileLookup", "true")
 .load(path)
binaryFilesDF.show(5)