In [1]:
%%init_spark
launcher.packages = ["org.apache.spark:spark-avro_2.12:3.1.2"]

Spark Data Sources
This notebook shows how to use Spark Data Sources Interface API to read file formats:

* Parquet
* JSON
* CSV
* Avro
* ORC
* Image
* Binary

A full list of DataSource methods is available here

Define paths for the various data sources

In [2]:
val parquet_file = "../databricks-datasets/learning-spark-v2/flights/summary-data/parquet/2010-summary.parquet"
val json_file = "../databricks-datasets/learning-spark-v2/flights/summary-data/json/*"
val csv_file = "../databricks-datasets/learning-spark-v2/flights/summary-data/csv/*"
val orc_file = "../databricks-datasets/learning-spark-v2/flights/summary-data/orc/*"
val avro_file = "../databricks-datasets/learning-spark-v2/flights/summary-data/avro/*"
val schema = "DEST_COUNTRY_NAME STRING, ORIGIN_COUNTRY_NAME STRING, count INT"

Intitializing Scala interpreter ...

Spark Web UI available at http://192.168.0.11:4041
SparkContext available as 'sc' (version = 3.1.2, master = local[*], app id = local-1634056465660)
SparkSession available as 'spark'


parquet_file: String = ../databricks-datasets/learning-spark-v2/flights/summary-data/parquet/2010-summary.parquet
json_file: String = ../databricks-datasets/learning-spark-v2/flights/summary-data/json/*
csv_file: String = ../databricks-datasets/learning-spark-v2/flights/summary-data/csv/*
orc_file: String = ../databricks-datasets/learning-spark-v2/flights/summary-data/orc/*
avro_file: String = ../databricks-datasets/learning-spark-v2/flights/summary-data/avro/*
schema: String = DEST_COUNTRY_NAME STRING, ORIGIN_COUNTRY_NAME STRING, count INT


## Parquet Data Source

In [3]:
val df = (spark
      .read
      .format("parquet")
      .option("path", parquet_file)
      .load())

df: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]


Another way to read this same data using a variation of this API

In [4]:
val df2 = spark.read.parquet(parquet_file)

df2: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]


In [5]:
df.show(10, false)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



In [6]:
df2.show(10, false)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



## Use SQL
This will create an unmanaged temporary view

In [7]:
spark.sql(
    """
    CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl
        USING parquet
            OPTIONS (
              path "../databricks-datasets/learning-spark-v2/flights/summary-data/parquet/2010-summary.parquet"
            )
    """
)

res2: org.apache.spark.sql.DataFrame = []


In [9]:
spark.sql("SELECT * FROM us_delay_flights_tbl").show(10, truncate=false)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



## JSON Data Source

In [7]:
val df = spark.read.format("json").option("path", json_file).load()

df: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]


In [8]:
df.show(10, truncate=false)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |15   |
|United States    |Croatia            |1    |
|United States    |Ireland            |344  |
|Egypt            |United States      |15   |
|United States    |India              |62   |
|United States    |Singapore          |1    |
|United States    |Grenada            |62   |
|Costa Rica       |United States      |588  |
|Senegal          |United States      |40   |
|Moldova          |United States      |1    |
+-----------------+-------------------+-----+
only showing top 10 rows



In [9]:
val df2 = spark.read.json(json_file)

df2: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]


In [10]:
df2.show(10, truncate=false)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |15   |
|United States    |Croatia            |1    |
|United States    |Ireland            |344  |
|Egypt            |United States      |15   |
|United States    |India              |62   |
|United States    |Singapore          |1    |
|United States    |Grenada            |62   |
|Costa Rica       |United States      |588  |
|Senegal          |United States      |40   |
|Moldova          |United States      |1    |
+-----------------+-------------------+-----+
only showing top 10 rows



## CSV Data Source

In [11]:
val df = (spark
     .read
	 .format("csv")
	 .option("header", "true")
	 .schema(schema)
	 .option("mode", "FAILFAST")  // exit if any errors
	 .option("nullValue", "")	  // replace any null data field with “”
	 .option("path", csv_file)
	 .load())

df: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]


In [12]:
df.show(10, truncate = false)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



In [13]:
(df.write.format("parquet")
  .mode("overwrite")
  .option("path", "/tmp/data/parquet/df_parquet")
  .option("compression", "snappy")
  .save())

In [14]:
val df2 = (spark
       .read
       .option("header", "true")
       .option("mode", "FAILFAST")	 // exit if any errors
       .option("nullValue", "")
       .schema(schema)
       .csv(csv_file))

df2: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]


In [15]:
df2.show(10, truncate=false)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



## ORC Data Source

In [16]:
val df = (spark.read
      .format("orc")
      .option("path", orc_file)
      .load())

df: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]


In [17]:
df.show(10, truncate=false)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



## Avro Data Source

In [18]:
val df = (spark.read
      .format("avro")
      .option("path", avro_file)
      .load())

df: org.apache.spark.sql.DataFrame = [DEST_COUNTRY_NAME: string, ORIGIN_COUNTRY_NAME: string ... 1 more field]


In [19]:
df.show(10, truncate=false)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



## Image

In [20]:
import org.apache.spark.ml.source.image

val image_dir = "../databricks-datasets/learning-spark-v2/cctvVideos/train_images/"
val images_df = spark.read.format("image").load(image_dir)

images_df.printSchema()
images_df.select("image.height", "image.width", "image.nChannels", "image.mode", "label").show(5, truncate=false)

root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |    |-- nChannels: integer (nullable = true)
 |    |-- mode: integer (nullable = true)
 |    |-- data: binary (nullable = true)
 |-- label: integer (nullable = true)

+------+-----+---------+----+-----+
|height|width|nChannels|mode|label|
+------+-----+---------+----+-----+
|288   |384  |3        |16  |0    |
|288   |384  |3        |16  |1    |
|288   |384  |3        |16  |0    |
|288   |384  |3        |16  |0    |
|288   |384  |3        |16  |0    |
+------+-----+---------+----+-----+
only showing top 5 rows



import org.apache.spark.ml.source.image
image_dir: String = ../databricks-datasets/learning-spark-v2/cctvVideos/train_images/
images_df: org.apache.spark.sql.DataFrame = [image: struct<origin: string, height: int ... 4 more fields>, label: int]


## Binary

In [21]:
val path = "../databricks-datasets/learning-spark-v2/cctvVideos/train_images/"
val binary_files_df = (spark.read.format("binaryFile")
  .option("pathGlobFilter", "*.jpg")
  .load(path))

binary_files_df.show(5)

+--------------------+-------------------+------+--------------------+-----+
|                path|   modificationTime|length|             content|label|
+--------------------+-------------------+------+--------------------+-----+
|file:/media/jose/...|2021-04-15 02:34:17| 55037|[FF D8 FF E0 00 1...|    0|
|file:/media/jose/...|2021-04-15 02:34:17| 54634|[FF D8 FF E0 00 1...|    1|
|file:/media/jose/...|2021-04-15 02:34:17| 54624|[FF D8 FF E0 00 1...|    0|
|file:/media/jose/...|2021-04-15 02:34:17| 54505|[FF D8 FF E0 00 1...|    0|
|file:/media/jose/...|2021-04-15 02:34:17| 54475|[FF D8 FF E0 00 1...|    0|
+--------------------+-------------------+------+--------------------+-----+
only showing top 5 rows



path: String = ../databricks-datasets/learning-spark-v2/cctvVideos/train_images/
binary_files_df: org.apache.spark.sql.DataFrame = [path: string, modificationTime: timestamp ... 3 more fields]


To ignore any partitioning data discovery in a directory, you can set the recursiveFileLookup to true.

In [22]:
val binary_files_df = (spark.read.format("binaryFile")
   .option("pathGlobFilter", "*.jpg")
   .option("recursiveFileLookup", "true")
   .load(path))
binary_files_df.show(5)

+--------------------+-------------------+------+--------------------+
|                path|   modificationTime|length|             content|
+--------------------+-------------------+------+--------------------+
|file:/media/jose/...|2021-04-15 02:34:17| 55037|[FF D8 FF E0 00 1...|
|file:/media/jose/...|2021-04-15 02:34:17| 54634|[FF D8 FF E0 00 1...|
|file:/media/jose/...|2021-04-15 02:34:17| 54624|[FF D8 FF E0 00 1...|
|file:/media/jose/...|2021-04-15 02:34:17| 54505|[FF D8 FF E0 00 1...|
|file:/media/jose/...|2021-04-15 02:34:17| 54475|[FF D8 FF E0 00 1...|
+--------------------+-------------------+------+--------------------+
only showing top 5 rows



binary_files_df: org.apache.spark.sql.DataFrame = [path: string, modificationTime: timestamp ... 2 more fields]
