# Spark Data Sources

This notebook shows how to use Spark Data Sources Interface API to read file formats:
 * Parquet
 * JSON
 * CSV
 * Avro
 * ORC
 * Image
 * Binary

A full list of DataSource methods is available [here](https://docs.databricks.com/spark/latest/data-sources/index.html#id1)

## Define paths for the various data sources

In [1]:
parquet_file = "./data/flights/summary-data/parquet/2010-summary.parquet"
json_file = "./data/flights/summary-data/json/*"
csv_file = "./data/flights/summary-data/csv/*"
orc_file = "./data/flights/summary-data/orc/*"
avro_file = "./data/flights/summary-data/avro/*"
schema = "DEST_COUNTRY_NAME STRING, ORIGIN_COUNTRY_NAME STRING, count INT"

## Parquet Data Source

In [2]:
from pyspark.sql.types import *
from pyspark.sql.functions import *
from pyspark.sql import SparkSession
spark = (
    SparkSession
    .builder
    .appName("04_chap")
    .getOrCreate()
    )
sc = spark.sparkContext

df = (spark
      .read
      .format("parquet")
      .option("path", parquet_file)
      .load())

Setting default log level to "WARN".
To adjust logging level use sc.setLogLevel(newLevel). For SparkR, use setLogLevel(newLevel).
25/05/09 00:01:11 WARN NativeCodeLoader: Unable to load native-hadoop library for your platform... using builtin-java classes where applicable
                                                                                

Another way to read this same data using a variation of this API

In [3]:
df2 = spark.read.parquet(parquet_file)

In [4]:
df.show(10, False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



## Use SQL

This will create an _unmanaged_ temporary view

In [5]:
# %sql
spark.sql(f"""
CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl
    USING parquet
    OPTIONS (
      path "{parquet_file}"
    )
""")

DataFrame[]

Use SQL to query the table

The outcome should be the same as one read into the DataFrame above

In [6]:
spark.sql("SELECT * FROM us_delay_flights_tbl").show(10, truncate=False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



## JSON Data Source

In [7]:
df = spark.read.format("json").option("path", json_file).load()

                                                                                

In [8]:
df.show(10, truncate=False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |15   |
|United States    |Croatia            |1    |
|United States    |Ireland            |344  |
|Egypt            |United States      |15   |
|United States    |India              |62   |
|United States    |Singapore          |1    |
|United States    |Grenada            |62   |
|Costa Rica       |United States      |588  |
|Senegal          |United States      |40   |
|Moldova          |United States      |1    |
+-----------------+-------------------+-----+
only showing top 10 rows



In [9]:
df2 = spark.read.json(json_file)

In [10]:
df2.show(10, False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |15   |
|United States    |Croatia            |1    |
|United States    |Ireland            |344  |
|Egypt            |United States      |15   |
|United States    |India              |62   |
|United States    |Singapore          |1    |
|United States    |Grenada            |62   |
|Costa Rica       |United States      |588  |
|Senegal          |United States      |40   |
|Moldova          |United States      |1    |
+-----------------+-------------------+-----+
only showing top 10 rows



## Use SQL

This will create an _unmanaged_ temporary view

In [11]:
# %sql
spark.sql(
    f"""
    CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl
    USING json
    OPTIONS (
      path "{json_file}"
    )
    """
)


DataFrame[]

Use SQL to query the table

The outcome should be the same as one read into the DataFrame above

In [12]:
spark.sql("SELECT * FROM us_delay_flights_tbl").show(10, truncate=False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |15   |
|United States    |Croatia            |1    |
|United States    |Ireland            |344  |
|Egypt            |United States      |15   |
|United States    |India              |62   |
|United States    |Singapore          |1    |
|United States    |Grenada            |62   |
|Costa Rica       |United States      |588  |
|Senegal          |United States      |40   |
|Moldova          |United States      |1    |
+-----------------+-------------------+-----+
only showing top 10 rows



## CSV Data Source

In [13]:
df = (spark
      .read
	 .format("csv")
	 .option("header", "true")
	 .schema(schema)
	 .option("mode", "FAILFAST")  # exit if any errors
	 .option("nullValue", "")	  # replace any null data field with “”
	 .option("path", csv_file)
	 .load())


In [14]:
df.show(10, truncate = False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



In [15]:
(df.write.format("parquet")
  .mode("overwrite")
  .option("path", "/tmp/data/parquet/df_parquet")
  .option("compression", "snappy")
  .save())

                                                                                

In [16]:
!ls /tmp/data/parquet/df_parquet

_SUCCESS
part-00000-a452443c-e8fe-47b2-b617-d359b9557cdb-c000.snappy.parquet
part-00001-a452443c-e8fe-47b2-b617-d359b9557cdb-c000.snappy.parquet
part-00002-a452443c-e8fe-47b2-b617-d359b9557cdb-c000.snappy.parquet
part-00003-a452443c-e8fe-47b2-b617-d359b9557cdb-c000.snappy.parquet
part-00004-a452443c-e8fe-47b2-b617-d359b9557cdb-c000.snappy.parquet
part-00005-a452443c-e8fe-47b2-b617-d359b9557cdb-c000.snappy.parquet


In [17]:
df2 = (spark
       .read
       .option("header", "true")
       .option("mode", "FAILFAST")	 # exit if any errors
       .option("nullValue", "")
       .schema(schema)
       .csv(csv_file))

In [18]:
df2.show(10, truncate=False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



## Use SQL

This will create an _unmanaged_ temporary view

In [19]:
# %sql
spark.sql(
    f"""
    CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl
    USING csv
    OPTIONS (
      path "{csv_file}",
      header "true",
      inferSchema "true",
      mode "FAILFAST"
    )
    """
)


DataFrame[]

Use SQL to query the table

The outcome should be the same as one read into the DataFrame above

In [20]:
spark.sql("SELECT * FROM us_delay_flights_tbl").show(10, truncate=False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



## ORC Data Source

In [21]:
df = (spark.read
      .format("orc")
      .option("path", orc_file)
      .load())

In [22]:
df.show(10, truncate=False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



## Use SQL

This will create an _unmanaged_ temporary view

In [23]:
# %sql
spark.sql(
    f"""
    CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl
        USING orc
        OPTIONS (
        path "{orc_file}"
        )
    """
)

DataFrame[]

Use SQL to query the table

The outcome should be the same as one read into the DataFrame above

In [24]:
spark.sql("SELECT * FROM us_delay_flights_tbl").show(10, truncate=False)

+-----------------+-------------------+-----+
|DEST_COUNTRY_NAME|ORIGIN_COUNTRY_NAME|count|
+-----------------+-------------------+-----+
|United States    |Romania            |1    |
|United States    |Ireland            |264  |
|United States    |India              |69   |
|Egypt            |United States      |24   |
|Equatorial Guinea|United States      |1    |
|United States    |Singapore          |25   |
|United States    |Grenada            |54   |
|Costa Rica       |United States      |477  |
|Senegal          |United States      |29   |
|United States    |Marshall Islands   |44   |
+-----------------+-------------------+-----+
only showing top 10 rows



## Avro Data Source

In [25]:
# df = (spark.read
#       .format("avro")
#       .option("path", avro_file)
#       .load())

In [26]:
# df.show(10, truncate=False)

## Use SQL

This will create an _unmanaged_ temporary view

In [27]:
# # %sql
# CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl
#     USING avro
#     OPTIONS (
#       path "/databricks-datasets/learning-spark-v2/flights/summary-data/avro/*"
#     )

Use SQL to query the table

The outcome should be the same as the one read into the DataFrame above

In [28]:
# spark.sql("SELECT * FROM us_delay_flights_tbl").show(10, truncate=False)

## Image

In [29]:
from pyspark.ml import image

image_dir = "./data/cctvVideos/train_images/"
images_df = spark.read.format("image").load(image_dir)
images_df.printSchema()

images_df.select("image.height", "image.width", "image.nChannels", "image.mode", "label").show(5, truncate=False)

root
 |-- image: struct (nullable = true)
 |    |-- origin: string (nullable = true)
 |    |-- height: integer (nullable = true)
 |    |-- width: integer (nullable = true)
 |    |-- nChannels: integer (nullable = true)
 |    |-- mode: integer (nullable = true)
 |    |-- data: binary (nullable = true)
 |-- label: integer (nullable = true)

+------+-----+---------+----+-----+
|height|width|nChannels|mode|label|
+------+-----+---------+----+-----+
|288   |384  |3        |16  |0    |
|288   |384  |3        |16  |1    |
|288   |384  |3        |16  |0    |
|288   |384  |3        |16  |0    |
|288   |384  |3        |16  |0    |
+------+-----+---------+----+-----+
only showing top 5 rows



## Binary

In [30]:
path = "./data/cctvVideos/train_images/"
binary_files_df = (spark
                   .read
                   .format("binaryFile")
                   .option("pathGlobFilter", "*.jpg")
                   .load(path))
binary_files_df.show(5)

+--------------------+--------------------+------+--------------------+-----+
|                path|    modificationTime|length|             content|label|
+--------------------+--------------------+------+--------------------+-----+
|file:/Users/khodo...|2025-05-09 00:00:...| 55037|[FF D8 FF E0 00 1...|    0|
|file:/Users/khodo...|2025-05-09 00:00:...| 54634|[FF D8 FF E0 00 1...|    1|
|file:/Users/khodo...|2025-05-09 00:00:...| 54624|[FF D8 FF E0 00 1...|    0|
|file:/Users/khodo...|2025-05-09 00:00:...| 54505|[FF D8 FF E0 00 1...|    0|
|file:/Users/khodo...|2025-05-09 00:00:...| 54475|[FF D8 FF E0 00 1...|    0|
+--------------------+--------------------+------+--------------------+-----+
only showing top 5 rows



                                                                                

To ignore any partitioning data discovery in a directory, you can set the `recursiveFileLookup` to `true`.

In [31]:
binary_files_df = (spark
                   .read
                   .format("binaryFile")
                   .option("pathGlobFilter", "*.jpg")
                   .option("recursiveFileLookup", "true")
                   .load(path))
binary_files_df.show(5)

+--------------------+--------------------+------+--------------------+
|                path|    modificationTime|length|             content|
+--------------------+--------------------+------+--------------------+
|file:/Users/khodo...|2025-05-09 00:00:...| 55037|[FF D8 FF E0 00 1...|
|file:/Users/khodo...|2025-05-09 00:00:...| 54634|[FF D8 FF E0 00 1...|
|file:/Users/khodo...|2025-05-09 00:00:...| 54624|[FF D8 FF E0 00 1...|
|file:/Users/khodo...|2025-05-09 00:00:...| 54505|[FF D8 FF E0 00 1...|
|file:/Users/khodo...|2025-05-09 00:00:...| 54475|[FF D8 FF E0 00 1...|
+--------------------+--------------------+------+--------------------+
only showing top 5 rows



25/05/09 00:47:27 WARN HeartbeatReceiver: Removing executor driver with no recent heartbeats: 211066 ms exceeds timeout 120000 ms
25/05/09 00:47:27 WARN SparkContext: Killing executors is not supported by current scheduler.
25/05/09 00:47:33 ERROR Inbox: Ignoring error
org.apache.spark.SparkException: Exception thrown in awaitResult: 
	at org.apache.spark.util.SparkThreadUtils$.awaitResult(SparkThreadUtils.scala:56)
	at org.apache.spark.util.ThreadUtils$.awaitResult(ThreadUtils.scala:310)
	at org.apache.spark.rpc.RpcTimeout.awaitResult(RpcTimeout.scala:75)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRefByURI(RpcEnv.scala:102)
	at org.apache.spark.rpc.RpcEnv.setupEndpointRef(RpcEnv.scala:110)
	at org.apache.spark.util.RpcUtils$.makeDriverRef(RpcUtils.scala:36)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.driverEndpoint$lzycompute(BlockManagerMasterEndpoint.scala:124)
	at org.apache.spark.storage.BlockManagerMasterEndpoint.org$apache$spark$storage$BlockManagerMasterEndpoint$$

#
# A fatal error has been detected by the Java Runtime Environment:
#
#  Internal Error (sharedRuntime.cpp:553), pid=19869, tid=0x000000000000b503
#  guarantee(cb != NULL && cb->is_nmethod()) failed: safepoint polling: pc must refer to an nmethod
#
# JRE version: Java(TM) SE Runtime Environment (8.0_401) (build 1.8.0_401-b10)
# Java VM: Java HotSpot(TM) 64-Bit Server VM (25.401-b10 mixed mode bsd-amd64 compressed oops)
# Failed to write core dump. Core dumps have been disabled. To enable core dumping, try "ulimit -c unlimited" before starting Java again
#
# An error report file with more information is saved as:
# /Users/khodosevichleo/Desktop/fun/Spark/Learning-Spark-book/04_chap/hs_err_pid19869.log
#
# If you would like to submit a bug report, please visit:
#   http://bugreport.java.com/bugreport/crash.jsp
#
