Spark Data Sources
This notebook shows how to use Spark Data Sources Interface API to read file formats:

* Parquet
* JSON
* CSV
* Avro
* ORC
* Image
* Binary

A full list of DataSource methods is available here



Define paths for the various data sources

In [None]:
parquet_file = "../../databricks-datasets/learning-spark-v2/flights/summary-data/parquet/2010-summary.parquet"
json_file = "../../databricks-datasets/learning-spark-v2/flights/summary-data/json/*"
csv_file = "../../databricks-datasets/learning-spark-v2/flights/summary-data/csv/*"
orc_file = "../../databricks-datasets/learning-spark-v2/flights/summary-data/orc/*"
avro_file = "../../databricks-datasets/learning-spark-v2/flights/summary-data/avro/*"
schema = "DEST_COUNTRY_NAME STRING, ORIGIN_COUNTRY_NAME STRING, count INT"

In [None]:
from pyspark.sql import SparkSession
#create a SparkSession
spark = (SparkSession
    .builder
    .appName("Example-3_6")
    .config("spark.jars.packages", "org.apache.spark:spark-avro_2.12:3.1.2")
    .getOrCreate())

## Parquet Data Source

In [None]:
df = (spark
      .read
      .format("parquet")
      .option("path", parquet_file)
      .load())

Another way to read this same data using a variation of this API

In [None]:
df2 = spark.read.parquet(parquet_file)

In [None]:
df.show(10, False)

## Use SQL
This will create an unmanaged temporary view

In [None]:
!sql
CREATE OR REPLACE TEMPORARY VIEW us_delay_flights_tbl
    USING parquet
    OPTIONS (
      path "/databricks-datasets/definitive-guide/data/flight-data/parquet/2010-summary.parquet"
    )

Use SQL to query the table

The outcome should be the same as one read into the DataFrame above

In [None]:
spark.sql("SELECT * FROM us_delay_flights_tbl").show(10, truncate=False)

## JSON Data Source

In [None]:
df = spark.read.format("json").option("path", json_file).load()

In [None]:
df.show(10, truncate=False)

In [None]:
df2 = spark.read.json(json_file)

In [None]:
df2.show(10, False)

## CSV Data Source

In [None]:
df = (spark
      .read
	 .format("csv")
	 .option("header", "true")
	 .schema(schema)
	 .option("mode", "FAILFAST")  # exit if any errors
	 .option("nullValue", "")	  # replace any null data field with “”
	 .option("path", csv_file)
	 .load())

In [None]:
df.show(10, truncate = False)

In [None]:
(df.write.format("parquet")
  .mode("overwrite")
  .option("path", "/tmp/data/parquet/df_parquet")
  .option("compression", "snappy")
  .save())

In [None]:
!fs ls /tmp/data/parquet/df_parquet

In [None]:
df2 = (spark
       .read
       .option("header", "true")
       .option("mode", "FAILFAST")	 # exit if any errors
       .option("nullValue", "")
       .schema(schema)
       .csv(csv_file))

In [None]:
df2.show(10, truncate=False)

## ORC Data Source

In [None]:
df = (spark.read
      .format("orc")
      .option("path", orc_file)
      .load())

In [None]:
df.show(10, truncate=False)

## Avro Data Source

In [None]:
df = (spark.read
      .format("avro")
      .option("path", avro_file)
      .load())

In [None]:
df.show(10, truncate=False)

## Image

In [None]:
from pyspark.ml import image

image_dir = "../../databricks-datasets/learning-spark-v2/cctvVideos/train_images/"
images_df = spark.read.format("image").load(image_dir)
images_df.printSchema()

images_df.select("image.height", "image.width", "image.nChannels", "image.mode", "label").show(5, truncate=False)

## Binary

In [None]:
path = "../../databricks-datasets/learning-spark-v2/cctvVideos/train_images/"
binary_files_df = (spark.read.format("binaryFile")
  .option("pathGlobFilter", "*.jpg")
  .load(path))

binary_files_df.show(5)

To ignore any partitioning data discovery in a directory, you can set the recursiveFileLookup to true.

In [None]:
binary_files_df = (spark.read.format("binaryFile")
   .option("pathGlobFilter", "*.jpg")
   .option("recursiveFileLookup", "true")
   .load(path))
binary_files_df.show(5)