### Auto Loader with Pyspark


Set up of catalog and schema


In [0]:
spark.sql("USE CATALOG catalog")
spark.sql("USE schema")

DataFrame[]

Set up of the data structure for the Delta table

In [0]:
from pyspark.sql.types import StructType, StructField, StringType, IntegerType
schema = StructType([
   StructField("Id", IntegerType(), True),
   StructField("name", StringType(), True),
   StructField("age", IntegerType(), True),
   StructField("money", IntegerType(), True),
   StructField("sales", IntegerType(), True),
   StructField("units", IntegerType(), True),
])

Reading the csv which are already in the volume, and future csv files (by file arrival). 

Selecting the format **"cloudFiles"** we are setting up Auto Loader.

In [0]:
df = (spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "csv")   # or csv, parquet, etc.
      .option("header", "true")
      .schema(schema)  # Schema enforcement
      .load("/Volumes/autoloader/csv/"))

Writing the data from the cvs to the Delta table. 

The checkpoint folder has the information for Auto Loader so it avoid reading and writing the same file more than once.

Not need to create the Delta table before starting to load data.

In [0]:
(df.writeStream
   .format("delta")
   .option("checkpointLocation", "/Volumes/autoloader/_checkpoints")
   .outputMode("append")
   # .trigger(processingTime='5 seconds')
   .trigger(availableNow=True)
   .table("autoloader"))

<pyspark.sql.connect.streaming.query.StreamingQuery at 0x7f51990339b0>

Checking that the data is well loaded in the Delta table

In [0]:
%sql
SELECT * FROM autoloader

Id,name,age,money,sales,units
10,Julia,26,1300,2800,28
11,Kevin,39,1750,5500,55
12,Laura,31,1450,3800,38
7,George,30,1100,2200,22
8,Hannah,28,1600,4000,40
9,Ian,36,2000,6500,65
1,Alice,25,1200,4500,45
2,Bob,32,850,3000,30
3,Charlie,29,1500,6000,60
1,Alice,25,1200,4500,45


Checking the data of all csv files (path to the folder)

In [0]:
%sql
SELECT *
FROM read_files(
  '/Volumes/autoloader/csv/',
  format => 'csv'
);


Id,name,age,money,sales,units,_rescued_data
10,Julia,26,1300,2800,28,
11,Kevin,39,1750,5500,55,
12,Laura,31,1450,3800,38,
7,George,30,1100,2200,22,
8,Hannah,28,1600,4000,40,
9,Ian,36,2000,6500,65,
1,Alice,25,1200,4500,45,
2,Bob,32,850,3000,30,
3,Charlie,29,1500,6000,60,
1,Alice,25,1200,4500,45,


Reading the json which are already in the volume, and future json files (by file arrival). 

Selecting the format **"cloudFiles"** we are setting up Auto Loader.

Writing the data from the json to the Delta table. 

The checkpoint folder has the information for Auto Loader so it avoid reading and writing the same file more than once.

Not need to create the Delta table before starting to load data.

Here, it is a Databricks extension to Spark Structured Streaming df.writeStream.table("autoloader_json") This does not return a StreamingQuery. Instead, Databricks takes care of starting and managing the streaming query behind the scenes.


In [0]:
df = (spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "json")   # or csv, parquet, etc.
      .option("multiLine", "true")  # this option is important if we want to read the json without problems
      .schema(schema)
      .load("/Volumes/autoloader/json/"))


(df.writeStream
   .queryName("query_json")
   .format("delta")
   .option("checkpointLocation", "/Volumes/autoloader/_checkpoints_json")
   .outputMode("append")
   .table("autoloader_json"))

<pyspark.sql.connect.streaming.query.StreamingQuery at 0x7f47184437a0>

We have a continuous running streaming Job. 


**To stop the stream**

In [0]:
# option 1 (when we do not have .queryName("query_json") on df.writeStream)
for q in spark.streams.active:
    print(f"Stopping query: {q}")
    q.stop()

Stopping query: <pyspark.sql.connect.streaming.query.StreamingQuery object at 0x7f4719515e20>


In [0]:
# option 1 (when we do have .queryName("query_json") on df.writeStream)
for q in spark.streams.active:
    if q.name == "query_json":
        q.stop()

Checking that the data is well loaded in the Delta table

In [0]:
%sql
SELECT * FROM autoloader_json

Id,age,money,name,sales,units,_rescued_data,city
10,26,1300,Julia,2800,28,,
11,39,1750,Kevin,5500,55,,
12,31,1450,Laura,3800,38,,
7,30,1100,George,2200,22,,
8,28,1600,Hannah,4000,40,,
9,36,2000,Ian,6500,65,,
1,25,1200,Alice,4500,45,,
2,32,850,Bob,3000,30,,
3,29,1500,Charlie,6000,60,,
4,41,2100,Diana,7500,75,,


Checking the data of all json files (path to the folder)

In [0]:
%sql

-- Option 1

SELECT *
FROM read_files(
  '/Volumes/autoloader/json/',
  format => 'json',
  multiline => 'true'
);

Id,age,city,money,name,sales,units,_rescued_data
10,26,,1300,Julia,2800,28,
11,39,,1750,Kevin,5500,55,
12,31,,1450,Laura,3800,38,
7,30,,1100,George,2200,22,
8,28,,1600,Hannah,4000,40,
9,36,,2000,Ian,6500,65,
1,25,,1200,Alice,4500,45,
2,32,,850,Bob,3000,30,
3,29,,1500,Charlie,6000,60,
4,41,,2100,Diana,7500,75,


In [0]:
%sql

-- Option 2

CREATE OR REPLACE TEMPORARY VIEW multiLineJsonTable
USING json
OPTIONS (path="/Volumes/autoloader/json/",multiline=true);

select * from multiLineJsonTable

Id,age,city,money,name,sales,units
10,26,,1300,Julia,2800,28
11,39,,1750,Kevin,5500,55
12,31,,1450,Laura,3800,38
7,30,,1100,George,2200,22
8,28,,1600,Hannah,4000,40
9,36,,2000,Ian,6500,65
1,25,,1200,Alice,4500,45
2,32,,850,Bob,3000,30
3,29,,1500,Charlie,6000,60
4,41,,2100,Diana,7500,75


In Spark Structured Streaming the DataFrame (df) itself has no stop() method, because the streaming query is managed through a StreamingQuery object, not the DataFrame. When you call df.writeStream...start(), Spark returns a StreamingQuery. You use that object to manage (stop, await termination, etc.) your continuous query.

Here, the same code but with schema evolution and StreamingQuery

In [0]:
df = (spark.readStream
      .format("cloudFiles")
      .option("cloudFiles.format", "json")   # or csv, parquet, etc.
      .option("multiLine", "true")  # this option is important if we want to read the json without problems
      .option("cloudFiles.schemaLocation", "/Volumes/autoloader/_schemas_json")
      .option("cloudFiles.schemaEvolutionMode", "addNewColumns") 
      .load("/Volumes/autoloader/json/"))


query = (
  df.writeStream
   .format("delta")
   .option("checkpointLocation", "/Volumes/autoloader/_checkpoints_json")
   .option("mergeSchema", "true")
   .outputMode("append")   
   #.start("autoloader_json"))  # for a delta table which exists already in the catalog
   .toTable("autoloader_json"))

In [0]:
# to check if the stream is active
query.isActive

True

In [0]:
# It makes your code wait until that the batch finishes (when we have the option df.writeStream.trigger(once=True) for exmple)
query.awaitTermination()

com.databricks.backend.common.rpc.CommandCancelledException
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$5(SequenceExecutionState.scala:132)
	at scala.Option.getOrElse(Option.scala:189)
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3(SequenceExecutionState.scala:132)
	at com.databricks.spark.chauffeur.SequenceExecutionState.$anonfun$cancel$3$adapted(SequenceExecutionState.scala:129)
	at scala.collection.immutable.Range.foreach(Range.scala:158)
	at com.databricks.spark.chauffeur.SequenceExecutionState.cancel(SequenceExecutionState.scala:129)
	at com.databricks.spark.chauffeur.ExecContextState.cancelRunningSequence(ExecContextState.scala:715)
	at com.databricks.spark.chauffeur.ExecContextState.$anonfun$cancel$1(ExecContextState.scala:435)
	at scala.Option.getOrElse(Option.scala:189)
	at com.databricks.spark.chauffeur.ExecContextState.cancel(ExecContextState.scala:435)
	at com.databricks.spark.chauffeur.ExecutionContextManagerV1.can

In [0]:
query.stop()

### SQL

Auto Loader, SQL code, is only used in Lakeflow Declarative Pipelines (Delta Life Tables)

https://docs.databricks.com/aws/en/ingestion/cloud-object-storage/auto-loader/patterns?language=Python

In [0]:
%sql
CREATE OR REFRESH STREAMING TABLE autoloder_sql
AS SELECT *
FROM STREAM read_files(
  '/Volumes/autoloader/json/',
  format => 'json',
  multiline => 'true'
);
  

Batch-read data from json folder (spark.read =! spark.readStream)

In [0]:
df_json = (spark.read
      .format("json")
      .option("multiLine", True)
      .load("/Volumes/autoloader/json/"))

In [0]:
display(df_json)

Id,age,money,name,sales,units
10,26,1300,Julia,2800,28
11,39,1750,Kevin,5500,55
12,31,1450,Laura,3800,38
10,26,1300,Julia,2800,28
11,39,1750,Kevin,5500,55
12,31,1450,Laura,3800,38
7,30,1100,George,2200,22
8,28,1600,Hannah,4000,40
9,36,2000,Ian,6500,65
1,25,1200,Alice,4500,45
