### Getting Started with Databricks Data Engineering
There are three key Apache Spark interfaces that you should know about: Resilient Distributed Dataset, DataFrame, and Dataset.

- Resilient Distributed Dataset: The first Apache Spark abstraction was the Resilient Distributed Dataset (RDD). It is an interface to a sequence of data objects that consist of one or more types that are located across a collection of machines (a cluster). RDDs can be created in a variety of ways and are the “lowest level” API available. While this is the original data structure for Apache Spark, you should focus on the DataFrame API, which is a superset of the RDD functionality. The RDD API is available in the Java, Python, and Scala languages.
- DataFrame: These are similar in concept to the DataFrame you may be familiar with in the pandas Python library and the R language. The DataFrame API is available in the Java, Python, R, and Scala languages.
- Dataset: A combination of DataFrame and RDD. It provides the typed interface that is available in RDDs while providing the convenience of the DataFrame. The Dataset API is available in the Java and Scala languages.

In many scenarios, especially with the performance optimizations embedded in DataFrames and Datasets, it will not be necessary to work with RDDs. But it is important to understand the RDD abstraction because:
- The RDD is the underlying infrastructure that allows Spark to run so fast and provide data lineage.
- If you are diving into more advanced components of Spark, it may be necessary to use RDDs.
- The visualizations within the Spark UI reference RDDs.

#### 1.0. Apache Spark File Utilities
Databricks Utilities `(dbutils)` make it easy to perform powerful combinations of tasks. You can use the utilities to work with object storage efficiently, to chain and parameterize notebooks, and to work with secrets. `dbutils` are not supported outside of notebooks.

##### 1.1. List available commands for a utility
To list available commands for a utility along with a short description of each command, run `.help()` after the programmatic name for the utility. For example, the following command lists the available commands for the Databricks File System (DBFS) utility.

In [0]:
dbutils.fs.help()

##### 1.2. List the Contents of a Folder
The following command lists the contents of a folder in the Databricks File System (DBFS). In this example, all the sample datasets are being enumerated.

In [0]:
display(dbutils.fs.ls('/databricks-datasets'))

path,name,size
dbfs:/databricks-datasets/COVID/,COVID/,0
dbfs:/databricks-datasets/README.md,README.md,976
dbfs:/databricks-datasets/Rdatasets/,Rdatasets/,0
dbfs:/databricks-datasets/SPARK_README.md,SPARK_README.md,3359
dbfs:/databricks-datasets/adult/,adult/,0
dbfs:/databricks-datasets/airlines/,airlines/,0
dbfs:/databricks-datasets/amazon/,amazon/,0
dbfs:/databricks-datasets/asa/,asa/,0
dbfs:/databricks-datasets/atlas_higgs/,atlas_higgs/,0
dbfs:/databricks-datasets/bikeSharing/,bikeSharing/,0


##### 1.3. Create a Table based on a Databricks Dataset
This code example demonstrates how to use SQL in the Databricks SQL query editor, or how to use Python in a notebook in Data Science & Engineering or Databricks Machine Learning, to create a table based on a Databricks dataset:

In [0]:
%sql
DROP TABLE default.people10m;

CREATE TABLE default.people10m
  OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.delta');
  
SELECT * FROM default.people10m LIMIT 10;

id,firstName,middleName,lastName,gender,birthDate,ssn,salary
3766824,Hisako,Isabella,Malitrott,F,1961-02-12T05:00:00.000+0000,938-80-1874,58862
3766825,Daisy,Merissa,Fibben,F,1998-05-19T04:00:00.000+0000,971-14-3755,66221
3766826,Caren,Blossom,Henner,F,1962-08-06T04:00:00.000+0000,954-19-8973,54376
3766827,Darleen,Gertie,Goodinson,F,1980-03-12T05:00:00.000+0000,981-65-5269,69954
3766828,Kyle,Lu,Habben,F,1974-02-15T04:00:00.000+0000,936-95-3240,56681
3766829,Melia,Kristy,Bonhill,F,1970-09-13T04:00:00.000+0000,960-91-9232,73995
3766830,Yevette,Faye,Bebbell,F,1972-09-07T04:00:00.000+0000,987-72-3701,92888
3766831,Delpha,Kenisha,Gillison,F,1979-06-25T04:00:00.000+0000,962-66-5404,51206
3766832,Mikaela,Jenifer,Hallan,F,1973-05-23T04:00:00.000+0000,911-38-3114,98887
3766833,Cindi,Renita,Cousin,F,1979-03-19T05:00:00.000+0000,666-50-3216,63646


In [0]:
spark.sql("DROP TABLE default.people10m2")
spark.sql("CREATE TABLE default.people10m2 OPTIONS (PATH 'dbfs:/databricks-datasets/learning-spark-v2/people/people-10m.delta')")

df = spark.sql("SELECT * FROM default.people10m2 LIMIT 10")
display(df)

id,firstName,middleName,lastName,gender,birthDate,ssn,salary
3766824,Hisako,Isabella,Malitrott,F,1961-02-12T05:00:00.000+0000,938-80-1874,58862
3766825,Daisy,Merissa,Fibben,F,1998-05-19T04:00:00.000+0000,971-14-3755,66221
3766826,Caren,Blossom,Henner,F,1962-08-06T04:00:00.000+0000,954-19-8973,54376
3766827,Darleen,Gertie,Goodinson,F,1980-03-12T05:00:00.000+0000,981-65-5269,69954
3766828,Kyle,Lu,Habben,F,1974-02-15T04:00:00.000+0000,936-95-3240,56681
3766829,Melia,Kristy,Bonhill,F,1970-09-13T04:00:00.000+0000,960-91-9232,73995
3766830,Yevette,Faye,Bebbell,F,1972-09-07T04:00:00.000+0000,987-72-3701,92888
3766831,Delpha,Kenisha,Gillison,F,1979-06-25T04:00:00.000+0000,962-66-5404,51206
3766832,Mikaela,Jenifer,Hallan,F,1973-05-23T04:00:00.000+0000,911-38-3114,98887
3766833,Cindi,Renita,Cousin,F,1979-03-19T05:00:00.000+0000,666-50-3216,63646


### 2.0. Structured Streaming
Sensors, IoT devices, social networks, and online transactions all generate data that needs to be monitored constantly and acted upon quickly. As a result, the need for large-scale, real-time stream processing is more evident than ever before. This tutorial module introduces Structured Streaming, the main model for handling streaming datasets in Apache Spark. In Structured Streaming, a data stream is treated as a table that is being continuously appended. This leads to a stream processing model that is very similar to a batch processing model. You express your streaming computation as a standard batch-like query as on a static table, but Spark runs it as an incremental query on the unbounded input table.

#### 2.1. Load the Sample Data
The easiest way to get started with Structured Streaming is to use an example Azure Databricks dataset available in the /databricks-datasets folder accessible within the Azure Databricks workspace. Azure Databricks has sample event data as files in /databricks-datasets/structured-streaming/events/ to use to build a Structured Streaming application. First, take a look at the contents of this directory.

In [0]:
inputPath = "/databricks-datasets/structured-streaming/events/"
display(dbutils.fs.ls(inputPath))

path,name,size
dbfs:/databricks-datasets/structured-streaming/events/file-0.json,file-0.json,72530
dbfs:/databricks-datasets/structured-streaming/events/file-1.json,file-1.json,72961
dbfs:/databricks-datasets/structured-streaming/events/file-10.json,file-10.json,73025
dbfs:/databricks-datasets/structured-streaming/events/file-11.json,file-11.json,72999
dbfs:/databricks-datasets/structured-streaming/events/file-12.json,file-12.json,72987
dbfs:/databricks-datasets/structured-streaming/events/file-13.json,file-13.json,73006
dbfs:/databricks-datasets/structured-streaming/events/file-14.json,file-14.json,73003
dbfs:/databricks-datasets/structured-streaming/events/file-15.json,file-15.json,73007
dbfs:/databricks-datasets/structured-streaming/events/file-16.json,file-16.json,72978
dbfs:/databricks-datasets/structured-streaming/events/file-17.json,file-17.json,73008


#### 2.2. Initialize the Stream
Since the sample data is just a static set of files, you can emulate a stream from them by reading one file at a time, in the chronological order in which they were created.

In [0]:
from pyspark.sql.types import *
from pyspark.sql.functions import *

# Define the schema to speed up processing
jsonSchema = StructType([ StructField("time", TimestampType(), True), StructField("action", StringType(), True) ])

streamingInputDF = (
  spark
    .readStream
    .schema(jsonSchema)               # Set the schema of the JSON data
    .option("maxFilesPerTrigger", 1)  # Treat a sequence of files as a stream by picking one file at a time
    .json(inputPath)
)

streamingCountsDF = (
  streamingInputDF
    .groupBy(
      streamingInputDF.action,
      window(streamingInputDF.time, "1 hour"))
    .count()
)

#### 2.3. Start the Streaming Job
You start a streaming computation by defining a sink and starting it. In our case, to query the counts interactively, set the complete set of 1 hour counts to be in an in-memory table.

In [0]:
query = (
  streamingCountsDF
    .writeStream
    .format("memory")        # memory = store in-memory table (for testing only)
    .queryName("counts")     # counts = name of the in-memory table
    .outputMode("complete")  # complete = all the counts should be in the table
    .start()
)

#### 2.4. Interactively Query the Stream
We can periodically query the counts aggregation:

In [0]:
%sql
SELECT action
  , date_format(window.end, "MMM-dd HH:mm") AS time
  , count
FROM counts
ORDER BY time, action

action,time,count
Close,Jul-26 03:00,11
Open,Jul-26 03:00,179
Close,Jul-26 04:00,344
Open,Jul-26 04:00,1001
Close,Jul-26 05:00,815
Open,Jul-26 05:00,999
Close,Jul-26 06:00,1003
Open,Jul-26 06:00,1000
Close,Jul-26 07:00,1011
Open,Jul-26 07:00,993
