This notebook shows how to create and query a table or DataFrame on Azure Blob Storage.

### Step 1: Set the data location and type

There are two ways in Databricks to read from Azure Blob Storage. You can either read data using [account keys](https://docs.databricks.com/spark/latest/data-sources/azure/azure-storage.html#azure-blob-storage) or read data using shared access signatures (SAS).

To get started, we need to set the location and type of the file. We can do this using [widgets](https://docs.databricks.com/user-guide/notebooks/widgets.html). Widgets allow us to parameterize the exectuion of this entire notebook. First we create them, then we can reference them throughout the notebook.

In [3]:
dbutils.widgets.text("storage_account_name", "sawstaging", "Storage Account Name")
dbutils.widgets.text("storage_account_access_key", "daU2wY2rntDb7EblyVoV5CMG1y2wRKDjrfWgneoS8z0Km1Uzc3Ykjcxi/kwwP6yeclCEsQzz+OfQDa4eWKdOhw==", "Storage Access Key / SAS")

In [4]:
dbutils.widgets.text("file_location", "wasbs://example/location", "Upload Location")
dbutils.widgets.dropdown("file_type", "csv", ["csv", 'parquet', 'json'])

In [5]:
spark.conf.set(
  "fs.azure.account.key."+dbutils.widgets.get("storage_account_name")+".blob.core.windows.net",
  dbutils.widgets.get("storage_account_access_key"))

### Step 2: Read the data

Now that we have specified our file metadata, we can create a DataFrame. Notice that we use an *option* to specify that we want to infer the schema from the file. We can also explicitly set this to a particular schema if we have one already.

First, let's create a DataFrame in Python.

In [7]:
df = spark.read.format(dbutils.widgets.get("file_type")).option("inferSchema", "true").load(dbutils.widgets.get("file_location"))

### Step 3: Query the data

Now that we have created our DataFrame, we can query it. For instance, you can identify particular columns to select and display within Databricks.

In [9]:
display(df.select("EXAMPLE_COLUMN"))

### Step 4: (Optional) Create a view or table

If you want to query this data as a table, you can simply register it as a *view* or a table.

In [11]:
df.createOrReplaceTempView("YOUR_TEMP_VIEW_NAME")

We can query this view using Spark SQL. For instance, we can perform a simple aggregation. Notice how we can use `%sql` to query the view from SQL.

In [13]:
%sql

SELECT EXAMPLE_GROUP, SUM(EXAMPLE_AGG) FROM YOUR_TEMP_VIEW_NAME GROUP BY EXAMPLE_GROUP

Registered as a temp view, this data is only available to this particular notebook. If you want other users to be able to query this table, you can also create a table from the DataFrame.

In [15]:
df.write.format("parquet").saveAsTable("MY_PERMANENT_TABLE_NAME")

This table will persist across cluster restarts and allow various users across different notebooks to query this data.