-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Providing Options for External Sources
While directly querying files works well for self-describing formats, many data sources require additional configurations or schema declaration to properly ingest records.

In this lesson, we will create tables using external data sources. While these tables will not yet be stored in the Delta Lake format (and therefore not be optimized for the Lakehouse), this technique helps to facilitate extracting data from diverse external systems.

## Learning Objectives
By the end of this lesson, you should be able to:
- Use Spark SQL to configure options for extracting data from external sources
- Create tables against external data sources for various file formats
- Describe default behavior when querying tables defined against external sources

## Run Setup

The setup script will create the data and declare necessary values for the rest of this notebook to execute.

In [0]:
%run ../Includes/Classroom-Setup-4.2

Python interpreter will be restarted.
Python interpreter will be restarted.



Creating the database "dbacademy_chiraggoel_kpmg_com_dewd_4_2"
Skipping install to "dbfs:/user/chiraggoel@kpmg.com/dbacademy/dewd/source/eltwss", dataset already exists
Creating the sales-csv dataset...(4 seconds / 10,514 records)
Creating the users table...(10 seconds / 251,501 records)

Predefined Paths:
  DA.paths.working_dir: dbfs:/user/chiraggoel@kpmg.com/dbacademy/dewd/4.2
  DA.paths.user_db:     dbfs:/user/chiraggoel@kpmg.com/dbacademy/dewd/4.2/4_2.db
  DA.paths.datasets:    dbfs:/user/chiraggoel@kpmg.com/dbacademy/dewd/source/eltwss

Predefined tables in dbacademy_chiraggoel_kpmg_com_dewd_4_2:
  -none-

Setup completed in 16 seconds


## When Direct Queries Don't Work 

While views can be used to persist direct queries against files between sessions, this approach has limited utility.

CSV files are one of the most common file formats, but a direct query against these files rarely returns the desired results.

In [0]:
%sql
SELECT * FROM csv.`${da.paths.working_dir}/sales-csv`

_c0
order_id|email|transactions_timestamp|total_item_quantity|purchase_revenue_in_usd|unique_items|items
298592|sandovalaustin@holder.com|1592629288475307|1|850.5|1|[{'coupon': 'NEWBED10'
299024|msmith@monroe.com|1592636869915092|2|1092.6|2|[{'coupon': 'NEWBED10'
300048|robertstimothy@hotmail.com|1592649862529478|1|1075.5|1|[{'coupon': 'NEWBED10'
298711|lovejamie@yahoo.com|1592631406799948|1|850.5|1|[{'coupon': 'NEWBED10'
301760|jennifer7054@gmail.com|1592661071882666|1|940.5|1|[{'coupon': 'NEWBED10'
302809|ywhite@kane.org|1592665563660982|1|1075.5|1|[{'coupon': 'NEWBED10'
309136|karen61@hotmail.com|1592689638083947|1|1795.5|1|[{'coupon': 'NEWBED10'
303941|deborah18@conrad-gallagher.com|1592669885794924|1|850.5|1|[{'coupon': 'NEWBED10'
305920|khanedwin@gmail.com|1592676863608194|1|1075.5|1|[{'coupon': 'NEWBED10'


We can see from the above that:
1. The header row is being extracted as a table row
1. All columns are being loaded as a single column
1. The file is pipe-delimited (**`|`**)
1. The final column appears to contain nested data that is being truncated

## Registering Tables on External Data with Read Options

While Spark will extract some self-describing data sources efficiently using default settings, many formats will require declaration of schema or other options.

While there are many <a href="https://docs.databricks.com/spark/latest/spark-sql/language-manual/sql-ref-syntax-ddl-create-table-using.html" target="_blank">additional configurations</a> you can set while creating tables against external sources, the syntax below demonstrates the essentials required to extract data from most formats.

<strong><code>
CREATE TABLE table_identifier (col_name1 col_type1, ...)<br/>
USING data_source<br/>
OPTIONS (key1 = val1, key2 = val2, ...)<br/>
LOCATION = path<br/>
</code></strong>

Note that options are passed with keys as unquoted text and values in quotes. Spark supports many <a href="https://docs.databricks.com/data/data-sources/index.html" target="_blank">data sources</a> with custom options, and additional systems may have unofficial support through external <a href="https://docs.databricks.com/libraries/index.html" target="_blank">libraries</a>. 

**NOTE**: Depending on your workspace settings, you may need administrator assistance to load libraries and configure the requisite security settings for some data sources.

The cell below demonstrates using Spark SQL DDL to create a table against an external CSV source, specifying:
1. The column names and types
1. The file format
1. The delimiter used to separate fields
1. The presence of a header
1. The path to where this data is stored

In [0]:
%sql
CREATE TABLE sales_csv
  (order_id LONG, email STRING, transactions_timestamp LONG, total_item_quantity INTEGER, purchase_revenue_in_usd DOUBLE, unique_items INTEGER, items STRING)
USING CSV
OPTIONS (
  header = "true",
  delimiter = "|"
)
LOCATION "${da.paths.working_dir}/sales-csv"

Note that no data has moved during table declaration. Similar to when we directly queried our files and created a view, we are still just pointing to files stored in an external location.

Run the following cell to confirm that data is now being loaded correctly.

In [0]:
%sql
SELECT * FROM sales_csv

order_id,email,transactions_timestamp,total_item_quantity,purchase_revenue_in_usd,unique_items,items
298592,sandovalaustin@holder.com,1592629288475307,1,850.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_F', 'item_name': 'Standard Full Mattress', 'item_revenue_in_usd': 850.5, 'price_in_usd': 945.0, 'quantity': 1}]"
299024,msmith@monroe.com,1592636869915092,2,1092.6,2,"[{'coupon': 'NEWBED10', 'item_id': 'M_PREM_T', 'item_name': 'Premium Twin Mattress', 'item_revenue_in_usd': 985.5, 'price_in_usd': 1095.0, 'quantity': 1}, {'coupon': 'NEWBED10', 'item_id': 'P_DOWN_S', 'item_name': 'Standard Down Pillow', 'item_revenue_in_usd': 107.10000000000001, 'price_in_usd': 119.0, 'quantity': 1}]"
300048,robertstimothy@hotmail.com,1592649862529478,1,1075.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_K', 'item_name': 'Standard King Mattress', 'item_revenue_in_usd': 1075.5, 'price_in_usd': 1195.0, 'quantity': 1}]"
298711,lovejamie@yahoo.com,1592631406799948,1,850.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_F', 'item_name': 'Standard Full Mattress', 'item_revenue_in_usd': 850.5, 'price_in_usd': 945.0, 'quantity': 1}]"
301760,jennifer7054@gmail.com,1592661071882666,1,940.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_Q', 'item_name': 'Standard Queen Mattress', 'item_revenue_in_usd': 940.5, 'price_in_usd': 1045.0, 'quantity': 1}]"
302809,ywhite@kane.org,1592665563660982,1,1075.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_K', 'item_name': 'Standard King Mattress', 'item_revenue_in_usd': 1075.5, 'price_in_usd': 1195.0, 'quantity': 1}]"
309136,karen61@hotmail.com,1592689638083947,1,1795.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_PREM_K', 'item_name': 'Premium King Mattress', 'item_revenue_in_usd': 1795.5, 'price_in_usd': 1995.0, 'quantity': 1}]"
303941,deborah18@conrad-gallagher.com,1592669885794924,1,850.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_F', 'item_name': 'Standard Full Mattress', 'item_revenue_in_usd': 850.5, 'price_in_usd': 945.0, 'quantity': 1}]"
305920,khanedwin@gmail.com,1592676863608194,1,1075.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_STAN_K', 'item_name': 'Standard King Mattress', 'item_revenue_in_usd': 1075.5, 'price_in_usd': 1195.0, 'quantity': 1}]"
298795,samantha4354@hotmail.com,1592632916516773,1,985.5,1,"[{'coupon': 'NEWBED10', 'item_id': 'M_PREM_T', 'item_name': 'Premium Twin Mattress', 'item_revenue_in_usd': 985.5, 'price_in_usd': 1095.0, 'quantity': 1}]"


In [0]:
%sql
SELECT COUNT(*) FROM sales_csv

count(1)
10510


All the metadata and options passed during table declaration will be persisted to the metastore, ensuring that data in the location will always be read with these options.

**NOTE**: When working with CSVs as a data source, it's important to ensure that column order does not change if additional data files will be added to the source directory. Because the data format does not have strong schema enforcement, Spark will load columns and apply column names and data types in the order specified during table declaration.

Running **`DESCRIBE EXTENDED`** on a table will show all of the metadata associated with the table definition.

In [0]:
%sql
DESCRIBE EXTENDED sales_csv

col_name,data_type,comment
order_id,bigint,
email,string,
transactions_timestamp,bigint,
total_item_quantity,int,
purchase_revenue_in_usd,double,
unique_items,int,
items,string,
,,
# Detailed Table Information,,
Database,dbacademy_chiraggoel_kpmg_com_dewd_4_2,


## Limits of Tables with External Data Sources

If you've taken other courses on Databricks or reviewed any of our company literature, you may have heard about Delta Lake and the Lakehouse. Note that whenever we're defining tables or queries against external data sources, we **cannot** expect the performance guarantees associated with Delta Lake and Lakehouse.

For example: while Delta Lake tables will guarantee that you always query the most recent version of your source data, tables registered against other data sources may represent older cached versions.

The cell below executes some logic that we can think of as just representing an external system directly updating the files underlying our table.

In [0]:
%python
(spark.table("sales_csv")
      .write.mode("append")
      .format("csv")
      .save(f"{DA.paths.working_dir}/sales-csv"))

If we look at the current count of records in our table, the number we see will not reflect these newly inserted rows.

In [0]:
%sql
SELECT COUNT(*) FROM sales_csv

count(1)
10510


At the time we previously queried this data source, Spark automatically cached the underlying data in local storage. This ensures that on subsequent queries, Spark will provide the optimal performance by just querying this local cache.

Our external data source is not configured to tell Spark that it should refresh this data. 

We **can** manually refresh the cache of our data by running the **`REFRESH TABLE`** command.

In [0]:
%sql
REFRESH TABLE sales_csv

Note that refreshing our table will invalidate our cache, meaning that we'll need to rescan our original data source and pull all data back into memory. 

For very large datasets, this may take a significant amount of time.

In [0]:
%sql
SELECT COUNT(*) FROM sales_csv

count(1)
21016


## Extracting Data from SQL Databases
SQL databases are an extremely common data source, and Databricks has a standard JDBC driver for connecting with many flavors of SQL.

The general syntax for creating these connections is:

<strong><code>
CREATE TABLE <jdbcTable><br/>
USING JDBC<br/>
OPTIONS (<br/>
&nbsp; &nbsp; url = "jdbc:{databaseServerType}://{jdbcHostname}:{jdbcPort}",<br/>
&nbsp; &nbsp; dbtable = "{jdbcDatabase}.table",<br/>
&nbsp; &nbsp; user = "{jdbcUsername}",<br/>
&nbsp; &nbsp; password = "{jdbcPassword}"<br/>
)
</code></strong>

In the code sample below, we'll connect with <a href="https://www.sqlite.org/index.html" target="_blank">SQLite</a>.
  
**NOTE:** SQLite uses a local file to store a database, and doesn't require a port, username, or password.  
  
<img src="https://files.training.databricks.com/images/icon_warn_24.png"> **WARNING**: The backend-configuration of the JDBC server assume you are running this notebook on a single-node cluster. If you are running on a cluster with multiple workers, the client running in the executors will not be able to connect to the driver.

In [0]:
%sql
DROP TABLE IF EXISTS users_jdbc;

CREATE TABLE users_jdbc
USING JDBC
OPTIONS (
  url = "jdbc:sqlite:/${da.username}_ecommerce.db",
  dbtable = "users"
)

Now we can query this table as if it were defined locally.

In [0]:
%sql
SELECT * FROM users_jdbc

user_id,user_first_touch_timestamp,email
UA000000102357351,1592187804331222,
UA000000102357772,1592196585484760,
UA000000102358075,1592198929755893,
UA000000102358422,1592200681180797,
UA000000102358489,1592200952155132,
UA000000102358495,1592200983001857,
UA000000102358794,1592202111344629,
UA000000102358914,1592202501646714,
UA000000102359033,1592202891556793,
UA000000102359145,1592203235808647,


Looking at the table metadata reveals that we have captured the schema information from the external system. Storage properties (which would include the username and password associated with the connection) are automatically redacted.

In [0]:
%sql
DESCRIBE EXTENDED users_jdbc

col_name,data_type,comment
user_id,string,
user_first_touch_timestamp,"decimal(20,0)",
email,string,
,,
# Detailed Table Information,,
Database,dbacademy_chiraggoel_kpmg_com_dewd_4_2,
Table,users_jdbc,
Owner,root,
Created Time,Sun May 22 12:27:26 UTC 2022,
Last Access,UNKNOWN,


While the table is listed as **`MANAGED`**, listing the contents of the specified location confirms that no data is being persisted locally.

In [0]:
%python
jdbc_users_path = f"{DA.paths.user_db}/users_jdbc/"
print(jdbc_users_path)

files = dbutils.fs.ls(jdbc_users_path)
print(f"Found {len(files)} files")

dbfs:/user/chiraggoel@kpmg.com/dbacademy/dewd/4.2/4_2.db/users_jdbc/
Found 0 files


Note that some SQL systems such as data warehouses will have custom drivers. Spark will interact with various external databases differently, but the two basic approaches can be summarized as either:
1. Moving the entire source table(s) to Databricks and then executing logic on the currently active cluster
1. Pushing down the query to the external SQL database and only transferring the results back to Databricks

In either case, working with very large datasets in external SQL databases can incur significant overhead because of either:
1. Network transfer latency associated with moving all data over the public internet
1. Execution of query logic in source systems not optimized for big data queries

Run the following cell to delete the tables and files associated with this lesson.

In [0]:
%python
DA.cleanup()

Dropping the database "dbacademy_chiraggoel_kpmg_com_dewd_4_2"
Removing the working directory "dbfs:/user/chiraggoel@kpmg.com/dbacademy/dewd/4.2"


-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>