-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Setting Up Tables
Managing database and table metadata, locations, and configurations at the beginning of project can help to increase data security, discoverability, and performance.

## Learning Objectives
By the end of this notebook, students will be able to:
- Set database locations
- Specify database comments
- Set table locations
- Specify table comments
- Specify column comments
- Use table properties for custom tagging
- Explore table metadata

## Setup Variables

The following script clears out previous runs of this demo and configures some Hive variables that will be used in our SQL queries.

In [0]:
%run ../Includes/sql-setup $lesson="demo" $mode="reset"

## Using Hive Variables

While not a pattern that is generally recommended in Spark SQL, this notebook will use some Hive variables to substitute in string values derived from the account email of the current user.

The following cell demonstrates this pattern.

In [0]:
%sql
SELECT "${c.database}";

Using this syntax is identical to typing the associated string value into a SQL query.

Run the following to make sure no database with the provided name exists.

In [0]:
%sql
DROP DATABASE IF EXISTS ${c.database} CASCADE

## Creating a Database with Options

The following cell demonstrates the syntax for creating a database while:
1. Setting a database comment
1. Specifying a database location
1. Adding an arbitrary key-value pair as a database property

An arbitrary directory on the DBFS root is being used for the location; in any stage of development or production, it is best practice to create databases in secure cloud object storage with credentials locked down to appropriate teams within the organization.

**NOTE**: Remember that by default, all managed tables will be created within the directory declared as the location when a database is created.

In [0]:
%sql
CREATE DATABASE ${c.database}
COMMENT "This is a test database"
LOCATION "${c.userhome}"
WITH DBPROPERTIES (contains_pii = true)

All of the comments and properties set during database declaration can be reviewed using `DESCRIBE DATABASE EXTENDED`.

This information can aid in data discovery, auditing, and governance. Having proactive rules about how databases will be created and tagged can help prevent accidental data exfiltration, redundancies, and deletions.

In [0]:
%sql
DESCRIBE DATABASE EXTENDED ${c.database}

## Creating a Table with Options
The following cell demonstrates creating a **managed** Delta Lake table while:
1. Setting a column comment
1. Setting a table comment
1. Adding an arbitrary key-value pair as a table property

**NOTE**: A number of Delta Lake configurations are also set using `TBLPROPERTIES`. When using this field as part of an organizational approach to data discovery and auditting, users should be made aware of which keys are leveraged for modifying default Delta Lake behaviors.

In [0]:
%sql
CREATE TABLE ${c.database}.pii_test
(id INT, name STRING COMMENT "PII")
COMMENT "Contains PII"
TBLPROPERTIES ('contains_pii' = True) 

Much like the command for reviewing database metadata settings, `DESCRIBE EXTENDED` allows users to see all of the comments and properties for a given table.

**NOTE**: Delta Lake automatically adds several table properties on table creation.

In [0]:
%sql
DESCRIBE EXTENDED ${c.database}.pii_test

Below the code from above is replicated with the addition of specifying a location, creating an **external** table.

**NOTE**: The only thing that differentiates managed and external tables is this location setting. Performance of managed and external tables should be equivalent with regards to latency, but the results of SQL DDL statements on these tables differ drastically.

In [0]:
%sql
CREATE TABLE ${c.database}.pii_test_2
(id INT, name STRING COMMENT "PII")
COMMENT "Contains PII"
LOCATION "${c.userhome}/pii_test_2"
TBLPROPERTIES ('contains_pii' = True) 

As expected, the only differences in the extended description of the table have to do with the table location and type.

In [0]:
%sql
DESCRIBE EXTENDED ${c.database}.pii_test_2

## Using Table Metadata

Assuming that rules are followed appropriately when creating databases and tables, comments, table properties, and other metadata can be interacted with programmatically for discovering datasets for governance and auditing purposes.

The Python code below demonstrates parsing the table properties field, filtering those options that are specifically geared toward controlling Delta Lake behavior. In this case, logic could be written to further parse these properties to identify all tables in a database that contain PII.

In [0]:
def parse_table_keys(database):
    table_keys = {}
    for table in spark.sql(f"SHOW TABLES IN {database}").collect():
        table_name = table[1]
        key_values = spark.sql(f"DESCRIBE EXTENDED {database}.{table_name}").filter("col_name = 'Table Properties'").collect()[0][1][1:-1].split(",")
        table_keys[table_name] = [kv for kv in key_values if not kv.startswith("delta.")]
    return table_keys

parse_table_keys(database)   

-sandbox
&copy; 2021 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="http://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="http://help.databricks.com/">Support</a>