-sandbox

<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

<i18n value="731b610a-2018-40a2-8eae-f6f01ae7a788"/>


# Schemas and Tables on Databricks
In this demonstration, you will create and explore schemas and tables.

## Learning Objectives
By the end of this lesson, you should be able to:
* Use Spark SQL DDL to define schemas and tables
* Describe how the **`LOCATION`** keyword impacts the default storage directory



**Resources**
* <a href="https://docs.databricks.com/user-guide/tables.html" target="_blank">Schemas and Tables - Databricks Docs</a>
* <a href="https://docs.databricks.com/user-guide/tables.html#managed-and-unmanaged-tables" target="_blank">Managed and Unmanaged Tables</a>
* <a href="https://docs.databricks.com/user-guide/tables.html#create-a-table-using-the-ui" target="_blank">Creating a Table with the UI</a>
* <a href="https://docs.databricks.com/user-guide/tables.html#create-a-local-table" target="_blank">Create a Local Table</a>
* <a href="https://spark.apache.org/docs/latest/sql-data-sources-load-save-functions.html#saving-to-persistent-tables" target="_blank">Saving to Persistent Tables</a>

<i18n value="10b2fb72-8534-4903-98a1-26716350dd20"/>


## Lesson Setup
The following script clears out previous runs of this demo and configures some Hive variables that will be used in our SQL queries.

In [0]:
%run ../Includes/Classroom-Setup-03.1

Python interpreter will be restarted.
Python interpreter will be restarted.


Resetting the learning environment:
| dropping the schema "munirsheikhcloudseekho_0lj9_da_dewd"...(3 seconds)
| removing the working directory "dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks"...(0 seconds)

Skipping install of existing datasets to "dbfs:/mnt/dbacademy-datasets/data-engineering-with-databricks/v02"

Validating the locally installed datasets:
| listing local files...(6 seconds)
| completed (7 seconds total)

Using the "default" schema.

Predefined paths variables:
| DA.paths.working_dir: dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks
| DA.paths.user_db:     dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/database.db
| DA.paths.datasets:    dbfs:/mnt/dbacademy-datasets/data-engineering-with-databricks/v02

Setup completed (11 seconds)


<i18n value="1cbf441b-a62f-4202-af2a-677d37a598b2"/>


## Using Hive Variables

While not a pattern that is generally recommended in Spark SQL, this notebook will use some Hive variables to substitute in string values derived from the account email of the current user.

The following cell demonstrates this pattern.

In [0]:
%sql
SELECT "${da.schema_name}" AS schema_name,
       "${da.paths.working_dir}" AS working_dir

schema_name,working_dir
munirsheikhcloudseekho_0lj9_da_dewd,dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks


<i18n value="014c9f3d-ffd0-48b8-989e-b80b2568d642"/>


Because you may be working in a shared workspace, this course uses variables derived from your username so the schemas don't conflict with other users. Again, consider this use of Hive variables a hack for our lesson environment rather than a good practice for development.

<i18n value="ff022f79-7f38-47ea-809e-537cf00526d0"/>

 
## Schemas
Let's start by creating two schemas:
- One with no **`LOCATION`** specified
- One with **`LOCATION`** specified

In [0]:
%sql
CREATE SCHEMA IF NOT EXISTS ${da.schema_name}_default_location;
CREATE SCHEMA IF NOT EXISTS ${da.schema_name}_custom_location LOCATION '${da.paths.working_dir}/_custom_location.db';

<i18n value="4eff4961-9de3-4d5d-836e-cc48862ef4e6"/>

 
Note that the location of the first schema is in the default location under **`dbfs:/user/hive/warehouse/`** and that the schema directory is the name of the schema with the **`.db`** extension

In [0]:
%sql
DESCRIBE SCHEMA EXTENDED ${da.schema_name}_default_location;

database_description_item,database_description_value
Catalog Name,spark_catalog
Namespace Name,munirsheikhcloudseekho_0lj9_da_dewd_default_location
Comment,
Location,dbfs:/user/hive/warehouse/munirsheikhcloudseekho_0lj9_da_dewd_default_location.db
Owner,root
Properties,


<i18n value="58292139-abd2-453b-b327-9ec2ab76dd0a"/>


Note that the location of the second schema is in the directory specified after the **`LOCATION`** keyword.

In [0]:
%sql
DESCRIBE SCHEMA EXTENDED ${da.schema_name}_custom_location;

database_description_item,database_description_value
Catalog Name,spark_catalog
Namespace Name,munirsheikhcloudseekho_0lj9_da_dewd_custom_location
Comment,
Location,dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/_custom_location.db
Owner,root
Properties,


<i18n value="d794ab19-e4e8-4f5c-b784-385ac7c27bc2"/>

 
We will create a table in the schema with default location and insert data. 

Note that the schema must be provided because there is no data from which to infer the schema.

In [0]:
%sql
USE ${da.schema_name}_default_location;

CREATE OR REPLACE TABLE managed_table_in_db_with_default_location (width INT, length INT, height INT);
INSERT INTO managed_table_in_db_with_default_location 
VALUES (3, 2, 1);
SELECT * FROM managed_table_in_db_with_default_location;

width,length,height
3,2,1


<i18n value="17403d69-25b1-44d5-b37f-bab7c091a01b"/>

 
We can look at the extended table description to find the location (you'll need to scroll down in the results).

In [0]:
%sql
DESCRIBE DETAIL managed_table_in_db_with_default_location;

format,id,name,description,location,createdAt,lastModified,partitionColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion
delta,443dc979-6bae-464e-a7a7-2d48775ec0e3,spark_catalog.munirsheikhcloudseekho_0lj9_da_dewd_default_location.managed_table_in_db_with_default_location,,dbfs:/user/hive/warehouse/munirsheikhcloudseekho_0lj9_da_dewd_default_location.db/managed_table_in_db_with_default_location,2022-11-13T04:45:11.592+0000,2022-11-13T04:45:17.000+0000,List(),1,1045,Map(),1,2


<i18n value="71f3a626-a3d4-48a6-8489-6c9cffd021fc"/>


By default, managed tables in a schema without the location specified will be created in the **`dbfs:/user/hive/warehouse/<schema_name>.db/`** directory.

We can see that, as expected, the data and metadata for our Delta Table are stored in that location.

In [0]:
%python 
hive_root =   f"dbfs:/user/hive/warehouse"
schema_name = f"{DA.schema_name}_default_location.db"
table_name =  f"managed_table_in_db_with_default_location"

tbl_location = f"{hive_root}/{schema_name}/{table_name}"
print(tbl_location)

files = dbutils.fs.ls(tbl_location)
display(files)

dbfs:/user/hive/warehouse/munirsheikhcloudseekho_0lj9_da_dewd_default_location.db/managed_table_in_db_with_default_location


path,name,size,modificationTime
dbfs:/user/hive/warehouse/munirsheikhcloudseekho_0lj9_da_dewd_default_location.db/managed_table_in_db_with_default_location/_delta_log/,_delta_log/,0,0
dbfs:/user/hive/warehouse/munirsheikhcloudseekho_0lj9_da_dewd_default_location.db/managed_table_in_db_with_default_location/part-00000-30147936-04e9-4f5f-a541-cdf71b58b1a7-c000.snappy.parquet,part-00000-30147936-04e9-4f5f-a541-cdf71b58b1a7-c000.snappy.parquet,1045,1668314717000


<i18n value="ff92a2d3-9bf0-45d0-b78a-c25638ab9479"/>

 
Drop the table.

In [0]:
%sql
DROP TABLE managed_table_in_db_with_default_location;

<i18n value="e9c2d161-c157-4d67-8b8d-dbd3d89b6460"/>

 
Note the table's directory and its log and data files are deleted. Only the schema directory remains.

In [0]:
%python 

db_location = f"{hive_root}/{schema_name}"
print(db_location)
dbutils.fs.ls(db_location)

dbfs:/user/hive/warehouse/munirsheikhcloudseekho_0lj9_da_dewd_default_location.db
Out[9]: []

<i18n value="bd185ea7-cd88-4453-a77a-1babe4633451"/>

 
We now create a table in the schema with custom location and insert data. 

Note that the schema must be provided because there is no data from which to infer the schema.

In [0]:
%sql
USE ${da.schema_name}_custom_location;

CREATE OR REPLACE TABLE managed_table_in_db_with_custom_location (width INT, length INT, height INT);
INSERT INTO managed_table_in_db_with_custom_location VALUES (3, 2, 1);
SELECT * FROM managed_table_in_db_with_custom_location;

width,length,height
3,2,1


<i18n value="68e86e08-9400-428d-9c56-d47439af7dff"/>

 
Again, we'll look at the description to find the table location.

In [0]:
%sql
DESCRIBE DETAIL managed_table_in_db_with_custom_location;

format,id,name,description,location,createdAt,lastModified,partitionColumns,numFiles,sizeInBytes,properties,minReaderVersion,minWriterVersion
delta,9f052df8-5a85-4349-9032-007232165779,spark_catalog.munirsheikhcloudseekho_0lj9_da_dewd_custom_location.managed_table_in_db_with_custom_location,,dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/_custom_location.db/managed_table_in_db_with_custom_location,2022-11-13T04:48:27.044+0000,2022-11-13T04:48:32.000+0000,List(),1,1045,Map(),1,2


<i18n value="878787b3-1178-44d1-a775-0bcd6c483184"/>

 
As expected, this managed table is created in the path specified with the **`LOCATION`** keyword during schema creation. As such, the data and metadata for the table are persisted in a directory here.

In [0]:
%python 

table_name = f"managed_table_in_db_with_custom_location"
tbl_location =   f"{DA.paths.working_dir}/_custom_location.db/{table_name}"
print(tbl_location)

files = dbutils.fs.ls(tbl_location)
display(files)

dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/_custom_location.db/managed_table_in_db_with_custom_location


path,name,size,modificationTime
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/_custom_location.db/managed_table_in_db_with_custom_location/_delta_log/,_delta_log/,0,0
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/_custom_location.db/managed_table_in_db_with_custom_location/part-00000-fc55d113-1a2f-421d-a268-6a0e0db6f68d-c000.snappy.parquet,part-00000-fc55d113-1a2f-421d-a268-6a0e0db6f68d-c000.snappy.parquet,1045,1668314912000


<i18n value="699d9cda-0276-4d93-bf8c-5e1d370ce113"/>

 
Let's drop the table.

In [0]:
%sql
DROP TABLE managed_table_in_db_with_custom_location;

<i18n value="c87c1801-0101-4378-9f52-9a8d052a38e1"/>

 
Note the table's folder and the log file and data file are deleted.  
  
Only the schema location remains

In [0]:
%python 

db_location =   f"{DA.paths.working_dir}/_custom_location.db"
print(db_location)

dbutils.fs.ls(db_location)

dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/_custom_location.db
Out[11]: []

<i18n value="67fd15cf-0ca9-4e76-8806-f24c60d324b1"/>

 
## Tables
We will create an external (unmanaged) table from sample data. 

The data we are going to use are in CSV format. We want to create a Delta table with a **`LOCATION`** provided in the directory of our choice.

In [0]:
%sql
USE ${da.schema_name}_default_location;

CREATE OR REPLACE TEMPORARY VIEW temp_delays USING CSV OPTIONS (
  path = '${DA.paths.datasets}/flights/departuredelays.csv',
  header = "true",
  mode = "FAILFAST" -- abort file parsing with a RuntimeException if any malformed lines are encountered
);
CREATE OR REPLACE TABLE external_table LOCATION '${da.paths.working_dir}/external_table' AS
  SELECT * FROM temp_delays;

SELECT * FROM external_table;

date,delay,distance,origin,destination
1011245,6,602,ABE,ATL
1020600,-8,369,ABE,DTW
1021245,-2,602,ABE,ATL
1020605,-4,602,ABE,ATL
1031245,-4,602,ABE,ATL
1030605,0,602,ABE,ATL
1041243,10,602,ABE,ATL
1040605,28,602,ABE,ATL
1051245,88,602,ABE,ATL
1050605,9,602,ABE,ATL


<i18n value="367720a7-b738-4782-8f42-571b522c95c2"/>

 
Let's note the location of the table's data in this lesson's working directory.

In [0]:
%sql
DESCRIBE TABLE EXTENDED external_table;

col_name,data_type,comment
date,string,
delay,string,
distance,string,
origin,string,
destination,string,
,,
# Detailed Table Information,,
Catalog,spark_catalog,
Database,munirsheikhcloudseekho_0lj9_da_dewd_default_location,
Table,external_table,


<i18n value="3267ab86-f8fe-40dc-aa52-44aecf8d8fc1"/>

 
Now, we drop the table.

In [0]:
%sql
DROP TABLE external_table;

<i18n value="b9b3c493-3a09-4fdb-9615-1e8c56824b12"/>

 
The table definition no longer exists in the metastore, but the underlying data remain intact.

In [0]:
%python 
tbl_path = f"{DA.paths.working_dir}/external_table"
files = dbutils.fs.ls(tbl_path)
display(files)

path,name,size,modificationTime
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/external_table/_delta_log/,_delta_log/,0,0
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/external_table/part-00000-fcc74a1f-29ee-4d29-8b82-75f2230a3044-c000.snappy.parquet,part-00000-fcc74a1f-29ee-4d29-8b82-75f2230a3044-c000.snappy.parquet,885537,1668315049000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/external_table/part-00001-6176af38-946d-45e6-9f30-1a5826e1e21a-c000.snappy.parquet,part-00001-6176af38-946d-45e6-9f30-1a5826e1e21a-c000.snappy.parquet,895728,1668315048000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/external_table/part-00002-f8a6a5e7-8bb0-46b4-87c3-b9261977596b-c000.snappy.parquet,part-00002-f8a6a5e7-8bb0-46b4-87c3-b9261977596b-c000.snappy.parquet,993397,1668315049000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/external_table/part-00003-a1ae0fdc-9155-4f9e-b5da-826502f0897e-c000.snappy.parquet,part-00003-a1ae0fdc-9155-4f9e-b5da-826502f0897e-c000.snappy.parquet,910950,1668315049000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/external_table/part-00004-1e72bf6c-da5d-4e54-8754-c096f503a857-c000.snappy.parquet,part-00004-1e72bf6c-da5d-4e54-8754-c096f503a857-c000.snappy.parquet,976314,1668315049000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/external_table/part-00005-f75f3060-3d55-4357-b105-b669ffc46e8f-c000.snappy.parquet,part-00005-f75f3060-3d55-4357-b105-b669ffc46e8f-c000.snappy.parquet,914029,1668315049000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/external_table/part-00006-ef03443d-bc5a-4b33-a090-d2407b60a189-c000.snappy.parquet,part-00006-ef03443d-bc5a-4b33-a090-d2407b60a189-c000.snappy.parquet,900684,1668315049000
dbfs:/mnt/dbacademy-users/munirsheikhcloudseekho@gmail.com/data-engineering-with-databricks/external_table/part-00007-faa6de6d-967c-4054-a15d-e20e952d15f6-c000.snappy.parquet,part-00007-faa6de6d-967c-4054-a15d-e20e952d15f6-c000.snappy.parquet,119732,1668315039000


<i18n value="c456ac65-ab0b-435a-ae00-acbde5048a96"/>

 
## Clean up
Drop both schemas.

In [0]:
%sql
DROP SCHEMA ${da.schema_name}_default_location CASCADE;
DROP SCHEMA ${da.schema_name}_custom_location CASCADE;

<i18n value="6fa204d5-12ff-4ede-9fe1-871a346052c4"/>

 
Run the following cell to delete the tables and files associated with this lesson.

In [0]:
%python 
DA.cleanup()

-sandbox
&copy; 2022 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>