
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>


# 2.2 Demo - Programmatic Exploration and Data Ingestion to Unity Catalog

In this demonstration, we will programmatically explore our data objects, display a raw CSV file from a volume, then read the CSV file from the volume and create a table.

### Objectives
- Apply programmatic techniques to view data objects in our environment.
- Demonstrate how to upload a CSV file into a volume in Databricks.
- Demonstrate how to use a `CREATE TABLE` statement to create a table from a CSV file in a Databricks volume.

## REQUIRED - SELECT A SHARED SQL WAREHOUSE

Before executing cells in this notebook, please select the **SHARED SQL WAREHOUSE** in the lab. Follow these steps:

1. Navigate to the top-right of this notebook and click the drop-down to select compute (it might say **Connect**). Complete one of the following below:

   a. Under **Recent resources**, check to see if you have a **shared_warehouse SQL**. If you do, select it.

   b. If you do not have a **shared_warehouse** under **Recent resources**, complete the following:

    - In the same drop-down, select **More**.

    - Then select the **SQL Warehouse** button.

    - In the drop-down, make sure **shared_warehouse** is selected.

    - Then, at the bottom of the pop-up, select **Start and attach**.

<br></br>
   <img src="../Includes/images/sql_warehouse.png" alt="SQL Warehouse" width="600">

## A. Classroom Setup

Run the following cell to configure your working environment for this notebook.

**NOTE:** The `DA` object is only used in Databricks Academy courses and is not available outside of these courses. It will dynamically reference the information needed to run the course in the lab environment.

### IMPORTANT LAB INFORMATION

Recall that your lab setup is created with the [0 - REQUIRED - Course Setup and Data Discovery]($../0 - REQUIRED - Course Setup and Data Discovery) notebook. If you end your lab session or if your session times out, your environment will be reset, and you will need to rerun the Course Setup notebook.

In [0]:
%run ../Includes/2.2-Classroom-Setup

## B. Programmatically Exploring Your Environment

In this section, we will demonstrate how to programmatically explore your environment, an alternative method to using the Catalog Explorer.

1. View your available catalogs using the `SHOW CATALOGS` statement. Notice that your environment has a series of catalogs available.

In [0]:
SHOW CATALOGS;

2. Run the following cell to view your default catalog and schema. You should notice that your default catalog is set to **samples** and your default schema is set to **nyctaxi**.

   **NOTE:** Setting the default catalog and schema in Databricks allows you to avoid repeatedly typing the full path (catalog.schema.table) when referring to your data objects. Once set, Databricks will automatically use your chosen catalog and schema, making it easier and faster to work with your data without needing to specify the full namespace each time.


In [0]:
SELECT current_catalog(), current_schema()

3. Use the `SHOW SCHEMAS` statement to view available schemas within the default catalog. Notice that it displays schemas (databases) within the **samples** catalog since that is your current default catalog.

In [0]:
SHOW SCHEMAS;

4. You can modify the `SHOW SCHEMAS` statement to specify a specific catalog, like the **dbacademy** catalog. Notice that this displays available schemas (databases) within the **dbacademy** catalog.

In [0]:
SHOW SCHEMAS IN dbacademy;

5. To view available tables in a schema, use the `SHOW TABLES` statement. Notice that, by default, it displays the one table within the default **samples** catalog in the **nyctaxi** schema (database).


In [0]:
SHOW TABLES;

6. To query the **samples.nyctaxi.trips** table, you only need to specify the table name **trips** and not the entire three-level namespace (catalog.schema.table) because the default catalog and schema are **samples** and **nyctaxi**, respectively.

In [0]:
SELECT *
FROM trips
LIMIT 10;

7. Let's try querying the **dbacademy.labuser.ca_orders** table without using the three-level namespace. Notice that an error is returned because it is looking for the **ca_orders** table in **samples.nyctaxi**.

In [0]:
-- This query will return an error 
-- This is because the table does not exist in the default catalog (samples) and default schema (nyctaxi) schema
SELECT *
FROM ca_orders
LIMIT 10;

8. We want to modify our default catalog and default schema to use **dbacademy** and our **labuser** schema to avoid writing the three-level namespace everytime we query and create tables in this course.

    However, before we proceed, note that each of us has a different schema name. Your specific schema name has been stored dynamically in the SQL variable `DA.schema_name` during the classroom setup script.

    Run the code below and confirm that the value of the `DA.schema_name` variable matches your specific schema name (e.g., **labuser1234_678**).

In [0]:
values(DA.schema_name)

9. Let's modify our default catalog and schema using the `USE CATALOG` and `USE SCHEMA` statements.

    - `USE CATALOG` – Sets the current catalog.

    - `USE SCHEMA` – Sets the current schema. 

    **NOTE:** Since our dynamic schema name is stored in the variable `DA.schema_name` as a string, we will need to use the `IDENTIFIER` clause to interpret the constant string in our variable as a schema name. The `IDENTIFIER` clause can interpret a constant string as any of the following:
    - Relation (table or view) name
    - Function name
    - Column name
    - Field name
    - Schema name
    - Catalog name

    [IDENTIFIER clause documentation](https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-names-identifier-clause?language=SQL)

  Alternatively, you can simply add your schema name without using the `IDENTIFIER` clause.

In [0]:
USE CATALOG dbacademy;
USE SCHEMA IDENTIFIER(DA.schema_name);

SELECT current_catalog(), current_schema()

10. Let's view the available tables in the **dbacademy** catalog within our **labuser** schema. Notice that your schema contains a variety of tables.

In [0]:
SHOW TABLES;

11. Let's query the **ca_orders** table in the **dbacademy** catalog within our **labuser** schema without using the three-level namespace.

In [0]:
SELECT *
FROM ca_orders
LIMIT 10;

12. While you can set your default catalog and schema to avoid using the three-level namespace, there are times when you might want to reference a specific table. You can do this by specifying the catalog and schema name in the query.

    In this example, let's query the **samples.nyctaxi.trips** table using the three-level namespace.

In [0]:
SELECT *
FROM samples.nyctaxi.trips
LIMIT 10;

13. Lastly, let's view the available volumes within our **labuser** schema in the **dbacademy** catalog using the `SHOW VOLUMES` statement. Notice that our **labuser** schema contains a variety of Databricks volumes, including the **backup** volume.

In [0]:
SHOW VOLUMES;

14. Let's view the available files in our **dbacademy.labuser.backup** volume using the UI. Complete the following:

    a. In the left navigation pane, expand the **dbacademy** catalog.

    b. Expand your **labuser** schema.

    c. Expand **Volumes**.

    d. Expand the **backup** volume.

    e. Notice that your volume contains the **au_products.csv** file.

  **NOTES:** 
  - You can also use the [LIST statement](https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-aux-list) to programmatically view available files in a volume shown below.
  - The `LIST` statement does not list files in Unity Catalog managed tables.
  - Add your unique schema name to the string in the `LIST` statement to programmatically view the file in the volume.

In [0]:
-- Add your schema name in the LIST statement below, example - `/Volumes/dbacademy/labuser1234_5678/backup`
LIST '/Volumes/dbacademy/<ADD_SCHEMA_NAME>/backup'

## C. Create a Table From a CSV File in a Volume

In this section, we will use SQL to create a table (Delta Table) from a CSV file stored in a Databricks volume using two methods:
- `read_files`
- `COPY INTO`

1. Our goal is to read the **au_products.csv** file and create a table. To start, it's good practice to examine the raw file(s) you want to use to create a table. We can do that with the following code to [query the data by path](https://docs.databricks.com/aws/en/query#query-data-by-path). This enables us to see:

    - The delimiter of the CSV file

    - The general structure of the CSV file

    - If the CSV contains headers in the first row

    - You can use this technique to query a variety of file types.

    Complete the following:

      a. In the left navigation pane, navigate to your **backup** volume and find the **au_products.csv** file.

      b. In the cell below, place your cursor between the two backticks.

      c. In the navigation pane, hover over the **au_products.csv** file and select the `>>` to insert the path of the CSV file between the backticks where it says 'REPLACE WITH YOUR VOLUME PATH'. Yours will look something like this with your unique schema name:


      ```SQL
      SELECT *
      FROM text.`/Volumes/dbacademy/labuser1234_5678/backup/au_products.csv`
      ```

      d. Run the cell and view the results. Notice that:
      - The first row of the CSV file contains a header
      - The values are separated by a comma

In [0]:
-- Add the volume path to your file: Example - `/Volumes/dbacademy/labuser_1234_5678/backup/au_products.csv`
SELECT *
FROM text.`REPLACE WITH YOUR VOLUME PATH`

### C1. read_files Function

1. In the cell below, let's create a table named **au_products** in the **dbacademy** catalog within your **labuser** schema using the **au_products.csv** file with the [read_files table-valued function](https://docs.databricks.com/aws/en/sql/language-manual/functions/read_files) (TVF).

    The `read_files` function reads files from a provided location and returns the data in tabular form. It supports reading JSON, CSV, XML, TEXT, BINARYFILE, PARQUET, AVRO, and ORC file formats.

    The `read_files` function below uses the following options:

      - The path of the CSV file is created by concatenating the volume path with your schema name, which is stored in the `DA.schema_name` variable. This allows dynamic referencing of the file for your unique lab schema name.

      - The `format => 'csv'` option specifies the data file format in the source path. The format is auto-inferred if not provided.

      - The `header => true` option specifies that the CSV file contains a header.

      - When the schema is not provided, `read_files` attempts to infer a unified schema across the discovered files, which requires reading all the files.

**NOTES:**

- There are a variety of different options for each file type. You can view the available options for your specific file type in the [Options](https://docs.databricks.com/aws/en/sql/language-manual/functions/read_files#options) documentation.

- If a volume contains related files, you can read all of the files into a table by specifying the path of the volume without the file name. In this example, we are specifying the volume path and the file name.

In [0]:
CREATE OR REPLACE TABLE au_products AS
SELECT *
FROM read_files(
  '/Volumes/dbacademy/' || DA.schema_name || '/backup/au_products.csv',
  format => 'csv',
  header => true
);

2. View available tables in your **labuser** schema. Notice that a new table named **au_products** was created.

In [0]:
SHOW TABLES;

3. Query the **au_products** table within your **labuser** schema and view the results.

    Notice the following:
    - The table was created from the CSV file successfully.
    - A new column named **_rescued_data** was added. This column is provided by default to rescue any data that doesn’t match the schema.

In [0]:
SELECT *
FROM au_products;

4. When the schema is not provided, `read_files` attempts to infer a unified schema across the discovered files, which requires reading all the files which can be inefficient for large files.

    For larger files it's more efficient to specify the schema within the `read_files` function. For our small CSV file performance is not an issue.

In [0]:
CREATE OR REPLACE TABLE au_products_with_schema AS
SELECT *
FROM read_files(
  '/Volumes/dbacademy/' || DA.schema_name || '/backup/au_products.csv',
  format => 'csv',
  header => true,
  schema => '
    productid STRING,
    productname STRING,
    listprice DOUBLE
  '
);

SELECT *
FROM au_products_with_schema;

### C2. COPY INTO
Another method to create a table from files is using the `COPY INTO` statement. The `COPY INTO` statement loads data from a file location into a Delta table. This is a retryable and idempotent operation — Files in the source location that have already been loaded are skipped. This is true even if the files have been modified since they were loaded.

1. The cell below shows how to create a table and copy CSV data into it using `COPY INTO`:

   a. The `CREATE TABLE` statement creates an empty table with a defined schema (columns `productid`, `productname`, and `listprice`). `COPY INTO` will copy the data into this table.

   b. The `COPY INTO` command loads CSV files from the specified path into the created table. 

   c. `FROM` specifies the volume path to read from. This could also reference an external location.

    - The `FILEFORMAT = CSV` specifies the format of the input data. Databricks supports various file formats such as CSV, PARQUET, JSON, and more.

    - The `FORMAT_OPTIONS` specifies file format options like reading the header and schema inference.

<br></br>

**REQUIRED** - In the cell below add the path to YOUR **backup** volume in the `FROM` clause of `COPY INTO`. 

Example: `FROM '/Volumes/dbacademy/labuser1234_5678/backup'`

In [0]:
DROP TABLE IF EXISTS au_products_copy_into;


-- Create an empty table
-- You can define the schema of the table if you desire
CREATE TABLE au_products_copy_into(
  productid STRING,
  productname STRING,
  listprice DOUBLE
);


-- Copy the files into the table and merge the schema
COPY INTO au_products_copy_into
FROM 'REPLACE WITH THE PATH TO YOUR backup VOLUME'     -- TO DO: Add your path to the backup volume here
FILEFORMAT = CSV
FORMAT_OPTIONS ('header'='true', 'inferSchema'='true');

2. Rerun the `COPY INTO` statement again after adding the path to your **backup** volume in the `FROM` clause again. 

    Notice that **num_affected_rows** and **num_inserted_rows** are both 0. Since all of the data was already read, `COPY INTO` does not ingest the file(s) again.

In [0]:
COPY INTO au_products_copy_into
FROM 'REPLACE WITH THE PATH TO YOUR backup VOLUME'   -- TO DO: Add your path to the backup volume here
FILEFORMAT = CSV
FORMAT_OPTIONS ('header'='true', 'inferSchema'='true');

3. Display the **au_products_copy_into** table.

In [0]:
SELECT *
FROM au_products_copy_into;

#### Summary: COPY INTO (legacy)
The CREATE STREAMING TABLE SQL command is the recommended alternative to the legacy COPY INTO SQL command for incremental ingestion from cloud object storage. See COPY INTO. For a more scalable and robust file ingestion experience, Databricks recommends that SQL users leverage streaming tables instead of COPY INTO.

[COPY INTO (legacy)](https://docs.databricks.com/aws/en/ingestion/#copy-into-legacy)

### C3. Introduction to Streaming Tables in DBSQL (Bonus)

This is a high-level introduction to streaming tables in DBSQL. Streaming tables enable streaming or incremental data processing. Depending on your final objective, streaming tables can be extremely useful. 


  We will briefly cover the topic to familiarize you with its capabilities. For more details, check out the [CREATE STREAMING TABLE documentation](https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-ddl-create-streaming-table). Databricks offers several features for managing streaming and real-time data ingestion, so remember to consult with your Data Engineering team for additional support with streaming data.

1. In this example, we will create a silver streaming table from a simple bronze table. In many scenarios, you will be reading from cloud storage to start the process. If you want to stream raw data into a table from cloud storage, view examples in the [CREATE STREAMING TABLE documentation](https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-ddl-create-streaming-table).

    To begin, let's create the table **emp_bronze_raw** with a list of employees.

In [0]:
-- Use our catalog and schema
USE CATALOG dbacademy;
USE SCHEMA IDENTIFIER(DA.schema_name);

-- Drop the tables if they exist to start from the beginning
DROP TABLE IF EXISTS emp_bronze_raw;
DROP TABLE IF EXISTS emp_silver_streaming;

-- Create the employees table
CREATE TABLE emp_bronze_raw (
    EmployeeID INT,
    FirstName VARCHAR(20),
    Department VARCHAR(20)
);

-- Insert 5 rows of sample data
INSERT INTO emp_bronze_raw (EmployeeID, FirstName, Department)
VALUES
(1, 'John', 'Marketing'),
(2, 'Raul', 'HR'),
(3, 'Michael', 'IT'),
(4, 'Panagiotis', 'Finance'),
(5, 'Aniket', 'Operations');

2. Use the `CREATE OR REFRESH STREAMING TABLE` statement to create a streaming table named **emp_silver_streaming**. This table will incrementally load new rows as they are added to the **emp_bronze_raw** table. It will also create a new column named **IngestDateTime**, which records the date and time when the row was ingested.

**NOTES:**
- This process will take about a minute to run. Behind the scenes, streaming tables create a DLT pipeline. We will cover DLT in detail later in this course.
- Streaming tables are supported only in DLT and on Databricks SQL with Unity Catalog.

In [0]:
CREATE OR REFRESH STREAMING TABLE emp_silver_streaming 
SCHEDULE EVERY 1 HOUR     -- Scheduling the refresh is optional
SELECT 
  *, 
  current_timestamp() AS IngestDateTime
FROM STREAM emp_bronze_raw;

3. The `DESCRIBE HISTORY` statement displays a detailed list of all changes, versions, and metadata associated with a Delta table, including information on updates, deletions, and schema changes.

    Run the cell below and view the results. Notice the following:

    - In the **operation** column, you can see that a streaming table performs two operations: **DLT SETUP** and **STREAMING UPDATE**.

    - Scroll to the right and find the **operationMetrics** column. In row 1 (Version 2 of the table), the value shows that the **numOutputRows** is 5, indicating that 5 rows were added to the **emp_silver_streaming** table.


In [0]:
DESCRIBE HISTORY emp_silver_streaming;

4. Run the cell below to query the **emp_silver_streaming** table. Notice that the results display 5 rows of data.

In [0]:
SELECT *
FROM emp_silver_streaming;

5. Run the cell below to insert 2 rows of data into the originally **emp_bronze_raw** table.

In [0]:
INSERT INTO emp_bronze_raw (EmployeeID, FirstName, Department)
VALUES
(6, 'Athena', 'Marketing'),
(7, 'Pedro', 'Training');

6. In our scenario, the scheduled refresh occurs every hour. However, we don't want to wait that long. Use the `REFRESH STREAMING TABLE` statement to refresh the streaming table **emp_silver_streaming**. This statement refreshes the data for a streaming table (or a materialized view, which will be covered later).


    For more information to refresh a streaming table, view the [REFRESH (MATERIALIZED VIEW or STREAMING TABLE) documentation](https://docs.databricks.com/aws/en/sql/language-manual/sql-ref-syntax-ddl-refresh-full).

In [0]:
REFRESH STREAMING TABLE emp_silver_streaming;

7. Run the `DESCRIBE HISTORY` statement again to view the changes in the table.

    Notice the following:
    - The **emp_silver_streaming** table now has a new version, version 3.

    - Scroll over to the **operationParameters** column. Notice that **outputMode** specifies an **Append** operation occurred.

    - Scroll over to the **operationMetrics** column. Notice that the value of **numOutputRows** is 2, indicating an incremental update occurred and that two new rows were added to the **emp_silver_streaming** table.

In [0]:
DESCRIBE HISTORY emp_silver_streaming;

8. Run the query below to view the data in **emp_silver_streaming**. Notice that the table now contains 7 rows.

In [0]:
SELECT *
FROM emp_silver_streaming;

9. Lastly, let's drop the two tables we created.

In [0]:
DROP TABLE IF EXISTS emp_bronze_raw;
DROP TABLE IF EXISTS emp_silver_streaming;

&copy; 2025 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>