
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img
    src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png"
    alt="Databricks Learning"
  >
</div>


# Ingesting Data into Delta Lake

## REQUIRED - SELECT CLASSIC COMPUTE

Before executing cells in this notebook, please select your classic compute cluster in the lab. Be aware that **Serverless** is enabled by default.

Follow these steps to select the classic compute cluster:


1. Navigate to the top-right of this notebook and click the drop-down menu to select your cluster. By default, the notebook will use **Serverless**.

2. If your cluster is available, select it and continue to the next cell. If the cluster is not shown:

   - Click **More** in the drop-down.

   - In the **Attach to an existing compute resource** window, use the first drop-down to select your unique cluster.

**NOTE:** If your cluster has terminated, you might need to restart it in order to select it. To do this:

1. Right-click on **Compute** in the left navigation pane and select *Open in new tab*.

2. Find the triangle icon to the right of your compute cluster name and click it.

3. Wait a few minutes for the cluster to start.

4. Once the cluster is running, complete the steps above to select your cluster.

## Classroom Setup

Run the following cell to configure your working environment for this course.

**NOTE:** The `DA` object is only used in Databricks Academy courses and is not available outside of these courses. It will dynamically reference the information needed to run the course.

In [0]:
%run ../Includes/Classroom-Setup-02

## A. Configure and Explore Your Environment


####1. Setting Up catalog and Schema
Set the default catalog to **dbacademy** and your unique schema. Then, view the available tables to confirm that no tables currently exist in your schema.

##### 1A. Using SQL Commands

In [0]:
%sql
-- Set the default catalog and schema
USE CATALOG dbacademy;
USE SCHEMA IDENTIFIER(DA.schema_name);

-- Display available tables in your schema
SHOW TABLES;

##### 1B. Using PySpark

In [0]:
# Set the default catalog and schema (Requires Spark 3.4.0 or later)
spark.catalog.setCurrentCatalog(DA.catalog_name)
spark.catalog.setCurrentDatabase(DA.schema_name)

# Display available tables in your schema
spark.catalog.listTables(DA.schema_name)

####2. Viewing the available files
View the available files in your schema's **myfiles** volume. Confirm that only the **employees.csv** file is available.

**NOTE:** Remember, when referencing data in volumes, use the path provided by Unity Catalog, which always has the following format: */Volumes/catalog_name/schema_name/volume_name/*.

In [0]:
spark.sql(f"LIST '/Volumes/dbacademy/{DA.schema_name}/myfiles/' ").display()

## B. Delta Lake Ingestion Techniques
**Objective**: Create a Delta table from the **employees.csv**  file using various methods.

- CREATE TABLE AS (`CTAS`)
- UPLOAD UI (`User Interface`)
- COPY INTO
- AUTOLOADER (`Overview only`, `outside the scope of this module`)

####1. CREATE TABLE (CTAS)
1. Create a table from the **employees.csv** file using the CREATE TABLE AS statement similar to the previous demonstration. Run the query and confirm that the **current_employees_ctas** table was successfully created.

In [0]:
%sql

-- Drop the table if it exists for demonstration purposes
DROP TABLE IF EXISTS current_employees_ctas;

-- Create the table using CTAS
CREATE TABLE current_employees_ctas
AS
SELECT ID, FirstName, Country, Role 
FROM read_files(
  '/Volumes/dbacademy/' || DA.schema_name || '/myfiles/',
  format => 'csv',
  header => true,
  inferSchema => true
 );

-- Display available tables in your schema
SHOW TABLES;

2. Query the **current_employees_ctas** table and confirm that it contains 4 rows and 4 columns.

In [0]:
%sql
SELECT *
FROM current_employees_ctas;

####2. UPLOAD UI
The add data UI allows you to manually load data into Databricks from a variety of sources.

1. Complete the following steps to manually download the **employees.csv** file from your volume:

   a. Select the Catalog icon in the left navigation bar. 

   b. Click on your catalog **(dbacademy)**

   c. Select the refresh icon to refresh the **dbacademy** catalog.

   d. Expand the **dbacademy** catalog. Within the catalog, you should see a variety of schemas (databases).

   e. Expand your schema. You can locate your schema in the setup notes in the first cell. Notice that your schema contains **Tables** and **Volumes**.

   f. Expand **Volumes** then **myfiles**. The **myfiles** volume should contain a single CSV file named **employees.csv**. 

   g. Click on the kebab menu on the right-hand side of the **employees.csv** file and select **Download Volume file.** This will download the CSV file to your browser's download folder.

2. Complete the following steps to manually upload the **employees.csv** file to your schema. This will mimic loading a local file to Databricks:

   a. In the navigation bar select your schema. 

   b. Click the ellipses (three-dot) icon next to your schema and select **Open in Catalog Explorer**.

   c. Select the **Create** drop down icon ![create_drop_down](../Includes/images/create_drop_down.png), and select **Table**.

   d. Select the **employees.csv** you downloaded earlier into the available section in the browser, or select **browse**, navigate to your downloads folder and select the **employees.csv** file.

3. Complete the following steps to create the Delta table using the UPLOAD UI.

   a. In the UI confirm the table will be created in the catalog **dbacademy** and your unique schema. 

   b. Under **Table name**, name the table **current_employees_ui**.

   c. Select the **Create table** button at the bottom of the screen to create the table.

   d. Confirm the table was created successfully. Then close out of the Catalog Explorer browser.

**Example**
<br></br>

![create_table_ui](../Includes/images/create_table_ui.png)


4. Use the SHOW TABLES statement to view the available tables in your schema. Confirm that the **current_employees_ui** table has been created. 


In [0]:
%sql
SHOW TABLES;

5. Lastly, query the table to review its contents.

**NOTE**: If you did not upload the table using the UPLOAD UI and name it **current_employees_ui** an error will be returned.

In [0]:
%sql
SELECT * 
FROM current_employees_ui;

####3. COPY INTO
Create a table from the **employees.csv** file using the [COPY INTO](https://docs.databricks.com/en/sql/language-manual/delta-copy-into.html) statement. 

The `COPY INTO` statement incrementally loads data from a file location into a Delta table. This is a retryable and idempotent operation. Files in the source location that have already been loaded are skipped. This is true even if the files have been modified since they were loaded.

1. Create an empty table named **current_employees_copyinto** and define the column data types.

**NOTE:** You can also create an empty table with no columns and evolve the schema with `COPY INTO`.

In [0]:
%sql
-- Drop the table if it exists for demonstration purposes
DROP TABLE IF EXISTS current_employees_copyinto;

-- Create an empty table with the column data types
CREATE TABLE current_employees_copyinto (
  ID INT,
  FirstName STRING,
  Country STRING,
  Role STRING
);

2. Use the `COPY INTO` statement to load all files from the **myfiles** volume (currently only the **employees.csv** file exists) using the path provided by Unity Catalog. Confirm that the data is loaded into the **current_employees_copyinto** table.

    Confirm the following:
    - **num_affected_rows** is 4
    - **num_inserted_rows** is 4
    - **num_skipped_correct_files** is 0

In [0]:
spark.sql(f'''
COPY INTO current_employees_copyinto
  FROM '/Volumes/dbacademy/{DA.schema_name}/myfiles/'
  FILEFORMAT = CSV
  FORMAT_OPTIONS (
      'header' = 'true', 
      'inferSchema' = 'true'
    )
  ''').display()

3. Query the **current_employees_copyinto** table and confirm that all 4 rows have been copied into the Delta table correctly.

In [0]:
%sql
SELECT * 
FROM current_employees_copyinto;

4. Run the `COPY INTO` statement again and confirm that it did not re-add the data from the volume that was already loaded. Remember, `COPY INTO` is a retryable and idempotent operation â€” Files in the source location that have already been loaded are skipped.   
    - **num_affected_rows** is 0
    - **num_inserted_rows** is 0
    - **num_skipped_correct_files** is 0



In [0]:
spark.sql(f'''
COPY INTO current_employees_copyinto
  FROM '/Volumes/dbacademy/{DA.schema_name}/myfiles/'
  FILEFORMAT = CSV
  FORMAT_OPTIONS (
      'header' = 'true', 
      'inferSchema' = 'true'
    )
  ''').display()

5. Run the script below to create an additional CSV file named **employees2.csv** in your **myfiles** volume. View the results and confirm that your volume now contains two CSV files: the original **employees.csv** file and the new **employees2.csv** file.

In [0]:
## Create the new employees2.csv file in your volume
DA.create_employees_csv2()

## View the files in the your myfiles volume
files = dbutils.fs.ls(f'/Volumes/dbacademy/{DA.schema_name}/myfiles')
display(files)

6. Query the new **employees2.csv** file directly. Confirm that only 2 rows exist in the CSV file.

In [0]:
%sql
SELECT 
  ID, 
  FirstName, 
  Country, 
  Role 
FROM read_files(
  '/Volumes/dbacademy/' || DA.schema_name || '/myfiles/employees2.csv',
  format => 'csv',
  header => true,
  inferSchema => true
 );

7. Execute the `COPY INTO` statement again using your volume's path. Notice that only the 2 rows from the new **employees2.csv** file are added to the **current_employees_copyinto** table.

    - **num_affected_rows** is 2
    - **num_inserted_rows** is 2
    - **num_skipped_correct_files** is 0

In [0]:
spark.sql(f'''
COPY INTO current_employees_copyinto
  FROM '/Volumes/{DA.catalog_name}/{DA.schema_name}/myfiles/'
  FILEFORMAT = CSV
  FORMAT_OPTIONS (
      'header' = 'true', 
      'inferSchema' = 'true'
    )
  ''').display()

8. View the updated **current_employees_copyinto** table and confirm that it now contains 6 rows, including the new data that was added.

In [0]:
%sql
SELECT * 
FROM current_employees_copyinto;

9. View table's history. Notice that there are 3 versions.
    - **Version 0** is the initial empty table created by the CREATE TABLE statement.
    - **Version 1** is the first `COPY INTO` statement that loaded the **employees.csv** file into the Delta table.
    - **Version 2** is the second `COPY INTO` statement that only loaded the new **employees2.csv** file into the Delta table.

In [0]:
%sql
DESCRIBE HISTORY current_employees_copyinto;

####4. AUTO LOADER

**NOTE: Auto Loader is outside the scope of this course.**

Auto Loader incrementally and efficiently processes new data files as they arrive in cloud storage without any additional setup.

![autoloader](../Includes/images/autoloader.png)

The key benefits of using the auto loader are:
- No file state management: The source incrementally processes new files as they land on cloud storage. You don't need to manage any state information on what files arrived.
- Scalable: The source will efficiently track the new files arriving by leveraging cloud services and RocksDB without having to list all the files in a directory. This approach is scalable even with millions of files in a directory.
- Easy to use: The source will automatically set up notification and message queue services required for incrementally processing the files. No setup needed on your side.

Check out the documentation
[What is Auto Loader](https://docs.databricks.com/en/ingestion/auto-loader/index.html) for more information.

## C. Cleanup
1. Drop your demonstration tables.

In [0]:
%sql
DROP TABLE IF EXISTS current_employees_ctas;
DROP TABLE IF EXISTS current_employees_ui;
DROP TABLE IF EXISTS current_employees_copyinto;
SHOW TABLES;

2. Drop the **employees2.csv** file.

In [0]:
## Remove employees2.csv from the myfiles volume
dbutils.fs.rm(f"/Volumes/{DA.catalog_name}/{DA.schema_name}/myfiles/employees2.csv")

&copy; 2026 Databricks, Inc. All rights reserved. Apache, Apache Spark, Spark, the Spark Logo, Apache Iceberg, Iceberg, and the Apache Iceberg logo are trademarks of the <a href="https://www.apache.org/" target="_blank">Apache Software Foundation</a>.<br/><br/><a href="https://databricks.com/privacy-policy" target="_blank">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use" target="_blank">Terms of Use</a> | <a href="https://help.databricks.com/" target="_blank">Support</a>