![databricks_academy_logo.png](../Includes/images/databricks_academy_logo.png "databricks_academy_logo.png")

# Creating and Managing Lakeflow Spark Declarative Pipelines

Lakeflow Spark Declarative Pipelines provides a collection of tools that allow you to orchestrate ETL pipelines with ease on Databricks.

**Objective:** Use Databricks Lakeflow Spark Declarative Pipelines to create an ETL pipeline. The pipeline will create three tables, which will be refreshed every time the pipeline runs.

## Important: Select Environment 4
The cells below may not work in other environments. To choose environment 4: 
1. Click the ![environment.png](../Includes/images/environment.png "environment.png") button on the right sidebar
1. Open the **Environment version** dropdown
1. Select **4**

## Classroom Setup

Run the following cell to configure your working environment for this lesson.

In [0]:
####################################################################################
# Set python variables for catalog, schema, and volume names (change, if desired)
catalog_name = "dbacademy"
schema_name = "create_pipeline"
volume_name = "myfiles"
####################################################################################

####################################################################################
# Create the catalog, schema, and volume if they don't exist already
spark.sql(f"CREATE CATALOG IF NOT EXISTS {catalog_name}")
spark.sql(f"CREATE SCHEMA IF NOT EXISTS {catalog_name}.{schema_name}")
spark.sql(f"CREATE VOLUME IF NOT EXISTS {catalog_name}.{schema_name}.{volume_name}")
####################################################################################

####################################################################################
# Creates a file called employees.csv in the specified catalog.schema.volume
import pandas as pd
data = [
    ["1111", "Kristi", "USA", "Manager"],
    ["2222", "Sophia", "Greece", "Developer"],
    ["3333", "Peter", "USA", "Developer"],
    ["4444", "Zebi", "Pakistan", "Administrator"]
]
columns = ["ID", "Firstname", "Country", "Role"] 
df = pd.DataFrame(data, columns=columns)
file_path = f"/Volumes/{catalog_name}/{schema_name}/{volume_name}/employees.csv"
df.to_csv(file_path, index=False)
################################################################################

####################################################################################
# Creates a file called employees2.csv in the specified catalog.schema.volume
data = [
    [5555, 'Alex','USA', 'Instructor'],
    [6666, 'Sanjay','India', 'Instructor']
]
columns = ["ID","Firstname", "Country", "Role"]

## Create the DataFrame
df = pd.DataFrame(data, columns=columns)

## Create the CSV file in the course Catalog.Schema.Volume
df.to_csv(f"/Volumes/{catalog_name}/{schema_name}/{volume_name}/employees2.csv", index=False)
####################################################################################

####################################################################################
# Print paths to root folder and source code file
path = dbutils.entry_point.getDbutils().notebook().getContext().notebookPath().getOrElse(None)
newpath = path.replace('02 - Creating and Managing Spark Declarative Pipelines','Pipeline Files')
rootpath = newpath
sourcefilepath = newpath + '/Pipeline - 1.py'
print(f'NOTEBOOK PATHS FOR TASKS:\n')
print(f'  * Root folder path: \n   {rootpath}\n')
print(f'  * Source file path: \n   {sourcefilepath}')
####################################################################################

NOTEBOOK PATHS FOR TASKS:

  * Root folder path: 
   /Users/alladisindhu24@gmail.com/Data Engineering/M-03 -- Pipelines/Pipeline Files

  * Source file path: 
   /Users/alladisindhu24@gmail.com/Data Engineering/M-03 -- Pipelines/Pipeline Files/Pipeline - 1.py


## Create a Pipeline
In this lesson we have starter files for you to use in your pipeline. These are located in the folder **Pipeline Files**. To create a pipeline and add existing assets to associate it with code files already available in your workspace, complete the following:

1. Right click **Jobs & Pipelines** in the left navigation bar and select **Open Link in New Tab**.

2. In **Jobs & Pipelines** select **Create** → **ETL Pipeline**.

3. Complete the pipeline creation page with the following: 

    * **Name**: Name the pipeline whatever you wish
    * **Default catalog**: Select the **dbacademy** catalog (or a different one if you changed the default at the beginning of the lesson) 
    * **Default schema**: Select the **create_pipeline** schema (or a different one if you changed the default at the beginning of the lesson)

4. In the options, select **Add existing assets**. In the popup, complete the following:

- **Pipeline root folder**: Select the **Pipeline Files** folder (path is in the output of the previous cell) 

- **Source code paths**: Within the same root folder as above, select the **Pipeline - 1.py** file

5. Click **Add**, This will create a pipeline and associate the correct files for this demonstration.

  Note: We will discuss the source code for this pipeline in the next lesson.

## Deploy a Pipeline
Let's look at how to deploy a pipeline to production.

Complete the following steps:

  a. Click **Settings** in the upper-right corner (this may be a gear icon, depending on your browser's zoom level)

  b. In the **Pipeline settings** section, you can:

  - Modify the **Pipeline name**, if desired

  - Change the **Run as** principle. To do this, select the pencil icon next to **Run as** to modify the option. You can only change this if there are other users in your workspace.

  - You can optionally change the executor of the pipeline to a service principal. A service principal is an identity you create in Databricks for use with automated tools, jobs, and applications.  

  - For more information, see the [What is a service principal?](https://docs.databricks.com/aws/en/admin/users-groups/service-principals#what-is-a-service-principal) documentation.

    - In **Pipeline mode**, ensure **Triggered** is selected so the pipeline runs on a schedule.  
      - Alternatively, you can choose **Continuous** mode to keep the pipeline running at all times.  
      - For more details, see [Triggered vs. continuous pipeline mode](https://docs.databricks.com/aws/en/dlt/pipeline-mode).

  c. In the **Code assets** section, you can change the **Root Folder** or **Source code** files, as needed.

  d. The **Default location for data assets** section gives you the ability to change where tables, views, etc., will be created/refreshed, by default.

  e. In the **Compute** section, confirm that **Serverless** compute is selected.

  f. You can add libraries or a `requirements.txt` file in the **Pipeline environment** section.

  g. We need to add a couple of configuration parameters to the pipeline that will be used by the pipeline's source code file. Complete the following:

  h. Scroll down to the **Configuration** section, and click **Add configuration**.

  - Set the **key** for the first parameter to "catalog_name"
  - Set the **value** for this key to "dbacademy" (or a different one if you changed the default at the beginning of the lesson) 
  - Set the **key** for the first parameter to "schema_name"
  - Set the **value** for this key to "create_pipeline" (or a different one if you changed the default at the beginning of the lesson) 
  - Click **Save**
  - Close the settings by clicking the "X" in the upper-right corner

  i. The **Tags**, **Usage**, and **Notifications** sections are not discussed in this course.

  j. While we are here, we are going to setup this pipeline for the next lesson. Click **Edit advanced settings**.

  - Expand **Advanced settings** at the bottom of the window.

  - For **Channel**, you can select either **Current** or **Preview**:
    - **Current** – Uses the latest stable Databricks Runtime version, recommended for production.
    - **Preview** – Uses a more recent, potentially less stable Runtime version, ideal for testing upcoming features.
    - View the [Lakeflow Spark Declarative Pipelines release notes and the release upgrade process](https://docs.databricks.com/aws/en/release-notes/dlt/) documentation for more information.

  - We can also publish the pipeline's event log to a Delta table. We will see in the next lesson that a summary of the pipeline's events is available to us in the UI, but we can get more detailed information saved, if desired.

    - Click **Save** to save the advanced settings.
  k. Click the "X" in the upper-right corner to close the Pipeline settings.

## Schedule a Pipeline

Once your pipeline is production-ready, you may want to schedule it to run either on a time interval or continuously. For this demonstration, we’ll configure a schedule, but not actually implement it.

Complete the following steps to schedule the pipeline:

a. Select the **Schedule** button in the upper-right corner (might be a small calendar icon depending on your browser's zoom level).

b. Click **Add schedule**.

When you schedule a pipeline, you are actually creating a single task Lakeflow Job that will run the pipeline according to the schedule you set.

c. For the job name, leave it as is.

d. Below **Job name**, select **Advanced**.

e. In the **Schedule** section, configure the following:
- Set the **Day**.
- Set the time to **20:00** (8:00 PM).
- Leave the **Timezone** as default.
- Select **More options**, and under **Notifications**, add your email to receive alerts for:
  - **Start**
  - **Success**
  - **Failure**

f. Optionally, uncheck **Performance optimized** if you wish to save money, but increase startup time.

g. Since we are not actually scheduling this pipeline, click **Cancel**.

**NOTE:** You could also set the pipeline to run a few minutes after your current time to see it start through the scheduler.