**<center><h1>Introduction</h1></center>**

Azure Databricks is a Microsoft analytics service, part of the Microsoft Azure cloud platform. It offers an integration between Microsoft Azure and the Apache Spark's Databricks implementation. Azure Databricks natively integrates with Azure security and data services. In this module, you will learn how to work with the key features of Azure Databricks.


**<h2>Learning Objectives</h2>**

After completing this module, you’ll be able to:

- Describe the main concepts in Azure Databricks.
- Work with workspaces and clusters.
- Work with notebooks.

<hr>

**<center><h1>Understand Azure Databricks</h1></center>**


Azure Databricks runs on top of a proprietary data processing engine called Databricks Runtime, an optimized version of Apache Spark. It allows up to 50x performance for Apache Spark workloads.

Apache Spark is the core technology. Spark is an open-source analytics engine for large-scale data processing. It provides an interface for programming entire clusters with implicit data parallelism and fault tolerance.

In a nutshell: Azure Databricks offers a fast, easy, and collaborative Spark based analytics service. It is used to accelerate big data analytics, artificial intelligence, performant data lakes, interactive data science, machine learning, and collaboration.

**<h2>The main concepts in Azure Databricks</h2>**

<img src = "images/01-01-01-databricks-workspace.jpg"/>

The landing page shows the fundamental concepts to be used in Databricks:

1. The **cluster:** a set of computational resources on which we run the code.
2. The **workspace:** groups all the Databricks elements, clusters, notebooks, data.
3. The **notebook:** a document that contains runnable code, descriptive text, and visualizations.
 
 
<mark>**Note:** More information: for more information about Azure Databricks, see the [documentation](https://docs.microsoft.com/en-us/azure/databricks/scenarios/what-is-azure-databricks).</mark>





<hr>

**<center><h1>Provision Azure Databricks workspaces and clusters</h1></center>**

Two of the key concepts you need to be familiar with when working with Azure Databricks are **workspaces** and **clusters**.


**<h2>Workspaces</h2>**

A workspace is an environment for accessing all of your Databricks elements:

- It groups objects (like notebooks, libraries, experiments) into folders,
- Provides access to your data,
- Provides access to the computations resources used (clusters, jobs).

<img src="images/01-01-05-workspace.png"/>

Each user has a home folder for their notebooks and libraries. The objects stored in the Workspace root folder are: folders, notebooks, libraries, and experiments.

To perform an action on a Workspace object, we can right-click the object and choose one of the available actions.

**<h2>Clusters</h2>**

A cluster is a set of computational resources on which you run your code (as notebooks or jobs). We can run ETL pipelines, or machine learning, data science, analytics workloads on the cluster.

We can create:

- An **all-purpose cluster.** Multiple users can share such clusters to do collaborative interactive analysis.
- A **job cluster** to run a specific job. The cluster will be terminated when the job completes (A job is a way of running a notebook or JAR either immediately or on a scheduled basis).

Before we can use a cluster, we have to choose one of the available **runtimes**.

Databricks runtimes are the set of core components that run on Azure Databricks clusters. Azure Databricks offers several types of runtimes:

- **Databricks Runtime:** includes Apache Spark, components and updates that optimize the usability, performance, and security for big data analytics.
- **Databricks Runtime for Machine Learning:** a variant that adds multiple machine learning libraries such as TensorFlow, Keras, and PyTorch.
- **Databricks Light:** for jobs that don’t need the advanced performance, reliability, or autoscaling of the Databricks Runtime.

To create and configure a new cluster, we have to select the **Create Cluster** button and choose our options.

<img src="images/01-01-02-new-cluster.png"/>

We will see your new cluster appearing in the clusters list.

<img src="images/01-01-03-clusters.png"/>

To launch the cluster, we have to select the **Start** button and then confirm to launch it. It is recommended to wait until the cluster is started.

A cluster can be customized in many ways. In case you want to make third-party code available to your notebooks, you can install a library. Your cluster can be provisioned to use Python/Java/Scala/R libraries via PyPI or Maven.

Once the cluster is running, we can select **Edit** to change its properties. In case we want to provision your cluster with additional libraries, we can select the **Libraries** and then choose **Install New**.

<img src="images/01-01-04-provision-cluster.png"/>

We can pick a library and it will be available later to be used in your notebooks.

**<h2>Working with data in a workspace</h2>**

An Azure Databricks database is a collection of tables. An Azure Databricks table is a collection of structured data.

We can cache, filter, and perform any operations supported by Apache Spark DataFrames on Azure Databricks tables. We can query tables with Spark APIs and Spark SQL.

To access our data:

- We can import our files to DBFS using the UI.
- We can mount and use supported data sources via DBFS.

We can then use Spark or local APIs to access the data.

We will be able to use a DBFS file path in our notebook to access our data, independent of its data source.

It is possible to import existing data or code in the workspace.

If we use small data files on the local machine that we want to analyze with Azure Databricks, we can import them to DBFS using the UI. There are two ways to upload data to DBFS with the UI:

- Upload files to the FileStore in the Upload Data UI.
- Upload data to a table with the Create table UI, which is also accessible via the Import & Explore Data box on the landing page.

We may also read data on cluster nodes using Spark APIs. We can read data imported to DBFS into Apache Spark DataFrames. For example, if you import a CSV file, you can read the data using this code
```
df = spark.read.csv('/FileStore/tables/nyc_taxi.csv', header="true", inferSchema="true")
```
We can also read data imported to DBFS in programs running on the Spark driver node using local file APIs. For example:
```
df = spark.read.csv('/dbfs/FileStore/tables/nyc_taxi.csv', header="true", inferSchema="true")
```

**<h2>Importing data</h2>**

To add data, we can go to the landing page and select **Import & Explore Data**.

To get the data in a table, there are multiple options available:

- Upload a local file and import the data.
- Use data already existing under DBFS.
- Mount external data sources, like Azure Storage, Azure Data Lake and more.

To create a table based on a local file, we can select **Upload File** to upload data from your local machine.

<img src="images/01-01-06-upload.png"/>

Once the data is uploaded, it will be available as a table or as a mountpoint under the DBFS filesystem (/FileStore).

Databricks can create a table automatically if we select **Create Table with UI**.

<img src="images/01-01-07-table-ui.png"/>

Alternately, we can have full control over the structure of the new table by choosing Create Table in Notebook. Azure Databricks will generate Spark code that loads your data (and we can customize it via the Spark API).


<img src="images/01-01-08-table-spark.png"/>

**<h2>Using DBFS mounted data</h2>**

Databricks File System (DBFS) is a distributed file system mounted into a Databricks workspace and available on Databricks clusters. DBFS is an abstraction on top of scalable object storage and offers the following benefits:

- Allows to you mount storage objects so that you can seamlessly access data without requiring credentials.
- Allows you to interact with object storage using directory and file semantics instead of storage URLs.
- Persists files to object storage, so you won’t lose data after you terminate a cluster.

The default storage location in DBFS is known as the DBFS root.

We can use the DBFS to access:

- Local files (previously imported). For example, the tables you imported above are available under `/FileStore`
- Remote files, objects kept in separate storages as if they were on the local file system

For example, to mount a remote Azure storage account as a DBFS folder, we can use the `dbutils` module:
```
data_storage_account_name = '<data_storage_account_name>'
data_storage_account_key = '<data_storage_account_key>'

data_mount_point = '/mnt/data'

data_file_path = '/bronze/wwi-factsale.csv'

dbutils.fs.mount(
  source = f"wasbs://dev@{data_storage_account_name}.blob.core.windows.net",
  mount_point = data_mount_point,
  extra_configs = {f"fs.azure.account.key.{data_storage_account_name}.blob.core.windows.net": data_storage_account_key})

display(dbutils.fs.ls("/mnt/data"))
#this path is available as dbfs:/mnt/data for spark APIs, e.g. spark.read
#this path is available as file:/dbfs/mnt/data for regular APIs, e.g. os.listdir
```

Notebooks support a shorthand - `%fs` magic command - for accessing the dbutils filesystem module. Most `dbutils.fs` commands are available using `%fs` magic commands:
```
# bash
# List the DBFS root
%fs ls

# Overwrite the file "/mnt/my-file" with the string "Hello world!"
%fs put -f "/mnt/my-file" "Hello world!"
```





<hr>

**<center><h1>Work with notebooks in Azure Databricks</h1></center>**


A notebook is a web-based interface to a document that contains:

- Runnable code
- Descriptive text
- Visualizations

A notebook is a collection of runnable cells (commands). When you use a notebook, you are primarily developing and running cells.

Runnable cells operate on files and tables. Cells can be run in sequence, referring to the output of previously run cells.

To create a notebook, we can select **Workspace**, browse into the desired folder, right-click, and choose **Create**, then select **Notebook**.

<img src="images/01-01-09-new-notebook.png"/>

A name should be given to the new notebook, and a default language to be used inside the code cells. Choose a cluster to run the code in the cells.

For runnable cells, the following programming languages are supported: Python, Scala, R, and SQL. You may choose the default language for the cells in a notebook. You may also override that language later.

<img src="images/01-01-10-new-notebook.png"/>

The notebook editor opens with a first empty cell

<img src="images/01-01-11-edit-notebook.png"/>

By hovering over the Plus button below the current cell or by choosing the top-right menu options, we can change the contents of the notebook. We may add new cells, cut, copy, export the cell contents, or run a specific cell.

We can override the default language by specifying the language magic command `%<language>` at the beginning of a cell.

The supported magic commands are:

- %python
- %r
- %scala
- %sql

Notebooks also support a few auxiliary magic commands:

- `%sh`: Allows you to run shell code in your notebook
- `%fs`: Allows you to use dbutils filesystem commands
- `%md`: Allows you to include various types of documentation, including text, images, and mathematical formulas and equations.





<hr>

**<center><h1>Exercise - Get started with Azure Databricks</h1></center>**


Now it's your chance to get started with Azure Databricks for yourself by configuring a cluster, creating a workspace and a notebook.

In this exercise, you will:

- Create an Azure Databricks Cluster.
- Provision an Azure Databricks Workspace.
- Work with Notebooks.
-  Use DBFS.



**<h2>Instructions</h2>**

Follow these instructions to complete the exercise:

1. Open the exercise instructions at https://aka.ms/mslearn-dp090.
2. Complete the **Getting Started with Azure Databricks** exercise.



<hr>

**<center><h1>Knowledge check</h1></center>**


Choose the best response for each of the questions below. Then select Check your answers.

1. Alice creates a notebook on Azure Databricks to train her datasets, before using them with Spark ML. Which of the following languages are supported for doing that in a notebook?

 - Java

 - Python

 - C#

2. You want to train a neural network with TensorFlow. You don't want to install the library manually to avoid extra overhead. What should you do?

 -  Create a cluster with the Databricks Runtime for Machine Learning.

 -  Create a single node cluster.

 - Create a Python notebook.

3. Which description of DBFS is correct?

 - You can upload a file to the DBFS using the UI.

 - You can only access data in Azure Databricks if it's stored on DBFS.

 - Data uploaded to the DBFS is only stored as long as your cluster is running.



<hr>

**<center><h1>Summary</h1></center>**


In this module, you have learned how to get started with Azure Databricks.

Now that you've completed this module, you can:

- Describe the main concepts in Azure Databricks.
- Work with workspaces and clusters.
- Work with notebooks.



<hr>