# Databricks

## Introduction

The [*Databricks Lakehouse Platform*](https://www.databricks.com/product/data-lakehouse) is a unified platform that provides tools for building, deploying, sharing, and maintaining enterprise-grade data solutions at scale. Databricks integrates with cloud storage and security in your cloud account, and can manage and deploy cloud infrastructure on your behalf.


## How does Databricks work with AWS?

The Databricks Lakehouse platform architecture is composed of two primary parts: 
- the infrastructure used by Databricks to deploy, configure, and manage the platform
- the customer-owned infrastructure managed in collaboration by Databricks and your company

Databricks does not force you to migrate your data into its own storage systems in order to use the platform. Instead, you can configure a Databricks workspace by configuring a secure integration between the Databricks and your cloud account. Then, Databricks will deploy the compute clusters using cloud resources in your account to process and store data.

## Databricks fundamental concepts

### Accounts, Workspaces and Users

A Databricks *Account* represents a single entity for purpose of billing and support, but can include multiple workspaces.

A *Workspace* can have two meanings:
- A Databricks deployment in the cloud that functions as an unified environment for all your Databricks assets. 
- The UI for the Databricks persona-based environments, as seen below:

<p align="center">
    <img src="images/Databricks Workspace.png" width="500"/>
</p>

A workspace organizes objects (notebooks, libraries, dashboards and experiments) into directories and provides access to data and other computational resources, such as clusters and jobs. The objects you will be working with the most are *Notebooks*, which are documents that contain runnable commands and visualizations.

A *User* is an unique individual who has access to the system. User identities are represented by email addresses.

### Cluster

A set of computation resources on which you run notebooks and jobs. There are two types of clusters: *all-purpose* and *job*.
- All-purpose cluster: you create them using the UI, CLI, or REST API. You can manually terminate and restart them, and they can be shared across multiple users.
- Job cluster: The Databricks job scheduler creates a job cluster when you run a job and terminates the cluster when the job is complete. You cannot restart an job cluster.

You can check out active clusters, under the **Compute** panel on your Databricks account:

<p align="center">
    <img src="images/Databricks Cluster.png" width="1000" height="250"/>
</p>

### Databricks runtime

The set of core components that run on the clusters managed by Databricks. Databricks offers several types of runtimes:

- *Databricks Runtime* includes Apache Spark but also adds a number of components and updates that substantially improve the usability, performance, and security.

- *Databricks Runtime for Machine Learning* is built on Databricks Runtime and provides a ready-to-go environment for machine learning and data science. It contains libraries such as TensorFlow and PyTorch.

- *Databricks Light* is the Databricks packaging of the open source Apache Spark runtime. It provides a runtime option for jobs that don’t need the advanced performance, reliability, or autoscaling benefits. You can select Databricks Light only when you create a cluster to run a JAR, Python, or spark-submit job; you cannot select this runtime for clusters on which you run interactive notebooks, jobs or workloads.

### Workflows

Workflows are frameworks to develop and run data processing pipelines, including create, run, and manage Databricks Jobs. Workflows represent a non-interactive mechanism for running a notebook or library either immediately or on a scheduled basis.


## Databricks Lakehouse components

The main components of the Databricks Lakehouse are:
#### *1.Delta Table*
Tables created on Databricks use the Delta Lake protocol by default. When you create a new Delta table:
- Metadata used to reference the table is added to the metastore in the declared schema or database.
- Data and table metadata are saved to a directory in cloud object storage.

#### *2.Unity Catalog*

The Unity Catalog ensures you have complete control over who gains access to which data and provides a centralised mechanisms for managing all data governance and access controls without needing to replicate data.

- Account-level management of the Unity Catalog metastore means databases, data objects, and permissions can be shared across Databricks workspaces.
- You can leverage three tier namespacing (`<catalog>.<database>.<table>`) for organizing and granting access to data.
- External locations and storage credentials are also securable objects.
- The Data Explorer provides a graphical user interface to explore databases and manage permissions.

## Data objects in the Databricks Lakehouse

The Databricks Lakehouse architecture combines data stored with the Delta Lake protocol in cloud object storage with metadata registered to a metastore. 

The metastore contains all the metadata that defines data objects in the lakehouse. The following options exist:
- *Unity Catalog*: you can create a metastore to store and share metadata across multiple Databricks workspaces. Unity Catalog is managed at the account level.

- *Hive metastore*: Databricks stores all the metadata for the built-in Hive metastore as a managed service. Every Databricks deployment has a central Hive metastore accessible by all clusters to persist table metadata. A Hive metastore is a database that holds metadata about data, such as paths to the data in the data lake and the format of the data.

- *External metastore*: you can also bring your own metastore to Databricks.

Regardless of the used metastore, Databricks will store all data associated with the table in the object storage configured by the customer in their cloud account.

There are five main objects in the Databricks Lakehouse:

<p align="center">
    <img src="images/Primary Objects.png" width="400"/>
</p>

- *Catalog*: a grouping of databases.
- *Database or schema*: a grouping of objects in a catalog. Databases contain tables, views, and functions.
- *Table*: a collection of rows and columns stored as data files in object storage.
- *View*: a saved query typically against one or more tables or data sources.
- *Function*: saved logic that returns a scalar value or set of rows.

Most of these objects will be stored in the *Databricks File System (DBFS)*. A filesystem that contains directories, which can contain files (data files, libraries, and images), and other directories. DBFS is automatically populated with some datasets that you can use to learn Databricks.


## Databricks architecture planes

> Databricks architectures have two planes: a *control plane* and a *data plane*.

### Control plane

The control plane includes the backend services that Databricks manages in its own AWS account. Notebook commands and many other workspace configurations are stored in the control plane and encrypted at rest.

### Data plane

The data plane, which your AWS account manages, is where the data resides and is processed. By default, Apache Spark clusters are created in a single VPC that Databricks creates and configures in the customer-controlled AWS account.

- For most Databricks computation, the compute resources are in your AWS account in what is called the *Classic data plane*. This is the type of data plane Databricks uses for notebooks, jobs, and for Classic Databricks SQL warehouses.
- If you enable Serverless compute for Databricks SQL, the compute resources for Databricks SQL are in a *Shared Serverless data plane*. The compute resources for notebooks, jobs and Classic Databricks SQL warehouses still live in the Classic data plane in the customer account. 

<p align="center">
    <img src="images/Databricks Architecture.png" width="500" height="500"/>
</p>

### S3 bucket in customer-controlled AWS account

An Amazon S3 bucket is created in the customer account when a Databricks cluster is deployed. The Databricks workspace uses this S3 bucket to store some input and output data. It access this data in two ways:

- *Databricks-managed directories*. Some data (Spark driver log initial storage, job output, etc) is stored or read by Databricks in hidden directories. These directories are inaccessible to customers using Databricks File System (DBFS).

- *DBFS root storage*. This storage can be accessed by customer notebooks through a DBFS path.

## Conclusion
At this point, you should have a good understanding of:
- What is Databricks
- How can Databricks be integrated with a cloud provider
- Fundamental concepts of Databricks
- The Databricks Lakehouse components
- The Databricks Lakehouse architecture
- How data objects are stored in the Databricks Lakehouse