
<div style="text-align: center; line-height: 0; padding-top: 9px;">
  <img src="https://databricks.com/wp-content/uploads/2018/03/db-academy-rgb-1200px.png" alt="Databricks Learning" style="width: 600px">
</div>

# Supporting Databricks workspaces and metastores

In this lab you will learn how to:
* Create AWS resources needed to support a Databricks workspace
* Create AWS resources needed to support a Unity Catalog metastore
* Create cloud resources to bring awareness of these AWS resources to Databricks

## Prerequisites

If you would like to follow along with this lab, you will need:
* administrator access to your AWS console, with the ability to create buckets and IAM roles
* account administrator capabilities in your Databricks account in order to access the account console

## Supporting a workspace

A Databricks workspace is an environment for accessing all of your Databricks assets. The workspace organizes objects (notebooks, libraries, and experiments) into folders, integrates revision control, and provides access to data and computational resources such as clusters and jobs. A workspace also provides access to Databricks SQL, a simple experience for SQL users who want to query, explore and visualize queries on their data lake.

There are some underlying AWS resources that must be set up first in order to support the workspace. These include:
* A cross-account credential that allows Databricks to launch clusters in the account (in AWS, this means an IAM role)
* An S3 bucket to provide workspace root storage. This will require a specialized policy to permit Databricks to access the bucket.

We will create these elements in this demo, however note that this procedure is also documented <a href="https://docs.databricks.com/administration-guide/account-settings-e2/workspaces.html" target="_blank">here</a>. We will be referencing this documentation throughout the demo.

### Creating a credential configuration

In order for the software running in the Databricks control plane to create and manage compute resources like clusters and VPCs within your account, limited access to your AWS account is required, which is enabled through a cross account IAM role. In this section, we'll create and appropriately configure such a credential, then wrap it into a credential configuration that can be used by Databricks when deploying a workpace.

#### Creating a cross-account IAM role

In this section, we'll create and appropriately configure cross-account IAM role to allow Databricks to create and manage VPCs and cluster in your own AWS account. Note that the policy we use applies to the default Databricks-managed VPC. A different policy is needed if providing your own VPC; we talk about this in a separate course.

1. In the AWS IAM console, let's select **Roles**.
1. Click **Create role**.
1. Select **AWS account**. This will let us set up a cross-account trust relationship that will allow Databricks, running in its own account, to assume the role to access services in our account.
   * Select **Another AWS account**.
   * For **Account ID**, let's substitute in the Databricks account ID, *414351767826*.
   * Select **Require external ID**.
   * For **External ID**, let's paste our Databricks account ID. We can easily get this from the user menu in the account console.
   * Now let's click **Next** until we get to the final page.
   * Let's assign the name for our role (use *dbacademy-test-cross-account-role*).
   * Click **Create role**.
1. Now let's view the role we just created.
1. In the **Permissions** tab, let's select **Add permissions > Create inline policy**.
1. In the **JSON** tab, replace the default policy with the following:
    ```
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Sid": "Stmt1403287045000",
          "Effect": "Allow",
          "Action": [
            "ec2:AllocateAddress",
            "ec2:AssociateDhcpOptions",
            "ec2:AssociateIamInstanceProfile",
            "ec2:AssociateRouteTable",
            "ec2:AttachInternetGateway",
            "ec2:AttachVolume",
            "ec2:AuthorizeSecurityGroupEgress",
            "ec2:AuthorizeSecurityGroupIngress",
            "ec2:CancelSpotInstanceRequests",
            "ec2:CreateDhcpOptions",
            "ec2:CreateInternetGateway",
            "ec2:CreateNatGateway",
            "ec2:CreateRoute",
            "ec2:CreateRouteTable",
            "ec2:CreateSecurityGroup",
            "ec2:CreateSubnet",
            "ec2:CreateTags",
            "ec2:CreateVolume",
            "ec2:CreateVpc",
            "ec2:CreateVpcEndpoint",
            "ec2:DeleteDhcpOptions",
            "ec2:DeleteInternetGateway",
            "ec2:DeleteNatGateway",
            "ec2:DeleteRoute",
            "ec2:DeleteRouteTable",
            "ec2:DeleteSecurityGroup",
            "ec2:DeleteSubnet",
            "ec2:DeleteTags",
            "ec2:DeleteVolume",
            "ec2:DeleteVpc",
            "ec2:DeleteVpcEndpoints",
            "ec2:DescribeAvailabilityZones",
            "ec2:DescribeIamInstanceProfileAssociations",
            "ec2:DescribeInstanceStatus",
            "ec2:DescribeInstances",
            "ec2:DescribeInternetGateways",
            "ec2:DescribeNatGateways",
            "ec2:DescribePrefixLists",
            "ec2:DescribeReservedInstancesOfferings",
            "ec2:DescribeRouteTables",
            "ec2:DescribeSecurityGroups",
            "ec2:DescribeSpotInstanceRequests",
            "ec2:DescribeSpotPriceHistory",
            "ec2:DescribeSubnets",
            "ec2:DescribeVolumes",
            "ec2:DescribeVpcs",
            "ec2:DetachInternetGateway",
            "ec2:DisassociateIamInstanceProfile",
            "ec2:DisassociateRouteTable",
            "ec2:ModifyVpcAttribute",
            "ec2:ReleaseAddress",
            "ec2:ReplaceIamInstanceProfileAssociation",
            "ec2:RequestSpotInstances",
            "ec2:RevokeSecurityGroupEgress",
            "ec2:RevokeSecurityGroupIngress",
            "ec2:RunInstances",
            "ec2:TerminateInstances"
          ],
          "Resource": [
            "*"
          ]
        },
        {
          "Effect": "Allow",
          "Action": [
              "iam:CreateServiceLinkedRole",
              "iam:PutRolePolicy"
          ],
          "Resource": "arn:aws:iam::*:role/aws-service-role/spot.amazonaws.com/AWSServiceRoleForEC2Spot",
          "Condition": {
            "StringLike": {
                "iam:AWSServiceName": "spot.amazonaws.com"
            }
          }
        }
      ]
    }
    ```
1. Now let's click **Review policy** to get to the final page.
1. Let's assign the name for our policy (use *dbacademy-test-cross-account-policy*).
1. Click **Create policy**.
1. Let's take note of the **ARN**; the account administrator will need this in order to create a credential configuration that captures this IAM role.

### Creating the credential configuration

With a cross account IAM role create, we need a way to represent that in Databricks. For this reason, we have *credential configurations*, which we create in this section using the account console.

1. In the account console, let's click on the **Cloud Resources** icon in the left sidebar.
1. Let's click the **Credential configuration** tab.
1. Let's click **Add credential configuration**.
1. Let's provide a name for the configuration. This name will have no user visibility (use *dbacademy-test-credential-configuration*).
1. Paste the **ARN** for the role we created moments ago.
1. Finally, let's click **Add**.

### Creating a storage configuration

Workspaces need an S3 bucket collocated in the same region to store objects that are generated as the platform is used. These stored objects include:
* Cluster logs
* Notebook revisions
* Job results
* Libraries
* Any files written to the DBFS root, either by a job or uploaded from the user interface
* Tables written to the legacy metastore

With an appropriately configured bucket in place, we then need to create a *storage configuration* in the account console to represent this bucket.

Note that you can share a bucket between more than one workspace, though Databricks advises against this.

#### Creating the workspace root storage bucket

Let's create an S3 bucket to function as the workspace root storage.

1. In the AWS S3 console, let's click **Create bucket**.
1. Let's specify a name. When choosing your own names, be mindful to not include dots in your names. Bucket names must also be globally unique. In this example we use *dbacademy-test-workspace-bucket*, but you should include a suffix or prefix that uniquely ties the name to your organization; for example, replace *dbacademy* with your domain name (using hyphens instead of dots).
1. Let's choose the region where we plan on creating our workspace.
1. For this example, let's accept the default settings for the rest, and create the bucket. We will need to revisit it in a moment to add a policy.

#### Creating a storage configuration

Now let's create the piece that links Databricks to the storage container for the workspace we will create.
1. In the account console, let's click on the **Cloud Resources** icon in the left sidebar.
1. Let's click the **Storage configuration** tab.
1. Let's click **Add storage configuration**.
1. Let's provide a name for the configuration. This name will have no user visibility (use *dbacademy-test-storage-configuration*).
1. Let's enter the name for the bucket we created moments ago (*dbacademy-test-workspace-bucket*).
1. Now we need to add a policy to that bucket. Let's click the **Generate policy** link and copy the JSON policy description.
1. Finally, let's click **Add**.

#### Adding the policy to the bucket

With a policy on the clipboard, let's revisit the S3 console to add that policy to the bucket we created earlier.

1. In the AWS S3 console, let's find the bucket we created and select it.
1. Let's click the **Permissions** tab.
1. In the **Bucket policy** area, click **Edit**.
1. Let's paste the JSON policy.
1. Finally, let's click **Save changes**.

## Supporting a metastore

A metastore is the top-level container of data objects in Unity Catalog. The metastore contains metadata about your tables and, in the case of managed tables, the table data itself. 

Account administrators create metastores and assign them to workspaces to allow workloads in those workspaces to access the data represented in the metastore. This can be done in the account console, through REST APIs, or using <a href="https://registry.terraform.io/providers/databrickSlabs/databricks/latest/docs" target="_blank">Terraform</a>. In this demo, we will explore the creation and management of metastores interactively using the account console.

There are some underlying cloud resources that must be set up first in order to support the metastore. This includes:
* An S3 bucket for storing metastore artifacts located in your own AWS account
* An IAM role that allows Databricks to access the bucket

We will create these elements in this demo, but note that this procedure is also documented <a href="https://docs.databricks.com/data-governance/unity-catalog/get-started.html#configure-aws-objects" target="_blank">here</a>. We will be referencing this documentation throughout the demo.

It's important to keep the following constraints in mind when creating and managing metastores:
* You can create only one metastore per region
* Metastores can only be associated with workspaces in the same region
* There can be as many workspaces as needed associated with a metastore located within the same region.

### Creating the metastore bucket

Databricks recommends creating a dedicated bucket for each metastore. We do not recommended sharing this bucket for any other purpose than hosting the metastore. Here we will create a bucket named *dbacademy-test-metastore-bucket* for this purpose. 

1. Still in the AWS S3 console, let's click **Create bucket**.
1. Let's specify our name. Once again, be mindful to not include dots in your names, and that names must be globally unique. For this example we use *dbacademy-test-metastore-bucket*, but adjust your name accordingly.
1. Let's choose a region that matches with the workspace bucket we created earlier.
1. For this example, let's accept the default settings for the rest, and create the bucket.

### Creating an IAM policy

Before creating the IAM role that Unity Catalog needs, we need to create a policy that defines how this bucket can be accessed. This must be done using the same AWS account as the bucket.

1. In the AWS IAM console, let's select **Policies**.
1. Click **Create policy**.
1. Let's select the **JSON** tab and replace the default policy with the following, which we use as a starting point:
    ```
    {
     "Version": "2012-10-17",
     "Statement": [
         {
             "Action": [
                 "s3:GetObject",
                 "s3:PutObject",
                 "s3:DeleteObject",
                 "s3:ListBucket",
                 "s3:GetBucketLocation",
                 "s3:GetLifecycleConfiguration",
                 "s3:PutLifecycleConfiguration"
             ],
             "Resource": [
                 "arn:aws:s3:::<BUCKET>/*",
                 "arn:aws:s3:::<BUCKET>"
             ],
             "Effect": "Allow"
         },
         {
             "Action": [
                 "sts:AssumeRole"
             ],
             "Resource": [
                 "arn:aws:iam::<AWS_ACCOUNT_ID>:role/<AWS_IAM_ROLE_NAME>"
             ],
             "Effect": "Allow"
         }
       ]
    }
    ```
1. Now let's customize the policy.
   * Replace instances of **`<BUCKET>`** with the name of the bucket we created.
   * Replace **`<AWS_ACCOUNT_ID>`** with the account ID of the current AWS account, which is accessible from the user menu in the AWS console.
   * Replace **`<AWS_IAM_ROLE_NAME>`** with the name of the IAM role that we will create, *dbacademy-test-metastore-role*.
1. Let's click through accepting the default settings for the rest and specifying a suitable name (use *dbacademy-test-metastore-policy*), then create the policy.

### Creating an IAM role

Let's create an IAM role that will allow Databricks to access this bucket residing in your own account.
1. In the AWS console, let's go to **IAM > Roles**.
1. Click **Create role**.
1. Select **Custom trust policy**. This will let us set up a cross-account trust relationship that will allow Unity Catalog to assume the role to acccess the bucket on our behalf.
   * In the **Custom trust policy** area, let's paste the following policy as a starting point.
    ```
    {
      "Version": "2012-10-17",
      "Statement": [
        {
          "Effect": "Allow",
          "Principal": {
            "AWS": "arn:aws:iam::414351767826:role/unity-catalog-prod-UCMasterRole-14S5ZJVKOTYTL"
          },
          "Action": "sts:AssumeRole",
          "Condition": {
            "StringEquals": {
              "sts:ExternalId": "<DATABRICKS_ACCOUNT_ID>"
            }
          }
        }
      ]
    }
    ```
   * For **`<DATABRICKS_ACCOUNT_ID>`** let's substitute in our Databricks account ID. We can easily get this from the account console as we did earlier. Treat this value carefully like you would any other credential.
   * Now let's click **Next**.
1. Now let's locate and select the policy we created.
1. Finally, let's assign the name for our role. Let's use *dbacademy-test-metastore-role* and create the role.
1. Let's take note of the **ARN** as the account administrator will need this when creating the metastore.

&copy; 2023 Databricks, Inc. All rights reserved.<br/>
Apache, Apache Spark, Spark and the Spark logo are trademarks of the <a href="https://www.apache.org/">Apache Software Foundation</a>.<br/>
<br/>
<a href="https://databricks.com/privacy-policy">Privacy Policy</a> | <a href="https://databricks.com/terms-of-use">Terms of Use</a> | <a href="https://help.databricks.com/">Support</a>