# Accessing storage in Databricks
---
## 0. Prerequisites
### 0.1. Azure App registration

1. [Create an Azure AD application and service principal that can access resources](https://docs.microsoft.com/en-us/azure/azure-resource-manager/resource-group-create-service-principal-portal). Note the following properties:
  - application-id: An ID that uniquely identifies the application
  - directory-id: An ID that uniquely identifies the Azure AD instance
  - storage-account-name: The name of the storage account
  - service-credential: A string that the application uses to prove its identity   
  
2. Register the service principal, granting the correct role assignment, such as Storage Blob Data Contributor, on the Azure Data Lake Storage Gen2 account.

### 0.2. Adding scoped secrets
To add a secret and a scope, this needs to be completed using the [Databricks CLI](https://docs.databricks.com/dev-tools/cli/index.html).

Once signed in via the CLI, issue the following commands:
1. `databricks secrets create-scope --scope Analysts`
2. `databricks secrets put --scope Analysts --key SPID --string-value "Service Principal ID"` (Application Client ID)
3. `databricks secrets put --scope Analysts --key SPKey --string-value "Service Principal Secret Key"`
4. `databricks secrets put --scope Analysts --key DirectoryID --string-value "Azure Directory ID"`

## 1. Mount Azure Data Lake Storage Gen2 filesystem
This only needs to be done once per cluster

In [3]:
# service principal details
client_id     = dbutils.secrets.get(scope='Analysts', key='SPID')  # 'cb90462d-aaae-4038-ac60-b8e090d404bb'
client_secret = dbutils.secrets.get(scope='Analysts', key='SPKey')  # '-nGdTkdSa-TN-4Fr7z0aBX3.k4Dct54SOX'
directory_id  = dbutils.secrets.get(scope='Analysts', key='DirectoryID')  # '600fc753-881c-4179-a38f-27842e71dc98'

# data lake container details
account_name = 'storagejamesleslie'
container_name = 'ecovacs'

configs = {'fs.azure.account.auth.type': 'OAuth',
           'fs.azure.account.oauth.provider.type': 'org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider',
           'fs.azure.account.oauth2.client.id': client_id,
           'fs.azure.account.oauth2.client.secret': client_secret,
           'fs.azure.account.oauth2.client.endpoint': f'https://login.microsoftonline.com/{directory_id}/oauth2/token'}

# Optionally, you can add <directory-name> to the source URI of your mount point.
dbutils.fs.mount(
    source = f'abfss://{container_name}@{account_name}.dfs.core.windows.net/',
    mount_point = f'/mnt/{container_name}',
    extra_configs = configs)

# list all items in container
display(dbutils.fs.ls(f'/mnt/{container_name}'))

### Unmount a mount point

In [5]:
# unmount container
dbutils.fs.unmount(f'/mnt/{container_name}')

## 2. Access directly with service principal and OAuth 2.0
All code below needs to be run every time

In [7]:
# service principal details
client_id     = dbutils.secrets.get(scope='Analysts', key='SPID')  # 'cb90462d-aaae-4038-ac60-b8e090d404bb'
client_secret = dbutils.secrets.get(scope='Analysts', key='SPKey')  # '-nGdTkdSa-TN-4Fr7z0aBX3.k4Dct54SOX'
directory_id  = dbutils.secrets.get(scope='Analysts', key='DirectoryID')  # '600fc753-881c-4179-a38f-27842e71dc98'

# update spark configs
spark.conf.set('fs.azure.account.auth.type', 'OAuth')
spark.conf.set('fs.azure.account.oauth.provider.type', 'org.apache.hadoop.fs.azurebfs.oauth2.ClientCredsTokenProvider')
spark.conf.set('fs.azure.account.oauth2.client.id', client_id)
spark.conf.set('fs.azure.account.oauth2.client.secret', client_secret)
spark.conf.set('fs.azure.account.oauth2.client.endpoint', f'https://login.microsoftonline.com/{directory_id}/oauth2/token')

# list all items in container / directory
display(dbutils.fs.ls('abfss://ecovacs@storagejamesleslie.dfs.core.windows.net/BestBuy/'))

---
# Using dbutils

In [9]:
dbutils.fs.help()

## 1. Listing files / directories

In [11]:
# Mount the "transactions container"
dbutils.fs.mount(
    source = f'abfss://transactions@{account_name}.dfs.core.windows.net/',
    mount_point = f'/mnt/transactions',
    extra_configs = configs)

# list all mounted directories
display(dbutils.fs.ls('/mnt'))

In [12]:
# list all items in the 2020 directory
display(dbutils.fs.ls('/mnt/transactions/2020'))

## 2. Creating new directories

In [14]:
# create a new directory
dbutils.fs.mkdirs('/mnt/transactions/2020/p14')

# check that it appears in the mount point
display(dbutils.fs.ls('/mnt/transactions/2020'))

In [15]:
# does it appear in the data lake too?
display(dbutils.fs.ls('abfss://transactions@storagejamesleslie.dfs.core.windows.net/2020/'))

## 3. Remove directory

In [17]:
# remove the new directory and all its contents
dbutils.fs.rm('/mnt/transactions/2020/p14', recurse=True)

# is it gone?
display(dbutils.fs.ls('abfss://transactions@storagejamesleslie.dfs.core.windows.net/2020/'))