# Mount an AWS S3 bucket to Databricks

## Introduction

Databricks is an integrated analytics environment powered by Apache Spark which lets you connect and read from many data sources such as AWS S3, HDFS, MySQL, Cassandra etc. In this notebook, we will learn how to read data from an Amazon S3 bucket.

To do so, we will follow these steps:
- Create an access key and a secret access key for Databricks in AWS
- Mount Databricks to a AWS S3 bucket
- Read JSON files from mounted S3 bucket

## Step 1: Create AWS Access Key and Secret Access Key for Databricks

Once the desired data has been uploaded to a S3 bucket, we will go through the following steps:

1. Access the **IAM console** in your AWS account.
<p align="center">
    <img src="images/AWS Console IAM.png" width="600"/>
</p>

2. In the IAM console, under **Access management** click on **Users**.
<p align="center">
    <img src="images/IAM Users.png" width="200" height="300"/>
</p>

3. Click on the **Add users** button.

4. Enter the desired **User name** and click **Next**.

5. On the permission page, select the **Attach existing policies directly** choice. In the search bar type **AmazonS3FullAccess** and check the box.
(This will allow full access to S3, meaning Databricks will be able to connect to any existing buckets on the AWS account.)

6. Skip the next sections until you reach the **Review page**. Here select the **Create user** button. 

7. Now that you have created your IAM User, you will need to assign it a programmatic access key:
- In the **Security Credentials** tab select **Create Access Key**

<p align="center">
    <img src="images/security_credentials.png" width="700"/>
</p>

- On the subsequent page select **Command Line Interface (CLI)**, navigate to the bottom of the page click **I understand**

<p align="center">
    <img src="images/i_understand.png" width="700"/>
</p>

- On the next page, give the keypair a description and select **Create Access Key**

8. Click the **Download.csv file** button to download the credentials you have just created.

<p align="center">
    <img src="images/copy_keypair_2.png" width="700"/>
</p>

## Step 2: Upload credential csv file to Databricks

1. In the **Databricks UI**, click the **Data** icon and then click **Create Table**.
<p align="center">
    <img src="images/Databricks Credential Upload.png" width="700" height="300"/>
</p>

2. Click on **Drop files to upload, or click to browse** and select the credentials file you have just downloaded from AWS. Once the file has been successfully uploaded, you should see a green checkmark next to it.
<p align="center">
    <img src="images/Credentials Uploaded.png" width="500" height="400"/>
</p>

> As we can see the credentials will be uploaded in the following location: */FileStore/tables*.

## Step 3: Mount S3 bucket to Databricks

1. Select the **New** icon and then select **Notebook**.

<p align="center">
    <img src="images/Create Notebook.png" width="500" height="300"/>
</p>

2. Let's check the contents in FileStore, the location where we uploaded our AWS credentials in the last step, by running the following command:

`dbutils.fs.ls(“/FileStore/tables”)`

You should see the CSV file you uploaded earlier is now inside the FileStore tables folder.

3. Mount the S3 bucket to Databricks.

We will need to import the following libraries first:

In [None]:
# pyspark functions
from pyspark.sql.functions import *
# URL processing
import urllib

Now let's read the csv file containing the AWS keys to Databricks using the code below:

In [None]:
# Specify file type to be csv
file_type = "csv"
# Indicates file has first row as the header
first_row_is_header = "true"
# Indicates file has comma as the delimeter
delimiter = ","
# Read the CSV file to spark dataframe
aws_keys_df = spark.read.format(file_type)\
.option("header", first_row_is_header)\
.option("sep", delimiter)\
.load("/FileStore/tables/authentication_credentials.csv")

We can extract the `access key` and `secret access key` from the spark dataframe created above. The secret access key will be encoded using `urllib.parse.quote` for security purposes. `safe=""` means that every character will be encoded.

In [None]:
# Get the AWS access key and secret key from the spark dataframe
ACCESS_KEY = aws_keys_df.where(col('User name')=='databricks-user').select('Access key ID').collect()[0]['Access key ID']
SECRET_KEY = aws_keys_df.where(col('User name')=='databricks-user').select('Secret access key').collect()[0]['Secret access key']
# Encode the secrete key
ENCODED_SECRET_KEY = urllib.parse.quote(string=SECRET_KEY, safe="")

We can now mount the S3 bucket by passing in the **S3 URL** and the **desired mount name** to `dbutils.fs.mount()`. Make sure to replace the `AWS_S3_BUCKET` with the name of the bucket you have your data stored into, and `MOUNT_NAME` with the desired name inside your Databricks workspace.

In [None]:
# AWS S3 bucket name
AWS_S3_BUCKET = "bucket_name"
# Mount name for the bucket
MOUNT_NAME = "/mnt/mount_name"
# Source url
SOURCE_URL = "s3n://{0}:{1}@{2}".format(ACCESS_KEY, ENCODED_SECRET_KEY, AWS_S3_BUCKET)
# Mount the drive
dbutils.fs.mount(SOURCE_URL, MOUNT_NAME)

The code above will return *True* if the bucket was mounted successfully. You will only need to mount the bucket once, and then you should be able to access it from Databricks at any time.

## Step 4: Read data from the mounted S3 bucket

1. To check if the S3 bucket was mounted succesfully run the following command:

`display(dbutils.fs.ls("/mnt/mount_name/../.."))`

If inside the mounted S3 bucket your data is organised in folders, you can specify the whole path in the above command after `/mnt/mount_name`. With the correct path specified, you should be able to see the contents of the S3 bucket when running the above command.

2. Read the JSON format dataset from S3 into Databricks using the code below:

In [None]:
# File location and type
# Asterisk(*) indicates reading all the content of the specified file that have .json extension
file_location = "/mnt/mount_name/filepath_to_data_objects/*.json" 
file_type = "json"
# Ask Spark to infer the schema
infer_schema = "true"
# Read in JSONs from mounted S3 bucket
df = spark.read.format(file_type) \
.option("inferSchema", infer_schema) \
.load(file_location)
# Display Spark dataframe to check its content
display(df)

## Step 5 (Optional): Unmount S3 bucket

If you want to unmount the S3 bucket, run the following code:

`dbutils.fs.unmount("/mnt/mount_name")`

## Conclusion
At this point, you should have a good understanding of:
- How to create AWS access keys for Databricks and how to upload them
- How to mount/unmount an Amazon S3 bucket to Databricks
- How to read data from mounted buckets 