# Integrating Databricks with AWS Kinesis

## Introduction

In this lesson you will learn how to stream data from **Kinesis** to **Databricks** using `pyspark`.

To be able to follow along you will need to have the following resources setup:
- An AWS account
- A Databricks account with an **AWS Access Key** and an **AWS Secret Access** key for it
- One/multiple Kinesis Data Streams 
- A preferred method to ingest data into Kinesis Data Stream (such as sending data to an API with a Kinesis proxy integration)

## Read streaming data from Kinesis

Using you preferred method start injesting data into Kinesis Data Stream. Once you see the data arriving in the Kinesis data streams, you are ready to read it into Databricks.

You will first need to read the csv file containing your AWS **Access Key** and **Secret Access Key**. To do this, you can run the code below:

In [None]:
from pyspark.sql.types import *
from pyspark.sql.functions import *
import urllib

# Specify file type to be csv
file_type = "csv"
# Indicates file has first row as the header
first_row_is_header = "true"
# Indicates file has comma as the delimeter
delimiter = ","
# Read the CSV file to spark dataframe
aws_keys_df = spark.read.format(file_type)\
.option("header", first_row_is_header)\
.option("sep", delimiter)\
.load("/FileStore/tables/authentication_credentials.csv")

We can extract the `ACCESS_KEY` and `SECRET_KEY` from the spark dataframe created above. The secret access key will be encoded using `urllib.parse.quote` for security purposes. `safe=""` means that every character will be encoded.

In [None]:
# Get the AWS access key and secret key from the spark dataframe
ACCESS_KEY = aws_keys_df.where(col('User name')=='databricks-user').select('Access key ID').collect()[0]['Access key ID']
SECRET_KEY = aws_keys_df.where(col('User name')=='databricks-user').select('Secret access key').collect()[0]['Secret access key']
# Encode the secret key
ENCODED_SECRET_KEY = urllib.parse.quote(string=SECRET_KEY, safe="")

Now using the `ACCESS_KEY` and `SECRET_KEY` we can read the streaming data from Kinesis using the format below (make sure you are sending data to your stream before running the code below):

In [None]:
df = spark \
.readStream \
.format('kinesis') \
.option('streamName','<KINESIS_STREAM_NAME>') \
.option('initialPosition','earliest') \
.option('region','us-east-1') \
.option('awsAccessKey', ACCESS_KEY) \
.option('awsSecretKey', SECRET_KEY) \
.load()

You can see the streaming data by applying the `display` method on the dataframe (`display(df)`). The data will arrive in the default schema of Kinesis which is shown below:
- `partitionKey`
- `data`
- `stream`
- `shardId`
- `sequenceNumber`
- `approximateArrivalTimestamp`

> The display query will run continuously, with the output being updated every few seconds. The rows number is displayed at the bottom of the query output `Showing all <number> rows`. You should see this number increasing as more and more data is being send to your stream. This command will run indefinitely. To stop this from running you need to press the **Interrupt** button at the top of the Databricks Notebook console.

To see the data contained in your stream, you can explicitly deserialize the `data` column of the dataframe by running the following command:

`df = df.selectExpr("CAST(data as STRING)")`

If you run `display(df)` again, you should see the data in your stream being displayed to the console.

## Writing streaming data to Databricks

After performing any necessary transformations to your streaming data, you are ready to store the transformed streams in Databricks. One way to do this is by writing the streams to Databricks Delta tables, as seen below:

In [None]:
df.writeStream \
  .format("delta") \
  .outputMode("append") \
  .option("checkpointLocation", "/tmp/kinesis/_checkpoints/") \
  .table("<TABLE_NAME>")

The `.option("checkpointLocation", "/tmp/kinesis/_checkpoints/") ` allows you to recover the previous state of a query in case of failure. Before running the `writeStream` function again, you will need to delete the checkpoint folder using the following command:

`dbutils.fs.rm("/tmp/kinesis/_checkpoints/", True)`

Again, just like the `readStream` query, the `writeStream` query will run indefinitely until interrupting it. To check the data was saved as expected, we will access the **Data** section in the Databricks menu. You should be able to see the created Delta table under **Catalogs** -> **Databases** -> **Tables**.

<p align="center">
    <img src="images/Delta Table.png" width="1000" height="300"/>
</p>

Selecting the table you have just created, you should be able to see its **Schema** and the **Sample Data** that has been stored inside it.

<p align="center">
    <img src="images/Example Table.png" width="900" height="500"/>
</p>

## Conclusion
At this point, you should have a good understanding of:
- How to read data from Kinesis Data Streams in Databricks
- How save streams in Delta Tables in Databricks