# Project Log

## 1. Set up a  new conda environment 

In [None]:
conda create -n pinterest_conda_env
conda activate pinterest_conda_env

Install any libraries required to run the user_posting_emulation.py using 

In [None]:
conda install <library_name>
conda install -c conda-forge <library_name>

## 2. Data infrastructure simulating Pinterest.

The user_posting_emulation.py contains the login credentials for a RDS database, which contains three tables with data resembling data received by the Pinterest API when a POST request is made by a user uploading data to Pinterest.


## 3. Batch processing : Configuring the EC2 Kafka client

Create an MSK cluster, a client is needed to communicate with this MSK cluster. In this project, an EC2 instance is used to act as the client. In my case, both MSK Cluster and EC2 instance have already been created.

**To create an MSK cluster:**

Amazon MSK > Create cluster > Choose quick create

> - Name cluster 'pinterest-msk-cluster       
> - Choose Cluster type: provisioned  
> - Apache Kafka version: 2.8.1   
> - Choose Broker type: Kafka.m5.large    
> - Amazon EBS storage per broker: 100 GiB    

Select create cluster

**To create an EC2 instance:**   

>- Name: 0a60b9a8a831  
>- Instance ID : i-034925981e7bb03f3   
>- Instance type: t2.micro 
>- Amazon Machine Image : Amazon linux 2023 AMI
>- Availability Zone: us-east-1a   
>- Public IPV4 DNS : ec2-34-207-200-90.compute-1.amazonaws.com 
>- Public IPv4 address : 34.207.200.90 
>- Key pair name : 0a60b9a8a831-key-pair
>- Key pair type : RSA
>- Private key file format : .pem

An IAM role was also setup for the EC2 to allow Kafka access.

### 3.1 Create .pem file locally

Use the key pair assigned at launched of EC2 instance to create .pem file locally, this allows secure local connection to the EC2 instance via SSH.

EC2 > Instances > Search "i-034925981e7bb03f3" > Details > Key pair assigned at launch.

Copy the key pair and save inside a .pem file, and ensure that the .pem file has read-only permission for User class.  

In [None]:
chmod 400 0a60b9a8a831-key-pair.pem

### 3.2 Connect to the EC2 instance
Follow the instructions on AWS on how to connect to EC2 through the CLI.  
    
> ssh -i "0a60b9a8a831-key-pair.pem" ec2-user@ec2-34-207-200-90.compute-1.amazonaws.com     

### 3.3 Installing Kafka on the EC2 client 


In order for the client machine to connect and send data to the MSK cluster, it's necessary to edit the inbound rules for the security group associated with the MSK cluster.

VPC > Security > Security groups > Select the default security group associated with the cluster VPC 

(The associated Security group can be found in MSK console > properties tab > Networking settings section > Security groups applied) 

Edit inbound rules > Add rule :
> - Type column: All traffic     
> - Source column: ID of the security group of the client machine (found in EC2 console > security tab). 
> - Save rules and cluster will accept all traffic from the client machine.

#### Step 1 : Install Java and Kafka on the EC2 client machine

First install java and ensure it's the correct version `java -version`.     
Then download and 'unzip' kafka version 2.12-2.81.

In [None]:
sudo yum install java-1.8.0

# Get the version of kafka to install
wget https://archive.apache.org/dist/kafka/2.8.1/kafka_2.12-2.8.1.tgz 
# To 'unzip' or 'untar' the file
tar -xzf kafka_2.12-2.8.1.tgz  
# Remove the compressed file, keeping only uncompressed version                                          
rm kafka_2.12-2.8.1.tgz                                                 

#### Step 2 : Installing IAM MSK authentication package on client EC2 machine.

The [IAM MSK authentication package](https://github.com/aws/aws-msk-iam-auth) is required for verifying a client which uses IAM authentication for connecting it to MSK clusters. 

Inside the 'Kafka/libs' directory, download the IAM MSK authentication package from Github.

In [None]:
cd kafka_2.12-2.8.1/libs
# Download msk iam authentication file
wget https://github.com/aws/aws-msk-iam-auth/releases/download/v1.1.5/aws-msk-iam-auth-1.1.5-all.jar  

Configure the client classpath environment variable in `.bashrc` so the client is able to use the IAM package.

In [None]:
# Open bash configuration file
nano /home/ec2-user/.bashrc

# Set class path environment variable for MSK authentication inside nano file 
export CLASSPATH=/home/ec2-user/kafka_2.12-2.8.1/libs/aws-msk-iam-auth-1.1.5-all.jar #

# Saves and updates the bashrc file
source ~/.bashrc

![](media/log/nano_bashrc_20231211.png)
**(Remove image?)**

If configured correctly, `echo $CLASSPATH` returns the class path '/home/ec2-user/kafka_2.12-2.8.1/libs/aws-msk-iam-auth-1.1.5-all.jar'.

#### Step 3 : Authenticate MSK cluster using EC2 IAM role 

Before configuring the EC2 client to use AWS IAM for cluster authentication, a setup is required to ensure that the EC2 instance has the necessary IAM role and permissions to authenticate and interact with the MSK cluster securely. A trust relationship can allow the EC2 instance to assume its own IAM role (essentially saying, "This EC2 IAM role is trusted by itself to assume its own role."), which contains the required permissions for MSK authentication.

To assume the "0a60b9a8a831-ec2-access-role" EC2 IAM role, which contains the necessary permissions to authenticate the MSK cluster. First, retrieve the 0a60b9a8a831-ec2-access-role ARN.
    
IAM console > Roles > 0a60b9a8a831-ec2-access-role > copy ARN      

Trust relationships tab > Edit trust policy > Add a principal       
> - Selected IAM roles as the Principal type 
> - Replace ARN with the 0a60b9a8a831-ec2-access-role ARN (copied from ec2-access-role)

#### Step 4 : Configure Kafka client to use AWS IAM authentication to the cluster 

Configure Kafka client to use AWS IAM authentication for authenticating the MSK cluster.
In the EC2 client, inside the `kafka_2.12-2.8.1/bin` directory, usign nano, modify the `client.properties` file as follows:

In [None]:
""" 
https://github.com/aws/aws-msk-iam-auth instructions on configuring a Kafka client to use 
AWS IAM with AWS_MSK_IAM mechanism
"""

"""
Sets up TLS (Transport Layer Security) for encryption (cryptographic protocol) and SASL (Simple 
Authentication and Security Layer) for authN (framework for authentication and data security in Internet protocols).
"""
security.protocol = SASL_SSL

# Identifies the SASL mechanism to use.
sasl.mechanism = AWS_MSK_IAM

# Binds SASL client implementation.
sasl.jaas.config = software.amazon.msk.auth.iam.IAMLoginModule required awsRoleArn="arn:aws:iam::584739742957:role/0a60b9a8a831-ec2-access-role";


# Encapsulates constructing a SigV4 signature based on extracted credentials. 
# The SASL client bound by "sasl.jaas.config" invokes this class.
sasl.client.callback.handler.class = \
    software.amazon.msk.auth.iam.IAMClientCallbackHandler

#### 3.4 Creating Kafka topics

The bootstrap server and zookeeper string are required to create Kafka topics on the Kafka cluster using the EC2 client.  

Using the MSK Management Console to get cluster information
Amazon MSK > pinterest-msk-cluster > View client information and note:  
- Bootstrap servers Private endpoint (single-VPC)
- Plaintext Apache Zookeeper connection string  

The three topic names being created are: 
- 0a60b9a8a831.pin for the Pinterest posts data
- 0a60b9a8a831.geo for the post geolocation data
- 0a60b9a8a831.user for the post user data

In the EC2 client, inside the `kafka_2.12-2.8.1/bin` directory, run the commands:

In [None]:
./kafka-topics.sh --bootstrap-server b-1.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1\
    .amazonaws.com:9098 --command-config client.properties --create --topic 0a60b9a8a831.pin
./kafka-topics.sh --bootstrap-server b-2.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1\
    .amazonaws.com:9098 --command-config client.properties --create --topic 0a60b9a8a831.geo
./kafka-topics.sh --bootstrap-server b-3.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1\
    .amazonaws.com:9098 --command-config client.properties --create --topic 0a60b9a8a831.user

### 4. Batch processing : Connecting the MSK cluster to an S3 bucket.    

Next, configure MSK Connect to enable the MSK cluster to automatically transmit and store data to an S3 bucket, that is partitioned by topic. This is achieved by downloading the Confluent.io Amazon S3 Connector and adding it to the S3 bucket through the EC2 client. Then creating a connector in MSK connect by using the newly created custom plugin (which is designed to connect to the S3 bucket).

##### **Step 1:** Download confluent.io on the EC2 client and copy it to the S3 bucket in the EC2 client. 
The s3 bucket our data will be saved in is user-0a60b9a8a831-bucket (IAM role already set up to write to the S3 bucket).
Download the confluent package : 

In [None]:
# assume admin user privileges
sudo -u ec2-user -i 

# create directory where we will save our connector 
mkdir kafka-connect-s3 && cd kafka-connect-s3 

# download connector from Confluent
wget https://d1i4a15mxbxib1.cloudfront.net/api/plugins/confluentinc/kafka-connect-s3/ \
    versions/10.0.3/confluentinc-kafka-connect-s3-10.0.3.zip

# copy connector to S3 bucket 
aws s3 cp ./confluentinc-kafka-connect-s3-10.0.3.zip s3://user-0a60b9a8a831-bucket/kafka-connect-s3/

##### **Step 2 :** Create custom plugin in the MSK Connect console
   
MSK console > MSK Connect section  > Custom plugins > Create custom plugin. 
    
>- Name this plugin 0a60b9a8a831-plugin
>- Choose bucket where Confluent connector ZIP file is located (s3://user-0a60b9a8a831-bucket/kafka-connect-s3/confluentinc-kafka-connect-s3-10.0.3.zip)

##### **Step 3:** Create a connector in MSK connect using custom plugin to connect to S3. 

MSK connect > Customised plugins > choose 0a60b9a8a831-plugin > Create connector > Connector properties
> - **Basic properties**
>    - Connector name : 0a60b9a8a831-connector
>    - Description – optional : Connecting topics to s3 bucket
> - **Apache Kafka cluster**
>    - Cluster type : MSK cluster
>    - MSK clusters : pinterest-msk-cluster
> - **Connector configuration**
>    - Configuration settings :  

In [None]:
connector.class=io.confluent.connect.s3.S3SinkConnector    
s3.region=us-east-1     
flush.size=1    
schema.compatibility=NONE   
tasks.max=3     
topics.regex=0a60b9a8a831.*     
format.class=io.confluent.connect.s3.format.json.JsonFormat     
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner   
value.converter.schemas.enable=false    
value.converter=org.apache.kafka.connect.json.JsonConverter     
storage.class=io.confluent.connect.s3.storage.S3Storage     
key.converter=org.apache.kafka.connect.storage.StringConverter  
s3.bucket.name=user-0a60b9a8a831-bucket

> - **Connector capacity**
>    - Capacity type : Provisioned
>    - MCU count per worker : 1
>    - Number of workers : 1
>- **Worker configuration**
>    -Use a customised configuration
>    - Worker configuration : confluent worker
>- **Access permision**
>    - IAM role : 0a60b9a8a831-ec2-access-role

### 5. Batch Processing : Configuring an API in API Gateway

#### 5.1 Build a Kafka REST proxy integration method for the API

Now, an API is needed to send data to the MSK cluster and in turn, the S3 bucket.

##### **Step 1:** Create a REST API: 

API gateway > Create API > REST API > Build > 
>- API name: 0a60b9a8a831
>- Remaining settings remain as default. 

##### **Step 2:** Create a resource that allows building a PROXY integration for the API and setup a HTTP ANY method onto the created resource.

>- Integration type: HTTP Proxy
>- HTTP method: ANY
>- Endpoint URL: http://ec2-34-207-200-90.compute-1.amazonaws.com:8082/{proxy} (EC2 instance as endpoint)
>- Content handling: Passthrough (To pass all payloads to backend, apply a mapping template if specified)

![](media/log/5.1_create_api_resource.PNG)

![](media//log/5.1_create_http_any_method.PNG)

##### **Step 3:** Deploy the API and note the invoke URL so it can be used for POST requests.    
Invoke URL : https://vqbq2ubp7a.execute-api.us-east-1.amazonaws.com/prod

### 5.2 Set up the Kafka REST proxy on EC2 client machine

Now that the API has been set up to send data to the EC2 client. Install the Confluent package to setup a REST proxy API on EC2 client which listens for requests and interacts with the kafka cluster.


##### **Step 1:** Install Confluent package for Kafka REST proxy on EC2 client machine. 

In [None]:
sudo wget https://packages.confluent.io/archive/7.2/confluent-7.2.0.tar.gz      
tar -xvzf confluent-7.2.0.tar.gz 

##### **Step 2:** Allow the REST proxy to perform IAM authentication to the MSK cluster by modifying the kafka-rest.properties file.
Navigate to `confluent-7.2.0/etc/kafka-rest`, and modify the `kafka-rest.properties` file.

Inside the `kafka-rest.properties` file. Modify the bootstrap.servers and the zookeeper.connect variables in this file, with the corresponding Boostrap server string and Plaintext Apache Zookeeper connection string, gathered back in [Section 3.4](#3.4-Creating-Kafka-topics).
     
To surpass the IAM authentication of the MSK cluster, we will make use of the IAM MSK authentication package again, adding this at the bottom of `kafka-rest.properties`.

In [None]:
#
# Copyright 2018 Confluent Inc.
#
# Licensed under the Confluent Community License (the "License"); you may not use
# this file except in compliance with the License.  You may obtain a copy of the
# License at
#
# http://www.confluent.io/confluent-community-license
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
# WARRANTIES OF ANY KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations under the License.
#

#id=kafka-rest-test-server
#schema.registry.url=http://localhost:8081
zookeeper.connect=z-2.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:2181,z-1.pinterestmskcluster\
    .w8g8jt.c12.kafka.us-east-1.amazonaws.com:2181,z-3.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:2181
bootstrap.servers=b-1.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098,b-3.pinterestmskcluster\
    .w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098,b-2.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098


# Configure interceptor classes for sending consumer and producer metrics to Confluent Control Center
# Make sure that monitoring-interceptors-<version>.jar is on the Java class path
#consumer.interceptor.classes=io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor
#producer.interceptor.classes=io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor

# Sets up TLS for encryption and SASL for authN.
client.security.protocol = SASL_SSL

# Identifies the SASL mechanism to use.
client.sasl.mechanism = AWS_MSK_IAM

# Binds SASL client implementation.
client.sasl.jaas.config = software.amazon.msk.auth.iam.IAMLoginModule \
    required awsRoleArn="arn:aws:iam::584739742957:role/0a60b9a8a831-ec2-access-role";

# Encapsulates constructing a SigV4 signature based on extracted credentials.
# The SASL client bound by "sasl.jaas.config" invokes this class.
client.sasl.client.callback.handler.class = software.amazon.msk.auth.iam.IAMClientCallbackHandler

##### **Step 3:** Start the REST proxy on the EC2 client machine
To make sure messages are consumed in MSK, start the REST proxy in 'confluent-7.2.0/bin' (this also functions as a test).    
Showing the INFO Server started and listening for requests inside EC2 console.

In [None]:
./kafka-rest-start /home/ec2-user/confluent-7.2.0/etc/kafka-rest/kafka-rest.properties

#### 5.3 Sending data to the API   

Modify the user_posting_emulation.py to send data to Kafka topics using API Invoke URL. Send data from the three tables to their corresponding Kafka topic.


Open three separate terminals, each inside the `kafka_2.12-2.8.1/bin` directory of the ec2 client, Then set up the kafka consumers, as shown in the code below (one terminal per topic, for the three topics). 

Running `user_posting_emulation_batch_data.py` posts data messages to the cluster via the API gateway and the kafka REST proxy. 
 
All three consumer terminals alongside the REST proxy terminal actively streamed data, with the correct data consumed by the corresponding topic. Messages also showed up in the S3 bucket, inside a folder named 'Topics'.

In [None]:
./kafka-console-consumer.sh --bootstrap-server b-1.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1\
    .amazonaws.com:9098 --consumer.config client.properties --group students --topic 0a60b9a8a831.pin --from-beginning

./kafka-console-consumer.sh --bootstrap-server b-2.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1\
    .amazonaws.com:9098 --consumer.config client.properties --group students --topic 0a60b9a8a831.geo --from-beginning

./kafka-console-consumer.sh --bootstrap-server b-3.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1\
    .amazonaws.com:9098 --consumer.config client.properties --group students --topic 0a60b9a8a831.user --from-beginning

### 6. Batch processing : Mount AWS S3 bucket onto Databricks


In order to clean and query batch data, we will mount and read the data from the s3 bucket into Databricks. 

A notebook `pinterest_autenticate_aws` was made to create a function which reads and extracts a delta table containg the AWS authentication keys:
1. Read the credentials delta table into a Sparks dataframe
2. Create variables using AWS access key and secret key from the spark dataframe and return these variables in the function

Another notebook was created `mount_S3_bucket` to house a mounting function to mount the S3 bucket to Databricks:
1. Run the 'pinterest_authenticate_aws' notebook to retrieve the AWS access key and secret key
2. Mount the S3 bucket using the AWS key credentials
This Notebook is run once to mount the S3 bucket. 

Created a third notebook `pinterest_batch_data` to read the data from the s3 bucket and convert into dataframe:
1. Run the `pinterest_authenticate_aws` notebook to retrieve the AWS access key and secret key 
2. Read all data with a .json file extension from the S3 bucket 
3. Convert the data into a spark dataframe and dynamically generate the dataframe name
4. Create three different DataFrames:
    - df_pin for the Pinterest post data
    - df_geo for the geolocation data     
    - df_user for the user data

### 7. Clean all three dataframes and query the data on databricks using pyspark

Clean the three dataframes df_pin, df_geo, df_user and query the data on databricks using pyspark.

To clean the 3 dataframes a notebook `cleaning_utils` was made, within this 3 separate functions cleaned each separate dataframe and returned a cleaned dataframe. 

In the `pinterest_batch_data` notebook: 
Each dataframe(df_pin, df_geo, df_user) would be updated by running each cleaning function from the `cleaning_utils` notebook. 
The data was then queried using pyspark for the following questions:

1. Find the most popular Pinterest category people post to based on their country.
2. Find the most popular category in each year between 2018 and 2022
3. Find the user with most followers in each country
4. What is the most popular category people post to for different age groups?
5. What is the median follower count for users in different age groups?
6. Find how many users have joined each year between 2015 and 2020
7. Find the median follower count of users based on their joining year between 2015 and 2020.
8. Find the median follower count of users based on their joining year and age group for 2015 to 2020.

### 8. Batch processing: Orchestrating databricks workload using AWS MWAA

- Upload the `0a60b9a8a831_dag.py` directed acyclic graph (DAG) file to the S3 bucket `mwaa-dags-bucket/dags` associated with the MWAA environment. This allows us to run the DAG from the AWS airflow UI
- Utilise AWS Managed Workflows for Apache Airflow (MWAA) to automate **daily** batch processing of the previously created databricks notebook. 

#### 8.1 Create and upload DAG to an MWAA environment  

MWAA requires an S3 bucket to hold Directed Acyclic Graphs (DAGs), Python requirements and plugins. Then the MWAA airflow console can be used to run the DAGs.   

I have been provided with an MWAA environment linked to an S3 bucket. However this can be created using the following steps:    

##### **Step 1:** Create an S3 bucket
Amazon S3 > Bucket > create bucket:     
>- General configuration:  
>    - AWS region : US East (N. Virginia) us-east-1  
>    - Bucket type : General purpose 
>    - Bucket name: mwaa-dags-bucket 
>- Block Public Access settings for this bucket:   
>    - Block all public access   
>- Bucket Versioning:  
>    - Enable    
     
After creating the bucket, create a folder named `dags` inside it.  

##### **Step 2:** Creating the MWAA environment
Amazon MWAA (us-east-1) > create environment :  
>- Environment details:    
>    - Name: Databricks-Airflow-env  
>- DAG code in Amazon S3:  
>    - S3 Bucket : s3://mwaa-dags-bucket  (browse S3 and choose the previously created bucket)   
>    - DAGs folder : s3://mwaa-dags-bucket/dags (choose the dags folder created inside the S3 bucket)    
>- Then on networking page,    
>    - select create MWAA VPC    
>    - Choose the preferred Apache Airflow access mode.  
>    - Web server access : Private network   
>    - Security groups : Create new security group   
>- Environment class :     
>    - Class : mw1.small (recommended to choose the smallest environment size that is necessary to support the workload) 
>    - Maximum worker count : 5  
>    - Minimum worker count : 1  
>    - Scheduler count : 2   

##### **Step 3:** Create an API token in Databricks   
The API token functions as a connection between Databricks and the MWAA environment. 
Username > User Settings > Access tokens > Generate new token > Copy the Token ID.   

##### **Step 4:** Connect MWAA to Databricks
Using the Databricks API token, set up the connection between MWAA and Databricks.

Amazon MWAA > Environments > Databricks-Airflow-env > Open Airflow UI > Admin > Connections > databricks_default > Edit record:
>- Host column : \<url of your Databricks account\>
>- Extra column : {"token": "\<API_token_id_from_previous_step\>", "host": "\<url_from_host_column\>"}
>- Connection type column: Databricks
    
##### **Step 5:** Obtaining Databricks connection type
Obtaining Databricks connection type requires installalation of the corresponding Python dependencies for the MWAA environment This is included in the `requirements.txt` file.     
Before uploading a `requirements.txt` file in the `mwaa-dags-bucket` S3 bucket, the following [Github repository](https://github.com/aws/aws-mwaa-local-runner) can be used to create and test the environment.

After uploading requirements.txt to the S3 bucket, select the `requirements.txt` environment in amazon MWAA.
 Amazon MWAA > environment > select environment > edit:   
>- DAG code in Amazon S3:  
>    - Requirements file : s3://mwaa-dags-bucket/requirements.txt  

#### 8.2 Upload DAG to MWAA environment and trigger the DAG
 
Create an Airflow DAG that will trigger a Databricks Notebook by upload the corresponding `0a60b9a8a831_dag` Python file in the `mwaa-dags-bucket/dags` S3 bucket folder (associated with the MWAA environment). 

To manually trigger the DAG: 
Amazon MWAA > Environments > Databricks-Airflow-env > Airflow UI from the MWAA environment.
Unpause the DAG from AWS MWAA airflow UI, and trigger the DAG.

### 9. Stream processing: AWS Kinesis   

To stream data from the AWS RDS to Databricks Delta tables, a new script was created to send data to an API and creating a new notebook to stream to Delta tables.   

#### 9.1 Creating streams on AWS Kinesis 
First, the following 3 streams were created to link the three Pinterest tables:     
 - streaming-0a60b9a8a831-pin  
 - streaming-0a60b9a8a831-geo    
 - streaming-0a60b9a8a831-user  

On AWS Kinesis > Create new stream
    

![](media/log/9.1_kinesis.png)

#### 9.2 Modifyign the REST API

Configure the previously created REST API to allow it to invoke Kinesis actions. 
AWS account has been granted permissions to invoke Kinesis actions, there was no need to create an IAM role for API to access Kinesis. However this can be done with the following steps:    
 
Copy the ARN of the access role `0a60b9a8a831-kinesis-access-role` from the IAM console, under Roles. ARN to be used when setting up the Execution role for the integration point of all the API methods created.

An overview of the REST API:

![](media/log/9.2_REST_API.PNG)


##### **Step 1:** List streams in Kinesis     

To begin building the integration, navigate to the previously created API in AWS API Gateway. And select Create resource
API Gateway > APIs > previously created API > Create resource:

![](media/log/9.2.1_kinesis_api_gateway_streams.PNG)

Under this newly created streams resource, create a GET method with the following settings: 
> Integration type: AWS Service     
> AWS Region: us-east-1     
> AWS Service: Kinesis  
> HTTP method: POST (to invoke Kinesis's ListStreams action)    
> Action Type: Use action name     
> Action name: ListStreams  
> Execution role: ARN of your Kinesis Access Role (created in the previous section)

![](media/log/9.2.2_kinesis_api_gateway_get_method.PNG)

Once the GET method has been created, select the Integration request panel, and click on the Edit and modify the header parameteres and mapping template. 


> URL Requests header parameters :  
>- Name: Content-Type    
>- Mapped from: 'application/x-amz-json-1.1'
>
> Mapping templates: 
>- Content type: application/json
> - Body:  {}


![](media/log/9.2.3_kinesis_api_gateway_integration_request_header_mapping.PNG)



##### **Step 2:** Create, describe and delete streams in Kinesis  

Under the streams resource create a new child resource with the Resource name {stream-name}

**The settings to create a POST method:**   

On the create method settings:
>- Integration Type: AWS Service    
>- AWS Region: us-east-1    
>- AWS Service: Kinesis     
>- HTTP method: POST    
>- Action: CreateStream     
>- Execution role: ARN of IAM role created  

![](media/log/9.2.4_kinesis_api_stream_child_post_method.PNG)

**In 'Integration Request' under 'Mapping Templates', add new mapping template:**
>- Content Type: 'application/json'
>- Mapping Template Body:

In [None]:
{
    "ShardCount": #if($input.path('$.ShardCount') == '') 5 #else $input.path('$.ShardCount') #end,
    "StreamName": "$input.params('stream-name')"
}

![](media/log/9.2.5_kinesis_api_put_integration_request_header_mapping.PNG)

**For the other methods, the same settings were used except for the action name while creating the method and mapping template for the integration request settings:**

**The settings to create the GET method:** 
>- Action name: 'DescribeStream'    
>- Mapping Template:

In [None]:
{
    "StreamName": "$input.params('stream-name')"
}

**The settings to create the DELETE method:**

>- Action: 'DeleteStream'
>- Mapping Template Body:

In [None]:
{
    "StreamName": "$input.params('stream-name')"
}

##### **Step 3:** Add records to streams in Kinesis   

Under {stream-name} resource create a two new child resources with the Resource Name: record and records. THe settings for creating the child resources were as follows: 

>- Resource path : /streams/stream-name/  
>- Resource name : record 

>- Resource path : /streams/stream-name/   
>- Resource name : records  


For both resources create a PUT method.

**The settings to create the PUT method for record:**   
>- Action name: 'PutRecord'    
>- Mapping Template:   

In [None]:
{
    "StreamName": "$input.params('stream-name')",
    "Data": "$util.base64Encode($input.json('$.Data'))",
    "PartitionKey": "$input.path('$.PartitionKey')"
}

**The settings to create the PUT method for records:**  
>- Action name: 'PutRecords'  
>- Mapping Template: 

In [None]:
{
    "StreamName": "$input.params('stream-name')",
    "Records": [
       #foreach($elem in $input.path('$.records'))
          {
            "Data": "$util.base64Encode($elem.data)",
            "PartitionKey": "$elem.partition-key"
          }#if($foreach.hasNext),#end
        #end
    ]
}

After this, the API has been redeployed, and the invoke URL has been noted.

### 9.3 Sending Data to the Kinesis streams    

The previously created script `user_posting_emulation.py` was modified to create an api_send_to_kinesis function. Create a new script `user_posting_emulation_streaming.py` to send requests to the newly created API, sending data one at a time from the three Pinterest AWS RDS tables to their corresponding Kinesis streams.

### 9.4 Read data from Kinesis streams into Databricks   


Create a new notebook called `pinterest_streaming_data` in Databricks (Details can be found inside the notebook).

1. To read the credentials import the `pinterest_authenticate_aws` notebook.
2. Create functions to ingest data into Kinesis Data Streams
3. Read the data from the three streams

### 9.5 Transform Kinesis Streams in Databricks

Clean the streaming data utilising the `cleaning_utils` notebook functions inside the `pinterest_streaming_data` notebook.

### 9.6 Write streaming data into Databricks delta tables
Created a new function `store_as_delta` in the notebook `pinterest_streaming_data` to save each stream in a Delta Table.

The following three tables were created: 
 - 0a60b9a8a831_pin_table 
 - 0a60b9a8a831_geo_table 
 - 0a60b9a8a831_user_table

The delta tables could be viewed under Catalogs > default >  \<name of delta table\>

**Improvements:**
The cleaning for both batch and streaming data is exactly the same in both notebooks, similarly the authenticatipn credentials file is also the same, it would be nice to not have to repeat both of these sections, but unsure how to go about this due to it being a notebook - will research further. 
