## Project Log

#### 1. Set up a  new conda environment 

In [None]:
conda create -n pinterest_conda_env
conda activate pinterest_conda_env

Install any libraries required to run the user_posting_emulation.py using 

In [None]:
conda install <library_name>
conda install -c conda-forge <library_name>

### 2. Data infrastructure simulating Pinterest.

The user_posting_emulation.py contains the login credentials for a RDS database, which contains three tables with data resembling data received by the Pinterest API when a POST request is made by a user uploading data to Pinterest.


### 3. Batch processing : Configuring the EC2 Kafka client

MSK cluster already created. To create an MSK cluster:

Amazon MSK > Create cluster > Choose quick create > name cluster 'pinterest-msk-cluster > Choose Cluster type: provisioned > Apache Kafka version: 2.8.1 > Choose Broker type: Kafka.m5.large > Amazon EBS storage per broker: 100 GiB > Click create cluster

#### The EC2 instance   
AWS EC2: Cloud-hosted virtual machine

![](/home/mhash/pinterest_data_pipeline_project/project_log/ec2_instance.png)

- Name: 0a60b9a8a831  
- Instance ID : i-034925981e7bb03f3   
- Instance type: t2.micro 
- Amazon Machine Image : Amazon linux 2023 AMI
- Availability Zone: us-east-1a   
- Public IPV4 DNS : ec2-34-207-200-90.compute-1.amazonaws.com 
- Public IPv4 address : 34.207.200.90 
- Key pair name : 0a60b9a8a831-key-pair
- Key pair type : RSA
- Private key file format : .pem

An IAM role was setup for the EC2

#### 3.1 Create .pem file locally
EC2 > Instances > Search "i-034925981e7bb03f3" > Details > Key pair assigned at launch.  
Copy the key pair and save inside a .pem file, and ensure that the .pem file has read-only permission for User class.   
>chmod 400 0a60b9a8a831-key-pair.pem

#### 3.2 Connect to the EC2 instance
Follow the instructions on AWS on how to connect to EC2 through the CLI.  
    
> ssh -i "0a60b9a8a831-key-pair.pem" ec2-user@ec2-34-207-200-90.compute-1.amazonaws.com     
    
`ssh -i "0a60b9a8a831-key-pair.pem" ec2-user@ec2-34-207-200-90.compute-1.amazonaws.com`

#### 3.3 Installing Kafka on the EC2 client 


I have been provided with access to an IAM authenticated MSK cluster.

However MSK cluster can be setup manually by: 

- Creating the MSK cluster: 
    - MSK console > Create Cluster > Quick create    
    - Cluster name :pinterest-msk-cluster    
    - Cluster type : provisioned 
    - Apache Kafka version : 2.8.1   
    - Broker type : kafka.m5.large   
    - Amazon EBS storage per broker : 100 GiB    
Create cluster  

EC2 instance has been created, however to ensure the client machine can send data to the MSK cluster. 
VPC > Security > Security groups >  
- Select the default security group associated with the cluster VPC (MSK console > properties tab > Networking settings section > Security groups applied) 

Edit inbound rules > Add rule. 
- Type column: All traffic     
- Source column: ID of the security group of the client machine (found in EC2 console > security tab). 
- Save rules and cluster will accept all traffic from the client machine.

##### Step 1 : First install java and check its version `java -version`. Then download and 'unzip' kafka version 2.12-2.81.

In [None]:
sudo yum install java-1.8.0

wget https://archive.apache.org/dist/kafka/2.8.1/kafka_2.12-2.8.1.tgz   # Get the version of kafka to install
tar -xzf kafka_2.12-2.8.1.tgz                                           # To 'unzip' or 'untar' the file 
rm kafka_2.12-2.8.1.tgz                                                 # Remove the compressed file, keeping only uncompressed version

##### Step 2 : Installing IAM MSK authentication package on client EC2 machine.
Enter inside the Kafka installation directory and then in the libs subdirectory. Download the IAM MSK authentication package from Github (Required for a connection to MSK clusters which uses IAM authentication).

In [None]:
cd kafka_2.12-2.8.1/libs

wget https://github.com/aws/aws-msk-iam-auth/releases/download/v1.1.5/aws-msk-iam-auth-1.1.5-all.jar  # Download msk iam authentication file
export CLASSPATH=/home/ec2-user/kafka_2.12-2.8.1/libs/aws-msk-iam-auth-1.1.5-all.jar # Set class path environment variable for MSK authentication

Add above classpath environment variable to bashrc file to Configure the client to be able to use the IAM package

In [None]:
nano /home/ec2-user/.bashrc
source ~/.bashrc

![](/home/mhash/pinterest_data_pipeline_project/project_log/nano_bashrc_20231211.png)

Should return this class path '/home/ec2-user/kafka_2.12-2.8.1/libs/aws-msk-iam-auth-1.1.5-all.jar' if configured correctly.

In [None]:
echo $CLASSPATH  


##### Step 3 : To assume the "0a60b9a8a831-ec2-access-role" IAM role, which contains the necessary permissions to authenticate the MSK cluster
    
IAM console > Roles > 0a60b9a8a831-ec2-access-role      
- copy ARN      

Trust relationships tab > Edit trust policy > Add a principal       
- Selected IAM roles as the Principal type , Replace ARN with the 0a60b9a8a831-ec2-access-role ARN copied from ec2-access-role.


##### Step 4 : Configure Kafka client to use AWS IAM authentication to the cluster 
In the EC2 client, inside the kafka/bin directory `cd kafka_2.12-2.8.1/bin`, modify the client.properties file `nano client.properties` as follows:

In [None]:
# https://github.com/aws/aws-msk-iam-auth instructions on configuring a Kafka client to use AWS IAM with AWS_MSK_IAM mechanism

# Sets up TLS (Transport Layer Security) for encryption (cryptographic protocol) and SASL (Simple Authentication and Security Layer) for authN (framework for authentication and data security in Internet protocols).
security.protocol = SASL_SSL

# Identifies the SASL mechanism to use.
sasl.mechanism = AWS_MSK_IAM

# Binds SASL client implementation.
sasl.jaas.config = software.amazon.msk.auth.iam.IAMLoginModule required awsRoleArn="arn:aws:iam::584739742957:role/0a60b9a8a831-ec2-access-role";

# Encapsulates constructing a SigV4 signature based on extracted credentials.
# The SASL client bound by "sasl.jaas.config" invokes this class.
sasl.client.callback.handler.class = software.amazon.msk.auth.iam.IAMClientCallbackHandler

#### 3.4 Creating Kafka topics

Using the MSK Management Console to get cluster information
Amazon MSK > pinterest-msk-cluster > View client information and save:  
- Bootstrap servers Private endpoint (single-VPC)
- Plaintext Apache Zookeeper connection string  

topic names: 
- 0a60b9a8a831.pin for the Pinterest posts data
- 0a60b9a8a831.geo for the post geolocation data
- 0a60b9a8a831.user for the post user data

In the EC2 client, enter inside the kafka folder, then inside the bin folder `cd kafka_2.12-2.8.1/bin`, to run the commands:


In [None]:
./kafka-topics.sh --bootstrap-server b-1.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098 --command-config client.properties --create --topic 0a60b9a8a831.pin
./kafka-topics.sh --bootstrap-server b-2.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098 --command-config client.properties --create --topic 0a60b9a8a831.geo
./kafka-topics.sh --bootstrap-server b-3.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098 --command-config client.properties --create --topic 0a60b9a8a831.user

### 4. Batch processing : Connecting the MSK cluster to an S3 bucket.    
The s3 bucket our data will be saved in is user-0a60b9a8a831-bucket (IAM role already set up to write to s3 bucket )

Firstly, download confluent.io 

In [None]:
# assume admin user privileges
sudo -u ec2-user -i 

# create directory where we will save our connector 
mkdir kafka-connect-s3 && cd kafka-connect-s3 

# download connector from Confluent
wget https://d1i4a15mxbxib1.cloudfront.net/api/plugins/confluentinc/kafka-connect-s3/versions/10.0.3/confluentinc-kafka-connect-s3-10.0.3.zip 

# copy connector to S3 bucket 
aws s3 cp ./confluentinc-kafka-connect-s3-10.0.3.zip s3://user-0a60b9a8a831-bucket/kafka-connect-s3/

#### 4.1 Create custom plugin in the MSK Connect console   
MSK console > MSK Connect section  > Custom plugins > Create custom plugin.
Choose bucket where Confluent connector ZIP file is (s3://user-0a60b9a8a831-bucket/kafka-connect-s3/confluentinc-kafka-connect-s3-10.0.3.zip) and name this plugin 0a60b9a8a831-plugin.

#### 4.2 Create a connector in MSK connect using custom plugin    
 
MSK connect > Customised plugins > choose 0a60b9a8a831-plugin > Create connector >
Connector properties
- **Basic properties**
    - Connector name : 0a60b9a8a831-connector
    - Description – optional : Connecting topics to s3 bucket
- **Apache Kafka cluster**
    - Cluster type : MSK cluster
    - MSK clusters : pinterest-msk-cluster
- **Connector configuration**
    - Configuration settings :  

In [None]:
connector.class=io.confluent.connect.s3.S3SinkConnector    
s3.region=us-east-1     
flush.size=1    
schema.compatibility=NONE   
tasks.max=3     
topics.regex=0a60b9a8a831.*     
format.class=io.confluent.connect.s3.format.json.JsonFormat     
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner   
value.converter.schemas.enable=false    
value.converter=org.apache.kafka.connect.json.JsonConverter     
storage.class=io.confluent.connect.s3.storage.S3Storage     
key.converter=org.apache.kafka.connect.storage.StringConverter  
s3.bucket.name=user-0a60b9a8a831-bucket

- **Connector capacity**
    - Capacity type : Provisioned
    - MCU count per worker : 1
    - Number of workers : 1
- **Worker configuration**
    -Use a customised configuration
    - Worker configuration : confluent worker
- **Access permision**
    - IAM role : 0a60b9a8a831-ec2-access-role

### 5. Batch Processing : Configuring an API in API Gateway

#### 5.1 Build a Kafka REST proxy integration method for the API

Create a REST API

API gateway > Create API > REST API > Build > 
API name: 0a60b9a8a831

Remaining settings remain as default.Then create API

Create a resource that allows building a PROXY integration for the API.    

![](/home/mhash/pinterest_data_pipeline_project/project_log/5.1_create_api_resource.PNG)


For the created resource, Set up a HTTP ANY method    

![](/home/mhash/pinterest_data_pipeline_project/Media/5.1_create_http_any_method.PNG)


endpoint URL : http://ec2-34-207-200-90.compute-1.amazonaws.com:8082/{proxy}


Deploy the API and note the invoke URL for a later step.    
Invoke URL : https://vqbq2ubp7a.execute-api.us-east-1.amazonaws.com/prod

#### 5.2 Set up the Kafka REST proxy on EC2 client machine

##### Install Confluent package for Kafka REST proxy on EC2 client machine 

In [None]:
sudo wget https://packages.confluent.io/archive/7.2/confluent-7.2.0.tar.gz      
tar -xvzf confluent-7.2.0.tar.gz 

Allow the REST proxy to perform IAM authentication to the MSK cluster by modifying the kafka-rest.properties file.
Navigate to `confluent-7.2.0/etc/kafka-rest`. Inside here run the following command to modify the kafka-rest.properties file:

In [None]:
cd confluent-7.2.0/etc/kafka-rest
nano kafka-rest.properties

Now, inside the `kafka-rest.properties`. Modify the bootstrap.servers and the zookeeper.connect variables in this file, with the corresponding Boostrap server string and Plaintext Apache Zookeeper connection string, gathered back in section 3.4.     
To surpass the IAM authentication of the MSK cluster, we will make use of the IAM MSK authentication package again, adding this at the bottom of `kafka-rest.properties`.

In [None]:
#
# Copyright 2018 Confluent Inc.
#
# Licensed under the Confluent Community License (the "License"); you may not use
# this file except in compliance with the License.  You may obtain a copy of the
# License at
#
# http://www.confluent.io/confluent-community-license
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
# WARRANTIES OF ANY KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations under the License.
#

#id=kafka-rest-test-server
#schema.registry.url=http://localhost:8081
zookeeper.connect=z-2.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:2181,z-1.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:2181,z-3.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:2181
bootstrap.servers=b-1.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098,b-3.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098,b-2.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098


# Configure interceptor classes for sending consumer and producer metrics to Confluent Control Center
# Make sure that monitoring-interceptors-<version>.jar is on the Java class path
#consumer.interceptor.classes=io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor
#producer.interceptor.classes=io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor

# Sets up TLS for encryption and SASL for authN.
client.security.protocol = SASL_SSL

# Identifies the SASL mechanism to use.
client.sasl.mechanism = AWS_MSK_IAM

# Binds SASL client implementation.
client.sasl.jaas.config = software.amazon.msk.auth.iam.IAMLoginModule required awsRoleArn="arn:aws:iam::584739742957:role/0a60b9a8a831-ec2-access-role";

# Encapsulates constructing a SigV4 signature based on extracted credentials.
# The SASL client bound by "sasl.jaas.config" invokes this class.
client.sasl.client.callback.handler.class = software.amazon.msk.auth.iam.IAMClientCallbackHandler

##### Start the REST proxy on the EC2 client machine
To make sure messages are consumed in MSK, start the REST proxy, this also functions as a test.    
Showing the INFO Server started and listening for requests inside EC2 console.

In [None]:
cd confluent-7.2.0/bin  
./kafka-rest-start /home/ec2-user/confluent-7.2.0/etc/kafka-rest/kafka-rest.properties

#### 5.3 Sending data to the API   

Modify the user_posting_emulation.py to send data to Kafka topics using API Invoke URL. Send data from the three tables to their corresponding Kafka topic.


Opened three separate terminals, each inside the `kafka_2.12-2.8.1/bin` directory of the ec2 client, Then set up the kafka consumers one per topic for the three topics. When running user_posting_emulation.py, to send a stream of messages to the cluster, all three consumer terminals alongside the REST proxy terminal actively streamed data, with the correct data consumed by the corresponding topic. Messages also showed up in the S3 bucket, inside a folder named 'Topics'. 

In [None]:
./kafka-console-consumer.sh --bootstrap-server b-1.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098 --consumer.config client.properties --group students --topic 0a60b9a8a831.pin --from-beginning

./kafka-console-consumer.sh --bootstrap-server b-2.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098 --consumer.config client.properties --group students --topic 0a60b9a8a831.geo --from-beginning

./kafka-console-consumer.sh --bootstrap-server b-3.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098 --consumer.config client.properties --group students --topic 0a60b9a8a831.user --from-beginning

### 6. Batch processing : Mount AWS S3 bucket onto Databricks


In order to clean and query batch data, we will mount and read the data from the s3 bucket into Databricks. 

Create three different DataFrames:

- df_pin for the Pinterest post data  
- df_geo for the geolocation data     
- df_user for the user data  

This was all carried out in databricks notebook `dataframe_creation_from_s3_bucket`.


### 7. Clean all three dataframes and query the data on databricks using pyspark

### 8. Batch processing: Orchestrating databricks workload using AWS MWAA

#### 8.1 Create and upload DAG to an MWAA environment  

MWAA requires an S3 bucket to hold Directed Acyclic Graphs (DAGs), Python requirements and plugins. Then the MWAA airflow console can be used to run the DAGs.   

I have been provided with an MWAA environment linked to an S3 bucket. However this can be created using the following steps:    

##### 1) Amazon S3 > Bucket > create bucket:     
- General configuration:  
    - AWS region : US East (N. Virginia) us-east-1  
    - Bucket type : General purpose 
    - Bucket name: mwaa-dags-bucket 
- Block Public Access settings for this bucket:   
    - Block all public access   
- Bucket Versioning:  
    - Enable    
     
After creating the bucket, create a folder named `dags` inside it.  

##### 2) Amazon MWAA (us-east-1) > create environment :  
- Environment details:    
    - Name: Databricks-Airflow-env  
- DAG code in Amazon S3:  
    - S3 Bucket : s3://mwaa-dags-bucket  (browse S3 and choose the previously created bucket)   
    - DAGs folder : s3://mwaa-dags-bucket/dags (choose the dags folder created inside the S3 bucket)    
- Then on networking page,    
    - select create MWAA VPC    
    - Choose the preferred Apache Airflow access mode.  
    - Web server access : Private network   
    - Security groups : Create new security group   
- Environment class :     
    - Class : mw1.small (recommended to choose the smallest environment size that is necessary to support the workload) 
    - Maximum worker count : 5  
    - Minimum worker count : 1  
    - Scheduler count : 2   

##### 3) Create an API token in Databricks   
In databricks, Username > User Settings > Access tokens > Generate new token > Copy the Token ID.   

##### 4) Amazon MWAA > Environments > Databricks-Airflow-env > Open Airflow UI > Admin > Connections > databricks_default > Edit record:
- Host column : <url of your Databricks account>
- Extra column : < {"token": "<API_token_id_from_previous_step>", "host": "<url_from_host_column>"}>
- Connection type column: Databricks
    
Obtaining Databricks connection type requires installalation of the corresponding Python dependencies for the MWAA environment.     
To create and test, before uploading a requirements.txt file in the `mwaa-dags-bucket` S3 bucket, the following Github repository can be used:  https://github.com/aws/aws-mwaa-local-runner    
After uploading requirements.txt to the S3 bucket, Amazon MWAA > environment > select environment > edit:   
- DAG code in Amazon S3:  
    - Requirements file : s3://mwaa-dags-bucket/requirements.txt  


#### Upload DAG to MWAA environment

Amazon MWAA > Environments > Databricks-Airflow-env > Airflow UI from the MWAA environment

Create an Airflow DAG that will trigger a Databricks Notebook by upload the corresponding `0a60b9a8a831_dag` Python file in the `mwaa-dags-bucket/dags` S3 bucket folder. 



Unpause the DAG from AWS MWAA airflow UI, and trigger the DAG. **DAG has failed** (Requires search in logs for debugging). 

#### 9. Stream processing: AWS Kinesis   

On AWS Kinesis > Create new stream, the following 3 streams were created to link the three Pinterest tables.  

- streaming-0a60b9a8a831-pin  
- streaming-0a60b9a8a831-geo    
- streaming-0a60b9a8a831-user     

![](/home/mhash/pinterest_data_pipeline_project/project_log/9.1_kinesis.png)

#### 9.2 


Configure previously created REST API to allow it to invoke Kinesis actions. AWS account has been granted permissions to invoke Kinesis actions, did not need to create an IAM role for API to access Kinesis. However this can be done with the following steps:    
 
Copy the ARN of the access role `0a60b9a8a831-kinesis-access-role` from the IAM console, under Roles. ARN to be used when setting up the Execution role for the integration point of all the API methods created.

Enable API to be able to invoke the following actions:

- List streams in Kinesis
- Create, describe and delete streams in Kinesis
- Add records to streams in Kinesis


##### 9.2.1 List streams in Kinesis     


##### 9.2.2 Create, describe and delete streams in Kinesis  


##### 9.2.3 Add records to streams in Kinesis   


 Under {stream-name} resource create a two new child resources with the Resource Name, record and records. For both resources create a PUT method.  


In [None]:
 Resource path : /streams/stream-name/  
 Resource name : record 

Resource path : /streams/stream-name/   
 Resource name : records    

Set up a `PUT` method for record and records 


After this, the API has been redeployed.

#### 9.3 Sending Data to the Kinesis streams    

Create a new script `user_posting_emulation_streaming.py` to send requests to the newly created API, sending data one at a time from the three Pinterest AWS RDS tables to their corresponding Kinesis streams.

#### 9.4 Read data from Kinesis streams into Databricks   

##### 9.4.1
Create a new notebook called `pinterest_streaming_data` in Databricks. 
1. read in credentials authentication_credentials.csv file to retrieve the Access Key and Secret Access Key.
2. Create functions to ingest data into Kinesis Data Streams
3. Read the data from the three streams
4. Clean the streaming data
5. Write each stream in a Delta Table

#### 9.5 Transform Kinesis Streams in Databricks

Clean the streaming data in the same way as the batch data



#### 9.6 Write streaming data into Databricks delta tables
save each stream in a Delta Table
The following three tables were created: 
 - 0a60b9a8a831_pin_table 
 - 0a60b9a8a831_geo_table 
 - 0a60b9a8a831_user_table
 

Improvements: 
The cleaning for both batch and streaming data is exactly the same in both notebooks, similarly the authenticatipn credentials file is also the same, it would be nice to not have to repeat both of these sections, but unsure how to go about this due to it being a notebook - will research further. 
