#### 1. Set up a  new conda environment 

In [None]:
conda create -n pinterest_conda_env
conda activate pinterest_conda_env

Installed any libraries required to run the user_posting_emulation.py using 

In [None]:
conda install <library_name>
conda install -c conda-forge <library_name>

### 3. Batch processing : Configuring the EC2 Kafka client

#### The EC2 instance   
AWS EC2: Cloud-hosted virtual machine

![](/home/mhash/pinterest_data_pipeline_project/project_log/ec2_instance.png)

- Name: 0a60b9a8a831  
- Instance ID : i-034925981e7bb03f3   
- Instance type: t2.micro 
- Amazon Machine Image : Amazon linux 2023 AMI
- Availability Zone: us-east-1a   
- Public IPV4 DNS : ec2-34-207-200-90.compute-1.amazonaws.com 
- Public IPv4 address : 34.207.200.90 
- Key pair name : 0a60b9a8a831-key-pair
- Key pair type : RSA
- Private key file format : .pem


#### 3.1 Created .pem file locally
EC2 > Instances > Search "i-034925981e7bb03f3" > Details > Key pair assigned at launch.  
Copied the key-pair and saved inside a .pem file, and ensured that the .pem file has read-only permission for User class.   
>chmod 400 0a60b9a8a831-key-pair.pem

#### 3.2 Connect to the EC2 instance
Followed the instructions on AWS on how to connect to EC2 through the CLI.  
    
> ssh -i "0a60b9a8a831-key-pair.pem" ec2-user@ec2-34-207-200-90.compute-1.amazonaws.com     
    
`ssh -i "0a60b9a8a831-key-pair.pem" ec2-user@ec2-34-207-200-90.compute-1.amazonaws.com`

#### 3.3 Installing Kafka on the EC2 client 


I have been provided with access to an IAM authenticated MSK cluster.

However MSK cluster can be setup manually by: 

- Creating the MSK cluster: 
    - MSK console > Create Cluster > Quick create    
    - Cluster name :pinterest-msk-cluster    
    - Cluster type : provisioned 
    - Apache Kafka version : 2.8.1   
    - Broker type : kafka.m5.large   
    - Amazon EBS storage per broker : 100 GiB    
Create cluster  

EC2 instance has been created, however to ensure the client machine can send data to the MSK cluster. 
VPC > Security > Security groups >  
- Select the default security group associated with the cluster VPC (MSK console > properties tab > Networking settings section > Security groups applied) 

Edit inbound rules > Add rule. 
- Type column: All traffic     
- Source column: ID of the security group of the client machine (found in EC2 console > security tab). 
- Save rules and cluster will accept all traffic from the client machine.

##### Step 1 : First installed java and checked its version `java -version`. Then downloaded and 'unzipped' kafka version 2.12-2.81.

In [None]:
sudo yum install java-1.8.0

wget https://archive.apache.org/dist/kafka/2.8.1/kafka_2.12-2.8.1.tgz   # Get the version of kafka to install
tar -xzf kafka_2.12-2.8.1.tgz                                           # To 'unzip' or 'untar' the file 
rm kafka_2.12-2.8.1.tgz                                                 # Remove the compressed file, keeping only uncompressed version

##### Step 2 : Installing IAM MSK authentication package on client EC2 machine.
Entered inside the Kafka installation directory and then in the libs subdirectory. Downloaded the IAM MSK authentication package from Github (Required for a connection to MSK clusters which uses IAM authentication).

In [None]:
cd kafka_2.12-2.8.1/libs

wget https://github.com/aws/aws-msk-iam-auth/releases/download/v1.1.5/aws-msk-iam-auth-1.1.5-all.jar  # Download msk iam authentication file
export CLASSPATH=/home/ec2-user/kafka_2.12-2.8.1/libs/aws-msk-iam-auth-1.1.5-all.jar # Set class path environment variable for MSK authentication

Added above classpath environment variable to bashrc file to maintain across ec2 sessions 

In [None]:
nano /home/ec2-user/.bashrc
source ~/.bashrc

![](/home/mhash/pinterest_data_pipeline_project/project_log/nano_bashrc_20231211.png)

In [None]:
echo $CLASSPATH                          # To test, should return class path in configured correctly /home/ec2-user/kafka_2.12-2.8.1/libs/aws-msk-iam-auth-1.1.5-all.jar


##### Step 3 : To assume the "0a60b9a8a831-ec2-access-role" IAM role, which contains the necessary permissions to authenticate the MSK cluster
    
IAM console > Roles > 0a60b9a8a831-ec2-access-role      
- copy ARN      

Trust relationships tab > Edit trust policy > Add a principal       
- Selected IAM roles as the Principal type , Replace ARN with the 0a60b9a8a831-ec2-access-role ARN copied from ec2-access-role.


##### Step 4 : Configure Kafka client to use AWS IAM
In the EC2 client, inside the kafka/bin directory `cd kafka_2.12-2.8.1/bin`, modified the client.properties file `nano client.properties` as follows:

In [None]:
# https://github.com/aws/aws-msk-iam-auth instructions on configuring a Kafka client to use AWS IAM with AWS_MSK_IAM mechanism

# Sets up TLS (Transport Layer Security) for encryption (cryptographic protocol) and SASL (Simple Authentication and Security Layer) for authN (framework for authentication and data security in Internet protocols).
security.protocol = SASL_SSL

# Identifies the SASL mechanism to use.
sasl.mechanism = AWS_MSK_IAM

# Binds SASL client implementation.
sasl.jaas.config = software.amazon.msk.auth.iam.IAMLoginModule required awsRoleArn="arn:aws:iam::584739742957:role/0a60b9a8a831-ec2-access-role";

# Encapsulates constructing a SigV4 signature based on extracted credentials.
# The SASL client bound by "sasl.jaas.config" invokes this class.
sasl.client.callback.handler.class = software.amazon.msk.auth.iam.IAMClientCallbackHandler`

#### 3.4 Creating Kafka Topics

Using the MSK Management Console to get cluster information
Amazon MSK > pinterest-msk-cluster > View client information and save:  
- Bootstrap servers Private endpoint (single-VPC)
- Plaintext Apache Zookeeper connection string  

topic names: 
- 0a60b9a8a831.pin for the Pinterest posts data
- 0a60b9a8a831.geo for the post geolocation data
- 0a60b9a8a831.user for the post user data

In the EC2 client, entered inside the kafka folder, then inside the bin folder `cd kafka_2.12-2.8.1/bin`, to run the commands:


In [None]:
./kafka-topics.sh --bootstrap-server b-1.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098 --command-config client.properties --create --topic 0a60b9a8a831.pin
./kafka-topics.sh --bootstrap-server b-2.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098 --command-config client.properties --create --topic 0a60b9a8a831.geo
./kafka-topics.sh --bootstrap-server b-3.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098 --command-config client.properties --create --topic 0a60b9a8a831.user

### 4. Batch processing : Connecting the MSK cluster to an S3 bucket.    
The s3 bucket our data will be saved in is user-0a60b9a8a831-bucket (IAM role already set up to write to s3 bucket )

firstly, downloaded confluent.io 

In [None]:
# assume admin user privileges
sudo -u ec2-user -i 

# create directory where we will save our connector 
mkdir kafka-connect-s3 && cd kafka-connect-s3 

# download connector from Confluent
wget https://d1i4a15mxbxib1.cloudfront.net/api/plugins/confluentinc/kafka-connect-s3/versions/10.0.3/confluentinc-kafka-connect-s3-10.0.3.zip 

# copy connector to S3 bucket 
aws s3 cp ./confluentinc-kafka-connect-s3-10.0.3.zip s3://user-0a60b9a8a831-bucket/kafka-connect-s3/

#### 4.1 Created custom plugin in the MSK Connect console   
MSK console > MSK Connect section  > Custom plugins > Create custom plugin.
Choose bucket where Confluent connector ZIP file is (s3://user-0a60b9a8a831-bucket/kafka-connect-s3/confluentinc-kafka-connect-s3-10.0.3.zip). Name the plugin 0a60b9a8a831-plugin.

#### 4.2 Created a connector in MSK connect using custom plugin    
 
MSK connect > Customised plugins > choose 0a60b9a8a831-plugin > Create connector >
Connector properties
- **Basic properties**
    - Connector name : 0a60b9a8a831-connector
    - Description – optional : Connecting topics to s3 bucket
- **Apache Kafka cluster**
    - Cluster type : MSK cluster
    - MSK clusters : pinterest-msk-cluster
- **Connector configuration**
    - Configuration settings :  

In [None]:
connector.class=io.confluent.connect.s3.S3SinkConnector    
s3.region=us-east-1     
flush.size=1    
schema.compatibility=NONE   
tasks.max=3     
topics.regex=0a60b9a8a831.*     
format.class=io.confluent.connect.s3.format.json.JsonFormat     
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner   
value.converter.schemas.enable=false    
value.converter=org.apache.kafka.connect.json.JsonConverter     
storage.class=io.confluent.connect.s3.storage.S3Storage     
key.converter=org.apache.kafka.connect.storage.StringConverter  
s3.bucket.name=user-0a60b9a8a831-bucket

- **Connector capacity**
    - Capacity type : Provisioned
    - MCU count per worker : 1
    - Number of workers : 1
- **Worker configuration**
    -Use a customised configuration
    - Worker configuration : confluent worker
- **Access permision**
    - IAM role : 0a60b9a8a831-ec2-access-role

### 5. Batch Processing : Configuring an API in API Gateway

#### 5.1 Built a Kafka REST proxy integration method for the API

API gateway > Create API > REST API > Build > 
API name: 0a60b9a8a831
Remaining settings remain as default.
Create API

Created resource something something 
[] video here 


Set up ANY method on proxy 

endpoint URL : http://ec2-34-207-200-90.compute-1.amazonaws.com:8082/{proxy}


Deploy API  
Invoke URL : https://vqbq2ubp7a.execute-api.us-east-1.amazonaws.com/prod

#### 5.2 Set up the Kafka REST proxy on the EC2 client



##### Installed Confluent package for Kafka REST proxy on EC2 client machine 

sudo wget https://packages.confluent.io/archive/7.2/confluent-7.2.0.tar.gz      

tar -xvzf confluent-7.2.0.tar.gz 

Allow the REST proxy to perform IAM authentication to the MSK cluster by modifying the kafka-rest.properties file.
navigate to confluent-7.2.0/etc/kafka-rest. Inside here run the following command to modify the kafka-rest.properties file:

In [None]:
cd confluent-7.2.0/etc/kafka-rest
nano kafka-rest.properties

 bootstrap.servers and the zookeeper.connect variables in this file, with the corresponding Boostrap server string and Plaintext Apache Zookeeper connection string     
To surpass the IAM authentication of the MSK cluster, we will make use of the IAM MSK authentication package again. Added to kafka-rest.properties. 

In [None]:
#
# Copyright 2018 Confluent Inc.
#
# Licensed under the Confluent Community License (the "License"); you may not use
# this file except in compliance with the License.  You may obtain a copy of the
# License at
#
# http://www.confluent.io/confluent-community-license
#
# Unless required by applicable law or agreed to in writing, software
# distributed under the License is distributed on an "AS IS" BASIS, WITHOUT
# WARRANTIES OF ANY KIND, either express or implied.  See the License for the
# specific language governing permissions and limitations under the License.
#

#id=kafka-rest-test-server
#schema.registry.url=http://localhost:8081
zookeeper.connect=z-2.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:2181,z-1.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:2181,z-3.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:2181
bootstrap.servers=b-1.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098,b-3.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098,b-2.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098

#
# Configure interceptor classes for sending consumer and producer metrics to Confluent Control Center
# Make sure that monitoring-interceptors-<version>.jar is on the Java class path
#consumer.interceptor.classes=io.confluent.monitoring.clients.interceptor.MonitoringConsumerInterceptor
#producer.interceptor.classes=io.confluent.monitoring.clients.interceptor.MonitoringProducerInterceptor

# Sets up TLS for encryption and SASL for authN.
client.security.protocol = SASL_SSL

# Identifies the SASL mechanism to use.
client.sasl.mechanism = AWS_MSK_IAM

# Binds SASL client implementation.
client.sasl.jaas.config = software.amazon.msk.auth.iam.IAMLoginModule required awsRoleArn="arn:aws:iam::584739742957:role/0a60b9a8a831-ec2-access-role";

# Encapsulates constructing a SigV4 signature based on extracted credentials.
# The SASL client bound by "sasl.jaas.config" invokes this class.
client.sasl.client.callback.handler.class = software.amazon.msk.auth.iam.IAMClientCallbackHandler

##### Start the REST proxy on the EC2 client machine.
To make sure messages are consumed in MSK, we need to start our REST proxy (also as a test)     
Saw a INFO Server started, listening for requests... in your EC2 console.

In [None]:
cd confluent-7.2.0/bin
./kafka-rest-start /home/ec2-user/confluent-7.2.0/etc/kafka-rest/kafka-rest.properties

##### 5.3 Sending data to the API   

Modify the user_posting_emulation.py to send data to Kafka topics using API Invoke URL. Send data from the three tables to their corresponding Kafka topic.


opened three separate terminals, each inside the `kafka_2.12-2.8.1/bin` directory of the ec2 client and set up the kafka consumers for the three topics, respectively. 

In [None]:
./kafka-console-consumer.sh --bootstrap-server b-1.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098 --consumer.config client.properties --group students --topic 0a60b9a8a831.pin --from-beginning

./kafka-console-consumer.sh --bootstrap-server b-2.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098 --consumer.config client.properties --group students --topic 0a60b9a8a831.geo --from-beginning

./kafka-console-consumer.sh --bootstrap-server b-3.pinterestmskcluster.w8g8jt.c12.kafka.us-east-1.amazonaws.com:9098 --consumer.config client.properties --group students --topic 0a60b9a8a831.user --from-beginning