# MSK Connect

## Introduction

> IMPORTANT: Though MSK connect is relatively cheap it is a paid service, so you will get charged if use MSK Connect on AWS. The details of pricing can be found at the following [link](https://aws.amazon.com/msk/pricing/). During your project an AWS account will be provided to you so you don't have to use your own account. Remember to close any AWS resources after use if using your own AWS account. 

*MSK Connect* is a feature of AWS MSK, that allows users to stream data to and from their MSK-hosted Apache Kafka clusters. With MSK Connect, you can deploy fully managed connectors that move data into or pull data from popular data stores like Amazon S3 and Amazon OpenSearch Service, or that connect Kafka clusters with external systems, such as databases and file systems. 

*Source connectors* can be used to import data from external systems into your topics, while *Sink connectors* can export data from your topics to external systems.

MSK Connect will continuously monitor the connectors health and delivery state, as well as manage the underlying hardware, and autoscale the connectors to match changes in data load.

## Set up the required resources

As an example, we will create a **sink connector** that will send data from a MSK cluster to a S3 bucket. To achieve this we will need the following resources:

- A **MSK cluster** to which we will send the data. The connector will read data from here and send it to the destination bucket.
- A **S3 bucket** that will serve as the destination for the data received from the connector.
- An **IAM role** that allows the connector to write to the destination.
- A **VPC endpoint** that allows data from cluster VPC and connector to be sent to the destination.

### 1. Create the S3 bucket

Open the Amazon S3 console and choose **Create bucket**. For the bucket name select your desired name. Make sure to select the same **AWS Region** (in our case us-east-1) as the region in which you created your MSK cluster. Finally, choose **Create bucket**.

<p align="center">
    <img src="images/Create Bucket.png" width="550" height="300"/>
</p>


### 2. Create a MSK cluster

Check-out the MSK Essentials lesson to see how to create a MSK cluster.

### 3. Create an IAM role that can write to the destination bucket

Navigate to the IAM console, and select Roles under the **Access management** section. Choose **Create role** to create a new IAM role. 

Under **Trusted entity type**, select AWS service, and under the **Use case** field select S3 in the **Use cases for other AWS services** field.


<p align="center">
    <img src="images/IAM Role Trusted Entity.png" width="600" height="400"/>
</p>

In the permission tab, select **Create policy**. This will open a new tab where you can create the desired policy. Select the **JSON** tab. Replace the existing text with the following policy:

In [None]:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Sid": "VisualEditor0",
            "Effect": "Allow",
            "Action": [
                "s3:ListBucket",
                "s3:DeleteObject",
                "s3:GetBucketLocation"
            ],
            "Resource": [
                "arn:aws:s3:::<DESTINATION_BUCKET>",
                "arn:aws:s3:::<DESTINATION_BUCKET>/*"
            ]
        },
        {
            "Sid": "VisualEditor1",
            "Effect": "Allow",
            "Action": [
                "s3:PutObject",
                "s3:GetObject",
                "s3:ListBucketMultipartUploads",
                "s3:AbortMultipartUpload",
                "s3:ListMultipartUploadParts"
            ],
            "Resource": "*"
        },
        {
            "Sid": "VisualEditor2",
            "Effect": "Allow",
            "Action": "s3:ListAllMyBuckets",
            "Resource": "*"
        }
    ]
}

This policy creates the necessary permissions to write to the destination bucket. Skip the rest of the pages until you reach the **Create policy** button.

Now back to the main tab for the IAM role, you should be able to find the policy you have just created, and then select it. Skip the rest of the pages until you reach the **Create role** button. Once you have select it, the new IAM role will be created.

In the IAM console, choose the role you have just created, and select the **Trust relationships** tab. In the **Trusted entities** tab you should add the following trust policy:

In [None]:
{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Effect": "Allow",
            "Principal": {
                "Service": "kafkaconnect.amazonaws.com"
            },
            "Action": "sts:AssumeRole"
        }
    ]
}

This trust relationship allows MSK Connect to assume the role to which we attached the policy created above. Finally select the **Update Trust Policy**.


### 4. Create a VPC endpoint to S3.

In the VPC console, select Endpoints under the **Virtual private cloud** tab and then choose **Create endpoint**.

Under **Service Name** choose **com.amazonaws.us-east-1.s3** and the Gateway type. Choose the VPC that corresponds to the MSK cluster's VPC from the drop-down menu in VPC section. Finally, select **Create endpoint**.

<p align="center">
    <img src="images/VPC Endpoint.png" width="600" height="600"/>
</p>

## Create the custom plugin

A plugin will contain the code that defines the logic of our connector. We will use the client EC2 machine we have previously used (check out MSK Essentials lesson) to connect to our cluster for this step.

First connect to your client EC2 machine. We will download the **Confluent.io Amazon S3 Connector** on our machine, and then copy it to the S3 bucket we have previously created. This connector is a sink connector that exports data from Kafka topics to S3 objects in either JSON, Avro or Bytes format. To do download & copy this connector run the code below inside your client machine:

In [None]:
# assume admin user privileges
sudo -u ec2-user -i
# create directory where we will save our connector 
mkdir kafka-connect-s3 && cd kafka-connect-s3
# download connector from Confluent
wget https://d2p6pa21dvn84.cloudfront.net/api/plugins/confluentinc/kafka-connect-s3/versions/10.5.13/confluentinc-kafka-connect-s3-10.5.13.zip
# copy connector to our S3 bucket
aws s3 cp ./confluentinc-kafka-connect-s3-10.0.3.zip s3://<BUCKET_NAME>/kafka-connect-s3/

If everything ran successfully, you should be able to see the following folder and file inside your S3 bucket.

<p align="center">
    <img src="images/Plugin ZIP.png" width="700" height="450"/>
</p>

Now, open the MSK console and select *Custom plugins* under the **MSK Connect** section on the left side of the console. Choose **Create custom plugin**.

In the list of buckets, find the bucket where you upload the `Confluent connector ZIP file`. Then, in the list of objects in that bucket select the ZIP file and select the **Choose** button. Give the plugin a name and press **Create custom plugin**.

<p align="center">
    <img src="images/Custom plugin.png" width="650" height="400"/>
</p>

Once the plugin has been created you should see the following message at the top of your browser window:

`plugin <PLUGIN_NAME> was successfully created. The custom plugin was created. You can now create a connector using this custom plugin`

## Create the connector

In the MSK console, select *Connectors* under the **MSK Connect** section on the left side of the console. Choose **Create connector**.

In the list of plugin, select the plugin you have just created, and then click **Next**. For the connector name choose the desired name, and then choose your MSK cluster from the cluster list.

In the **Connector configuration settings** copy the following configuration:

In [None]:
connector.class=io.confluent.connect.s3.S3SinkConnector
# same region as our bucket and cluster
s3.region=us-east-1
flush.size=1
schema.compatibility=NONE
tasks.max=3
# include nomeclature of topic name, given here as an example will read all data from topic names starting with msk.topic....
topics.regex=<YOUR_UUID>.*
format.class=io.confluent.connect.s3.format.json.JsonFormat
partitioner.class=io.confluent.connect.storage.partitioner.DefaultPartitioner
value.converter.schemas.enable=false
value.converter=org.apache.kafka.connect.json.JsonConverter
storage.class=io.confluent.connect.s3.storage.S3Storage
key.converter=org.apache.kafka.connect.storage.StringConverter
s3.bucket.name=<BUCKET_NAME>

Leave the rest of the configurations as default, except for:

- **Connector type** change to **Provisioned** and make sure both the **MCU count per worker** and **Number of workers** are set to 1
- **Worker Configuration**, select **Use a custom configuration**, then pick `confluent-worker`
- **Access permissions**, where you should select the IAM role you have created previously

Skip the rest of the pages until you get to **Create connector** button page. Once your connector is up and running you will be able to visualise it in the **Connectors** tab in the MSK console.

## Send data from MSK to S3

You are now ready to send data from your MSK cluster to your S3 bucket. Any data that will pass through your MSK cluster will now be automatically uploaded to its designated S3 bucket, in a newly created folder called `topics`.

## Conclusion
At this point, you should have a good understanding of:
- What is MSK Connect
- How to set-up the necessary resources for MSK Connect
- How to create a custom sink plugin
- How to create a connector with MSK connect
- How to send data from a MSK cluster to an S3 bucket