# Deploy Versioned Glue Artifact

A lot of serverless AWS Service supports versioning and alias for deployment. It made the blue / green deployment, canary deployment and rolling back super easy.

- AWS Lambda Versioning and Alias: https://docs.aws.amazon.com/lambda/latest/dg/configuration-versions.html
- AWS StepFunction Versioning and Alias: https://docs.aws.amazon.com/step-functions/latest/dg/auth-version-alias.html
- AWS SageMaker Model Registry Versioning: https://docs.aws.amazon.com/sagemaker/latest/dg/model-registry.html

However, AWS Glue does not support this feature. This library provides a way to manage AWS Glue versioning and alias so you can deploy AWS Glue Jobs with confident.

## Overview

An AWS Glue Jobs project usually have the following code components:

1. (Required) One or several AWS Glue ETL Python script.
2. (Optional) One Python Library that will be imported into Glue ETL script to use. It usually includes the reusable code snippet to keep your code organized.
3. (Optional) Additional third party Python library that will be used in your Glue ETL script.

According to this [AWS official document](https://docs.aws.amazon.com/glue/latest/dg/aws-glue-programming-python-libraries.html), #3 can be defined ``--additional-python-modules`` parameter and you don't have to upload anything. Since #2 is custom code, you have to zip it and upload to s3 and use ``--extra-py-files`` parameter to import it. And #1 has to upload to S3 and then pass to ``ScriptLocation`` parameter when you create the Glue job via API or CloudFormation.

The ``aws_glue_artifact`` library can help you manage #1 and #2 and bring in versioning and alias best practice to help with the deployment.


## Quick Start

First, import the ``GlueETLScriptArtifact`` and ``GluePythonLibArtifact`` from ``aws_glue_artifact.api``. The ``GlueETLScriptArtifact`` is an abstraction of #1 and ``GluePythonLibArtifact`` is an abstraction of #2. Also, we need to import the ``BotoSesManager`` object to give our artifact manager AWS permission. In this example, you need AWS S3 and AWS DynamoDB permission.

In [23]:
from aws_glue_artifact.api import GlueETLScriptArtifact, GluePythonLibArtifact
from boto_session_manager import BotoSesManager

We need to import additional library to improve our development experience

In [24]:
# define the Path to the artifact files
from pathlib import Path
# pretty printer for debugging
from rich import print as rprint

First, let's use a local AWS CLI profile to create the boto session manager object.

In [25]:
bsm = BotoSesManager(profile_name="bmt_app_dev_us_east_1")

### Create Glue ETL Script Artifact

This code block will create the Glue ETL script artifact. Firstly, let's create the path to the script and display the content.

In [26]:
dir_here = Path.cwd().absolute()
dir_project_root = dir_here.parent
path_glue_etl_script_1_py = dir_here.joinpath("glue_etl_script_1.py")
print(path_glue_etl_script_1_py.read_text())

import sys
from awsglue.transforms import *
from awsglue.utils import getResolvedOptions
from pyspark.context import SparkContext
from awsglue.context import GlueContext
from awsglue.job import Job

sc = SparkContext.getOrCreate()
glueContext = GlueContext(sc)
spark = glueContext.spark_session
job = Job(glueContext)
args = getResolvedOptions(
    sys.argv,
    [
        "JOB_NAME",
    ],
)
job.commit()



Then we create a Glue ETL script artifact object. We need to specify ``aws_region``, ``s3_bucket``, ``s3_prefix`` and ``dynamodb_table_name`` to define the artifact store backend. It uses the [versioned](https://github.com/MacHu-GWU/versioned-project) Python library under the hood to manage the artifact content and its metadata. Also, you have to give it a unique ``artifact_name``, it will become part of the naming convention of artifact S3 location. And we pass the ``path_glue_etl_script`` to define where is the Glue ETL script located.

In [27]:
aws_region = bsm.aws_region
s3_bucket = f"{bsm.aws_account_id}-{bsm.aws_region}-artifacts"
s3_prefix = "versioned-artifacts"
dynamodb_table_name = "versioned-artifacts"

glue_etl_script_artifact = GlueETLScriptArtifact(
    aws_region=aws_region,
    s3_bucket=s3_bucket,
    s3_prefix=s3_prefix,
    dynamodb_table_name=dynamodb_table_name,
    artifact_name="glue_etl_script_1",
    path_glue_etl_script=path_glue_etl_script_1_py,
)
print(glue_etl_script_artifact.path_glue_etl_script.relative_to(dir_project_root.parent))

aws_glue_artifact-project/examples/glue_etl_script_1.py


``aws_glue_artifact`` uses AWS S3 to store the artifact files and AWS DynamoDB to store the artifact metadata. Yet, the S3 bucket and DynamoDB table are not created yet, so we have to call the ``.bootstrap`` method to create them.

In [28]:
glue_etl_script_artifact.repo.purge_all()
glue_etl_script_artifact.bootstrap(bsm=bsm)

Now we can just call the ``put_artifact`` method to deploy the artifact as the ``LATEST``. It will return an ``Artifact`` object includes the metadata of the artifact.

In [29]:
artifact = glue_etl_script_artifact.put_artifact(metadata={"foo": "bar"})
rprint(artifact)

If you want to deploy your Glue Job via SDK, CloudFormation, CDK, Terraform, you need to pass the S3 uri of the artifact. You can use the ``get_artifact_s3path()`` method to get the latest artifact S3 uri.

In [30]:
s3path = glue_etl_script_artifact.get_artifact_s3path()
print(s3path.uri)
rprint(s3path.console_url)

s3://878625312159-us-east-1-artifacts/versioned-artifacts/glue_etl_script_1/LATEST.py


Once you made a release to production, you should create an immutable version of your artifact so you can roll back anytime. You can use ``publish_artifact_version()`` method to publish a new version from the Latest. The version is simply a immutable snapshot of your latest artifact.

In [31]:
artifact = glue_etl_script_artifact.publish_artifact_version()
rprint(artifact)

When you are doing roll back, you need to pass the S3 uri of the historical version of artifact. You can use the ``get_artifact_s3path(version=...)`` method to get the S3 uri.

In [32]:
s3path = glue_etl_script_artifact.get_artifact_s3path(version=1)
print(s3path.uri)

s3://878625312159-us-east-1-artifacts/versioned-artifacts/glue_etl_script_1/000001.py


### Create Glue Python Library Artifact

The Glue Python Library can simplify your code, improve code reusability, and enhance code maintainability. Additionally, it offloads complex logic from your ETL script, and you thoroughly test that logic in unit tests.

Similar to how we create the ``GlueETLScriptArtifact``, we can create a ``GluePythonLibArtifact``. We have to specify the path to your Python library directory in ``dir_glue_python_lib``, and give a temporary folder ``dir_glue_build`` to build the artifact. Note that ``dir_glue_build`` will be clean up before the building, so please make sure it doesn't have any important files.

In [33]:
glue_python_lib_artifact = GluePythonLibArtifact(
    aws_region=aws_region,
    s3_bucket=s3_bucket,
    s3_prefix=s3_prefix,
    dynamodb_table_name=dynamodb_table_name,
    artifact_name="glue_python_lib",
    dir_glue_python_lib=dir_project_root.joinpath("aws_glue_artifact"),
    dir_glue_build=dir_project_root.joinpath("build", "glue"),
)
print(glue_python_lib_artifact.dir_glue_python_lib.relative_to(dir_project_root.parent))
print(glue_python_lib_artifact.dir_glue_build.relative_to(dir_project_root.parent))

aws_glue_artifact-project/aws_glue_artifact
aws_glue_artifact-project/build/glue


Similarly, you have to bootstrap it to ensure the S3 and DynamoDB backend are created. But if you use the same ``s3_bucket``, ``s3_prefix`` and ``dynamodb_table_name`` for all your glue projects, you can skip this step.

In [34]:
glue_python_lib_artifact.repo.bootstrap(bsm=bsm)

Similarly, we can just call ``put_artifact`` method to deploy the artifact as the ``LATEST``. It will automatically build your source code, zip it, and upload it to AWS S3.

In [35]:
artifact = glue_python_lib_artifact.put_artifact(metadata={"foo": "bar"})
rprint(artifact)

Similarly, you can use the ``get_artifact_s3path()`` method to get the S3 uri of the artifact. You may need this in ``--extra-py-files`` parameter of your Glue Job.

In [36]:
s3path = glue_python_lib_artifact.get_artifact_s3path()
print(s3path.uri)
rprint(s3path.console_url)

s3://878625312159-us-east-1-artifacts/versioned-artifacts/glue_python_lib/LATEST.zip


Similarly, once you made a release production, you should call the ``publish_artifact_version()`` to create a immutable snapshot of your artifact.

In [37]:
artifact = glue_python_lib_artifact.publish_artifact_version()
rprint(artifact)

## Summary

Now you get the idea of how to manage AWS Glue artifacts using ``aws_glue_artifact`` Python library. With versioned artifacts, you can easily enable the blue/green, canary deployment, and have the confidence to roll back when there's a failure in production. I highly suggest this pattern in production project.
