This sample project uses a sample machine learning project to showcase how we can implement MLOps - CI/CD for Machine Learning using Amazon SageMaker, AWS CodePipeline and AWS CDK
- Python (version 3.8 or higher)
- NodeJS (version 14 or higher)
- Yarn (installed via
npm install -g yarn
) - Typescript (installed via
npm install -g typescript
) - AWS CDK v2 CLI (installed via
npm install -g aws-cdk
) - AWS CLI (version 2 or higher)
- AWS CLI Configuration (configured via
aws configure
)
- Fork this repo in your GitHub account
- Create a GitHub connection using the CodePipeline console to provide CodePipeline with access to your Github repositories (See session Create a connection to GitHub (CLI))
- Update the GitHub related configuration in the
./configuration/projectConfig.json
file- Set the value of repoType to git
- Update the value of githubConnectionArn, githubRepoOwner and githubRepoName
Alternatively, the CDK Infrastructure code can provision a CodeCommit Repo as Source Repo for you.
To switch to this option, set the value of repoType to codecommit in the ./configuration/projectConfig.json
file.
Please note that for simplicity, the API endpoint for the online model consumers is not protected by any authentication process. By default, it can be accessed by anyone from the internet. Please update the value of ipPermitList in the ./configuration/projectConfig.json
to include only the CIDR block of your network.
Important: this application uses various AWS services and there are costs associated with these services after the Free Tier usage - please see the AWS Pricing page for details. You are responsible for any AWS costs incurred. Please follow the Cleanup Section to clean up resources after your usage. No warranty is implied in this example.
Run the command below to provision all the required infrastructure.
./scripts/bootstrap.sh
The command can be run repatedly to deploy any changes in this folder.
If the script is run successfully, a list of Cloudformation Output will be printed out in the console as shown in the screenshot below:
You can find the name of the newly created Cloudformation Stack, CodePipeline, Data S3 Bucket, Data Manifest S3 Bucket, SageMaker Artifact S3 Bucket and SageMaker Execution Role.
If you navigate to the CodePipeline console, you should see the newly created CodePipeline, as shown in the screenshot below:
If you are using CodeCommit Repo, Refer to Source Code Section on how to push the source code to the newly created CodeCommit Repo.
If you are using Github Repo, the CodePipeline should be connected to your Github Repo already. Refer to Testing Data Set Section on how to upload the testing data set to trigger the pipeline.
If repoType is codecommit, after the cloudformation stack is created, follow this page to connect to the CodeCommit Repo and push the content of this folder to the main branch of the repo.
Note: The default branch may not be main depending on your Git setting.
Once the source code is pushed to the repo, the CodePipeline will be triggered, but the CI stage will fail given that the testing data set has not been uploaded yet. Refer to Testing Data Set Section on how to upload the testing data set.
Download a copy of testing data set from https://archive.ics.uci.edu/ml/datasets/abalone
, and upload it to the Data Source S3 Bucket (The bucket name starts with mlopsinfrastracturestack-datasourcedatabucket...) under your prefered folder path, e.g. yyyy/mm/dd/abalone.csv.
Alternatively, you can run the scripts below to download a copy of testing data set and upload to the Empty data bucket.
./scripts/uploadTestingDataset.sh
Once the data is uploaded to data bucket, the CodePipeline should be triggered automatically.
During the CodePipeline run, a SageMaker Pipeline named mlops-e2e
(The projectName in the configuration/projectConfig.json
file) will be created or updated.
To inspect the newly created SageMaker Pipeline, you can setup SageMaker Studio and Navigate to the SageMaker Pipeline list from the SageMaker Studio, as shown in the screenshow below:
When there are any issues during the MLPipeline stage of the CodePipeline run, the best way to troubleshoot is to navigate to the SageMaker Pipeline details page in the SageMaker Studio for the logging information.
After the CodePipeline run is completed (including the Manual-approval-gated Deploy stage), the Model Consumer example can be deployed. See section ML Model Consumers for more details.
To clean up all the infrastructure, run the command below:
./scripts/cleanup.sh
Note: If you have bootstraped the model consumer example, you will need to clean up the model consumer infrastructure resources first. Refer to the README file for more details and instructions on how to clean up the example infrastructure.
The project is created based on the SageMaker Project Template - MLOps template for model building, training and deployment.
In this example, we are solving the abalone age prediction problem using a sample dataset. The dataset used is the UCI Machine Learning Abalone Dataset. The aim for this task is to determine the age of an abalone (a kind of shellfish) from its physical measurements. At the core, it's a regression problem.
buildspecs
: Build specification files used by CodeBuild projectsconfiguration
: Project and Pipeline configurationconsumers
: Examples how to consume the inference modeldocs
: Images used in the documentationinfrastructure
: AWS CDK app for provisioning the end-to-end MLOps infrastructureml_pipeline
: The SageMaker pipeline definition expressing the ML steps involved in generating an ML model and helper scriptsmodel_deploy
: AWS CDK app for deploying the model on SageMaker endpointscripts
: Bash scripts used in the CI/CD pipelinesrc
: Machine learning code for peprocessing and evaluating the ML modeltests
: Unit testing code for testing machine learning code
The overall archiecture of the sample project is shown below:
When there is a new version of source code or there is a new version of data, the CodePipeline (serving as MLOps pipeline) is triggered to run CI step to test the ML Code and build the infrastruture code, followed by the MLPipeline step. In the MLPipeline step, a SageMaker Pipeline is created/updated to preprocess the raw data, train and evaluate the ML model. In the Deploy Step, the trained model is deployed as SageMaker endpoint after manual approval.
The CodePipeline is defined as CDK construct in the ./infrastructure/codePipelineConstructure.ts
file.
When a new data file is uploaded into the Data Source S3 Bucket, a lambda function defined in the ./infrastructure/functions/dataSourceMonitor
folder is triggered to generate a new data manifest file to specify what raw data should be included in the training and then upload it into the Data Manifest s3 Bucket. And the new version of this file triggers the CodePipeline.
By default, all the files inside the Data Source S3 Bucket are used in the training job. The source code can be updated to only include data files within certain date range.
The SageMaker Pipeline is defined by the python code in the ./ml_pipeline
folder. The source code for preprocessing and evaluating data is located in the ./src
folder.
In the preprocessing job (specified in the ./src/preprocess.py
file), we leverages sklearn-kit to transform the data. During the inference, the same preprocessor is expected to be used to transform the inference data. So in the SageMaker Pipeline, we build a inference pipeline model including the preprossor and the inference model to create a pipeline model package so that an inference pipeline can be deployed to process the raw data and send it to the prediction model for predication. A transform step defined in the file ./src/transform.py
is used to map the input and output of the preprossor during the inference.
The Model Deployment is managed by the CDK stack defined in the ./model_deploy
folder. The model is deployed into persistent SageMaker Real-time Inference endpoint.
An example on how to consume the inference model is available in the consumers/online
folder.
Refer to the README file for more details and instructions on how to deploy the example.
This project is licensed under the MIT-0.
Refer to CONTRIBUTING for more details on how to contribute to this project.