# Get started with CI/CD for ML projects

### Concepts



Software engineers may be familiar with the concepts of continuous integration and continuous delivery. The basic flow consists of pushing code to a Git repository, then this event triggers a job to test the code and build the application in an automated way. One of the most famous open-source tool is Jenkins, but Cloud providers also have their own services, such
as Cloud Build for GCP, or CodeBuild/CodePipeline for AWS. One of the main advantages of CI/CD is the automation of all the deployment tasks, which shorten software development iterations.

CI/CD for data-science is becoming a norm. Deploying models to production is not easy and DevOps engineers are bringing their expertise to the ML teams to simplify the process. Many of
the lessons learnt by software engineering teams can be re-used, at the exception that in addition of testing code, ML teams also need to test data and evaluate models. A CI/CD workflow
for data-science could look like this:
    

<img src="cicd.png">


This picture is extracted from this interesting Google blog post: https://cloud.google.com/solutions/machine-learning/mlops-continuous-delivery-and-automation-pipelines-in-machine-learning.
If you want to know more about it, and MLOps in general, I encourage you to read it.

In this blog we will setup a simple Continuous Integration pipeline that you can re-use across projects. As a first step, it will be very similar to a standard Software Engineering pipeline, and in a future article we will try to enhance it and add more features.

### What's next?

In the next section, we set up a simple pipeline continuous integration flow. This would cover the first steps of the deployment flow shown above.

<img src="architecture.png">

A ML pipeline is composed of a few components for data preparation, training, model validation, inference, etc. A popular pattern is to package each component in containers as it's a good
strategy to be able to reproduce results without dependency on the hardware/OS it runs on. Wether you are running your containers on-prem, or in the Cloud, it shouldn't matter as your code will always
be executed in the same Docker environment. Today, most of ML frameworks chooses this approach.

We will start with a demo Git repository on GitHub. We will create two dummy components with a corresponding Dockerfile and some unit tests. As any software product, your code deserves 
to be tested and starting writing unit tests from the beginning is a good habit to have. Then, we will configure an event on Git push to trigger a build job, in both GCP and AWS. In a
real-world setting, you may not need to deploy in multiple Cloud providers, but this is just for demo.

### Let's get started

First, create your own GitHub repository. In this example, I use my repo https://github.com/MatthieuBlais/demo-mlops.git

#### Repository structure

You may start your project by analysing data on a notebook. It's important to add them to your repository as it helps others to quickly 
explore and understand your experiments. However, you may want to avoid deploying Jupyter notebooks in production and instead, organise your code into components. In addition of that, 
you may also have some Dockerfile needed to build your app. With all these different kind of files, it can quickly become messy and difficul to manage. To avoid that, following a clear
repository structure is important and can also help to automate the deployment process.

In this example, I've made the choice or creating a folder structure like this:

In the "components" folder, I can add all the components I need for my ML pipeline. Each component comes with its own testable code (app folder) a component definition (components.yaml - very similar to Kubeflow definition), a Dockerfile and a requirements.txt.
In the "deployment" folder, I keep the scripts needed for the CI/CD pipeline. On the repo, I've created two folders, one for AWS and one for GCP, but as mentioned you probably need only one of them.

#### Testing

Before setting up the CI pipeline, let's confirm everything is running as expected. For each component, try to build the Docker image locally:

You should be able to successfully build your docker image. You can also try to test the (very simple) code using pytest:

Now that everything is ready, let's configure the pipeline.

### GCP setup

When we push to our GitHub repository, we want to trigger a Cloud Build job to test our code and build our component's Dockers. The first step is to configure this Cloud Build trigger. The best way to proceed is to refer to the official documentation: https://cloud.google.com/build/docs/automating-builds/run-builds-on-github. After following this guide, if you try to push code to you repository, you should notice a new build starting.

#### First pipeline 

You may notice the codebuild.yaml in the folder deployment/gcp/. This file describes the steps executed by Cloud Build. Let's keep it simple for now. We have 2 components, we want to run
pytest and get the coverage for each of them (2 steps), then we want to build the Docker images (2 steps) and push them to GCR (2 steps). Total, we have 6 steps. 2 of them require a Python environment and 4 of them a docker environment. Our first cloudbuild.yaml can look like this:

We use a Python image to run pytest for our two components. Then, we use one of the default Cloud Build docker image to build our images. Note that we set the tags to be able to easily push
the images to GCR.

#### Enhancing the pipeline

For now we have 2 components and our Cloud Build flow has 6 steps. This means that if we have 10 components, we will have 30 steps. However, we can notice that all the commands are very
similar except the name of the components. We can simplify the flow by writing two bash scripts (one for pytest and one for the docker images) iterating over the list of components and 
executing the same commands for each of them.

This cloudbuild definition will remain the same whatever the number of components we have. Note that we kept the same images, python:3.8-slim and docker. If you go through the bash scripts,
you may notice we add two tags to our Docker images:

This is to keep the GCR repository tidy and to easily identify what is the latest image for each branch. You can obviously choose a different method to organise the repository but it's important
to avoid the following situation. Let's say a ML pipeline in production uses the docker image **gcr.io/PROJECT_ID/demo-mlops/component:latest**. You want to add a new feature to this component
and you push a new Dockerfile to Github. It means the new "latest" image will be the one built during your latest push. Next time your ML pipeline in production is triggered, it will
also use this new image, but you haven't done any proper testing to know if something is going to break or not. This is a dangerous situation. Adding the branch name in the image tag can help
to avoid this issue. 
If you push to your dev branch, the image will be tagged with **gcr.io/PROJECT_ID/demo-mlops/component:dev-latest**, but as your prod environment (which is using the "master" branch) uses the 
    image **gcr.io/PROJECT_ID/demo-mlops/component:master-latest**, there won't be any conflict.

#### (Optional) One more thing

We have already achieved what we wanted to do and we could stop here, but there is one more thing... During the build, we do not save any artifact. If we want to
know the pytest coverage or the Docker images that have been built, we must read the logs to find out. This is not very nice. Instead let's save the reports to Google Cloud Storage! 

We save the artifacts in this location: cloudbuild/REPO_NAME/BRANCH_NAME/BUILD_ID/. To do that we add an artifact section to our two Cloud Build steps.

You must also update the bash scripts to save all the artifacts you want to save in the folder "_artifacts/".

Push your changes and here we are, you should see your artifacts being uploaded at the end of your build:

### Conclusion

We have seen how to create a simple Continuous Integration for your ML projects. We have used GCP Cloud Build and AWS CodeBuild, but there are other tools and products that you can explore.
One important condition to enable CI is to have a proper repository structure. This demo is a just an example to get you started. You can enhance it in many ways. For example, the scripts
used to test the code and build the docker images are currently in the source repository. This is something you would change if you work with a big team because most of your projects would
use the same scripts and they will get copy-pasted in many repositories. The day you want to update something in one of them, you will have to apply the change to all the projects! Instead,
you could have a process to store the scripts on GS/S3 and download them during the build execution. 
Another area that could be improved is the handling of artifacts. Currently, we just upload them to GS/S3, but during a code review, the engineers will still have to go to the buckets to 
check the results. Instead of that you could have one more step during the build to summarize the results and save them into a database. Based on these results, you could also have an automated 
"first code review", that decide if a PR should be closed or not based on the pytest coverage or other findings. 
Start small and improve over time!