From fbb594befa97cb40d1aca622334ce707516c125b Mon Sep 17 00:00:00 2001 From: Jason Andrews Date: Fri, 18 Oct 2024 17:46:47 +0000 Subject: [PATCH] Review MLOps on GitHub Actions Learning Path --- .../gh-runners/_index.md | 21 ++-- .../gh-runners/_review.md | 2 +- .../gh-runners/background.md | 51 +++++++--- .../gh-runners/compare-performance.md | 43 ++++++--- .../gh-runners/deploy.md | 90 +++++++++++++----- .../gh-runners/e2e-workflow.md | 30 ++++-- .../gh-runners/train-test.md | 86 ++++++++++++----- .../gh-runners/workflows.md | 95 +++++++++++++++---- 8 files changed, 311 insertions(+), 107 deletions(-) diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/_index.md b/content/learning-paths/servers-and-cloud-computing/gh-runners/_index.md index f633457023..559be9fc30 100644 --- a/content/learning-paths/servers-and-cloud-computing/gh-runners/_index.md +++ b/content/learning-paths/servers-and-cloud-computing/gh-runners/_index.md @@ -1,27 +1,30 @@ --- title: MLOps with Arm-hosted GitHub Runners +draft: true +cascade: + draft: true minutes_to_complete: 30 who_is_this_for: This is an introductory topic for software developers interested in automation for machine learning (ML) tasks. learning_objectives: - - Set up an Arm-hosted GitHub runner - - Train and test a PyTorch ML model with the German Traffic Sign Recognition Benchmark (GTSRB) dataset on Arm - - Use PyTorch compiled with OpenBLAS and oneDNN with Arm Compute Library to compare the performance of your trained model - - Containerize the model and push your container to DockerHub - - Automate all the steps in the ML workflow using GitHub Actions - + - Set up an Arm-hosted GitHub runner. + - Train and test a PyTorch ML model with the German Traffic Sign Recognition Benchmark (GTSRB) dataset. + - Use PyTorch compiled with OpenBLAS and oneDNN with Arm Compute Library to compare the performance of a trained model. + - Containerize the model and push the container to DockerHub. + - Automate all the steps in the ML workflow using GitHub Actions. prerequisites: - - A GitHub account with access to Arm-hosted GitHub runners - - Some familiarity with ML and continuous integration and deployment (CI/CD) concepts is assumed + - A GitHub account with access to Arm-hosted GitHub runners. + - A Docker Hub account for storing container images. + - Some familiarity with ML and continuous integration and deployment (CI/CD) concepts. author_primary: Pareena Verma, Annie Tallund ### Tags skilllevels: Introductory -subjects: CI/CD +subjects: CI-CD armips: - Neoverse tools_software_languages: diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/_review.md b/content/learning-paths/servers-and-cloud-computing/gh-runners/_review.md index 79dbe14e31..393f658cae 100644 --- a/content/learning-paths/servers-and-cloud-computing/gh-runners/_review.md +++ b/content/learning-paths/servers-and-cloud-computing/gh-runners/_review.md @@ -24,7 +24,7 @@ review: - questions: question: > - ACL is integrated into PyTorch by default. + ACL is included in PyTorch. answers: - "True" - "False" diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/background.md b/content/learning-paths/servers-and-cloud-computing/gh-runners/background.md index afa4ee9c63..be6d327b3b 100644 --- a/content/learning-paths/servers-and-cloud-computing/gh-runners/background.md +++ b/content/learning-paths/servers-and-cloud-computing/gh-runners/background.md @@ -1,5 +1,5 @@ --- -title: Background +title: MLOps background weight: 2 ### FIXED, DO NOT MODIFY @@ -8,19 +8,35 @@ layout: learningpathall ## Overview -In this Learning Path, you will learn how to automate your MLOps workflow using an Arm-hosted GitHub runner and GitHub Actions. You will learn how to train and test a neural network model with PyTorch. You will compare the model inference time for your trained model using two different PyTorch backends. You will then containerize your trained model and deploy the container image to DockerHub for easy deployment of your application. +In this Learning Path, you will learn how to automate an MLOps workflow using Arm-hosted GitHub runners and GitHub Actions. + +You will learn how to do the following tasks: +- Train and test a neural network model with PyTorch. +- Compare the model inference time using two different PyTorch backends. +- Containerize the model and save it to DockerHub. +- Deploy the container image and use API calls to access the model. ## GitHub Actions -GitHub Actions is a platform that automates software development workflows, including continuous integration and continuous delivery. Every repository on GitHub has a tab named _Actions_. +GitHub Actions is a platform that automates software development workflows, including continuous integration and continuous delivery. Every repository on GitHub has an `Actions` tab as shown below: ![#actions-gui](images/actions-gui.png) -From here, you can run different _workflow files_ which automate processes that run when specific events occur in your GitHub code repository. You use [YAML](https://yaml.org/) to define a workflow. You specify how a job is triggered, the running environment, and the workflow commands. The machine on which the workflow runs is called a _runner_. +GitHub Actions runs workflow files to automate processes. Workflows run when specific events occur in a GitHub repository. + +[YAML](https://yaml.org/) defines a workflow. + +Workflows specify how a job is triggered, the running environment, and the commands to run. + +The machine running workflows is called a _runner_. ## Arm-hosted GitHub runners -Arm-hosted GitHub runners are a powerful addition to your CI/CD toolkit. They leverage the efficiency and performance of Arm64 architecture, making your build systems faster and easier to scale. By using the Arm-hosted GitHub runners, you can optimize your workflows, reduce costs, and improve energy consumption. Additionally, the Arm-hosted runners are preloaded with essential tools, making it easier for you to develop and test your applications. +Hosted GitHub runners are provided by GitHub so you don't need to setup and manage cloud infrastructure. Arm-hosted GitHub runners use the Arm architecture so you can build and test software without cross-compiling or instruction emulation. + +Arm-hosted GitHub runners enable you to optimize your workflows, reduce cost, and improve energy consumption. + +Additionally, the Arm-hosted runners are preloaded with essential tools, making it easier for you to develop and test your applications. Arm-hosted runners are available for Linux and Windows. This Learning Path uses Linux. @@ -28,9 +44,11 @@ Arm-hosted runners are available for Linux and Windows. This Learning Path uses You must have a Team or Enterprise Cloud plan to use Arm-hosted runners. {{% /notice %}} -Getting started with Arm-hosted GitHub runners is straightforward. Follow [these steps to create a Linux Arm-hosted runner within your organization](/learning-paths/cross-platform/github-arm-runners/runner/#how-can-i-create-an-arm-hosted-runner). +Getting started with Arm-hosted GitHub runners is straightforward. Follow the steps in [Create a new Arm-hosted runner](/learning-paths/cross-platform/github-arm-runners/runner/#how-can-i-create-an-arm-hosted-runner) to create a runner in your organization. -Once you have created the runner within your organization, you can use the `runs-on` syntax in your GitHub Actions workflow file to execute the workflow on Arm. Shown here is an example workflow that executes on your Arm-hosted runner named `ubuntu-22.04-arm`: +Once you have created the runner, use the `runs-on` syntax in your GitHub Actions workflow file to execute the workflow on Arm. + +Below is an example workflow that executes on an Arm-hosted runner named `ubuntu-22.04-arm-os`: ```yaml name: Example workflow @@ -45,14 +63,25 @@ jobs: run: echo "This line runs on Arm!" ``` -This setup allows you to take full advantage of the Arm64 architecture's capabilities. Whether you are working on cloud, edge, or automotive projects, these runners provide a versatile and robust solution. ## Machine Learning Operations (MLOps) -With machine learning use-cases evolving and scaling, comes an increased need for reliable workflows to maintain them. There are many regular tasks that can be automated in the ML lifecycle. Models need to be re-trained, while ensuring they still perform at their best capacity. New training data needs to be properly stored and pre-processed, and the models need to be deployed in a good production environment. Developer Operations (DevOps) refers to good practices for CI/CD. The domain-specific needs for ML, combined with state of the art DevOps knowledge, created the term MLOps. +Machine learning use-cases have a need for reliable workflows to maintain performance and quality. + +There are many tasks that can be automated in the ML lifecycle. +- Model training and re-training +- Model performance analysis +- Data storage and processing +- Model deployment + +Developer Operations (DevOps) refers to good practices for collaboration and automation, including CI/CD. The domain-specific needs for ML, combined with DevOps knowledge, creates the new term MLOps. ## German Traffic Sign Recognition Benchmark (GTSRB) -In this Learning path, you will train and test a PyTorch model for use in Traffic Sign recognition. You will use the GTSRB dataset to train the model. The dataset is free to use under the [Creative Commons](https://creativecommons.org/publicdomain/zero/1.0/) license. It contains thousands of images of traffic signs found in Germany. Thanks to the availability and real-world connection, it has become a well-known resource to showcase ML applications. Additionally, given that it is a benchmark, you can apply it in a MLOps context to compare model improvements. This makes it a great candidate for this Learning Path, where you compare the performance of your trained model using two different PyTorch backends. +This Learning Path explains how to train and test a PyTorch model to perform traffic sign recognition. + +You will learn how to use the GTSRB dataset to train the model. The dataset is free to use under the [Creative Commons](https://creativecommons.org/publicdomain/zero/1.0/) license. It contains thousands of images of traffic signs found in Germany. It has become a well-known resource to showcase ML applications. + +The GTSRB dataset is also good for comparing performance and accuracy of different models and to compare and contrast different PyTorch backends. -Now that you have an overview, in the following sections you will learn how to setup an end-to-end MLOps workflow using the Arm-hosted GitHub runners. +Continue to the next section to learn how to setup an end-to-end MLOps workflow using Arm-hosted GitHub runners. diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/compare-performance.md b/content/learning-paths/servers-and-cloud-computing/gh-runners/compare-performance.md index 49ea0fb015..bd07a9718e 100644 --- a/content/learning-paths/servers-and-cloud-computing/gh-runners/compare-performance.md +++ b/content/learning-paths/servers-and-cloud-computing/gh-runners/compare-performance.md @@ -1,27 +1,34 @@ --- -title: Modify test workflow and compare performance +title: Compare the performance of PyTorch backends weight: 5 ### FIXED, DO NOT MODIFY layout: learningpathall --- -Continuously monitoring the performance of your machine learning models in production is crucial to maintaining their effectiveness over time. The performance of your ML model can change due to various factors ranging from data-related issues to model-specific and environmental factors. +Continuously monitoring the performance of your machine learning models in production is crucial to maintaining effectiveness over time. The performance of your ML model can change due to various factors ranging from data-related issues to environmental factors. -In this section, you will change the PyTorch backend being used to test the trained model. You will learn how to measure and continuously monitor the inference performance with your workflow. +In this section, you will change the PyTorch backend being used to test the trained model. You will learn how to measure and continuously monitor the inference performance using your workflow. ## OneDNN with Arm Compute Library (ACL) -In the previous section, you used the PyTorch 2.3.0 Docker Image compiled with OpenBLAS from DockerHub to run your testing workflow. PyTorch can be run with other backends as well. You will now modify the testing workflow to use PyTorch 2.3.0 Docker Image compiled with OneDNN and the Arm Compute Library. +In the previous section, you used the PyTorch 2.3.0 Docker Image compiled with OpenBLAS from DockerHub to run your testing workflow. PyTorch can be run with other backends. You will now modify the testing workflow to use PyTorch 2.3.0 Docker Image compiled with OneDNN and the Arm Compute Library. -The [Arm Compute Library](https://github.com/ARM-software/ComputeLibrary) is a collection of low-level machine learning functions optimized for Arm's Cortex-A and Neoverse processors, and the Mali GPUs. The Arm-hosted GitHub runners use Arm Neoverse CPUs, which makes it possible to optimize your neural networks to take advantange of the features available on the runners. ACL implements kernels (which you may know as operators or layers), which uses specific instructions that run faster on AArch64. -ACL is integrated into PyTorch through the [oneDNN engine](https://github.com/oneapi-src/oneDNN). +The [Arm Compute Library](https://github.com/ARM-software/ComputeLibrary) is a collection of low-level machine learning functions optimized for Arm's Cortex-A and Neoverse processors and Mali GPUs. Arm-hosted GitHub runners use Arm Neoverse CPUs, which make it possible to optimize your neural networks to take advantage of processor features. ACL implements kernels (also known as operators or layers), using specific instructions that run faster on AArch64. + +ACL is integrated into PyTorch through [oneDNN](https://github.com/oneapi-src/oneDNN), an open-source deep neural network library. ## Modify the test workflow and compare results -Two different PyTorch docker images for Arm Neoverse CPUs are available on [DockerHub](https://hub.docker.com/r/armswdev/pytorch-arm-neoverse). Up until this point, you used the `r24.07-torch-2.3.0-openblas` container image in your workflows. You will now update `test_model.yml` to use the `r24.07-torch-2.3.0-onednn-acl` container image instead. +Two different PyTorch docker images for Arm Neoverse CPUs are available on [DockerHub](https://hub.docker.com/r/armswdev/pytorch-arm-neoverse). + +Up until this point, you used the `r24.07-torch-2.3.0-openblas` container image to run workflows. The oneDNN container image is also available to use in workflows. These images represent two different PyTorch backends which handle the PyTorch model execution. + +### Change the Docker image to use oneDNN -Open and edit `.github/workflows/test_model.yml` in your browser. Update the `container.image` parameter to `armswdev/pytorch-arm-neoverse:r24.07-torch-2.3.0-onednn-acl` and save the file: +In your browser, open and edit the file `.github/workflows/test_model.yml`. + +Update the `container.image` parameter to `armswdev/pytorch-arm-neoverse:r24.07-torch-2.3.0-onednn-acl` and save the file by committing the change to the main branch: ```yaml jobs: @@ -34,9 +41,17 @@ jobs: # Steps omitted ``` -Trigger the Test Model job again by clicking the Run workflow button on the Actions tab. +### Run the test workflow + +Trigger the **Test Model** job again by clicking the `Run workflow` button on the `Actions` tab. + +The test workflow starts running. -Expand the Run testing script step from your Actions tab. You should see a change in the performance results with OneDNN and ACL kernels being used. +Navigate to the workflow run on the `Actions` tab, click into the job, and expand the **Run testing script** step. + +You see a change in the performance results with OneDNN and ACL kernels being used. + +The output is similar to: ```output Accuracy of the model on the test images: 90.48% @@ -55,8 +70,10 @@ Accuracy of the model on the test images: 90.48% aten::addmm 8.50% 558.000us 8.71% 572.000us 286.000us 2 --------------------------------- ------------ ------------ ------------ ------------ ------------ ------------ Self CPU time total: 6.565ms - ``` -For the ACL results, observe that the **Self CPU time total** is lower compared to the OpenBLAS run in the previous section. The names of the layers have changed as well, where the `aten::mkldnn_convolution` is the kernel optimized to run on Aarch64. That operator is the main reason our inference time is improved, made possible by using ACL kernels. -In the next section, you will learn how to automate the deployment of your trained and tested model. +For the ACL results, notice that the **Self CPU time total** is lower compared to the OpenBLAS run in the previous section. + +The names of the layers have also changed, where the `aten::mkldnn_convolution` is the kernel optimized to run on the Arm architecture. That operator is the main reason the inference time is improved, made possible by using ACL kernels. + +In the next section, you will learn how to automate the deployment of your model. diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/deploy.md b/content/learning-paths/servers-and-cloud-computing/gh-runners/deploy.md index 29fa3ef543..8710299fdf 100644 --- a/content/learning-paths/servers-and-cloud-computing/gh-runners/deploy.md +++ b/content/learning-paths/servers-and-cloud-computing/gh-runners/deploy.md @@ -1,17 +1,20 @@ --- -title: Deploy the application +title: Deploy the application as a container weight: 6 ### FIXED, DO NOT MODIFY layout: learningpathall --- -After your model has been trained and validated using the GitHub Actions workflows, your next step is to deploy the model into a production environment. -In this section, you will walk through the steps to containerize your trained model and push the container image to DockerHub. +After your model has been trained and validated using GitHub Actions workflows, the next step is to deploy the model into a production environment. -## Containerize the Model +In this section, you will containerize your trained model and push the container image to DockerHub. -You will use the Dockerfile included in your repository to create a docker container for the trained model and the scripts to deploy this model. Lets look at the `Dockerfile`: +## Containerize the model + +You can use the Dockerfile included in the repository to create a container for the trained model and the deployment scripts. + +Review the `Dockerfile` to understand how it works: ```console # Use the official PyTorch image as the base image @@ -33,10 +36,20 @@ EXPOSE 8000 # Specify the command to run the model server CMD ["uvicorn", "scripts.serve_model:app", "--host", "0.0.0.0", "--port", "8000"] ``` -The Dockerfile uses the PyTorch image with the ACL backend as the base image for the container. The working directory is set to `/app` where the trained model and the scripts to deploy the model are copied. The container runs a FastAPI application (`scripts/serve_model.py`) on port 8000. This script is called by `uvicorn` when the container is run. Uvicorn is a fast, lightweight ASGI (Asynchronous Server Gateway Interface) server, a good fit for serving Python web applications such as this. -## Deploy the trained model using FastAPI -FastAPI is an easy way to serve your trained model as API. You use a FastAPI application that loads your pre-trained model, accepts image uploads, and makes predictions on the uploaded images as shown in `scripts/serve_model.py`: +The Dockerfile uses the PyTorch image with the ACL backend as the base image for the container. + +The working directory is set to `/app` where the trained model and the scripts to deploy the model are copied. + +The container runs an application (`scripts/serve_model.py`) on port 8000. This script is called by `uvicorn` when the container is run. Uvicorn is a fast, lightweight ASGI (Asynchronous Server Gateway Interface) server, and is good for serving Python web applications. More information about application is provided in the next section. + +## Serve the model using FastAPI + +[FastAPI](https://fastapi.tiangolo.com/) is an easy way to serve your trained model as an API. + +The Python application using FastAPI and PyTorch loads the pre-trained model, accepts image uploads, and makes predictions about the image. + +The code is in the file `scripts/serve_model.py` and is shown below: ```python import torch @@ -106,10 +119,12 @@ async def predict(file: UploadFile = File(...)): except Exception as e: raise HTTPException(status_code=500, detail=str(e)) ``` -## Deploy with GitHub Actions -You can now automate the deployment of your containerized model on the Arm-hosted runner in GitHub Actions. -Navigate to the `.github/workflows` directory and inspect the YAML file named `deploy-model.yml`: +## Build the container image with GitHub Actions + +You are now ready automate the build of your containerized model using GitHub Actions. + +Review the workflow file at `.github/workflows/deploy-model.yml`: ```console name: Deploy to DockerHub @@ -142,45 +157,70 @@ jobs: run: | docker buildx build --platform linux/arm64 -t ${{ secrets.DOCKER_USERNAME }}/gtsrb-image:latest --push . ``` -In this workflow, you build the Docker container for Arm64 architecture and push the container image to DockerHub. -Before you run this workflow, you need your Docker Hub username and a Personal Access Token (PAT). This enables you to automate the login to your Docker Hub account. These credentials are passed to the workflow as secrets. +This workflow builds the container for the Arm architecture and pushes the container image to DockerHub. + +### Configure your DockerHub credentials + +Before you run this workflow, you need your Docker Hub username and a Personal Access Token (PAT). This enables GitHub Actions to store the container image in your DockerHub account. -To save your secrets, click on the Settings tab in your new GitHub repository. Expand the Secrets and variables on the left side and click Actions. +If you don't have a personal access token log in to [DockerHub](https://hub.docker.com/), click on your profile on the top right, select `Account settings` and then select `Personal access tokens`. Use the `Generate new token` button to create the token. -Add two secrets using the New repository secret button: +The credentials are passed to the workflow as secrets. + +To save your secrets, click on the `Settings` tab in the GitHub repository. Expand the `Secrets and variables` on the left side and click `Actions`. + +Add two secrets using the `New repository secret` button: * DOCKER_USERNAME: Your DockerHub username * DOCKER_PASSWORD: Your DockerHub Personal Access Token -To run the action, navigate to the Actions tab in your repository. Select Deploy to DockerHub on the left. +### Run the workflow + +Navigate to the `Actions` tab in your repository, and select `Deploy to DockerHub` on the left side. + +Use the `Run workflow` drop-down on the right-hand side to click `Run workflow` to start the workflow on the main branch. + +The workflow starts running. + +## Verify the deployment -Use the Run workflow drop-down on the right-hand side to click Run workflow. +When the **Deploy to DockerHub** workflow completes, the container image is available in DockerHub and you can run it on any Arm machine. -## Verify the Deployment +### Confirm the image in DockerHub -After the `deploy-model.yml` workflow completes successfully, the docker container image is pushed to your DockerHub repository. +Log in to DockerHub and see the newly created container image. -You can validate this by logging into DockerHub and checking your repository: +A screenshot showing the new image in DockerHub is below: ![dockerhub_img](images/dockerhub_img.png) -You can then pull this docker container image on your local machine and start the container: +### Run the application + +To run the application on an Arm machine, you can pull the Docker image and run create a container. + +Make sure to substitute your DockerHub username in the commands below: ```console docker pull /gtsrb-image docker run -d -p 8000:8000 /gtsrb-image ``` -Now test the application by running a curl command to make a POST request to the predict endpoint using a test image that is included in your container: +### Test the API with a traffic sign image + +Test the application using an image from the repository. Download the test image named `test-img.png` from GitHub by clicking it and using the download button on the right side. + +Run the `curl` command below to make a POST request to the predict endpoint using the image: ```bash curl -X 'POST' 'http://localhost:8000/predict/' -H 'accept: application/json' -H 'Content-Type: multipart/form-data' -F 'file=@test-img.png;type=image/png' ``` -The output should look like: + +The output is: + ```ouput {"predicted_class":6} ``` -You have now validated that you were able to successfully deploy your application, serve your model as an API and make predictions on a test image. +You have now validated that you were able to successfully deploy your application, serve your model as an API, and make predictions on a test image. -In the last section, you will learn how to build a complete end-to-end MLOps workflow by combining the individual workflows. +In the last section, you will learn about the complete end-to-end MLOps workflow by combining the individual workflows. diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/e2e-workflow.md b/content/learning-paths/servers-and-cloud-computing/gh-runners/e2e-workflow.md index f4d13bd410..71a0743264 100644 --- a/content/learning-paths/servers-and-cloud-computing/gh-runners/e2e-workflow.md +++ b/content/learning-paths/servers-and-cloud-computing/gh-runners/e2e-workflow.md @@ -6,7 +6,17 @@ weight: 7 layout: learningpathall --- -So far, you have run individual workflows for each task of the ML lifecycle: training, testing, performance monitoring and deployment. With GitHub Actions you can build an end-to-end custom MLOPs workflow that combines and automates these individual workflows all running on your Arm-hosted runner. To demonstrate this, the repository contains a workflow in `.github/workflows/train-test-deploy-model.yml` that automates the individual steps: +So far, you have run three individual workflows covering the tasks in the ML lifecycle: +- Training +- Testing +- Performance monitoring +- Deployment + +With GitHub Actions, you can build an end-to-end custom MLOPs workflow that combines and automates the individual workflows. + +To demonstrate this, the repository contains a workflow in `.github/workflows/train-test-deploy-model.yml` that automates the individual steps. + +Here is the complete **Train, Test, and Deploy Model** workflow file: ```yaml name: Train, Test and Deploy Model @@ -138,12 +148,20 @@ jobs: docker buildx build --platform linux/arm64 -t ${{ secrets.DOCKER_USERNAME }}/gtsrb-image:latest --push . ``` -These steps should look familiar and now they are put together to curate an end-to-end MLOPs workflow. The training and testing steps are run like before. The output report is saved and parsed to show the compare the performance changes in inference time. The trained model is updated in the repository. The deployment step connects to DockerHub, pushes the containerized model and scripts, which can then be downloaded and run. +These steps should look familiar and now they are put together to curate an end-to-end MLOPs workflow. -The steps depend on each other, requiring the previous one to run before the next is triggered. The entire workflow is triggered automatically any time a change is pushed into the main branch of your repository. You can also navigate to the _Train, Test and Deploy_ workflow and trigger it to run manually. The diagram below shows the end-to-end workflow: +The training and testing steps are run like before. The output report is saved and parsed to show the compare the performance changes in inference time. -![#e2e-workflow](/images/e2e-workflow.png) +The trained model is updated in the repository. + +The deployment step connects to DockerHub, pushes the containerized model and scripts, which can then be downloaded and run. -All steps of the workflow should complete successfully. +The steps depend on each other, requiring the previous one to run before the next is triggered. The entire workflow is triggered automatically any time a change is pushed into the main branch of your repository. + +Using what you have learned, navigate to the **Train, Test and Deploy** workflow and run it. + +The diagram below shows the end-to-end workflow, the relationship between the steps, and the time required to run each step: + +![#e2e-workflow](/images/e2e-workflow.png) -You have now run an MLOps workflow using GitHub Actions with Arm-hosted runners for managing all of the steps in your application's lifecycle. +You have run an MLOps workflow using GitHub Actions with Arm-hosted runners for managing all of the steps in your ML application's lifecycle. diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/train-test.md b/content/learning-paths/servers-and-cloud-computing/gh-runners/train-test.md index a44640647c..eadd263cfe 100644 --- a/content/learning-paths/servers-and-cloud-computing/gh-runners/train-test.md +++ b/content/learning-paths/servers-and-cloud-computing/gh-runners/train-test.md @@ -1,15 +1,16 @@ --- -title: Train and test the neural network model +title: Understand neural network model training and testing weight: 3 ### FIXED, DO NOT MODIFY layout: learningpathall --- -In this section, you will fork the provided example GitHub repository which contains all the code to complete this learning path. You will then learn how to train and test the neural network model using the provided scripts. +In this section, you will fork the example GitHub repository containing the project code and inspect the Python code for training and testing a neural network model. ## Fork the example repository -As you will be making modifications to the example and will run the GitHub Actions workflows within your own copy of the repository, start by forking the example repository. + +Get started by forking the example repository. In a web browser, navigate to the repository at: @@ -21,17 +22,35 @@ Fork the repository, using the `Fork` button: ![#fork](/images/fork.png) -Create a fork within a GitHub Enterprise Organization where you have access to the Arm-hosted GitHub runners. +Create a fork within a GitHub Organization or Team where you have access to Arm-hosted GitHub runners. + +{{% notice Note %}} +If a repository with the same name `gh_armrunner_mlops_gtsrb` already exists in your Organization or Team you modify the repository name to make it unique. +{{% /notice %}} + +## Learn about model training and testing + +Explore the repository using a browser to get familiar with code and the workflow files. + +{{% notice Note %}} +No actions are required in the sections below. + +The purpose is to provide an overview of the code used for training and testing a PyTorch model on the GTSRB dataset. +{{% /notice %}} -Lets inspect and walk through the code included in the repository to train and test a NN model on the GTSRB dataset. +### Model training -### Train model +In the `scripts` directory, there is a Python script called `train_model.py`. This script loads the GTSRB dataset, defines a neural network, and trains the model on the dataset. -Within the `scripts` directory, there is a Python script incuded called `train_model.py`. Using this script, you will load the GTSRB dataset, define a neural network, and train the model on the dataset using PyTorch. Lets look at all the steps to train the model in more detail. +#### Data pre-processing -#### Pre-processing +The first section loads the GTSRB dataset to prepare it for training. The GTSRB dataset is built into `torchvision`, which makes loading easier. -First, you need to load the GTSRB dataset to prepare it for training. The GTSRB dataset is built into `torchvision`, which makes loading it easier. You will define the transformations which are used when loading the training data. The transformations are part of the *pre-processing* step, which makes the data uniform and ready to run through the extensive math operations of your ML model. In accordance with best machine learning practices, you will separate the data used for training and testing, to avoid over-fitting the neural network. +The transformations used when loading data are part of the pre-processing step, which makes the data uniform and ready to run through the extensive math operations of the ML model. + +In accordance with common machine learning practices, data is separated into training and testing data to avoid over-fitting the neural network. + +Here is the code to load the dataset: ```python transform = transforms.Compose([ @@ -44,9 +63,13 @@ train_set = torchvision.datasets.GTSRB(root='./data', split='train', download=Tr train_loader = DataLoader(train_set, batch_size=64, shuffle=True) ``` -#### Define the model +#### Model creation + +The next step is to define a class for the model, listing the layers used. -The next step is to define a class for the actual model, listing the different layers used. You will define the forward-pass function, which is used at training time to update the weights. Additionally, you will define the loss function and optimizer for the model. +The model defines the forward-pass function used at training time to update the weights. Additionally, the loss function and optimizer for the model are defined. + +Here is the code defining the model: ```python class TrafficSignNet(nn.Module): @@ -72,9 +95,13 @@ criterion = nn.CrossEntropyLoss() optimizer = optim.Adam(model.parameters(), lr=0.001) ``` -#### Training with PyTorch and saving the model +#### Model training and the model file + +A training loop performs the actual training. -A loop is responsible for the actual training, pulling all the steps together. The number of epochs is arbitrarily set to 10 for this example. When the training is finished, the model weights are saved to a `.pth` file format. +The number of epochs is arbitrarily set to 10 for this example. When the training is finished, the model weights are saved to a `.pth` file. + +Here is the code for the training loop: ```python num_epochs = 10 @@ -100,16 +127,24 @@ for epoch in range(num_epochs): torch.save(model.state_dict(), './models/traffic_sign_net.pth') ``` -With this script, you have learnt how to load the GTSRB dataset, define a neural network, train the model on the dataset and save the trained model using PyTorch. -Lets now look at testing this trained model. +You now have an understanding of how to load the GTSRB dataset, define a neural network, train the model on the dataset, and save the trained model. + +The next step is testing the trained model. + +### Model testing + +The `test_model.py` Python script in the `scripts` directory verifies how accurately the ML model classifies traffic signs. + +It uses the PyTorch profiler to measure the CPU performance in terms of execution time. The profiler measures the model inference time when different PyTorch backends are used to test the model. -### Test model +#### Model loading and testing data -The `test_model.py` Python script included in the `scripts` directory verifies how accurately the ML model you have trained, can classify the traffic signs. It uses the PyTorch profiler to measure the CPU performance in terms of execution time. Using the profiler, you will be able to measure the model inference time when you use two different PyTorch backends to test the model. +Testing is done by loading the model that was saved after training and preparing it for evaluation on a test dataset. -#### Load model and create test set -You will load the model that was saved after training and prepare it for evaluation on a test dataset. Just like training, you will define transformations for the testing data and load it from the GTSRB dataset. +As in training, transformations are used to load the test data from the GTSRB dataset. + +Here is the code to load the model and the test data: ```python model_path = args.model if args.model else './models/traffic_sign_net.pth' @@ -128,8 +163,13 @@ test_set = torchvision.datasets.GTSRB(root='./data', split='test', download=True test_loader = DataLoader(test_set, batch_size=64, shuffle=False) ``` -#### Testing and display profiling results -The testing snippet loops through the test data, passing each batch through the model and compares predictions to the actual labels to calculate accuracy. The accuracy is calculated as a percentage of correctly classified images. Both the accuracy and PyTorch profiler report is printed at the end of the script. +#### Testing loop and profiling results + +The testing loop passes each batch of test data through the model and compares predictions to the actual labels to calculate accuracy. + +The accuracy is calculated as a percentage of correctly classified images. Both the accuracy and PyTorch profiler report is printed at the end of the script. + +Here is the testing loop with profiling: ```python correct = 0 @@ -148,4 +188,6 @@ print(f'Accuracy of the model on the test images: {100 * correct / total:.2f}%') print(prof.key_averages().table(sort_by="cpu_time_total", row_limit=10)) ``` -You should now have an overview of the code for training and testing the model on the GTSRB dataset using PyTorch. In the next section, you will learn how to setup the GitHub Actions workflows to automate running both the training and testing scripts on your Arm-hosted GitHub runner. +You now have a good overview of the code for training and testing the model on the GTSRB dataset using PyTorch. + +In the next section, you will learn how to use GitHub Actions workflows to run the training and testing scripts on an Arm-hosted GitHub runner. diff --git a/content/learning-paths/servers-and-cloud-computing/gh-runners/workflows.md b/content/learning-paths/servers-and-cloud-computing/gh-runners/workflows.md index de1918ce7a..c384142530 100644 --- a/content/learning-paths/servers-and-cloud-computing/gh-runners/workflows.md +++ b/content/learning-paths/servers-and-cloud-computing/gh-runners/workflows.md @@ -1,18 +1,26 @@ --- -title: Automate training and testing +title: Automate training and testing with GitHub Actions weight: 4 ### FIXED, DO NOT MODIFY layout: learningpathall --- -## GitHub Actions workflows +## Run GitHub Actions workflows -### Train Model +In this section, you will use GitHub Actions to run the training and testing scripts on an Arm-hosted GitHub runner. -In this section, you will automate the training step by executing the _Train Model_ workflow (.github/workflows/train-model.yml) on your Arm-hosted GitHub runner using GitHub Actions. This workflow pulls a [PyTorch 2.3.0 Docker Image compiled with OpenBLAS from DockerHub](https://hub.docker.com/r/armswdev/pytorch-arm-neoverse), and runs the training script `scripts/train_model.py` within that container. The model that is trained on the GTSRB dataset using this script is saved as an artifact of the workflow. +### Train the model -Inspect the _Train Model_ workflow by opening up the `.github/workflows/train-model.yml` file within your fork: +GitHub Actions are defined by workflows in the `.github/workflows` directory of a project. + +The workflow at `.github/workflows/train-model.yml` automates the model training. + +The training workflow uses a [PyTorch 2.3.0 Docker Image compiled with OpenBLAS from DockerHub](https://hub.docker.com/r/armswdev/pytorch-arm-neoverse) and runs the script at `scripts/train_model.py` in the container. + +When training is complete, the model is saved for future use as an artifact of the workflow. + +Review the **Train Model** workflow by opening the file `.github/workflows/train-model.yml` from your fork in your browser: ```yaml name: Train Model @@ -39,17 +47,28 @@ jobs: path: ${{ github.workspace }}/models/traffic_sign_net.pth retention-days: 5 ``` -This workflow specifies one job named "Train the model". This job runs in a runner environment specified by `runs-on`. The `runs-on: ubuntu-22.04-arm-os` points to the Arm-hosted GitHub runner you setup in the first section. -Now, navigate to the _Train Model_ workflow under the _Actions_ tab, and press the _Run workflow_ button. +The workflow specifies one job named **Train the Model**. + +The job runs in the runner environment specified by `runs-on`. The `runs-on: ubuntu-22.04-arm-os` points to the Arm-hosted GitHub runner you setup in the first section. + +### Run the training workflow + +Navigate to the **Train Model** workflow under the `Actions` tab. + +Press the `Run workflow` button and run the workflow on the main branch. ![Train_workflow](/images/train_run.png) -You will see the workflow running and it should complete succesfully. Click on the workflow to see the output from each step of the workflow. +The workflow starts running. It takes about 8 minutes to complete. + +Click on the workflow to see the output from each step of the workflow. ![Actions_train](/images/actions_train.png) -If you expand on the "Run training script" step, you should see the training loss per epoch followed by `Finished Training`. +Expand on the `Run training script` step to see the training loss per epoch followed by `Finished Training`. + +The output is similar to: ```output (...) @@ -64,17 +83,20 @@ Epoch [10/10], Step [300/417], Loss: 0.0208 Epoch [10/10], Step [400/417], Loss: 0.0152 Finished Training ``` -Confirm that the model has been generated as an artifact in the job's overview. + +Confirm the model is generated and saved as an artifact in the job's overview. ![#artifact](/images/artifact.png) -This trained model artifact is used to run the next step: testing the model. +This trained model artifact is used in the next step. -### Test Model +### Test the model -Similar to `train_model.py`, there is a workflow called `test-model.yml` which automates running the `test_model.py` script on your Arm-hosted runner. The test job downloads the artifact generated by the training job in the previous step, and runs the inference using PyTorch with OpenBLAS backend from yoru specified container image. +The next workflow called `test-model.yml` automates running the `test_model.py` script on the Arm-hosted runner. -Inspect the _Test Model_ workflow by opening up the `.github/workflows/test-model.yml` file within your fork: +The test job downloads the artifact generated by the training workflow in the previous step, and runs the inference using PyTorch with the OpenBLAS backend from the specified container image. + +Review the **Test Model** workflow by opening the file `.github/workflows/test-model.yml` in your browser: ```yaml name: Test Model @@ -103,21 +125,54 @@ jobs: ``` -In order to use the model created by your _Train Model_ job, you will edit the _Test Model_ workflow file. +### Run the testing workflow + +{{% notice Note %}} +The `test-model.yml` file needs to be edited to be able to use the saved model from the training run. +{{% /notice %}} + +#### Modify the workflow file -Open the training job which uploaded the ML model as an artifact by navigating to the _Actions_ tab on your GitHub repository. Choose the _Train Model_ under _All workflows_, and select the latest successful job. The URL from here contains an 11-digit number. Note down this number, which is the _run ID_ from training your model. Open `.github/workflows/test-model.yml`, and update the `run-id` parameter. Save the changes to the file. +Complete the steps below to modify the testing workflow file: + +1. Navigate to the `Actions` tab on your GitHub repository. + +2. Click on `Train Model` on the left side of the page. + +3. Click on the completed `Train Model` workflow. + +4. Copy the The 11 digit ID number from the end of the URL in your browser address bar. ![#run-id](/images/run-id.png) -Trigger the _Test Model_ job by clicking the _Run workflow_ button on the _Actions_ tab. +5. Navigate back to the `Code` tab and open the file `.github/workflows/test-model.yml`. + +6. Click the Edit button, represented by a pencil on the top right of the file contents. + +7. Update the `run-id` parameter with the 11 digit ID number you copied. + +8. Save the file by clicking the `Commit changes` button. + + +#### Run the workflow file + +You are now ready to run the **Test Model** workflow. + +1. Navigate to the `Actions` tab and select the **Test Workflow** on the left side. + +2. Click the `Run workflow` button to run the workflow on the main branch. ![#run-workflow](images/run-workflow.png) -You will see the workflow running and it should complete succesfully. Click on the workflow to view the output from each step. +The workflow starts running. + +Click on the workflow to view the output from each step. ![Actions_test](/images/actions_test.png) -If you click on the "Run testing script" step, you should see the accuracy of the model and a table of the results printed from the PyTorch profiler. The output from testing the model with PyTorch using OpenBLAS should look similar to: +Click on the "Run testing script" step to see the accuracy of the model and a table of the results from the PyTorch profiler. + +The output from is similar to: ```output Accuracy of the model on the test images: 90.48% @@ -138,4 +193,4 @@ Accuracy of the model on the test images: 90.48% Self CPU time total: 14.141ms ``` -In the next section, you will learn how to modify the testing workflow to compare the inference performance of the model using PyTorch with a different backend. +In the next section, you will learn how to modify the testing workflow to compare the inference performance using PyTorch with two different backends.