Dev Loop Experience

Azure Databricks - Python & Dev Ops

Editted by Laura Edell on April 4, 2019 & Updated June 19, 2019

Hands on Lab – Abstract

This hands on lab is designed for the scenario where a team of scientists and engineers are responsible for the development, maintenance and quality of analytical models which are made available to other teams for consumption.

Infrastructure Set Up

This section covers all infrastructure between Azure Dev Ops and Azure resources for the HOL which must be completed prior to the lab.

Azure Resource Creation

These steps should be completed ahead of time - It will make it easier to keep or save after the workshop -

Step 1: Create Resource Groups - Begin by creating 3 resource groups: • [some-name]-db-dev • [some-name]-db-stage • [some-name]-db-prod

Adding Resources to Resource Groups

Keep the following in mind:

Resource groups:

[some-name]-db-dev [some-name]-db-stage [some-name]-db-prod

will have different resources completely.

Add Machine Learning Service Workspace

Select “Add a Resource”.

Search for “machine learning” and select “Machine Learning service workspace” published by Microsoft. Click Create

Populate the fields with a naming convention that makes sense to you. Select the correct resource group and ensure the location pairs with your other services.

Add Data Lake (Azure Storage gen 2)

Select “Add a Resource” form within a resource group pane.

Search for “Storage” and select “Storage account” and click “create”

Fill out the creation form. Ensure you are in the correct resource group. Give the account a name, ensure it is StorageV2 and set access tier to Cool.

Click on “Advanced” and ensure “Hierarchical namespace” under “Data Lake Storage Gen2” is selected as “enabled”.

Select Create

Add Azure Key Vault

Select “Add a Resource” from within a resource group pane.

Search for “key vault” and select “Key Vault” published by Microsoft.

Populate the creation form. Give a name that is easy to remember and ensure the resource group is the desired resource group as well as the location.

Add a DataBricks Cluster

Select “Add a Resource” from within a resource group pane.

Search for DataBricks and select the one published by Microsoft. Click “create”.

Complete the Form for Creation using [some-name] as the workspace name, the resource group you are operating in for the resource group, select a location and ensure pricing tier is “Premium”. We will be using RBAC controls.

Navigate back to your resource group and select your newly created workspace

Select to “Launch Workspace” – Do not use the URL link. In the top right.

On the left hand pane, select “Clusters” and then “Create Cluster”

Fill out the creation form. MAKE SURE you select “terminate after 120 minutes of inactivity” to help reduce accidental usage and billing.

Add AzureML SDK as Library to Cluster.

From the DataBricks Workspace, click on “Clusters” and then the cluster name.

Click on the Libraries Tab, Install New, PyPl and enter “azureml-sdk”. Click Install

Add Initial Data to Storage

We want to ensure there is some data in the various data lakes so folks can access it.

Download the file: https://amldockerdatasets.azureedge.net/AdultCensusIncome.csv
Select the storage v2 from your resource group.

Select File Systems from the pane on the left and click "+ File system"

Name the file system "datalake"
Click on the newly created "datalake" file system

Select "Download Azure Storage Explorer" if you do not already have it installed.
Once Azure Storage Explorer is installed, open it. Add an Account and Login using your Azure credentials.

Find your filesystem inside the account you have now added. Drag and drop the AdultCensusIncome.csv into the main pane. Click Refresh if it does not refresh automatically.

Create Secrets for Secure & Controlled Storage Mounts

We parameterize out a few extra values such that the code for mounting data can remain the same regardless of which databricks cluster it is attached to and the access to data is controlled by a cluster & data lake admin instead. These steps should be completed for each databricks workspace within each resource group.

Install Azure & Data Bricks CLI

Ensure python 3.x is installed. a. If it is not; the easiest way is to install Anaconda. b. https://www.anaconda.com/download/
Install the data bricks cli a. Open a cmd prompt and execute the command: “pip install databricks-cli”
Install the azure cli a. Open a cmd prompt and execute the command: “pip install azure-cli”

Configure Data Bricks CLI for User

Generate a User Access Token

Launch the DataBricks Workspace from the Azure Portal

In the upper right hand side of the screen, select the user icon.

Select “User Settings” from the drop down.
Select “Generate a New Token”

Give a good comment and remove the token lifetime (gives permanent token access to cluster; not a best practice)

Copy the token that gets generated.

Authenticate CLI with Token using Profiles

Use the command “databricks configure --token --profile [PROFILENAME] a. HINT: use dev, test & prod similar to how you named the workspaces and resources groups to more easily differentiate.
Enter the host URL a. https://eastus.azuredatabricks.net (example) b. Paste the token generated from the previous step
To use the profiles capability; simply use the –profile flag with the [PROFILENAME] configured to configure the workspace you are targeting.

Create Service Principal and Give Access to Data Lake

This section uses an AD Service principal and provides access to the principal for the data.

Login in to the azure cli by executing the command: “az login” a. Follow instructions printed out.
Create a service principal by executing the command: “az ad sp create-for-rbac –name [SOMENAME]” a. Copy the app id b. COPY the password – you will not be able to get it again.
Get the Service Principal’s object id by executing the command: “az ad sp show –id [AppId]”. Search through the result and find the value of the property “objectId” b. Copy the objectID

Open Azure Storage Explorer and right click the datalake container you created previously. Select “Manage Access”

Paste the objectID into the text box and click “Add”

Find the object ID in the list, click on it and give it Read, Write & Execute Access as well as Default. Click Save

Navigate to the Azure Portal and to the ADLS Gen Two Blade for this resource group. Click on Access Control (IAM)

Click on “Add” “Add role assignment”

The role should be: “Storage Blob Data Contributor” and enter the name for the service principal for this resource group you created and click save.

Create an Azure Key Vault Backed Secret Scope

Navigate to your workspace with the following format: a. https://eastus.azuredatabricks.net/?o=6776691945951303#secrets/createScope b. Replace the number after o= with yours:

c. Or simply append #secrets/createScope to the end of the url of your workspace. 2. Navigate to the key vault for the resource group you are setting up:

Copy the DNS name

Copy the Resource ID

Name the scope “data-lake”, set for “All Users”. Populate the dns name and resource id of the key vault. And select “Create".

From the databricks CLI, enter the command: “databricks secrets list-scopes –profile [YOUR PROFILE]

Add Secrets to Secret Scope for Accessing Data

You will need the Service Principal’s password and app id from the previous steps.

Get the app’s tenant id by executing the following command: “az ad sp show –id [AppId]” a. Copy the value from: “appOwnerTenantId”.
Add the Service Principal’s TenantID to the Azure Key Vault a. “az keyvault secret set –vault-name [KeyVault for RG] –name “sp-tenant-id” –value [TenantId]”
Add the Service Principal App-ID to the Azure Key Vault a. “az keyvault secret set –vault-name [KeyVault for RG you are configuring] –name “sp-app-id” –value [service principal’s app id]

Add the Service Principal’s password to the Azure Key Vault a. “az keyvault secret set –vault-name [KeyVault for RG] –name “sp-password” –value [password copied from earlier]
Add the Service Principal’s token endpoint a. https://login.microsoftonline.com/YOURAPPOWNERTENANTID/oauth2/token b. “az keyvault secret set –vault-name [KeyVault for RG] –name “sp-token-endpoint” –value [token endpoint]
Add the FQDN of the data lake. a. “az keyvault secret set –vault-name [KeyVault for RG] –name “datalake-fqdn” –value “abfss://datalake@YOURSTORAGEACCOUNT.dfs.core.windows.net”
Add the subscription id for the ml service. Navigate to the ml service inside your resource group and copy the subscription id.
1. "az keyvault secret set --vault-name [KeyVault for RG] --name subscription-id --value YOURSUBSCRIPTIONID

Add the resource group for the ml service. Navigate to the ml service inside your resource group and copy the resource group id.
1. "az keyvault secret set --vault-name [KeyVault for RG] --name resource-group --value YOURRGNAME

Add the ml service workspace name. Navigate to the ml service within the appropriate resource group and copy the name.
1. "az keyvault secret set --vault-name [KeyVault for RG] --name ml-workspace-name --value YOURVALUE

Add the "Alg State" This changes per resource group. For the Dev RG, it is "dev", for "stage" it is "stage". If you were to add additional clusters for releases for multi-tenancy it should have a convention to help support that.
1. "az keyvault secret set --vault-name [KeyVault for RG] --name alg-state --value APPROPRIATEVALUE
Add the "Created By". For now this will simply match "Alg State"'s conventions.
1. "az keyvault secret set --vault-name [KeyVault for RG] --name created-by --value APPROPRIATEVALUE
Verify secrets are in the data-lake scope for databricks a. “databricks secrets list –scope data-lake”

Azure Dev Ops – Creation

This section covers creating a project in Azure Dev Ops for the workshop.

Navigate to https://dev.azure.com
Select the organization you intend to use OR create a new organization.
Create a new project. Pick a name, description. Select “Git” for version control and “Agile” for the work item process.

Invite Additional Users

Click on Repos, Files.
At the very bottom, select “Initialize Repo”.

Scientists – Initial Setup

Configure Azure Dev Ops Integrations Azure Databricks, set your Git provider to Azure DevOps Services on the User Settings page:

Click the User icon at the top right of your screen and select User Settings.

Click the Git Integration tab.
Change your provider to Azure DevOps Services.

Create & Link Project File w/ Repo

From inside the Data Bricks cluster interface, select workspace, shared, then the drop down, then create and create a “Folder”

Name the folder “Project_One”
Create a new file inside the project called “train_model”.

Link “train_model.py” file to your Azure Dev Ops repository. a. Copy the git link from your azure dev ops portal:

b. Paste into the “link” location in the popup for “Git Preferences” c. Create a new branch. Name it your unique user ID d. Use “Project_One/notebooks/train_model.py” as the path in git repo.

Dev Loop Experience

The dev loop experience encompasses mounting the dev data, exploring that data, training a model; writing the inference code, compiling a dev container; running tests inside the dev container.

Train the world’s worst regression & Stage for inference coding.

Copy the code from Project_One/notebooks/train_model.py into your databricks train_model.py which was created earlier.
The proctor will step through what exactly the code is doing and why.
1. Essentially: The precreated secrets are being used to mount to various stores securely and will allow zero code changes as the algorithm progresses across secure environments.
2. You train a super simple algorithm and register the resulting model files with the AZML service such that we can bridge the divide between databricks and inference coding. This process is ML Framework independent and can be used cross algorithms, frameworks etc.

Inference Coding

This section extends from having a trained model to now building an inference container which is reflective of the asset we will deliver to our customer base.

Code Structure

Good Code Structure from the beginning is a great way to ensure you are set up well. In this case we are going to follow well defined development strategies via a bit of a hybrid between .net project structures and python project structures.

We have two folders for each project. Project_One is the primary inference project

Git Pull the train code

Open a cmd prompt.
Change directory into the root of where your project is.
Execute the command: a. “git checkout ” b. “git pull”

Test Driven Development

Write a Test

You should always start with testing and then writing code to satisfy those tests. The only code which will be required to write is the test_model.py. The facilitation code here is provided for you.

Inside this file we will write a very simple unit test to ensure that the x_scaler object is populated during model initialization.

An example unit test has already been written. Add 1 more unit test to Project_One-Tests/test_model.py.
The facilitation code follows standard pytest rules, so you can even add more test files etc; just follow pytest conventions.
The proctor will run through how the project works.
1. Project_One is the project code which seperates the inference code as a "provider" type class following similiar principals from the testable web dev space.
2. Project_One-Tests is your seperated testing code such that it is not coupled with your app development code.
3. A container is built for the inference code, which is then extended with the test code. The base inference container is the asset expected to be deployed while the extended testing container allows you to test the assets in the same type of format as if they were to be compiled.

Review Inference Code

Normally we would test and ensure the tests fail before writing the inference code; however much of the code is already written, so we will simply review it.

In an ideal world, the only code you would need to worry about is highlighted in red. The current state of tooling as of today is why the other code exists and is not wrapped up as ADO Tasks or VS Code extensions.

The proctor will run through the code with your, but essentially:

./Project_One/score.py is what the azure ml sdk expects as the interface and must be populated with an init() and a run(params). The params are what is received in the http request body (or iot edge message over the route)
The code placed in inference_code is to help ensure code coverage is reported appropriately. We follow a similar provider type structure as in web dev when there is a pre-defined functional interface. The objective is to minimize that footprint to 1 line of code. (in score.py init and run)
The rest of the code is a dockerized build process that can run independent of the dependencies installed on your system such that the build on your machine is the same as the build in the build server improving confidence the locally generated and tested asset will match the asset which has probability of being promoted to production.

We now have inference code with matching train code. Lets build the inference container and test it.

Build Inference Container

First open runbuild_local.cmd a. Modify the environment variables to match for the dev environment. These will remain constant for this algorithm and your local environment. i. Subscription_id ii. Ml_resource_group iii. Ml_workspace_name iv. Ml_alg_author From the command prompt:
Change directory into the Project_One folder.
Run the runbuild_local.cmd a. You may need to execute az login prior to executing this command or be interactively logged in (watch the output)

c. This will execute a bunch of stuff and be on “Creating image” for a while. Occasionally hit enter to see if the cmd prompt output is up to date or not.

Test Inference Container

Change directory into the Project_One-Tests folder.
Run the runtests_local.cmd file
This will extend the container you created in the previous step, run your unit tests and check your code coverage. The code coverage results can be found in C:/ml_temp/artifacts/test_results These are standard pytest and pytest-cov result outputs.

Click on index.html from cov_html folder

We have 68% code coverage; could be worse.

Commit & Pull Request.

We now know that we have an inference container and it passes our unit tests and our code coverage is to a point where we are happy about it.
From the command prompt change directory to the root of the repository.
Execute the following commands to push the changes from your branch: a. Git add ./ b. Git commit -m “works” c. Git push
Create a pull request by going to your ADO site, under repos, pull request, New Pull Request

Populate the request template and ensure you have a reviewer:

Review the changes with the reviewer you selected. Ensure both enter ADO and hit “Approve” and then “Complete”. If you see problems in your peers code; add comments and reject it. Once both reviewers Approve you can complete. This will launch the build stages & release stages which are connected to master.

Defining your Build stage

Since we are targeting a different Azure Databricks Environment than the one used in the local Dev Loop described earlier in this document, and since we are concerned with security we will be creating a library asset which will allow us to define secrets from a key vault that points to this new environment. These secrets become available as variables in the build stage. Variables give you a convenient way to get key bits of data into various parts of the stage. As the name suggests, the value of a variable may change from run to run or job to job of your stage. Almost any place where a stage requires a text string or a number, you can use a variable instead of hard-coding a value. The system will replace the variable with its current value during the stage's execution.

Creating a Variable Group

In your Azure DevOps Subscription navigate to the Library Menu Item and click + Variable Group

Name your variable group as indicated and select the Azure Subscription and KeyVault that you wish to target and toggle the “Link secrets from an Azure key vault as variables” switch to the on position

Click the + Add button, select the variables that you want to make available to the stage, click ok and then Save to make sure that your changes are persisted to your Azure DevOps instance

Create a Build stage in the Visual Designer

The intention of this step is to create an Azure DevOps stage that will mimic the steps from the Local Build Loop, but targets a different Azure Databrick Environment for the training .The connection details of this environment will not be available to the scientists directly and will be managed by the operations team. This stage will execute when a PR to master is approved and completed.

In your Azure DevOps tenant, navigate to stages -> Builds and click on + New and select New build stage.

Select your source and make sure to select the master branch as we want to make sure that the stage is attached the branch that we will be monitoring for Pull Requests. Click Continue.

Select Empty Job

Name your stage accordingly and select the Hosted Ubuntu 1604 Build Agent from the Agent Pool.

Link the variable group that you created earlier by clicking on Variables in the menu bar, followed by Variable groups and click Link Variable Groups.

Select the stage Environment Variable Group and Click Link. Your stage now has access to all the runtime environmental variables to connect to the stage Environment.

Click back onto Tasks on the menu and click +on the Agent Job to Add the Tasks that you will be configuring for the build process.

8. Type “CLI” in the Search Box and Click the Azure CLI”ADD” button four times.

Your Agent Job Step should look like the following when you have completed.

9. Repeat Step 8, substituting the Search for “CLI” with Copy and add two Copy Files Tasks.

Substitute “Copy” with “Test” and add a Publish Test Results Task

Substitute “Test” with Coverage and add a Publish Code Coverage Results Task.

Your Agent Job should now resemble the following:

The First Azure CLI Task will be used to configure the agent environment and make sure that the required packages are installed to execute the rest of the stage. Provide the task with a descriptive name, Select the appropriate Azure Subscription, set the Script Location to “Inline Script” and add the flowing to the inline script window: pip3 install -U setuptools python3 -m install --upgrade pip pip3 install --upgrade azureml-sdk[notebooks] Set the remainder of the task properties as depicted below:

Click on the second Azure CLI Task, select the appropriate Azure Subscription and configure as follows:

Click in the third Azure CLI Task , select the appropriate Azure Subscription and configure the Task as follows :

Click the fourth Azure CLI Task, Select the appropriate Azure Subscription and configure the Task as follows:

Click on the first Copy Files Task and configure the task as follows:

Click on the second Copy Files Task and configure the task as follows:

Click on the Publish Test Results Task and Configure the task as follows:

Click on the Publish Code Coverage Task and configure the task as follows:

On the Agent Job Click the + in order to add a task that will be used to publish the build artifacts for use in a release stage later.

Search for Publish and Click “Add” on the Publish Build Artifacts Task

Configure the task as follows:

Enable the Continuous Integration trigger on the stage which will make sure that every time a change in made to the master branch of the repository this stage will execute. Click the Triggers menu item in the menu bar and click the checkbox to enable continuous integration.

You can now Save and Queue this stage for a manual build to make sure that it executes from end to end without any issues.

Output from the stage should resemble the following:

Defining your Release stage

While release stages are often used to deliver artifacts in a deployed state, our scenario calls for an different approach. Our build artifact is an image that contains our tested model and we will be creating a Two Stage Release that will first deliver the correct image to aQA environment where it can be picked up and be tested by a product team. Once all conditions for the product team is satisfied a release manager will manually approve the Production release step and the model will become available for consumption in the Production Environment. Creating Variable Groups Required for the Release stage

Click on the Library menu item in the Azure DevOps portal and click

Complete the resulting form as depicted below, making sure that you provide values to the variables that correspond to the Targeted QA Environment

After each Variable Value has been assigned click the to encrypt its value in the stage.
Repeat Steps 2 and 3 above to set up a variable group for the targeted Production Environment.

Create the Release stage

In the Azure DevOps portal Click on stages -> Releases in the left menu

Click The New stage menu item and select New release stage

Add Two Stages, Named QA and Production Respectively ensuring that you select the “Empty Template”. Click on the Pre-deployment condition icon and continue to configure as depicted below. This will prevent the Production deployment from happening automatically unless there is an Approval provided by One or all of the Approvers (dependent on configuration) and that the Production Stage Deployment will timeout after two days with out an approval.

Add the build artifacts and link the release stage to its associated build stage.

Note: The value in the Source alias text area will be required to correctly configure the AZ CLI tasks in Steps 6 and 7 below.

In the Menu area select variables and link the QA and Production variable to the relevant slots.

Add a CLI Task to the QA Stage and configure it as follows :

** Make sure that the working directory set above reflects the generated path correct path here $(System.DefaultWorkingDirectory)//stageArtifacts

Repeat Step 6 above for the Production Stage .

Note that the script internals are identical for both stages but will target different destination repositories based on the Variables groups assigned to each of the stages.

Run a release and inspect the results
To automate the release process click the Continuous Integration Trigger of the Build Artifact and set as follows.

Finally click on the Pre-Release Condition for the QA Stage and set as follows.

A successful release will resemble the following

Name		Name	Last commit message	Last commit date
Latest commit History 7 Commits
DataBricks_MLDevOps_LEE		DataBricks_MLDevOps_LEE
Other Helpful Guides		Other Helpful Guides
WorkshopProject_One-Tests		WorkshopProject_One-Tests
WorkshopProject_One		WorkshopProject_One
readme_images		readme_images
.gitattributes		.gitattributes
.gitignore		.gitignore
LICENSE		LICENSE
README.md		README.md
_config.yml		_config.yml

License

Annielytix/DevOpsforDatabricks

Folders and files

Latest commit

History

Repository files navigation

Azure Databricks - Python & Dev Ops

Editted by Laura Edell on April 4, 2019 & Updated June 19, 2019

Hands on Lab – Abstract

Infrastructure Set Up

Azure Resource Creation

Adding Resources to Resource Groups

Add Machine Learning Service Workspace

Add Data Lake (Azure Storage gen 2)

Add Azure Key Vault

Add a DataBricks Cluster

Add AzureML SDK as Library to Cluster.

Add Initial Data to Storage

Create Secrets for Secure & Controlled Storage Mounts

Install Azure & Data Bricks CLI

Configure Data Bricks CLI for User

Generate a User Access Token

Authenticate CLI with Token using Profiles

Create Service Principal and Give Access to Data Lake

Create an Azure Key Vault Backed Secret Scope

Add Secrets to Secret Scope for Accessing Data

Azure Dev Ops – Creation

Scientists – Initial Setup

Create & Link Project File w/ Repo

Dev Loop Experience

Train the world’s worst regression & Stage for inference coding.

Inference Coding

Git Pull the train code

Test Driven Development

Write a Test

Review Inference Code

Build Inference Container

Test Inference Container

Commit & Pull Request.

Defining your Build stage

Creating a Variable Group

Create a Build stage in the Visual Designer

Defining your Release stage

Create the Release stage

About

Topics

Resources

License

Stars

Watchers

Forks

Languages