Skip to content

Are you like me , a Senior Data Scientist, wanting to learn more about how to approach DevOps, specifically when you using Databricks (workspaces, notebooks, libraries etc) ? Set up using @Azure @databricks

License

Annielytix/DevOpsforDatabricks

Repository files navigation

Azure Databricks - Python & Dev Ops

Editted by Laura Edell on April 4, 2019 & Updated June 19, 2019

Hands on Lab – Abstract

This hands on lab is designed for the scenario where a team of scientists and engineers are responsible for the development, maintenance and quality of analytical models which are made available to other teams for consumption.

Infrastructure Set Up

This section covers all infrastructure between Azure Dev Ops and Azure resources for the HOL which must be completed prior to the lab.

Azure Resource Creation

These steps should be completed ahead of time - It will make it easier to keep or save after the workshop -

Step 1: Create Resource Groups - Begin by creating 3 resource groups: • [some-name]-db-dev • [some-name]-db-stage • [some-name]-db-prod

Adding Resources to Resource Groups

Keep the following in mind:

Resource groups:

[some-name]-db-dev [some-name]-db-stage [some-name]-db-prod

will have different resources completely.

Add Machine Learning Service Workspace

  1. Select “Add a Resource”.

alt text

  1. Search for “machine learning” and select “Machine Learning service workspace” published by Microsoft. Click Create

alt text

  1. Populate the fields with a naming convention that makes sense to you. Select the correct resource group and ensure the location pairs with your other services.

alt text

Add Data Lake (Azure Storage gen 2)

  1. Select “Add a Resource” form within a resource group pane.

alt text

  1. Search for “Storage” and select “Storage account” and click “create”

alt text

  1. Fill out the creation form. Ensure you are in the correct resource group. Give the account a name, ensure it is StorageV2 and set access tier to Cool.

alt text

  1. Click on “Advanced” and ensure “Hierarchical namespace” under “Data Lake Storage Gen2” is selected as “enabled”.

alt text

  1. Select Create

Add Azure Key Vault

  1. Select “Add a Resource” from within a resource group pane.

alt text

  1. Search for “key vault” and select “Key Vault” published by Microsoft.

alt text

  1. Populate the creation form. Give a name that is easy to remember and ensure the resource group is the desired resource group as well as the location.

alt text

Add a DataBricks Cluster

  1. Select “Add a Resource” from within a resource group pane.

alt text

  1. Search for DataBricks and select the one published by Microsoft. Click “create”.

alt text

  1. Complete the Form for Creation using [some-name] as the workspace name, the resource group you are operating in for the resource group, select a location and ensure pricing tier is “Premium”. We will be using RBAC controls.

alt text

  1. Navigate back to your resource group and select your newly created workspace

alt text

  1. Select to “Launch Workspace” – Do not use the URL link. In the top right.

alt text

  1. On the left hand pane, select “Clusters” and then “Create Cluster”

alt text

  1. Fill out the creation form. MAKE SURE you select “terminate after 120 minutes of inactivity” to help reduce accidental usage and billing.

alt text

Add AzureML SDK as Library to Cluster.

  1. From the DataBricks Workspace, click on “Clusters” and then the cluster name.

alt text

  1. Click on the Libraries Tab, Install New, PyPl and enter “azureml-sdk”. Click Install

alt text

Add Initial Data to Storage

We want to ensure there is some data in the various data lakes so folks can access it.

  1. Download the file: https://amldockerdatasets.azureedge.net/AdultCensusIncome.csv
  2. Select the storage v2 from your resource group.

alt text

  1. Select File Systems from the pane on the left and click "+ File system"

alt text

  1. Name the file system "datalake"
  2. Click on the newly created "datalake" file system

alt text

  1. Select "Download Azure Storage Explorer" if you do not already have it installed.
  2. Once Azure Storage Explorer is installed, open it. Add an Account and Login using your Azure credentials.

alt text

  1. Find your filesystem inside the account you have now added. Drag and drop the AdultCensusIncome.csv into the main pane. Click Refresh if it does not refresh automatically.

alt text

Create Secrets for Secure & Controlled Storage Mounts

We parameterize out a few extra values such that the code for mounting data can remain the same regardless of which databricks cluster it is attached to and the access to data is controlled by a cluster & data lake admin instead. These steps should be completed for each databricks workspace within each resource group.

Install Azure & Data Bricks CLI
  1. Ensure python 3.x is installed. a. If it is not; the easiest way is to install Anaconda. b. https://www.anaconda.com/download/
  2. Install the data bricks cli a. Open a cmd prompt and execute the command: “pip install databricks-cli”
  3. Install the azure cli a. Open a cmd prompt and execute the command: “pip install azure-cli”
Configure Data Bricks CLI for User
Generate a User Access Token
  1. Launch the DataBricks Workspace from the Azure Portal

alt text

  1. In the upper right hand side of the screen, select the user icon.

alt text

  1. Select “User Settings” from the drop down.
  2. Select “Generate a New Token”

alt text

  1. Give a good comment and remove the token lifetime (gives permanent token access to cluster; not a best practice)

alt text

  1. Copy the token that gets generated.
Authenticate CLI with Token using Profiles
  1. Use the command “databricks configure --token --profile [PROFILENAME] a. HINT: use dev, test & prod similar to how you named the workspaces and resources groups to more easily differentiate.
  2. Enter the host URL a. https://eastus.azuredatabricks.net (example) b. Paste the token generated from the previous step
  3. To use the profiles capability; simply use the –profile flag with the [PROFILENAME] configured to configure the workspace you are targeting.
Create Service Principal and Give Access to Data Lake

This section uses an AD Service principal and provides access to the principal for the data.

  1. Login in to the azure cli by executing the command: “az login” a. Follow instructions printed out.
  2. Create a service principal by executing the command: “az ad sp create-for-rbac –name [SOMENAME]” a. Copy the app id b. COPY the password – you will not be able to get it again.
  3. Get the Service Principal’s object id by executing the command: “az ad sp show –id [AppId]”. Search through the result and find the value of the property “objectId” b. Copy the objectID

alt text

  1. Open Azure Storage Explorer and right click the datalake container you created previously. Select “Manage Access”

alt text

  1. Paste the objectID into the text box and click “Add”

alt text

  1. Find the object ID in the list, click on it and give it Read, Write & Execute Access as well as Default. Click Save

alt text

  1. Navigate to the Azure Portal and to the ADLS Gen Two Blade for this resource group. Click on Access Control (IAM)

alt text

  1. Click on “Add” “Add role assignment”

alt text

  1. The role should be: “Storage Blob Data Contributor” and enter the name for the service principal for this resource group you created and click save.

alt text

Create an Azure Key Vault Backed Secret Scope
  1. Navigate to your workspace with the following format: a. https://eastus.azuredatabricks.net/?o=6776691945951303#secrets/createScope b. Replace the number after o= with yours:

alt text

c. Or simply append #secrets/createScope to the end of the url of your workspace. 2. Navigate to the key vault for the resource group you are setting up:

alt text

  1. Copy the DNS name

alt text

  1. Copy the Resource ID

alt text

  1. Name the scope “data-lake”, set for “All Users”. Populate the dns name and resource id of the key vault. And select “Create".

alt text

  1. From the databricks CLI, enter the command: “databricks secrets list-scopes –profile [YOUR PROFILE]

alt text

Add Secrets to Secret Scope for Accessing Data

You will need the Service Principal’s password and app id from the previous steps.

  1. Get the app’s tenant id by executing the following command: “az ad sp show –id [AppId]” a. Copy the value from: “appOwnerTenantId”.
  2. Add the Service Principal’s TenantID to the Azure Key Vault a. “az keyvault secret set –vault-name [KeyVault for RG] –name “sp-tenant-id” –value [TenantId]”
  3. Add the Service Principal App-ID to the Azure Key Vault a. “az keyvault secret set –vault-name [KeyVault for RG you are configuring] –name “sp-app-id” –value [service principal’s app id]

alt text

  1. Add the Service Principal’s password to the Azure Key Vault a. “az keyvault secret set –vault-name [KeyVault for RG] –name “sp-password” –value [password copied from earlier]
  2. Add the Service Principal’s token endpoint a. https://login.microsoftonline.com/YOURAPPOWNERTENANTID/oauth2/token b. “az keyvault secret set –vault-name [KeyVault for RG] –name “sp-token-endpoint” –value [token endpoint]
  3. Add the FQDN of the data lake. a. “az keyvault secret set –vault-name [KeyVault for RG] –name “datalake-fqdn” –value “abfss://datalake@YOURSTORAGEACCOUNT.dfs.core.windows.net
  4. Add the subscription id for the ml service. Navigate to the ml service inside your resource group and copy the subscription id.
    1. "az keyvault secret set --vault-name [KeyVault for RG] --name subscription-id --value YOURSUBSCRIPTIONID

alt text

  1. Add the resource group for the ml service. Navigate to the ml service inside your resource group and copy the resource group id.
    1. "az keyvault secret set --vault-name [KeyVault for RG] --name resource-group --value YOURRGNAME

alt text

  1. Add the ml service workspace name. Navigate to the ml service within the appropriate resource group and copy the name.
    1. "az keyvault secret set --vault-name [KeyVault for RG] --name ml-workspace-name --value YOURVALUE

alt text

  1. Add the "Alg State" This changes per resource group. For the Dev RG, it is "dev", for "stage" it is "stage". If you were to add additional clusters for releases for multi-tenancy it should have a convention to help support that.

    1. "az keyvault secret set --vault-name [KeyVault for RG] --name alg-state --value APPROPRIATEVALUE
  2. Add the "Created By". For now this will simply match "Alg State"'s conventions.

    1. "az keyvault secret set --vault-name [KeyVault for RG] --name created-by --value APPROPRIATEVALUE
  3. Verify secrets are in the data-lake scope for databricks a. “databricks secrets list –scope data-lake”

Azure Dev Ops – Creation

This section covers creating a project in Azure Dev Ops for the workshop.

  1. Navigate to https://dev.azure.com
  2. Select the organization you intend to use OR create a new organization.
  3. Create a new project. Pick a name, description. Select “Git” for version control and “Agile” for the work item process.

alt text

  1. Invite Additional Users

alt text

alt text

  1. Click on Repos, Files.
  2. At the very bottom, select “Initialize Repo”.
Scientists – Initial Setup

Configure Azure Dev Ops Integrations Azure Databricks, set your Git provider to Azure DevOps Services on the User Settings page:

  1. Click the User icon at the top right of your screen and select User Settings.

alt text

  1. Click the Git Integration tab.
  2. Change your provider to Azure DevOps Services.

alt text

Create & Link Project File w/ Repo
  1. From inside the Data Bricks cluster interface, select workspace, shared, then the drop down, then create and create a “Folder”

alt text

  1. Name the folder “Project_One”
  2. Create a new file inside the project called “train_model”.

alt text

  1. Link “train_model.py” file to your Azure Dev Ops repository. a. Copy the git link from your azure dev ops portal:

alt text

b. Paste into the “link” location in the popup for “Git Preferences” c. Create a new branch. Name it your unique user ID d. Use “Project_One/notebooks/train_model.py” as the path in git repo.

alt text

Dev Loop Experience

The dev loop experience encompasses mounting the dev data, exploring that data, training a model; writing the inference code, compiling a dev container; running tests inside the dev container.

Train the world’s worst regression & Stage for inference coding.

  1. Copy the code from Project_One/notebooks/train_model.py into your databricks train_model.py which was created earlier.
  2. The proctor will step through what exactly the code is doing and why.
    1. Essentially: The precreated secrets are being used to mount to various stores securely and will allow zero code changes as the algorithm progresses across secure environments.
    2. You train a super simple algorithm and register the resulting model files with the AZML service such that we can bridge the divide between databricks and inference coding. This process is ML Framework independent and can be used cross algorithms, frameworks etc.

Inference Coding

This section extends from having a trained model to now building an inference container which is reflective of the asset we will deliver to our customer base.

Code Structure

alt text

Good Code Structure from the beginning is a great way to ensure you are set up well. In this case we are going to follow well defined development strategies via a bit of a hybrid between .net project structures and python project structures.

We have two folders for each project. Project_One is the primary inference project

Git Pull the train code

  1. Open a cmd prompt.
  2. Change directory into the root of where your project is.
  3. Execute the command: a. “git checkout ” b. “git pull”

Test Driven Development

Write a Test

You should always start with testing and then writing code to satisfy those tests. The only code which will be required to write is the test_model.py. The facilitation code here is provided for you.

alt text

Inside this file we will write a very simple unit test to ensure that the x_scaler object is populated during model initialization.

  1. An example unit test has already been written. Add 1 more unit test to Project_One-Tests/test_model.py.
  2. The facilitation code follows standard pytest rules, so you can even add more test files etc; just follow pytest conventions.
  3. The proctor will run through how the project works.
    1. Project_One is the project code which seperates the inference code as a "provider" type class following similiar principals from the testable web dev space.
    2. Project_One-Tests is your seperated testing code such that it is not coupled with your app development code.
    3. A container is built for the inference code, which is then extended with the test code. The base inference container is the asset expected to be deployed while the extended testing container allows you to test the assets in the same type of format as if they were to be compiled.

Review Inference Code

Normally we would test and ensure the tests fail before writing the inference code; however much of the code is already written, so we will simply review it.

alt text

In an ideal world, the only code you would need to worry about is highlighted in red. The current state of tooling as of today is why the other code exists and is not wrapped up as ADO Tasks or VS Code extensions.

The proctor will run through the code with your, but essentially:

  1. ./Project_One/score.py is what the azure ml sdk expects as the interface and must be populated with an init() and a run(params). The params are what is received in the http request body (or iot edge message over the route)
  2. The code placed in inference_code is to help ensure code coverage is reported appropriately. We follow a similar provider type structure as in web dev when there is a pre-defined functional interface. The objective is to minimize that footprint to 1 line of code. (in score.py init and run)
  3. The rest of the code is a dockerized build process that can run independent of the dependencies installed on your system such that the build on your machine is the same as the build in the build server improving confidence the locally generated and tested asset will match the asset which has probability of being promoted to production.

We now have inference code with matching train code. Lets build the inference container and test it.

Build Inference Container

  1. First open runbuild_local.cmd a. Modify the environment variables to match for the dev environment. These will remain constant for this algorithm and your local environment. i. Subscription_id ii. Ml_resource_group iii. Ml_workspace_name iv. Ml_alg_author From the command prompt:
  2. Change directory into the Project_One folder.
  3. Run the runbuild_local.cmd a. You may need to execute az login prior to executing this command or be interactively logged in (watch the output)

alt text

c. This will execute a bunch of stuff and be on “Creating image” for a while. Occasionally hit enter to see if the cmd prompt output is up to date or not.

alt text

Test Inference Container

  1. Change directory into the Project_One-Tests folder.
  2. Run the runtests_local.cmd file
  3. This will extend the container you created in the previous step, run your unit tests and check your code coverage. The code coverage results can be found in C:/ml_temp/artifacts/test_results These are standard pytest and pytest-cov result outputs.

alt text

  1. Click on index.html from cov_html folder

alt text

  1. We have 68% code coverage; could be worse.

Commit & Pull Request.

  1. We now know that we have an inference container and it passes our unit tests and our code coverage is to a point where we are happy about it.
  2. From the command prompt change directory to the root of the repository.
  3. Execute the following commands to push the changes from your branch: a. Git add ./ b. Git commit -m “works” c. Git push
  4. Create a pull request by going to your ADO site, under repos, pull request, New Pull Request

alt text

  1. Populate the request template and ensure you have a reviewer:

alt text

  1. Review the changes with the reviewer you selected. Ensure both enter ADO and hit “Approve” and then “Complete”. If you see problems in your peers code; add comments and reject it. Once both reviewers Approve you can complete. This will launch the build stages & release stages which are connected to master.  

Defining your Build stage

Since we are targeting a different Azure Databricks Environment than the one used in the local Dev Loop described earlier in this document, and since we are concerned with security we will be creating a library asset which will allow us to define secrets from a key vault that points to this new environment. These secrets become available as variables in the build stage. Variables give you a convenient way to get key bits of data into various parts of the stage. As the name suggests, the value of a variable may change from run to run or job to job of your stage. Almost any place where a stage requires a text string or a number, you can use a variable instead of hard-coding a value. The system will replace the variable with its current value during the stage's execution.

Creating a Variable Group

  1. In your Azure DevOps Subscription navigate to the Library Menu Item and click + Variable Group

alt text

  1. Name your variable group as indicated and select the Azure Subscription and KeyVault that you wish to target and toggle the “Link secrets from an Azure key vault as variables” switch to the on position

alt text

  1. Click the + Add button, select the variables that you want to make available to the stage, click ok and then Save to make sure that your changes are persisted to your Azure DevOps instance

alt text

Create a Build stage in the Visual Designer

The intention of this step is to create an Azure DevOps stage that will mimic the steps from the Local Build Loop, but targets a different Azure Databrick Environment for the training .The connection details of this environment will not be available to the scientists directly and will be managed by the operations team. This stage will execute when a PR to master is approved and completed.

  1. In your Azure DevOps tenant, navigate to stages -> Builds and click on + New and select New build stage.

alt text

  1. Select your source and make sure to select the master branch as we want to make sure that the stage is attached the branch that we will be monitoring for Pull Requests. Click Continue.

alt text

  1. Select Empty Job

alt text

  1. Name your stage accordingly and select the Hosted Ubuntu 1604 Build Agent from the Agent Pool.

alt text

  1. Link the variable group that you created earlier by clicking on Variables in the menu bar, followed by Variable groups and click Link Variable Groups.

alt text

  1. Select the stage Environment Variable Group and Click Link. Your stage now has access to all the runtime environmental variables to connect to the stage Environment.

alt text

  1. Click back onto Tasks on the menu and click +on the Agent Job to Add the Tasks that you will be configuring for the build process.

alt text   8. Type “CLI” in the Search Box and Click the Azure CLI”ADD” button four times.

alt text

Your Agent Job Step should look like the following when you have completed.

alt text   9. Repeat Step 8, substituting the Search for “CLI” with Copy and add two Copy Files Tasks.

alt text

  1. Substitute “Copy” with “Test” and add a Publish Test Results Task

alt text

  1. Substitute “Test” with Coverage and add a Publish Code Coverage Results Task.

alt text

Your Agent Job should now resemble the following:

alt text

  1. The First Azure CLI Task will be used to configure the agent environment and make sure that the required packages are installed to execute the rest of the stage. Provide the task with a descriptive name, Select the appropriate Azure Subscription, set the Script Location to “Inline Script” and add the flowing to the inline script window: pip3 install -U setuptools python3 -m install --upgrade pip pip3 install --upgrade azureml-sdk[notebooks] Set the remainder of the task properties as depicted below:

alt text

alt text

  1. Click on the second Azure CLI Task, select the appropriate Azure Subscription and configure as follows:

alt text

  1. Click in the third Azure CLI Task , select the appropriate Azure Subscription and configure the Task as follows :

alt text

alt text

  1. Click the fourth Azure CLI Task, Select the appropriate Azure Subscription and configure the Task as follows:

alt text

alt text

  1. Click on the first Copy Files Task and configure the task as follows:

alt text

alt text

  1. Click on the second Copy Files Task and configure the task as follows:

alt text

alt text

  1. Click on the Publish Test Results Task and Configure the task as follows:

alt text

alt text

  1. Click on the Publish Code Coverage Task and configure the task as follows:

alt text

alt text

  1. On the Agent Job Click the + in order to add a task that will be used to publish the build artifacts for use in a release stage later.

alt text

Search for Publish and Click “Add” on the Publish Build Artifacts Task

alt text

Configure the task as follows:

alt text

alt text

  1. Enable the Continuous Integration trigger on the stage which will make sure that every time a change in made to the master branch of the repository this stage will execute. Click the Triggers menu item in the menu bar and click the checkbox to enable continuous integration.

alt text

You can now Save and Queue this stage for a manual build to make sure that it executes from end to end without any issues.

Output from the stage should resemble the following:

alt text

Defining your Release stage

While release stages are often used to deliver artifacts in a deployed state, our scenario calls for an different approach. Our build artifact is an image that contains our tested model and we will be creating a Two Stage Release that will first deliver the correct image to aQA environment where it can be picked up and be tested by a product team. Once all conditions for the product team is satisfied a release manager will manually approve the Production release step and the model will become available for consumption in the Production Environment. Creating Variable Groups Required for the Release stage

  1. Click on the Library menu item in the Azure DevOps portal and click

alt text

  1. Complete the resulting form as depicted below, making sure that you provide values to the variables that correspond to the Targeted QA Environment

alt text

  1. After each Variable Value has been assigned click the to encrypt its value in the stage.
  2. Repeat Steps 2 and 3 above to set up a variable group for the targeted Production Environment.

Create the Release stage

  1. In the Azure DevOps portal Click on stages -> Releases in the left menu

alt text

  1. Click The New stage menu item and select New release stage

alt text

  1. Add Two Stages, Named QA and Production Respectively ensuring that you select the “Empty Template”. Click on the Pre-deployment condition icon and continue to configure as depicted below. This will prevent the Production deployment from happening automatically unless there is an Approval provided by One or all of the Approvers (dependent on configuration) and that the Production Stage Deployment will timeout after two days with out an approval.

alt text

  1. Add the build artifacts and link the release stage to its associated build stage.

alt text

Note: The value in the Source alias text area will be required to correctly configure the AZ CLI tasks in Steps 6 and 7 below.  

  1. In the Menu area select variables and link the QA and Production variable to the relevant slots.

alt text

  1. Add a CLI Task to the QA Stage and configure it as follows :

alt text

alt text

** Make sure that the working directory set above reflects the generated path correct path here $(System.DefaultWorkingDirectory)//stageArtifacts

  1. Repeat Step 6 above for the Production Stage .

Note that the script internals are identical for both stages but will target different destination repositories based on the Variables groups assigned to each of the stages.

alt text

  1. Run a release and inspect the results
  2. To automate the release process click the Continuous Integration Trigger of the Build Artifact and set as follows.

alt text

alt text

  1. Finally click on the Pre-Release Condition for the QA Stage and set as follows.

alt text

  1. A successful release will resemble the following

alt text

alt text

About

Are you like me , a Senior Data Scientist, wanting to learn more about how to approach DevOps, specifically when you using Databricks (workspaces, notebooks, libraries etc) ? Set up using @Azure @databricks

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published