Skip to content

7xuanlu/sqlserver-bdc

Repository files navigation

Workshop: SQL Server Big Data Clusters

Contributor: Martin
Purpose: Demo/Workshop
Updated date: 2020/03/26

Welcome to this Microsoft solutions workshop on the architecture on SQL Server Big Data Clusters. You'll experiment with SQL Server Big Data Clusters (BDC), and how you can use it to implement large-scale data processing and machine learning.

This Workshop assumes you have a full understanding the concepts of big data analytics, the technologies (such as containers, Kubernetes, Spark and HDFS, machine learning, and other technologies) that you will use throughout the Workshop, the architecture of a BDC. If you are familiar with these topics, you can take a complete course here.

In this Workshop you'll learn how to create external tables over other data sources to unify your data, and how to use Spark to run big queries over your data in HDFS or do data preparation. You'll review a complete solution for an end-to-end scenario, with a focus on how to extrapolate what you have learned to create other solutions for your organization.

This Workshop expects that you understand data structures and working with SQL Server and computer networks. This Workshop does not expect you to have any prior data science knowledge, but a basic knowledge of statistics and data science is helpful in the Data Science sections. Knowledge of SQL Server, Azure Data and AI services, Python, and Jupyter Notebooks is recommended. AI techniques are implemented in Python packages. Solution templates are implemented using Azure services, development tools, and SDKs. You should have a basic understanding of working with the Microsoft Azure Platform.

You need to have all of the prerequisites completed before taking this Workshop.

You need a full Big Data Cluster for SQL Server up and running, and have identified the connection endpoints, with all security parameters. You find out how to do that here.

You will work through six Jupyter Notebooks using the Azure Data Studio tool. Download them and open them in Azure Data Studio, running only one cell at a time.

NotebookTopics
bdc-00-overview.ipynb Overview of the Workshop and setup of the source data, problem space, solution options and architectures
bdc-01-k8s.ipynb In-depth details of a pod or other Kubernetes artifacts that are located in a SQL Server big data cluster.
bdc-02-adstudio.ipynb View service endpoints and status of a SQL Server big data cluster components.
bdc-03-sqlserver-master.ipynb Run standard SQL Server Queries against the Master Instance (MI) in a SQL Server big data cluster.
bdc-04-data-virtualization.ipynb Learn how to create and query Virtualized Data in a SQL Server big data cluster.
bdc-05-data-mart.ipynb Create and query a Data Mart using Virtualized Data in a SQL Server big data cluster.
bdc-06-spark-etl.ipynb Learn how to work with Spark Jobs in a SQL Server big data cluster.
bdc-07-spark-ml.ipynb Train Spark ML model in a SQL Server big data cluster and export is as a MLeap bundle
bdc-08-model-deployment.ipynb Learn how to export and deploy MLeap bundle in a SQL Server big data cluster.