<img src="https://github.com/Microsoft/sqlworkshops/blob/master/graphics/solutions-microsoft-logo-small.png?raw=true" alt="Microsoft">
<br>

# SQL Server 2019 big data cluster Tutorial
## 00 - Scenario Overview and System Setup

In this set of tutorials you'll work with an end-to-end scenario that uses SQL Server 2019's big data clusters to solve real-world problems. 


## Wide World Importers

Wide World Importers (WWI) is a traditional brick and mortar business that makes specialty items for other companies to use in their products. They design, sell and ship these products worldwide.

WWI corporate has now added a new partnership with a company called "AdventureWorks", which sells bicycles both online and in-store. The AdventureWorks company has asked WWI to produce super-hero themed baskets, seats and other bicycle equipment for a new line of bicycles. WWI corporate has asked the IT department to develop a pilot program with these goals: 

- Integrate the large amounts of data from the AdventureWorks company including customers, products and sales
- Allow a cross-selling strategy so that current WWI customers and AdventureWorks customers see their information without having to re-enter it
- Incorporate their online sales information for deeper analysis
- Provide a historical data set so that the partnership can be evaluated
- Ensure this is a "framework" approach, so that it can be re-used with other partners

WWI has a typical N-Tier application that provides a series of terminals, a Business Logic layer, and a Database back-end. They use on-premises systems, and are interested in linking these to the cloud. 

In this series of tutorials, you will build a solution using the scale-out features of SQL Server 2019, Data Virtualization, Data Marts, and the Data Lake features. 

## Running these Tutorials

- You can read through the output of these completed tutorials if you wish - or:

- You can follow along with the steps you see in these tutorials by copying the code into a SQL Query window and Spark Notebook using the Azure Data Studio tool, or you can click here to download these Jupyter Notebooks and run them in Azure Data Studio for a hands-on experience.
 
- If you would like to run the tutorials, you'll need a SQL Server 2019 big data cluster and the client tools installed. If you want to set up your own cluster, <a href="https://docs.microsoft.com/en-us/sql/big-data-cluster/deploy-get-started?view=sqlallproducts-allversions" target="_blank">click this reference and follow the steps you see there for the server and tools you need</a>.

- You will need to have the following: 
    - Your **Knox Password**
    - The **Knox IP Address**
    - The `sa` **Username** and **Password** to your Master Instance
    - The **IP address** to the SQL Server big data cluster Master Instance 
    - The **name** of your big data cluster

For a complete workshop on SQL Server 2019's big data clusters, <a href="https://github.com/Microsoft/sqlworkshops/tree/master/sqlserver2019bigdataclusters" target="_blank">check out this resource</a>.

## Copy Database backups to the SQL Server 2019 big data cluster Master Instance

The first step for the solution is to copy the database backups from WWI from their location on the cloud and then up to your cluster. 

These commands use the `curl` program to pull the files down. [You can read more about curl here](https://curl.haxx.se/). 

The next set of commands use the `kubectl` command to copy the files from where you downloaded them to the data directory of the SQL Server 2019 bdc Master Instance. [You can read more about kubectl here](https://kubernetes.io/docs/reference/kubectl/overview/). 

Note that you will need to replace the section of the script marked with `<ReplaceWithClusterName>` with the name of your SQL Server 2019 bdc. (It does not need single or double quotes, just the letters of your cluster name.)

Notice also that these commands assume a `c:\temp` location, if you want to use another drive or directory, edit accordingly.

Once you have edited these commands, you can open a Command Prompt *(not PowerShell)* on your system and copy and paste each block, one at a time and run them there, observing the output.

In the next tutorial you will restore these databases on the Master Instance.

In [2]:
REM Create a temporary directory for the files
md c:\temp
cd c:\temp

REM Get the database backups
curl "https://cs7a9736a9346a1x44c6xb00.blob.core.windows.net/backups/WWI.bak" -o c:\temp\WWI.bak
curl "https://cs7a9736a9346a1x44c6xb00.blob.core.windows.net/backups/AdventureWorks.bak" -o c:\temp\AdventureWorks.bak
curl "https://cs7a9736a9346a1x44c6xb00.blob.core.windows.net/backups/AdventureWorksDW.bak" -o c:\temp\AdventureWorksDW.bak
curl "https://cs7a9736a9346a1x44c6xb00.blob.core.windows.net/backups/Analysis.bak" -o c:\temp\Analysis.bak
curl "https://cs7a9736a9346a1x44c6xb00.blob.core.windows.net/backups/sales.bak" -o c:\temp\sales.bak
curl "https://cs7a9736a9346a1x44c6xb00.blob.core.windows.net/backups/NYC.bak" -o c:\temp\NYC.bak
curl "https://cs7a9736a9346a1x44c6xb00.blob.core.windows.net/backups/WWIDW.bak" -o c:\temp\WWIDW.bak


In [0]:
REM Copy the backups to the data location on the SQL Server Master Instance
cd c:\temp
kubectl cp WWI.bak master-0:/var/opt/mssql/data -c mssql-server -n <ReplaceWithClusterName>
kubectl cp WWIDW.bak master-0:/var/opt/mssql/data -c mssql-server -n <ReplaceWithClusterName>
kubectl cp sales.bak master-0:/var/opt/mssql/data -c mssql-server -n <ReplaceWithClusterName>
kubectl cp analysis.bak master-0:/var/opt/mssql/data -c mssql-server -n <ReplaceWithClusterName>
kubectl cp AdventureWorks.bak master-0:/var/opt/mssql/data -c mssql-server -n <ReplaceWithClusterName>
kubectl cp AdventureWorksDW.bak master-0:/var/opt/mssql/data -c mssql-server -n <ReplaceWithClusterName>
kubectl cp NYC.bak master-0:/var/opt/mssql/data -c mssql-server -n <ReplaceWithClusterName>


## Copy Exported Data to Storage Pool

Next, you'll download a few text files that will form the external data to be ingested into the Storage Pool HDFS store. In production environments, you have multiple options for moving data into HDFS, such as Spark Streaming or the Azure Data Factory.

The first code block creates directories in the HDFS store. The second block downloads the source data from a web location. And in the final block, you'll copy the data from your local system to the SQL Server 2019 big data cluster Storage Pool.

You need to replace the `<ReplaceWithHDFSGatewayPassword>`, `<ReplaceWithHDFSGatewayEndpoint>`, and potentially the drive letter and directory values with the appropriate information on your system. 
> (You can use **CTL-H** to open the Find and Replace dialog in the cell)

In [0]:
REM Make the Directories in HDFS
curl -i -L -k -u root:<ReplaceWithKnoxPassword> -X PUT "https://<ReplaceWithHDFSGatewayEndpoint>:30443/gateway/default/webhdfs/v1/product_review_data?op=MKDIRS"
curl -i -L -k -u root:<ReplaceWithKnoxPassword> -X PUT "https://<ReplaceWithHDFSGatewayEndpoint>:30443/gateway/default/webhdfs/v1/partner_customers?op=MKDIRS"
curl -i -L -k -u root:<ReplaceWithKnoxPassword> -X PUT "https://<ReplaceWithHDFSGatewayEndpoint>:30443/gateway/default/webhdfs/v1/partner_products?op=MKDIRS"
curl -i -L -k -u root:<ReplaceWithKnoxPassword> -X PUT "https://<ReplaceWithHDFSGatewayEndpoint>:30443/gateway/default/webhdfs/v1/web_logs?op=MKDIRS"


In [0]:
REM Get the textfiles 
curl -G "https://cs7a9736a9346a1x44c6xb00.blob.core.windows.net/backups/product_reviews_sample.csv" -o product_reviews.csv
curl -G "https://cs7a9736a9346a1x44c6xb00.blob.core.windows.net/backups/customers.csv" -o customers.csv
curl -G "https://cs7a9736a9346a1x44c6xb00.blob.core.windows.net/backups/stockitemholdings.csv" -o products.csv
curl -G "https://cs7a9736a9346a1x44c6xb00.blob.core.windows.net/backups/web_clickstreams.csv" -o web_clickstreams.csv
curl -G "https://cs7a9736a9346a1x44c6xb00.blob.core.windows.net/backups/training-formatted.csv" -o training-formatted.csv


In [0]:
REM Copy the text files to the HDFS directories
curl -i -L -k -u root:<ReplaceWithHDFSGatewayPassword> -X PUT "https://<ReplaceWithHDFSGatewayEndpoint>:30443/gateway/default/webhdfs/v1/product_review_data/product_reviews.csv?op=create&overwrite=true" -H "Content-Type: application/octet-stream" -T "product_reviews.csv"
curl -i -L -k -u root:<ReplaceWithHDFSGatewayPassword> -X PUT "https://<ReplaceWithHDFSGatewayEndpoint>:30443/gateway/default/webhdfs/v1/partner_customers/customers.csv?op=create&overwrite=true" -H "Content-Type: application/octet-stream" -T "customers.csv"
curl -i -L -k -u root:<ReplaceWithHDFSGatewayPassword> -X PUT "https://<ReplaceWithHDFSGatewayEndpoint>:30443/gateway/default/webhdfs/v1/partner_products/products.csv?op=create&overwrite=true" -H "Content-Type: application/octet-stream" -T "products.csv"
curl -i -L -k -u root:<ReplaceWithHDFSGatewayPassword> -X PUT "https://<ReplaceWithHDFSGatewayEndpoint>:30443/gateway/default/webhdfs/v1/web_logs/web_clickstreams.csv?op=create&overwrite=true" -H "Content-Type: application/octet-stream" -T "web_clickstreams.csv"


## Next Step: Working with the SQL Server 2019 big data cluster Master Instance

Now you're ready to open the next Python Notebook - [bdc_tutorial_01.ipynb](bdc_tutorial_01.ipynb) - to learn how to work with the SQL Server 2019 bdc Master Instance.