Skip to content

Latest commit

 

History

History
120 lines (87 loc) · 7.58 KB

r-clustering-model-introduction.md

File metadata and controls

120 lines (87 loc) · 7.58 KB
title titleSuffix description author ms.author ms.reviewer ms.date ms.service ms.subservice ms.topic monikerRange
Tutorial: Develop a clustering model in R
SQL Machine Learning
In this four-part tutorial series, you'll develop a model to perform clustering in R with SQL machine learning.
WilliamDAssafMSFT
wiassaf
garye, jroth
05/26/2020
sql
machine-learning
tutorial
>=sql-server-2016||>=sql-server-linux-ver15||=azuresqldb-mi-current

Tutorial: Develop a clustering model in R with SQL machine learning

[!INCLUDE SQL Server 2016 SQL MI]

::: moniker range=">=sql-server-ver15||>=sql-server-linux-ver15" In this four-part tutorial series, you'll use R to develop and deploy a K-Means clustering model in SQL Server Machine Learning Services or on Big Data Clusters to categorize customer data. ::: moniker-end ::: moniker range="=sql-server-2017" In this four-part tutorial series, you'll use R to develop and deploy a K-Means clustering model in SQL Server Machine Learning Services to cluster customer data. ::: moniker-end ::: moniker range="=sql-server-2016" In this four-part tutorial series, you'll use R to develop and deploy a K-Means clustering model in SQL Server R Services to cluster customer data. ::: moniker-end ::: moniker range="=azuresqldb-mi-current" In this four-part tutorial series, you'll use R to develop and deploy a K-Means clustering model in Azure SQL Managed Instance Machine Learning Services to cluster customer data. ::: moniker-end

In part one of this series, you'll set up the prerequisites for the tutorial and then restore a sample dataset to a database. In parts two and three, you'll develop some R scripts in an Azure Data Studio notebook to analyze and prepare this sample data and train a machine learning model. Then, in part four, you'll run those R scripts inside a database using stored procedures.

Clustering can be explained as organizing data into groups where members of a group are similar in some way. For this tutorial series, imagine you own a retail business. You'll use the K-Means algorithm to perform the clustering of customers in a dataset of product purchases and returns. By clustering customers, you can focus your marketing efforts more effectively by targeting specific groups. K-Means clustering is an unsupervised learning algorithm that looks for patterns in data based on similarities.

In this article, you'll learn how to:

[!div class="checklist"]

  • Restore a sample database

In part two, you'll learn how to prepare the data from a database to perform clustering.

In part three, you'll learn how to create and train a K-Means clustering model in R.

In part four, you'll learn how to create a stored procedure in a database that can perform clustering in R based on new data.

Prerequisites

::: moniker range=">=sql-server-ver15||>=sql-server-linux-ver15"

Restore the sample database

The sample dataset used in this tutorial has been saved to a .bak database backup file for you to download and use. This dataset is derived from the tpcx-bb dataset provided by the Transaction Processing Performance Council (TPC).

::: moniker range=">=sql-server-ver15||>=sql-server-linux-ver15"

Note

If you are using Machine Learning Services on Big Data Clusters, see how to Restore a database into the SQL Server big data cluster master instance. ::: moniker-end

::: moniker range=">=sql-server-2017||>=sql-server-linux-ver15"

  1. Download the file tpcxbb_1gb.bak.

  2. Follow the directions in Restore a database from a backup file in Azure Data Studio, using these details:

    • Import from the tpcxbb_1gb.bak file you downloaded
    • Name the target database "tpcxbb_1gb"
  3. You can verify that the dataset exists after you have restored the database by querying the dbo.customer table:

    USE tpcxbb_1gb;
    SELECT * FROM [dbo].[customer];

::: moniker-end ::: moniker range="=azuresqldb-mi-current"

  1. Download the file tpcxbb_1gb.bak.

  2. Follow the directions in Restore a database to a Managed Instance in SQL Server Management Studio, using these details:

    • Import from the tpcxbb_1gb.bak file you downloaded
    • Name the target database "tpcxbb_1gb"
  3. You can verify that the dataset exists after you have restored the database by querying the dbo.customer table:

    USE tpcxbb_1gb;
    SELECT * FROM [dbo].[customer];

::: moniker-end

Clean up resources

If you're not going to continue with this tutorial, delete the tpcxbb_1gb database.

Next steps

In part one of this tutorial series, you completed these steps:

  • Installed the prerequisites
  • Restored a sample database

To prepare the data for the machine learning model, follow part two of this tutorial series:

[!div class="nextstepaction"] Prepare data to perform clustering