Skip to content

IBM/prepare-your-dataset-using-data-refinery-from-db2-cp4d

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

abstract authors completed_date components draft excerpt keywords last_updated primary_tag pta pwg related_content related_links runtimes series services subtitle tags title translators type
name email
Manoj Jahgirdar
manoj.jahgirdar@in.ibm.com
name email
Smruthi Raj Mohan
smrraj32@in.ibm.com
name email
Srikanth Manne
srikanth.manne@in.ibm.com
name email
Manjula G. Hosurmath
mhosurma@in.ibm.com
2020-01-20
slug name url type
ibm-db2-database
IBM Db2 Database
component
slug name url type
cloud-pak-for-data
IBM Cloud Pak for Data
component
true|false
type slug
announcements|articles|blogs|patterns|series|tutorials|videos
title url description
type slug
name email
tutorial

In this Tutorial, we will perform Data Engineering operations on multiple datasets using Watson Data Refinery on Cloud Pak for Data / Watson Studio on IBM Cloud.

A Data Scientist cannot directly build a model based on the dataset, the data collection and analysis is very essential before building a model. In this Tutorial we demonstrate how data scientists can easily collect data from databases, analyse the data and enhance the data according to their requirements with the help of Watson Data Refinery on Cloud Pak for Data / Watson Studio on IBM Cloud.

Learning objectives

When you have completed this code pattern, you will understand how to:

  • Create a set of ordered steps to cleanse, shape, and enhance data.
  • Create a connection with any database and Data Refinery.
  • Prepare datasets specific to your ML Model.
  • Save the datasets in any database of your choice.

Prerequisites

  1. Any SQL Database.

In this Tutorial we have demonstrated with Db2 on Cloud Pak for Data and Db2 on Cloud.

  1. IBM Cloud Account - If you prefer to deploy on IBM Cloud.

Estimated time

Completing this tutorial should take about 30 minutes.

Steps

Following this tutorial you can deploy on either Cloud Pak for Data or IBM Cloud.

Steps to deploy on Cloud Pak for Data

Step 1: Download the data

In this Tutorial we are going to use Brazilian E-Commerce Public Dataset by Olist from Kaggle. Download the dataset from the link given below.

After Downloading, Extract the brazilian-ecommerce.zip file.

We’ll be using the following files:

  1. brazilian-ecommerce/olist_orders_dataset.csv : This is the core dataset. From each order you might find all other information.

  2. brazilian-ecommerce/olist_order_items_dataset.csv : This dataset includes data about the items purchased within each order.

  3. brazilian-ecommerce/olist_products_dataset.csv : This dataset includes data about the products sold by Olist.

  4. brazilian-ecommerce/olist_sellers_dataset.csv : This dataset includes data about the sellers that fulfilled orders made at Olist.

Step 2: Load the data into tables in Db2

NOTE: We are Assuming you have already Provisioned a Db2 Instance in your Cloud Pak for Data. If you do not have Db2 Instance Provisioned you can also use other on-prem, public or private Databases of your choice and load the datasets.

  • Open the Db2 Instance and click on load data.

loadData

  • Select the olist_orders_dataset.csv file and select next.

browseFile

  • Choose your namespace and create a table named ORDERS and select next.

Note: Make sure you have selected the default schema of your database. In case of Db2 your default Schema is your username.

createTable

  • You can preview the metadata of the table and select next.

tableMetaData

  • Click on Begin Load to import the downloaded .csv file into your Db2.

beginLoad

  • Wait for the upload to finish.

waitForUpload

  • Once the table is created, click on Load More Data to add the other three datasets.

loadMoreData

  • Load the olist_order_items_dataset.csv and name the table ORDERITEMS, load olist_products_dataset.csv and name the table PRODUCTS & finally load olist_sellers_dataset.csv and name the table SELLERS by repeating the above steps.

Step 3: Create a Project in Cloud Pak for Data

Once the Database is ready, we will start using the database in our Cloud Pak for Data.

  • Create a Project in Cloud Pak for Data choose an Empty Project.

createProject

  • Once The Project is Created you will see the below page.

projectDashboard

Step 4: Add Db2 connection to the project

Now that we have created a project, we will start adding components to our project. We will start by adding Db2 Connection to our project first.

  • Click on Add to Project and select Connection. If you have followed step 2 select Db2 from the list and add the credentials of your provisioned Db2 Instance. If you have a different database then you can select that and fill in the credentials.

gif

  • After filling the credentials click on Test Connection to make sure you have entered correct credentials. Finally select Create.

connection

NOTE: The Database Credentials will be provided by your Database administrator. If you have provisioned a Db2 instance on Cloud Pak for Data then you can follow the steps here to get the credentials.

Step 5: Add Data Refinery to the project and perform Data Engineering Operations

5.1 Add Data Refinery to the project

We will add Data Refinery Flow in the similar way.

  • Click on Add to Project and select Data Refinery Flow.

dataRefinery

  • Under Assets click on Connections and then click on the connection that you created in step 4, click on the schema of your Database and select the table ORDERS and finally click on ADD.

selectOrders

  • You will now see the Data Refinery Dashboard.

dataRefineryDashboard

5.2 Perform Data Engineering Operations

5.2.1 We will be performing the Join in this tutorial. Click on Operation on the top left and click on Join.

joinOperation

5.2.2 Select the Inner Join and add the second dataset from our db2 by clicking the button shown.

addDataset

5.2.3 We will first join the ORDERS table with ORDERITEMS table from db2. Under Assets click on Connections and then click on the connection that you created in step 4, click on the schema of your Database and select the table ORDERITEMS and finally click on APPLY.

selectOrderItems

5.2.4 Select the JOIN KEYS for ORDERS and ORDERITEMS as order_id and click NEXT.

keyOrderId

5.2.5 Click on APPLY to apply the Join Operation.

applyOperation

  • Repeat the steps 5.2.1 to step 5.2.5 to keep joining data to the original by product id and seller id.

  • Select the JOIN KEYS for ORDERS and PRODUCTS as product_id.

  • Select the JOIN KEYS for ORDERS and SELLERS as seller_id.

Step 6: Save the Enhanced Dataset to a table in Db2 and Run the Job

Once the operations are performed its time to save the result in a table. By Default the resulting table will be saved as a .csv file in the project but we will change the output path to the Db2 database.

6.1 Save the Enhanced Dataset to a table in Db2

  • Click on the Edit button on the top right as shown.

editButton

  • Then click on the Pencil button as shown.

edit2

  • Click on Change Location, under Assets click on Connections and then click on the connection that you created in step 4, click on the schema of your Database and finally click on SAVE LOCATION.

selectDbLocation

  • Name the Dataset DERIVEDDATA and click on done.

NOTE: Use Uppercase naming only, as Db2 stores in Uppercase.

tickButton

6.2 Run the Data Refinery Job

  • Click on the Save and create a Job as shown.

createJob

  • Give a name to the Job and finally click on Create and Run.

createRun

  • The Job will start running and it will take approximately 4-5min to complete.

runningJob

  • Once The Job Status becomes Completed, you can check your database to see a new table with a name four_tables_merged with the result.

Steps to deploy on IBM Cloud

Step 1: Download the data

In this Tutorial we are going to use Brazilian E-Commerce Public Dataset by Olist from Kaggle. Download the dataset from the link given below.

After Downloading, Extract the brazilian-ecommerce.zip file.

We’ll be using the following files:

  1. brazilian-ecommerce/olist_orders_dataset.csv : This is the core dataset. From each order you might find all other information.

  2. brazilian-ecommerce/olist_order_items_dataset.csv : This dataset includes data about the items purchased within each order.

  3. brazilian-ecommerce/olist_products_dataset.csv : This dataset includes data about the products sold by Olist.

  4. brazilian-ecommerce/olist_sellers_dataset.csv : This dataset includes data about the sellers that fulfilled orders made at Olist.

Step 2: Load the data into tables in Db2

NOTE: You can Skip this step if you do not want to use Db2 Instance as you can use other on-prem, public or private Databases of your choice and load the datasets.

  • Create a Db2 Resource. db2Resource

  • Once the Resource is ready click on Service Credentials on the left panel and then click view credentials. viewCredentials

NOTE: Copy these credentials as it will be used in Step 4.

  • Now click on Manage on the left panel and then click on Open Console to open the Db2 Console.

openConsole

  • Once the Db2 Console is opened, click on load data.

loadData

  • Select the olist_orders_dataset.csv file and select next.

browseFile

  • Choose your namespace and create a table named ORDERS and select next.

Note: Make sure you have selected the default schema of your database. In case of Db2 your default Schema is your username.

createTable

  • You can preview the metadata of the table and select next.

tableMetaData

  • Click on Begin Load to import the downloaded .csv file into your Db2.

beginLoad

  • Wait for the upload to finish.

waitForUpload

  • Once the table is created, click on Load More Data to add the other three datasets.

loadMoreData

  • Load the olist_order_items_dataset.csv and name the table ORDERITEMS, load olist_products_dataset.csv and name the table PRODUCTS & finally load olist_sellers_dataset.csv and name the table SELLERS by repeating the above steps.

Step 3: Create a Watson Studio Service

Once the Database is ready, we will start using the database in our Watson Studio on IBM Cloud.

createwatsonstudio

  • Then click on Get Started.

  • In Watson Studio click Create a project > Create an empty project and name it Retail.

watsonstudioproject

Step 4: Add Db2 connection to the project

Now that we have created a project, we will start adding components to our project. We will start by adding Db2 Connection to our project first.

  • Click on Add to Project and select Connection. If you have followed step 2 select Db2 from the list and add the credentials of your provisioned Db2 Instance. If you have a different database then you can select that and fill in the credentials.

gif

  • After filling the credentials click on Create.

connection

NOTE: The Database Credentials are generated in Step 2

Step 5: Add Data Refinery to the project and perform Data Engineering Operations

5.1 Add Data Refinery to the project

We will add Data Refinery Flow in the similar way.

  • Click on Add to Project and select Data Refinery Flow.

dataRefinery

  • Under Assets click on Connections and then click on the connection that you created in step 4, click on the schema of your Database and select the table ORDERS and finally click on ADD.

selectOrders

  • You will now see the Data Refinery Dashboard.

dataRefineryDashboard

5.2 Perform Data Engineering Operations

5.2.1 We will be performing the Join in this tutorial. Click on Operation on the top left and click on Join.

joinOperation

5.2.2 Select the Inner Join and add the second dataset from our db2 by clicking the button shown.

addDataset

5.2.3 We will first join the ORDERS table with ORDERITEMS table from db2. Under Assets click on Connections and then click on the connection that you created in step 4, click on the schema of your Database and select the table ORDERITEMS and finally click on APPLY.

selectOrderItems

5.2.4 Select the JOIN KEYS for ORDERS and ORDERITEMS as order_id and click NEXT.

keyOrderId

5.2.5 Click on APPLY to apply the Join Operation.

applyOperation

  • Repeat the steps 5.2.1 to step 5.2.5 to keep joining data to the original by product id and seller id.

  • Select the JOIN KEYS for ORDERS and PRODUCTS as product_id.

  • Select the JOIN KEYS for ORDERS and SELLERS as seller_id.

Step 6: Save the Enhanced Dataset to a table in Db2 and Run the Job

Once the operations are performed its time to save the result in a table. By Default the resulting table will be saved as a .csv file in the project but we will change the output path to the Db2 database.

6.1 Save the Enhanced Dataset to a table in Db2

  • Click on the Edit button on the top right as shown.

editButton

  • Then click on the Pencil button as shown.

edit2

  • Click on Change Location, under Assets click on Connections and then click on the connection that you created in step 4, click on the schema of your Database and finally click on SAVE LOCATION.

selectDbLocation

  • Name the Dataset DERIVEDDATA and click on done.

NOTE: Use Uppercase naming only, as Db2 stores in Uppercase.

tickButton

6.2 Run the Data Refinery Job

  • Click on the Save and create a Job as shown.

createJob

  • Give a name to the Job and finally click on Create and Run.

createRun

  • The Job will start running and it will take approximately 4-5min to complete.

runningJob

  • Once The Job Status becomes Completed, you can check your database to see a new table with a name four_tables_merged with the result.

Summary

A Data Scientist cannot directly build a model based on the dataset, the data collection and analysis is very essential before building a model. This tutorial allows Data Scientists to perform data engineering operations to any data easily hence reducing the time spent on a data engineering operation and Data Scientists can focus mainly on building a model. The main advantage of the Data Refinery capabilities of IBM Cloud Pak for Data is Creating a set of ordered steps to cleanse, shape, and enhance data.

About

Prepare your Dataset for your ML Models using Data Refinery

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published