abstract | authors | completed_date | components | draft | excerpt | keywords | last_updated | primary_tag | pta | pwg | related_content | related_links | runtimes | series | services | subtitle | tags | title | translators | type | |||||||||||||||||||||||||||||||||||||||||||||||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
|
2020-01-20 |
|
true|false |
|
|
|
|
tutorial |
In this Tutorial, we will perform Data Engineering operations on multiple datasets using Watson Data Refinery on Cloud Pak for Data / Watson Studio on IBM Cloud.
A Data Scientist cannot directly build a model based on the dataset, the data collection and analysis is very essential before building a model. In this Tutorial we demonstrate how data scientists can easily collect data from databases, analyse the data and enhance the data according to their requirements with the help of Watson Data Refinery on Cloud Pak for Data / Watson Studio on IBM Cloud.
When you have completed this code pattern, you will understand how to:
- Create a set of ordered steps to cleanse, shape, and enhance data.
- Create a connection with any database and Data Refinery.
- Prepare datasets specific to your ML Model.
- Save the datasets in any database of your choice.
- Any SQL Database.
In this Tutorial we have demonstrated with Db2 on Cloud Pak for Data and Db2 on Cloud.
- IBM Cloud Account - If you prefer to deploy on IBM Cloud.
Completing this tutorial should take about 30 minutes.
Following this tutorial you can deploy on either Cloud Pak for Data or IBM Cloud.
In this Tutorial we are going to use Brazilian E-Commerce Public Dataset by Olist from Kaggle. Download the dataset from the link given below.
After Downloading, Extract the brazilian-ecommerce.zip
file.
We’ll be using the following files:
-
brazilian-ecommerce/olist_orders_dataset.csv
: This is the core dataset. From each order you might find all other information. -
brazilian-ecommerce/olist_order_items_dataset.csv
: This dataset includes data about the items purchased within each order. -
brazilian-ecommerce/olist_products_dataset.csv
: This dataset includes data about the products sold by Olist. -
brazilian-ecommerce/olist_sellers_dataset.csv
: This dataset includes data about the sellers that fulfilled orders made at Olist.
NOTE: We are Assuming you have already Provisioned a Db2 Instance in your Cloud Pak for Data. If you do not have Db2 Instance Provisioned you can also use other on-prem, public or private Databases of your choice and load the datasets.
- Open the Db2 Instance and click on load data.
- Select the olist_orders_dataset.csv file and select next.
- Choose your namespace and create a table named ORDERS and select next.
Note: Make sure you have selected the default schema of your database. In case of Db2 your default Schema is your username.
- You can preview the metadata of the table and select next.
- Click on Begin Load to import the downloaded
.csv
file into your Db2.
- Wait for the upload to finish.
- Once the table is created, click on Load More Data to add the other three datasets.
- Load the
olist_order_items_dataset.csv
and name the table ORDERITEMS, loadolist_products_dataset.csv
and name the table PRODUCTS & finally loadolist_sellers_dataset.csv
and name the table SELLERS by repeating the above steps.
Once the Database is ready, we will start using the database in our Cloud Pak for Data.
- Create a Project in Cloud Pak for Data choose an Empty Project.
- Once The Project is Created you will see the below page.
Now that we have created a project, we will start adding components to our project. We will start by adding Db2 Connection to our project first.
- Click on Add to Project and select Connection. If you have followed step 2 select Db2 from the list and add the credentials of your provisioned Db2 Instance. If you have a different database then you can select that and fill in the credentials.
- After filling the credentials click on Test Connection to make sure you have entered correct credentials. Finally select Create.
NOTE: The Database Credentials will be provided by your Database administrator. If you have provisioned a Db2 instance on Cloud Pak for Data then you can follow the steps here to get the credentials.
We will add Data Refinery Flow in the similar way.
- Click on Add to Project and select Data Refinery Flow.
- Under Assets click on Connections and then click on the connection that you created in step 4, click on the schema of your Database and select the table ORDERS and finally click on ADD.
- You will now see the Data Refinery Dashboard.
5.2.1 We will be performing the Join in this tutorial. Click on Operation on the top left and click on Join.
5.2.2 Select the Inner Join and add the second dataset from our db2 by clicking the button shown.
5.2.3 We will first join the ORDERS table with ORDERITEMS table from db2. Under Assets click on Connections and then click on the connection that you created in step 4, click on the schema of your Database and select the table ORDERITEMS and finally click on APPLY.
5.2.4 Select the JOIN KEYS for ORDERS and ORDERITEMS as order_id and click NEXT.
5.2.5 Click on APPLY to apply the Join Operation.
-
Repeat the steps 5.2.1 to step 5.2.5 to keep joining data to the original by product id and seller id.
-
Select the JOIN KEYS for ORDERS and PRODUCTS as product_id.
- Select the JOIN KEYS for ORDERS and SELLERS as seller_id.
Once the operations are performed its time to save the result in a table. By Default the resulting table will be saved as a .csv
file in the project but we will change the output path to the Db2 database.
- Click on the Edit button on the top right as shown.
- Then click on the Pencil button as shown.
- Click on Change Location, under Assets click on Connections and then click on the connection that you created in step 4, click on the schema of your Database and finally click on SAVE LOCATION.
- Name the Dataset DERIVEDDATA and click on done.
NOTE: Use Uppercase naming only, as Db2 stores in Uppercase.
- Click on the Save and create a Job as shown.
- Give a name to the Job and finally click on Create and Run.
- The Job will start running and it will take approximately 4-5min to complete.
- Once The Job Status becomes Completed, you can check your database to see a new table with a name four_tables_merged with the result.
In this Tutorial we are going to use Brazilian E-Commerce Public Dataset by Olist from Kaggle. Download the dataset from the link given below.
After Downloading, Extract the brazilian-ecommerce.zip
file.
We’ll be using the following files:
-
brazilian-ecommerce/olist_orders_dataset.csv
: This is the core dataset. From each order you might find all other information. -
brazilian-ecommerce/olist_order_items_dataset.csv
: This dataset includes data about the items purchased within each order. -
brazilian-ecommerce/olist_products_dataset.csv
: This dataset includes data about the products sold by Olist. -
brazilian-ecommerce/olist_sellers_dataset.csv
: This dataset includes data about the sellers that fulfilled orders made at Olist.
NOTE: You can Skip this step if you do not want to use Db2 Instance as you can use other on-prem, public or private Databases of your choice and load the datasets.
-
Create a Db2 Resource.
-
Once the Resource is ready click on Service Credentials on the left panel and then click view credentials.
NOTE: Copy these credentials as it will be used in Step 4.
- Now click on Manage on the left panel and then click on Open Console to open the Db2 Console.
- Once the Db2 Console is opened, click on load data.
- Select the olist_orders_dataset.csv file and select next.
- Choose your namespace and create a table named ORDERS and select next.
Note: Make sure you have selected the default schema of your database. In case of Db2 your default Schema is your username.
- You can preview the metadata of the table and select next.
- Click on Begin Load to import the downloaded
.csv
file into your Db2.
- Wait for the upload to finish.
- Once the table is created, click on Load More Data to add the other three datasets.
- Load the
olist_order_items_dataset.csv
and name the table ORDERITEMS, loadolist_products_dataset.csv
and name the table PRODUCTS & finally loadolist_sellers_dataset.csv
and name the table SELLERS by repeating the above steps.
Once the Database is ready, we will start using the database in our Watson Studio on IBM Cloud.
- Create Watson Studio service.
-
Then click on Get Started.
-
In Watson Studio click
Create a project > Create an empty project
and name itRetail
.
Now that we have created a project, we will start adding components to our project. We will start by adding Db2 Connection to our project first.
- Click on Add to Project and select Connection. If you have followed step 2 select Db2 from the list and add the credentials of your provisioned Db2 Instance. If you have a different database then you can select that and fill in the credentials.
- After filling the credentials click on Create.
NOTE: The Database Credentials are generated in Step 2
We will add Data Refinery Flow in the similar way.
- Click on Add to Project and select Data Refinery Flow.
- Under Assets click on Connections and then click on the connection that you created in step 4, click on the schema of your Database and select the table ORDERS and finally click on ADD.
- You will now see the Data Refinery Dashboard.
5.2.1 We will be performing the Join in this tutorial. Click on Operation on the top left and click on Join.
5.2.2 Select the Inner Join and add the second dataset from our db2 by clicking the button shown.
5.2.3 We will first join the ORDERS table with ORDERITEMS table from db2. Under Assets click on Connections and then click on the connection that you created in step 4, click on the schema of your Database and select the table ORDERITEMS and finally click on APPLY.
5.2.4 Select the JOIN KEYS for ORDERS and ORDERITEMS as order_id and click NEXT.
5.2.5 Click on APPLY to apply the Join Operation.
-
Repeat the steps 5.2.1 to step 5.2.5 to keep joining data to the original by product id and seller id.
-
Select the JOIN KEYS for ORDERS and PRODUCTS as product_id.
- Select the JOIN KEYS for ORDERS and SELLERS as seller_id.
Once the operations are performed its time to save the result in a table. By Default the resulting table will be saved as a .csv
file in the project but we will change the output path to the Db2 database.
- Click on the Edit button on the top right as shown.
- Then click on the Pencil button as shown.
- Click on Change Location, under Assets click on Connections and then click on the connection that you created in step 4, click on the schema of your Database and finally click on SAVE LOCATION.
- Name the Dataset DERIVEDDATA and click on done.
NOTE: Use Uppercase naming only, as Db2 stores in Uppercase.
- Click on the Save and create a Job as shown.
- Give a name to the Job and finally click on Create and Run.
- The Job will start running and it will take approximately 4-5min to complete.
- Once The Job Status becomes Completed, you can check your database to see a new table with a name four_tables_merged with the result.
A Data Scientist cannot directly build a model based on the dataset, the data collection and analysis is very essential before building a model. This tutorial allows Data Scientists to perform data engineering operations to any data easily hence reducing the time spent on a data engineering operation and Data Scientists can focus mainly on building a model. The main advantage of the Data Refinery capabilities of IBM Cloud Pak for Data is Creating a set of ordered steps to cleanse, shape, and enhance data.