Virtualizing Cloudera Data using IBM Cloud Pak for Data

The purpose of this article is to highlight the ability of IBM Cloud Pak for Data to integrate with other market-leading data management platforms, such as Cloudera Data Platform.

Due to the complexity of installing Cloudera Data Platform and Cloud Pak for data, it is assumed the reader will already have access to these environments. If not, you can still follow along and get insights into how integration can be accomplished.

Topics

Overview of technologies
Architecture overview
Load data on Cloud Pak for Data
Create Db2 connection to access data on Cloud Pak for Data
Load data on Cloudera Data Platform
Use IBM BigSQL to synchronize data from Hive table into Cloud Pak for Data
Use Data Virtualization to merge data from Cloudera and Cloud Pak for Data
Build Cognos Analytics dashboard to visualize merged data

Video

To view the following steps as a video click on the image below:

NOTE: Database table and schema names may differ between the video and the instructions listed below, but the steps performed should be similar.

1. Overview of technologies

Let's start with an overview of the technologies that we will be utilizing:

Cloudera Data Platform

Cloudera Data Platform (CDP Private Cloud) is built for hybrid cloud, connecting on-premises environments to public clouds. It provides a disaggregation of compute and storage, allowing for independent scaling of compute and storage clusters. Analytics running on containerized compute nodes, scalable object store, and a secure data lake.

Cloudera Manager provides administration of your Cloudera cluster, including operations for installation, upgrading, and host management and monitoring.
Apache Hive runs over the Hadoop framework and provides an SQL-like interface for processing and querying the HDFS data.
Apache Impala is similar to Hive, but with low-latency and high concurrency it provides a better option for interactive computing.
Apache Knox is an application gateway for interacting with the REST APIs and UIs of Hadoop deployments. Knox presents consumers with one endpoint for access to all the required services across multiple Hadoop clusters.

IBM Cloud Pak for Data

IBM Cloud Pak for Data is a unified, pre-integrated data and AI platform that runs natively on the Red Hat OpenShift Container platform. Services are delivered with an open and extensible cloud native platform for collecting, organizing, and analyzing data. It’s a single interface to perform end-to-end analytics with built-in governance. It also supports and governs the end-to-end AI workflow.

IBM Db2 is a Relational Database Management System (RDBMS). Along with providing the Db2 relational database, it includes a family of tools that allows you to manage both structured and unstructured data across on-premises and multi-cloud environments.
IBM Big SQL provides a single database connection to query data across Hadoop and other relational/NoSQL databases. Data can reside on local systems or on the cloud.
Data Virtualization can query data across many systems without having to copy and replicate data. Is accurate because you’re querying the latest data at its source.
IBM Cognos Analytics Dashboard: self-service analytics, infused with AI and machine learning, enable you to create stunning visualizations and share your findings through dashboards and reports.

2. Architecture overview

Versions

We have tested CDP versions 7.1.6 and 7.1.7 and Cloud Pak for Data 3.5.2 and 4.0.1

CDP configuration

We provisioned 8 VMs running CentOS 7.9 on IBM Cloud to host our CDP instance.

We also provisioned two additional virtual servers -- one for a bastion node so that we could easily SSH into all the other nodes, and the other for an Active Directory Server to manage authentication for our cluster.

Here are the nodes listed in our IBM Cloud devices list:

cid-bastion is our bastion
cid-adc is our Active Directory server
cid-vm-01 to cid-vm-03 are our master nodes
cid-vm-04 to cid-vm-06 are our worker nodes
cid-vm-07 and cid-vm-08 are our edge nodes

Windows 2019 Server

We set up a Windows 2019 Server [cid-adc] to take advantage of it's Active Directory service for user management, and it's Domain Services to manage DNS. We created a private domain cdplab.local so most of the traffic would stay on private networks.

Cloudera Manager

We manage our CDP using Cloudera Manager running on cid-vm-01. Here you see our cpdlab cluster and list of hosts:

Here is the list of services available on the cluster:

IBM Cloud Pak for Data configuration

Cloud Pak for Data comes pre-packaged with a host of tools and services that can be instantiated. We provisioned Data Virtualization, Db2, Big SQL, and Cognos Analytics.

3. Load data on Cloud Pak for Data

In Cloud Pak for Data we will add data to the Db2 instance. The data Sales.csv is meant to represent transactions and contains information such as profit, revenue, and sales price.

The Db2 service has been provisioned in our Cloud Pak for Data instance. To view it, from the main Cloud Pak for Data menu, select Services, and then click on Instances.

Click on the Db2 instance name to bring up the details panel.

Click on the Open database button in the top right corner.

From the Db2 instance Summary panel, click Load Data.

From the File selection window, upload the Sales.csv CSV file, then click Next.

To create a new sales table, first create a new GREAT_OUTDOORS schema, and then click the New table + button. Enter SALES as the new table name and then click Create.

Once our table is created, click Next.

You can then view all of the data rows that will be uploaded into our new SALES table.

Note that you can toggle the Header in first row button to view the column header names.

Click Next to start the loading of the data. If successful, there should be no error messages.

To view the data, from the Db2 summary panel, click Explore > Tables.

4. Create a JDBC connection to Db2

After creating the data, we need to create a new connection to access it. Note that this task requires you to log in as admin on your CPD console.

From the main menu, select Data and then Platform connections.

From the Platform connections panel, click New connection.

From the New connection panel, in the IBM list of connection options, select Db2.

To fill in the data for the new connection, take the values from the following 2 panels:

Panel 1: The Db2 instance panel, which you can navigate to by selecting Services and then Instances from the main menu on the CPD Home page, and then clicking on the Db2 instance name.
Panel 2: The Db2 instance connection information panel. From the previous details page, click on Open database, then select Connection Information from the Summary drop-down menu.

Transfer the data from these panels into the new connection panel and click Test to ensure it works.

Enter unique name for the connection Name.
Database name can be found in panel 1.
Hostname can be found in panel 2.
Port can be parsed from the JDBC Connection URL (SSL) found in panel 1.
Username and Password are your Cloud Pak for Data credentials.
SSL certificate can be downloaded from panel 1.

Click Create to create the connection.

5. Load data on Cloudera Data Platform

On the Cloudera side, we will use HDFS to populate our Hive database. This step uses the data set Products.csv, which contains information about products (brand, description, price, ID, etc.).

The first step is to add the data to a users HDFS directory. Download the data where appropriate and run the command below (replace the last value with a directory you have access to).

hdfs dfs -put Products.csv /user/<your-user-name>

Next we'll connect to Hive using the beeline CLI tool. To do this download the Hive-on-Tez configuration file by selecting the Download Client Configuration option listed under the Actions drop-down menu for our Cloudera Hive on Tez service.

Unzip the downloaded file and open the beeline-site.xml file to find the URL associated with the property beeline-hs2-jdbc-url-hive_on_tez. Copy this value and then exit the file.

Log into one of the nodes in your CPD cluster and start the beeline CLI by typing:

beeline

NOTE: The authentication process will depend on how you configured your CPD Private instance.

Connect to the Hive service by typing the command below, replacing the jdbc-url value with the value copied from the previous step:

!connect <jdbc-url>

Use the following SQL command to create a new database and switch context to it:

CREATE DATABASE great_outdoors;
USE great_outdoors;

Use the following SQL command to create a new table for the products:

CREATE TABLE products (
  `Product_number` INT,
  `Product_line` STRING,
  `Product_type` STRING,
  `Product` STRING,
  `Introduction_date` STRING,
  `Product_brand` STRING,
  `Product_color` STRING,
  `Product_size` STRING,
  `Product_description` STRING,
  `Unit_cost` STRING,
  `Unit_price` STRING)
row format delimited fields terminated by ','
STORED AS textfile TBLPROPERTIES('transactional'='false');

Use the following SQL command to load the data from HDFS to the table:

LOAD DATA INPATH '/<user-dir>/Products.csv' INTO TABLE products;

If the data is loaded correctly you can query the data by running the SQL command below:

SELECT * from products;

6. Use IBM BigSQL to synchronize data from Hive table into Cloud Pak for Data

At this point we'll use BigSQL to synchronize data from Hive to Db2, so make sure you have the service installed on your CPD cluster.

Open the Big SQL service by clicking on Open in the action (⋮) menu.

From the Summary dropdown menu click on the Run SQL options.

Click on the Create new + button, enter the following SQL command, and click on the Run all button.

CALL SYSHADOOP.HCAT_SYNC_OBJECTS('GREAT_OUTDOORS', 'PRODUCTS', 't');

From the dropdown menu click Explore > Hadoop tables. Verify that the product data shows up when going to the GREAT_OUTDOORS schema and viewing the PRODUCTS table.

Create a JDBC connection to BigSQL

From the main menu, select Data and then Platform connections.

From the Platform connections panel, click New connection.

From the New connection panel, in the IBM list of connection options, select Db2 BigSQL.

Below is an example of a BigSQL connect that you can build your own connection from.

NOTE To determine host name, follow the procedures outlined in the Setting up a connection to Db2 Big SQL instructions. We used the second suggested method - "Passthrough secure route method". The result was a route that we could use for the Hostname or IP Address field.

For Username and Password use your Cloudera Data Platform credentials.

For SSL certificate, you must run an OpenSSL command to the hostname just generated.

openssl s_client -showcerts -servername <hostname without port> -connect <hostname:port>

This will return a chain certificate, so you will need to copy both and paste into the SSL certificate field. Note, be sure to remove any extraneous lines outside of the BEGIN and END certificate delimiters.

Click the Test button to ensure the connection works.

7. Use Data Virtualization to merge data from Cloudera Data Platform and Cloud Pak for Data

From the Cloud Pak for Data home page menu, open up the Data menu and click on Data Virtualization.

Add data sources

The default landing page for Data Virtualization is Data Sources. Click on the Add data source button.

Choose the option to Select existing connection and choose the cdp-pvc-base connection that represents the connection we made to BigSQL (which reads from our Hive databases on Cloudera). Click Add.

Repeat the step for the db2-on-cpd connection representing the connection we made to Db2.

Virtualize tables

From the main drop-down menu, click on Virtualize, which is listed under the Virtualization option.

From the Table list, select each of the tables we have created, and add them to the Cart.

The SALES table associated with the db2-on-cpd connection (data from Cloud Pak for Data):
The PRODUCTS table associated with the cpd-pvc-base connection (data from Cloudera Data Platform):

Click the Add to cart button.

Once available, click the View cart button.

Resolve any conflicts, assign to your CPD project, and then click Virtualize.

Once complete, click the View my virtualized data button on the modal dialog.

NOTE: You can also navigate to your data by clicking on the My virtualized data option from the main Data Virtualization menu.

You should now see your new virtualized data tables listed.

Join tables

Select the Sales and Products tables, then click the Join button in the table header.

We won't need all of the columns for our join, so start by de-selecting all of the column names.

The columns we are interested in are:

Products table: PRODUCT_NUMBER, PRODUCT, PRODUCT_TYPE
Sales table: Product number, REVENUE, Gross profit

Connect the primary keys together by clicking in the Product number row in the Sales table, and drag and drop the cursor onto the PRODUCT_NUMBER row in the PRODUCTS table.

If successful, the table in the right hand margin should reflect the joining of the keys.

Click the Open in SQL editor button at the top of the right hand margin.

In the editor window, you will see the SQL statement that was generated from the actions from the previous step. We will need to modify this slightly so that we get total SUM values for revenue and gross profit. The GROUP BY statements allow us to generate the sums -- which we can use to create some meaningful Cognos Analytics dashboards (see next steps). Note that we are also changing the view name.

CREATE VIEW HIGHEST_REVENUE_PRODUCTS
AS
SELECT
    "HIGHREV"."PRODUCTS"."PRODUCT_NUMBER" AS "HIGHREV_PRODUCTS_PRODUCT_NUMBER",
    "HIGHREV"."PRODUCTS"."PRODUCT" AS "HIGHREV_PRODUCTS_PRODUCT",
    "HIGHREV"."PRODUCTS"."PRODUCT_TYPE" AS "HIGHREV_PRODUCTS_PRODUCT_TYPE",
    SUM("HIGHREV"."SALES"."REVENUE") AS "HIGHREV_SALES_REVENUE",
    SUM("HIGHREV"."SALES"."Gross profit") AS "HIGHREV_SALES_Gross profit"
FROM
    "HIGHREV"."PRODUCTS",
    "HIGHREV"."SALES"
WHERE
    "HIGHREV"."PRODUCTS"."PRODUCT_NUMBER" = "HIGHREV"."SALES"."Product number"
GROUP BY 
    "HIGHREV"."PRODUCTS"."PRODUCT_NUMBER",
    "HIGHREV"."PRODUCTS"."PRODUCT",
    "HIGHREV"."PRODUCTS"."PRODUCT_TYPE",
    "HIGHREV"."SALES"."REVENUE",
    "HIGHREV"."SALES"."Gross profit"

Click the Run all button to execute the SQL.

New view

To see our new view, navigate back to the My virtualized data panel. You should now see the new view.

Use the action menu for the view to select the Preview option.

8. Build Cognos Analytics dashboard to visualize merged data

To use Cognos Analytics, we will need to connect to Big SQL to access the Hive data on our Cloudera cluster. Grab the port number for our Big SQL instance from the Red Hat OpenShift console:

Launch Cognos Analytics instance

From the CPD Instances panel, locate the Cognos Analytics service, then click on Open from the action menu.

This will bring up the Cognos Analytics service home page.

Add Big SQL connection

To navigate to the Big SQL connection panel, start by clicking on the Manage option from the main menu. From the Data server connections panel, clicking the + button to create a new connection.

From the Type list, select IBM Big SQL.

Update the fields to match your Big SQL service connection (including port number), and click Test to try it out.

Note that for our setup, we have SSL enabled and need to pass in our user ID and password.

If successful, change the name of the connection and click Save.

Build dashboard

Once you have your connections and data sources configured, create a new dashboard using the provided visualization tools available in Cognos. Here is an example dashboard (see video for details) that shows the highest revenue product types, and the top 20 products based on revenue.

Summary

This tutorial covered examples of how to use Platform connections in IBM Cloud Pak for Data to access data sources on Cloud Pak for Data and other platforms, such as Cloudera Data Platform. It details the steps required to create a Db2 connection to Cloud Pak for Data data, and a Big SQL connection to access Hive data on Cloudera. It then shows how the data can be combined and transformed using Data Virtualization. Lastly, it provides examples of how to import this new virtualized data into Cognos Analytics, where dashboards can be built to pull out insights using multiple visualizations.

License

This code pattern is licensed under the Apache Software License, Version 2. Separate third party code objects invoked within this code pattern are licensed by their respective providers pursuant to their own separate licenses. Contributions are subject to the Developer Certificate of Origin, Version 1.1 (DCO) and the Apache Software License, Version 2.

Apache Software License (ASL) FAQ

Name		Name	Last commit message	Last commit date
Latest commit History 27 Commits
data		data
doc/source/images		doc/source/images
.gitignore		.gitignore
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
MAINTAINERS.md		MAINTAINERS.md
README.md		README.md

License

IBM/cloudera-for-cpd

Folders and files

Latest commit

History

Repository files navigation

Virtualizing Cloudera Data using IBM Cloud Pak for Data

Topics

Video

1. Overview of technologies

Cloudera Data Platform

IBM Cloud Pak for Data

2. Architecture overview

Versions

CDP configuration

Windows 2019 Server

Cloudera Manager

IBM Cloud Pak for Data configuration

3. Load data on Cloud Pak for Data

4. Create a JDBC connection to Db2

5. Load data on Cloudera Data Platform

6. Use IBM BigSQL to synchronize data from Hive table into Cloud Pak for Data

Create a JDBC connection to BigSQL

7. Use Data Virtualization to merge data from Cloudera Data Platform and Cloud Pak for Data

Add data sources

Virtualize tables

Join tables

New view

8. Build Cognos Analytics dashboard to visualize merged data

Launch Cognos Analytics instance

Add Big SQL connection

Build dashboard

Summary

License

About

Resources

License

Stars

Watchers

Forks