#### HOL2184 - lab 4 - Data Virtualization

## 1. Introduction

Welcome to the lab for Data Virtualization. 

In this lab you analyze data from multiple data sources, without copying data.

This hands-on lab uses data from 4 data sources, were data is “virtually” available through the IBM Cloud Pak for Data Virtualization Service. This would make it easy to analyze data from across your multi-cloud enterprise using tools like, Jupyter Notebooks, Watson Studio or your favorite reporting tool like Cognos.  


## 2. Exploring Data Source Connections
Let's start by looking at the the Data Source Connections that are available in this environment. 

1. Click the three bar (hamburger) menu at the top left of the console
2. Click on the Data menu item if is not already expanded
3. Right click **Data Virtualization** and select **Open in New Window**
    <img src="https://raw.githubusercontent.com/HOL2184/HOL2184-resources/main/40-Data_Virtualization/images/DV02.png"><br>
4. Click on the submenu (**Virtualize**) and select **Data Sources** to show the currently defined data source.
    <img src="https://raw.githubusercontent.com/HOL2184/HOL2184-resources/main/40-Data_Virtualization/images/DV03.png"><br>
4. Click **Constellation View**. A spider diagram of the connected data sources opens. 
    <img src="https://raw.githubusercontent.com/HOL2184/HOL2184-resources/main/40-Data_Virtualization/images/DV04.png"><br>

    This displays the Data Source Graph with a number of active data sources:
    * 4 Db2 Family Databases hosted on premises, on Cloud Pak for Data and on OpenShift on AWS
    * 1 Remote connector
    * 1 MongoDB Enterprise data server running as a Cloud Pak for Data service and on Premises
    * 1 Enterprise DB Postgres data server running on premises and on Cloud Pak for Data
    * 1 Netezza Performance Server (using the Pure Data for Analyics connection) running on the Cloud
    * 1 MySQL data server running on premises
    * 1 Informix Database running on premises 
    * 1 BigSQL engine running as a Cloud Pak for Data service
    * 1 file system on a remote server


## 3. Virtualize tables

The data sources are already defined for this lab. Now we want to virtualize tables so that we can run queries over several data sources. IBM Cloud Pak for Data searches through the available data sources and compiles a single large inventory of all the tables and data available to virtualize in IBM Cloud Pak for Data. 

1. Click the Data Virtualization menu and select **Virtualize** under **Virtualization**
    <img src="https://raw.githubusercontent.com/HOL2184/HOL2184-resources/main/40-Data_Virtualization/images/DV05.png"><br>
    
2. Check the total number of available tables at the top of the list. There should be hundreds available. We now virtualize our tables from various data sources. You can type the name of a table in the search field or reduce the number of tables by restricting the data source type. We start with the **ORDERS** table.

    <img src="https://raw.githubusercontent.com/HOL2184/HOL2184-resources/main/40-Data_Virtualization/images/DV06.png"><br>

3. Find the **ORDRES** table in data source **BigSQL** with Schema **USER00**. You can preview the clicking the eye on the right. Select the entry by checking the box on the left and press **Add to cart**. You know see on the top right that the first item is in the shopping cart.


4. We will now add more tables to the shopping cart. Follow the same process also for these tables:

- Data source **Db2 Warehouse on Cloud Pak for Data**
    - USER.CUSTOMER
    - USER.LINEITEM
    - USER.REGION
    - USER.NATION
- Data source **EDB Postgres on Cloud Pak for Data**
    - public.supplier
- Data source **MongoDB orders**
    - "ORDERS-DATABASE.PARTSUPP-COLLECTION
    
5. We now should have 7 items in the cart. Now we add a file as a table. Select **File** from the tab above.

    <img src="https://raw.githubusercontent.com/HOL2184/HOL2184-resources/main/40-Data_Virtualization/images/DV22.png"><br>
    
6. Open the file server **server7**.

    <img src="https://raw.githubusercontent.com/HOL2184/HOL2184-resources/main/40-Data_Virtualization/images/DV23.png"><br>

7. Select the directory **additionalData** and select the file **part.csv**.

    <img src="https://raw.githubusercontent.com/HOL2184/HOL2184-resources/main/40-Data_Virtualization/images/DV24.png"><br>

8. Add the file also to the cart. We know should have 8 items in the cart. Open the cart by clicking **View cart(8)**

    <img src="https://raw.githubusercontent.com/HOL2184/HOL2184-resources/main/40-Data_Virtualization/images/DV25.png"><br>

9. In the cart are 8 items and some have to be changed. But first select **My virtualized data** as a target. Then change the names of the following items
    - supplier ⇒ SUPPLIER
    - PARTSUPP-COLLECTION ⇒ PARTSUPP
    - part_csv ⇒ PART
<br><br>
    
10. For some table we have to change the columns. Click on the 3 dots on the right for table **SUPPLIER** and select **Edit columns**. 

    <img src="https://raw.githubusercontent.com/HOL2184/HOL2184-resources/main/40-Data_Virtualization/images/DV09.png"><br>

11. Change all column names to uppercase for this table and click on **APPLY**

    <img src="https://raw.githubusercontent.com/HOL2184/HOL2184-resources/main/40-Data_Virtualization/images/DV10.png"><br>
    
12. Click **Edit columns** for **PARTSUPP** and deselect the columns **INDEX** and **_ID**. 

    <img src="https://raw.githubusercontent.com/HOL2184/HOL2184-resources/main/40-Data_Virtualization/images/DV11.png"><br>

13. We are now ready to virtualize the tables. Click on **Virtualize** to do so.

    <img src="https://raw.githubusercontent.com/HOL2184/HOL2184-resources/main/40-Data_Virtualization/images/DV12.png"><br>

14. After the virtualization has finished execution you can click on **View my virtualized data**. 

    <img src="https://raw.githubusercontent.com/HOL2184/HOL2184-resources/main/40-Data_Virtualization/images/DV13.png"><br>

15. The table shows the virtualized tables. There might be more tables than you published here as you can also see virtualized tables others made available to you. You can filter by your username to see the tables created in the previous steps.

    <img src="https://raw.githubusercontent.com/HOL2184/HOL2184-resources/main/40-Data_Virtualization/images/DV14.png"><br>


# 4. Access DV from SQL

Data Virtualization in Cloud Pak for Data is behaving like a database. It is offering a JDBC interface to run SQL and it is using the Db2 drivers. 

The next part of the lab relies on a Jupyter notebook extension, commonly refer to as a "magic" command, to connect to a Db2 database. To use the commands you load load the extension by running another notebook call db2 that contains all the required code 
<pre>
&#37;run db2.ipynb
</pre>
The cell below loads the Db2 extension directly from GITHUB. Note that it will take a few seconds for the extension to load, so you should generally wait until the "Db2 Extensions Loaded" message is displayed in your notebook. 
1. Click the cell below
2. Click **Run**. When the cell is finished running, In[*] will change to In[2]

In [None]:
# !wget https://raw.githubusercontent.com/IBM/db2-jupyter/master/db2.ipynb
!wget -O db2.ipynb https://raw.githubusercontent.com/Db2-DTE-POC/CPDDVLAB/master/db2.ipynb

%run db2.ipynb
print('db2.ipynb loaded')

### 4.1 Gaining Insight from Virtualized Data

To connect to the data virtualization engine we have to speccify user and password. Please change the values below for your assigned lab user. After changing the values click on **Run** above or press **Shift-Enter** to execute the code cell

In [None]:
# Connect to the IBM Cloud Pak for Data Virtualization Database from inside CPD

user = "USERxx"
password = 'HOL2184'

database = 'bigsql'
host = '10.1.1.1'
port = '32601'

%sql CONNECT TO {database} USER {user} USING {password} HOST {host} PORT {port}

Now that you are connected to the Data Virtualization engine you can query the virtualized tables using all the power in the Db2 SQL query engine. 

We have the 8 tables that we virtualized before available for querying:

- SUPPLIER
- PART
- PARTSUPP
- CUSTOMER
- NATION
- REGION
- LINEITEM
- ORDERS

We start with a query for the suppliers.

In [None]:
%sql select * from SUPPLIER FETCH FIRST 5 ROWS ONLY

We can also run a query that joins tables from multiple sources. Run the cell below.

Next, we are running a more complex query that includes data from multiple data sources. This query finds which supplier should be selected to place an order for a given part in a given region. It will run a few seconds (>15s), so you have to be patient. 

While you are waiting you look at the included tables and see above in which data sources they are located.

In [None]:
%%time

%%sql -a
select
    s_acctbal,
    s_name,
    n_name,
    p_partkey,
    p_mfgr,
    s_address,
    s_phone,
    s_comment
from
    part,
    supplier,
    partsupp,
    nation,
    region
where
    p_partkey = ps_partkey
    and s_suppkey = ps_suppkey
    and p_size = 15
    and p_type like '%BRASS'
    and s_nationkey = n_nationkey
    and n_regionkey = r_regionkey
    and r_name = 'EUROPE'
    and ps_supplycost = (
        select
            min(ps_supplycost)
    from
            partsupp,
            supplier,
            nation,
            region
        where
            p_partkey = ps_partkey
            and s_suppkey = ps_suppkey
            and s_nationkey = n_nationkey
            and n_regionkey = r_regionkey
            and r_name = 'EUROPE'
    )
order by
    s_acctbal desc,
    n_name,
    s_name,
    p_partkey;

Data Virtualization has a explain tool that can show you how the query is processed. The above query has the following plan:

   <img src="https://raw.githubusercontent.com/HOL2184/HOL2184-resources/main/40-Data_Virtualization/images/DV19.png"><br>

As the size of the plan is huge the following snippet gives you an impression what operations are done while calculating the query result:

   <img src="https://raw.githubusercontent.com/HOL2184/HOL2184-resources/main/40-Data_Virtualization/images/DV21.png"><br>


### 4.2 Seeing where your Virtualized Data is coming from
You may eventually work with a complex Data Virtualization schema with dozens or hundres of data sources. As an administrator or a Data Scientist you may need to understand where data is coming from. 

The following query shows the tables available for your user schema.

In [None]:
%%sql -a
SELECT TABSCHEMA, TABNAME
  FROM SYSCAT.NICKNAMES
    WHERE TABSCHEMA = :user
    ORDER BY TABSCHEMA, TABNAME

If you want to know details about the source of aquery, the following query returns this information for the table **PARTSUPP**.

In [None]:
%%sql -a 
select * from table(dvsys.GET_VT_SOURCES(:user, 'PARTSUPP'))

You can also find the same information in the Cloud Pak for Data user interface. Look for the **Metadata** option in the interface.

To see the source of all your virtual tables we just need to join the query and the procedure call.

In [None]:
%%sql -a
SELECT N.TABSCHEMA AS TABSCHEMA, N.TABNAME AS TABNAME, S.SRCTABNAME AS SRCTABNAME, S.SRCSCHEMA AS SRCSCHEMA, S.SRCTYPE AS TYPE, S.DRIVER AS DRIVER, S.URL AS URL, S.USER AS USER, S.HOSTNAME AS HOSTNAME, S.PORT AS PORT, S.DBNAME AS DBNAME
  FROM SYSCAT.NICKNAMES N, TABLE(
  DVSYS.GET_VT_SOURCES(N.TABSCHEMA, N.TABNAME)) S
  WHERE N.TABSCHEMA = :user

This concludes lab 4.

**This project contains Sample Materials, provided under license. <br>
Licensed Materials - Property of IBM. <br>
© Copyright IBM Corp. 2021. All Rights Reserved. <br>
US Government Users Restricted Rights - Use, duplication or disclosure restricted by GSA ADP Schedule Contract with IBM Corp.<br>**