# **Data Source Connectivity and Exploration on HPE Ezmeral Unified Anaytics**

## This tutorial provides the basic steps for using the Data Engineering space within HPE Ezmeral Unified Analytics Software.

The data and information used in this tutorial is for example purposes only. You must connect Unified Analytics to your own data sources and use the data sets available to you in your data sources.

### Tutorial Objectives: 
The purpose of this tutorial is to walk you through some Data Engineering basics and familiarize you with the interface, including how to:

- Connect data sources.
- Select predefined data sets in data sources.
- Join data across data sets/data sources.
- Create a view.
- Run a query against the view.

### Table of Contents

- [Introduction to Data Engineering on HPE Ezmeral Unified Analytics](#introduction-to-data-engineering-on-hpe-ezmeral-unified-analytics)
- [1. Sign in to HPE Ezmeral Unified Analytics Software](#1-sign-in-to-hpe-ezmeral-unified-analytics-software)
- [2. Connect Data Sources](#2-connect-data-sources)
- [3. Select Data Sets in the Data Catalog](#3-select-data-sets-in-the-data-catalog)
- [4. Run a JOIN Query on Data Sets and Create a View](#4-run-a-join-query-on-data-sets-and-create-a-view)
- [5. Next Steps](#5-next-steps)

## **Introduction to Data Engineering on HPE Ezmeral Unified Analytics**

Before we begin, familiarize yourself with the the sidebar navigation menu in your HPE Ezmeral Unified Analytics Dashboard. Under the Data Engineering tab, you can connect to data sources and work with data in a variety of ways. The Data Engineering tab includes:

- **Data Sources:** View and access connected data sources; create new data source connections.
- **Data Catalog:** Select data sets (tables and views) from one or more data sources and query data across the data sets. You can cache data sets. Caching stores the data in a distributed caching layer within the data fabric for accelerated access to the data.
- **Query Editor:** Run queries against selected data sets; create views and new schemas.
- **Cached Assets:** Lists the cached data sets (tables and views).
- **Airflow Pipelines:** Links to the Airflow interface where you can connect to data sets created in HPE Ezmeral Unified Analytics Software and use them in your data pipelines.

## **1. Sign in to HPE Ezmeral Unified Analytics Software**

Sign in to HPE Ezmeral Unified Analytics Software with the URL provided by your administrator.

## **2. Connect Data Sources**

Let's connect HPE Ezmeral Unified Analytics Software to external data sources that contain the data sets (tables and views) you want to work with. This tutorial uses MySQL, SQL, Snowflake and Hive as the connected data sources.

In the left navigation column, select Data Engineering > Data Sources. The Data Sources screen appears.

![title](images/01a.png)

Click **Add New Data Source.**

![title](images/01b.png)

### **Connecting to MySQL**

In the Add New Data Source screen, click Create Connection in the MySQL tile.

In the drawer that opens, enter required information in the respective fields:

**Name:** mysql

**Connection URL:** jdbc:mysql://<ip-address>:<port>

**Connection User:** demouser

**Connection Password:** moi123

**Enable Local Snapshot Table:** Select the check box

When Enable Local Snapshot Table is selected, the system caches remote table data to accelerate queries on the tables. The cache is active for the duration of the configured TTL or until the remote tables in the data source are altered.


Finally, click Connect. Upon successful connection, the system returns the following message: *Successfully added data source "mysql".*

### **Connecting to Microsoft SQL Server**

In the Add New Data Source screen, click Create Connection in the MySQL tile.

In the drawer that opens, enter required information in the respective fields:

**Name:** mssql_ret2

**Connection URL:** jdbc:sqlserver:<ip-address>:<port>;database=retailstore

**Connection User:** myaccount

**Connection Password:** moi123

**Enable Local Snapshot Table:** Select the check box

When Enable Local Snapshot Table is selected, the system caches remote table data to accelerate queries on the tables. The cache is active for the duration of the configured TTL or until the remote tables in the data source are altered.

**Enable Transparent Cache:** Select the check box

When Enable Transparent Cache is selected, the system caches data at runtime when queries access remote tables. As the query engine scans data in remote data sources, the scanned data is cached on the fly. Results for subsequent queries on the same data are quickly returned from the cache. The cache lives for the duration of the session.

Finally, click Connect. Upon successful connection, the system returns the following message: *Successfully added data source "mssql_ret2".*

### **Connecting to Snowflake**

In the Add New Data Source screen, click Create Connection in the Snowflake tile.

In the drawer that opens, enter required information in the respective fields:

**Name:** snowflake_ret

**Connection URL:** jdbc:snowflake://mydomain.com/

**Connection User:** demouser

**Connection Password:** moi123

**Enable Local Snapshot Table:** Select the check box

When Enable Local Snapshot Table is selected, the system caches remote table data to accelerate queries on the tables. The cache is active for the duration of the configured TTL or until the remote tables in the data source are altered.

Finally, click Connect. Upon successful connection, the system returns the following message: *Successfully added data source "snowflake_ret".*


### **Connecting to Hive**

In the Add New Data Source screen, click Create Connection in the Hive tile.

In the drawer that opens, enter required information in the respective fields:

**Name:** hiveview

**Hive Metastore:** file

**Hive Metastore Catalog Dir:** file:///data/shared/tmpmetastore




In Optional Fields, search for the following fields and add the specified values:

**Hive Max Partitions Per Writers:** 10000

**Hive Temporary Staging Directory Enabled:** Unselect

**Hive Allow Drop Table:** Select

**Enable Local Snapshot Table:** Select the check box




When Enable Local Snapshot Table is selected, the system caches remote table data to accelerate queries on the tables. The cache is active for the duration of the configured TTL or until the remote tables in the data source are altered.

Finally, click Connect. Upon successful connection, the system returns the following message: *Successfully added data source "hiveview".*

## **3. Select Data Sets in the Data Catalog**

In the Data Catalog, select the data sets (tables and views) in each of the data sources that you want to work with.

This tutorial uses the customer tables in the connected mysql and snowflake_ret data sources. In the mysql data source, the schema for the customer table is retailstore. In the snowflake_ret data source, the schema for the customer table is public.

To select the data sets that you want to work with:

1. In the left navigation bar, select **Data Engineering > Data Catalog**.
1. On the **Data Catalog** page, click the dropdown next to the **mysql** and **snowflake_ret** data sources to expose the available schemas in those data sources.
1. For the **snowflake_ret** data source select the **public** schema and for the **mysql** data source, select the **retailstore** schemas.
1. In the **All Datasets** search field, enter a search term to limit the number of data sets. This tutorial searches on data sets with the name *customer*. All the data sets that have **customer** in the name with *public* or *retailstore* schema display.
1. Click a **customer** table and preview its data in the **Columns** and **Data Preview** tabs. Do not click the browser's back button; doing so takes you to the Data Sources screen and you will have to repeat the previous steps.
1. Click **Close** to return to the data sets.
1. Click **Select** by each of the tables named **customer**. Selected Datasets should show 2 as the number of data sets selected.
1. Click **Selected Datasets**. The Selected Datasets drawer opens, giving you another opportunity to preview the datasets or discard them. From here, you can either query or cache the selected data sets. For the purpose of this tutorial, we will query the data sets.
1. Click **Query Editor**.

![title](images/01c.png)

## **4. Run a JOIN Query on Data Sets and Create a View**

The data sets you selected display under Selected Datasets in the Query Editor. 

Run a JOIN query to join data from the two customer tables and then create a view from the query. The system saves views as cached assets that you can reuse.

To view table columns and run a JOIN query:
1. Expand the customer tables in the **Selected Datasets** section to view the columns in each of the tables.
1. In the **SQL Query** workspace, click + to add a worksheet.
1. Copy and paste the following query into the **SQL Query** field. This query creates the a new schema in the hiveview data source named **demoschema:** 


In [None]:
create schema if not exists hiveview.demoschema;

4. Click **Run** to run the query. As the query runs, a green light pulsates next to the Query ID in the Query Results section to indicate that the query is in progress. When the query is completed, the Status column displays Succeeded.
5. In the **SQL Query** workspace, click + to add a worksheet.
6. Copy and paste the following query into the **SQL Query** field. This query creates a view (hiveview.demoschema) from a query that joins columns from the two **customer** tables (in the mysql and snowflake-ret data sources) on the **customer ID**.

In [None]:
create view hiveview.demoschema.customer_info_view as SELECT t1.c_customer_id, t1.c_first_name, t1.c_last_name, t2.c_email_address FROM mysql.retailstore.customer t1 INNER JOIN snowflake_ret.public.customer t2 ON t1.c_customer_id=t2.c_customer_id

7. Click **Run** to run the query.
8. In the **SQL Query** workspace, click + to add a worksheet.
9. Copy and paste the following query into the **SQL Query** field. This runs against the view you created (hiveview.demoschema) and returns all data in the view.

In [None]:
SELECT * FROM hiveview.demoschema.customer_info_view;*

10. Click **Run** to run the query.
11. In the **Query Results** section, expand the **Actions** option for the query and select **Query Details** to view the query session and resource utilization summary.
12. Click **Close** to exit out of Query Details.

![title](images/01d.png)

## **5. Next Steps**

You have completed the first part of this tutorial. This tutorial demonstrated how easy it is to connect HPE Ezmeral Unified Analytics Software to various data sources for federated access to data through a single interface using standard SQL queries.

Next, you will learn how to create a Superset dashboard using the view (customer_info_view) and schema (customer_schema) you created in this tutorial.