# Data Source Connectivity and Exploration

This tutorial outlines the fundamental steps for using the Data Engineering space in HPE Ezmeral
Unified Analytics (EzUA) Software. Please note that the data and information presented here are
solely for illustrative purposes. To effectively leverage the capabilities of Unified Analytics, you
will need to connect it to your own data sources and utilize the datasets available within those
sources.

## Tutorial Objectives:

This tutorial is designed to introduce you to the essentials of Data Engineering and acquaint you
with the interface of the system. It includes step-by-step guidance on how to:

- Establish connections to various data sources.
- Choose predefined data sets within these data sources.
- Merge data from different sets or sources.
- Develop a view for data analysis.
- Execute queries on the view you have created.

## Table of Contents

- [Introduction to Data Engineering on HPE Ezmeral Unified Analytics](#introduction-to-data-engineering-on-hpe-ezmeral-unified-analytics)
- [Connect Data Sources](#connect-data-sources)
- [Select Data Sets in the Data Catalog](#select-data-sets-in-the-data-catalog)
- [Run a JOIN Query on Data Sets and Create a View](#run-a-join-query-on-data-sets-and-create-a-view)
- [Next Steps](#next-steps)

# Introduction to Data Engineering on HPE Ezmeral Unified Analytics

Before starting, take a moment to familiarize yourself with the sidebar navigation menu of your HPE
EzUA Dashboard. Within the "Data Engineering" section, you'll find tools to connect to data sources
and manage data in various ways. The "Data Engineering" tab includes the following features:

- **Data Sources:** View and access your connected data sources, or establish new connections.
- **Data Catalog:** Select and query data sets (including tables and views) from one or more data
                    sources. This section also offers the option to cache data sets, which stores
                    the data in a distributed caching layer within the data fabric, ensuring quicker
                    access.
- **Query Editor:** Run queries against selected data sets; create views and new schemas.
- **Cached Assets:** List the cached data sets (tables and views).
- **Airflow Pipelines:** Connect to the Airflow interface where you can connect to data sets created
                         in HPE EZUA and use them in your data pipelines.

# Connect Data Sources

Let's begin by connecting to external data sources that house the data sets (tables and views) you
wish to work with. For this tutorial, you are using MySQL, SQL Server, Snowflake, and Hive as your
connected data sources. To start, navigate to the left column and select 'Data Engineering' followed
by 'Data Sources.' Upon doing this, the Data Sources screen will be displayed.

![title](images/01a.png)

Click the `Add New Data Source` button.

![title](images/01b.png)

## Connecting to MySQL

In the Add New Data Source screen, click `Create Connection` in the MySQL tile. In the drawer that
opens on the right, enter required information in the respective fields:

- **Name:** mysql
- **Connection URL:** jdbc:mysql://<ip-address>:<port>
- **Connection User:** demouser
- **Connection Password:** moi123
- **Enable Local Snapshot Table:** Select the check box

When `Enable Local Snapshot Table` is selected, the system caches remote table data to accelerate
queries on the tables. The cache is active for the duration of the configured TTL or until the
remote tables in the data source are altered.

Finally, click `Connect`. Upon successful connection, the system returns the following message:

```
Successfully added data source "mysql".
```

> Please be aware that the credentials used in this tutorial are solely for illustrative purposes.
> You should use your own credentials to establish connections with your personal data sources.

## Connecting to Microsoft SQL Server

In the `Add New Data Source` screen, click Create Connection in the MySQL tile. In the drawer that
opens on the right, enter required information in the respective fields:

- **Name:** mssql
- **Connection URL:** jdbc:sqlserver:<ip-address>:<port>;database=retailstore
- **Connection User:** myaccount
- **Connection Password:** moi123
- **Enable Local Snapshot Table:** Select the check box
- **Enable Transparent Cache:** Select the check box

When `Enable Local Snapshot Table` is selected, the system caches remote table data to accelerate
queries on the tables. The cache is active for the duration of the configured TTL or until the
remote tables in the data source are altered.

~~When the `Enable Transparent Cache` option is selected, the system automatically caches data in
real-time as queries access remote tables. This means that as the query engine processes data from
remote data sources, it simultaneously caches this data. Consequently, any future queries targeting
the same data will benefit from faster response times, as the results are swiftly retrieved from
this cache. It's important to note that this cache exists only for the duration of the current
session.~~

Finally, click `Connect`. Upon successful connection, the system returns the following message:

```
Successfully added data source "mssql".
```

## Connecting to Snowflake

In the `Add New Data Source` screen, click Create Connection in the Snowflake tile. In the drawer
that opens on the right, enter required information in the respective fields:

- **Name:** snowflake
- **Connection URL:** jdbc:snowflake://mydomain.com/
- **Connection User:** demouser
- **Connection Password:** moi123
- **Enable Local Snapshot Table:** Select the check box

When `Enable Local Snapshot Table` is selected, the system caches remote table data to accelerate
queries on the tables. The cache is active for the duration of the configured TTL or until the
remote tables in the data source are altered.

Finally, click `Connect`. Upon successful connection, the system returns the following message:

```
Successfully added data source "snowflake".
```

## **Connecting to Hive**

In the `Add New Data Source` screen, click Create Connection in the Hive tile. In the drawer that
opens on the right, enter required information in the respective fields:

- **Name:** hiveview
- **Hive Metastore:** file
- **Hive Metastore Catalog Dir:** file:///data/shared/tmpmetastore

In `Optional Fields`, search for the following fields and add the specified values:

- **Hive Max Partitions Per Writers:** 10000
- **Hive Temporary Staging Directory Enabled:** Unselect
- **Hive Allow Drop Table:** Select
- **Enable Local Snapshot Table:** Select the check box

When `Enable Local Snapshot Table` is selected, the system caches remote table data to accelerate
queries on the tables. The cache is active for the duration of the configured TTL or until the
remote tables in the data source are altered.

Finally, click `Connect`. Upon successful connection, the system returns the following message:

```
Successfully added data source "hiveview".
```

# Select Data Sets in the Data Catalog

In the `Data Catalog`, select the data sets (tables and views) in each of the data sources that you
want to work with. This tutorial uses the customer tables in the connected `mysql` and `snowflake`
data sources. In the `mysql` data source, the schema for the `customer` table is `retailstore`. In
the `snowflake` data source, the schema for the `customer` table is `public`.

To select the data sets that you want to work with:

1. In the left navigation bar, select `Data Engineering > Data Catalog`.
1. On the `Data Catalog` page, click the dropdown next to the `mysql` and `snowflake` data
   sources to expose the available schemas in those data sources.
1. For the `snowflake` data source select the `public` schema and for the `mysql` data source,
   select the `retailstore` schemas.
1. In the `All Datasets` search field, enter a search term to limit the number of data sets. This
   tutorial searches on data sets with the name `customer`. All the data sets that have `customer`
   in the name with `public` or `retailstore` schema display.
1. Click a `customer` table and preview its data in the `Columns` and `Data Preview` tabs. Do not
   click the browser's back button; doing so takes you to the `Data Sources` screen and you will
   have to repeat the previous steps.
1. Click `Close` to return to the data sets.
1. Click `Select` by each of the tables named `customer`. `Selected Datasets` should show `2` as the
   number of data sets selected.
1. Click `Selected Datasets`. The `Selected Datasets` drawer opens, giving you another opportunity
   to preview the datasets or discard them. From here, you can either query or cache the selected
   data sets. For the purpose of this tutorial, we will query the data sets.
1. Click `Query Editor`.

![title](images/01c.png)

# Run a JOIN Query on Data Sets and Create a View

The datasets you selected display under `Selected Datasets` in the `Query Editor`. Run a `JOIN`
query to join data from the two `customer` tables and then create a view from the query. The system
saves views as cached assets that you can reuse.

To view table columns and run a `JOIN` query:
1. Expand the `customer` tables in the `Selected Datasets` section to view the columns in each of
   the tables.
1. In the `SQL Query` workspace, click `+` to add a worksheet.
1. Copy and paste the following query into the `SQL Query` field. This query creates the a new
   schema in the hiveview data source named `demoschema`:

   ```sql
   create schema if not exists hiveview.demoschema;
   ```
1. Click `Run` to run the query. As the query runs, a green light pulsates next to the `Query ID` in
   the `Query Results` section to indicate that the query is in progress. When the query is
   completed, the `Status` column displays `Succeeded`.
1. In the `SQL Query` workspace, click `+` to add a worksheet.
1. Copy and paste the following query into the `SQL Query` field. This query creates a view
   (`hiveview.demoschema`) from a query that joins columns from the two `customer` tables (in the
   `mysql` and `snowflake` data sources) on the `customer ID`:

   ```sql
   create view hiveview.demoschema.customer_info_view as SELECT t1.c_customer_id, t1.c_first_name, t1.c_last_name, t2.c_email_address FROM mysql.retailstore.customer t1 INNER JOIN snowflake.public.customer t2 ON t1.c_customer_id=t2.c_customer_id
   ```
1. Click `Run` to run the query.
1. In the `SQL Query` workspace, click `+` to add a worksheet.
1. Copy and paste the following query into the `SQL Query` field. This runs against the view you
   created (`hiveview.demoschema`) and returns all data in the view:

   ```sql
   SELECT * FROM hiveview.demoschema.customer_info_view;*
   ```
1. Click `Run` to run the query.
1. In the `Query Results` section, expand the `Actions` option for the query and select
   `Query Details` to view the query session and resource utilization summary.
1. Click `Close` to exit out of Query Details.

![title](images/01d.png)

# Next Steps

You have completed the first part of this tutorial. This tutorial demonstrated how easy it is to
connect HPE EzUA to various data sources for federated access to data through a single interface,
using standard SQL queries.

Next, you will learn how to create a Superset dashboard using the view (`customer_info_view`) and
schema (`customer_schema`) you created in this tutorial.