# Using IBM Cloud SQL Query

<div class="pull-left"><left><img style="float: right;" src="http://developer.ibm.com/clouddataservices/wp-content/uploads/sites/85/2018/01/ibm-cloud-object-storage-logo-small.png" width="100" margin=50></left></div>
<div style="text-align:center">
IBM Cloud SQL Query is IBM's serverless SQL service on data in Cloud Object Storage. It allows to run ANSI SQL on Parquet, CSV, JSON, ORC and AVRO data sets. You can use it to run your analytic queries, and you can use it to conduct complex transformations and write the result in any desired data format, partitioning and layout. SQL Query is based on Apache Spark SQL as the query engine in the background. This means you do not have to provision any Apache Spark instance or service. A simple Python client (like the IBM Watson Studio Notebook) is sufficient.<br><br></div>
This notebook is meant to be a generic starter to use the SQL Query API in order to run SQL statements in a programmatic way. It uses the <a href="https://github.com/IBM-Cloud/sql-query-clients/tree/master/Python" target="_blank" rel="noopener noreferrer">ibmcloudsql</a> Python library for this purpose. The notebook also demonstrates how you can combine SQL Query with visualization libraries such as PixieDust. The notebook has been verified to work with Python 3.5. As mentioned above it does not require a Spark service bound to the notebook.

## Table of contents
1. [Setup libraries](#setup)<br>
2. [Configure SQL Query](#configure)<br>
    2.1 [Using the project bucket](#projectbucket)<br>
    2.2 [Setting SQL Query parameters](#parameters)<br>
3. [Your SQL](#sql)<br>
4. [Running Your SQL Statement](#run)<br>
    4.1 [Low level SQL job submission](#lowlevel)<br>
5. [Running ETL SQLs](#etl)<br>
6. [Paginated SQL Results](#pagination)<br>
7. [List recent SQL submissions](#joblist)<br>
8. [Next steps](#next)<br>

### <a id="setup"></a> 1. Setup libraries

Run the following cell at least once in your notebook environment in order to install required packages, such as the SQL Query client library:

In [None]:
!conda install pyarrow
!conda install sqlparse

In [None]:
!pip install --user ibmcloudsql

In [None]:
import ibmcloudsql
from pixiedust.display import *
import pandas as pd
targeturl=''

### <a id="configure"></a> 2. Configure SQL Query
1. You need an **API key** for an IBM cloud identity that has access to your Cloud Object Storage bucket for writing SQL results and to your SQL Query instance. To create API keys log on to the IBM Cloud console and go to <a href="https://console.bluemix.net/iam/#/apikeys" target="_blank">Manage->Security->Platform API Keys</a>, click the `Create` button, give the key a custom name and click `Create`. In the next dialog click `Show` and copy the key to your clipboard and paste it below in this notebook.
2. You need the **instance CRN** for the SQL Query instance. You can find it in the <a href="https://console.bluemix.net/dashboard/apps" target="_blank">IBM Cloud console dashboard</a>. Make sure you have `All Resources` selected as resource group. In the section `Services` you can see your instances of SQL Query and Cloud Object Storage. Select the instance of SQL Query that you want to use. In the SQL Query dashboard page that opens up you find a section titled **REST API** with a button labelled **Instance CRN**. Click the button to copy the CRN into your clipboard and paste it here into the notebook. If you don't have an SQL Query instance created yet, <a href="https://console.bluemix.net/catalog/services/sql-query" target="_blank">create one</a> first.
3. You need to specify the location on Cloud Object Storage where your **query results** should be written. This comprises three parts of information that you can find in the Cloud Object Storage UI for your instance in the IBM Cloud console. You need to provide it as a **URL** using the format `cos://<endpoint>/<bucket>/[<prefix>]`. You have the option to use the cloud object storage **bucket that is associated with your project**. In this case, execute the following section before you proceed.  
<br/>
For more background information, check out the SQL Query <a href="https://console.bluemix.net/docs/services/sql-query/getting-started.html#getting-started-tutorial" target="_blank">documentation</a>.

#### <a id="projectbucket"></a> 2.1 Using the project bucket
**Only** follow the instructions in this section when you want to write your SQL query results to the bucket that has been created for the project for which you have created this notebook. In any other case proceed directly with section **2.2**.
<br><br>
__Inserting the project token__:  
Click the `More` option in the toolbar above (the three stacked dots) and select `Insert project token`.
 * If you haven't created an access token for this project before, you will see a dialog that asks you to create one first. Follow the link to open your project settings, scroll down to `Access tokens` and click `New token`. Give the token a custom name and make sure you select `Editor` as `Access role for project`. After you created your access token you can come back to this notebook, select the empty cell below and again select `Insert project token` from the toolbar at the top.
[//]: # 
This will add a new cell at the top of your notebook with content that looks like this:
```
# @hidden_cell
# The project token is an authorization token that is used to access project resources like data sources, connections, and used by platform APIs.
from project_lib import Project
project = Project(project_id='<some id>', project_access_token='<some access token>')
pc = project.project_context
```
Leave that cell content as inserted and run the cell. Then then proceed with the following cell below:

In [None]:
cos_bucket = project.get_metadata()['entity']['storage']['properties']
targeturl="cos://" + cos_bucket['bucket_region'] + "/" + cos_bucket['bucket_name'] + "/"

#### <a id="parameters"></a> 2.2 Setting the SQL Query parameters

In [8]:
import getpass
apikey=getpass.getpass('Enter IBM Cloud API Key (leave empty to use previous one): ') or apikey
instnacecrn=input('Enter SQL Query Instance CRN (leave empty to use previous one): ') or instnacecrn
if targeturl == '':
    targeturl=input('Enter target URL for SQL results: ')
else:
    targeturl=input('Enter target URL for SQL results (leave empty to use ' + targeturl + '): ') or targeturl
sqlClient = ibmcloudsql.SQLQuery(apikey, instnacecrn, client_info='SQL Query Starter Notebook')
sqlClient.logon()
print('\nYour SQL Query web console link:\n')
sqlClient.sql_ui_link()

Enter IBM Cloud API Key (leave empty to use previous one): ········
Enter SQL Query Instance CRN (leave empty to use previous one): 
Enter target URL for SQL results (leave empty to use cos://us-south/sqltempregional/): 

Your SQL Query web console link:

https://sql.ng.bluemix.net/sqlquery/?instance_crn=crn:v1:bluemix:public:sql-query:us-south:a/d86af7367f70fba4f306d3c19c938f2f:d1b2c005-e3d8-48c0-9247-e9726a7ed510::


### <a id="sql"></a> 3. Your SQL
To author your own SQL query, use the interactive SQL Query web console (**link above**) of your SQL Query service instance.

In [6]:
import sqlparse
from pygments import highlight
from pygments.lexers import get_lexer_by_name
from pygments.formatters import HtmlFormatter, Terminal256Formatter

sql=input('Enter your SQL statement (leave empty to use a simple sample SQL)')
if sql == '':
    sql='SELECT o.OrderID, c.CompanyName, e.FirstName, e.LastName FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET o, \
         cos://us-geo/sql/employees.parquet STORED AS PARQUET e, cos://us-geo/sql/customers.parquet STORED AS PARQUET c \
         WHERE e.EmployeeID = o.EmployeeID AND c.CustomerID = o.CustomerID AND o.ShippedDate > o.RequiredDate AND o.OrderDate > "1998-01-01" \
         ORDER BY c.CompanyName'
if " INTO " not in sql:
    sql += ' INTO {} STORED AS CSV'.format(targeturl)
formatted_sql = sqlparse.format(sql, reindent=True, indent_tabs=True, keyword_case='upper')
lexer = get_lexer_by_name("sql", stripall=True)
formatter = Terminal256Formatter(style='tango')
result = highlight(formatted_sql, lexer, formatter)
from IPython.core.display import display, HTML
print('\nYour SQL statement is:\n')
print(result)

Enter your SQL statement (leave empty to use a simple sample SQL)

Your SQL statement is:

[38;5;24;01mSELECT[39;00m [38;5;0mo[39m[38;5;0;01m.[39;00m[38;5;0mOrderID[39m[38;5;0;01m,[39;00m
	[38;5;24;01mc[39;00m[38;5;0;01m.[39;00m[38;5;0mCompanyName[39m[38;5;0;01m,[39;00m
	[38;5;0me[39m[38;5;0;01m.[39;00m[38;5;0mFirstName[39m[38;5;0;01m,[39;00m
	[38;5;0me[39m[38;5;0;01m.[39;00m[38;5;0mLastName[39m
[38;5;24;01mFROM[39;00m [38;5;0mcos[39m[38;5;0;01m:[39;00m[38;5;166;01m/[39;00m[38;5;166;01m/[39;00m[38;5;0mus[39m[38;5;166;01m-[39;00m[38;5;0mgeo[39m[38;5;166;01m/[39;00m[38;5;24;01mSQL[39;00m[38;5;166;01m/[39;00m[38;5;0morders[39m[38;5;0;01m.[39;00m[38;5;0mparquet[39m [38;5;0mSTORED[39m [38;5;24;01mAS[39;00m [38;5;0mPARQUET[39m [38;5;0mo[39m[38;5;0;01m,[39;00m
	[38;5;0mcos[39m[38;5;0;01m:[39;00m[38;5;166;01m/[39;00m[38;5;166;01m/[39;00m[38;5;0mus[39m[38;5;166;01m-[39;00m[38;5;0mgeo[39m[38;5;166;01m/[39;00m[

### <a id="run"></a> 4. Running Your SQL Statement
The following cell submits the above SQL statement and waits for it to finish before printing a sample of the result set.

In [9]:
result_df = sqlClient.run_sql(sql)
if isinstance(result_df, str):
    print(result_df)

In [10]:
result_df.head(10)

Unnamed: 0,OrderID,CompanyName,FirstName,LastName
0,10924,Berglunds snabbköp,Janet,Leverling
1,11058,Blauer See Delikatessen,Anne,Dodsworth
2,10827,Bon app',Nancy,Davolio
3,11076,Bon app',Margaret,Peacock
4,11045,Bottom-Dollar Markets,Michael,Suyama
5,10970,Bólido Comidas preparadas,Anne,Dodsworth
6,11054,Cactus Comidas para llevar,Laura,Callahan
7,11008,Ernst Handel,Robert,King
8,11072,Ernst Handel,Margaret,Peacock
9,10816,Great Lakes Food Market,Margaret,Peacock


In [11]:
from pixiedust.display import *
display(result_df)

#### <a id="lowlevel"></a> 4.1 Low level SQL job submission
Let's run the same SQL again, but this time using the asynchronous submission mechanism and the status check method.

In [12]:
sqlClient.logon()
jobId = sqlClient.submit_sql(sql)
print("SQL query submitted and running in the background. jobId = " + jobId)

SQL query submitted and running in the background. jobId = 743d385c-d94e-4da2-b23f-3305e0c12259


In [13]:
print("Job status for " + jobId + ": " + sqlClient.get_job(jobId)['status'])

Job status for 743d385c-d94e-4da2-b23f-3305e0c12259: running


Use the `wait_for_job()` method as a blocking call until your job has finished:

In [14]:
job_status = sqlClient.wait_for_job(jobId)
print("Job " + jobId + " terminated with status: " + job_status)
if job_status == 'failed':
    details = sqlClient.get_job(jobId)
    print("Error: {}\nError Message: {}".format(details['error'], details['error_message']))

Job 743d385c-d94e-4da2-b23f-3305e0c12259 terminated with status: completed


Use the `get_result()` method to retrieve a dataframe for the SQL result set:

In [15]:
result_df = sqlClient.get_result(jobId)
print("OK, we have a dataframe for the SQL result that has been stored by SQL Query in " + sqlClient.get_job(jobId)['resultset_location'])

OK, we have a dataframe for the SQL result that has been stored by SQL Query in cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/jobid=743d385c-d94e-4da2-b23f-3305e0c12259


You can delete the result set from Cloud Object Storage using the `delete_result()` method:

In [16]:
sqlClient.delete_result(jobId)

Unnamed: 0,Deleted Object
0,jobid=743d385c-d94e-4da2-b23f-3305e0c12259/_SU...
1,jobid=743d385c-d94e-4da2-b23f-3305e0c12259
2,jobid=743d385c-d94e-4da2-b23f-3305e0c12259/par...


### <a id="etl"></a> 5. Running ETL SQLs
The following ETL SQL statement joins two data sets from COS and writes the result to COS using **hive style partitioning** with two columns. 

In [17]:
etl_sql='SELECT OrderID, c.CustomerID CustomerID, CompanyName, ContactName, ContactTitle, Address, City, Region, PostalCode, Country, Phone, Fax \
         EmployeeID, OrderDate, RequiredDate, ShippedDate, ShipVia, Freight, ShipName, ShipAddress, \
         ShipCity, ShipRegion, ShipPostalCode, ShipCountry FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET o, \
         cos://us-geo/sql/customers.parquet STORED AS PARQUET c \
         WHERE c.CustomerID = o.CustomerID \
         INTO {}customer_orders STORED AS PARQUET PARTITIONED BY (ShipCountry, ShipCity)'.format(targeturl)
formatted_etl_sql = sqlparse.format(etl_sql, reindent=True, indent_tabs=True, keyword_case='upper')
result = highlight(formatted_etl_sql, lexer, formatter)
print('\nExample ETL Statement is:\n')
print(result)


Example ETL Statement is:

[38;5;24;01mSELECT[39;00m [38;5;0mOrderID[39m[38;5;0;01m,[39;00m
	[38;5;24;01mc[39;00m[38;5;0;01m.[39;00m[38;5;0mCustomerID[39m [38;5;0mCustomerID[39m[38;5;0;01m,[39;00m
	[38;5;0mCompanyName[39m[38;5;0;01m,[39;00m
	[38;5;0mContactName[39m[38;5;0;01m,[39;00m
	[38;5;0mContactTitle[39m[38;5;0;01m,[39;00m
	[38;5;0mAddress[39m[38;5;0;01m,[39;00m
	[38;5;0mCity[39m[38;5;0;01m,[39;00m
	[38;5;0mRegion[39m[38;5;0;01m,[39;00m
	[38;5;0mPostalCode[39m[38;5;0;01m,[39;00m
	[38;5;0mCountry[39m[38;5;0;01m,[39;00m
	[38;5;0mPhone[39m[38;5;0;01m,[39;00m
	[38;5;0mFax[39m [38;5;0mEmployeeID[39m[38;5;0;01m,[39;00m
	[38;5;0mOrderDate[39m[38;5;0;01m,[39;00m
	[38;5;0mRequiredDate[39m[38;5;0;01m,[39;00m
	[38;5;0mShippedDate[39m[38;5;0;01m,[39;00m
	[38;5;0mShipVia[39m[38;5;0;01m,[39;00m
	[38;5;0mFreight[39m[38;5;0;01m,[39;00m
	[38;5;0mShipName[39m[38;5;0;01m,[39;00m
	[38;5;0mShipAddress[39m[38;5;0;01m,

In [18]:
jobId = sqlClient.submit_sql(etl_sql)
print("SQL query submitted and running in the background. jobId = " + jobId)
job_status = sqlClient.wait_for_job(jobId)
print("Job " + jobId + " terminated with status: " + job_status)
job_details = sqlClient.get_job(jobId)
if job_status == 'failed':
    print("Error: {}\nError Message: {}".format(job_details['error'], job_details['error_message']))

SQL query submitted and running in the background. jobId = ebe86df7-f9b5-4cf2-90ad-9d374b1b288b
Job ebe86df7-f9b5-4cf2-90ad-9d374b1b288b terminated with status: completed


The following cell uses the `get_cos_summary()` method to print a summary of the objects that have been written by the previous ETL SQL statement. Note the **total_volume** value. We will reference it for comparison in the next steps.

In [20]:
resultset_location = job_details['resultset_location']
sqlClient.get_cos_summary(resultset_location)

{'largest_object': 'customer_orders/jobid=ebe86df7-f9b5-4cf2-90ad-9d374b1b288b/ShipCountry=UK/ShipCity=London/part-00000-0ab1a55a-0886-4c36-81d7-651c671732dc-attempt_20181115144729_0012_m_000000_0.c000.snappy.parquet',
 'largest_object_size': '7.1 KB',
 'newest_object_timestamp': 'November 15, 2018, 14H:47M:33S',
 'oldest_object_timestamp': 'November 15, 2018, 14H:47M:28S',
 'smallest_object': 'customer_orders/jobid=ebe86df7-f9b5-4cf2-90ad-9d374b1b288b',
 'smallest_object_size': '0.0 B',
 'total_objects': 72,
 'total_volume': '401.7 KB',
 'url': 'cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/customer_orders/jobid=ebe86df7-f9b5-4cf2-90ad-9d374b1b288b'}

The following cell uses the `list_results()` method to print a list of the objects that have been written by the above ETL SQL statement. Note the partition columns and their values being part of the object names now. This naming convention is known as **hive style partitioning**. This type of partitioning is the basis for optimizing SQL queries using predicates that match with the partitioning columns.

In [21]:
pd.set_option('display.max_colwidth', -1)
result_objects_df = sqlClient.list_results(jobId)
print("List of objects written by ETL SQL:")
result_objects_df.head(200)

List of objects written by ETL SQL:


Unnamed: 0,ObjectURL,Size
0,cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/customer_orders/jobid=ebe86df7-f9b5-4cf2-90ad-9d374b1b288b,0
1,cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/customer_orders/jobid=ebe86df7-f9b5-4cf2-90ad-9d374b1b288b/ShipCountry=Argentina/ShipCity=Buenos Aires/part-00000-0ab1a55a-0886-4c36-81d7-651c671732dc-attempt_20181115144729_0012_m_000000_0.c000.snappy.parquet,6412
2,cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/customer_orders/jobid=ebe86df7-f9b5-4cf2-90ad-9d374b1b288b/ShipCountry=Austria/ShipCity=Graz/part-00000-0ab1a55a-0886-4c36-81d7-651c671732dc-attempt_20181115144729_0012_m_000000_0.c000.snappy.parquet,6094
3,cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/customer_orders/jobid=ebe86df7-f9b5-4cf2-90ad-9d374b1b288b/ShipCountry=Austria/ShipCity=Salzburg/part-00000-0ab1a55a-0886-4c36-81d7-651c671732dc-attempt_20181115144729_0012_m_000000_0.c000.snappy.parquet,5636
4,cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/customer_orders/jobid=ebe86df7-f9b5-4cf2-90ad-9d374b1b288b/ShipCountry=Belgium/ShipCity=Bruxelles/part-00000-0ab1a55a-0886-4c36-81d7-651c671732dc-attempt_20181115144729_0012_m_000000_0.c000.snappy.parquet,5660
5,cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/customer_orders/jobid=ebe86df7-f9b5-4cf2-90ad-9d374b1b288b/ShipCountry=Belgium/ShipCity=Charleroi/part-00000-0ab1a55a-0886-4c36-81d7-651c671732dc-attempt_20181115144729_0012_m_000000_0.c000.snappy.parquet,5957
6,cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/customer_orders/jobid=ebe86df7-f9b5-4cf2-90ad-9d374b1b288b/ShipCountry=Brazil/ShipCity=Campinas/part-00000-0ab1a55a-0886-4c36-81d7-651c671732dc-attempt_20181115144729_0012_m_000000_0.c000.snappy.parquet,5715
7,cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/customer_orders/jobid=ebe86df7-f9b5-4cf2-90ad-9d374b1b288b/ShipCountry=Brazil/ShipCity=Resende/part-00000-0ab1a55a-0886-4c36-81d7-651c671732dc-attempt_20181115144729_0012_m_000000_0.c000.snappy.parquet,5749
8,cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/customer_orders/jobid=ebe86df7-f9b5-4cf2-90ad-9d374b1b288b/ShipCountry=Brazil/ShipCity=Rio de Janeiro/part-00000-0ab1a55a-0886-4c36-81d7-651c671732dc-attempt_20181115144729_0012_m_000000_0.c000.snappy.parquet,6877
9,cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/customer_orders/jobid=ebe86df7-f9b5-4cf2-90ad-9d374b1b288b/ShipCountry=Brazil/ShipCity=Sao Paulo/part-00000-0ab1a55a-0886-4c36-81d7-651c671732dc-attempt_20181115144729_0012_m_000000_0.c000.snappy.parquet,6981


Now let's take a look at the result data with the `get_result()` method. Note that the result dataframe contains the two partitioning columns. The values for these have been put together by get_result() from the object names above because in hive style partitioning the partition column values are not stored in the objects but rather in the object names.

In [22]:
sqlClient.get_result(jobId).head(100)

Unnamed: 0,OrderID,CustomerID,CompanyName,ContactName,ContactTitle,Address,City,Region,PostalCode,Country,...,RequiredDate,ShippedDate,ShipVia,Freight,ShipName,ShipAddress,ShipRegion,ShipPostalCode,ShipCountry,ShipCity
0,10409,OCEAN,Océano Atlántico Ltda.,Yvonne Moncada,Sales Agent,Ing. Gustavo Moncada 8585 Piso 20-A,Buenos Aires,,1010,Argentina,...,1997-02-06 06:00:00,1997-01-14 00:00:00.000,1,29.83,Océano Atlántico Ltda.,Ing. Gustavo Moncada 8585 Piso 20-A,,1010,Argentina,Buenos Aires
1,10448,RANCH,Rancho grande,Sergio Gutiérrez,Sales Representative,Av. del Libertador 900,Buenos Aires,,1010,Argentina,...,1997-03-17 06:00:00,1997-02-24 00:00:00.000,2,38.82,Rancho grande,Av. del Libertador 900,,1010,Argentina,Buenos Aires
2,10521,CACTU,Cactus Comidas para llevar,Patricio Simpson,Sales Agent,Cerrito 333,Buenos Aires,,1010,Argentina,...,1997-05-27 05:00:00,1997-05-02 00:00:00.000,2,17.22,Cactus Comidas para llevar,Cerrito 333,,1010,Argentina,Buenos Aires
3,10531,OCEAN,Océano Atlántico Ltda.,Yvonne Moncada,Sales Agent,Ing. Gustavo Moncada 8585 Piso 20-A,Buenos Aires,,1010,Argentina,...,1997-06-05 05:00:00,1997-05-19 00:00:00.000,1,8.12,Océano Atlántico Ltda.,Ing. Gustavo Moncada 8585 Piso 20-A,,1010,Argentina,Buenos Aires
4,10716,RANCH,Rancho grande,Sergio Gutiérrez,Sales Representative,Av. del Libertador 900,Buenos Aires,,1010,Argentina,...,1997-11-21 06:00:00,1997-10-27 00:00:00.000,2,22.57,Rancho grande,Av. del Libertador 900,,1010,Argentina,Buenos Aires
5,10782,CACTU,Cactus Comidas para llevar,Patricio Simpson,Sales Agent,Cerrito 333,Buenos Aires,,1010,Argentina,...,1998-01-14 06:00:00,1997-12-22 00:00:00.000,3,1.10,Cactus Comidas para llevar,Cerrito 333,,1010,Argentina,Buenos Aires
6,10819,CACTU,Cactus Comidas para llevar,Patricio Simpson,Sales Agent,Cerrito 333,Buenos Aires,,1010,Argentina,...,1998-02-04 06:00:00,1998-01-16 00:00:00.000,3,19.76,Cactus Comidas para llevar,Cerrito 333,,1010,Argentina,Buenos Aires
7,10828,RANCH,Rancho grande,Sergio Gutiérrez,Sales Representative,Av. del Libertador 900,Buenos Aires,,1010,Argentina,...,1998-01-27 06:00:00,1998-02-04 00:00:00.000,1,90.85,Rancho grande,Av. del Libertador 900,,1010,Argentina,Buenos Aires
8,10881,CACTU,Cactus Comidas para llevar,Patricio Simpson,Sales Agent,Cerrito 333,Buenos Aires,,1010,Argentina,...,1998-03-11 06:00:00,1998-02-18 00:00:00.000,1,2.84,Cactus Comidas para llevar,Cerrito 333,,1010,Argentina,Buenos Aires
9,10898,OCEAN,Océano Atlántico Ltda.,Yvonne Moncada,Sales Agent,Ing. Gustavo Moncada 8585 Piso 20-A,Buenos Aires,,1010,Argentina,...,1998-03-20 06:00:00,1998-03-06 00:00:00.000,2,1.27,Océano Atlántico Ltda.,Ing. Gustavo Moncada 8585 Piso 20-A,,1010,Argentina,Buenos Aires


The following cell runs a new **optimized SQL** query against the **partitioned data** that has been produced by the previous ETL SQL statement. The query uses `WHERE` predicates on the columns that have been used to partition the results in the ETL job. The query will physically only read the objects that match these predicate values.

In [23]:
optimized_sql='SELECT * FROM {} STORED AS PARQUET WHERE ShipCountry = "Austria" AND ShipCity="Graz" \
               INTO {} STORED AS PARQUET'.format(resultset_location, targeturl)
formatted_optimized_sql = sqlparse.format(optimized_sql, reindent=True, indent_tabs=True, keyword_case='upper')
result = highlight(formatted_optimized_sql, lexer, formatter)
print('\nRunning SQL against the previously produced hive style partitioned objects as input:\n')
print(result)

jobId = sqlClient.submit_sql(optimized_sql)
job_status = sqlClient.wait_for_job(jobId)
print("Job " + jobId + " terminated with status: " + job_status)
job_details = sqlClient.get_job(jobId)
if job_status == 'failed':
    print("Error: {}\nError Message: {}".format(job_details['error'], job_details['error_message']))


Running SQL against the previously produced hive style partitioned objects as input:

[38;5;24;01mSELECT[39;00m [38;5;166;01m*[39;00m
[38;5;24;01mFROM[39;00m [38;5;0mcos[39m[38;5;0;01m:[39;00m[38;5;166;01m/[39;00m[38;5;166;01m/[39;00m[38;5;0ms3[39m[38;5;0;01m.[39;00m[38;5;0mus[39m[38;5;166;01m-[39;00m[38;5;0msouth[39m[38;5;0;01m.[39;00m[38;5;0mobjectstorage[39m[38;5;0;01m.[39;00m[38;5;0msoftlayer[39m[38;5;0;01m.[39;00m[38;5;0mnet[39m[38;5;166;01m/[39;00m[38;5;0msqltempregional[39m[38;5;166;01m/[39;00m[38;5;0mcustomer_orders[39m[38;5;166;01m/[39;00m[38;5;0mjobid[39m[38;5;166;01m=[39;00m[38;5;0mebe86df7[39m[38;5;166;01m-[39;00m[38;5;0mf9b5[39m[38;5;166;01m-[39;00m[38;5;20;01m4[39;00m[38;5;0mcf2[39m[38;5;166;01m-[39;00m[38;5;20;01m90[39;00m[38;5;0mad[39m[38;5;166;01m-[39;00m[38;5;20;01m9[39;00m[38;5;0md374b1b288b[39m [38;5;0mSTORED[39m [38;5;24;01mAS[39;00m [38;5;0mPARQUET[39m
[38;5;24;01mWHERE[39;00m [38;

The following cell uses the `get_job()` method in order to show the job details of the just run optimized SQL that leverages hive style partitioning. Note the **bytes_read** value that is significantly lower than the **total_volume** value of the data in the queries data set. This does increase query performance and lower the query cost.

In [24]:
sqlClient.get_job(jobId)

{'bytes_read': 6090,
 'end_time': '2018-11-15T14:51:20.435Z',
 'job_id': '0b079513-cede-4048-b07d-06b3044d9520',
 'plan_id': 'e03a38d0-5ec1-41c5-b3b3-5e081dc19c8c',
 'resultset_format': 'parquet',
 'resultset_location': 'cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/jobid=0b079513-cede-4048-b07d-06b3044d9520',
 'rows_read': 30,
 'rows_returned': 30,
 'statement': 'SELECT * FROM cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/customer_orders/jobid=ebe86df7-f9b5-4cf2-90ad-9d374b1b288b STORED AS PARQUET WHERE ShipCountry = "Austria" AND ShipCity="Graz"                INTO cos://us-south/sqltempregional/ STORED AS PARQUET',
 'status': 'completed',
 'submit_time': '2018-11-15T14:50:51.145Z',
 'user_id': 'torsten@de.ibm.com'}

### <a id="pagination"></a> 6. Paginated SQL Results
The next cell runs a simple join SQL. But this time `submit_sql()` is provided the optional **`pagesize`** parameter with a value of **`10`**. This results in multiple objects being written with each having 10 rows of the result in it. Internally this is achieved by using the SQL Query syntax clause of `PARTITIONED EVERY <num> ROWS`. This also means that your query cannot already contain another `PARTITIONED BY` clause.

In [40]:
pagination_sql='SELECT OrderID, c.CustomerID CustomerID, CompanyName, City, Region, PostalCode \
                FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET o, \
                     cos://us-geo/sql/customers.parquet STORED AS PARQUET c \
                WHERE c.CustomerID = o.CustomerID \
                INTO {}paginated_orders STORED AS PARQUET'.format(targeturl)
formatted_etl_sql = sqlparse.format(etl_sql, reindent=True, indent_tabs=True, keyword_case='upper')
result = highlight(formatted_etl_sql, lexer, formatter)
print('\nExample ETL Statement is:\n')
print(result)

jobId = sqlClient.submit_sql(pagination_sql, pagesize=10)
job_status = sqlClient.wait_for_job(jobId)
print("Job " + jobId + " terminated with status: " + job_status)
job_details = sqlClient.get_job(jobId)
if job_status == 'failed':
    print("Error: {}\nError Message: {}".format(job_details['error'], job_details['error_message']))


Example ETL Statement is:

[38;5;24;01mSELECT[39;00m [38;5;0mOrderID[39m[38;5;0;01m,[39;00m
	[38;5;24;01mc[39;00m[38;5;0;01m.[39;00m[38;5;0mCustomerID[39m [38;5;0mCustomerID[39m[38;5;0;01m,[39;00m
	[38;5;0mCompanyName[39m[38;5;0;01m,[39;00m
	[38;5;0mContactName[39m[38;5;0;01m,[39;00m
	[38;5;0mContactTitle[39m[38;5;0;01m,[39;00m
	[38;5;0mAddress[39m[38;5;0;01m,[39;00m
	[38;5;0mCity[39m[38;5;0;01m,[39;00m
	[38;5;0mRegion[39m[38;5;0;01m,[39;00m
	[38;5;0mPostalCode[39m[38;5;0;01m,[39;00m
	[38;5;0mCountry[39m[38;5;0;01m,[39;00m
	[38;5;0mPhone[39m[38;5;0;01m,[39;00m
	[38;5;0mFax[39m [38;5;0mEmployeeID[39m[38;5;0;01m,[39;00m
	[38;5;0mOrderDate[39m[38;5;0;01m,[39;00m
	[38;5;0mRequiredDate[39m[38;5;0;01m,[39;00m
	[38;5;0mShippedDate[39m[38;5;0;01m,[39;00m
	[38;5;0mShipVia[39m[38;5;0;01m,[39;00m
	[38;5;0mFreight[39m[38;5;0;01m,[39;00m
	[38;5;0mShipName[39m[38;5;0;01m,[39;00m
	[38;5;0mShipAddress[39m[38;5;0;01m,

Let's check how many pages with each 10 rows have been written:

In [41]:
print("Number of pages written by job {}: {}".format(jobId, len(sqlClient.list_results(jobId))))

Number of pages written by job 33b34819-9f0f-4428-b0cb-7cb2c854e7f7: 85


The following cell retrieves the first page of the result as a data frame. The desired page is specified as the optional parameter **`pagenumber`** to the `get_result()` method.

In [47]:
pagenumber=1
sqlClient.get_result(jobId, pagenumber=pagenumber).head(100)

Unnamed: 0,OrderID,CustomerID,CompanyName,City,Region,PostalCode
0,11011,ALFKI,Alfreds Futterkiste,Berlin,,12209
1,10952,ALFKI,Alfreds Futterkiste,Berlin,,12209
2,10835,ALFKI,Alfreds Futterkiste,Berlin,,12209
3,10702,ALFKI,Alfreds Futterkiste,Berlin,,12209
4,10692,ALFKI,Alfreds Futterkiste,Berlin,,12209
5,10643,ALFKI,Alfreds Futterkiste,Berlin,,12209
6,10926,ANATR,Ana Trujillo Emp...,México D.F.,,5021
7,10759,ANATR,Ana Trujillo Emp...,México D.F.,,5021
8,10625,ANATR,Ana Trujillo Emp...,México D.F.,,5021
9,10308,ANATR,Ana Trujillo Emp...,México D.F.,,5021


The following cell gets the next page. Run it multiple times in order to retrieve the subsequent pages, one page after the another.

In [48]:
pagenumber+=1
sqlClient.get_result(jobId, pagenumber).head(100)

Unnamed: 0,OrderID,CustomerID,CompanyName,City,Region,PostalCode
0,10856,ANTON,Antonio Moreno T...,México D.F.,,05023
1,10682,ANTON,Antonio Moreno T...,México D.F.,,05023
2,10677,ANTON,Antonio Moreno T...,México D.F.,,05023
3,10573,ANTON,Antonio Moreno T...,México D.F.,,05023
4,10535,ANTON,Antonio Moreno T...,México D.F.,,05023
5,10507,ANTON,Antonio Moreno T...,México D.F.,,05023
6,10365,ANTON,Antonio Moreno T...,México D.F.,,05023
7,11016,AROUT,Around the Horn,London,,WA1 1DP
8,10953,AROUT,Around the Horn,London,,WA1 1DP
9,10920,AROUT,Around the Horn,London,,WA1 1DP


### <a id="joblist"></a> 7. Working with your SQL Job Submission History
The following cell uses the `get_cos_summary()` method to get a statistical overview of the data in the **target location** in COS that has been used by the above queries in this notebook.

In [25]:
sqlClient.get_cos_summary(targeturl)

{'largest_object': 'jobid=778d78b9-1068-4fd4-86d8-793fc1b5b737/part-00000-43ce56b3-4e44-4841-81e1-3d1e00bf8f43.csv-attempt_20180403103903_0009_m_000000_0',
 'largest_object_size': '203.4 MB',
 'newest_object_timestamp': 'November 15, 2018, 14H:51M:18S',
 'oldest_object_timestamp': 'November 24, 2017, 09H:58M:25S',
 'smallest_object': 'categories.avro',
 'smallest_object_size': '0.0 B',
 'total_objects': 7803,
 'total_volume': '1.4 GB',
 'url': 'cos://us-south/sqltempregional/'}

The method `get_jobs()` provides you a dataframe with the **30 most recent SQL submissions** with all details. You can change the value `-1`for `display.max_colwidth` to a positive integer if you want to truncate the cell content to shrink the overall table display size.

In [26]:
pd.set_option('display.max_colwidth', -1)
job_history_df = sqlClient.get_jobs()
job_history_df.head(100)

Unnamed: 0,job_id,status,user_id,statement,resultset_location,submit_time,end_time,rows_read,rows_returned,bytes_read,error,error_message
0,0b079513-cede-4048-b07d-06b3044d9520,completed,torsten@de.ibm.com,"SELECT * FROM cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/customer_orders/jobid=ebe86df7-f9b5-4cf2-90ad-9d374b1b288b STORED AS PARQUET WHERE ShipCountry = ""Austria"" AND ShipCity=""Graz"" INTO cos://us-south/sqltempregional/ STORED AS PARQUET",cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/jobid=0b079513-cede-4048-b07d-06b3044d9520,2018-11-15T14:50:51.145Z,2018-11-15T14:51:20.435Z,30.0,30.0,6090.0,,
1,ebe86df7-f9b5-4cf2-90ad-9d374b1b288b,completed,torsten@de.ibm.com,"SELECT OrderID, c.CustomerID CustomerID, CompanyName, ContactName, ContactTitle, Address, City, Region, PostalCode, Country, Phone, Fax EmployeeID, OrderDate, RequiredDate, ShippedDate, ShipVia, Freight, ShipName, ShipAddress, ShipCity, ShipRegion, ShipPostalCode, ShipCountry FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET o, cos://us-geo/sql/customers.parquet STORED AS PARQUET c WHERE c.CustomerID = o.CustomerID INTO cos://us-south/sqltempregional/customer_orders STORED AS PARQUET PARTITIONED BY (ShipCountry, ShipCity)",cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/customer_orders/jobid=ebe86df7-f9b5-4cf2-90ad-9d374b1b288b,2018-11-15T14:47:22.420Z,2018-11-15T14:47:34.552Z,921.0,830.0,43058.0,,
2,743d385c-d94e-4da2-b23f-3305e0c12259,completed,torsten@de.ibm.com,"SELECT o.OrderID, c.CompanyName, e.FirstName, e.LastName FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET o, cos://us-geo/sql/employees.parquet STORED AS PARQUET e, cos://us-geo/sql/customers.parquet STORED AS PARQUET c WHERE e.EmployeeID = o.EmployeeID AND c.CustomerID = o.CustomerID AND o.ShippedDate > o.RequiredDate AND o.OrderDate > ""1998-01-01"" ORDER BY c.CompanyName INTO cos://us-south/sqltempregional/ STORED AS CSV",cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/jobid=743d385c-d94e-4da2-b23f-3305e0c12259,2018-11-15T14:45:40.870Z,2018-11-15T14:45:47.866Z,1760.0,29.0,41499.0,,
3,aaf77fb6-bad9-4df0-961d-cc06e302cc58,completed,torsten@de.ibm.com,"SELECT o.OrderID, c.CompanyName, e.FirstName, e.LastName FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET o, cos://us-geo/sql/employees.parquet STORED AS PARQUET e, cos://us-geo/sql/customers.parquet STORED AS PARQUET c WHERE e.EmployeeID = o.EmployeeID AND c.CustomerID = o.CustomerID AND o.ShippedDate > o.RequiredDate AND o.OrderDate > ""1998-01-01"" ORDER BY c.CompanyName INTO cos://us-south/sqltempregional/ STORED AS CSV",cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/jobid=aaf77fb6-bad9-4df0-961d-cc06e302cc58,2018-11-15T14:42:37.424Z,2018-11-15T14:42:54.742Z,1760.0,29.0,41499.0,,
4,d332bc4b-4c5e-4ad4-a9d2-399271784510,completed,torsten@de.ibm.com,SELECT * FROM cos://us-geo/sql/employees.parquet STORED AS PARQUET LIMIT 10,cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/jobid=d332bc4b-4c5e-4ad4-a9d2-399271784510,2018-11-15T13:09:20.997Z,2018-11-15T13:09:41.341Z,9.0,9.0,8593.0,,
5,74253ee7-97b3-42a4-a2bc-2ba48e84c7b6,failed,torsten@de.ibm.com,SELECT xyz FROM cos://us-geo/sql/employees.parquet STORED AS PARQUET LIMIT 10 INTO cos://us-south/sqltempregional/ STORED AS CSV,cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/jobid=74253ee7-97b3-42a4-a2bc-2ba48e84c7b6,2018-11-15T13:08:54.976Z,2018-11-15T13:08:56.342Z,,,,SQL execution failed,A non-existing column used: xyz. Use an existing column.
6,fa31011d-907c-485b-9e74-6ec04bf72d80,completed,torsten@de.ibm.com,"WITH orders_shipped AS (SELECT OrderID, EmployeeID, (CASE WHEN shippedDate < requiredDate THEN 'On Time' ELSE 'Late' END) AS Shipped FROM cos://us-geo/sql/orders.parquet STORED AS PARQUET) SELECT e.FirstName, e.LastName, COUNT(o.OrderID) As NumOrders, Shipped FROM orders_shipped o, cos://us-geo/sql/employees.parquet STORED AS PARQUET e WHERE e.EmployeeID = o.EmployeeID GROUP BY e.FirstName, e.LastName, Shipped ORDER BY e.LastName, e.FirstName, NumOrders DESC INTO cos://us-south/sqltempregional/ STORED AS CSV",cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/jobid=fa31011d-907c-485b-9e74-6ec04bf72d80,2018-11-15T13:08:34.361Z,2018-11-15T13:08:42.303Z,839.0,18.0,20626.0,,
7,cd2af8e6-dd7b-4f10-bea1-69621babbb93,completed,torsten@de.ibm.com,SELECT * FROM cos://us-geo/sql/employees.parquet STORED AS PARQUET LIMIT 10 INTO cos://us-south/sqltempregional/ STORED AS PARQUET PARTITIONED EVERY 2 ROWS,cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/jobid=cd2af8e6-dd7b-4f10-bea1-69621babbb93,2018-11-15T13:08:03.544Z,2018-11-15T13:08:12.141Z,9.0,9.0,8593.0,,
8,3a4805c8-1925-46dc-8a47-41416f4eeb5e,completed,torsten@de.ibm.com,SELECT * FROM cos://us-geo/sql/employees.parquet STORED AS PARQUET INTO cos://us-south/sqltempregional/ STORED AS CSV PARTITIONED BY (city),cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/jobid=3a4805c8-1925-46dc-8a47-41416f4eeb5e,2018-11-15T13:07:34.069Z,2018-11-15T13:07:54.578Z,9.0,9.0,8593.0,,
9,e7a26e07-bedc-429c-ab5c-30db546a910c,completed,torsten@de.ibm.com,SELECT * FROM cos://us-geo/sql/employees.parquet STORED AS PARQUET LIMIT 10 INTO cos://us-south/sqltempregional/ STORED AS CSV,cos://s3.us-south.objectstorage.softlayer.net/sqltempregional/jobid=e7a26e07-bedc-429c-ab5c-30db546a910c,2018-11-15T13:07:22.682Z,2018-11-15T13:07:26.418Z,9.0,9.0,8593.0,,


In [27]:
sqlClient.export_job_history(targeturl + "my_job_history/")

Exported 4 new jobs


In [29]:
pd.set_option('display.max_colwidth', 20)
sql = "SELECT * FROM {}my_job_history/ STORED AS PARQUET INTO {} STORED AS CSV".format(targeturl, targeturl)
sqlClient.run_sql(sql).head(100)

Unnamed: 0,index,job_id,status,user_id,statement,resultset_location,submit_time,end_time,rows_read,rows_returned,bytes_read,error,error_message
0,,b6e56055-51e9-42...,completed,torsten@de.ibm.com,SELECT * FROM co...,cos://s3.us-sout...,2018-11-15T09:31...,2018-11-15T09:31...,9.0,9.0,8593.0,,
1,,2c13a113-a2de-4f...,failed,torsten@de.ibm.com,SELECT xyz FROM ...,cos://s3.us-sout...,2018-11-15T09:30...,2018-11-15T09:30...,,,,SQL execution fa...,A non-existing c...
2,,551b7b81-984c-4b...,completed,torsten@de.ibm.com,WITH orders_ship...,cos://s3.us-sout...,2018-11-15T09:30...,2018-11-15T09:30...,839.0,18.0,20626.0,,
3,,d8d0861d-f15f-4d...,completed,torsten@de.ibm.com,SELECT * FROM co...,cos://s3.us-sout...,2018-11-15T09:29...,2018-11-15T09:30...,9.0,9.0,8593.0,,
4,,f6e15dbe-d4b3-49...,completed,torsten@de.ibm.com,SELECT * FROM co...,cos://s3.us-sout...,2018-11-15T09:29...,2018-11-15T09:29...,9.0,9.0,8593.0,,
5,,038d907b-f344-4c...,completed,torsten@de.ibm.com,SELECT * FROM co...,cos://s3.us-sout...,2018-11-15T09:29...,2018-11-15T09:29...,9.0,9.0,8593.0,,
6,,346169b0-0498-42...,completed,torsten@de.ibm.com,SELECT * FROM co...,cos://s3.us-sout...,2018-11-15T09:29...,2018-11-15T09:29...,9.0,9.0,8593.0,,
7,,4fb82070-d03f-4f...,completed,torsten@de.ibm.com,-- Data from htt...,cos://s3.us-sout...,2018-11-15T08:49...,2018-11-15T08:49...,590.0,935.0,197839.0,,
8,,fc5987c1-f90a-46...,completed,torsten@de.ibm.com,-- Data from htt...,cos://s3.us-sout...,2018-11-14T16:06...,2018-11-14T16:06...,590.0,935.0,197839.0,,
9,,0f1bbb37-7287-4c...,completed,torsten@de.ibm.com,-- Data pulled f...,cos://s3.us-sout...,2018-11-14T16:05...,2018-11-14T16:05...,854.0,854.0,202519.0,,


### <a id="next"></a> 8. Next steps
In this notebook you learned how you can use the `ibmcloudsql` library in a Python notebook to submit SQL queries on data in IBM Cloud Object Storage and how you can interact with the query results. If you want to automate such an SQL query execution as part of your cloud solution, you can use the <a href="https://console.bluemix.net/openwhisk/" target="_blank">IBM Cloud Functions</a> framework. There is a dedicated SQL function available that lets you set up a cloud function to run SQL statements with IBM Cloud SQL Query. You can find the documentation for doing this <a href="https://hub.docker.com/r/ibmfunctions/sqlquery/" target="_blank" rel="noopener noreferrer">here</a>.

### <a id="authors"></a>Authors

**Torsten Steinbach**, Torsten is the lead architect for IBM Cloud SQL Query. Previously he has worked as IBM architect for a series of data management products and services, including DB2, PureData for Analytics and Db2 on Cloud.

<hr>
Copyright &copy; IBM Corp. 2018. This notebook and its source code are released under the terms of the MIT License.

<div style="background:#F5F7FA; height:110px; padding: 2em; font-size:14px;">
<span style="font-size:18px;color:#152935;">Love this notebook? </span>
<span style="font-size:15px;color:#152935;float:right;margin-right:40px;">Don't have an account yet?</span><br>
<span style="color:#5A6872;">Share it with your colleagues and help them discover the power of Watson Studio!</span>
<span style="border: 1px solid #3d70b2;padding:8px;float:right;margin-right:40px; color:#3d70b2;"><a href="https://ibm.co/wsnotebooks" target="_blank" style="color: #3d70b2;text-decoration: none;">Sign Up</a></span><br>
</div>