# Use SQL with data in Hadoop Python

This notebook shows you how to query data stored in Hadoop using SQL. Apache Hadoop is an open source software framework used for storing and manipulating big data. The notebook shows you how you can use [IBM Big SQL](http://www.ibm.com/software/data/infosphere/hadoop/big-sql.html) to access data in Hadoop without having to learn new languages or skills.



The notebook uses a Big SQL sandbox environment to show you how to get started with SQL on Hadoop.

To query the data stored in Hadoop, you can use [IBM Big SQL](http://www.ibm.com/software/data/infosphere/hadoop/big-sql.html). This notebook shows you how to issue queries of any size on data stored in Hadoop.

This notebook runs on Python and Spark.
## Table of contents

1. [What is Big SQL](#what_is_big_sql)
1. [Load libraries](#load_libraries)
1. [Access_data](#access_data)
1. [Query data](#query_data)
1. [Summary](#summary)


<a id="what_is_big_sql"></a>
## What is Big SQL?

[IBM Big SQL](http://www.ibm.com/software/data/infosphere/hadoop/big-sql.html) is a data warehouse system for Hadoop that you can use to summarize, query, and analyze data. It provides standards-compliant SQL access to data in Hadoop.

This notebook shows you how to access the sandbox environment to get your own set of credentials.

<a id="load_libraries"></a>
## Load libraries


In order to connect to a remote Hadoop cluster with Big SQL and then run SQL queries on data stored in Hadoop, the `ibm_db` library must be installed. The first time you run the notebook, if loading the library fails, you will need to install it with the following command:

In [1]:
!pip install --user ibm_db

Collecting ibm_db
  Downloading ibm_db-2.0.7.tar.gz (553kB)
[K    100% |████████████████████████████████| 563kB 2.0MB/s 
[?25hBuilding wheels for collected packages: ibm-db
  Running setup.py bdist_wheel for ibm-db ... [?25l- \ | / - \ | / - \ | / - \ | / - \ done
[?25h  Stored in directory: /gpfs/fs01/user/s778-bfb6f75aebc10f-9bb95b1f072f/.cache/pip/wheels/d7/05/e2/d7b2f153bfbabcdf8af0fec36d78656142b5966cf6be091af3
Successfully built ibm-db
Installing collected packages: ibm-db
Successfully installed ibm-db-2.0.7


Run the following cell to load the library once it is installed:

In [2]:
import ibm_db;

To connect to the database for this notebook, you need to get your own set of credentials.


<a id="access_data"></a>
## Access data

To access the Big SQL technology preview sandbox environment, you need your own access credentials. To get your credentials:

1. Sign up for a free Big SQL sandbox account on [IBM Analytics Demo Cloud](https://my.imdemocloud.com/users/sign_up).

2. To set up your account, follow the instructions in the activation email that you are sent.
Note: Your user name is different from your email address. For example, the user name for `jane.doe@example.com` might be `jane doe`. You will see your user name in the top-right corner of Demo Cloud when you're logged in.

3. Log in to [IBM Analytics Demo Cloud](https://my.imdemocloud.com/users/sign_up) and click __Big SQL Technology Sandbox__. You are automatically approved to join.


Add your `username` and `password` between the quotation marks in the code cell below and run the cell:

In [3]:
username = "";
password = ""
database = "bigsql";
hostname = "iop-bi-master.imdemocloud.com";
port = "32051"

Finally, to connect to the Big SQL sandbox environment from the notebook, run the following cell:

In [4]:
conn_string = (
      "DRIVER={{IBM DB2 ODBC DRIVER}};"
      "DATABASE={0};"
      "HOSTNAME={1};"
      "PORT={2};"
      "PROTOCOL=TCPIP;"
      "UID={3};"
      "PWD={4};").format(database, hostname, port, username, password);

conn = ibm_db.connect(conn_string, "", "")

You are now connected to the Big SQL sandbox from the notebook.

If you saw an error, check that you filled in your user name and password correctly.

<a id="query_data"></a>
## Query data

In this section, you will create a sample table, named `testTable`, load some data into it, and execute a query. Before you do this, you need to check if the table already exists in the sandbox environment, and if it does, remove it so that you can start from scratch.

To prepare and execute a single SQL statement, you will use the `ibm_db.exec_immediate()` function. The function takes the following arguments:
* `connection`  
  * A valid database connection resource returned from the `ibm_db.connect()` function.
* `statement`  
  * A string that contains the SQL statement. This string can include an XQuery expression that is wrapped by an XMLQUERY clause.
  
__Note:__ Big SQL has only one database called `bigsql` and you cannot create a new database. However, you can have your own schema, which defaults to your user name. When you connect to the database using your name and execute `CREATE HADOOP TABLE testTable`, a table called `YOUR_USER_NAME.testTable` is created under your schema. 

In this notebook, you will use your schema. Run the next cell to ensure you are using your schema:

In [5]:
query = "USE "+username+";";
ibm_db.exec_immediate(conn, query);

Then run the next cell to remove the sample table `testTable` if it already exists to enable creating a new one:

In [6]:
query = "DROP TABLE IF EXISTS testTable"
ibm_db.exec_immediate(conn, query);

Now create a new `testTable` database table with two columns named `column1` and `column2`. To create the table in your schema, run the cell below:

In [7]:
query = "CREATE HADOOP TABLE testTable (column1 INT, column2 STRING)"
ibm_db.exec_immediate(conn, query);

Then insert some sample data into your `testTable` database table:

In [8]:
query = "INSERT INTO testTable VALUES (1,'Text1'); "
ibm_db.exec_immediate(conn, query);

Now retrieve and show this data:

In [9]:
query = "SELECT * FROM testTable";
stmt = ibm_db.exec_immediate(conn, query);
dictionary=ibm_db.fetch_both(stmt)
print "The COLUMN1 value is : ", dictionary["COLUMN1"]
print "The COLUMN2 value is : ", dictionary["COLUMN2"]

The COLUMN1 value is :  1
The COLUMN2 value is :  Text1


### Query big data

In this section you will use sample data that is provided in Big SQL by default. You will learn how to run queries and create reports about a fictional company named Sample Outdoor Company. 

The GOSALESDW schema will be used. It contains fact tables for the following areas:

* Distribution
* Finance
* Geography
* Marketing
* Organization
* Personnel
* Products
* Retailers
* Sales
* Time

Run the next cell to use the GOSALESDW schema and show the names and employee IDs of 10 employees:  

In [10]:
query = "use GOSALESDW;";
stmt = ibm_db.exec_immediate(conn, query);
query = "select * from EMP_EMPLOYEE_DIM LIMIT 10";
stmt = ibm_db.exec_immediate(conn, query);
dictionary = ibm_db.fetch_both(stmt)
while dictionary != False:
    print "ID: ",  dictionary["EMPLOYEE_KEY"] , " -- Name: ", dictionary["EMPLOYEE_NAME"]
    dictionary = ibm_db.fetch_both(stmt)

ID:  4001  -- Name:  Élizabeth Michel
ID:  4002  -- Name:  Émile Clermont
ID:  4003  -- Name:  Étienne Jauvin
ID:  4004  -- Name:  Frank Fuhlroth
ID:  4005  -- Name:  Gunter Erler
ID:  4006  -- Name:  Björn Winkler
ID:  4007  -- Name:  Björn Winkler
ID:  4008  -- Name:  Belinda Jansen-Velasquez
ID:  4009  -- Name:  Ellen Shapiro
ID:  4010  -- Name:  Maria Laponder


You can improve the `SELECT` statement by adding a *predicate* to the second statement to return fewer rows. A predicate is a condition on a query that reduces and narrows the focus of the result. A predicate on a query with a multi-way join can improve the performance of the query.

Run the next cell to narrow the search to return results from America only:

In [11]:
query = "SELECT * FROM gosalesdw.go_region_dim WHERE region_en LIKE 'Amer%';";
stmt = ibm_db.exec_immediate(conn, query);
dictionary = ibm_db.fetch_both(stmt)
dictionary['REGION_CODE']


710

You can also run a query that returns the number of rows in a table. 

In [12]:
query = "SELECT COUNT(*) FROM gosalesdw.go_region_dim;";
stmt = ibm_db.exec_immediate(conn, query);
dictionary = ibm_db.fetch_both(stmt)
dictionary


{0: '21', '1': '21'}

To learn what products were ordered from Sample Outdoor Company, and by what method they were ordered, you must join the information from multiple tables in the `gosalesdw` database because it is a relational database where not everything is in one table.


In [13]:
lis=[]
query ="\
SELECT pnumb.product_name, sales.quantity, \
  meth.order_method_en \
FROM \
  gosalesdw.sls_sales_fact sales, \
  gosalesdw.sls_product_dim prod, \
  gosalesdw.sls_product_lookup pnumb, \
  gosalesdw.sls_order_method_dim meth \
WHERE \
  pnumb.product_language='EN' \
  AND sales.product_key=prod.product_key \
  AND prod.product_number=pnumb.product_number \
  AND meth.order_method_key=sales.order_method_key LIMIT 10;"
stmt = ibm_db.exec_immediate(conn, query);
dictionary = ibm_db.fetch_both(stmt)
while dictionary != False:
    lis.append(dictionary)
    dictionary = ibm_db.fetch_both(stmt)
    
import pandas as pd
pd.DataFrame(lis).head()

Unnamed: 0,0,1,2,ORDER_METHOD_EN,PRODUCT_NAME,QUANTITY
0,Compact Relief Kit,313,Sales visit,Sales visit,Compact Relief Kit,313
1,Course Pro Putter,587,Telephone,Telephone,Course Pro Putter,587
2,Blue Steel Max Putter,214,Telephone,Telephone,Blue Steel Max Putter,214
3,Course Pro Gloves,576,Telephone,Telephone,Course Pro Gloves,576
4,Glacier Deluxe,129,Sales visit,Sales visit,Glacier Deluxe,129


<a id="summary"></a>
## Summary

In this sample, you learned how to query data stored in Hadoop using SQL based on sample data in a Big SQL sandbox environment.

<a id="resources"></a>
### Resources

For more information on Big SQL, see [Big SQL](https://www.ibm.com/support/knowledgecenter/SSPT3X_2.1.2/com.ibm.swg.im.infosphere.biginsights.bigsql.doc/doc/bsql_reference.html).

## Want to learn more?
### Free courses on <a href="https://bigdatauniversity.com/courses/?utm_source=tutorial-dashdb-python&utm_medium=github&utm_campaign=bdu/" rel="noopener noreferrer" target="_blank">Big Data University</a>: <a href="https://bigdatauniversity.com/courses/?utm_source=tutorial-dashdb-python&utm_medium=github&utm_campaign=bdu" rel="noopener noreferrer" target="_blank"><img src = "https://ibm.box.com/shared/static/xomeu7dacwufkoawbg3owc8wzuezltn6.png" width=600px> </a>

### Authors

**Saeed Aghabozorgi**, PhD, is a Data Scientist in IBM with a track record of developing enterprise-level applications that substantially increases clients' ability to turn data into actionable knowledge. He is a researcher in the data mining field and an expert in developing advanced analytic methods like machine learning and statistical modelling on large data sets.

**Polong Lin** is a Data Scientist at IBM in Canada. Under the Emerging Technologies division, Polong is responsible for educating the next generation of data scientists through Big Data University. Polong is a regular speaker in conferences and meetups, and holds an M.Sc. in Cognitive Psychology.

Copyright © 2016, 2017 Big Data University. This notebook and its source code are released under the terms of the <a href="https://bigdatauniversity.com/mit-license/" rel="noopener noreferrer" target="_blank">MIT License</a>.