# Access dashDB and DB2 with Python

This notebook shows how to access a dashDB data warehouse or DB2 database when using Python. The examples use a dashDB warehouse, but the instructions apply to both dashDB and DB2.

## Table of contents

1. [Setup](#Setup) 
1. [Import the *ibmdbpy* Python library](#Import-the-ibmdbpy-Python-library)
1. [Identify and enter the database connection credentials](#Identify-and-enter-the-database-connection-credentials)
1. [Create the database connection](#Create-the-database-connection)
1. [Use dataframe to read and manipulate tables](#Use-dataframe-to-read-and-manipulate-tables)
1. [Close the database connection](#Close-the-database-connection)
1. [Summary](#Summary)


## Setup

Before beginning you will need a *dashDB* warehouse. dashDB is a fully managed cloud data warehouse, purpose-built for analytics. It offers massively parallel processing (MPP) scale and compatibility with a wide range of business intelligence (BI) tools.  

[Try dashDB free of charge on IBM Bluemix.](https://console.ng.bluemix.net/catalog/services/dashdb)

<a class="ibm-tooltip" href="https://console.ng.bluemix.net/catalog/services/dashdb" target="_blank" title="" id="ibm-tooltip-0">
<img alt="IBM Bluemix.Get started now" height="193" width="153" src="https://ibm.box.com/shared/static/42yt39czuksqdi278xpy96txtlw3lfmb.png" >
</a>





## Import the *ibmdbpy* Python library

Python support for dashDB and DB2 is provided by the [ibmdbpy Python library](https://pypi.python.org/pypi/ibmdbpy). Connecting to dashDB or DB2 is also enabled by a DB2 driver, libdb2.so.

The JDBC Connection is based on a Java virtual machine. From the ibmdbpy library you can use JDBC to connect to a remote dashDB/DB2 instance. To be able to use JDBC to connect, we need to import the *JayDeBeApi* package.

Run the following commands to install and load the JayDeBeApi package and the ibmdbpy library into your notebook:

In [1]:
!pip install jaydebeapi --user  
!pip install ibmdbpy --user 



In [2]:
import jaydebeapi
from ibmdbpy import IdaDataBase
from ibmdbpy import IdaDataFrame

In [3]:
import os
os.environ['CLASSPATH'] = "/usr/local/src/data-connectors-1.4.1/db2jcc4-10.5.0.6.jar"

In [4]:
import jpype
args='-Djava.class.path=%s' % os.environ['CLASSPATH']
jvm = jpype.getDefaultJVMPath()
jpype.startJVM(jvm, args)


## Identify and enter the database connection credentials

Connecting to dashDB or a DB2 database requires the following information:
* Database name 
* Host DNS name or IP address 
* Host port
* Connection protocol
* User ID
* User password

All of this information must be captured in a connection string in a subsequent step. Provide the dashDB or DB2 connection information as shown:

In [5]:
dsn_uid = "";  # e.g.  dash104434
dsn_pwd = ""   # e.g. xxxx
dsn_hostname =""  # e.g.  awh-yp-small03.services.dal.bluemix.net
dsn_port = ""   # e.g.  50001
dsn_database = ""   # e.g. BLUDB 

## Create the database connection

The following code snippet creates a connection string `connection_string`
and uses the `connection_string` to create a DB2 connection object:


In [6]:
connection_string='jdbc:db2://'+dsn_hostname+':'+dsn_port+'/'+dsn_database+':user='+dsn_uid+';password='+dsn_pwd+";" 
idadb=IdaDataBase(dsn=connection_string)

## Use dataframe to read and manipulate tables

You can now use the connection object `conn` to query the database:

In [7]:
df=idadb.show_tables(show_all = True)
df.head(5)

Unnamed: 0,TABSCHEMA,TABNAME,OWNER,TYPE
0,GOSALES,BRANCH,DB2INST1,T
1,GOSALES,CONVERSION_RATE,DB2INST1,T
2,GOSALES,COUNTRY,DB2INST1,T
3,GOSALES,CURRENCY_LOOKUP,DB2INST1,T
4,GOSALES,EURO_CONVERSION,DB2INST1,T


In [8]:
idadb.exists_table_or_view('GOSALESDW.EMP_EXPENSE_FACT')

True

Using our previously opened IdaDataBase instance named ‘idadb’, we can open one or several IdaDataFrame objects. They behave like pointers to remote tables.

Let us open the *EMP_EXPENSE_FACT* data set, assuming it is stored in the database under the name ‘GOSALESDW.EMP_EXPENSE_FACT’. The following cell assigns the dataset to a pandas DataFrame.

The [Pandas data analysis library](http://pandas.pydata.org/) provides high-performance, easy-to-use data structures and data analysis tools for the Python programming language. Pandas allows easy processing and manipulation of tabular data, so it is a perfect fit for data extracted from relational databases.


In [9]:
idadf = IdaDataFrame(idadb, 'GOSALESDW.EMP_EXPENSE_FACT')

You can very easily explore the data in the IdaDataFrame by using built in functions.

Use IdaDataFrame.head to get the first n records of your data set (default 5):

In [10]:
idadf.head(5)

Unnamed: 0,DAY_KEY,ORGANIZATION_KEY,POSITION_KEY,EMPLOYEE_KEY,EXPENSE_TYPE_KEY,ACCOUNT_KEY,EXPENSE_UNIT_QUANTITY,EXPENSE_TOTAL
0,20100131,11101,43639,4001,2120,8052,0.08,513.35
1,20100131,11101,43639,4001,2131,8049,165.0,4125.0
2,20100131,11101,43639,4001,2130,8050,0.005,2291.88
3,20100131,11101,43639,4001,2124,8056,0.03,192.51
4,20100131,11101,43639,4001,2122,8054,0.11,705.86


Use IdaDataFrame.tail to get the last n records of your data set (default 5):

In [11]:
idadf.tail(5)

Unnamed: 0,DAY_KEY,ORGANIZATION_KEY,POSITION_KEY,EMPLOYEE_KEY,EXPENSE_TYPE_KEY,ACCOUNT_KEY,EXPENSE_UNIT_QUANTITY,EXPENSE_TOTAL
127979,20130731,11187,43603,4960,2120,8052,0.08,929.23
127980,20130731,11187,43603,4960,2122,8054,0.11,1277.69
127981,20130731,11187,43603,4960,2124,8056,0.03,348.46
127982,20130731,11187,43603,4960,2131,8049,157.5,11087.42
127983,20130731,11187,43603,4960,2134,8050,37.5,13500.0


__Note:__ Because dashDB operates on a distributed system, the order of rows using IdaDataFrame.head and IdaDataFrame.tail is not guaranteed unless the table is sorted (using an ‘ORDER BY’ clause) or a column is declared as index for the IdaDataFrame (parameter/attribute indexer).

IdaDataFrame also implements most attributes that are available in a pandas DataFrame:

In [12]:
idadf.shape

(127984, 8)

In [13]:
idadf.columns

Index([u'DAY_KEY', u'ORGANIZATION_KEY', u'POSITION_KEY', u'EMPLOYEE_KEY',
       u'EXPENSE_TYPE_KEY', u'ACCOUNT_KEY', u'EXPENSE_UNIT_QUANTITY',
       u'EXPENSE_TOTAL'],
      dtype='object')

Several standard statistics functions from the pandas interface are also available for IdaDataFrame. For example, let us calculate the covariance matrix for the iris data set:

In [14]:
idadf.cov()

Unnamed: 0,DAY_KEY,ORGANIZATION_KEY,POSITION_KEY,EMPLOYEE_KEY,EXPENSE_TYPE_KEY,ACCOUNT_KEY,EXPENSE_UNIT_QUANTITY,EXPENSE_TOTAL
DAY_KEY,107444500.0,-1301.774305,-2699.336397,-74463.200864,-2541.104007,-88.733494,-2747.250164,338749.301508
ORGANIZATION_KEY,-1301.774,977.978493,-60.746262,2228.417559,-27.240468,0.756326,11.18659,-2999.218552
POSITION_KEY,-2699.336,-60.746262,148.234472,-2070.93463,10.28491,-1.006254,-13.697657,1101.107528
EMPLOYEE_KEY,-74463.2,2228.417559,-2070.93463,89393.601947,-237.530049,39.144365,525.387975,47399.031411
EXPENSE_TYPE_KEY,-2541.104,-27.240468,10.28491,-237.530049,88.103306,4.663223,26.490807,5577.918013
ACCOUNT_KEY,-88.73349,0.756326,-1.006254,39.144365,4.663223,6.414971,-92.920363,-2669.484571
EXPENSE_UNIT_QUANTITY,-2747.25,11.18659,-13.697657,525.387975,26.490807,-92.920363,3331.325768,76740.540006
EXPENSE_TOTAL,338749.3,-2999.218552,1101.107528,47399.031411,5577.918013,-2669.484571,76740.540006,4321078.159027


It is possible to subset the rows of an IdaDataFrame by accessing the IdaDataFrame with a slice object. You can also use the IdaDataFrame.loc attribute, which contains an ibmdbpy.Loc object. However, the row selection might be inaccurate if the current IdaDataFrame is not sorted or does not contain an indexer. This is due to the fact that dashDB stores the data across several nodes if available. Moreover, because dashDB is a column oriented database, row numbers are undefined:

In [15]:
idadf_new = idadf[0:9] # Select the first 10 rows
idadf_new.head()

  " was given and the dataset was not sorted")


Unnamed: 0,DAY_KEY,ORGANIZATION_KEY,POSITION_KEY,EMPLOYEE_KEY,EXPENSE_TYPE_KEY,ACCOUNT_KEY,EXPENSE_UNIT_QUANTITY,EXPENSE_TOTAL
0,20111231,11136,43631,4914,2124,8056,0.03,171.63
1,20111231,11136,43631,4914,2131,8049,165.0,5721.15
2,20120131,11136,43631,4914,2120,8052,0.08,457.69
3,20120131,11136,43631,4914,2124,8056,0.03,171.63
4,20120131,11136,43631,4914,2122,8054,0.11,629.33


## Close the database connection

To ensure expected behaviors, IdaDataBase instances need to be closed. Closing the *IdaDataBase* is equivalent to closing the connection: once the connection is closed, it is no longer possible to use the *IdaDataBase* instance and any IdaDataFrame instances that were opened on this connection.

In [16]:
idadb.close()

Connection closed.


## Summary

This notebook demonstrated how to establish a connection to a dashDB / DB2 database from Python using the ibmdbpy library.

## Want to learn more?
### Free courses on <a href="https://bigdatauniversity.com/courses/?utm_source=tutorial-dashdb-python&utm_medium=github&utm_campaign=bdu/" rel="noopener noreferrer" target="_blank">Big Data University</a>: <a href="https://bigdatauniversity.com/courses/?utm_source=tutorial-dashdb-python&utm_medium=github&utm_campaign=bdu" rel="noopener noreferrer" target="_blank"><img src = "https://ibm.box.com/shared/static/xomeu7dacwufkoawbg3owc8wzuezltn6.png" width=600px> </a>

### Authors

**Saeed Aghabozorgi**, PhD, is a Data Scientist in IBM with a track record of developing enterprise-level applications that substantially increases clients' ability to turn data into actionable knowledge. He is a researcher in the data mining field and an expert in developing advanced analytic methods like machine learning and statistical modelling on large data sets.

**Polong Lin** is a Data Scientist at IBM in Canada. Under the Emerging Technologies division, Polong is responsible for educating the next generation of data scientists through Big Data University. Polong is a regular speaker in conferences and meetups, and holds an M.Sc. in Cognitive Psychology.

Copyright © 2016, 2017 Big Data University. This notebook and its source code are released under the terms of the <a href="https://bigdatauniversity.com/mit-license/" rel="noopener noreferrer" target="_blank">MIT License</a>.