# Load data into a notebook from different sources

Before you can start analyzing your data, you have to load the data from a data source. You can store your data in many different data sources. This reference notebook shows you how to load and integrate data in a notebook from the following data sources:
-  Object Storage
-  dashDB
-  Cloudant

The notebook sample code shows you how to load data into a notebook by using Python and the PySpark stack. You can copy and paste these code snippets into the notebook you are developing. 

This notebook runs on Python 2 with Spark 1.6.

## Table of contents

- [Load data from Object Storage](#osv3)
  - [Load data by using Python](#osv3_python)
  - [Load data by using PySpark](#osv3_pyspark)
- [Load data from dashDB](#dashdb)
  - [Load data by using PySpark](#dashdb_pyspark)
- [Load data from a Cloudant database](#cloudant)
  - [Load data by using Python](#cloudant_python)
  - [Load data by using PySpark](#cloudant_pyspark)
- [Summary](#summary)

<a id="osv3"></a>
## Load data from Object Storage
IBM® Object Storage for Bluemix® provides provides you with access to a fully provisioned Swift Object Storage account to manage your data. Object Storage uses OpenStack Identity (Keystone) for authentication and can be accessed directly by using [OpenStack Object Storage (Swift) API v3](http://developer.openstack.org/api-ref-identity-v3.html#credentials-v3). 

When you load data for use in a notebook, the data file is listed on the `Data` pane in the notebook and is stored in the Object Storage instance associated with your project.

Click the next code cell to set the focus on the cell. Now add the credentials to access the data file to this code cell by selecting the **Data** icon on the notebook action bar. Then, if the data file is a CSV file, click **Insert to code>Credentials** on the data file in the `Data` pane.  If the data file has another format, clicking **Insert to code** adds the credentials; there is no `Insert to code` list.

<div class="alert alert-block alert-info">Note: The Python dictionary with the credentials that is generated for you is given a generic name. Rename the dictionary variable to `credentials_osv3` and run the code cell to proceed.</div>

<a id="osv3_python"></a>
### Load data by using Python

Run the next cells to load the data from a file in Object Storage by using Python functions: 

In [2]:
import requests, StringIO, json

def get_file_content(credentials):
    """For given credentials, this functions returns a StringIO object containg the file content 
    from the associated Bluemix Object Storage V3."""

    url1 = ''.join([credentials['auth_url'], '/v3/auth/tokens'])
    data = {'auth': {'identity': {'methods': ['password'],
            'password': {'user': {'name': credentials['username'],'domain': {'id': credentials['domain_id']},
            'password': credentials['password']}}}}}
    headers1 = {'Content-Type': 'application/json'}
    resp1 = requests.post(url=url1, data=json.dumps(data), headers=headers1)
    resp1_body = resp1.json()
    for e1 in resp1_body['token']['catalog']:
        if(e1['type']=='object-store'):
            for e2 in e1['endpoints']:
                if(e2['interface']=='public'and e2['region']==credentials['region']):
                    url2 = ''.join([e2['url'],'/', credentials['container'], '/', credentials['filename']])
    s_subject_token = resp1.headers['x-subject-token']
    headers2 = {'X-Auth-Token': s_subject_token, 'accept': 'application/json'}
    resp2 = requests.get(url=url2, headers=headers2)
    return StringIO.StringIO(resp2.content)

In [3]:
import pandas as pd

data_df = pd.read_csv(get_file_content(credentials_osv3))
data_df.head()

Unnamed: 0,Country or Area,1990,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009
0,Albania,28385.0,40311.0,0.0,0.0,0.0,38284.0,30683.0,30491.0,35883.0,27893.0,42787.0,42840.0,32380.0,30964.0,0.0,0.0
1,Algeria,76160.0,90270.0,53380.0,74460.0,66470.0,50150.0,64430.0,43840.0,37317.0,0.0,0.0,0.0,0.0,0.0,100000.0,0.0
2,Andorra,539.947998,510.673004,560.340027,434.475006,254.0,450.151001,518.666016,456.626007,565.559021,566.583008,567.044006,530.278015,353.220001,306.630005,0.0,0.0
3,Anguilla,93.099998,100.730003,0.0,0.0,0.0,0.0,68.190002,70.730003,68.190002,108.769997,84.25,124.400002,99.550003,86.290001,96.889999,71.080002
4,Antigua and Barbuda,300.299988,374.5,323.299988,279.200012,384.5,426.799988,249.600006,238.0,268.600006,253.899994,426.899994,371.0,332.799988,293.600006,392.5,276.899994


Now your data is in a `pandas.DataFrame` and you can begin analyzing it. 

<a id="osv3_pyspark"></a>
### Load data by using PySpark

Before you can access data in the data file in Object Storage by using the [`SparkContext`](https://spark.apache.org/docs/1.6.0/api/python/pyspark.html#pyspark.SparkContext) object, you must set the Hadoop configuration by using the following configuration function:

In [4]:
def set_hadoop_config(credentials):
    """This function sets the Hadoop configuration with given credentials, 
    so it is possible to access data using SparkContext"""
    
    prefix = "fs.swift.service." + credentials['name']
    hconf = sc._jsc.hadoopConfiguration()
    hconf.set(prefix + ".auth.url", credentials['auth_url']+'/v3/auth/tokens')
    hconf.set(prefix + ".auth.endpoint.prefix", "endpoints")
    hconf.set(prefix + ".tenant", credentials['project_id'])
    hconf.set(prefix + ".username", credentials['user_id'])
    hconf.set(prefix + ".password", credentials['password'])
    hconf.setInt(prefix + ".http.port", 8080)
    hconf.set(prefix + ".region", credentials['region'])
    hconf.setBoolean(prefix + ".public", True)

Set the Hadoop configuration and give it a name, for example, `keystone`:

In [5]:
# you can choose any name
credentials_osv3['name'] = 'keystone'
set_hadoop_config(credentials_osv3)

data_rdd = sc.textFile("swift://" + credentials_osv3['container'] + "." + credentials_osv3['name'] + "/" + credentials_osv3['filename'])
data_rdd.take(5)

[u'Country or Area,1990,1995,1996,1997,1998,1999,2000,2001,2002,2003,2004,2005,2006,2007,2008,2009',
 u'Albania,28385.0,40311.0,0.0,0.0,0.0,38284.0,30683.0,30491.0,35883.0,27893.0,42787.0,42840.0,32380.0,30964.0,0.0,0.0',
 u'Algeria,76160.0,90270.0,53380.0,74460.0,66470.0,50150.0,64430.0,43840.0,37317.0,0.0,0.0,0.0,0.0,0.0,100000.0,0.0',
 u'Andorra,539.947998047,510.67300415,560.340026855,434.475006104,254.0,450.151000977,518.666015625,456.62600708,565.559020996,566.583007812,567.044006348,530.278015137,353.220001221,306.630004883,0.0,0.0',
 u'Anguilla,93.0999984741,100.730003357,0.0,0.0,0.0,0.0,68.1900024414,70.7300033569,68.1900024414,108.769996643,84.25,124.400001526,99.5500030518,86.2900009155,96.8899993896,71.0800018311']

Now your data is in a `pyspark.RDD` and you can begin analyzing it. 

<div class="alert alert-block alert-info">Note: To access CSV files in Object Storage and load data to use in the notebook, you can use the code generation functions on the `Insert to code` list on each data file in the `Data` pane in the notebook. </div>

<a id="dashdb"></a>
## Load data from dashDB

dashDB is a data warehousing and analytics solution. Use dashDB to store relational data, including special data types such as geospatial data. You can leverage the in-memory database technology to use both columnar and row-based tables. 

In this notebook, you must use a connection to an IBM dashdDB for Bluemix service instance. You can create data service connections on your project page. The dashDB instance name appears in the `Data` pane. 

Click the next code cell and use the `Insert to code` function below the dashDB instance name in the `Data` pane to add the dashDB credentials. 

<div class="alert alert-block alert-info">Note: The Python dictionary with the dashDB credentials that is generated for you is given a generic name. Rename the dictionary variable to `credentials_dashDB` and run the code cell to proceed.</div>

<a id="dashdb_pyspark"></a>
### Load data by using PySpark

Add the credentials of your dashDB instance that contains your data and run the next cell to load this data.

The code in the cell reads the credentials and loads the data from dashBD into a DataFrame data structure.

In [7]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

props = {}
props['user'] = credentials_dashdb['username']
props['password'] = credentials_dashdb['password']

# fill in table name
table = credentials_dashdb['username'] + "." + "PRECIPITATION"

data_df = sqlContext.read.jdbc(credentials_dashdb['jdbcurl'],table,properties=props)
data_df.printSchema()

root
 |-- COUNTRY_OR_AREA: string (nullable = true)
 |-- 1990: decimal(31,11) (nullable = true)
 |-- 1995: decimal(31,11) (nullable = true)
 |-- 1996: decimal(31,11) (nullable = true)
 |-- 1997: decimal(31,11) (nullable = true)
 |-- 1998: decimal(31,12) (nullable = true)
 |-- 1999: decimal(31,11) (nullable = true)
 |-- 2000: decimal(31,10) (nullable = true)
 |-- 2001: decimal(31,12) (nullable = true)
 |-- 2002: decimal(31,11) (nullable = true)
 |-- 2003: decimal(31,12) (nullable = true)
 |-- 2004: decimal(31,12) (nullable = true)
 |-- 2005: decimal(31,11) (nullable = true)
 |-- 2006: decimal(31,11) (nullable = true)
 |-- 2007: decimal(31,12) (nullable = true)
 |-- 2008: decimal(31,11) (nullable = true)
 |-- 2009: decimal(31,11) (nullable = true)



In [8]:
data_df.take(5)

[Row(COUNTRY_OR_AREA=u'Albania', 1990=Decimal('28385.00000000000'), 1995=Decimal('40311.00000000000'), 1996=Decimal('0E-11'), 1997=Decimal('0E-11'), 1998=Decimal('0E-12'), 1999=Decimal('38284.00000000000'), 2000=Decimal('30683.0000000000'), 2001=Decimal('30491.000000000000'), 2002=Decimal('35883.00000000000'), 2003=Decimal('27893.000000000000'), 2004=Decimal('42787.000000000000'), 2005=Decimal('42840.00000000000'), 2006=Decimal('32380.00000000000'), 2007=Decimal('30964.000000000000'), 2008=Decimal('0E-11'), 2009=Decimal('0E-11')),
 Row(COUNTRY_OR_AREA=u'Algeria', 1990=Decimal('76160.00000000000'), 1995=Decimal('90270.00000000000'), 1996=Decimal('53380.00000000000'), 1997=Decimal('74460.00000000000'), 1998=Decimal('66470.000000000000'), 1999=Decimal('50150.00000000000'), 2000=Decimal('64430.0000000000'), 2001=Decimal('43840.000000000000'), 2002=Decimal('37317.00000000000'), 2003=Decimal('0E-12'), 2004=Decimal('0E-12'), 2005=Decimal('0E-11'), 2006=Decimal('0E-11'), 2007=Decimal('0E-12'),

Now your data is in a `pyspark.sql.DataFrame` and you can analyze it. 

<a id="cloudant"></a>
## Load data from a Cloudant database
Cloudant is a NoSQL database as a service (DBaaS) built to scale globally, run nonstop, and handle a wide variety of data types like JSON, full-text, and geospatial. Cloudant NoSQL DB is an operational data store optimized to handle concurrent reads and  writes and to provide high availability and data durability.

In this notebook, you must have an IBM Cloudant NoSQL Database for Bluemix service instance and a connection to this data service instance. You can create data service connections on your project page. The Cloudant NoSQL DB instance name appears in the **Data Sources** pane. 

Click the next code cell and use the `Insert to code` function below the Cloudant NoSQL DB instance name in the **Data Sources** pane to add the Cloudant credentials.  

<div class="alert alert-block alert-info">Note: The Python dictionary with the Cloudant credentials that is generated for you is given a generic name. Rename the dictionary variable to `credentials_cloudant` and run the code cell to proceed.</div>

<a id="cloudant_python"></a>
### Load data by using Python

Before you begin loading data from a Cloudant NoSQL DB instance to your notebook, ensure that you are using the latest database version. Do not use Cloudant 0.5.10 or earlier. For more information see [the Python library for Cloudant and CouchDB](https://github.com/cloudant/python-cloudant).

Install the `cloudant` package:

In [10]:
!pip install --user cloudant



In [11]:
from cloudant.client import Cloudant
from cloudant.result import Result
import pandas as pd, json

client = Cloudant(credentials_cloudant['username'], credentials_cloudant['password'], url=credentials_cloudant['url'])
client.connect()

List all existing databases:

In [12]:
client.all_dbs()

[u'test_db', u'weather_db']

In [13]:
# fill in database name 
db_name = 'test_db'
my_database = client[db_name]
result_collection = Result(my_database.all_docs, include_docs=True)
data_df = pd.DataFrame([item['doc'] for item in result_collection])
data_df.head()

Unnamed: 0,_id,_rev,age,city,country,name
0,1,3-27cca37f6a255b60524bfbe865542404,32,Chicago,USA,Peter Smith
1,2,4-cb90fa2f6171a233cde1145ae345424e,26,New York City,USA,Maria Sanchez
2,3,5-e7eeaf3db3bb9d618e91bc4614883b57,52,Ontario,Canada,Don Spears
3,4,5-12789958a18573ebb767b03bfaf83f4c,25,Berlin,Germany,Martin Mueller
4,5,3-1515da6fa41d2642f10b5f9d75bd9ad2,37,Ontario,Canada,Julia Ma


Now your data is in a `pandas.DataFrame` and you can start analyzing it.

<a id="cloudant_pyspark"></a>
### Load data by using PySpark

Add the credentials of your Cloudant NoSQL DB instance that contains your data and run the next cell to load this data.

The code in the cell reads the credentials and loads the data from Cloudant into a DataFrame data structure.

In [14]:
from pyspark.sql import SQLContext
sqlContext = SQLContext(sc)

# fill in database name 
db_name = "test_db"
data_df = sqlContext.read.format("com.cloudant.spark")\
.option("cloudant.host", credentials_cloudant['host'])\
.option("cloudant.username", credentials_cloudant['username'])\
.option("cloudant.password", credentials_cloudant['password'])\
.load(db_name)

data_df.printSchema()

root
 |-- _id: string (nullable = true)
 |-- _rev: string (nullable = true)
 |-- age: long (nullable = true)
 |-- city: string (nullable = true)
 |-- country: string (nullable = true)
 |-- name: string (nullable = true)



In [15]:
data_df.take(5)

[Row(_id=u'001', _rev=u'3-27cca37f6a255b60524bfbe865542404', age=32, city=u'Chicago', country=u'USA', name=u'Peter Smith'),
 Row(_id=u'002', _rev=u'4-cb90fa2f6171a233cde1145ae345424e', age=26, city=u'New York City', country=u'USA', name=u'Maria Sanchez'),
 Row(_id=u'003', _rev=u'5-e7eeaf3db3bb9d618e91bc4614883b57', age=52, city=u'Ontario', country=u'Canada', name=u'Don Spears'),
 Row(_id=u'004', _rev=u'5-12789958a18573ebb767b03bfaf83f4c', age=25, city=u'Berlin', country=u'Germany', name=u'Martin Mueller'),
 Row(_id=u'005', _rev=u'3-1515da6fa41d2642f10b5f9d75bd9ad2', age=37, city=u'Ontario', country=u'Canada', name=u'Julia Ma')]

Now your data in a `pyspark.sql.DataFrame` and you can start analyzing it.

<a id="summary"></a>
## Summary

In this notebook, you learned how to load data from object storage, dashDB, or Cloudant to a notebook.

### Author
Sven Hafeneger is a member of the Data Science Experience development team at IBM Analytics in Germany. He holds a M.Sc. in Bioinformatics and is passionate about data analysis, machine learning and the Python ecosystem for data science. 

Copyright © IBM Corp. 2016, 2017. This notebook and its source code are released under the terms of the MIT License.