## Access files in Object Storage with R

This notebook shows you how to access data files stored in Object Storage by using the R programming language and SparkR, the light-weight frontend to use Apache Spark from R. 

This notebook runs on R with Spark 2.0.


## Table of contents

1. [Load data](#load_data)
1. [Access data](#access_data)
    1. [Access data by using R](#access_data_using_R)
    1. [Access data by using SparkR](#access_data_using_SparkR)
1. [Summary](#summary)

<a id="load_data"></a>
## Load data

Before you begin analyzing data in data files in your notebook, you must add the data files to the notebook. When you load data files to your notebook, these files are stored in Object Storage. 

To add files that you want to use in a notebook to Object Storage, click the **Data** icon on the notebook action bar. You can either drag the file that you want to add to the `Data` pane or click **Add Source** and browse to the file. The data files are listed on the `Data` pane. 

<a id="access_data"></a>
## Access data

To access data in a file in Object Storage, you need the Object Storage authentication credentials. 

Click the next code cell to set the focus on the cell. To add the credentials to access the data file to this code cell, select **Insert to code>Credentials** on the data file that you loaded in the `Data` pane.

This action returns an R `list` object with the credentials required to access the file in Object Storage. 

<div class="alert alert-block alert-info">Note: If you decide to share this notebook with other users, consider removing the credentials from the notebook.</div>

<a id="access_data_using_R"></a>
### Access data by using R

Because the data file is located in Object Storage, you need to define a helper function to access the file that you loaded.  

Run the following cell to define the function called `getObjectStorageFile`. This function takes the list object with the credentials required to access the data file as input. The function accesses Object Storage using your credentials and opens the data file in text-mode format for reading in the notebook. 

In [7]:
getObjectStorageFile <- function(credentials) {
    if(!require(httr)) install.packages('httr')
    if(!require(RCurl)) install.packages('RCurl')
    library(httr, RCurl)
    auth_url <- paste(credentials[['auth_url']],'/v3/auth/tokens', sep= '')
    auth_args <- paste('{"auth": {"identity": {"password": {"user": {"domain": {"id": ', credentials[['domain_id']],'},"password": ',
                   credentials[['password']],',"name": ', credentials[['username']],'}},"methods": ["password"]}}}', sep='"')
    auth_response <- httr::POST(url = auth_url, body = auth_args)
    x_subject_token <- headers(auth_response)[['x-subject-token']]
    auth_body <- content(auth_response)
    access_url <- unlist(lapply(auth_body[['token']][['catalog']], function(catalog){
        if((catalog[['type']] == 'object-store')){
            lapply(catalog[['endpoints']], function(endpoints){
                if(endpoints[['interface']] == 'public' && endpoints[['region_id']] == credentials[['region']]) {
                   paste(endpoints[['url']], credentials[['container']], credentials[['filename']], sep='/')}
            })
        }
    })) 
    data <- content(httr::GET(url = access_url, add_headers ("Content-Type" = "application/json", "X-Auth-Token" = x_subject_token)), as="text")
    textConnection(data)
}

You can use the text-mode connection to the data file in Object Storage that the helper function returns as input to any standard R data import functions. 
For example, run the next cell to read a `.csv` file into an R data frame by using the `read.csv()` function:

In [8]:
R.data.frame <- read.csv(file = getObjectStorageFile(credentials_1))
head(R.data.frame)

id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave.points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave.points_worst,symmetry_worst,fractal_dimension_worst,X
842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,
843786,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,...,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,


<a id="access_data_using_SparkR"></a>
### Access data by using SparkR

Before you can access data in the data file in Object Storage by using the [`SQLContext`](https://spark.apache.org/docs/latest/sparkr.html#starting-up-sparkcontext-sqlcontext) object, you must set the Hadoop configuration by using the following configuration function. Run the following cell to create the helper function:

In [9]:
setHadoopConfig <- function(credentials) {
    prefix = paste("fs.swift.service" , credentials[['name']], sep =".")
    hConf = SparkR:::callJMethod(sc, "hadoopConfiguration")
    SparkR:::callJMethod(hConf, "set", paste(prefix, "auth.url", sep='.'), paste(credentials[["auth_url"]],"/v3/auth/tokens",sep=""))    
    SparkR:::callJMethod(hConf, "set", paste(prefix, "auth.endpoint.prefix", sep='.'), "endpoints")    
    SparkR:::callJMethod(hConf, "set", paste(prefix, "tenant", sep='.'), credentials[["project_id"]])    
    SparkR:::callJMethod(hConf, "set", paste(prefix, "username", sep='.'), credentials[["user_id"]])    
    SparkR:::callJMethod(hConf, "set", paste(prefix, "password", sep='.'), credentials[["password"]])    
    SparkR:::callJMethod(hConf, "set", paste(prefix, "region", sep='.'), credentials[["region"]])    
    invisible(SparkR:::callJMethod(hConf, "setBoolean", paste(prefix, "public", sep='.'), TRUE))
}

Set the Hadoop configuration and give it a name, for example, `keystone`:

In [10]:
credentials_1[["name"]] <- "keystone"
setHadoopConfig(credentials_1)

You can now use the `read.df` function from the SparkR API to load the data file as a Spark DataFrame. For example, run the next cell to read a `.csv` file into an Spark DataFrame. The variable `filePath` is the location of the data file in Object Storage.

In [12]:
filePath <- paste("swift://" , credentials_1[['container']] , "." , credentials_1[['name']] , "/" , credentials_1[['filename']], sep="")
SparkR.DataFrame <- read.df(filePath, source = "com.databricks.spark.csv", header = "true")
head(SparkR.DataFrame)

id,diagnosis,radius_mean,texture_mean,perimeter_mean,area_mean,smoothness_mean,compactness_mean,concavity_mean,concave points_mean,...,texture_worst,perimeter_worst,area_worst,smoothness_worst,compactness_worst,concavity_worst,concave points_worst,symmetry_worst,fractal_dimension_worst,_c32
842302,M,17.99,10.38,122.8,1001.0,0.1184,0.2776,0.3001,0.1471,...,17.33,184.6,2019.0,0.1622,0.6656,0.7119,0.2654,0.4601,0.1189,
842517,M,20.57,17.77,132.9,1326.0,0.08474,0.07864,0.0869,0.07017,...,23.41,158.8,1956.0,0.1238,0.1866,0.2416,0.186,0.275,0.08902,
84300903,M,19.69,21.25,130.0,1203.0,0.1096,0.1599,0.1974,0.1279,...,25.53,152.5,1709.0,0.1444,0.4245,0.4504,0.243,0.3613,0.08758,
84348301,M,11.42,20.38,77.58,386.1,0.1425,0.2839,0.2414,0.1052,...,26.5,98.87,567.7,0.2098,0.8663,0.6869,0.2575,0.6638,0.173,
84358402,M,20.29,14.34,135.1,1297.0,0.1003,0.1328,0.198,0.1043,...,16.67,152.2,1575.0,0.1374,0.205,0.4,0.1625,0.2364,0.07678,
843786,M,12.45,15.7,82.57,477.1,0.1278,0.17,0.1578,0.08089,...,23.75,103.4,741.6,0.1791,0.5249,0.5355,0.1741,0.3985,0.1244,


Now your data is in a `Spark DataFrame` and you can begin analyzing it. 

<div class="alert alert-block alert-info">Note: To access CSV files in Object Storage and load data to use in the notebook, you can use the code generation functions on the `Insert to code` list below each data file in the `Data` pane in the notebook.</div>

<a id="summary"></a>
## Summary

This notebook demonstrated how to access files stored in Object Storage by using both R and SparkR. You can use and adapt these code snippets in a notebook you are developing if you want to load data to and access data from Object Storage.


### Author

Sumit Goyal is a Software Developer at IBM in Germany. He is a data science enthusiast and passionate about IBM's Data Science Experience. He holds a degree in Automation and Industrial IT. Meet him on twitter @imSumitGoyal.

Copyright © 2017 IBM. This notebook and its source code are released under the terms of the MIT License.