# Loading Data into SciDB

## Overview

The goal of this tutorial is to familizarize yourself with loading data into SciDB using the SciDB-py library for Python.  We will cover some simple examples using a relatively small dataset, and will see the benefits of using accelerated IO operators. 
***={EDIT THIS!}=***


## Topics to cover

- Ingesting TSV and CSV files with __`aio_input`__.
    - specifying `header=` and `split_on_dimension=`
    - using the "`error`" attribute
    - converting from text to numerics: regular cast versus dcast
- operators __`store`__ and __`insert`__
- operator __`delete`__
- operators __`versions()`__ and __`remove_versions()`__
- operator __`redimension()`__ 
- an example using a secondary "axis labels" array
- an example of appending data to an existing array
- ingesting data directly from Python (not from a file)




### Setup: Download the iris dataset to a local CSV file

We will download the well known iris dataset, which consists of measurements data for a set of 150 flowers.  There are 50 members of each of three different iris varieties, along with four measurements for each observation.  We will download this data and write it to a local comma-separated (CSV) file.

In [160]:
import pandas as pd
iris = 'http://archive.ics.uci.edu/ml/machine-learning-databases/iris/iris.data'

# Read the iris data from the URL into a DataFrame.
attributes = ["sepal_length", "sepal_width", "petal_length", "petal_width", "class"]
df = pd.read_csv(iris, sep=',', names=attributes)

# Write the CSV file back out with the column headers.
df.to_csv("/tmp/iris.csv", index=False)

print(df.shape)

(150, 5)


The iris data contains 150 observations of flowers from three different classes of iris (50 members of each).  There are four measurements (sepal length, sepal width, petal length, and petal width) for each flower observed.

In [161]:
df.head()

Unnamed: 0,sepal_length,sepal_width,petal_length,petal_width,class
0,5.1,3.5,1.4,0.2,Iris-setosa
1,4.9,3.0,1.4,0.2,Iris-setosa
2,4.7,3.2,1.3,0.2,Iris-setosa
3,4.6,3.1,1.5,0.2,Iris-setosa
4,5.0,3.6,1.4,0.2,Iris-setosa


Let's examine the first 10 lines of the CSV file as well, to verify that it has downloaded and written out correctly.

In [22]:
import subprocess
print(subprocess.check_output('head -n 10 /tmp/iris.csv', shell=True))

sepal_length,sepal_width,petal_length,petal_width,class
5.1,3.5,1.4,0.2,Iris-setosa
4.9,3.0,1.4,0.2,Iris-setosa
4.7,3.2,1.3,0.2,Iris-setosa
4.6,3.1,1.5,0.2,Iris-setosa
5.0,3.6,1.4,0.2,Iris-setosa
5.4,3.9,1.7,0.4,Iris-setosa
4.6,3.4,1.4,0.3,Iris-setosa
5.0,3.4,1.5,0.2,Iris-setosa
4.4,2.9,1.4,0.2,Iris-setosa



Note that this command can also be useful for looking at the header of a data file that you are about to upload, to investigate the format of the data.

Since we now have the iris dataset in python, we didn't actually have to write it out to a CSV before uploading it, but since it is common to have a local CSV file with data that needs to be ingested into SciDB, we have added the step of saving it to a local CSV file, which we will then read and upload below.

# 1. Connect to SciDB

Connect to the SciDB database.

In [162]:
from scidbpy import connect
import getpass
import requests
import warnings
warnings.filterwarnings("ignore")
requests.packages.urllib3.disable_warnings(requests.packages.urllib3.exceptions.InsecureRequestWarning)
db = connect(scidb_url="https://localhost:8083", 
             scidb_auth=('root', getpass.getpass('Please enter your password: ')),
             verify=False)

Please enter your password: ········


# 2. Ingest CSV File into a Temp Array Using `aio_input()`

First we'll use the Paradigm4 Accelerated IO tools plugin to ingest the file into a temporary array in the SciDB database. 



In [28]:
from scidbpy import Schema
def show_schema(db, array):
    if (type(array) != str):  
        array = str(array)
    else:
        array = array  
    if ('(' in array):    #AFL statement? escape quotes and run show('query', 'afl')
        array = """'{}','afl'""".format(array.replace("'", "\\'"))
    sch = Schema.fromstring(db.show(array)[:]['schema'][0].encode('ascii'))
    sch.name = '' #clear out the 'name' field for more clarity
    return(str(sch))

The operator **`aio_input`** is used to ingest a CSV file into SciDB. By default, it takes the file and simply returns it back to the user. The operator requires the file path and the number of columns (attributes) as parameters.  We will utilize several options to `aio_input`, but there are a few more options documented on [aio_tools page](https://github.com/paradigm4/accelerated_io_tools).  

By default, aio_input assumes that the input data is tab-separated (TSV format), so we will specify that this file is comma-separated instead, using the `"'attribute_delimiter=,'"` option.  We'll also use the option `"'header=1'"` to instruct the loader to ignore the first line of text.  

Our query will look like this:

In [186]:
input_cmd = db.aio_input("'/tmp/iris.csv'", "'num_attributes=5'", "'header=1'", "'attribute_delimiter=,'")

In [187]:
print(input_cmd)

aio_input('/tmp/iris.csv', 'num_attributes=5', 'header=1', 'attribute_delimiter=,')


Note that the options need to be double quoted.

Now let's run show_schema on this to see what the array will look like:

In [175]:
show_schema(db, input_cmd)

'<a0:string,a1:string,a2:string,a3:string,a4:string,error:string> [tuple_no=0:*:0:10000000; dst_instance_id=0:15:0:1; src_instance_id=0:15:0:1]'

The above indicates that `aio_input` will create one attribute for each of the five columns in the file: a0,a1,a2,a3,a4. It will also add an `error` attribute. Each line in the input file will become an array cell with (a0,a1,a2,a3,a4,error). The `error` is `null` unless that particular line in the file did not have exactly 4 tab-delimited tokens.

Each value is loaded in as a `string` - this operator does not bother parsing the strings at this point.  But also notice that all of the columns from the input file are treated as attributes by default.

The dimensions are complex and reflect this operator's advanced options. If this dataset were very large, the file would first be split into blocks of several megabytes which would then be sent around the cluster and parsed in parallel. This ensures faster, parallel parsing. Thus `dst_instance_id` denotes which instance a particular block of text ended up on, `tuple_no` is used to assign numbers to the different lines of text - not necessarily consecutively. The other dimension, `src_instance_id` is used when the operator ingests multiple files at once - which we are not doing. 

Without further ado, let's ingest the data into an array, to see what it looks like:

In [180]:
temp_iris = db.store(input_cmd)

In [181]:
temp_iris

Array(DB('https://localhost:8083', ('root', PASSWORD_PROVIDED), None, None, False), 'py_1513378737565129148_2')

The variable `temp_iris` is now a Python object that is bound to a temporary array in SciDB. The SciDB-Py package also created a temporary name for this array: `py_...`. This array will be automatically removed from the system when the `temp_iris` variable is removed from the scope. Let's look at our array:

In [200]:
db.limit(temp_iris, 3)[:]

Unnamed: 0,tuple_no,dst_instance_id,src_instance_id,a0,a1,a2,a3,a4,error
0,0,0,0,5.1,3.5,1.4,0.2,Iris-setosa,
1,1,0,0,4.9,3.0,1.4,0.2,Iris-setosa,
2,2,0,0,4.7,3.2,1.3,0.2,Iris-setosa,


# Exercise I

**(A)**. How many rows of text were loaded from the `iris.csv` file? 

In [89]:
#Answer to Exercise I.A:


**(B)**. Were there any errors? Is the attribute `error` not null anywhere?

In [90]:
#Answer to Exercise I.B:


**(C)**. Calculate the average value for a1 (y axis). Note, you need to convert it to a `double`:


In [93]:
#Answer to Exercise I.C:


## 2.1. Import Data as a Dimension Using "'`split_on_dimension=1`'"

This would suffice to get the data into a SciDB table, but let's try to represent the data in a more useful way for computation and slicing.  Instead of reading the measurement columns in as attributes, let's read them in as a second dimension -- the observation number, i.e. row number, being the first dimension.  

Even though the ouptut above looks like the type of 2-dimensional matrix that we would like, each row of that output actually represents a cell in the SciDB array, not a "row" of the array.  To get the 2-dimensional representation that we actually want in SciDB, we want to convert the four measurement columns into another dimension. 

To do this, we can use the `"'split_on_dimension=1'"` option to `aio_input`.  This will take the columns a0, a1, a2, a3, and a4 from above, and instead provide a single 



We would like the the data to be usable in SciDB similarly to a matrix o final option we'll use here is `"'split_on_dimension=1'"`.  By default, each column in the input is treated separate attribute.  `split_on_dimension=1` tells `aio_input` to turn the columns into a another dimension, each column is assigned a different, consecutive `"attribute_no"` in the output.  For example, the first column (sepal_length) will be assigned a 



In [188]:
input_cmd_2 = db.aio_input("'/tmp/iris.csv'", "'num_attributes=5'", "'header=1'", 
                         "'attribute_delimiter=,'", "'split_on_dimension=1'")

In [189]:
print(input_cmd_2)

aio_input('/tmp/iris.csv', 'num_attributes=5', 'header=1', 'attribute_delimiter=,', 'split_on_dimension=1')


In [190]:
show_schema(db, input_cmd_2)

SyntaxError: invalid syntax (<ipython-input-190-0e2579a25c26>, line 2)

In [197]:
temp_iris_2 = db.store(input_cmd_2)
temp_iris_2

Array(DB('https://localhost:8083', ('root', PASSWORD_PROVIDED), None, None, False), 'py_1513378737565129148_4')

In [196]:
db.limit(temp_iris_2, 18)[:]

Unnamed: 0,tuple_no,dst_instance_id,src_instance_id,attribute_no,a
0,0,0,0,0,5.1
1,0,0,0,1,3.5
2,0,0,0,2,1.4
3,0,0,0,3,0.2
4,0,0,0,4,Iris-setosa
5,0,0,0,5,
6,1,0,0,0,4.9
7,1,0,0,1,3.0
8,1,0,0,2,1.4
9,1,0,0,3,0.2


Note that the above 18 elements represent the same three iris observations as were output from the command "`db.limit(temp_iris, 3)[:]`" earlier.  But the 6 columns of a0-a5 are now represented along the `attribute_no` dimension.

# Exercise II

**(A)**. How many lines (or cells) are there in temp_iris_2? 

In [199]:
#Answer to Exercise II.A:


**(B)**. Where did the values in the error column (a5) from `temp_iris` go in the `temp_iris_2` array?

In [90]:
#Answer to Exercise II.B:


**(C)** What `attribute_no` value corresponds to "`petal_width`"?

In [None]:
#Answer to Exercise II.C:


# 3. Making a Schema

Our objective is to load the iris data into SciDB in such a way that we can utilize the dimensions of SciDB arrays to enable slicing and computing upon the measurement data.  The iris dataset that we have imported actually contains two types of data: (1) measurement data, and (2) metadata (the species of iris).  While there are different ways of approaching the representation of this data in SciDB, the most pragmatic is to separate the measurements from the metadata into two separate arrays.

The measurement array will contain the 4 measurements for each of the 150 irises that were observed.  In another context, we would think of this as a two dimensional matrix -- 150 observations X 4 measurements.  This is how we will represent the data in SciDB.  The measurement SciDB array will use observation_id (for example the row number) and measurement_id as its two dimensions.  And each cell will only contain a single attribute -- a value, representing the measurement for that individual and measurement.

In the other table, we will create a simple array for the iris's species metadata.  The table will only contain observation-id and the associated iris species as two attributes.  

Note that we are keeping the iris species information in this way because we are only going to use the iris's species as an outcome variable, or something to consider about each observation.  If the dataset were much larger, or if we were planning to regularly group the data by specific iris species, we would instead put iris-species as a third dimension in the main table.  Then the species information would help us to store and slice the data array far more efficiently.



***(Question -- do I also want to mention that another type of situation where we might want to make a value into a dimension is when that data type is sequential?  (Since allowing the sequential data to be chunked up will enable values near each other along that dimension to be stored near each other on disk))***

***(Question -- Should I say anything about chunk sizes here?  The dataset is small enough that there is no need to really fret about chunk sizes at the moment...)***



Let's first create the measurement array.  We'll use the array name `iris`. In case the array exists already, we'll remove it first:

In [262]:
try:
    db.remove('iris')
except:
    pass

Now create the array. We will specify a chunk size of 1,000 for the observation_id.  Since there are only 150 observations, this will keep all of the data together in a single chunk, this will help us shortly when we try to insert the data into the `iris` array, but we'll talk about that shortly.  We are leaving the chunk size unspecified for the `measurement_id` dimension though.

***(Question -- Are these relevant points to be making here?)***

In [263]:
db.iquery('''
create array 
iris
<measurement_value:double> 
[observation_id=0:*:0:1000; measurement_id=0:*:0:*]'''
)

In [264]:
show_schema(db, 'iris')

'<measurement_value:double> [observation_id=0:*:0:1000; measurement_id=0:*:0:*]'

The `*` above means "I didn't pick a chunk size yet".

Before we can insert the data, we need to convert filter the `temp_iris_2` array down to only the measurement data, and then change names to allow the measurement data to match the newly created `iris` array. 

In [265]:
filtered_values = db.filter(temp_iris_2, '(attribute_no <= 3)')
db.limit(filtered_values, 5)[:]

Unnamed: 0,tuple_no,dst_instance_id,src_instance_id,attribute_no,a
0,0,0,0,0,5.1
1,0,0,0,1,3.5
2,0,0,0,2,1.4
3,0,0,0,3,0.2
4,1,0,0,0,4.9


Now we can insert data. In SciDB this is done with the `redimension` operator and variable matching is performed by name. Therefore, we need to add attributes named "measurement_value", "observation_id", "measurement_id" to our `filtered_values` array using `apply`. The dimensions must be of type int64. We can do it like this:

In [266]:
applied_values = db.apply(filtered_values, 
 'measurement_value', 'double(a)',
 "observation_id", "int64(tuple_no)",
 "measurement_id", "int64(attribute_no)"
)

In [267]:
db.limit(applied_values, 5)[:]

Unnamed: 0,tuple_no,dst_instance_id,src_instance_id,attribute_no,a,measurement_value,observation_id,measurement_id
0,0,0,0,0,5.1,5.1,0,0
1,0,0,0,1,3.5,3.5,0,1
2,0,0,0,2,1.4,1.4,0,2
3,0,0,0,3,0.2,0.2,0,3
4,1,0,0,0,4.9,4.9,1,0


This array could be inserted into the `iris` table as is, but there is some chance that the `tuple_no` attribute was generated along non-consecutive numbers.  In order to get consecutive values for the `observation_id`, which will become a dimension in our `iris` array, we will use a feature of the redimension operator.  If there is one column that is missing when calling `redimension`, then it creates a sequential, ordered attribute, called a `synthetic dimension`, to match the missing dimension.  So let's recreate our applied_values array, but exclude the `observation_id` intentionally.

(As a note, the `synthetic_dimension` can only be generate if their are fewer elements being redimensioned than the chunk size of that dimension.  This is the reason that we specified the large chunk size for the `observation_id` dimension -- so that the 150 iris observations could be inserted in a single chunk, and utilize the `synthetic_dimension` feature in order to get a dense, ordered index.)

In [269]:
applied_values = db.apply(filtered_values, 
 'measurement_value', 'double(a)',
# "observation_id", "int64(tuple_no)",
 "measurement_id", "int64(attribute_no)"
)

And now let's `redimension`, and `insert` the data into our `iris` array (which is empty as yet).

In [278]:
db.scan('iris')[:]

Unnamed: 0,observation_id,measurement_id,measurement_value


In [279]:
show_schema(db, db.redimension(applied_values, 'iris'))

'<measurement_value:double> [observation_id=0:*:0:1000; measurement_id=0:*:0:*]'

In [280]:
show_schema(db, "iris")

'<measurement_value:double> [observation_id=0:*:0:1000; measurement_id=0:*:0:*]'

In [281]:
db.insert(db.redimension(applied_values, 'iris'), 'iris')

In [285]:
db.limit('iris', 20)[:]

Unnamed: 0,observation_id,measurement_id,measurement_value
0,0,3,1.9
1,1,3,1.0
2,2,3,0.2
3,3,3,2.1
4,4,3,0.2
5,5,3,0.1
6,6,3,1.3
7,7,3,0.2
8,8,3,1.6
9,9,3,2.3


# ***`______Sections still to do_____`***

## The `store` and `insert` operators

## The `delete` operator

## `versions()` and `remove_versions()` operators

## `redimension()` operator

## Examples
### an example using a secondary "axis labels" array
### an example of appending data to an existing array
### ingesting data directly from Python (not from a file)


<hr>
<hr>
<hr>
<hr>
<hr>
<hr>
<hr>
<hr>
<hr>
<hr>
<hr>
<hr>
<hr>
<hr>
<hr>
<hr>
<hr>
<hr>
<hr>
<hr>
<hr>
<hr>
<hr>
<hr>
<hr>
<hr>
<hr>

# ANSWER KEY FOR EXERCISES

# Exercise I

**(A)**. How many rows of text were loaded from the `iris.csv` file? 

*Answer:*  ***150***

**(B)**. Were there any errors? Is the attribute `error` not null anywhere?

In [116]:
# Answer to Exercise I.B:
temp_command = db.filter(temp_iris, 'error is not null')
print(temp_command)
temp_command[:]

filter(py_1513295652383260139_2, error is not null)


Unnamed: 0,tuple_no,dst_instance_id,src_instance_id,a0,a1,a2,a3,a4,error


In [115]:
db.op_count(temp_command)[:]

Unnamed: 0,i,count
0,0,0.0


*Answer:* ***So no, there were no errors.***

**(C)**. Calculate the average value for a1 (sepal_width). Note, you need to convert it to a `double`:

In [198]:
#Answer to Exercise I.C:
db.aggregate(db.apply(temp_iris, 'double_a1', 'double(a1)'), 'avg(double_a1)')[:]

Unnamed: 0,i,double_a1_avg
0,0,3.054


# Exercise II

**(A)**. How many lines (or cells) are there in temp_iris_2? 

In [199]:
#Answer to Exercise II.A:


**(B)**. Where did the values in the error column (a5) from `temp_iris` go in the `temp_iris_2` array?

In [90]:
#Answer to Exercise II.B:


**(C)** What `attribute_no` value corresponds to "`petal_width`"?

In [None]:
#Answer to Exercise II.C:
