Skip to content
Rodrigo Botafogo edited this page Jan 8, 2015 · 1 revision

SciCom and MDArray

MDArray is a multi dimensional array implemented for JRuby inspired by NumPy (www.numpy.org) and Masahiro Tanaka´s Narray (narray.rubyforge.org). MDArray stands on the shoulders of

Java-NetCDF and Parallel Colt. At this point MDArray has libraries for linear algebra, mathematical, trigonometric and descriptive statistics methods. NetCDF-Java Library is a Java interface to NetCDF files, as well as to many other types of scientific data formats. It is developed and distributed by Unidata (http://www.unidata.ucar.edu).

Parallel Colt (https://sites.google.com/site/piotrwendykier/software/parallelcolt) is a multithreaded version of Colt (http://acs.lbl.gov/software/colt/). Colt provides a set of Open Source Libraries for High Performance Scientific and Technical Computing in Java. Scientific and technical computing is characterized by demanding problem sizes and a need for high performance at reasonably small memory footprint.

Converting MDArray to R Array (same backing store)

An MDArray can be converted to an R array by calling method ‘R.md’.

First, let´s create an MDArray of shape [4, 3]:

arr1 = MDArray.typed_arange("double", 12)
arr1.reshape!([4, 3])
arr1.print

This is arr1 as printed from MDArray:

[[0.00 1.00 2.00]
 [3.00 4.00 5.00]
 [6.00 7.00 8.00]
 [9.00 10.00 11.00]]

Now, converting this array to an R array and printing it:

r_matrix = R.md(arr1)
r_matrix.pp

The result is:

     [,1] [,2] [,3]
[1,]    0    1    2
[2,]    3    4    5
[3,]    6    7    8
[4,]    9   10   11

One very important aspect of this conversion is that both the MDArray and the R array use the same backing store, and thus, this conversion does not do any copying and has very low cost. However, WITH GREAT POWER COMES GREAT RESPONSABILITIES: since MDArray and the R array have the same backing store, a change in MDArray will also change the value of the R array. Renjin assumes that the vector will never change and delays calculation of the vector to the latest possible time. If values change, the result can be unexpected, so, any changes to an MDArray should be done with care.

Array indexing

MDArrays are indexed starting at 0, while R arrays are indexed starting at 1. In order to facilitate the use of converted MDArrays we introduced method ‘ri’ (r-indexing) that converts an MDArray index into an R matrix index. Comparing the content of the MDArray and R array defined above can be done with:

compare = MDArray.byte(arr1.shape)
arr1.get_index.each do |ct|
    compare[*ct] = (arr1[*ct] == (r_matrix.ri(*ct).gz))? 1 : 0
end
comp = R.md(compare)
p comp.all.gt
  • We first create a byte MDArray. Byte arrays are converted to logical vectors in R;

  • arr1.get_index retrieves all indexes from arr1 in order;

  • we then compare arr1[*ct] (the array given its index) with r_matrix.ri(*ct) (.ri converts the given index to an R index)

  • In R, indexing a vector returns a new vector. If we want to get a scalar and not a vector, SciCon provides method .gz.

  • Finally, comp is converted to an logical vector in R and we call method all on this vector. Method all returns true if all elements of the vector are true. In this case, all elements are true and comp.all.gt print true.

Multi-dimensional arrays

Multi-dimensional arrays can also be converted into R arrays using method ‘.md’. However, multi-dimension definition for MDArray and R arrays are different. For instance, an MDArray defined with the following dimensions [3, 2, 2] indicates that there are 3 vector of 2 x 2 dimensions.

The figure bellow shows a [3, 2, 2] array in MDArray.

[[[0.00 1.00]
  [2.00 3.00]]

 [[4.00 5.00]
  [6.00 7.00]]

 [[8.00 9.00]
  [10.00 11.00]]]

Bellow we show a [3, 2, 2] array created in R. In R this specification indicates that the user wants to build an array of 2 vectors with size [3, 2].

, , 1

     [,1] [,2]
[1,]    0    3
[2,]    1    4
[3,]    2    5

, , 2

     [,1] [,2]
[1,]    6    9
[2,]    7   10
[3,]    8   11

In order to allow for easy use of converted arrays, when multi-dimensional arrays are converted from MDArray to R array the R array is dimensioned in order to be identical to the MDArray. As such, if the MDArray above is converted to an R array, the R array dimension is [2, 2, 3].

Dicing and Slicing MDArrays

MDArrays can be sliced and diced in many ways. A slilced MDArray can be converted to R array as any other MDArray. From the point of view of R, this is just a normal array.

When working with two dimensional arrays, each line is viewed as a new record and there is no information encoded in the line number. Columns encode information and each column has a different type of value, for example, “name”, “age”, “phone number”, etc.

With multi-dimensional arrays, dimensions can encode information. For example, let´s suppose we are developing a system to analyze quotes from multiple stocks. Working with two dimensional arrays we would have a file for each stock, in which each row would be a new record and columns would represent, “open”, “high”, “low”, “close”, etc. In multi-dimensional arrays we can use a single array and the following dimensions:

  • Dimension 0: The date of the quote.
  • Dimension 1: The stock
  • Dimension 2: The quote characteristic (“open”, “high”, etc.)

Let´s encode all quotes from Jul. 2014 for the following stocks: Google, Microsoft, Yahoo and Apple. We define an MDArray with the following specification: [22, 4, 6]. The first dimension of size 22 represents the 22 business days of Jul. 2014. The second dimension of size 4 is for each of the four stocks, and dimension 3 of size 6 has the quote attributes “open”, “high”, “low”, “close”, “volume” and “adjusted volume”.

Getting the data from Yahoo finance, we have that the opening value of Google stock on 1/Jul/2014 was 578.32. So, we assign data[0, 0, 0] = 578,32. The opening value of Google stock on 2/Jul/2014 was 583.35. So, again we have data[1, 0, 0] = 583.35.

Now, Microsoft “high” stock value on Jul/03/2014 was 44.09, so data[2, 1, 1] = 44.09.

Let´s say that we want the have statistics about the opening price of Google stocks. We can slice the data array to create a view with only the values of interest:

sec = @data.section([0,  0, 0], [22, 1, 1], true)

The ‘section’ method gets a section of the original array. It takes two or three arguments. The first two arguments are arrays and the third in ‘true’ (when used). The first array is an array of indexes and the second is an array of sizes. So, looking at the first dimension, we start at index 0 and get 22 elements (all elements in that dimension), in this example, all dates on Jul. 2014. The second dimension gets stock 0 and size 1, i.e., only 1 stock is selected. In this example Google is indexed by 0. Finally, the third dimension is from index 0 (“open”) and of size 1, i.e., only the open attribute is selected. Printing sec gives:

[578.32 583.35 583.35 583.76 577.66 571.58 565.91 571.91 582.60 585.74 588.00 579.53 593.00 591.75 590.72  593.23 596.45 590.40 588.07 588.75 586.55 580.60]

Now, let´s convert this to R and call the summary function, by:

R.md(sec).summary.pp

The result is:

Min.    1st Qu. Median  Mean    3rd Qu. Max.   
565,9   579,8   584,8   584,1     590   596,5