# NumPy -  multidimensional data arrays

## Introduction

The NumPy package (module) is used in almost all numerical computation using Python. It is a package that provides high-performance vector, matrix and higher-dimensional data structures for Python. It is implemented in C and Fortran so when calculations are vectorized (formulated with vectors and matrices), performance is very good. 

To use NumPy you need to import the module.  Here we will import all of NumPy into the global namespace.  We do this for convenience in this notebook, since we will be using NumPy functions extensively.  However, you will often see the convention used to import NumPy into a namespace called np (`'import numpy as np'`).  This is often seen in iPython notebooks or interactive sessions, and we will follow this convention in subsequent lecture notebooks.  But for now, be aware that most of the functions you will see demonstrated below are actually from the NumPy package.

In [1]:
from numpy import *

In the NumPy package the terminology used for vectors, matrices and higher-dimensional data sets is *array*. 



## Creating NumPy arrays

There are a number of ways to initialize new NumPy arrays, for example from

* a Python list or tuple
* using functions that are dedicated to generating NumPy arrays, such as `arange`, `linspace`, etc.
* reading data from files

### From lists

One of the most basic ways to create a NumPy array is to initialize it from an existing Python list.  For example, to create 
new vector and matrix arrays from Python lists we can use the `numpy.array` function:

In [2]:
# a vector: the argument to the array function is a Python list
v = array([1,2,3,4])

v

array([1, 2, 3, 4])

In [3]:
# a matrix: the argument to the array function is a nested Python list
M = array([[1, 2], [3, 4]])

M

array([[1, 2],
       [3, 4]])

The `v` and `M` objects are both of the type `ndarray` that the `NumPy` module provides.

In [4]:
type(v), type(M)

(numpy.ndarray, numpy.ndarray)

The difference between the `v` and `M` arrays is only their shapes. We can get information about the shape of an array by using the `ndarray.shape` property.

In [5]:
v.shape

(4,)

In [6]:
M.shape

(2, 2)

v is a 1 dimensional vector, with 4 elements in it.  M is a 2 dimensional matrix, with 2 rows and 2 columns (for a total of 4 elements).

The number of elements in the array is available through the `ndarray.size` property:

In [7]:
M.size

4

Equivalently, we could use the function `numpy.shape` and `numpy.size`

In [8]:
shape(M)

(2, 2)

In [9]:
size(M)

4

So far the `numpy.ndarray` looks a lot like a Python list (or nested list). Why not simply use Python lists for computations instead of creating a new array type? 

There are several reasons:

* Python lists are very general. They can contain any kind of object. They are dynamically typed. They do not support mathematical functions such as matrix and dot multiplications, etc. Implementating such functions for Python lists would not be very efficient because of the dynamic typing.
* NumPy arrays are **statically typed** and **homogeneous**. The type of the elements is determined when array is created, and they cannot be changed once the array is created.
* NumPy arrays are memory efficient.
* Because of the static typing, fast implementation of mathematical functions such as multiplication and addition of NumPy arrays can be implemented in a compiled language (C and Fortran are used).

Using the `dtype` (data type) property of an `ndarray`, we can see what type the data of an array has:

In [10]:
M.dtype

dtype('int32')

In this case, the M array contains integer elements (int64 indicates that we use 64 bits to represent each integer element, also known as a long integer, where a 32 bit integer is usually considered a regular sized integer).

We get an error if we try to assign a value of the wrong type to an element in a NumPy array:

In [11]:
M[0,0] = "hello"

ValueError: invalid literal for int() with base 10: 'hello'

If we want, we can explicitly define the type of the array data when we create it, using the `dtype` keyword argument: 

In [12]:
M = array([[1, 2], [3, 4]], dtype=float)

print(M)
print(M.dtype)

[[1. 2.]
 [3. 4.]]
float64


Common types that can be used with `dtype` are: `int`, `float`, `complex`, `bool`, `object`, etc.

We can also explicitly define the bit size of the data types, for example: `int64`, `int16`, `float128`, `complex128`.

### Using array-generating functions

For larger arrays it is impractical to initialize the data manually, using explicit Python lists. Instead we can use one of the many functions in NumPy that generates arrays of different forms (or reads in the data from some other source, e.g. files, see next section).
Some of the more common are:

#### arange

In [13]:
# create a range

x = arange(0, 10, 1) # arguments: start, stop, step

x

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [14]:
set_printoptions(4, suppress=True) # show only four decimals
x = arange(-1, 1, 0.1)

x

array([-1. , -0.9, -0.8, -0.7, -0.6, -0.5, -0.4, -0.3, -0.2, -0.1, -0. ,
        0.1,  0.2,  0.3,  0.4,  0.5,  0.6,  0.7,  0.8,  0.9])

#### random data

We will have a whole section in this class on random models and random number generation.  Here are some examples of creating arrays with randomly generated data using `NumPy` functions.

In [15]:
from numpy import random
import numpy as np

In [16]:
# uniform random numbers in range [0,1]
# generate a 2 dimensional array of random numbers, with 5 elements along each dimension.
random.rand(5,5)

array([[0.8875, 0.8613, 0.2038, 0.4985, 0.9887],
       [0.9909, 0.6415, 0.8215, 0.8919, 0.1437],
       [0.0543, 0.75  , 0.8037, 0.4609, 0.7812],
       [0.6953, 0.602 , 0.0479, 0.3978, 0.0935],
       [0.4262, 0.975 , 0.5179, 0.1478, 0.5956]])

In [17]:
# standard normally distributed random numbers (mean or mu = 0.0, standard deviation
# or sigma = 1.0
# create a 3 dimensional array with 3 elements in each dimension
x = random.randn(3,3,3)
print(x)
print(x.mean())
print(x.std())

[[[-0.6302 -1.3541  0.7286]
  [ 0.882   1.8133 -1.236 ]
  [-0.3139  1.5489  1.53  ]]

 [[ 1.7768 -0.2516  0.1147]
  [-0.5269  0.1287  0.4737]
  [ 1.4413  1.0876  0.0758]]

 [[ 0.6194 -1.2718  1.2522]
  [ 0.6847 -2.3992 -0.1517]
  [-0.7872 -1.4275 -1.2325]]]
0.09537572607769225
1.1214993829212967


An often seen idiom allocates a two-dimensional array, and then fills in one-dimensional arrays from some function:

In [18]:
twod = np.zeros((5, 2))
twod

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

In [25]:
%%timeit
for i in range(twod.shape[0]):
    twod[i, :] = np.random.random(2)
twod

9.12 µs ± 196 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


Benchmark the difference in performance from the preceeding code to the following code that does the same thing.  

In [27]:
%%timeit
#10x faster
twod = np.random.random(size=(5,2))
twod

928 ns ± 44.5 ns per loop (mean ± std. dev. of 7 runs, 1000000 loops each)


#### diag

In [28]:
# a diagonal matrix
diag([1,2,3])

array([[1, 0, 0],
       [0, 2, 0],
       [0, 0, 3]])

In [29]:
# diagonal with offset from the main diagonal
diag([1,2,3], k=1) 

array([[0, 1, 0, 0],
       [0, 0, 2, 0],
       [0, 0, 0, 3],
       [0, 0, 0, 0]])

#### zeros and ones

In [30]:
# a 3x3 2 dimensional array, filled with zeros
zeros((3,3))

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [31]:
# a vector (1 dimensional array) of 10 ones
ones((10,))

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

### List Performance vs. NumPy Performance

In the cell below, time the numpy sum vs the list sum using `%time` to see the difference

In [36]:
Nelements = 10000
Ntimeits = 10000

x = arange(Nelements)
y = range(Nelements)

print(y)

range(0, 10000)


In [37]:
%%timeit
#73x faster
np.sum(x)

6.7 µs ± 85.5 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


In [38]:
%%timeit
total = 0
for i in y:
    total = i + total
total

487 µs ± 16.6 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


#### Numpy Arrays vs. Python Lists?

1. Why the need for numpy arrays?  Can't we just use Python lists?
2. Iterating over numpy arrays is slow. Slicing is faster

Python lists may contain items of different types. This flexibility comes at a price: Python lists store *pointers* to memory locations.  On the other hand, numpy arrays are typed, where the default type is floating point.  Because of this, the system knows how much memory to allocate, and if you ask for an array of size 100, it will allocate one hundred contiguous spots in memory, where the size of each spot is based on the type.  This makes access extremely fast.

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/array_vs_list.png" alt="Drawing" style="width: 500px;"/>

BUT, iteration slows things down again. In general you should not access numpy array elements by iteration.  This is because of type conversion.  Numpy stores integers and floating points in C-language format.  When you operate on array elements through iteration, Python needs to convert that element to a Python int or float, which is a more complex beast (a `struct` in C jargon).  This has a cost.

<img src="https://jakevdp.github.io/PythonDataScienceHandbook/figures/cint_vs_pyint.png" alt="Drawing" style="width: 500px;"/>


If you want to know more, read [this]() from [Jake Vanderplas's Data Science Handbook](https://jakevdp.github.io/PythonDataScienceHandbook/). You will find that book an incredible resource.

Why is slicing faster? The reason is technical: slicing provides a view onto the memory occupied by a numpy array, instead of creating a new array. That is the reason the code above this cell works nicely as well. However, if you iterate over a slice, then you have gone back to the slow access.

By contrast, functions such as `np.dot` are implemented at C-level, do not do this type conversion, and access contiguous memory. If you want this kind of access in Python, use the struct module or Cython. Indeed many fast algorithms in numpy, pandas, and C are either implemented at the C-level, or employ Cython.

## File I/O

For small examples and tests we often simply randomly or systematically generate the data we need into arrays, as we have just done. But
for real computational problems and simulations, we usually need to get or load data that was generated from some external source or
experiment in order to analyse it.  `NumPy` supports reading in data from regular files in several formats, including a space efficient 'NumPy' native format.

BTW, we won't get into it in this course, but Python and `NumPy` support more complex and currently popular formats for doing huge big
data analysis projects.  These include relational database queries, JSON and things such as HDF5 (Hierarchical Data Format) and many
others.

### Comma-separated values (CSV)

A very common  but basic file format for data files are the comma-separated values (CSV), or related format such as TSV 
(tab-separated values) or space-separated values. To read data from such file into NumPy arrays we can use 
the `numpy.genfromtxt` function. For example, the stockholm_td_adj file contains temperature data recored from 
Stockholm, SW.  A recording was made each day of the year (presumably at the same time of day).  The first 3 columns 
are the year, month and day.  The next 3 columns hold the temperature in degress Celcius:

Note that you may need to change the path to the data below

In [None]:
data = genfromtxt('/data/cs2300/examples/stockholm_td_adj.dat')

In [None]:
data.shape

This indicates that the data consists of 77431 rows or records.  Each row has 7 columns, or elements.  As we mentioned, the first 3
columns are the year, month and day, and the next 3 columns are the temperature recordings.

We can look at the first 10 values from columns 0, 1 and 2 (the year, month and day).

In [None]:
data[0:10,0:3]

Using `numpy.savetxt` we can store a NumPy array to a file in CSV format:

In [None]:
M = random.rand(3,3)

M

In [None]:
savetxt("random-matrix.csv", M)

### NumPy's native file format

Useful when storing and reading back NumPy array data. Use the functions `numpy.save` and `numpy.load`:

In [None]:
save("random-matrix.npy", M)

In [None]:
load("random-matrix.npy")

## More properties of NumPy arrays

In [None]:
M.itemsize # bytes per element

In [None]:
M.nbytes # number of bytes

In [None]:
M.ndim # number of dimensions

## Manipulating arrays

### Indexing

We can index elements in an array using the square bracket and indices, as we have done before with regular Python lists:

In [None]:
# v is a vector, and has only one dimension, taking one index
print(v)
print(v[0])

If we want to access `NumPy` arrays with 2 or more dimensions, we can specify each element of each dimension, sepearating the dimension
indexes with a ,

In [None]:
# M is a matrix, or a 2 dimensional array, taking two indices 
print(M)
print(M[1,1])

If we omit an index of a multidimensional array it returns the whole row (or, in general, a N-1 dimensional array) 

In [None]:
M

In [None]:
M[1]

The same thing can be achieved with using `:` instead of an index (this is actually a slice):

In [None]:
M[1,:] # row 1

In [None]:
M[:,1] # column 1

We can assign new values to elements in an array using indexing:

In [None]:
M[0,0] = 1

In [None]:
M

In [None]:
# also works for rows and columns
M[1,:] = 0
M[:,2] = -1

In [None]:
M

### Index slicing

Index slicing is the technical name for the syntax `M[lower:upper:step]` to extract part of an array:

In [39]:
A = array([1,2,3,4,5])
A

array([1, 2, 3, 4, 5])

In [40]:
A[1:3]

array([2, 3])

WARNING: Array slices are *mutable*: if they are assigned a new value the original array from which the slice was extracted is modified (so they are really a view into the original data/memory of the array):

In [41]:
A[1:3] = [-2,-3]

A

array([ 1, -2, -3,  4,  5])

We can omit any of the three parameters in `M[lower:upper:step]`:

In [None]:
A[::] # lower, upper, step all take the default values

In [None]:
A[::2] # step is 2, lower and upper defaults to the beginning and end of the array

In [None]:
A[:3] # first three elements

In [None]:
A[3:] # elements from index 3

Negative indices counts from the end of the array (positive index from the begining):

In [None]:
A = array([1,2,3,4,5])

In [None]:
A[-1] # the last element in the array

In [None]:
A[-3:] # the last three elements

Index slicing works exactly the same way for multidimensional arrays:

In [None]:
A = array([[n+m*10 for n in range(5)] for m in range(5)])

A

In [None]:
# a block from the original array
A[1:4, 1:4]

In [None]:
# strides
A[::2, ::2]

### Fancy indexing

Fancy indexing is the name for when an array or list is used in-place of an index: 

In [None]:
row_indices = [1, 2, 3]
A[row_indices]

In [None]:
col_indices = [1, 2, -1] # remember, index -1 means the last element
A[row_indices, col_indices]

## Functions for extracting data from arrays and creating arrays

### diag

With the diag function we can also extract the diagonal and subdiagonals of an array:

In [42]:
diag(A)

array([[ 1,  0,  0,  0,  0],
       [ 0, -2,  0,  0,  0],
       [ 0,  0, -3,  0,  0],
       [ 0,  0,  0,  4,  0],
       [ 0,  0,  0,  0,  5]])

In [43]:
diag(A, -1)

array([[ 0,  0,  0,  0,  0,  0],
       [ 1,  0,  0,  0,  0,  0],
       [ 0, -2,  0,  0,  0,  0],
       [ 0,  0, -3,  0,  0,  0],
       [ 0,  0,  0,  4,  0,  0],
       [ 0,  0,  0,  0,  5,  0]])

### choose

Constructs an array by picking elements form several arrays:

In [None]:
which = [1, 0, 2, 3]
choices = [[-1, -2, -3, -4], [1,2,3,4]]

choose(which, choices, mode='wrap')

## Linear algebra

Vectorizing code is the key to writing efficient numerical calculation with Python/NumPy. That means that as much as possible of a program should be formulated in terms of matrix and vector operations, like matrix-matrix multiplication.

### Scalar-array operations

We can use the usual arithmetic operators to multiply, add, subtract, and divide arrays with scalar numbers.

In [None]:
v1 = arange(0, 5)

In [None]:
v1 * 2

In [None]:
v1 + 2

In [None]:
A * 2, A + 2

### Element-wise array-array operations

When we add, subtract, multiply and divide arrays with each other, the default behaviour is **element-wise** operations:

In [None]:
A * A # element-wise multiplication

In [None]:
v1 * v1

If we multiply arrays with compatible shapes, we get an element-wise multiplication of each row:

In [None]:
A.shape, v1.shape

In [None]:
A * v1

### Matrix algebra

What about matrix mutiplication? There are two ways. We can either use the `dot` function, which applies a matrix-matrix, matrix-vector, or inner vector multiplication to its two arguments: 

In [None]:
dot(A, A)

In [None]:
dot(A, v1)

In [None]:
dot(v1, v1)

Alternatively, we can cast the array objects to the type `matrix`. This changes the behavior of the standard arithmetic operators `+, -, *` to use matrix algebra.

In [None]:
M = matrix(A)
v = matrix(v1).T # make it a column vector

In [None]:
v

In [None]:
M*M

In [None]:
M*v

In [None]:
# inner product
v.T * v

In [None]:
# with matrix objects, standard matrix algebra applies
v + M*v

If we try to add, subtract or multiply objects with incomplatible shapes we get an error:

In [None]:
v = matrix([1,2,3,4,5,6]).T

In [None]:
shape(M), shape(v)

In [None]:
M * v

### Data processing

Often it is useful to store datasets in NumPy arrays. NumPy provides a number of functions to calculate statistics of datasets in arrays. 

For example, let's calculate some properties data from the Stockholm temperature dataset used above.

In [None]:
# reminder, the tempeature dataset is stored in the data variable:
shape(data)

#### mean

In [None]:
# the temperature data is in column 3
mean(data[:,3])

The daily mean temperature in Stockholm over the last 200 year so has been about 6.2 C.

#### standard deviations and variance

In [None]:
std(data[:,3]), var(data[:,3])

#### min and max

In [None]:
# lowest daily average temperature
data[:,3].min()

In [None]:
# highest daily average temperature
data[:,3].max()

#### sum, prod, and trace

In [None]:
d = arange(0, 10)
d

In [None]:
# sum up all elements
sum(d)

In [None]:
# product of all elements, do you understand why we passed d+1 to the prod() function?
prod(d+1)

## Reshaping, resizing and stacking arrays

The shape of an NumPy array can be modified without copying the underlaying data, which makes it a fast operation even for large arrays.

In [None]:
A

In [None]:
n, m = A.shape

In [None]:
B = A.reshape((1,n*m))
B

In [None]:
B[0,0:5] = 5 # modify the array

B

In [None]:
A # and the original variable is also changed. B is only a different view of the same data

We can also use the function `flatten` to make a higher-dimensional array into a vector. But this function create a copy of the data.

In [None]:
B = A.flatten()

B

In [None]:
B[0:5] = 10

B

In [None]:
A # now A has not changed, because B's data is a copy of A's, not refering to the same data

## Stacking and repeating arrays

Using function `repeat`, `tile`, `vstack`, `hstack`, and `concatenate` we can create larger vectors and matrices from smaller ones:

### tile and repeat

In [None]:
a = array([[1, 2], [3, 4]])

In [None]:
# repeat each element 3 times
repeat(a, 3)

In [None]:
# tile the matrix 3 times 
tile(a, 3)

### concatenate

In [None]:
b = array([[5, 6]])

In [None]:
concatenate((a, b), axis=0)

In [None]:
concatenate((a, b.T), axis=1)

## Copy and "deep copy"

To achieve high performance, assignments in Python usually do not copy the underlaying objects. This is important for example when objects are passed between functions, to avoid an excessive amount of memory copying when it is not necessary (techincal term: pass by reference). 

In [None]:
A = array([[1, 2], [3, 4]])

A

In [None]:
# now B is referring to the same array data as A 
B = A 

In [None]:
# changing B affects A
B[0,0] = 10

B

In [None]:
A

If we want to avoid this behavior, so that when we get a new completely independent object `B` copied from `A`, then we need to do a so-called "deep copy" using the function `copy`:

In [None]:
B = copy(A)

In [None]:
# now, if we modify B, A is not affected
B[0,0] = -5

B

In [None]:
A

## Iterating over array elements

Generally, we want to avoid iterating over the elements of arrays whenever we can (at all costs). The reason is that in a interpreted language like Python (or MATLAB), iterations are really slow compared to vectorized operations. 

However, sometimes iterations are unavoidable. For such cases, the Python `for` loop is the most convenient way to iterate over an array:

In [None]:
v = array([1,2,3,4])

for element in v:
    print(element)

In [None]:
M = array([[1,2], [3,4]])

for row in M:
    print("row", row)
    
    for element in row:
        print(element)

When we need to iterate over each element of an array and modify its elements, it is convenient to use the `enumerate` function to obtain both the element and its index in the `for` loop: 

In [None]:
for row_idx, row in enumerate(M):
    print("row_idx", row_idx, "row", row)
    
    for col_idx, element in enumerate(row):
        print("col_idx", col_idx, "element", element)
       
        # update the matrix M: square each element
        M[row_idx, col_idx] = element ** 2

In [None]:
# each element in M is now squared
M

## Vectorizing functions

As mentioned several times by now, to get good performance we should try to avoid looping over elements in our vectors and matrices, and instead use vectorized algorithms. The first step in converting a scalar algorithm to a vectorized algorithm is to make sure that the functions we write work with vector inputs.

In [None]:
def Theta(x):
    """
    Scalar implemenation of the Heaviside step function.
    """
    if x >= 0:
        return 1
    else:
        return 0

In [None]:
Theta(array([-3,-2,-1,0,1,2,3]))

OK, that didn't work because we didn't write the `Theta` function so that it can handle with vector input... 

To get a vectorized version of Theta we can use the NumPy function `vectorize`. In many cases it can automatically vectorize a function:

In [None]:
Theta_vec = vectorize(Theta)

In [None]:
Theta_vec(array([-3,-2,-1,0,1,2,3]))

We can also implement the function to accept vector input from the beginning (requires more effort but might give better performance):

In [None]:
def Theta(x):
    """
    Vector-aware implemenation of the Heaviside step function.
    """
    return 1 * (x >= 0)

In [None]:
Theta(array([-3,-2,-1,0,1,2,3]))

In [None]:
# still works for scalars as well
Theta(-1.2), Theta(2.6)

## Type casting

Since NumPy arrays are *statically typed*, the type of an array does not change once created. But we can explicitly cast an array of some type to another using the `astype` functions (see also the similar `asarray` function). This always create a new array of new type:

In [None]:
M.dtype

In [None]:
M2 = M.astype(float)

M2

In [None]:
M2.dtype

In [None]:
M3 = M.astype(bool)

M3

## Benchmarking
You can calculate the size of the numpy array (in bytes) by using the following code.  Try adding a few more lines to this to compare the size (storage requirements) of different arrays.  

In [None]:
M.data.nbytes

## Further reading

* http://numpy.scipy.org
* http://scipy.org/Tentative_NumPy_Tutorial
* http://scipy.org/NumPy_for_Matlab_Users - A Numpy guide for MATLAB users.

Acknowledgements
----------------

Original versions of these notebooks created by J.R. Johansson (robert@riken.jp) http://dml.riken.jp/~rob/.  Adjustments have been made by Dr. Derek Riley
