# Intro to Numpy

When it comes to doing numerical work, Python by itself is rather slow. By slow we mean compared to languages like C and Fortran, which benefit from being **compiled** languages in which a program is preprocessed into machine code by a compiler. Python by contrast is an **interpreted** language, in which each line in a program is fed to the Python interpreter in sequence, then executed. The flexiblity and ease of use that come with Python come at the cost of pure performance.

However, though Python code itself may be slow, Python can be used to run code that is written in a compiled language and already compiled. We will use a library (a.k.a., a Python *module*) that does exactly this underneath the hood to get fast performance for numerical operations on arrays.

In [1]:
import numpy

Importing a module is like taking a piece of equipment out of a storage locker and setting up on a lab bench. Importing the name `numpy` makes all the functions and classes (object types) available to us. The core data structure that `numpy` provides is known as the `numpy` array:

In [2]:
somenums = numpy.array([1, 2, 3, 4])
print(somenums)

[1 2 3 4]


A numpy array looks superfically similar to a `list`, which is a builtin to Python. They are fundamentally different, however, in how they both work and how they exist in memory. `numpy` arrays don't store references to other objects, but instead point to contiguous blocks of memory in which each element is of exactly the same data type. For instance, we just made an array of 64 bit integers:

In [3]:
somenums.dtype

dtype('int64')

In [4]:
# this will give an array with a string dtype
numpy.array(['a string', 10])

array(['a string', '10'], 
      dtype='<U8')

In [5]:
# this will give an array of 64 bit floats
floats = numpy.array([63.3, -5.0, 1])
print(floats)

[ 63.3  -5.    1. ]


In [6]:
floats.dtype

dtype('float64')

Also, because arrays are not a collection of objects but are a single object of identically sized pieces of data, they cannot be resized. To add elements to an array, one must create a new array.

In [7]:
# this will create a new array with repeated elements
numpy.hstack([somenums, somenums])

array([1, 2, 3, 4, 1, 2, 3, 4])

## Array methods (or, arrays are objects)

`numpy` arrays are built for numerical operations, and doing them quickly. Since like everything in Python these are *objects*, they include built-in methods such as:

In [8]:
somenums.mean()

2.5

In [9]:
somenums.std()

1.1180339887498949

and a whole plethora of others. You can get a view of what methods and attributes are part of an array's namespace with:

In [10]:
dir(somenums)

['T',
 '__abs__',
 '__add__',
 '__and__',
 '__array__',
 '__array_finalize__',
 '__array_interface__',
 '__array_prepare__',
 '__array_priority__',
 '__array_struct__',
 '__array_wrap__',
 '__bool__',
 '__class__',
 '__contains__',
 '__copy__',
 '__deepcopy__',
 '__delattr__',
 '__delitem__',
 '__dir__',
 '__divmod__',
 '__doc__',
 '__eq__',
 '__float__',
 '__floordiv__',
 '__format__',
 '__ge__',
 '__getattribute__',
 '__getitem__',
 '__gt__',
 '__hash__',
 '__iadd__',
 '__iand__',
 '__ifloordiv__',
 '__ilshift__',
 '__imod__',
 '__imul__',
 '__index__',
 '__init__',
 '__int__',
 '__invert__',
 '__ior__',
 '__ipow__',
 '__irshift__',
 '__isub__',
 '__iter__',
 '__itruediv__',
 '__ixor__',
 '__le__',
 '__len__',
 '__lshift__',
 '__lt__',
 '__mod__',
 '__mul__',
 '__ne__',
 '__neg__',
 '__new__',
 '__or__',
 '__pos__',
 '__pow__',
 '__radd__',
 '__rand__',
 '__rdivmod__',
 '__reduce__',
 '__reduce_ex__',
 '__repr__',
 '__rfloordiv__',
 '__rlshift__',
 '__rmod__',
 '__rmul__',
 '__ror__'

or in the notebook, by typing the name of the array followed by a `.` and the tab key:

In [None]:
somenums.

Recall that you can also get the documentation for any function or method with a question mark at the end of the name:

In [11]:
somenums.mean?

## Multidimensionality, indexing, and slicing

`numpy` arrays can be of any dimensionality, not just 1-D. It's common to encounter 2-D arrays, and for illustration we'll look at the position of a particle in three dimensions with time:

In [1]:
import numpy as np

In [98]:
# generate x, y, z positions
nframes = 10**6
x = np.cos(np.linspace(0, 20, nframes))
y = 3 * np.sin(np.linspace(0, 10, nframes))
z = -2 * np.sin(np.pi * np.linspace(0, 5, nframes))

# put them all in a single array; this gives
# an array with 3 rows and nframes columns
position = np.array([x, y, z])

# we want the transpose array for now
position = position.transpose()

In [99]:
position.shape

(1000000, 3)

Now say we wanted to examine the position of the particle in the very first frame (row), we could do:

In [100]:
position[0]

array([ 1.,  0., -0.])

to extract it. Notice that indexing starts at 0, as is the convention in Python.

What about the third frame?

In [101]:
position[2]

array([  9.99999999e-01,   6.00000600e-05,  -6.28319159e-05])

In zero-based terms, this we would call the "first frame" the zeroth frame, and so on. To avoid confusion we'll assume this from now on.

What if we wanted a bunch of frames, but only the 5th through the 72nd? It should have 68 rows:

In [102]:
position[5:73].shape

(68, 3)

Notice the **slicing** notation. Remember, this should be read as

> "Get each row in the array starting from the row at index 5 up to and not including the row at index 73."

We could even coarse-grain by slicing out every fifth row in this range:

In [103]:
position[5:73:5].shape

(14, 3)

Now what if we wanted a specific *element* of the array? Indexing works for this too:

In [104]:
position[42, 1]

0.0012600012229571492

This is the y-position of the 42nd frame.

The first index/slice corresponds to the first *axis* of the array, which for a 2-D array corresponds to the rows. The second index/slice would then be the columns. If we had a 3-D array, indexing the first axis would yield 2-D arrays. If we had a 4-D array, indexing the first axis would yield 3-D arrays, and so on.

### Breakout: obtain an array giving the mean of the x, y positions from the 10th frame to the 42nd as a 1-D array.

We can do this by slicing both the first axis (rows) and the second axis (columns), then using the `mean` method of the resulting array. To only take a mean across all rows (a mean for each column), we must specify the `axis=0` keyword.

In [105]:
position[10:43, :2].mean(axis=0)

array([  9.99999847e-01,   7.80000768e-04])

What if we wanted the smaller of the two numbers only?

In [106]:
position[10:43, :2].mean(axis=0).min()

0.00078000076767674302

Since slicing and methods of arrays often yield arrays, you can chain operations in this way. This is what qualifies as a *pythonic* way to work with these objects.

Now say we wanted to calculate the mean x-position of the particle over all time. We could 

### Fancy and boolean indexing

It's also possible to index arrays with lists of indices to select out; these can be repeated and in any order.

In [107]:
position[[2, 4, 7, -1, 2]]

array([[  9.99999999e-01,   6.00000600e-05,  -6.28319159e-05],
       [  9.99999997e-01,   1.20000120e-04,  -1.25663832e-04],
       [  9.99999990e-01,   2.10000210e-04,  -2.19911705e-04],
       [  4.08082062e-01,  -1.63206333e+00,  -1.22464680e-15],
       [  9.99999999e-01,   6.00000600e-05,  -6.28319159e-05]])

We can also use arrays of booleans to get back arrays with items for which `True` was the value in the boolean array used:

In [122]:
(position[:, :2] > 2).any(axis=1)

array([False, False, False, ..., False, False, False], dtype=bool)

We can use this array to get only the rows for which either the x or y position is greater than 2:

In [121]:
position[(position[:,:2] > 2).any(axis=1)].shape

(336428, 3)

Boolean arrays are useful for filtering data for rows of interest.

## Thinking in arrays 

Say we wanted to displace our particle a full 5 units in each of the x, y, and z directions. If you have experience with a language like C, you might be used to writing nested loops like this one to achieve this:

In [61]:
%%time
for i, row in enumerate(position):
    for j, element in enumerate(row):
        position[i, j] += 5     

CPU times: user 517 ms, sys: 3.33 ms, total: 520 ms
Wall time: 519 ms


In [62]:
position

array([[ 6.        ,  5.        ,  5.        ],
       [ 5.99999998,  5.0001    ,  5.00015708],
       [ 5.99999992,  5.0002    ,  5.00031416],
       ..., 
       [ 5.40844721,  4.45614672,  5.00031416],
       [ 5.40826464,  4.4560628 ,  5.00015708],
       [ 5.40808206,  4.45597889,  5.        ]])

But one of the main points of `numpy` is performance, so we'd do better to spend as little time in an operation running through the Python interpreter, which is the case in the above loop. Instead we can do:

In [63]:
%%time
position += 5

CPU times: user 0 ns, sys: 0 ns, total: 0 ns
Wall time: 714 µs


On my laptop the difference in speed is about a factor of 1000. The larger the array the more pronounced the difference in speed will be, too. The general rule when using `numpy` is to try and put what you're trying to do in terms of operations on whole arrays (or slices of them), avoid Python loops unless absolutely necessary.

### Breakout: rescale (multiply) the y-positions by 2 and displace the x and z positions by 3 and -100, respectively.

There are a lot of ways to do this, but the most succinct way is to take advantage of *broadcasting*. That is, doing:

In [64]:
position = position * np.array([1, 2, 1]) + np.array([3, 0, -100])

`numpy` will take the 3-element, 1-D arrays here and apply them to whole columns in `position`. Note that we already took advantage of broadcasting rules in a way, since multiplying an array by a scalar is the same as multiplying the array by an array of equal shape with all elements equal to the scalar.

## Array arithmetic 

We can do anything with arrays of the same shape that we can do with single numbers.

In [65]:
((2 * position + position) * position)**.5

array([[  24.24871131,   34.64101615,  155.88457268],
       [  24.24871127,   34.64136257,  155.88430061],
       [  24.24871117,   34.64170898,  155.88402854],
       ..., 
       [  23.22411182,   32.75705311,  155.88402854],
       [  23.22379561,   32.75676242,  155.88430061],
       [  23.22347936,   32.75647174,  155.88457268]])

is the same as $\sqrt{(2r + r) \cdot r} = \sqrt{3r^2} = r \sqrt{3}$

In [66]:
position * 3**.5

array([[  24.24871131,   34.64101615, -155.88457268],
       [  24.24871127,   34.64136257, -155.88430061],
       [  24.24871117,   34.64170898, -155.88402854],
       ..., 
       [  23.22411182,   32.75705311, -155.88402854],
       [  23.22379561,   32.75676242, -155.88430061],
       [  23.22347936,   32.75647174, -155.88457268]])

Note that multiplication between two arrays is **not** the same as matrix multiplcation. Arithmetic operations are element-wise. But there is a method for doing matrix multiplication:

In [68]:
position.dot(np.array([3, 5, -1]))

array([ 232.        ,  232.00084287,  232.00168562, ...,  224.78649463,
        224.78526485,  224.78403508])

And more linear algebra functions can be found in the `np.linalg` namespace:

In [70]:
dir(np.linalg)

['LinAlgError',
 'Tester',
 '__builtins__',
 '__cached__',
 '__doc__',
 '__file__',
 '__loader__',
 '__name__',
 '__package__',
 '__path__',
 '__spec__',
 '_umath_linalg',
 'absolute_import',
 'bench',
 'cholesky',
 'cond',
 'det',
 'division',
 'eig',
 'eigh',
 'eigvals',
 'eigvalsh',
 'info',
 'inv',
 'lapack_lite',
 'linalg',
 'lstsq',
 'matrix_power',
 'matrix_rank',
 'multi_dot',
 'norm',
 'pinv',
 'print_function',
 'qr',
 'slogdet',
 'solve',
 'svd',
 'tensorinv',
 'tensorsolve',
 'test']

# Plotting with matplotlib