# Intro to Numpy

When it comes to number crunching, Python is actually considered quite slow. It's an interpreted language, which means that there's quite a bit of overhead associated with looping through a list of numbers and calculating, say, a simple sum. Despite this, Python has become a popular language for numerical work, and this is due to the existence of `numpy`.

## NumPy arrays are performant for numerical operations

Let's import `numpy`; this creates a `module` object in our Python session, including in it all that `numpy` has to offer:

In [1]:
import numpy as np

The core functionality of `numpy` comes in the form of the array data structure it includes. We can create an array of integers, for example, by using `np.array` and feeding it a list of integers:

In [2]:
arr = np.array([1, 3, 7, 9])

In [3]:
arr

array([1, 3, 7, 9])

A distinguishing characteristic of a `numpy` array is that all of its elements are of the same data type. In this case, we have an array of 64 bit integers:

In [4]:
arr.dtype

dtype('int64')

The elements of an array are arranged in a computer's physical memory in a contiguous block. This allows for fast computation since modern CPUs will read as much contiguous data from memory as they can fit in their cache, which means that neighboring elements of a `numpy` array will already be in cache when a computation is requested on a given element. The result is faster computation than we would expect from, say, a Python `list`.

As a point of comparison, we'll sum a million integers in a `list`, and do the same with an `array`, and see how much time this takes:

In [5]:
biglist = list(range(1000000))

In [6]:
%%timeit 

sum(biglist)

8.01 ms ± 19.3 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [7]:
bigarray = np.arange(1000000)

In [8]:
bigarray

array([     0,      1,      2, ..., 999997, 999998, 999999])

In [9]:
%%timeit

np.sum(bigarray)

702 µs ± 66.5 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


For this simple test, the sum on the `numpy` array was 10 times faster than that on the `list`. Speed differences such as this add up very quickly when doing many operations on numerical data, so the benefits of `numpy` cannot be understated here.

## Working with arrays

`numpy` arrays have a number of methods attached to them. For example, these aggregating functions that yield a single value:

In [10]:
arr.sum()

20

In [11]:
arr.prod()

189

In [12]:
arr.min()

1

In [13]:
arr.max()

9

There are also methods like `shape`, which tells us the shape of our array as a tuple:

In [14]:
arr.shape

(4,)

In this case, we are dealing with a one-dimensional `numpy` array of length 4. This is similar to a mathematical *vector*.

We can reshape an array with the `reshape` method:

In [15]:
arr.reshape(2, 2)

array([[1, 3],
       [7, 9]])

Importantly, this *does not* change the existing array, but instead gives back a new array with the new shape. A `numpy` array cannot be reshaped or resized *in-place*; a new array must be made with the data copied to accommodate this behavior.

In [16]:
arr_n = arr.reshape(2, 2)

In [17]:
arr_n

array([[1, 3],
       [7, 9]])

In [18]:
arr

array([1, 3, 7, 9])

`numpy` also features standalone functions such as `arange`, which functions similarly to the built-in `range` but gives `numpy` arrays:

In [19]:
arr = np.arange(10, 46)

In [20]:
arr

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
       44, 45])

We can reshape this into a new 2-dimensional array with 3 rows with `reshape`. We only need to specify the length of a single dimension for this to work; the other dimension can be figured out, and we tell the function to figure it out with a `-1`:

In [21]:
arr = arr.reshape((3, -1))

In [22]:
arr

array([[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
       [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33],
       [34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45]])

If needed, we can obtain a 1-D array from an array of any dimension with the `flatten` method:

In [23]:
arr.flatten()

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
       44, 45])

### Challenge: create an array with `np.arange` with values from 21 to 100, inclusive. Reshape it into an array with dimensions $(2 \times 10 \times 4)$.

In [24]:
np.arange(21, 101).reshape(2, 10, 4).shape

(2, 10, 4)

For our next set of examples, we'll use some data from a molecular dynamics simulation of adenylate kinase. This trajectory data is included as part of the MDAnalysis test suite, `MDAnalysisTest`:

## Using molecular dynamics data

In [25]:
from MDAnalysisTests.datafiles import GRO, XTC

In [26]:
print((GRO, XTC))

('/home/david/miniconda3/lib/python3.6/site-packages/MDAnalysisTests/data/adk_oplsaa.gro', '/home/david/miniconda3/lib/python3.6/site-packages/MDAnalysisTests/data/adk_oplsaa.xtc')


The next lesson will go into detail on `MDAnalysis` usage, so here we'll load some data and extract things like coordinates and residue names, as these are represented as `numpy` arrays:

In [27]:
import MDAnalysis as mda

In [28]:
u = mda.Universe(GRO, XTC)
u

<Universe with 47681 atoms>

In [29]:
positions = u.atoms.positions
positions

array([[ 52.02     ,  43.560005 ,  31.550003 ],
       [ 51.190002 ,  44.11     ,  31.720001 ],
       [ 51.550003 ,  42.83     ,  31.04     ],
       ...,
       [105.340004 ,  74.07001  ,  40.989998 ],
       [ 57.68     ,  35.32     ,  14.8      ],
       [ 62.960007 ,  47.240005 ,   3.7500002]], dtype=float32)

We'll also grab the center of mass of our atoms

In [30]:
com = u.atoms.center_of_mass()
com

array([59.9608713 , 40.33083385, 28.2843879 ])

In [31]:
resnames = u.atoms.resnames
resnames

array(['MET', 'MET', 'MET', ..., 'NA+', 'NA+', 'NA+'], dtype=object)

In [32]:
atomids = u.atoms.ids
atomids

array([    1,     2,     3, ..., 47679, 47680, 47681])

## Array arithmetic

Arithmetic operations with arrays occur *element-wise*. Multiplying by 3 and subtracting 2 gives a new array with that operation performed on each element individually:

In [33]:
(positions * 3) - 2

array([[154.06    , 128.68002 ,  92.65001 ],
       [151.57    , 130.33    ,  93.16    ],
       [152.65001 , 126.490005,  91.12    ],
       ...,
       [314.02002 , 220.21002 , 120.96999 ],
       [171.04001 , 103.96    ,  42.4     ],
       [186.88002 , 139.72002 ,   9.250001]], dtype=float32)

This allows us to write code treating an array as if it was a single number! For example, we could write an Angstrom to nm converter:

In [34]:
def angstrom_to_nm(length):
    return length / 10

And this will work on an array as simply as it will on a single value:

In [35]:
angstrom_to_nm(20)

2.0

In [36]:
angstrom_to_nm(positions)

array([[ 5.202     ,  4.3560004 ,  3.1550002 ],
       [ 5.1190004 ,  4.4110003 ,  3.1720002 ],
       [ 5.155     ,  4.283     ,  3.104     ],
       ...,
       [10.534     ,  7.4070005 ,  4.099     ],
       [ 5.768     ,  3.532     ,  1.48      ],
       [ 6.2960005 ,  4.7240005 ,  0.37500003]], dtype=float32)

Arithmetic between arrays is also element-wise. If we add two arrays of the same shape together, elements in the corresponding row-column are added to each other in the resulting array:

In [37]:
positions

array([[ 52.02     ,  43.560005 ,  31.550003 ],
       [ 51.190002 ,  44.11     ,  31.720001 ],
       [ 51.550003 ,  42.83     ,  31.04     ],
       ...,
       [105.340004 ,  74.07001  ,  40.989998 ],
       [ 57.68     ,  35.32     ,  14.8      ],
       [ 62.960007 ,  47.240005 ,   3.7500002]], dtype=float32)

In [38]:
positions + positions

array([[104.04     ,  87.12001  ,  63.100006 ],
       [102.380005 ,  88.22     ,  63.440002 ],
       [103.100006 ,  85.66     ,  62.08     ],
       ...,
       [210.68001  , 148.14001  ,  81.979996 ],
       [115.36     ,  70.64     ,  29.6      ],
       [125.92001  ,  94.48001  ,   7.5000005]], dtype=float32)

This makes calculating quantities with arrays incredibly concise. It's not necessary to loop through the elements of an array in Python to calculate something with each element. Instead, we can treat an array as if it was a single quantity, and calculations are *fast*.

## Broadcasting

Say we want to take our atom positions and center them on the center of mass of the whole system. Our array of positions has shape $n \times 3$, with each row corresponding to an atom in the system, and each column to the $x$, $y$, and $z$ axes, respectively. We could do this in a loop:

In [39]:
centered = np.empty(positions.shape)

for i, row in enumerate(positions):
    centered[i] = positions[i] - com

centered

array([[ -7.94087084,   3.22917134,   3.26561515],
       [ -8.77086886,   3.77916676,   3.43561332],
       [ -8.41086825,   2.49916798,   2.75561301],
       ...,
       [ 45.37913267,  33.73917347,  12.70560996],
       [ -2.28087099,  -5.01083416, -13.48438771],
       [  2.99913542,   6.90917164, -24.53438767]])

However, looping through arrays should be avoided, as this negates much of the performance benefit of using `numpy` arrays in the first place. Instead, we can take advantage of a behavior known as *broadcasting*:

In [40]:
centered = positions - com

In [41]:
centered

array([[ -7.94087084,   3.22917134,   3.26561515],
       [ -8.77086886,   3.77916676,   3.43561332],
       [ -8.41086825,   2.49916798,   2.75561301],
       ...,
       [ 45.37913267,  33.73917347,  12.70560996],
       [ -2.28087099,  -5.01083416, -13.48438771],
       [  2.99913542,   6.90917164, -24.53438767]])

Here we performed an operation with the `com` array, which is 1-dimensional with size 3, and the `positions` array, which is 2-dimensional with sizes $n$ and $3$. The smaller array, `com`, is 'stretched' into what is effectively an $n \times 3$ array, where its values are repeated in each row. The operation is then carried out. `numpy` internally does this efficiently in C code. 

The result is both syntactically simpler and computationally more efficient than our earlier attempt. Generally, broadcasting occurs on dimensions of size 1 in one of the two arrays involved in the operation, or when one of the arrays has more dimensions than the other, or both.

See the [`numpy` documentation](https://docs.scipy.org/doc/numpy/user/basics.broadcasting.html) for a more detailed explanation.

## Indexing and slicing arrays

There are times when we need to select a subset of the elements in an array to operate on. Consider our `positions` array:

In [42]:
positions

array([[ 52.02     ,  43.560005 ,  31.550003 ],
       [ 51.190002 ,  44.11     ,  31.720001 ],
       [ 51.550003 ,  42.83     ,  31.04     ],
       ...,
       [105.340004 ,  74.07001  ,  40.989998 ],
       [ 57.68     ,  35.32     ,  14.8      ],
       [ 62.960007 ,  47.240005 ,   3.7500002]], dtype=float32)

As with other data structures in Python, `numpy` arrays use 0-based indexing, and we can select elements with square (`[]`) brackets. Selecting element 0 from this array gives the row at index 0 as an array, which corresponds to the coordinates of the first atom in our system:

In [43]:
positions[0]

array([52.02    , 43.560005, 31.550003], dtype=float32)

If we want, say, the coordinates of the 23rd atom in our system, we could do:

In [44]:
positions[22]

array([55.36, 47.16, 32.68], dtype=float32)

To select a single value from this two-dimensional array, we need to specify the column. We can obtain the $y$-position of the 23rd atom with:

In [45]:
positions[22, 1]

47.16

We can select multiple elements with *slices*; for example, all rows starting with row 1 to the end of the array:

In [46]:
positions[1:]

array([[ 51.190002 ,  44.11     ,  31.720001 ],
       [ 51.550003 ,  42.83     ,  31.04     ],
       [ 52.47     ,  43.170006 ,  32.370003 ],
       ...,
       [105.340004 ,  74.07001  ,  40.989998 ],
       [ 57.68     ,  35.32     ,  14.8      ],
       [ 62.960007 ,  47.240005 ,   3.7500002]], dtype=float32)

Or perhaps instead, row 1 *up to but not including* the 10th row:

In [47]:
positions[1:10]

array([[51.190002, 44.11    , 31.720001],
       [51.550003, 42.83    , 31.04    ],
       [52.47    , 43.170006, 32.370003],
       [53.07    , 44.21    , 30.75    ],
       [53.83    , 43.47    , 30.54    ],
       [52.570004, 44.739998, 29.410002],
       [51.89    , 44.04    , 28.93    ],
       [52.02    , 45.64    , 29.66    ],
       [53.710003, 45.11    , 28.45    ]], dtype=float32)

We can slice the columns as well. The slicing below will select row 1 onward, but only elements in columns 0 up to and not including column 2 (leaving out the $z$-coordinate values):

In [48]:
positions[1:, :2]

array([[ 51.190002,  44.11    ],
       [ 51.550003,  42.83    ],
       [ 52.47    ,  43.170006],
       ...,
       [105.340004,  74.07001 ],
       [ 57.68    ,  35.32    ],
       [ 62.960007,  47.240005]], dtype=float32)

As with e.g. `list`s, we can also use negative numbers to slice from the end of an array. The following yields coordinates for all atoms except for the last 10 in our system:

In [49]:
positions[:-10].shape

(47671, 3)

Finally, we can give a third value to indicate the slicing's *skip* (default is 1, or no skip):

In [50]:
positions[::2]

array([[ 52.02     ,  43.560005 ,  31.550003 ],
       [ 51.550003 ,  42.83     ,  31.04     ],
       [ 53.07     ,  44.21     ,  30.75     ],
       ...,
       [ 40.29     ,  29.160002 ,   4.6800003],
       [105.340004 ,  74.07001  ,  40.989998 ],
       [ 62.960007 ,  47.240005 ,   3.7500002]], dtype=float32)

This gives the coordinates for every other atom in the system.

### Challenge: create an array of shape (4, 5) starting with `np.arange(20)`, then obtain the mean value of elements in rows 1 and 2, inclusive, and columns 1 through 3, inclusive. These row and column numbers are zero-based.

We could do:

In [51]:
# create array with proper shape
my_arr = np.arange(20).reshape(4, 5)

# grab subselection of rows, columns
subsel = my_arr[1:3, 1:4]

# calculate mean
subsel.mean()

9.5

But we could also accomplish this in one line, since each method call/slice returns an array, and finally `mean` gives the single number as a result:

In [52]:
np.arange(20).reshape((4, -1))[1:3, 1:4].mean()

9.5

`numpy` also features a function called `mean` which does the same thing as the array method:

In [53]:
np.mean(np.arange(20).reshape((4, -1))[1:3, 1:4])

9.5

## Making arrays from scratch

There are several ways we can create arrays with different contents, for different purposes. We've already seen how to make an array from a list or using `np.arange`, but here we point out some other useful ways of creating an array.

If we want a 2-D array of a particular set of numbers, we can give the `np.array` function a list of lists:

In [54]:
np.array([[1,2,3], [4, 5, 6]])

array([[1, 2, 3],
       [4, 5, 6]])

`np.zeros` will create an array of all zeros with the specified shape:

In [55]:
np.zeros(100)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

Likewise, `np.ones` will give us all ones:

In [56]:
np.ones(100)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

`np.random` is a submodule that includes functions for sampling value from various probability distributions, for example, a [standard normal distribution](https://en.wikipedia.org/wiki/Normal_distribution#Standard_normal_distribution):

In [57]:
np.random.randn(100)

array([ 0.58322214, -0.58065383, -1.04591905, -0.99115132, -1.3807353 ,
       -0.37314728, -0.24779561, -0.96395618, -1.79287373, -0.16919348,
        1.40188946,  0.28293001,  0.25729575, -0.41696056,  0.23783154,
       -0.05707562, -0.04991868,  1.98615564, -0.68417321, -0.67961047,
       -0.10126913,  0.47632569,  0.38977636, -1.29554893,  1.38437527,
       -0.42283048,  0.55107915, -0.05847931,  0.32497648,  0.10256992,
       -0.69365455, -0.30147118,  0.73845444,  0.25704159,  1.02749006,
        1.19232419, -0.48222507,  0.78814438,  0.32312217,  0.41408452,
        0.39966598, -0.38679747,  1.52651631,  0.62777407, -0.76612068,
       -0.15906198, -2.21024295, -1.39408724, -2.22149924,  0.18625566,
       -0.86076357, -0.81879685,  0.15270418,  1.27359006,  0.77431448,
        0.67550855,  1.49544588,  0.42051536,  0.37613034,  0.14517956,
        0.44862189,  1.88764182, -1.05925085, -0.46039835, -1.01318963,
        0.89765175,  1.52064298,  0.2522312 ,  1.09801112, -0.83

When using `MDAnalysis` you may not use these very often, but it's valuable to remember where to reach if/when you need to initialize an array of your own.

## Using boolean indexing to select values from arrays by criteria

One of the most common operations when using `MDAnalysis` is to select items from our system according to some criteria. We can do quite a bit of this directly using `numpy` arrays, in particular with *boolean indexing*. We have our `resnames` array from earlier, giving the residue name for each atom in the system:

In [58]:
resnames

array(['MET', 'MET', 'MET', ..., 'NA+', 'NA+', 'NA+'], dtype=object)

In [59]:
resnames.shape

(47681,)

If we wanted the positions of all atoms in the system that are members of glutamate residues, we could first get a *boolean index* of all atoms:

In [60]:
in_glutamate = (resnames == 'GLU')
in_glutamate

array([False, False, False, ..., False, False, False])

This is a `numpy` array of boolean values (`True` or `False`), where `True` indicates an atom that is part of a `'GLU'` residue. We can take this boolean array and use it as an index to, say, our `positions` array:

In [61]:
pos_glu = positions[in_glutamate]
pos_glu.shape

(270, 3)

This gives us the positions of only atoms found in `'GLU'` residues. What if we wanted atoms that are either in `'GLU'` or `'ALA'` residues?

In [62]:
in_glu_or_ala = ((resnames == 'GLU') | (resnames == 'ALA'))

In [63]:
pos_glu_or_ala = positions[in_glu_or_ala]
pos_glu_or_ala.shape

(460, 3)

Here we made use of the bitwise-or operator (`|`) to build a numpy array that has `True` if either of the two statements is true. We can "and" two arrays together with a bitwise-and (`&`).

### Challenge: get an array of all coordinates within 20 Angstroms of the center of mass of our adenylate kinase system

We first calculate an array of distances from the center of mass, using the `positions` and `com` arrays:

In [68]:
distances = ((positions - com)**2).sum(axis=1)

We can then use this as a boolean index to the positions array:

In [69]:
positions[distances < 20]

array([[59.440002, 42.75    , 30.94    ],
       [59.910004, 41.800003, 31.220001],
       [58.      , 42.480003, 30.490002],
       [57.620003, 43.430004, 30.130003],
       [57.99    , 41.78    , 29.66    ],
       [60.08    , 43.260002, 29.66    ],
       [60.610004, 42.49    , 28.86    ],
       [60.030003, 44.570004, 29.390001],
       [59.030003, 44.300003, 26.470001],
       [57.99    , 43.489998, 26.960001],
       [57.520004, 43.730003, 27.900002],
       [57.56    , 42.390003, 26.220001],
       [56.670002, 41.870003, 26.520002],
       [58.170006, 41.980003, 25.02    ],
       [57.08    , 40.470005, 24.940002],
       [62.970005, 41.260002, 31.060001],
       [64.05    , 41.15    , 29.27    ],
       [63.480003, 42.08    , 28.51    ],
       [63.07    , 42.920002, 28.900002],
       [63.700005, 42.16    , 27.080002],
       [62.47    , 41.63    , 26.330002],
       [61.530003, 42.120003, 26.580002],
       [62.54    , 41.940002, 25.29    ],
       [62.300003, 40.120003, 26.2

## Advanced indexing

The last form of indexing common when using `MDAnalysis` is so-called [advanced indexing](https://docs.scipy.org/doc/numpy/reference/arrays.indexing.html#advanced-indexing), also sometimes called "fancy" indexing. If instead of just single index value, we give a `list` of index values when indexing an array:

In [75]:
positions[[1, 3, 5]]

array([[51.190002, 44.11    , 31.720001],
       [52.47    , 43.170006, 32.370003],
       [53.83    , 43.47    , 30.54    ]], dtype=float32)

What we get back is an array of those elements corresponding to the given indices, in order. Repeats are allowed:

In [77]:
positions[[3, 3, 3, 1]]

array([[52.47    , 43.170006, 32.370003],
       [52.47    , 43.170006, 32.370003],
       [52.47    , 43.170006, 32.370003],
       [51.190002, 44.11    , 31.720001]], dtype=float32)

This is useful when we know precisely which indices we want to select out of an existing array. There is no need to select each element from a set of indices in a loop. Even if you don't find you need it much, advanced indexing such as this is used heavily internal to `MDAnalysis` to make topology selection operations *fast*.

## Numpy arrays are the core of the scientific Python ecosystem

MDAnalysis is just one of many tools that leverage `numpy` to great effect. Developing a solid grasp of `numpy` arrays will carry over to tools as wide-ranging as `pandas` for statistical data analysis and *scikit-learn* for training machine learning models. `numpy` sits at the core of all of these tools.