# Numerical Python with numpy

NumPy ('Numerical Python') is the defacto standard module for doing numerical work in Python. Its main feature is its array data type which allows very compact and efficient storage of homogenous (of the same type) data.

A lot of the material in this section is based on [SciPy Lecture Notes](http://www.scipy-lectures.org/intro/numpy/array_object.html) ([CC-by 4.0](http://www.scipy-lectures.org/preface.html#license)).

As you go through this material, you'll likely find it useful to refer to the [NumPy documentation](https://docs.scipy.org/doc/numpy/), particularly the [array objects](https://docs.scipy.org/doc/numpy/reference/arrays.html) section.

As with `pandas` there is a standard convention for importing `numpy`, and that is as `np`:

In [1]:
import numpy as np

Now that we have access to the `numpy` package we can start using its features.

## Creating arrays

In many ways a NumPy array can be treated like a standard Python `list` and much of the way you interact with it is identical. Given a list, you can create an array as follows:

In [2]:
python_list = [1, 2, 3, 4, 5, 6, 7, 8]
numpy_array = np.array(python_list)
print(numpy_array)

[1 2 3 4 5 6 7 8]


In [3]:
# ndim give the number of dimensions
numpy_array.ndim

1

In [4]:
# the shape of an array is a tuple of its length in each dimension. In this case it is only 1-dimensional
numpy_array.shape

(8,)

In [5]:
# as in standard Python, len() gives a sensible answer
len(numpy_array)

8

In [6]:
nested_list = [[1, 2, 3], [4, 5, 6]]
two_dim_array = np.array(nested_list)
print(two_dim_array)

[[1 2 3]
 [4 5 6]]


In [7]:
two_dim_array.ndim

2

In [8]:
two_dim_array.shape

(2, 3)

It's very common when working with data to not have it already in a Python list but rather to want to create some data froms scratch. `numpy` comes with a whole suite of functions for creating arrays. We will now run through some of the most commonly used.

The first is `np.arange` (meaning "array range") which works in a vary similar fashion the the standard Python `range()` function, including how it defaults to starting from zero, doesn't include the number at the top of the range and how it allows you to specify a 'step:

In [9]:
np.arange(10) #0 .. n-1  (!)

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [10]:
np.arange(1, 9, 2) # start, end (exclusive), step

array([1, 3, 5, 7])

Next up is the `np.linspace` (meaning "linear space") which generates a given floating point numbers starting from the first argument up to the second argument. The third argument defines how many numbers to create:

In [11]:
np.linspace(0, 1, 6)   # start, end, num-points

array([ 0. ,  0.2,  0.4,  0.6,  0.8,  1. ])

Note how it included the end point unlike `arange()`. You can change this feature by using the `endpoint` argument:

In [12]:
np.linspace(0, 1, 5, endpoint=False)

array([ 0. ,  0.2,  0.4,  0.6,  0.8])

`np.ones` creates an n-dimensional array filled with the value `1.0`. The argument you give to the function defines the shape of the array:

In [13]:
np.ones((3, 3))  # reminder: (3, 3) is a tuple

array([[ 1.,  1.,  1.],
       [ 1.,  1.,  1.],
       [ 1.,  1.,  1.]])

Likewise, you can create an array of any size filled with zeros:

In [14]:
np.zeros((2, 2))

array([[ 0.,  0.],
       [ 0.,  0.]])

The `np.eye` (referring to the matematical identity matrix, commonly labelled as `I`) creates a square matrix of a given size with `1.0` on the diagonal and `0.0` elsewhere:

In [15]:
np.eye(3)

array([[ 1.,  0.,  0.],
       [ 0.,  1.,  0.],
       [ 0.,  0.,  1.]])

The `np.diag` creates a square matrix with the given values on the diagonal and `0.0` elsewhere:

In [16]:
np.diag([1, 2, 3, 4])

array([[1, 0, 0, 0],
       [0, 2, 0, 0],
       [0, 0, 3, 0],
       [0, 0, 0, 4]])

Finally, you can fill an array with random numbers:

In [17]:
np.random.rand(4)  # uniform in [0, 1]

array([ 0.08719542,  0.14807886,  0.60154561,  0.62319152])

In [18]:
np.random.randn(4)  # Gaussian

array([-1.5385821 ,  0.18504451,  1.76736024, -0.79798725])

### Exercises

- Experiment with `arange`, `linspace`, `ones`, `zeros`, `eye` and `diag`.
- Create different kinds of arrays with random numbers.
- Look at the function `np.empty`. What does it do? When might this be useful?

## Reshaping arrays

Behind the scenes, a multi-dimensional NumPy `array` is just stored as a linear segment of memory. The fact that it is presented as having more than one dimension is simply a layer on top of that (sometimes called a *view*). This means that we can simply change that interpretive layer and change the shape of an array very quickly (i.e without NumPy having to copy any data around).

This is mostly done with the `reshape()` method on the array object:

In [19]:
my_array = np.arange(16)
my_array

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15])

In [20]:
my_array.shape

(16,)

In [21]:
my_array.reshape((2, 8))

array([[ 0,  1,  2,  3,  4,  5,  6,  7],
       [ 8,  9, 10, 11, 12, 13, 14, 15]])

In [22]:
my_array.reshape((4, 4))

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

Note that if you check, `my_array.shape` will still return `(16,)` as `reshaped` is simply a *view* on the original data, it hasn't actually *changed* it. If you want to edit the original object in-place then you can use the `resize()` method.

You can also transpose an array using the `transpose()` method which mirrors the array along its diagonal:

In [23]:
my_array.reshape((2, 8)).transpose()

array([[ 0,  8],
       [ 1,  9],
       [ 2, 10],
       [ 3, 11],
       [ 4, 12],
       [ 5, 13],
       [ 6, 14],
       [ 7, 15]])

In [24]:
my_array.reshape((4,4)).transpose()

array([[ 0,  4,  8, 12],
       [ 1,  5,  9, 13],
       [ 2,  6, 10, 14],
       [ 3,  7, 11, 15]])

### Exercise

Using the NumPy documentation at https://docs.scipy.org/doc/numpy/reference/arrays.ndarray.html, to create, **in one line** a NumPy array which looks like:

`[ 10,  60,  20,  70,  30,  80,  40,  90,  50, 100]`

Hint: you will need to use `transpose()`, `reshape()` and `arange()` as well as one new function from the "Shape manipulation" section of the documentation.

## Basic data types

You may have noticed that, in some instances, array elements are displayed with a trailing dot (e.g. `2.` vs `2`). This is due to a difference in the data-type used:

In [25]:
a = np.array([1, 2, 3])
a.dtype

dtype('int64')

In [26]:
b = np.array([1., 2., 3.])
b.dtype

dtype('float64')

Different data-types allow us to store data more compactly in memory, but most of the time we simply work with floating point numbers. Note that, in the example above, NumPy auto-detects the data-type from the input.

In [27]:
c = np.array([1, 2, 3], dtype=float)
c.dtype

dtype('float64')

The default data type is floating point.

In [28]:
d = np.ones((3, 3))
d.dtype

dtype('float64')

There are other data types as well:

In [29]:
e = np.array([1+2j, 3+4j, 5+6*1j])
e.dtype

dtype('complex128')

In [30]:
f = np.array([True, False, False, True])
f.dtype

dtype('bool')

In [31]:
g = np.array(['Bonjour', 'Hello', 'Hallo',])
g.dtype     # <--- strings containing max. 7 letters

dtype('<U7')

We previously came across `dtype`s when learing about `pandas`. This is because `pandas` uses NumPy as its underlying library. A `pandas.Series` is essentially a `np.array` with some extra features wrapped around it.

## Why NumPy

To show some of the advantagesof NumPy over a standard Python list, let's do some benchmarking. It's an important habit in programming that whenever you think one method may be faster than another, you check to see whether your assumption is true.

Python provides some tools to make this easier, particularly via the [`timeit`](https://docs.python.org/3/library/timeit.html) module. Using this functionality, IPython provides a `%timeit` magic function to make our life easier. To use the `%timeit` magic, simply put it at the beginning of a line and it will give you information about how ling it took to run. It doesn't always work as you would expect so to make your life easier, put whatever code you want to benchmark inside a function and time that function call.

We start by making a list and an array of 10000 items each of values counting from 0 to 9999:

In [32]:
python_list = list(range(100000))
numpy_array = np.arange(100000)

We are going to go through each item in the list and double its value in-place, such that the list is changed after the operation. To do this with a Python `list` we need a `for` loop:

In [33]:
def python_double(a):
    for i, val in enumerate(a):
        a[i] = val * 2

%timeit -n1 -r1 python_double(python_list)

1 loop, best of 1: 12.4 ms per loop


To simplyfy things we've asked `%timeit` to only run the function once (`-r1 -n1`).

To do the same operation in NumPy we can use the fact that multiplying a NumPy `array` by a value will apply that operation to each of its elements:

In [34]:
def numpy_double(a):
    a *= 2

%timeit -n1 -r1 numpy_double(numpy_array)

1 loop, best of 1: 126 µs per loop


As youn can see, the NumPy version is at least 10 times faster, sometimes up to 100 times faster.

Have a think about why this might be, what is NumPy doing to make this so much faster? There are two main parts to the answer.

## Indexing and slicing

Like a standard Python list, a NumPy `array` can be accessed using the normal indexing syntax. This includes the negative indexing in order to count from the end of the array:

In [35]:
a = np.arange(10)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [36]:
a[0], a[2], a[-1]

(0, 2, 9)

For a multidimensional array, the index is a tuple of integers. Unlke a normal Python `list` it is possible to put more than one number in the square brackets rather than having to chain up multiple pairs of square brackets:

In [37]:
a = np.arange(9).reshape((3,3))
a

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

In [38]:
a[1, 1]

4

In [39]:
a[2, 1] = 10
a

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6, 10,  8]])

In 2D, the first dimension corresponds to rows, the second to columns. 

For a multidimensional array, you can under-specify the indices so `a[1]` is interpreted by taking all elements in the unspecified dimensions. In this case all the columns of row number `1`.

In [40]:
a[1]

array([3, 4, 5])

If you want to get a specific column, you can use `...` as a placeholder like so:

In [41]:
a[..., 2]

array([2, 5, 8])

Like a normal Python list, a NumPy array can also be sliced:

In [42]:
a = np.arange(10)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [43]:
a[2:6]  # index 2 to (but not including) 6

array([2, 3, 4, 5])

In [44]:
a[3:-2]  # index 4 to (but not including) the second-last entry

array([3, 4, 5, 6, 7])

In [45]:
a[2:9:3]  # 2 to 9 in steps of 3

array([2, 5, 8])

Note that not all entries are required, the first defaults to `0`, the second to '1 past the end' and the third to `1`:

In [46]:
a[4:]  # index 4 to end of array

array([4, 5, 6, 7, 8, 9])

In [47]:
a[::2]  # even-index entries

array([0, 2, 4, 6, 8])

It is also possible to combine slicing with a multi-dimensional array:

In [48]:
a = np.arange(36).reshape((6,6))
a

array([[ 0,  1,  2,  3,  4,  5],
       [ 6,  7,  8,  9, 10, 11],
       [12, 13, 14, 15, 16, 17],
       [18, 19, 20, 21, 22, 23],
       [24, 25, 26, 27, 28, 29],
       [30, 31, 32, 33, 34, 35]])

In [49]:
a[0, 3:5]

array([3, 4])

In [50]:
a[4:, 4:]

array([[28, 29],
       [34, 35]])

In [51]:
a[:, 2]

array([ 2,  8, 14, 20, 26, 32])

In [52]:
a[2::2, ::2]

array([[12, 14, 16],
       [24, 26, 28]])

You can also combine assignment and slicing:

In [53]:
a = np.arange(10)
a[5:] = 10
a

array([ 0,  1,  2,  3,  4, 10, 10, 10, 10, 10])

In [54]:
b = np.arange(5)
a[5:] = b[::-1]
a

array([0, 1, 2, 3, 4, 4, 3, 2, 1, 0])

### Exercise: Indexing and slicing

Try the different flavours of slicing, using `start`, `end` and `step`: starting from a `linspace()`, try to obtain odd numbers counting backwards, and even numbers counting forwards.

### Exercise: Array creation

Create the following arrays with correct data types (don't just paste them verbatim into a `np.array()` call):

```
[[1, 1, 1, 1],
 [1, 1, 1, 1],
 [1, 1, 1, 2],
 [1, 6, 1, 1]]
```

```
[[0., 0., 0., 0., 0.],
 [2., 0., 0., 0., 0.],
 [0., 3., 0., 0., 0.],
 [0., 0., 4., 0., 0.],
 [0., 0., 0., 5., 0.],
 [0., 0., 0., 0., 6.]]
```

Par on course: 3 statements for each

*Hint*: Individual array elements can be accessed similarly to a `list`, e.g. `[1]` or `a[1, 2]`.

*Hint*: Examine the docstring for `diag()`.

### Exercise: Tiling for array creation

Skim through the documentation for `np.tile()`, and use this function to construct the array:

```
[[4, 3, 4, 3, 4, 3],
 [2, 1, 2, 1, 2, 1],
 [4, 3, 4, 3, 4, 3],
 [2, 1, 2, 1, 2, 1]]
```

## Copies and views

A slicing operation (like reshaping before) creates a *view* on the original array, which is just a way of accessing array data. Thus the original array is not copied in memory. You can use `np.may_share_memory()` to check if two arrays share the same memory block. Note however, that this uses heuristics and may give you false positives.

When modifying the view, the original array is modified as well:

In [55]:
a = np.arange(10)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [56]:
b = a[::2]

np.may_share_memory(a, b)

True

In [57]:
b[0] = 12
b

array([12,  2,  4,  6,  8])

In [58]:
a   # (!)

array([12,  1,  2,  3,  4,  5,  6,  7,  8,  9])

In [59]:
a = np.arange(10)
c = a[::2].copy()  # force a copy
c[0] = 12
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [60]:
np.may_share_memory(a, c)  # we made a copy so there is no shared memory

False

## Fancy indexing

NumPy arrays can be indexed with slices, but also with boolean or integer arrays (*masks*). This method is called *fancy indexing*. It creates *copies not views*.

Using a boolean mask:

In [61]:
np.random.seed(3)
a = np.random.randint(0, 20, 15)
a

array([10,  3,  8,  0, 19, 10, 11,  9, 10,  6,  0, 12,  7, 14, 17])

In [62]:
(a % 3 == 0)  # an array with True where the condition "a[i] % 3 == 0" is true.

array([False,  True, False,  True, False, False, False,  True, False,
        True,  True,  True, False, False, False], dtype=bool)

In [63]:
mask = (a % 3 == 0)
multiples_of_three = a[mask] # or,  a[a%3==0]
multiples_of_three           # extract a sub-array with the mask

array([ 3,  0,  9,  6,  0, 12])

Indexing with a mask can be very useful to assign a new value to a sub-array:

In [64]:
a[a % 3 == 0] = -1
a

array([10, -1,  8, -1, 19, 10, 11, -1, 10, -1, -1, -1,  7, 14, 17])

You can also do fancy indexing with an array of integers, where the same index is repeated several times:

In [65]:
a = np.arange(0, 100, 10)
a

array([ 0, 10, 20, 30, 40, 50, 60, 70, 80, 90])

In [66]:
a[[2, 3, 2, 4, 2]]

array([20, 30, 20, 40, 20])

New values can be assigned with this sort of indexing:

In [67]:
a[[9, 7]] = -100
a

array([   0,   10,   20,   30,   40,   50,   60, -100,   80, -100])

## Exercise

Try using fancy indexing on the left and array creation on the right to assign values into an array, for instance by setting parts of a large 2D array to zero.

Next we will cover how to use the plotting tool favoured by pandas, *matplotlib*. Continue to the [next section](matplotlib.ipynb).