# NumPy

The NumPy package contains a range of useful numerical tools, and it is also used by many other scientific and data science packages, like SciPy and pandas. NumPy has many features that are similar to the commercial software matlab, and if you have used matlab before there is a useful page that [compares NumPy and matlab functions](https://numpy.org/doc/stable/user/numpy-for-matlab-users.html). 

The fundamental type of variable that NumPy is built around is called an array. These are a lot like lists, but NumPy is optimised to perform mathematical operations on arrays of numbers. This means that some operations are much faster or more convenient using NumPy. Arrays are intended to represent mathematical vectors, matrices and tensors, and NumPy provides many of the mathematical operations we might perform on them.

It is standard to import NumPy as `np`:

In [None]:
import numpy as np

## Creating NumPy arrays

A NumPy array can be created using the method `array`:

In [None]:
x=np.array([1,2,3])

NumPy supports the creation of multi-dimensional arrays, for example the following code creates a matrix with 3 rows and 2 columns:

In [None]:
y=np.array([[1,2],[1,2],[1,2]])
y

Here the outer square brackets correspond to the rows and the inner square brackets correspond to the columns. It is also possible to have higher dimensional arrays, but we won't deal with these much in this notebook.

Unlike lists, arrays must be **rectangular**, for example we get an error if we try to have a different number of columns in each row:

In [None]:
np.array([[1,2,3],[1,2]])

Restricting to rectangular arrays is one of the reasons that some operations on arrays can be faster than on lists. If you need to have non-rectangular arrays, then you might reconsider the data structure that you're using, or whether you can use zero to represent empty entries in an array, or you may need to rethink using NumPy.

We can obtain information about the shape of the array using the `shape` attribute:

In [None]:
y.shape

This tells us that the variable `y` has 3 rows and 2 columns. The shape of the `x` variable is

In [None]:
x.shape

This tells us that `x` is one-dimensional, but note that its size is not (3,1). NumPy usually runs faster on such one-dimensional arrays, rather then forcing them to be row or columns vectors.

In [None]:
# You could force your variable to be a row vector:
x=np.array([[1,2,3]])
x.shape
# But doing so may make your code slower.

The `size` attribute tells us how many elements there are in the array

In [None]:
x.size

In [None]:
y.size

A common operation on vectors and matrices is to transpose them, i.e. reflect them along the diagonal. This can be achieved in NumPy using the method

In [None]:
y.T

Usually the arrays that we want to use will be much too big to define by hand, so there are built in methods to define arrays. It is common to start with an array of zeros and then fill in the entries using a loop. The `zeros` method can be used to create an array of zeros:

In [None]:
np.zeros(3)

We can also create multi-dimensional arrays by using a tuple to specify the dimensions of the array.

In [None]:
number_of_rows=3
number_of_columns=2
np.zeros((number_of_rows,number_of_columns))

Defining an empty array like this can help with the efficiency of our programs, becase doing so tells the computer how much memory it needs to allocate to this variable. However, we have to be careful that we don't overwrite entries with the wrong values.

There is a similar method that produces an array of ones:

In [None]:
np.ones((2,2))

It is very common, particularly when plotting, to want an array of uniformaly spaced numbers. The `linspace` method provides this:

In [None]:
start=0
end=1
number_of_points=11
np.linspace(start,end,number_of_points)

## Operations on NumPy arrays

An important feature of NumPy arrays is that we can perform arithmetic on them. 

In [None]:
# Define two arrays of the same size
x=np.array([0,1,2])
y=np.array([1,2,4])

In [None]:
x+y

In [None]:
y-x

In [None]:
x/y

In [None]:
y/x

Note that a warning was produced when we divided by zero. While this did not produce an error and NumPy used the data type `inf` to represent the result of dividing by zero, you should generally try to avoid dividing by zero in your code, as it will often result in errors, possibly in other parts of your code where it may not be easy to determine that the error resulted from the appearance of `inf`.  

NumPy also has another data type for mathematical calculations that are not defined:

In [None]:
x/x

Here `nan` stands for "not a number". Again, the appearance of a `nan` usually indicates a problem with the code that should be fixed. Calculations that involve `nan` and `inf` often result in one of those variables, so `nan` and `inf` tend to propagate, which can mean that calculations can return unexpected answers. Note however that `nan` can be used to represent missing data, as it is in the pandas package, although pandas includes various features to deal with missing data.

NumPy also has a range of standard mathematical functions and importantly these work element-wise (i.e. separately on each element) on NumPy arrays. This can be much faster than using a loop to compute the results of mathematical functions on many different values.

In [None]:
np.sin(np.linspace(0,np.pi,10))

So far we have considered functions using arrays that are the same shape. However, NumPy can perform operations on arrays where the dimensions are either the same shape or one of them has dimension 1. This is called "broadcasting" and can be used to generalise scalar multiplication. For example, we would expect to be able to multiply an array by a constant:

In [None]:
2*np.array([1,2])

This can be generalised to multi-dimensional arrays. For example, the following array

In [None]:
x=np.array([[[1,1,1],[2,2,2]],[[3,3,3],[4,4,4]]])
x

has shape

In [None]:
x.shape

and consists of two-by-two matrices

In [None]:
x[:,:,0]

stacked on top of each other in the third dimension of the array. Suppose we want to multiply each layer in the third dimension by a different amount, then we could make an array in that dimension with the corresponding multiplication factors. For example the array

In [None]:
y=np.array([[[1,2,3]]])
y.shape

can be multiplied with the array x

In [None]:
x*y

## Iterating

Arrays are iterable in the same way that lists are

In [None]:
x=np.array([0,1,2])
for n in x:
    print(n)

There is also a NumPy version of `range` that allows non-integer step sizes

In [None]:
start=0
stop=1
step=0.3
for i in np.arange(start,stop,step):
    print(i)

Note how `arange` finished before the stop value. If you are iterating over non-integer values, it's good practice to check that the iterator goes through all of the intended values.

## Slicing

There is some additional functionality for slicing NumPy arrays over slicing lists. As an example, we'll use the following array.

In [None]:
x=np.linspace(0,1,10)
x

Indexing and slicing operations from lists also work on arrays:

In [None]:
x[0]

In [None]:
x[:3]

In [None]:
x[-1]

In [None]:
x[2:8]

In [None]:
x[2:8:2]

An important difference is that entries can be asigned new values using slicing and indexing. For example

In [None]:
x[2:8:2]=-1
x

Furthermore, slicing operations can also be combined with logical operations. First let's create a matrix of random numbers

In [None]:
x=np.random.rand(3,3)
x

(Note that NumPy has built in functions for random number generation.)

We can apply logical operators to arrays. For example, the following determines which values of `x` are less than a half:

In [None]:
x<0.5

The output is an array of logical values and this can be used to index arrays of the same size. 

(If none of the values are less than a half, then you wish to re-run the cell above where `x` is asigned.)

For example, to get the values of the entries in `x` less than a half, we can do

In [None]:
x[x<0.5]

We can then asign values to these.

In [None]:
x[x<0.5]=0
x

We could also asign values using an array. In the example below, the values of `x` that are zero are replaced with normally distributed random numbers

In [None]:
x[x==0]=np.random.randn(*x[x==0].shape)
# The * here unpacks a tuple into a function argument.
x

It is also possible to perform logical operations on arrays. For example, a logical "or" can be performed with the function `logical_or` or using the operator `|`:

In [None]:
np.logical_or(x<0.1,x>0.9)

In [None]:
# The use of brackets is important here
(x<0.1) | (x>0.9)

Similarly a logical "and" can be performed with the function `logical_and` or the operator `&`.

These can then used to slice and asign:

In [None]:
x[(x<0.1) | (x>0.9)]=-10
x

## Vectorisation

NumPy arrays are intended to represent vectors and matrices, and so naturally we might want to peform vector and matrix multiplication (rather than element-wise multiplication). Consider the following vector and matrix:

In [None]:
x=np.array([1,2,3])
y=np.array([[1,2],[3,4],[5,6]])

The `@` operator can be used to compute the vector product

In [None]:
x @ y

and there is also a corresponding method

In [None]:
np.matmul(x,y)

Note that NumPy will raise an error if the size of the arrays you wish to mulitply do not work:

In [None]:
y @ x

Voctorisation is the process of computing on arrays rather than using for loops. For example, it sometimes possible to replace a loop with vector/matrix multiplication, and this may be faster. To demonstrate this, suppose we want to compute the sum of 10,000 uniformly distributed random variables, and we want to do this 100 times. First lets draw all of the random numbers:

In [None]:
n=10000
m=100
x=np.random.rand(m,n)

We could use a list comprehension to loop over the rows of the matrix and sum them up:

In [None]:
%%timeit
[x[i,:].sum() for i in range(m)]

In this case, summing up is equivalent to multiplying on the right by a column vector of ones, and doing this is faster:

In [None]:
%%timeit
x @ np.ones(n)

However, there would be a smaller difference in speed if the number of rows (`m`) was smaller. Vectorising code can also use more computer memory, which could slow down the computations.

It's also usally quicker to perform operations on arrays rather than using loops. For example, below we compare summing up a million uniformly distributed random variables.

In [None]:
n=1000000

Using the NumPy sum function:

In [None]:
%%time
np.sum(np.random.rand(n))

Using a list comprehension and the built in function `sum`:

In [None]:
%%time
sum([np.random.rand() for i in range(n)])

Using a loop:

In [None]:
%%time
tot=0
for i in range(n):
    tot+=np.random.rand()
print(tot)

It can also be much better to create vectorised functions, i.e. functions that can work take arrays as arguments (and ideally scalars as well). The following function works element-wise on both arrays and numerical values:

In [None]:
def linear_combination(a,x,t):
    return x*np.exp(-a*t)

It also runs much quicker on arrays than using list comprehensions:

In [None]:
# Set up the function arguments
a=2
n=1000
t=np.linspace(0,10,n)
x=np.random.rand(n)

In [None]:
%%timeit
# Computing on arrays
y=linear_combination(a,x,t)

In [None]:
%%timeit
# Computing using a list comprehension
y=[linear_combination(a,x[i],t[i]) for i in range(n)]

Hopefully you should see that using arrays is much faster (it was about 140 times faster on the computer on which this notebook was written).

It may not be possible to write a function that can work with both arrays and numerical values, but if you know you will want to apply a function to arrays, then it's is generally much faster than looping. The following example function computes the square root of only positive values of an array.

In [None]:
def root_positive_numbers_vectorised(x):
    y=x.copy() # This ensures that the values in x are not over written
    y[x>0]=np.sqrt(y[x>0])
    return y

To illustrate what this function does:

In [None]:
x=np.linspace(-3,3,7)
x

In [None]:
y=root_positive_numbers_vectorised(x)

In [None]:
y

An equivalent function that only worked on numerical values would be:

In [None]:
def root_positive_numbers(x):
    if x>0:
        return np.sqrt(x)
    else:
        return x

Now we will compare the vectorised and non-vectorised functions on some random numbers

In [None]:
n=1000000
# Create normally distributed random variables
x=np.random.randn(n)

In [None]:
%%timeit 
y=root_positive_numbers_vectorised(x)

In [None]:
%%timeit 
y=[root_positive_numbers(z) for z in x]

In [None]:
%%timeit 
y=list(map(root_positive_numbers,x))

Hopefully you will again see that the vectorised approach is much faster (it was about 40 times faster on the computer on which this notebook was written).