
<a href="http://www.cosmostat.org/" target="_blank"><img align="left" width="300" src="http://www.cosmostat.org/wp-content/uploads/2017/07/CosmoStat-Logo_WhiteBK-e1499155861666.png" alt="CosmoStat Logo"></a>
<br>
<br>
<br>
<br>

# Numpy Intro

---

> Author: <a href="http://www.cosmostat.org/people/santiago-casas" target="_blank" style="text-decoration:none; color: #F08080">Santiago Casas</a>  
> Email: <a href="mailto:santiago.casas@cea.fr" style="text-decoration:none; color: #F08080">santiago.casas@cea.fr</a>  
> Year: 2019  
> Version: 1.0

---
<br>



So far we have seen how to use and define *lists, dictionaries, functions* and some other *pythonic* tools. However, in scientific research one often needs more than simple algorithms and one needs specialized libraries for working with arrays, math functions, databases and graphics.

## Let's start by importing the necessary libraries

In [None]:
import numpy

It more convenient to assign the numpy package contents to an alias to avoid having longer expressions.

In [None]:
import numpy as np

In this example the **`as`** statement assigns the numpy package contents to the object `np`.

---

## Arrays

### The Basics

The most essential numpy object is the numpy array (<a href="https://docs.scipy.org/doc/numpy/reference/generated/numpy.array.html" target="_blank">numpy.ndarray</a>).

In [None]:
# a is a list
a = [1, 2, 3, 4]
print('a is', type(a))

# b is a numpy array
b = np.array(a)
print('b is', type(b))

Accessing and printing a single entry works exactly the same.

In [None]:
print('first element of a is', a[0])
print('first element of b is', b[0])

In [None]:
print('last element of a is', a[-1])
print('last element of b is', b[-1])

However, their printed forms are slightly different.

In [None]:
print('list: ', a)
print('np array: ',b)

Moreover, while lists can contain different object types

In [None]:
a = [1, 1.0, 'a', True]
print(a)

numpy arrays are of a single type only, which is one of the reasons why they are so efficient.

---




For example, in this case it will convert all entries to strings (upcasting)

In [None]:
np.array(a)

Or in this case, all entries to floats

In [None]:
np.array([1, 2.5, 23, 100.0, np.pi])

One can also specify the type directly with the optional argument **dtype** and the entries will be converted to the specified type.

In [None]:
np.array([1, 2.5, 23, 100.0, np.pi], dtype='int32')

Did you notice the $\pi$ constant in the list above? Here you can find a list of available (<a href="https://www.numpy.org/devdocs/reference/constants.html" target="_blank">constants</a>). Another one useful in science is

In [None]:
#Euler's constant
np.e

## Creating arrays from scratch

Sometimes it is useful to create a numpy array in a fast way from scratch. Numpy offers several neat methods.

### Zeros and Ones

In [None]:
# Create a length-10 integer array filled with zeros
np.zeros(10, dtype=int)

In [None]:
# Create a length-5 floating-point array filled with ones
np.ones(5, dtype=float)

### A range and linspace

One of the most useful arrays for everyday scientific work is to create arrays that contain evenly-spaced numbers within an interval. 

In [None]:
#A range of floats up to 3.0 with default step 1.0
print(np.arange(3.0))
# A range from start to stop, with a given step
print(np.arange(5.0, 405., 50))

> **<font color='red'>NOTE:</font>** Note that with **`arange`** the endpoint is not included !

Remember to check the documentation within the Jupyter notebook running on a cell: **`?np.arange`**

If one needs to specify the number of samples and also include the endpoint, then linspace is the right tool. It even contains an optional argument `endpoint`, which defaults to `True`.

In [None]:
#Three floats evenly spaced in the interval 0. to 3. 
print(np.linspace(0.,3.,3))

In [None]:
#With endpoint=False, we get the same behavior as `np.arange`
print(np.linspace(0.,3.,3, endpoint=False))

In [None]:
#default number of samples is 50
np.linspace(0,100)

Another important array for scientists is a **logarithmically-spaced** interval. The default logarithm is base 10, but that can be changed with the `base` optional argument. The initial and final values of the interval have to be specified in their logarithms.

In [None]:
# A log10-spaced interval from 10^-2 to 10^3 of size 5.
np.logspace(-2, 3, 5)

Applying a $\log_{10}$ on the whole array shows that it is indeed log-spaced.

In [None]:
np.log10(np.logspace(-2, 3, 5))

> **Notice** how we are using `numpy` internal functions, called uFuncs to calculate properties on the entire array. We will explain that better a bit later.

In [None]:
#A ln-spaced interval from e^-1 to e^4 of size 4.
e_array = np.logspace(np.log(np.exp(-1)), np.log(np.exp(4)), 4, base=np.e)

> **Puzzle 1:** What is the ouput of `np.log(e_array)[-1]` ?

In [None]:
#Answer Puzzle 1:
#Uncomment to see the answer
#print('the result is: ', np.log(e_array)[-1])

### Multi-dimensional arrays

Arrays can also be multi-dimensional. And their shape can be specified at creation.

In [None]:
#2-dimensional array of size 3x5
np.ones((3,5))

Creating the ***identity*** matrix of size 5

In [None]:
np.identity(5)

`np.eye` is a generalization of the identity, with arguments `numpy.eye(N, M=None, k=0, dtype=<class 'float'>, order='C')`. `N` is the number of rows of the array, `M` defaults to `N` and is the number of columns, while `k` shifts the diagonal by a positive or negative integer with respect to the main diagonal. The other arguments can be looked up in the documentation.

In [None]:
#rectangular matrix
np.eye(3,4)

In [None]:
#shifted diagonal
np.eye(5, k=2)

Sometimes one just needs an array with garbage numbers which is to be filled later on. `np.empty` does the job:

In [None]:
# Create an uninitialized array of three integers
# The values will be whatever happens to already exist at that memory location
np.empty((3,6))

Very useful in science is the creation of arrays with **random numbers** following a given distribution. Check the extensive documentation of (<a href="https://www.numpy.org/devdocs/reference/random/index.html?highlight=random#module-numpy.random" target="_blank">numpy.random</a>) for much more information on all the available methods.

In [None]:
# Create a 3x3 array of normally distributed random values
# with mean 0 and standard deviation 1
np.random.normal(0, 1, (3, 3))

In [None]:
# Create a 6x6 array of uniformly distributed
# random integers between 0 and 10
rand_mat = np.random.randint(0, 10, (6, 6))
print(rand_mat[:,0])

One can create a random set of points following a Gaussian distribution with a given covariance.

In [None]:
mean = [0, 0]
cov = [[1, 0.5],
       [0.5, 2]]
X = np.random.multivariate_normal(mean, cov, 1000)
X.shape

> **<font color='red'>NOTE:</font>** We will see more details about matplotlib in the next session!

In [None]:
%matplotlib inline
import matplotlib.pyplot as plt
#import seaborn; seaborn.set()  # for plot styling

plt.scatter(X[:, 0], X[:, 1]);

To construct multi-dimensional arrays, one can also reshape 1-dimensional arrays, using the useful method `reshape(i,j)`. The arguments indicate the rows and the columns of the new array.

In [None]:
# Convert 1-dim array into 2x2 matrix
e_array.reshape((2,2))

In [None]:
# Reshape a 2-dim array
rand_mat.reshape(9,4)

If the second argument is `-1` then the size of the second axis is inferred from the previous array.

In [None]:
rand_mat.reshape(3,-1)

> **Puzzle 2:** The attribute `shape` returns the shape of a numpy array in the form of a tuple. What is the output of `rand_mat.reshape(2,-1).shape[1]` ?

In [None]:
#Answer Puzzle 2:
#Uncomment to see the answer
#print('the answer is: ', rand_mat.reshape(2,-1).shape[1])

Another useful method is `np.ravel` which is roughly the "inverse" of reshape in this case. It returns a flattened 1-d array from a 2-d array, equivalent in most cases to `np.flatten`.

In [None]:
e_array.reshape((2,2)).ravel()

In [None]:
e_array.reshape((2,2)).flatten()

And for scientific purposes, the ***transpose*** is a very important attribute

In [None]:
e_array.reshape((2,2)).T

In [None]:
(e_array.reshape((2,2)).T)[0,1]==(e_array.reshape((2,2)))[1,0]

Other available attributes are:

In [None]:
print("Number of dimensions, ndim: ", rand_mat.ndim)
print("Array shape:", rand_mat.shape)
print("Array size: ", rand_mat.size)

## Slicing and accessing elements

As you might now, slicing works for lists, using the `:` operator

In [None]:
list = [1,2,3,4,5,6]

In [None]:
#Take the first two elements of the list
list[:2]

In [None]:
print(list[2:])
#Omitting the number after the semicolon is equivalent to indicating the list size:
list[2:6] ==  list[2:]

In [None]:
#Take every second element
list[::2]

In [None]:
#Reverse the list
list[::-1]

> Slicing in visual form:

<img src='https://scipy-lectures.org/_images/numpy_indexing.png' width='600'>

Here we extend our `e_array` by concatenating it with the useful method `concatenate`, three times.

In [None]:
extended_array = np.concatenate([e_array, e_array, e_array])
print(extended_array)

To obtain three times the same number, we just need to get every fourth element

In [None]:
extended_array[::4]

> **Puzzle 3:** Obtain the array `[4., 4., 4.]` from `extended_array` by using slicing and `log`.

In [None]:
#Answers can be found in the Answers.ipynb

> **<font color='red'>NOTE:</font>** Slicing performs a view of the element and not a copy !

### Slicing: View vs. Copy

In [None]:
# Nested list comprehension
array2d = np.array([[ ii+jj for jj in range((ii-1)*4,(ii-1)*4+5)] for ii in range(1,1+4)])
print(array2d)

For multi-dimensional arrays, the different axis are accessed by separating the slices with commas.

In [None]:
#All rows, every other column
array2d[:,::2]

In [None]:
#Fourth row, all columns
array2d[3,:]

In [None]:
#Obtain shape
array2d.shape

In [None]:
#Obtain central array
array2d_center=array2d[1:array2d.shape[0]-1, 1:array2d.shape[1]-1]
print(array2d_center)

In [None]:
#Now if we modify this subarray, we'll see that the original array is changed! 
array2d_center[0,0]=1001.5

In [None]:
print(array2d_center)

In [None]:
print(array2d)

> **<font color='red'>NOTE:</font>** Notice not only the big array was changed, but also how numpy converted automatically the type of the variable to `int`, to match the other variable types !

From the `Python Data Science Handbook`: This default behavior is actually quite useful: it means that when we work with large datasets, we can access and process pieces of these datasets without the need to copy the underlying data buffer.

However, if we don't want this behavior, because it can be confusing and introduce possible bugs (believe me, it has happened to me), we can use the `copy` method.

In [None]:
sub_array = array2d[2:,:3].copy()
print(sub_array)

In [None]:
#Modify an element of the subarray
sub_array[1,1] = 999
print(sub_array)

Now the large array is not touched:

In [None]:
print(array2d)

Useful trick: Many elements of the large array can be modified through the small array using slicing.

In [None]:
array2d_center[:2,:2]=[[42,42],[42,42]]

In [None]:
print(array2d)

> **Puzzle 4:** Change all the last two columns by zeros using slicing, np.shape and np.zeros (assuming you don't know the size of the array beforehand)

In [None]:
#Answers.ipynb

## Numpy integrated universal functions

The power of **numpy** lies in its speed and efficiency in performing operations on large arrays.

Functions that operate on the entire array are called ***universal functions*** or for short, uFuncs.

### Arithmetic

In [None]:
x = np.arange(5)
print("x     =", x)
print("x + 5 =", x + 5)
print("x - 5 =", x - 5)
print("x * 2 =", x * 2)
print("x / 2 =", x / 2)
print("x // 2 =", x // 2)  # floor division
print("-x     = ", -x)
print("x ** 2 = ", x ** 2)
print("x % 2  = ", x % 2)   # modulo

These operators are actually wrappers to the method form:

In [None]:
np.add(x,5)

In [None]:
np.floor_divide(x,2)

More functions like this can be found in the documentation of the <a href="https://www.numpy.org/devdocs/reference/ufuncs.html#available-ufuncs" target="_blank">uFuncs</a> .

### Trigonometric functions

Numpy can compute mathematical functions very efficiently over a large array. They are computed to withon machine precision, therefore tiny values instead of exact zeros can appear.

In [None]:
# An array of 3000 elements between 0 and Pi.
theta = np.linspace(0, np.pi, 3000)
print("theta      = ", theta[0])
print("sin(theta) = ", np.sin(theta)[-1])
print("cos(theta) = ", np.cos(theta)[0])
print("tan(theta) = ", np.tan(theta)[-1])

Numpy also offers mathematical functions like `exp` and `log` and versions that are more accurate for tiny numbers, like **`expm1`** and **`log1p`**. 

In [None]:
# For tiny x values, log(1+x) and exp(x)-1 are very very close to x.
x = np.array([0., 1e-10, 1e-12, 1e-14])
print("     exp(x) - 1 =", np.expm1(x))
print("std: exp(x) - 1 =", np.exp(x)-1.0)
print("     log(1 + x) =", np.log1p(x))
print("std: log(1 + x) =", np.log(1.0+x))

There are tons of functionalities more, but this is just a rough overview. In the documentation and many other excellent tutorials and books, such as the 
    <a href="https://jakevdp.github.io/PythonDataScienceHandbook/"  target="_blank">Python Data Science Handbook</a>  one can find much more on these topics.

## Slowness of loops and lists vs. Numpy 

In a standard programming language, like C, we would have to define the following non-pythonic function in order to compute the reciprocal of a list

In [None]:
def compute_reciprocals(values):
    output = []
    for i in range(len(values)):
        output.append(1.0 / values[i])
    return output

In [None]:
a_list = np.random.randint(1,10,10000000).tolist()  #notice the tolist method here
print(a_list[0])

Computing the reciprocal will be slow, since each element of the python list is a data structure object.

In [None]:
%%time
b_list = compute_reciprocals(a_list)
#print(b_list[0])

> `%time` is one of these magic functions we will see explained in the next session.

In [None]:
a_array = np.random.randint(1,10,10000000)

In Numpy we don't even need to define a function, we just calculate 1.0/list.

In [None]:
%%time
b_array = 1.0 / a_array
#print(b_array)

> 25 times faster?

## Broadcasting

One of the most powerful features of `numpy` is, as we just saw, the fact that one can operate directly on entire arrays, element by element, without the need of cumbersome loops.

***Broadcasting*** is a way of applying uFuncs to arrays of different sizes. In the simplest case, addition, multiplication and so on, but actually it can be done with any uFunc that receives 2 arguments.

In [None]:
# Add a scalar to an array
s=5
ones = np.ones((3,5))
sixes = ones + s

In [None]:
print(sixes)

In [None]:
# Add two arrays
ones + sixes

> **<font color='red'>NOTE:</font>** Lists do not behave like that, when added, they concatenate.

In [None]:
#Convert numpy arrays to lists
ones.tolist() + sixes.tolist()

One can also add arrays of different sizes, thanks to broadcasting.

In [None]:
a = np.array([0,1,2])
b = np.ones((3,3))
print('a= ',a)
print('b=', b)

In [None]:
a+b

In [None]:
print('shape: ', a.shape, '| dimensions: ', a.ndim, '| size: ', a.size)

In [None]:
print('shape: ', b.shape, '| dimensions: ', b.ndim, '| size: ', b.size)

In this case what `numpy` has done is to compare the shape of the arrays. 
    - If they differ in the number of dimensions the shape gets augmented on the left (padded) with 1s. 
    - Then, if the arrays do not match in all their dimensions, the array with the padded 1s, is stretched in that dimension to match the size of the other array. 
    - Finally the two arrays can be combined.
<img src="./materials/02.05-broadcasting.png" width="600">

For the third graphical example we can create row and column vectors using slicing.

In [None]:
x = np.array([0,1,2])
# row vector
row = x[np.newaxis, :]
print(row)

What `np.newaxis` does is to increase the dimensions of the array by 1 on the axis where it is used.

In [None]:
row.shape

In [None]:
#column vector
col = x[:,np.newaxis]
print(col)

In [None]:
col.shape

In [None]:
row + col

Notice that with simple 1-d arrays, we do get the expected scalar in the dot product:

In [None]:
np.dot(x,x)

While with the row and column vectors we get a matrix:

In [None]:
np.dot(row,col)

In [None]:
np.dot(col,row)

Boradcasting is much more powerful and has many more subtleties that we cannot cover here. I refer again to  <a href="https://jakevdp.github.io/PythonDataScienceHandbook/"  target="_blank">Python Data Science Handbook</a> for a deeper treatment of this.


## Fancy indexing and masking

Finally, let's look at a couple of important tricks and methods available for numpy arrays, which make a scientist's life much easier.

### Obtaining several entries of an array at once

Fancy indexing is simply the property of accessing arrays not only with slicing, but with integers or booleans.

In [None]:
a100 = np.arange(100)

Let's say we wanted to obtain entries 1, 21, 41 and 61 and put them in an array. In other languages one would do something like:

In [None]:
[a100[1], a100[21], a100[41] , a100[61]]

**Option 1:** In python we could use slicing:

In [None]:
a100[1:62:20]

**Option 2:** We could use integers:

In [None]:
#accessing entries with integers
a100[[1,21,41,61]]

**Option 3:** Or we could define a list and extract only those that match using booleans (masking)

In [None]:
extract = [1,21,41,61]
mask = np.array([aa in extract for aa in a100])
print(mask)

In [None]:
a100[mask]

Fancy indexing is very powerful, because it can be used for assigning:

In [None]:
a100[[10,20,30]] = -1000

In [None]:
a100

And it can also change the shape of the original array, if it is indexed with an array of integers:

In [None]:
idx = np.array([[40,50],[60,70]])

In [None]:
a100[idx]

### Masking and comparisons

In [None]:
## Obtain booleans for values smaller than 10
a100 < 0

In [None]:
# Get the values
a100[a100 < 0]

In [None]:
#Get all odd values
a100[a100 %2 != 0]

> **<font color='red'>NOTE:</font>** Note how powerful this masking and comparison is compared to other languages.

Other functions like `np.sum` or `np.count_nonzero` can be used with these logical operators (True==1, False==0):

> **Puzzle 5:** What is the result of np.count_nonzero(a100 > 90)? 

In [None]:
#Answer Puzzle 5:
#Uncomment to see the answer:
#np.count_nonzero(a100 > 90)

In [None]:
#Sum all entries larger than 90
np.sum(a100[a100>90])

In [None]:
#Sum all boolean entries larger than 90, equivalent to count_nonzero
np.sum(a100>90)

One can also combine comparisons with logical operators:

|Operator|uFunc|
|------|---------------|
|&|`np.bitwise_and`|
|^|`np.bitwise_xor`|
|||`np.bitwise_or`|
|~|`np.bitwise_not`|

In [None]:
#Number of integers larger than 80, which are odd:
np.sum((a100 > 80) & ~(a100%2 == 0))

One can also convert an array of `1` and `0` to booleans right at creation.

In [None]:
identity_mask=np.identity(5, dtype=bool)

Which can be neatly used as a mask

In [None]:
#Create a matrix out of its diagonal using np.diag and replace the zeros by 7.

int_arr = np.diag(np.arange(1,6))
int_arr[int_arr == 0] = 7
print(int_arr)

In [None]:
#Apply the mask and return a 1-d array.
int_arr[identity_mask]

## Exercises

### Exercise 1:

> Using broadcasting and in max. 2 lines of code, construct a multiplication table of the numbers from 1 to 10, i.e., where for each column corresponding to 1,2,3,..., the rows correspond to their integer multiples. For 1 to 3, it looks like this:

---

|1|2|3|
|--|--|--|
|2|4|6|
|3|6|9|

> Then, using masking, remove all multiples of 3, in one line of code. So that the result (in this case for 3x3) looks like this:

|1|2|0|
|--|--|--|
|2|4|0|
|0|0|0|

> Finally, compute the sum over each column and write it into a list:

                                                      [3,6,0]


### **Exercise 2:** 
  * Produce a random set of points, following a normal distribution with mean 0 and covariance matrix `cov=[[1, 3.0/5], [3/5.0, 2]]`.
  * Plot the set of points using `plt.scatter(X, Y)`
  * Compute the marginal distributions for the x and y coordinates and using numpy functions, check that they follow a Gaussian distribution when compared to their histograms. Gaussian distribution:
  $$ \mathcal{N}(x) = \frac{1}{\sqrt{2 \pi \sigma^2}} \exp(-(x-\mu)^2/ 2 \sigma^2)$$
  * Here $\mu$ is the mean and $\sigma$ the standard deviation.
  * (Hints: `plt.hist(points, bins, density=True)`, `np.exp`, `np.sqrt`, `np.std`, `np.mean`). Use 25 bins for the histogram.
  * Using masking and `np.where` remove all points whose y-coordinate is more than 2-sigma away from the mean.
  * Plot the remaining points in a scatter plot with blue color, with `facecolor='none'` and size `s=200`,  together with the old set of points in red color with a transparency of alpha=0.6.

> **[Answers](./Answers.ipynb) to puzzles and exercises**

> **[Next Chapter: Pandas](./Pandas-Intro.ipynb)**