### MEDC0106: Bioinformatics in Applied Biomedical Science

<p align="center">
  <img src="../../resources/static/Banner.png" alt="MEDC0106 Banner" width="90%"/>
  <br>
</p>

---------------------------------------------------------------

# 05 - Introduction to NumPy

*Written by:* Oliver Scott

**This notebook provides a general introduction to NumPy.**

Do not be afraid to make changes to the code cells to explore how things work!

### What is NumPy?

[**Numpy**](https://numpy.org/) is a popular python package containing multidimensional array and matrix data structures. 

Description from the [NumPy user guide](https://numpy.org/devdocs/user/absolute_beginners.html):

> NumPy (Numerical Python) is an open source Python library that’s used in almost every field of science and engineering. It’s the universal standard for working with numerical data in Python, and it’s at the core of the scientific Python and PyData ecosystems. NumPy users include everyone from beginning coders to experienced researchers doing state-of-the-art scientific and industrial research and development. The NumPy API is used extensively in Pandas, SciPy, Matplotlib, scikit-learn, scikit-image and most other data science and scientific Python packages.

NumPy allows scientists to produce cutting edge software with the speed of C with a much less involved API!

In this notebook we will learn the very basics of the NumPy, it could have a whole lecture series itself! 

-----

## Contents

1. [The Basics](#The-Basics)
2. [Indexing and Slicing](#Indexing-and-Slicing)
3. [Basic Operations](#Basic-Operations)
4. [Broadcasting](#Broadcasting)

-----

#### Extra Resources:

- [Learn NumPy](https://numpy.org/learn/) - Recommended learning material from the NumPy developers
-----

## The Basics

Importing numpy is no different to any other package/module. NumPy users often use the np alias to keep code clean:

In [None]:
import numpy as np

array = np.array([1,2,3,4])

print(array)

#### NumPy arrays vs Python lists 

- NumPy arrays can only hold one 'type' of data unlike Python lists
- NumPy arrays consume less memory than Python lists
- NumPy arrays are much faster then Python lists
- NumPy arrays are of a fixed size
- NumPy provides numerous (fast) mathematical operations that can be applied over arrays

##### The structure of an array

The array is the fundamental data structure in the NumPy library, consisting of a grid of values which all share the same 'type' or 'dtype' in NumPy. This grid can be indexed in a similar way to Python lists, and also using tuples of nonnegative integers, by booleans, by another array, or by integers. 

----

**Array structure:**

<p align="center">
  <img src="https://i.imgur.com/mg8O3kd.png" alt="NumPy Arrays" width="100%"/>
  <br>
</p>

[Image Source](https://www.freecodecamp.org/news/exploratory-data-analysis-with-numpy-pandas-matplotlib-seaborn/)


**Multiple Dimensions:**

NumPy arrays can also be multidimensional (1D, 2D, 3D ... ND), meaning the NumPy array structure can be used to model vectors (1D) and matrices (2D). Arrays with >= 3 dimensions is often refered to as a tensor (see above).

Dimensions in NumPy are refered to as 'axes'. A 2D array may look something like this:

```python
[[0., 0., 0.],
 [1., 1., 1.]]
```

Where there are two axes and the first axis has a length of two and the second a length of three. You can access the shape of an array with the `.shape` attribute, which is a tuple refering to the length of each axis. In this case this would be `(2,3)`


**Creating NumPy arrays:**

There are numerous ways to create a NumPy array. We will cover some of the ways here:

- `np.array()`
- `np.zeros()`
- `np.ones()`
- `np.empty()`
- `np.arange()`
- `np.linspace()`

`np.array()`can be used to construct an array from a Python list:

In [None]:
array_1d = np.array([1, 2, 3, 4])                  # A 1D array
array_2d = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])  # A 2D array

print('1D NumPy array:\n', array_1d)
print('\n2D NumPy array:\n', array_2d)

`np.zeros()` fills an array with zeros, while `np.ones()` fills an array with ones:

In [None]:
zero_array_1d = np.zeros(6)
ones_array_2d = np.ones((6, 3))

print('1D NumPy array:\n', zero_array_1d)
print('\n2D NumPy array:\n', ones_array_2d)

`np.empty()` functions in much the same way but fills the array with random numbers depending on the state of the memory (quicker than zeros/ones):

In [None]:
array = np.empty(10)

print(array)

`np.arange()` creates an array containing a range of numbers similarly to the Python `range()` function. A step size parameter can also be provided:

In [None]:
array = np.arange(0, 11, 2)

print(array)

`np.linspace()` creates an array of evenly spaced values within a specified interval:

In [None]:
array = np.linspace(0, 20, 5)  # 0 -> 20 in 5 values

print(array)

**Specifying a dtype**

When creating an array the dtype is automatically `np.float64`, however the dtype can be specified by the user.

Learn more about datatypes [here](https://www.tutorialspoint.com/numpy/numpy_data_types.htm)

In [None]:
array = np.ones((4,4), dtype=np.int64)  # Here we specify integer

print(array)

## Indexing and Slicing

Indexing and slicing is fundamental to working with NumPy arrays. 

Indexing one dimensional arrays is very similar to indexing Python lists

*Try to use your prior knowledge to work out the output of the cell below before running it*

In [None]:
array = np.array([1, 2, 3, 4])

print(array[1])
print(array[0:3])
print(array[1:])
print(array[1:-1])

When accessing arrays with more than one dimension we can use comma seperated values representing the dimension and index of the element. It may help to think of a 2D array as a data table where axis 0 represents the rows and axis 1 represents the columns:

In [None]:
array_2d = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]], dtype=np.int64)

# Access the element on the first row, third column:
print('3rd element on the first row:', array_2d[0,2])

# Access the element on the secxond row last column:
print('Last element on the second row:', array_2d[1,-1])

# Access the entire first row
print('The first row:', array_2d[0])  # this is equivalent to [0,:]

# Access the entire second column
print('The second column:', array_2d[:,1])

Extending this logic to further dimensions is thus relatively straightforward:

In [None]:
array_3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])  # shape (4, 3)

print('Index [0,1,2]:', array_3d[0, 1, 2])

Sometimes it can be hard to think in multiple dimensions!

To explain:
    
- Index `0` selects the first dimension which contains two arrays:
   
```python
[[1, 2, 3], [4, 5, 6]]  &  [[7, 8, 9], [10, 11, 12]]

So we select:

[[1, 2, 3], [4, 5, 6]]
```

- Index `1` selects from the above two arrays:

```python
So we select:

[4, 5, 6]
```

- The final index `2` then selects from the above:

```python
So we select:

6
```

Of course we can also use slicing and negative indexes as in the 2D example. Take some time to experiment with indexing and slicing different arrays yourself.

----

#### Indexing with conditionals

If you can recall, we mentioned that NumPy arrays can be indexed with booleans. Boolean arrays can be generated simply using a conditional expression.

If you can recall from the first session we can use expressions like: 

```python
x = 5
x > 10  # False
````

which generate a boolean value.

We can use the same type of expression with numpy arrays to generate an array of boolean values:

```python
array = np.array([1, 2, 3, 4, 5])
mask = array > 2  # creates a boolean array [False, False, True, True ...]
```

Therefore indexing based on a condition is straightforward:

In [None]:
array = np.array([[1 , 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

# Interested in getting all values > 5
print('Conditional Indexing Result:', array[array > 5])

Of course boolean arrays can be composed logically using `&` 'AND' and `|` 'OR' to chain multiple conditions:

In [None]:
# We are now interested in getting all values which are > 5 AND divisible by 2 (even)
print('Result:', array[(array > 5) & (array % 2 == 0)])

# Let's break it down slightly:
condition_1 = array > 5                          # value greater than 5
condition_2 = array % 2 == 0                     # value divisible by 2
combined_condition = condition_1 & condition_2   # combined condition 'AND'

print('Result:', array[combined_condition])  # Same result!

`np.nonzero()` can also be used to select elements or indices from an array.

Instead of a boolean array, this method returns the indicies.

Let's try with `array > 5`:

In [None]:
indices = np.nonzero(array > 5)

print('Indices:', indices)  # first array is row index, second array is column index

# Of course we can use these to index also
print('Result:', array[indices])

## Basic Operations

By limiting datatypes to one typoe per array numpy can perform basic and complex mathematical operations across axes of an array or between different arrays. These operations can be executed using python operators `+`,`*`, etc... or functions built into numpy; `np.add()`, `np.multiply()`, etc.. 

Let's try some functions which are useful to apply over an arrays axes:

In [None]:
array = np.array([[1, 2], [3, 4], [5, 6]])

# We can apply the sum over any axis (note we could also use np.sum(array, axis=0))
print('Sum over entire array:', array.sum())
print('Sum over axis 0 (columns):', array.sum(axis=0))
print('Sum over axis 1 (rows):', array.sum(axis=1))

# The mean is often a useful operation (e.g. geometric mean)
print('\nMean over entire array:', array.mean())
print('Mean over axis 0 (columns):', array.mean(axis=0))
print('Mean over axis 1 (rows):', array.mean(axis=1))

# How about the min / max
print('\nMax over entire array:', array.max())
print('Max over axis 0 (columns):', array.max(axis=0))
print('Min over axis 1 (rows):', array.min(axis=1))

NumPy also understands performing array math:

In [None]:
other = np.ones(array.shape)  # Create an array of ones with the same shape as `array`

# We can easily sum two arrays elementwise
print('Elementwise Sum:\n\n', array + other)  # equivalent to np.add(array, other)

# We can calculate the difference of two arrays elementwise
print('\nElementwise Difference:\n\n', array - other)  # equivalent to np.subtract(array, other)

# Elementwise multiplication
print('\nElementwise Multiplication:\n\n', array * other)  # equivalent to np.multiply(array, other)

Note that the `*` operator refers to elementwise multiplication and not matrix multiplication. We instead use the `np.dot()` function to compute inner products of vectors, to multiply a vector by a matrix, and to multiply matrices

## Broadcasting

#### TODO

- Broadcasting
- Vectorization
- RMSD example using array broadcasting easy