### MEDC0106: Bioinformatics in Applied Biomedical Science

<p align="center">
  <img src="../../resources/static/Banner.png" alt="MEDC0106 Banner" width="90%"/>
  <br>
</p>

---------------------------------------------------------------

# 11 - NumPy (supplementary material)

*Written by:* Oliver Scott

**This notebook provides a general introduction to NumPy.**

Do not be afraid to make changes to the code cells to explore how things work!

### What is NumPy?

[**NumPy**](https://numpy.org/) is a widely-used Python package that provides multidimensional array and matrix data structures.

As described in the [NumPy User Guide](https://numpy.org/devdocs/user/absolute_beginners.html):

> NumPy (Numerical Python) is an open-source Python library that’s essential in nearly every field of science and engineering. It serves as the standard for numerical data handling in Python and forms the foundation of the scientific Python and PyData ecosystems. NumPy is used by everyone from beginner coders to advanced researchers engaged in scientific and industrial R&D. Its API is integral to other key data science libraries, including Pandas, SciPy, Matplotlib, scikit-learn, scikit-image, and more.

NumPy enables scientists and engineers to develop high-performance software with the simplicity of Python and speeds close to those of C.

In this notebook, we’ll explore the basics of NumPy. This library’s versatility and depth could easily merit an entire lecture series! 

-----

## Contents

1. [The basics](#The-basics)
2. [Indexing and slicing](#Indexing-and-slicing)
3. [Basic operations](#Basic-operations)
4. [Broadcasting](#Broadcasting)
5. [Reshaping](#Reshaping)
6. [Implementing equations](#Implementing-equations)
7. [Discussion](#Discussion)

-----

### Extra resources:

- [Learn NumPy](https://numpy.org/learn/) - Recommended learning material from the NumPy developers.
- [RealPython: NumPy Tutorial](https://realpython.com/numpy-tutorial/)
- [TutorialsPoint](https://www.tutorialspoint.com/numpy/index.htm)
- [W3Schools](https://www.w3schools.com/python/numpy/numpy_intro.asp)
- [NumPy CheatSheet](https://www.datacamp.com/community/blog/python-numpy-cheat-sheet)

-----

### Reference:

- [NumPy: the absolute basics for beginners](https://numpy.org/devdocs/user/absolute_beginners.html#)

## The basics

Importing `numpy` is no different to any other package/module. NumPy users often use the `np` alias to keep code clean.

In [None]:
import numpy as np

array = np.array([1,2,3,4])

print(array)

#### NumPy arrays vs. Python lists

- NumPy arrays can only hold one type of data, unlike Python lists.
- NumPy arrays consume less memory than Python lists.
- NumPy arrays are much faster than Python lists.
- NumPy arrays have a fixed size.
- NumPy provides numerous fast mathematical operations that can be applied over arrays.

##### The structure of an array

The array is the core data structure in the NumPy library, consisting of a grid of values that all share the same data type, or `dtype`. This grid can be indexed similarly to Python lists, as well as by tuples of nonnegative integers, by Booleans, by another NumPy array, or by integers.

----

**Array structure:**

<p align="center">
  <img src="https://i.imgur.com/mg8O3kd.png" alt="NumPy Arrays" width="100%"/>
  <br>
</p>

[Image Source](https://www.freecodecamp.org/news/exploratory-data-analysis-with-numpy-pandas-matplotlib-seaborn/)

**Multiple dimensions:**

NumPy arrays can also be multi-dimensional (1D, 2D, 3D ... ND), which means they can be used to model vectors (1D) and matrices (2D). Arrays with three or more dimensions are often referred to as tensors.

In NumPy, dimensions are referred to as "axes". A 2D array might look like this:

```python
[[0., 0., 0.],
 [1., 1., 1.]]
```

In this example, there are two axes: the first axis has a length of two, and the second has a length of three. You can access the shape of an array with the `.shape` attribute, which is a tuple representing the length of each axis. In this case, the shape would be `(2, 3)`.

**Creating NumPy arrays:**

There are numerous ways to create a NumPy array. Here are some common methods:

- `np.array()`
- `np.zeros()`
- `np.ones()`
- `np.empty()`
- `np.arange()`
- `np.linspace()`

`np.array()` can be used to construct an array from a Python list.

In [None]:
array_1d = np.array([1, 2, 3, 4])                  # A 1D array
array_2d = np.array([[1, 2, 3, 4], [5, 6, 7, 8]])  # A 2D array

print('1D NumPy array:\n', array_1d)
print('\n2D NumPy array:\n', array_2d)

The `np.zeros()`function fills an array with zeros, while `np.ones()` fills an array with ones.

In [None]:
zero_array_1d = np.zeros(6)
ones_array_2d = np.ones((6, 3))

print('1D NumPy array:\n', zero_array_1d)
print('\n2D NumPy array:\n', ones_array_2d)

The `np.empty()` function creates an array of the specified shape but does not initialize its entries to any specific value. Instead, it fills the array with whatever values are already present in that memory location, which can appear as random numbers. It is faster than `np.zeros()` or `np.ones()` because it skips the step of setting each element to zero or one.

In [None]:
array = np.empty(10)

print(array)

The `np.arange()` creates an array containing a range of numbers, similarly to the Python `range()` function. A step size parameter can also be provided.

In [None]:
array = np.arange(0, 11, 2)

print(array)

The `np.linspace()` creates an array of evenly spaced values within a specified interval.

In [None]:
array = np.linspace(0, 20, 5)  # 0 to 20 in intervals of 5

print(array)

**Specifying a data type**:

When creating an array the `dtype` is automatically `np.float64`, however the `dtype` can actually be specified by the user.

Learn more about data types [here](https://www.tutorialspoint.com/numpy/numpy_data_types.htm).

In [None]:
array = np.ones((4,4), dtype=np.int64)  # Here we specify data type as integer

print(array)

## Indexing and slicing

Indexing and slicing is fundamental to working with NumPy arrays. 

Indexing one-dimensional arrays is very similar to indexing Python lists.

In [None]:
array = np.array([1, 2, 3, 4])

print(array[1])
print(array[0:3])
print(array[1:])
print(array[1:-1])

When accessing arrays with more than one dimension, we can use comma-separated values to represent the dimension and index of the element. For example, in a 2D array, it’s helpful to think of it as a data table where:

- **Axis 0** represents the rows.
- **Axis 1** represents the columns.

In [None]:
array_2d = np.array([[1, 2, 3, 4, 5], [6, 7, 8, 9, 10]], dtype=np.int64)

# Access the element on the first row, third column
print('Third element on the first row:', array_2d[0,2])

# Access the element on the second row, last column
print('Last element on the second row:', array_2d[1,-1])

# Access the entire first row
print('The first row:', array_2d[0])  # This is equivalent to [0,:]

# Access the entire second column
print('The second column:', array_2d[:,1])

Extending this logic to further dimensions:

In [None]:
array_3d = np.array([[[1, 2, 3], [4, 5, 6]], 
                     [[7, 8, 9], [10, 11, 12]]])

print('Index [0, 1, 2]:', array_3d[0, 1, 2])

The `array_3d` has a shape of (2, 2, 3), meaning:
- It has 2 "blocks" along the first axis (axis 0)
- Each block contains 2 rows along the second axis (axis 1)
- Each row has 3 columns along the third axis (axis 2)

Sometimes it can be hard to think in multiple dimensions!

To explain:

1. **First index (`0`)**: Selecting index `0` in `array_3d` returns the first "block" along the first axis. This block is:
   ```python
   [[1, 2, 3], [4, 5, 6]]
   ```
   <br>
2. **Second index (`1`)**:
   Within this block, selecting index `1` returns the second row:

   ```python
   [4, 5, 6]
   ```
   <br>
3. **Third index (`2`)**:
   In this row, selecting index `2` returns the third element:

   ```python
   6
   ```
Slicing and negative indexes can be used with multi-dimensional arrays in the same way as with lower dimensional arrays.

----

#### Indexing with conditionals

We mentioned that NumPy arrays can be indexed with Booleans. Boolean arrays can be generated by using a conditional expression.

```python
x = 5
x > 10  # False
```

We can use the same type of expression with NumPy arrays to generate an array of Boolean values.

```python
array = np.array([1, 2, 3, 4, 5])
mask = array > 2  # Creates a Boolean array [False, False, True, True ...]
```

Therefore indexing based on a condition is straightforward.

In [None]:
array = np.array([[1 , 2, 3, 4], [5, 6, 7, 8], [9, 10, 11, 12]])

# Interested in getting all values > 5
print('Conditional indexing result:', array[array > 5])

Boolean arrays can be composed logically using `&` 'AND' and `|` 'OR' to chain multiple conditions.

In [None]:
# We are now interested in getting all values which are > 5 AND divisible by 2
print(array[(array > 5) & (array % 2 == 0)])

# Let's break it down
condition_1 = array > 5                          # Value greater than 5
condition_2 = array % 2 == 0                     # Value divisible by 2
combined_condition = condition_1 & condition_2   # Combined condition

print(array[combined_condition])  # Same result

The `np.nonzero()` function can be used to select elements or indexes from an array based on a condition.

Unlike Boolean indexing, this method returns the indexes where the condition is met.

Let's try it with the condition `array > 5`.

In [None]:
indexes = np.nonzero(array > 5)  # Get indexes where elements in array are greater than 5

print(indexes)  # First array is row indexes, second array is column indexes

# Use indexes to access elements in the array that meet the condition
print(array[indexes])

## Basic operations

By restricting arrays to a single data type, NumPy enables efficient execution of basic and complex mathematical operations across array axes or between arrays. These operations can be performed using standard Python operators like `+` and `*`, or with built-in NumPy functions such as `np.add()` and `np.multiply()`.

Let's try some functions that are useful for applying operations over the axes of an array.

In [None]:
array = np.array([[1, 2], [3, 4], [5, 6]])

# We can apply the sum over any axis (note we could also use `np.sum(array, axis=0)`)
print('Sum over entire array:', array.sum())
print('Sum over axis 0 (columns):', array.sum(axis=0))
print('Sum over axis 1 (rows):', array.sum(axis=1))

# The mean is often a useful operation (e.g. geometric mean)
print('\nMean over entire array:', array.mean())
print('Mean over axis 0 (columns):', array.mean(axis=0))
print('Mean over axis 1 (rows):', array.mean(axis=1))

# How about the min / max
print('\nMax over entire array:', array.max())
print('Max over axis 0 (columns):', array.max(axis=0))
print('Min over axis 1 (rows):', array.min(axis=1))

NumPy also supports element-wise mathematical operations on arrays.

In [None]:
other = np.ones(array.shape)  # Create an array of ones with the same shape as `array`

# Element-wise addition
print('Element-wise Sum:\n\n', array + other)  # Equivalent to np.add(array, other)

# Element-wise subtraction
print('\nElement-wise Difference:\n\n', array - other)  # Equivalent to np.subtract(array, other)

# Element-wise multiplication
print('\nElement-wise Multiplication:\n\n', array * other)  # Equivalent to np.multiply(array, other)

Note that the `*` operator performs element-wise multiplication, not matrix multiplication. For matrix operations, use the `np.dot()` function to compute inner products of vectors, multiply a vector by a matrix, or perform matrix multiplication.

## Broadcasting

Often, you may need to perform mathematical operations between a single number and an entire array, or between two arrays of different sizes. For example, if you have an array of temperatures in Kelvin that you want to convert to Celsius, NumPy’s broadcasting feature makes this fast and easy.

In [None]:
temp_in_kelvin = np.array([0, 100, 200, 300, 400])
print('Before conversion:', temp_in_kelvin)

temp_in_celsius = temp_in_kelvin - 273.15
print('After conversion: ', temp_in_celsius)

NumPy is designed to understand that certain operations should be applied to each element in an array individually. This concept is called **broadcasting**, which allows operations to be performed between arrays of different sizes. For broadcasting to work, the shapes of the arrays must be compatible; otherwise, NumPy will raise an error.

Let's look at an example of broadcasting with two differently sized arrays.

In [None]:
# Suppose we wish to translate a set of coordinates with a vector
coordinates = np.array([
    [1.2, 2.4, 3.6],
    [4.1, 5.5, 6.8],
    [7.1, 8.2, 9.1],
    [10.2, 11.4, 12.9]
])

vector = np.array([1., 0., -1.])

"""
Of course, we could do this with a Python loop:

translated = np.empty(coordinates.shape)
for i in range(coordinates.shape[0]):
    translated[i, :] = coordinates[i, :] + vector

However, with large arrays, this would be slow!
NumPy's broadcasting system makes this easy!
"""

# Broadcasting allows element-wise addition between `coordinates` and `vector`
translated = coordinates + vector

print('Translated Coordinates:\n\n', translated)

The line `translated = coordinates + vector` works despite the arrays having different shapes, `(4, 3)` and `(3,)`, thanks to broadcasting. Broadcasting allows NumPy to treat the `vector` as if it has the shape `(4, 3)` by virtually expanding it without actually copying the data, keeping operations efficient.

<p align="center">
  <img src="https://numpy.org/devdocs/_images/broadcasting_2.png" alt="NumPy Broadcasting" width="50%"/>
  <br>
</p>

[Image source](https://numpy.org/devdocs/user/basics.broadcasting.html#basics-broadcasting)

When operating on two arrays, NumPy compares their shapes element-wise, starting with the rightmost dimensions and moving left. Two dimensions are compatible when:
- they are equal, or
- one of them is 1.

If these conditions are not met, NumPy raises a `ValueError: operands could not be broadcast together`, indicating incompatible shapes. The resulting array’s shape takes the size that is not 1 along each axis of the input arrays.

Learn more about broadcasting [here](https://numpy.org/devdocs/user/basics.broadcasting.html#basics-broadcasting).

## Reshaping

Sometimes you may need to switch the dimensions of an array. This may happen when a model takes an input in a different shape to your data. Arrays can be reshaped easily using the `.reshape()` method. Obviously when reshaping an array the new dimensions must be compatible with the original shape.

In [None]:
# An array with shape (3, 2)
data = np.array([[1, 2], [3, 4], [5, 6]])

print('Original:\n', data)

# switch the dimensions to (2, 3)
print('\nReshape(2, 3):\n', data.reshape(2, 3))

This can also be done to convert 1D arrays into ND arrays.

In [None]:
arr = np.arange(6)

print('Original:\n', arr)

# Change the dimensions to 2D: (2, 3)
print('\nReshape(2, 3):\n',arr.reshape((2, 3)))

A common operation in matrix algebra is [transposition](https://en.wikipedia.org/wiki/Transpose), which swaps the rows and columns of a matrix. This can be achieved easily in NumPy using the `.T` property or the `.transpose()` method:

In [None]:
# Transpose the NumPy array 'data'
print('Transpose:\n', data.T)

# The same result
print('\nTranspose:\n', data.transpose())

## Implementing equations

NumPy is widely used in the scientific community because it makes implementing mathematical formulas on arrays straightforward. For example, suppose we want to calculate the [root-mean-square deviation (RMSD)](https://en.wikipedia.org/wiki/Root-mean-square_deviation_of_atomic_positions) between the coordinates of two superimposed molecules.

Given two sets of ${n}$ points ${v}$ and ${w}$:

$$
RMSD(v, w) = \sqrt{\frac{1}{n} \sum_{i=1}^n{\|v_{i}-w_{i}\|}^2} \\
= \sqrt{\frac{1}{n} \sum_{i=1}^n{((v_{ix}-w_{ix})^2 + (v_{iy}-w_{iy})^2 + (v_{iz}-w_{iz})^2)}}
$$

The RMSD value is expressed in units of length. In structural biology, the most commonly used unit is the **Ångström (Å)**, which is equal to \${10^{−10}}$ metres.

Thanks to NumPy’s broadcasting system, we can implement this function without using Python loops!

In [None]:
def rmsd(v, w):
    d = v - w  # Difference (deviation)
    rmsd = np.sqrt(np.mean(np.square(d)))
    return rmsd

# Coordinates V (x, y, z)
v = np.array([
    [-0.98353, 1.81095, -0.03140],
    [0.12683,  1.80418, -0.03242],
    [-1.48991, 3.22740,  0.18102],
    [-1.35042, 1.15351,  0.78475],
    [-1.35043, 1.42183, -1.00450],
    [-1.12302, 3.61652,  1.15412],
    [-2.60028, 3.23418,  0.18203],
    [-1.12302, 3.88485, -0.63514],
])

# Coordinates W (x, y, z)
w = np.array([
    [-1.18012, 1.83558, -0.02389],
    [-0.07891, 1.97662, -0.00383],
    [-1.87442, 3.17118,  0.18101],
    [-1.47136, 1.13188,  0.78415],
    [-1.47377, 1.40501, -1.00437],
    [-1.58078, 3.60175,  1.16149],
    [-2.97563, 3.03014,  0.16095],
    [-1.58318, 3.87488, -0.62704],
])

# Caclulate the RMSD between two sets of coordinates
print('RMSD:', rmsd(v, w))

## Discussion

NumPy has become an essential part of the Python scientific data stack due to its speed and ease of use (with practice!) compared to lower-level programming languages. It is the foundation upon which many data analysis and manipulation packages are built and will remain vital for the foreseeable future. If you plan to pursue a career in data science or scientific computing, learning NumPy is crucial. This notebook has only touched on a few of its capabilities!

If you find the syntax and concepts a bit challenging, don’t worry—it can take time to adjust to thinking in a new way!

Feel free to add more code cells and experiment with the concepts you have learnt.

You can use this notebook as reference if you need to refresh your knowledge on any of the concepts explored.

If you want to learn more there are some extra external resources linked at the beginning of this notebook. You can click [here](#Contents) to go back to the top.