## Numpy tutorial GRK 2450

- Numpy is a Python library for scientific computing with powerful tools for manipulating multi-dimensional arrays.
- A lot of (most) python packages use numpy functionality.
- Numpy performs the matrix manipulations internally using highly optimized C-code, which is often orders of magnitude faster than any implementation that one could write in pure python.
  - Therefore, you should always think about how you can use numpy functionality and functions to replace your slow python for-loops etc.
- If you use python for scientific purposes, there is no way around numpy.

In this tutorial, we will go through some basic numpy functionality, but we cannot cover everything. For more advanced features, you can visit the [Numpy documentation](https://numpy.org/doc/stable/) or you can also ask [ChatGPT](https://chat.openai.com/) to rewrite some of your code using fast numpy features 😃.

### Basics

In [1]:
import numpy as np

## Creating Arrays
print("## Creating arrays")

# There are many ways to create NumPy arrays. Here are a few:

# From a Python list:
a = np.array([1, 2, 3])
print(a)

# From a Python list of lists:
b = np.array([[1, 2, 3], [4, 5, 6]])
print(b)

# Using built-in NumPy functions:
c = np.zeros((2, 3))  # creates a 2x3 array of zeros
print(c)

d = np.ones((2, 3))  # creates a 2x3 array of ones
print(d)

e = np.eye(3)  # creates a 3x3 identity matrix
print(e)

## Array shapes
print("## Array shapes")

# Every numpy array has a "shape", which is a tuple of integers specifying the size 
# of each dimension.
print(a.shape)
print(b.shape)
# The array "a" simply has shape (3,). This can be interpreted as a vector with three entries.
# The array "b" has shape (2,3), which can be interpreted as a matrix with 2 rows and 3 columns.
# However, a numpy array can have an arbitrary number of dimensions, each with an arbitrary size.

# Keep in mind that the shape of a numpy array is fixed when it is created, you cannot 
# add additional items to a dimension without creating a new array.
# This is different from python lists, where you can add additional elements and make the list larger.

## Array Indexing and Slicing
print("## Array Indexing and Slicing")

# NumPy arrays can be indexed and sliced like Python lists:
f = np.array([0, 1, 2, 3, 4, 5])
print(f[0])  # prints the first element (0)
print(f[2:5])  # prints elements 2, 3, and 4

# Multi-dimensional arrays can be indexed and sliced similarly:
g = np.array([[0, 1, 2], [3, 4, 5]]) # shape (2,3)
print(g[0, 1])  # prints the element at row 0, column 1 (1)
print(g[:, 1:])  # prints all rows, columns 1 and up

# We can add an additional `dummy` dimension of size 1 to an existing array.
# For this, we can use "None" or "np.newaxis" when slicing the array:
h = np.array([1,2,3]) # Shape (3,)
h = h[:, None] # Shape (3,1)
h = h[:,:, np.newaxis] # Shape (3,1,1)

## Numpy built-in functions and operations
print("## Numpy built-in functions and operations")

# NumPy arrays support many operations, including arithmetic operations, statistical operations, and linear algebra operations:

# Let's define some more arrays:
a = np.array([1, 2, 3, 4, 5])
b = np.arange(1, 6) # Return evenly spaced values within the given interval (default step size: 1)
c = np.zeros((3, 3))
d = np.ones((3, 3))

# Flattening arrays
print("Flattening arrays")
# Calling flatten() on an array flattens all dimensions into a single dimension:
print(d.flatten()) # Shape (9,)

# Arithmetic operations
print("Arithmetic operations")
print(f"a + b: {a + b}")
print(f"a - b: {a - b}")
print(f"a * b: {a * b}") # Multiplying two matrices / arrays is done element-wise! This is not matrix multiplication as known from math.
print(f"a / b: {a / b}\n")

# Logical operations
print("Logical operations")
# Returns a bool array of the same shape as "a", indicating for each element if a > 3 is True:
print(f"a > 3: {a > 3}") 
# Another example:
print(f"b < 3: {b < 3}\n")

# Statistical operations
print("Statistical operations")
print(f"Mean of a: {np.mean(a)}")
print(f"Median of b: {np.median(b)}")
print(f"Standard deviation of a: {np.std(a)}\n")

# Reshaping operations
print("Reshaping operations")
e = np.arange(1, 10)
print(f"Original e: {e}")
e = e.reshape((3, 3)) # "Reshape" first flattens the array (in our example, it is already flattened) and then turns it into the specified new shape.
# The total number of elements of the array remains the same, they are just spread differently across the dimensions.
print(f"Reshaped e (3x3 matrix):\n{e}\n")


## Creating arrays
[1 2 3]
[[1 2 3]
 [4 5 6]]
[[0. 0. 0.]
 [0. 0. 0.]]
[[1. 1. 1.]
 [1. 1. 1.]]
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]
## Array shapes
(3,)
(2, 3)
## Array Indexing and Slicing
0
[2 3 4]
1
[[1 2]
 [4 5]]
## Numpy built-in functions and operations
Flattening arrays
[1. 1. 1. 1. 1. 1. 1. 1. 1.]
Arithmetic operations
a + b: [ 2  4  6  8 10]
a - b: [0 0 0 0 0]
a * b: [ 1  4  9 16 25]
a / b: [1. 1. 1. 1. 1.]

Logical operations
a > 3: [False False False  True  True]
b < 3: [ True  True False False False]

Statistical operations
Mean of a: 3.0
Median of b: 3.0
Standard deviation of a: 1.4142135623730951

Reshaping operations
Original e: [1 2 3 4 5 6 7 8 9]
Reshaped e (3x3 matrix):
[[1 2 3]
 [4 5 6]
 [7 8 9]]



### Broadcasting

One crucial and very powerful feature of numpy is broadcasting.

In [5]:
# When adding or multiplying two arrays, they usually need to have the same shape to be 
# able to apply the operation element-wise.
# Broadcasting allows operations such as addition or multiplication between arrays of different shapes, 
# by duplicating ("broadcasting") the smaller array to match the larger array. This often makes it possible
# to vectorize array operations such that no slow python loops are needed.

# When operating on two arrays with different shapes, NumPy compares their shapes element-wise.
# It starts with the trailing (i.e., rightmost) dimension and works its way left. 
# Two dimensions are compatible if they are equal, or one of them is 1.
# If they are equal, then the dimensions are directly compatible and the operation can be
# performed element-wise as usual. If one of the dimensions is 1, then the array 
# is duplicated along that dimension to match the size of that dimension of the other array.

# Furthermore, the arrays do not need to have the same number of dimensions 
# (length of the shapes do not have to be the same). If one of the 
# arrays has a smaller number of dimensions, the missing dimensions are filled
# with "dummy" dimensions of size 1 from the left.

# Let's look at some examples to make this clear.

# Let's say we want to add a scalar value to a 2D array:
h = np.array([[1, 2, 3], [4, 5, 6]])
i = 2
print(h + i)  # the scalar value is broadcasted to each element of the array

# Let's look at a more complicated example:
j = np.array([1, 2, 3]) # shape (3,)
k = np.array([[1], [2]]) # shape (2,1)
print(j * k)
# Let's try to understand what is happening here in more detail.
# The shapes get aligned at the right side and than filled with ones to get the 
# same number of dimensions: (1,3) * (2,1).
# Then, as discussed above, dimensions of size 1 are duplicated / broadcasted to match the other array: 
# => (2,3) * (2,3) after repeating each array along the 1-dimensions
# Now, the arrays have the same shapes and are being multiplied as usual, element-by-element.
# Try to see if you can calculate the result of j*k yourself. Maybe you have to write it down.

# Here is another example with a 2D matrix and a 1D vector:
l = np.array([[1,3],[3,5]]) # shape (2,2)
m = np.array([2,1]) # shape (2,)
n = np.array([[2],[1]]) # shape (1,2)
print(l*m)
print(n.shape)
print(l*n)
# Align the tensors at the right side and fill with 1s to match number 
# of dimensions: (2,2)*(1,2) 
# => the vector is repeated ("broadcasted") along the rows of the matrix:
# (2,2)*(2,2)
# => and then elementwise multiplication is applied as usual.

# Now, try playing around with multiplying or adding some arrays of your choice of different dimensions.
# Try to understand, when the dimensions of two arrays are compatible and when not.
# Try to see if you can figure out the result returned by numpy in your head / on a piece of paper (as we did
# for the examples above).

[[3 4 5]
 [6 7 8]]
[[1 2 3]
 [2 4 6]]
[[2 3]
 [6 5]]
(2, 1)
[[2 6]
 [3 5]]


#### Example: Computation of distance matrix between pairs of atoms

To showcase how powerful using broadcasting can be, we use a simple example. Given an array of 2D points (shape: (N,2)), we want to get the distance matrix D_ij that contains the distances between all pairs of points.

We start with a very inefficient implementation to compute this matrix using for-loops and python functionality.

In [None]:
import numpy as np

def get_distances_loop(coords):
  # Create an empty array for the result:
  distances = np.zeros((len(coords), len(coords)))

  # Fill the array:
  for i in range(len(coords)):
      for j in range(len(coords)):
          distances[i, j] = np.sqrt((coords[i][0] - coords[j][0]) ** 2 + (coords[i][1] - coords[j][1]) ** 2)

  return distances

# Create a list of coordinates
coords = [(0, 0), (1, 1), (2, 2)]

print(get_distances_loop(coords))

[[0.         1.41421356 2.82842712]
 [1.41421356 0.         1.41421356]
 [2.82842712 1.41421356 0.        ]]


Now, try to reimplement the computation of the distance matrix yourself using the broadcasting rules that we saw above. Here are some hints on how to do this:
- Create a numpy array from the python list `coords`
- Currently, this array has shape (N,2). Create two new arrays, one of shape (N,1,2) and one of shape (1,N,2). Use np.newaxis or None when indexing (see above) to achieve this.
- Subtract these two arrays from each other. This will use broadcasting to get an array of shape (N,N,2) which contains the vector distance between each pair of atoms
- Use a combination of np.sqrt and np.sum to calculate the distance matrix of shape (N,N). Look up the documentation of np.sum if needed.

You can find the solution to this exercise at the very end of this notebook.

In [None]:
import numpy as np

def get_distances_numpy(coords):

  ### Your code goes here ###

  return None # Change this!

# Create a list of coordinates
coords = [(0, 0), (1, 1), (2, 2)]
print(get_distances_numpy(coords))

None


If you implemented it correctly, the numpy implementation should yield the same result as before.
We can compare the runtime of both approaches by passing a very large array into both functions:

In [None]:
import time

coords = np.random.randn(1000,2)

start = time.time()
get_distances_loop(coords)
elapsed_loop = time.time()-start
print(f"Python loop implementation took {elapsed_loop}s")

start = time.time()
get_distances_numpy(coords)
elapsed_numpy = time.time()-start
print(f"Numpy implementation took {elapsed_numpy}s")

print(f"Numpy was {elapsed_loop/elapsed_numpy} times faster then the python loop.")

Python loop implementation took 3.628126621246338s
Numpy implementation took 0.039417266845703125s
Numpy way 92.04409416432789 times faster then the python loop.


As we can see, the numpy-based implementation is a lot faster than the naive python implementation using for-loops.

### Outlook: PyTorch

PyTorch is a popular deep learning framework for Python. Fundamentally, a neural network is not much more than multiplying a couple of matrices and applying some simple activation functions. These matrices are represented as Tensors in PyTorch and they work very similarly to the numpy arrays. A lot (most!) of the functions that PyTorch offers to manipulate these tensors are the same as in numpy. The main difference is that PyTorch can use your GPU to accelerate the computations that you perform on the tensors.

Once you know numpy, getting into PyTorch is easy.

Now, we repeat the exact operations that we showed in the beginning of this tutorial, but now we write them in PyTorch notation. As you can see, there is not much of a difference at all.

In [None]:
import torch

## Creating Tensors
print("## Creating tensors")

# From a Python list:
a = torch.tensor([1, 2, 3])
print(a)

# From a Python list of lists:
b = torch.tensor([[1, 2, 3], [4, 5, 6]])
print(b)

# Using built-in PyTorch functions:
c = torch.zeros((2, 3))  # creates a 2x3 tensor of zeros
print(c)

d = torch.ones((2, 3))  # creates a 2x3 tensor of ones
print(d)

e = torch.eye(3)  # creates a 3x3 identity matrix
print(e)

## Tensor shapes
print("## Tensor shapes")

print(a.shape)
print(b.shape)

## Tensor Indexing and Slicing
print("## Tensor Indexing and Slicing")

f = torch.tensor([0, 1, 2, 3, 4, 5])
print(f[0])  # prints the first element (0)
print(f[2:5])  # prints elements 2, 3, and 4

g = torch.tensor([[0, 1, 2], [3, 4, 5]]) # shape (2,3)
print(g[0, 1])  # prints the element at row 0, column 1 (1)
print(g[:, 1:])  # prints all rows, columns 1 and up

h = torch.tensor([1,2,3]) # Shape (3,)
h = h[:, None] # Shape (3,1)

## PyTorch built-in functions and operations
print("## PyTorch built-in functions and operations")

# Let's define some more tensors:
a = torch.tensor([1, 2, 3, 4, 5])
b = torch.arange(1, 6) 
c = torch.zeros((3, 3))
d = torch.ones((3, 3))

# Flattening tensors
print("Flattening tensors")
print(d.flatten()) 

# Arithmetic operations
print("Arithmetic operations")
print(f"a + b: {a + b}")
print(f"a - b: {a - b}")
print(f"a * b: {a * b}")
print(f"a / b: {a / b}\n")

# Logical operations
print("Logical operations")
print(f"a > 3: {a > 3}") 
print(f"b < 3: {b < 3}\n")

# Statistical operations
print("Statistical operations")
print(f"Mean of a: {torch.mean(a.float())}")
print(f"Median of b: {torch.median(b.float())}")
print(f"Standard deviation of a: {torch.std(a.float(), correction=0)}\n")

# Reshaping operations
print("Reshaping operations")
e = torch.arange(1, 10)
print(f"Original e: {e}")
e = e.reshape((3, 3))
print(f"Reshaped e (3x3 matrix):\n{e}\n")

## Creating tensors
tensor([1, 2, 3])
tensor([[1, 2, 3],
        [4, 5, 6]])
tensor([[0., 0., 0.],
        [0., 0., 0.]])
tensor([[1., 1., 1.],
        [1., 1., 1.]])
tensor([[1., 0., 0.],
        [0., 1., 0.],
        [0., 0., 1.]])
## Tensor shapes
torch.Size([3])
torch.Size([2, 3])
## Tensor Indexing and Slicing
tensor(0)
tensor([2, 3, 4])
tensor(1)
tensor([[1, 2],
        [4, 5]])
## PyTorch built-in functions and operations
Flattening tensors
tensor([1., 1., 1., 1., 1., 1., 1., 1., 1.])
Arithmetic operations
a + b: tensor([ 2,  4,  6,  8, 10])
a - b: tensor([0, 0, 0, 0, 0])
a * b: tensor([ 1,  4,  9, 16, 25])
a / b: tensor([1., 1., 1., 1., 1.])

Logical operations
a > 3: tensor([False, False, False,  True,  True])
b < 3: tensor([ True,  True, False, False, False])

Statistical operations
Mean of a: 3.0
Median of b: 3.0
Standard deviation of a: 1.4142135381698608

Reshaping operations
Original e: tensor([1, 2, 3, 4, 5, 6, 7, 8, 9])
Reshaped e (3x3 matrix):
tensor([[1, 2, 3],
      

## Solutions

In [None]:
## Solution to the distance matrix calculation:

import numpy as np

def get_distances_numpy(coords):

  # Convert the list of coordinates to a numpy array
  coords_array = np.array(coords)

  # Compute the distance matrix using broadcasting
  differences = coords_array[:, np.newaxis, :] - coords_array[np.newaxis, :, :] # (N,1,2)-(1,N,2) => resulting matrix will have shape (N,N,2), containing the vector distance between every pair of atoms
  distances = np.sqrt(np.sum(differences ** 2, axis=-1))

  return distances

# Create a list of coordinates
coords = [(0, 0), (1, 1), (2, 2)]
print(get_distances_numpy(coords))