# NumPy

The programming language of choice for CS5228 is Python. Python is a very popular language that is -- in comparison to other languages -- easy to read, easy to write, and thus easy to learn. Ideally, you already have some hands-on experience with Python from previous modules or simply by working on your own scripts and programs. This module cannot introduce Python to beginners. If needed, however, the popularity of Python resulted in a wider range of excellent tutorials that are freely available online (incl. comprehensive courses on YouTube).

Python has originally not been designed for "number crunching", i.e., numerical computations over large volumes of data. However, its popularity and widespread adoption spurred then development of highly optimized packages. These packages are written in "fast" programming languages such as C/C++ and provide bindings to be used in Python programs. One of the most important packages is [NumPy](https://numpy.org/). Directly taken from the website:

* Fast and versatile, the NumPy vectorization, indexing, and broadcasting concepts are the de-facto standards of array computing today.
* NumPy offers comprehensive mathematical functions, random number generators, linear algebra routines, Fourier transforms, and more.
* NumPy supports a wide range of hardware and computing platforms, and plays well with distributed, GPU, and sparse array libraries.
* The core of NumPy is well-optimized C code. Enjoy the flexibility of Python with the speed of compiled code.
* NumPy’s high level syntax makes it accessible and productive for programmers from any background or experience level.

In short, NumPy enables number crunching with Python, and also forms the foundation for other packages supporting data analysis and data visualization. In this tutorial, we explore the basic features of NumPy to acknowledge and appreciate its power and benefits.

Let's get started...

### Setting Up the Notebook

The focus here is of course on NumPy. The `sys` package is only needed for a quick example later on

In [1]:
import numpy as np  # Using the alias np is not needed but very commonly and almost standard

import sys

NumPy allows to print multidimensional arrays, by default with high precision and in scientific notation. The make the output easier to read, we limit the output to 4 decimal places and do not use the scientific notations. Feel free to remove this line and see how the output changes.

In [2]:
np.set_printoptions(precision=4, suppress=True)

## Arrays

The core concept of NumPy are homogeneous multidimensional arrays. That means, an array is a grid of (typically numerical) entries, all of the same type. While Python already allows to create grids using lists of lists, NumPy arrays are a much more powerful concept for scientific computing due highly optimized implementation and methods to index and manipulate arrays.

### Basic Usage

#### Creating Arrays

[`np.array()`](https://numpy.org/doc/stable/reference/generated/numpy.array.html) creates a NumPy array give an "array-like" or "grid-like" input. This is most commonly a normal list of lists. However, note that the following 2 requirements need to be fulfilled:

* All elements have the same type (NumPy may perform some automated casting of needed)

* All list of the same dimensions must have the same length, i.e., the grid must be "complete".

The commented example below should make this clearer

In [3]:
py_list =  [[2, 0, 1], [1, 3, 2]]     ## Correct format: conistent lengths, same datatype (here: int)
#py_list =  [[2, 0, 1], [1, 3, 2.0]]   ## Acceptable format: conistent lengths, all entries treated as floats/double
#py_list =  [[2, 0, 1], [1, 3]]        ## Inconsistent sizes: result is not a 2d array but 1d array with lists as entries
#py_list =  [[2, 0, 1], [1, 3, '2']]   ## Different data types: all entries are treated as string

a = np.array(py_list)

print(a)

[[2 0 1]
 [1 3 2]]


It is also possible to specify the data type if the array elements. We can use the same example from above containing only integer values but create ans array of floats values.

In [5]:
a = np.array([[2, 0, 1], [1, 3, 2]], dtype=np.float)

print(a)

[[2. 0. 1.]
 [1. 3. 2.]]


While 2d array are easier to print and to interpret, NumPy arrays can have multiple dimensions (called **axes**). The following example creates a 3d array, so the input grid is a list of lists of lists.

In [6]:
a = np.array([[[2, 0, 1], [1, 3, 2]], [[4, 4, 6], [3, 5, 1]]])

print(a)

[[[2 0 1]
  [1 3 2]]

 [[4 4 6]
  [3 5 1]]]


#### Describing Arrays

In [7]:
a = np.array([[[2, 0, 1], [1, 3, 2]], [[4, 4, 6], [3, 5, 1]]], dtype=np.int64)
#a = np.array([[[2, 0, 1], [1, 3, 2]], [[4, 4, 6], [3, 5, 1]]], dtype=np.int32)

[`np.array.ndim`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.ndim.html) returns the number of dimensions (axes) of a NumPy array.

In [8]:
print(a.ndim)

3


[`np.array.dtype`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.dtype.html) returns type of the elements in a NumPy array. More information about NumPy data types can be found [here](https://numpy.org/doc/stable/reference/arrays.dtypes.html).

In [9]:
print(a.dtype)

int64


The choice of the data type affects the physical size if an array in memory. For example, an `int64` element requires 8 bytes while an `int32` element requires only 4 bytes. This is particularly important for very large arrays as it can make the difference if they fit into the main memory or not.

The statement below shows the number of bytes for array `a`. Try different data types when creating `a` and see how it affects is size in memory. However, note that a NumPy array is an object with more than just the elements. So an `int32` array won't be half the size of an `int64` array, particularly when the number of elements is rather small

In [10]:
print(sys.getsizeof(a))

224


[`np.array.shape`](https://numpy.org/doc/stable/reference/generated/numpy.ndarray.shape.html) returns shapeof a NumPy array, i.e., the size if the different dimensions (axis).

In [11]:
print(a.shape)

(2, 2, 3)


### Auxiliary Methods for Array Creation

NumPy provides a wide range of methods to create new arrays. The following examples give an overview to the most common ones. Note that all the example below created 2d arrays to keep it simple. However, methods that take the shape of newly created the array as input parameter can be used to create arrays with arbitrary number of dimensions.

[`np.zeroes`](https://numpy.org/doc/stable/reference/generated/numpy.zeros.html) creates an array with all entries being 0. The main input parameter is the `shape` as tuple of the resulting array.

In [12]:
a = np.zeros((3,5))

print(a) 

[[0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0.]]


[`np.ones`](https://numpy.org/doc/stable/reference/generated/numpy.zeros.html) creates an array with all entries being 0. The main input parameter is the `shape` as tuple of the resulting array.

In [13]:
a = np.ones((3,5))

print(a) 

[[1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]
 [1. 1. 1. 1. 1.]]


[`np.full`](https://numpy.org/doc/stable/reference/generated/numpy.full.html) creates an array with all entries being a specified value. The main input parameter is the `shape` as tuple of the resulting array.

In [14]:
a = np.full((3,5), 5.8)

print(a) 

[[5.8 5.8 5.8 5.8 5.8]
 [5.8 5.8 5.8 5.8 5.8]
 [5.8 5.8 5.8 5.8 5.8]]


[`np.random.rand`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.rand.html) creates an array with all entries being random values between 0 and 1. The main input parameter is the `shape` of the resulting array. There a several related methods to generates random values, e.g., [`np.random.randint`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.randint.html), [`np.random.randn`](https://numpy.org/doc/stable/reference/random/generated/numpy.random.randn.html), and many more.

In [15]:
a = np.random.rand(3,5)

print(a)  

[[0.788  0.5033 0.464  0.9187 0.1191]
 [0.1856 0.6047 0.471  0.2894 0.9942]
 [0.4225 0.6461 0.3155 0.752  0.2196]]


[`np.eye`](https://numpy.org/doc/stable/reference/generated/numpy.eye.html) creates an array representing an identify matrix. The main input parameter is the size of the matrix. Note that that an identity matrix is square 2d-matrix, so a single integer value is sufficient.

In [16]:
a = np.eye(3)

print(a)  

[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]


## Array Manipulation

NumPy provide a [series of useful methods](https://numpy.org/doc/stable/reference/routines.array-manipulation.html) to manipulate arrays, where the manipulation does not involve changing individual elements but rather the shape of the array. Here we cover just some of the most commonly used methods.

Let's use the same example array again

In [17]:
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

print(a)

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]


[`np.transpose()`](https://numpy.org/doc/stable/reference/generated/numpy.transpose.html) reverses or permutes the axes of an array. In case of matrices (i.e., 2d arrays) `transpose()` calculates the common matrix transpose known from your Math classes. Since it's such a common operation, there's also the shorthand notation using `T`. As our example array `a` is a matrix, the following operations all perform the same matrix transpose:

In [18]:
a.T
#a.transpose()
#np.transpose(a)

array([[ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11],
       [ 4,  8, 12]])

[`np.reshape()`](https://numpy.org/doc/stable/reference/generated/numpy.reshape.html) changes the shape of an array without changing the data. As such, this methods requires as input parameter the new shape. Of course, the new shape must be compatible with the new shape -- that is, the product of all dimensions must remain the same. For example, array `a` has shape of `(3, 4)`. This means we can reshape it to `(1, 12)`, `(12, 1)`, `(2, 6)`, `(6, 2)`, `(3, 2, 2)`, etc. as the product of all dimensions is always 12.

In [21]:
a.reshape(6,2)
#np.reshape(a, (6,2))   # Same effect

array([[ 1,  2],
       [ 3,  4],
       [ 5,  6],
       [ 7,  8],
       [ 9, 10],
       [11, 12]])

As long as the products of the dimensions match, the number of dimensions for the new shape do not matter -- see the example below. Of course, in practice, reshaping should result in meaningful new arrays.

In [22]:
# All the commands below will work just fine.
a.reshape(1,1,6,1,1,2,1,1)
#a.reshape(2,1,3,2,1)
#a.reshape(1,12)
#a.reshape(1,1,12)
#a.reshape(1,1,1,12)
#a.reshape(1,1,1,1,12)
#a.reshape(1,1,1,1,1,12)
#...

array([[[[[[[[ 1]],

            [[ 2]]]]],




         [[[[[ 3]],

            [[ 4]]]]],




         [[[[[ 5]],

            [[ 6]]]]],




         [[[[[ 7]],

            [[ 8]]]]],




         [[[[[ 9]],

            [[10]]]]],




         [[[[[11]],

            [[12]]]]]]]])

Just to give an example, the following reshaping fails since the product of the dimensions of the new shape is not 12

In [23]:
#a.reshape(2, 1, 5)  # <-- This will fail
#a.reshape(2, 1, 6)  # <-- That will work

In practice the size of an array may not be known ahead of time making it difficult to guarantee a new shape will be valid. `np.reshape()` therefore accepts a special dimensions size `-1` to tell the method to calculate the require size of the dimensions automatically.

The most common use case is to flatten a multidimensional array into 1d array. Here, we do not really care about the number and size of all dimensions. We only want a new array with the shape `(1, "rest")`. We can does this for our example array `a` as follows:

In [24]:
a.reshape(1, -1)

array([[ 1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12]])

The `-1` parameter can als be used in different parts of the new shape

In [31]:
b = a.reshape(2, -1, 2)
print(b)
print(b.shape)

[ 1  2  3  4  5  6  7  8  9 10 11 12]
(12,)


As it's easy to see `-1` can only be used once, otherwise the calculation of the missing dimensions would generally be ambiguous. Also, it must be possible to find the missing dimension to ensure that the products of dimensions will match. As such, the following to uses will fail:

In [29]:
#a.reshape(-1, 3, -1)   # ValueError: can only specify one unknown dimension
#a.reshape(-1, 3, 3)    # ValueError: cannot reshape array of size 12 into shape (3,3) -- 3*3*?=12 does not work out

## Array Indexing

In [None]:
a = np.array([[1,2,3,4], [5,6,7,8], [9,10,11,12]])

print(a)

### Indexing Individual Elements

In [32]:
#a[2][1]

a[2,1]

10

In [33]:
a[0]

array([1, 2, 3, 4])

To get, say, the first column of `a`, we first have to understand slicing.

### Slicing

Slicing refers to indexing all elements in an array from one given index to another given index, optionally including a steps size to ignore certain indices between the start and end index. The general format of a slices in `[start:end]` (and `[start:end:step]` if a steps size larger than 1 needs to be specified.

A slice indexes the elements with respect to only one dimensions (axis). So an n-dimensional array may require the definition of n slices. Our example array `a` is a 2d-array of shape `(3,4)`. To get the first 2 rows, we can do

In [35]:
print(a)
print()

a[0:2]

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]



array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

The following example returns only every other row. For this, we set the step size to 2. Since we do not make any restrictions on the start and end index, we can simply leave them empty (we can set the the start index to 0; no harm in that).

In [36]:
a[::2]
#a[0::2]

array([[ 1,  2,  3,  4],
       [ 9, 10, 11, 12]])

Now let's assume we want not the first 2 row but the first 2 columns. This requires slicing with respect to the 2nd dimensions (axis). This implies that we have to specify that first 2 values of *all* rows, which we can do by using a non-restrictive slice for the 1st dimension.

In [37]:
a[:,0:2]
#a[::,0:2]

array([[ 1,  2],
       [ 5,  6],
       [ 9, 10]])

**Note:** Instead of `a[0:2]` to get the first 2 rows, we could also write `a[0:2,:]`, i.e., specifying a non-restrictive slice on the 2nd dimensions. However, here it's not mandatory since it can be unambiguously  derived.

Of course, we can use restrictive slices regarding all dimensions, for example:

In [38]:
a[0:2,0:2]

array([[1, 2],
       [5, 6]])

As it is already common for standard Python lists, the indexing can also by done with respect to the and of arrays.

In [39]:
a[-2:,-2:]

array([[ 7,  8],
       [11, 12]])

These different ways of slicing can be arbitrarily combined to extract the relevant elements from an array. The example below returns for every other column the first 2 row values.

In [40]:
a[0:2,::2]

array([[1, 3],
       [5, 7]])

Since slicing using only `start`, `end` and `step` for indexing, the result will always be a subarray of the original array, That means, for example, it is not possible to return an array with an arbitrary order of the elements for each dimension

### Integer Array Indexing

Instead of individual integers or `[start:end:step]` slices, we can use lists/arrays of integers to index elements of one dimension (axis). This allows to index an arbitrary subset of elements in an arbitrary order.

The following example extracts the second and third row but also changes their order:

In [41]:
a[[2,1]]

array([[ 9, 10, 11, 12],
       [ 5,  6,  7,  8]])

Using non-restrictive slices, we can do the same for other axes, for example to extract the first 2 columns and change their order:

In [43]:
print(a)
print()

a[:,[0,3]]

[[ 1  2  3  4]
 [ 5  6  7  8]
 [ 9 10 11 12]]



array([[ 1,  4],
       [ 5,  8],
       [ 9, 12]])

Again, one can combine integer array indexing of multiple axes:

In [44]:
a[[2,1], [0,3]]

array([9, 8])

This is result is arguably not that intuitive. It becomes more clearer when the integer array indexing is rewritten as a multiple basic integer indexes.

In [45]:
np.array([a[2,0], a[1,3]])

array([9, 8])

This implies that the 2 arrays used for indexing the 2 axis have to be of the same length. For example, the following statement will throw an error:

In [46]:
#a[[2,1], [0,3,1]]   # np.array([a[2,0], a[1,3], a[?,1]])

### Boolean Indexing

Boolean indexing allows to find elements based on a given condition. A boolean index is an array of the same shape as the input array with all elements being `True` (element fulfils condition) or `False` (element does not fulfil condition).

In [47]:
boolean_index = (a > 5)

print(boolean_index)

[[False False False False]
 [False  True  True  True]
 [ True  True  True  True]]


No this boolean index can be used to get all the elements from the initial array where the respective element in the boolean index is `True`:

In [48]:
a[boolean_index]

array([ 6,  7,  8,  9, 10, 11, 12])

Note that the resulting array is an 1d-array since the distribution of `True` and `False` values generally does not induce a valid multidimensional array structure (see the example `boolean_index` above).

In practice, the two statements above can be combined into one to make the code more concise:

In [49]:
a[a > 5]

array([ 6,  7,  8,  9, 10, 11, 12])

The indexing supported by NumPy is quite powerful and the examples only provide a basic introduction. You can check the [documentation](https://numpy.org/doc/stable/reference/arrays.indexing.html) for more details about indexing.

## Array Math

In [50]:
a = np.array([[1, 2, 3, 4], [2, 2, 3, 3]])
b = np.array([[4, 1, 1, 2], [1, 4, 2, 2]])

print(a)
print()
print(b)

[[1 2 3 4]
 [2 2 3 3]]

[[4 1 1 2]
 [1 4 2 2]]


### Elementwise Operations

If two arrays have the same shape, arithmetic operations (+, -, \*, /, %) between the two arrays are applied to the two element in the same respective positions in the arrays.

#### Addition

In [51]:
a + b

array([[5, 3, 4, 6],
       [3, 6, 5, 5]])

#### Substraction

In [52]:
a - b

array([[-3,  1,  2,  2],
       [ 1, -2,  1,  1]])

#### Multiplication

In [53]:
a * b

array([[4, 2, 3, 8],
       [2, 8, 6, 6]])

#### Division

In [54]:
a / b

array([[0.25, 2.  , 3.  , 2.  ],
       [2.  , 0.5 , 1.5 , 1.5 ]])

#### Modulo

In [55]:
a % b

array([[1, 0, 0, 0],
       [0, 2, 1, 1]])

The application of NumPy methods implementing unary mathematical operations are also applied to of the elements of an array, for example:

#### Square Root

In [56]:
np.sqrt(a)

array([[1.    , 1.4142, 1.7321, 2.    ],
       [1.4142, 1.4142, 1.7321, 1.7321]])

#### Sinus

In [57]:
np.sin(a)

array([[ 0.8415,  0.9093,  0.1411, -0.7568],
       [ 0.9093,  0.9093,  0.1411,  0.1411]])

### Broadcasting

Broadcasting allows to perform arithmetic operations on arrays of different shapes (compared to elementwise operations which requires arrays of the same shape). Of course, two arrays cannot have completely arbitrary shapes to result in a meaningful operation between them. When operating on two arrays, NumPy compares their shapes element-wise. It starts with the trailing (i.e. rightmost) dimensions and works its way left. Two dimensions are compatible when

* they are equal, or

* one of them is 1



For more details, check out the [documentation on broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html).

#### Example: Addition

First let's create a simple column vector with two elements. 

In [58]:
col = np.array([[1], [3]])

print(col)
print()
print("Shape of column vector:", col.shape)
print("Shape of matrix s:", a.shape)

[[1]
 [3]]

Shape of column vector: (2, 1)
Shape of matrix s: (2, 4)


Since all dimensions of `col` and `a` are compatible according to the definition above, we can broadcast `col` when adding it to `a`. This means that we add 1 to elements of the first row and 2 to all elements of the second row of a:

In [59]:
print(a)
print()
print(a + col)

[[1 2 3 4]
 [2 2 3 3]]

[[2 3 4 5]
 [5 5 6 6]]


We can do the same with a row vector where we want to add, say, 2/4/6/8 to the first/second/third/forth column of a.

In [60]:
row = np.array([2, 4, 6, 8])

print(row)
print()
print("Shape of row vector:", row.shape)
print("Shape of matrix a:", a.shape)

[2 4 6 8]

Shape of row vector: (4,)
Shape of matrix a: (2, 4)


Notice that `row` and `a` do not have the same number of dimensions. However, broadcasting is still possible since we start with the trailing (i.e. rightmost) dimensions to check if they are compatible. This is true here since the rightmost dimensions of both arrays is of size 4, and that's the only dimensions that needs to be compatible as `row` has only 1 dimension.

In [61]:
print(a)
print()
print(a + row)

[[1 2 3 4]
 [2 2 3 3]]

[[ 3  6  9 12]
 [ 4  6  9 11]]


Of course, broadcasting is not limited to addition but is supported for all other arithmetic operations (arithmetic operations (+, -, \*, /, %).

### In-Built Math Functions

NumPy provides a [long list of mathematical functions](https://numpy.org/doc/stable/reference/routines.math.html) to perform computations on arrays. 


#### Elementwise functions

Many of them apply mathematical operations on each element of an array, e.g.:

In [62]:
print(a)
print()

print("Calculate the square root of all elements:")
print(np.sqrt(a))
print()

print("Calculate the log of all elements:")
print(np.log(a))
print()

print("Calculate the exponential of all elements:")
print(np.exp(a))
print()

[[1 2 3 4]
 [2 2 3 3]]

Calculate the square root of all elements:
[[1.     1.4142 1.7321 2.    ]
 [1.4142 1.4142 1.7321 1.7321]]

Calculate the log of all elements:
[[0.     0.6931 1.0986 1.3863]
 [0.6931 0.6931 1.0986 1.0986]]

Calculate the exponential of all elements:
[[ 2.7183  7.3891 20.0855 54.5982]
 [ 7.3891  7.3891 20.0855 20.0855]]



Note that the elementwise arithmetic operations mentioned above can also be expressed using in-built functions. For example, `a + b` yields the same result as `np.add(a, b)`.


#### Aggregation Functions

Other functions perform aggregation operation (min, max, mean, sum, etc.) over arrays. This can be done over all elements in the array or with respect to a specified dimension. Let's use [`np.sum()`](https://numpy.org/doc/stable/reference/generated/numpy.sum.html) as an example.

First, we can sum up all the elements in an array:

In [63]:
print(a)
print()

print("Calculate the sum of all elements:")
print(np.sum(a))

[[1 2 3 4]
 [2 2 3 3]]

Calculate the sum of all elements:
20


By specifying `axis=0`, we can sum elements with respect to the first dimensions. This mean, we sum the first element of all rows, the second element of all rows, and so on. Hence the result is not a single number but an array with all the sums:

In [64]:
print(a)
print()

print("Calculate the sum of all columns:")
print(np.sum(a, axis=0))

[[1 2 3 4]
 [2 2 3 3]]

Calculate the sum of all columns:
[3 4 6 7]


Of course, we can perform the same for the columns with `axis=1`, summing up all elements of the for row, summing up all elements of the second row, and so on...well, our example array `a` has only 2 rows.

In [65]:
print(a)
print()

print("Calculate the sum of all rows:")
print(np.sum(a, axis=1))

[[1 2 3 4]
 [2 2 3 3]]

Calculate the sum of all rows:
[10 10]


#### Non-Elementwise Array Operations

While `*` refers to the elementwise multiplication between to NumPy arrays, in Math this operations typically refers to the dot product between matrices or vectors. NumPy provides the [`np.dot()`](https://numpy.org/doc/stable/reference/generated/numpy.dot.html) method for this. Let's assume we want to calculate the following matrix-vector multiplication.

$$
\begin{bmatrix}
    1 & 2 & 3 & 4  \\
    2 & 2 & 3 & 3
\end{bmatrix}
* \begin{bmatrix}
    1 \\
    2 \\
    3 \\
    4
\end{bmatrix}
= \begin{bmatrix}
    30 \\
    27 
\end{bmatrix}
$$

The respective code in NumPy looks as follows:

In [66]:
# First we need to create the column vector
col = np.array([[1], [2], [3], [4]])

print("Dot product:")
print(np.dot(a, col))
#print(a.dot(col))      # Same effect

Dot product:
[[30]
 [27]]


## Arg* Methods

We already saw that NumPy comes with methods that return the minimum, maximum, mean, etc. value(s) in an array. However, there are many use case where not the value itself (e.g., the minimum) but where in the array the minimum is located (see "Finding the K-Nearest Neighbors" use case below). To this end, Numpy provides different arm* methods.

Let's first create a `(3, 5)` array with random integer values

In [74]:
a = np.random.randint(1, 20, size=(3, 5))

print(a)  

[[ 8  5  8  9 15]
 [ 3 12 10  8  3]
 [ 4  8  7  5  9]]


[`np.argmax()`](https://numpy.org/doc/stable/reference/generated/numpy.argmax.html) returns the indices of the maximum values along an axis (if a axis is specified). First let's find the index of the largest element in `a` overall.

In [75]:
max_idx = np.argmax(a)

print(max_idx)

4


Note that the index is scalar value and not a 2d index as might be expected. The index 5 reflects the position of the value of `a` would be flattened to a 1d array. Hence, we can get the maximum value by flatten `a` and get the value at position `max_idx`:

In [76]:
a.reshape(-1)[max_idx]

15

Also note that `np.argmax()` also returns just one value although multiple elements in `a` have the largest value of 9. The returned index is the index of the first occurrence of the max value.

As mentioned above, we can also get the indices if the maximum values with respect to axes. In this case, we get multiple indices:

In [77]:
print(a)
print()

print(np.argmax(a, axis=0))
print()

print(np.argmax(a, axis=1))

[[ 8  5  8  9 15]
 [ 3 12 10  8  3]
 [ 4  8  7  5  9]]

[0 1 1 0 0]

[4 1 4]


For example, the result `[1 0 1 1 2]` shows the indices of the 5 maximum values for the 5 columns of `a`. Since the first column is `[5 9 4]`, the maximum values is at index 1, represented by `[1 ...]` and so on.

[`np.argmin()`](https://numpy.org/doc/stable/reference/generated/numpy.argmin.html) works exactly same just for finding the indices of the the minimum values.

[`np.argsort()`](https://numpy.org/doc/stable/reference/generated/numpy.argsort.html) returns the indices that would sort an array. That means, this methods sorts the element in the array but then returns the indices and not the actual values. In contrast to `np.argmax()` and `np.argmin()`, `np.argsort()` sort with respect to the last dimension (`axis=-1`) by default. So to sort irregardless of an axis, one has to explicitly set `axis=None`.

In [78]:
print(a)
print()

print(np.argsort(a, axis=None))

[[ 8  5  8  9 15]
 [ 3 12 10  8  3]
 [ 4  8  7  5  9]]

[ 5  9 10  1 13 12  0  2  8 11  3 14  7  6  4]


Again, the order of indices reflects a flatted version of the input array. For example, there are four 1's (smallest value) in `a` at the indices 1, 3, 6, and 11 if a would be flattened to an 1d array. When setting `axis=-1` or any other valid value -- here either 0 or 1 since `a` has only 2 dimensions -- we get the indices that would sort the values with respect to the chosen dimension/axis. For example:

In [79]:
print(a)
print()

print(np.argsort(a, axis=1))
#print(np.argsort(a, axis=-1)) # Same effect since a has only to axes 0 and 1.

[[ 8  5  8  9 15]
 [ 3 12 10  8  3]
 [ 4  8  7  5  9]]

[[1 0 2 3 4]
 [0 4 3 2 1]
 [0 3 2 1 4]]


This result tells us that, e.g., for the fist row, the smallest value is at index 1 and the largest value at index 0. If multiple elements with the same value the order in the sorted result depends on the order of occurrence. For example, there a two 1's (smallest value) in the first row, so the first one at index 1 will come before the one at index 3.

## Practical Example Use Cases

### Finding the K-Nearest Neighbors

There are many case and algorithms that require to find k-nearest, i.e., the k most similar data points in a dataset. For the following example, we create a random dataset `D` of 10 data points, with each data point featuring 5 attributes/features/coordinates. This means that `D` can be represented as a matrix (2d array) of shape `(10, 5)`.

We also need a data point `x` for which we want to find the k-nearest neighbors. Naturally, `x` must have the same number of 5 attributes/features/coordinates.

In [80]:
np.random.seed(10) # Just to ensure that we get the same random numbers

D = np.random.rand(10, 5)
x = np.random.rand(5)

print('Dataset D:')
print(D) 
print()
print('Data point x:')
print(x)

Dataset D:
[[0.7713 0.0208 0.6336 0.7488 0.4985]
 [0.2248 0.1981 0.7605 0.1691 0.0883]
 [0.6854 0.9534 0.0039 0.5122 0.8126]
 [0.6125 0.7218 0.2919 0.9178 0.7146]
 [0.5425 0.1422 0.3733 0.6741 0.4418]
 [0.434  0.6178 0.5131 0.6504 0.601 ]
 [0.8052 0.5216 0.9086 0.3192 0.0905]
 [0.3007 0.114  0.8287 0.0469 0.6263]
 [0.5476 0.8193 0.1989 0.8569 0.3517]
 [0.7546 0.296  0.8839 0.3255 0.165 ]]

Data point x:
[0.3925 0.0935 0.8211 0.1512 0.3841]


Given `x` we now want to find the, say, 3 most similar items. Since we have numerical features, we can directly apply the Euclidean distance to measure the similarity between two data points. Given two multidimensional data points `p` and `q` with `n` number of features., The Euclidean Distance `d(p, q)` is defined as:

$$
d(p, q) = \sum_{i=1}^n \sqrt{(p_i - q_i)^2} = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 + ... (p_n - q_n)^2 }
$$

Since we have 10 data points, we will get 10 distances from which we then pick the 3 shortest once. So let's first compute all the 10 distances between `x` and the data points in `D`. Using the built-in methods and the concept of broadcasting, NumPy makes this step very straightforward. In a sense, we can directly implement the formula given above: 

In [81]:
# Use broadcasting to subtract x from all data points in D
R = D - x

# Square all elements
R = np.square(R)

# Sum up all squared elements for each row
distances = np.sum(R, axis=1)

# Calculate the final square roots
distances = np.sqrt(distances)

# Print the 10 distances
print(distances)

[0.7444 0.3613 1.3442 1.1917 0.7087 0.8172 0.6898 0.2801 1.1988 0.5045]


Now that we have all 10 distances, we only need to find the 3 shortest ones. In fact, we want to find the positions/indices of the smallest values in `distances` as the correspond to the 3 most similar data points. Again, NumPy has us covered

In [82]:
# Get the indices of the 3 most similar data points
top_i = np.argsort(distances)[:3]

print(top_i)

[7 1 9]


Now we know that the points at indices 0, 4, and 6 in `D` are the ones most similar to `x`. Of course, we can also print those 3 data points:

In [83]:
D[top_i]

array([[0.3007, 0.114 , 0.8287, 0.0469, 0.6263],
       [0.2248, 0.1981, 0.7605, 0.1691, 0.0883],
       [0.7546, 0.296 , 0.8839, 0.3255, 0.165 ]])

**Sidenote:** Calculating the Euclidean Distances between vectors is such a common task that packages such as `scitkit-learn` provide ready-made methods for it. But note that under the hood, `scitkit-learn` itself relies heavily on NumPy!

In [84]:
from sklearn.metrics.pairwise import euclidean_distances

euclidean_distances(D, x.reshape(1,-1))

array([[0.7444],
       [0.3613],
       [1.3442],
       [1.1917],
       [0.7087],
       [0.8172],
       [0.6898],
       [0.2801],
       [1.1988],
       [0.5045]])

### Feature Scaling: Standardization

Feature scaling is a very common data preprocessing step and will be motived in more detail in the course of module. But to get a first sense of its importance, have second look at the formula for calculating the Euclidean distance -- here for just to features:

$$
d(p, q) = \sqrt{(p_1 - q_1)^2 + (p_2 - q_2)^2 }
$$

As you can see, the Euclidean Distance depends on the difference between the respective feature values. This can cause a problem if the two features are of very different magnitudes. For example, assume that the values of Feature 1 are around 0.0001 while the values of Feature 2 are around 1,000. That means that the values for Feature 1 will hardly effect the Euclidean Distance between to data points since $(p_1 - q_1)^2$ will always be negligible smalled compared to $(p_2 - q_2)^2$.

Feature scaling aims to remedy this issue by bring all features to a common order of magnitude. One approach for feature scaling is **standardization**. If `x` is a feature and `x` is the feature value for data point $i$, each feature value gets standardized by subtracting the feature mean and the feature standard deviation

$$
x_i = \frac{x_i - \mu}{\sigma}
$$

Let's see how it looks for example. First, we can create another random dataset. In this case, however, we artificially scale up or down the values for the different features. For example, the first feature will be of magnitude 0.01, while the third feature will be of magnitude 100.

In [85]:
np.random.seed(10) # Just to ensure that we get the same random numbers

D = np.random.rand(10, 5) * np.array([0.05, 10, 500, 0.1, 50])

print('Dataset D:')
print(D)

Dataset D:
[[  0.0386   0.2075 316.8241   0.0749  24.9254]
 [  0.0112   1.9806 380.2654   0.0169   4.417 ]
 [  0.0343   9.5339   1.9741   0.0512  40.631 ]
 [  0.0306   7.2176 145.938    0.0918  35.7288]
 [  0.0271   1.4217 186.6704   0.0674  22.0917]
 [  0.0217   6.1777 256.5691   0.065   30.0519]
 [  0.0403   5.2165 454.3244   0.0319   4.523 ]
 [  0.015    1.1398 414.3407   0.0047  31.3144]
 [  0.0274   8.1929  99.4738   0.0857  17.5826]
 [  0.0377   2.9596 441.9682   0.0326   8.2508]]


Without feature scaling, calculating the Euclidean Distance would most of the time be dominated by Feature 3. This would make Feature 3 more important than the others, which in practice is not a preferred assumption.

Using the built-in methods [`np.mean()`](https://numpy.org/doc/stable/reference/generated/numpy.mean.html) and [`np.std()`](https://numpy.org/doc/stable/reference/generated/numpy.std.html) we can easily calculate the mean and standard deviation for each feature. Note that we need to specify the `axis` parameter for this, otherwise we calculate the mean and standard deviation over all values in `D`.

In [86]:
col_mean = np.mean(D, axis=0)
col_std = np.std(D, axis=0)

print(col_mean, "<= Means for all features")
print(col_std, "<= Standard deviations for all features")

[  0.0284   4.4048 269.8348   0.0522  21.9517] <= Means for all features
[  0.0094   3.1227 149.0524   0.0281  12.3243] <= Standard deviations for all features


With the 5 means and 5 standard deviations for all 5 features, we can again use broadcasting to update each feature value with respect to its "own" mean and standard deviation.

In [87]:
D_standardized = (D - col_mean) / col_std

print(D_standardized)

[[ 1.0776 -1.3441  0.3153  0.8067  0.2413]
 [-1.8171 -0.7763  0.7409 -1.2559 -1.4228]
 [ 0.6223  1.6425 -1.7971 -0.0352  1.5157]
 [ 0.2365  0.9008 -0.8312  1.4079  1.1179]
 [-0.1341 -0.9553 -0.558   0.541   0.0114]
 [-0.709   0.5677 -0.089   0.4565  0.6573]
 [ 1.2571  0.2599  1.2378 -0.7218 -1.4142]
 [-1.4151 -1.0456  0.9695 -1.6908  0.7597]
 [-0.1074  1.2131 -1.143   1.1911 -0.3545]
 [ 0.9893 -0.4628  1.1549 -0.6994 -1.1117]]


As you can see, now all features a roughly in the same ballpark, with not features standing out and potentially domineering the calculation of Euclidean Distances.

**Sidenote:** Again, standardization or normalization of the data is a very common task, and as such there are off-the-shelf solutions to accomplish this available. The code below uses the method provided by `scikit-learn`. The results should of course be identical.

In [88]:
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
D_standardized_sklearn = scaler.fit_transform(D)

print(D_standardized_sklearn)

[[ 1.0776 -1.3441  0.3153  0.8067  0.2413]
 [-1.8171 -0.7763  0.7409 -1.2559 -1.4228]
 [ 0.6223  1.6425 -1.7971 -0.0352  1.5157]
 [ 0.2365  0.9008 -0.8312  1.4079  1.1179]
 [-0.1341 -0.9553 -0.558   0.541   0.0114]
 [-0.709   0.5677 -0.089   0.4565  0.6573]
 [ 1.2571  0.2599  1.2378 -0.7218 -1.4142]
 [-1.4151 -1.0456  0.9695 -1.6908  0.7597]
 [-0.1074  1.2131 -1.143   1.1911 -0.3545]
 [ 0.9893 -0.4628  1.1549 -0.6994 -1.1117]]


## Summary

NumPy is an indispensable package for data analysis and scientific computing with Python, particularly when working with multidimensional arrays, in the simplest case: vectors and matrices. The appropriate use of the built-in methods not only make your code shorter and easier the read, in the very most cases it will make the actual computations much faster. Having a good understanding and grip on the syntax and inner workings is very beneficial in practice...and definitely in this module :).

There's of course much more to NumPy. This tutorial gives only a first glimpse to beginners started. Given its popularity and importance, you can find a lot of documentation online (beyond the official documentation on the NumPy website). To conclude, here are just some more take-away comments.

* NumPy is not the perfect solution for all numerical computations. Most basically, NumPy is most suitable for problems and algorithms that can be expressed as vectorized operation. NumPy is also designed that operations run on a single CPU. With the prevalence of GPU computations, new packages such as [CuPy](https://cupy.dev/) have been developed.

* To keep it simple, most of example in this tutorial were only 2d array, i.e., matrices where the notion of "column" and "row" are meaningful and intuitive. For arrays with more dimensions these notions break down. In general, it's a good practice to always thing in "first dimension/axis", "second dimensions/axis", "third dimensions/axis", etc. Trying to talk about rows and columns in case of multidimensional arrays quickly becomes more confusing than helpful.

## Optional "Homework"

In the first example use case, we calculated the Euclidean Distances between a single data point `x` and all data points in `D`. This allowed us to use broadcasting to subtract data point `x` from all points in `D`.

Now lets assume we have, say, 5 data points `x1`, `x2`, ..., `x5`, and we want to find the k-nearest neighbors for all 5 data points. This requires to calculate 50 Euclidean distances, 10 for each `x_i`. Of course, we could simple loop over all `x_i` and perform the steps for each data point as outlined above. However, using loops can significantly affect the runtime for large datasets (not for this small example).

**Question:** How can we use NumPy to calculate all Euclidean Distances without using any loop?