# Lesson 2-extra: More about NumPy, Pandas

> Instructor: [Yuki Oyama](mailto:y.oyama@lrcs.ac), [Prprnya](mailto:nya@prpr.zip)
>
> The Christian F. Weichman Department of Chemistry, Lastoria Royal College of Science

This material is licensed under <a href="https://creativecommons.org/licenses/by-nc-sa/4.0/">CC BY-NC-SA 4.0</a><img src="https://mirrors.creativecommons.org/presskit/icons/cc.svg" alt="" style="max-width: 1em;max-height:1em;margin-left: .2em;"><img src="https://mirrors.creativecommons.org/presskit/icons/by.svg" alt="" style="max-width: 1em;max-height:1em;margin-left: .2em;"><img src="https://mirrors.creativecommons.org/presskit/icons/nc.svg" alt="" style="max-width: 1em;max-height:1em;margin-left: .2em;"><img src="https://mirrors.creativecommons.org/presskit/icons/sa.svg" alt="" style="max-width: 1em;max-height:1em;margin-left: .2em;">

After finishing Lesson 2, we have a basic understanding of NumPy, which still has many useful features worth learning. However, when processing large and complicated datasets, we need to use a more powerful tool, Pandas. In this lesson, we are going to learn the advanced features of NumPy in the first half, and Pandas in the second half.

## Indexing and Masking

Our lesson begins by expanding the concept of indexing. NumPy arrays support not only integer indexing and slicing, but also several different but useful indexing methods. Let's import NumPy first.

```python
import numpy as np
```

In [None]:
import numpy as np

### Arbitrary Indexing

In lesson 2, we have learned how to get a single element from an array using integer indexing or get a regularly spaced subset of elements using slicing. However, NumPy also supports arbitrary indexing, which means that we can get elements using arbitrary conditions. Consider the following matrix:

```python
mat = np.array([
    [1.1, 1.2, 1.3, 1.4, 1.5],
    [2.1, 2.2, 2.3, 2.4, 2.5],
    [3.1, 3.2, 3.3, 3.4, 3.5],
    [4.1, 4.2, 4.3, 4.4, 4.5],
    [5.1, 5.2, 5.3, 5.4, 5.5],
])
mat
```

In [None]:
mat = np.array([
    [1.1, 1.2, 1.3, 1.4, 1.5],
    [2.1, 2.2, 2.3, 2.4, 2.5],
    [3.1, 3.2, 3.3, 3.4, 3.5],
    [4.1, 4.2, 4.3, 4.4, 4.5],
    [5.1, 5.2, 5.3, 5.4, 5.5],
])
mat

We want to take some elements from the matrix, for example, the orange elements below:

$$
\begin{bmatrix}
1.1 & 1.2 & 1.3 & {\color{orange} 1.4} & 1.5 \\
2.1 & 2.2 & 2.3 & 2.4 & 2.5 \\
{\color{orange} 3.1} & 3.2 & 3.3 & 3.4 & 3.5 \\
4.1 & 4.2 & 4.3 & 4.4 & 4.5 \\
5.1 & {\color{orange} 5.2} & 5.3 & 5.4 & 5.5
\end{bmatrix}
$$

These elements are located at the following indices: $3.1$ at $(2, 0)$, $1.4$ at $(0, 3)$, and $5.2$ at $(4, 1)$. We can use two lists to represent the row indices and column indices of these elements, and use them to index the matrix:

```python
rows = [2, 0, 4]
cols = [0, 3, 1]

mat[rows, cols]
```

In [None]:
rows = [2, 0, 4]
cols = [0, 3, 1]

mat[rows, cols]

### Masking

Sometimes we want to filter out some elements from an array that meet certain conditions. Using the control flows we have learned in Lesson 1-extra, we can write a simple iteration combined with conditional statements to achieve this goal. However, NumPy provides a more convenient way to achieve this goal, which is called **masking**. For example, suppose we have some prime numbers such as

```python
primes = np.array([2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37])
primes
```

In [None]:
primes = np.array([2, 3, 5, 7, 11, 13, 17, 19, 23, 29, 31, 37])
primes.all()

We want to find out all the primes $p$ of the form $4n + 3$ from the array. This condition can be expressed as `p % 4 == 3` in Python. To create a mask that acts on `primes` using this condition, we just need to replace the `p` in the condition with `primes`:

```python
mask = primes % 4 == 3
mask
```

In [None]:
mask = primes % 4 == 3
mask

The result is a boolean array that is `True` for all the elements that meet the condition and `False` for the rest (why?). We can use this mask to index the array to get the desired elements:

```python
primes_filtered = primes[mask]
primes_filtered
```

In [None]:
primes_filtered = primes[mask]
primes_filtered

Now check the mask again:

```python
mask_filtered = primes_filtered % 4 == 3
mask_filtered
```

In [None]:
mask_filtered = primes_filtered % 4 == 3
mask_filtered

As you can see, all the elements in `primes_filtered` meet the condition. However, checking each element in the mask manually is not very efficient. To simplify this process, we can use the `any()` and `all()` methods of NumPy boolean arrays to check whether any or all the elements meet the condition:

```python
mask.any(), mask.all(), mask_filtered.any(), mask_filtered.all()
```

In [None]:
mask.any(), mask.all(), mask_filtered.any(), mask_filtered.all()

The results of these two methods are `np.True_` and `np.False_`, which are wrapped by NumPy from the built-in `True` and `False` in Python, so you can treat them as the same. For boolean array, the `any()` method returns `True` if at least one of the elements in the array is `True`, and the `all()` method returns `True` if all the elements in the array `True`.

<span style="color:green">**Exercise**:</span> Imagine that we mark the grid points (points with integer coordinates) on the 2D plane and then draw a circle centered at the origin with a radius of $r$ (for simplicity, $r$ only takes integer values). Use masking to show how many grid points are in the circle (including the boundary). We can use the `meshgrid()` function to generate a grid of points, for example:

```python
xrange = np.arange(2)
yrange = np.arange(3)
x, y = np.meshgrid(xrange, yrange)
x, y
```

In [None]:
xrange = np.arange(2)
yrange = np.arange(3)
x, y = np.meshgrid(xrange, yrange)
x, y

Each pair of elements with the same index in `x` and `y` represents a grid coordinate.

_Hint: the mask is also an array, which can be indexed using the mask itself._

In [None]:
def grids(r: int) -> int:
    points = np.arange(-r, r + 1)
    x, y = np.meshgrid(points, points)
    mask = x ** 2 + y ** 2 <= r ** 2
    return mask[mask].size

grids(5)

## Array Manipulation

NumPy provides a series of functions that allow us to perform various operations on arrays like kneading dough.

### Reshape

In [None]:
arr1d = np.arange(12)
arr1d

In [None]:
arr1d.size, arr1d.ndim, arr1d.shape

In [None]:
arr2d = arr1d.reshape(3, 4)
arr2d

In [None]:
arr2d.size, arr2d.ndim, arr2d.shape

In [None]:
# arr1d.reshape(2, 5)

In [None]:
arr3d = arr1d.reshape(2, 3, 2)
arr3d

### Flatten

In [None]:
arr3d.flatten()

### Transpose

In [None]:
arr2d.transpose()

In [None]:
arr3d.transpose(1, 0, 2)

## Vectorization

NumPy arrays naturally support batch operations, which is good indeed. However, when we need to combine arrays with Python native functions, we may encounter scenarios where batch operations cannot be used, and we have to fall back to regular loops and iterations. Fortunately, NumPy provides a feature called [**vectorization**](https://numpy.org/doc/stable/reference/generated/numpy.vectorize.html), which can be used to convert Python native functions into vectorized functions that support array operations using decorator syntax.

In [None]:
squares = np.arange(5) ** 2
squares

In [None]:
np.sqrt(squares)

In [None]:
import math
math.sqrt(squares)

In [None]:
@np.vectorize
def sqrt(n: int | float) -> float:
    return math.sqrt(n)

sqrt(squares)

## Broadcasting

When we need to combine arrays with different shapes, NumPy provides a feature called [**broadcasting**](https://numpy.org/doc/stable/user/basics.broadcasting.html), which can be used to convert arrays with different shapes into arrays with the same shape.


In [None]:
a = np.array([[1, 2], [3, 4]])
b = np.array([2, 5])
a, b

In [None]:
a + b

In [None]:
a = np.array([[1, 2], [3, 4]])
b = np.array([2])
a * b

In [None]:
a = np.array([[1, 2], [3, 4]])
b = np.array([[1, 1, 1],
              [2, 2, 2],
              [3, 3, 3]])
a + b

In [None]:
a = np.array([[1, 2], [3, 4], [5, 6], [7, 8]])
b = np.array([8, 7])
a + b

In [None]:
# TODO: More precise examples

## Linear Algebra in NumPy (`numpy.linalg`)

Since matrices can also be viewed as two-dimensional arrays, it is natural for NumPy to implement various concepts in linear algebra.

In [None]:
vec1 = np.array([1, 1, 1])
vec2 = np.array([-1, -1, 1])
vec1, vec2

In [None]:
norm1 = np.linalg.norm(vec1)
norm2 = np.linalg.norm(vec2)
norm1, norm2

In [None]:
np.vdot(vec1, vec2)

In [None]:
np.cross(vec1, vec2)

In [None]:
mat1 = np.array([
    [1, 0, 0],
    [0, 0, -1],
    [0, 1, 0],
])
mat2 = np.array([
    [0, -1, 0],
    [1, 0, 0],
    [0, 0, 1],
])
mat = mat1 @ mat2
mat

In [None]:
x = np.array([1, 0, 0])
y = np.array([0, 1, 0])
z = np.array([0, 0, 1])
mat @ x, mat @ y, mat @ z

In [None]:
# TODO: Trace, determinant, eigenvalues and eigenvectors (Hückel method)

## Pandas

- `Series` and `DataFrame`
- loading and saving files