In [None]:
%load_ext autoreload
%autoreload 2
%config Completer.use_jedi = False

In [None]:
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np

In [None]:
mpl.rcParams['mathtext.fontset'] = 'stix'
mpl.rcParams['font.family'] = 'STIXGeneral'
mpl.rcParams['text.usetex'] = False
plt.rc('xtick', labelsize=12)
plt.rc('ytick', labelsize=12)
plt.rc('axes', labelsize=12)
mpl.rcParams['figure.dpi'] = 300

# NumPy and Tabular Data

[Matthew R. Carbone](https://www.bnl.gov/staff/mcarbone) | _Assistant Computational Scientist, Computational Science Initiative, Brookhaven National Laboratory_

In this tutorial we will go over the fundamentals of dealing with tabular/array data using arguably the most popular Python libraries: `numpy` and `pandas`. By the end of this, you will hopefully be comfortable with:

- Using `numpy` to _efficiently_ manipulate numeric data
- Using the lightweight `numpy` wrapper `pandas` and its utility
- `numpy` broadcasting and advanced operations

We hope that this tutorial will be helpful for both beginners and those who are already coding experts.

- If you're a beginner, we hope that you'll benefit from the overview of basic concepts, and the references to various documentation pages which can provide more detail where this tutorial is sparse.
- If you're not a beginner, we hope that this tutorial might help you use NumPy more efficiently, and possibly teach you something new!

Ultimately, **data manipulation** is a key skill for doing anything in data-driven science, including AI/ML. Most, if not all, AI/ML libraries in Python rely on NumPy, so getting comfortable with NumPy's syntax is essentially required. By the end of this notebook, you'll have a good idea of how to use NumPy properly!

## What is NumPy?

The package `numpy` (NumPy) stands for **num**erical **py**thon. You can find its excellent documentation [here](https://numpy.org/doc/stable/). It is a jack-of-all-trades, state-of-the-art, open source library for handling any and all types of numerical data you can imagine. I will outline a few of its features as provided in the [first pages of its documentation](https://numpy.org/doc/stable/user/whatisnumpy.html).

1. NumPy facilitates the creation of fixed-size arrays of the _same datatype_.
2. NumPy has advanced mathematical operations built-in. These are implemented efficiently using compiled backends.
3. NumPy supports _vectorization_, allowing for fewer lines of code and faster execution.

It is also worth noting that NumPy is extremely well tested, actively maintained, and serves as the foundation for almost _all_ numerical scientific code written in Python. If you are doing numerical science in Python, you're either using NumPy or using a library that uses NumPy.

## Why should you use NumPy?

[Pure Python is extremely "slow"](https://www.geeksforgeeks.org/what-makes-python-a-slow-language/). We won't go into a ton of details as to why, but we will mention that:
* _Some_ Python operations are very fast, e.g. list comprehension.
* Most Python operations are very slow, e.g. for loops.
* NumPy operations are very fast, because they call pre-compiled executables written in lower-level languages. In other words, when you do something in NumPy, e.g. `np.sum(array)`, you are not actually using Python!

Essentially, NumPy unlocks C-level speed (not quite but close) but while using a much more human-readable language, which helps in debugging, clean code writing, distribution, etc.

# The basics (`np.ndarray`)

The core of NumPy is the `ndarray` object (see [here](https://numpy.org/doc/stable/user/absolute_beginners.html#more-information-about-arrays)). This stands for "N-dimensional array" and is basically a container for numerical data. It's easiest to see this by example.

## Vectors

The `list` is the pure Python equivalent of the `np.ndarray` object. We're guessing you're already familiar with this, but if not, a list is simply a one-dimensional object for holding other Python objects, such as numbers, classes, etc.

In [None]:
v_list = [1, 2, 3]

For instance, here is an `ndarray` object of one dimension (a vector), which almost always contains data of a _single type_:

In [None]:
v = np.array(v_list)
v  # Note that Jupyter Notebooks allow for "rendering" by simply typing the object at the end of the cell

Accessing elements of a vector is straightforward: for instance, `v[0]`. In this case, we see that NumPy has casted the objects in the list to the type `numpy.int64`.

In [None]:
print(type(v[0]))
print(v[1])

Possibly the most useful operation for debugging NumPy code is the `.shape` property. Often times, checking the "shape" of an `ndarray` is an easy, efficient and fast way of checking to make sure your arrays are doing what they're supposed to do. This will be especially important when considering broadcasting. For now, we have initialized an `ndarray` vector, so we expect our shape to have only one dimension:

In [None]:
v.shape

## Two-dimensional arrays

How about a 2-dimensional array? Note the 2d `np.ndarray` and `np.matrix` objects are very similar, but permit different operations. We won't get into this here, and for now we'll only deal with the `np.ndarray` objects. [Here](https://stackoverflow.com/a/4151251/16602018) is a good summary of the differences.

In [None]:
X = np.array([[1, 2, 3], [4, 5, 6]])
X

As before let's check the shape:

In [None]:
X.shape

We see that the first dimension in `X.shape`, following with standard matrix convention, is the number of rows and the second is the number of columns. Accessing the elements of 2d arrays is also straightforward, and follows conventional matrix indexing, with the caveat that like anything in Python, we zero-index.

In [None]:
X[0, 2]

Note that trying to access elements beyond the dimensions of the matrix will result in an `IndexError`:

In [None]:
try:
    X[3, 0]
except IndexError:
    print("Yup, just caught an IndexError!")

## Larger-than-two-dimensional arrays

Of course, numpy allows for arbitrarily large arrays. We won't deal with these very much here, but they are important for AI/ML purposes. The indexing is the same as 2d arrays.

In [None]:
np.random.seed(1234)
T = np.random.random(size=(3, 4, 4))

In [None]:
T[2, 1, 3]

Arrays with more than two dimensions are generally referred to as _tensors_, but in general a tensor is a superset of an array.

# Operations

Of course, the `np.ndarray` object wouldn't be very useful without the substantial number of operations defined on it. Here, we'll go through these operations, why they're useful and when to use them. In addition, we'll present a real-world example use case!

## Simple arithmetic operators

These include addition, subtraction, multiplication, division, squaring, and many others. Note that everything presented in this subsection **applies operations element-wise**. If you're not familiar with this concept, this following example will help demonstrate the meaning. Note as well that here in this section we will consider operations of arrays with floats and integers.

In [None]:
np.random.seed(1234)
X = np.random.randint(low=1, high=5, size=(2, 3))

In [None]:
X

**Element-wise addition**

In [None]:
X + 3

Note that we took the array `X` and added 3 to every element of the matrix. This is _elementwise addition_. Subtraction works the same way. Let's look at multiplication, division and squaring:

**Element-wise multiplication**

In [None]:
X * 3

**Element-wise division**

In [None]:
X / 3

**Element-wise integer division**

In [None]:
X // 2  # <- useful operator you may not know about!

**Element-wise squaring**

In [None]:
X**2

There are many other operators, too many to cover in this tutorial, but in an attempt to be as complete as possible, let's list a few more that will act elementwise on the array:

**Element-wise application of sinusoidal functions** (Units are always radians!)

In [None]:
np.sin(X)

In [None]:
np.arctan(X)

**Element-wise application of the logarithm**

In [None]:
np.log10(X)

**Element-wise application of a boolean operator**

In [None]:
X == 1

## Slicing

While not a proper mathematical operation, [array slicing](https://numpy.org/doc/stable/user/basics.indexing.html#slicing-and-striding) is still extremely important. Here's a simple example. We start with our usual array `X`, and want to access different rows and columns.

In [None]:
np.random.seed(1234)
X = np.random.randint(low=1, high=5, size=(4, 5))
X

In [None]:
print("The second row is    ", X[1, :])
print("The first column is  ", X[:, 0])
print("The fourth column is ", X[:, 3])

There are some more complicated nuances of slicing, but what you see above is basically 95% of the battle. We definitely advise to go look a bit more into the documentation linked above. Various syntax like `...` can be useful in certain situations.

## Broadcasting

Previously when discussing elementwise operations, we were really discussing a general concept of [broadcasting](https://numpy.org/doc/stable/user/basics.broadcasting.html). In the most general sense, broadcasting encompases the fundamental concept of how NumPy treats operations between arrays of different sizes. In the case of arrays and floats/integers, this is the specific case of some array of some shape, and an array of shape `(1,)`.

To demonstrate this, we'll use a curated version of the credit score data found [here](https://www.openml.org/d/31). The long story short is that this dataset ranks if you're a credit risk based on a variety of factors.

To load in the data, I'll be using the commonly used `pandas` library. To get the data, we can use `curl`:

In [None]:
import pandas as pd

In [None]:
!curl -o credit_g.csv https://pkgstore.datahub.io/machine-learning/credit-g/credit-g_csv/data/ac05ce3bfd911258bd37fde1e8a3051f/credit-g_csv.csv

In [None]:
# First, we read the csv data
df = pd.read_csv("credit_g.csv")

# Now we select only the numeric columns
df = df.select_dtypes(['number'])

Printing the dataframe below, we note one critical initial observation about our numerical data. The columns each represent a feature, and each row a data point (a person), but the features are all on different orders of magnitude. We don't want that for a variety of reasons, which we will go into in later tutorials. The long story short is that we want all of our features to be on the same scale, preferably -1 to 1 (roughly). So let's normalize each column by scaling it to a standard normal distribution.

In [None]:
df.head()

First, we can convert our dataframe to an `np.ndarray` by basically removing the `pandas` wrapper (this gets rid of the column and row labels).

In [None]:
data = df.to_numpy()
print(data.shape)

Now, we can normalize. To do so, we use the equation
$$ X_{ij}' = (X_{ij} - \mu_j) / (\sigma_j + \epsilon).$$
Here, $X_{ij}$ is the $i$th datapoint and the $j$th feature, and each feature has a mean and standard deviation. $\epsilon$ is a small positive number for numerical stability if $\sigma_j$ is close to 0. How can we compute those? Well, in NumPy it's easy! We can apply the same concept of array slicing in some of the standard NumPy functions:

In [None]:
mu = data.mean(axis=0, keepdims=True)
assert mu.shape == (1, 7)
sd = data.std(axis=0, keepdims=True)
assert sd.shape == (1, 7)

By looking at the shapes, we can see that each of the 7 features has its own mean and standard deviation, as expected. The `.mean()` and `.std()` methods took the mean and standard deviation along the specified axis! In this case, `axis=0` means to take the mean along the `0`th axis, which here means to take the mean over the rows for each column. Note additional documentation on the mean and standard deviation can be found [here](https://numpy.org/doc/stable/reference/generated/numpy.mean.html) and [here](https://numpy.org/doc/stable/reference/generated/numpy.std.html), respectively. Check the sidebar on either of those links for more useful methods!

Now, here's where things get interesting. We want to execute the equation above. Here's one way you could do it:

In [None]:
def slow_normalize(data=data, mu=mu, sd=sd):
    data_prime_slow = np.empty(shape=data.shape)
    for ii, row in enumerate(data):  # Iterables iterate over rows by default in numpy
        for jj, value in enumerate(row):
            data_prime_slow[ii, jj] = (value - mu[0, jj]) / (sd[0, jj] + 1e-8)
    return data_prime_slow

data_prime_slow = slow_normalize()

However, this is extremely slow, sloppy and prone to errors. We want a more "NumPythonic" way of doing this. This is where broadcasting comes in. We can perform precisely the above in a single line of code:

In [None]:
data_prime = (data - mu) / (sd + 1e-8)
assert np.all(data_prime == data_prime_slow)

As you can see, this produces the same result. Let's time it to see just how slow it actually is.

In [None]:
%timeit (data - mu) / (sd + 1e-8)

In [None]:
%timeit slow_normalize()

Looks like the NumPy version is **1000 times faster**. And this is for only 1000 data points! What's going on here then? We can take a look at a simpler example to help drive home the point. Consider the simpler matrix and vector representing perhaps the data matrix and mean vector.

In [None]:
np.random.seed(1236)
X = np.random.randint(low=1, high=5, size=(4, 3))
mu = X.mean(axis=0, keepdims=True)

In [None]:
X

In [None]:
mu

In [None]:
X - mu

From the above, note that `X - mu` performs the operation of subtracting `mu[0]` from the `0`th column in `X`, but for every row, and similarly for `mu[1]` and `mu[2]`. This process is much faster than using Python for loops because NumPy is calling **compiled libraries**, which use lower-level code like C++ or Assembly. You always want to use broadcasting and NumPy functions. They will always be faster and usually more efficient.

Broadcasting is simple but requires some intuition to use naturally. Here are a few tips that will help you effectively use this concept.
* Read the [docs](https://numpy.org/doc/stable/user/basics.broadcasting.html) in detail.
* Always check the `shape` of your result.
* Be very careful about sanity-checking your results! Sometimes broadcasting can do unexpected things. For example, do you know what happens if you do the following?

In [None]:
np.random.seed(1234)
X1 = np.random.randint(low=1, high=5, size=(3, 3))
v1 = np.random.randint(low=1, high=5, size=(3,))
what_is_this = X1 * v1

If not, be sure you're careful with your dimensions!

## Constructing arrays

Here's another example of the usage of broadcasting to drive home the point. Consider that you want to create an array of the 1d particle in a box wave functions. If you don't know what these are, don't worry, just look at the equation below,

$$ \psi_n(x) = \sqrt{\frac{2}{L}} \sin \frac{n \pi x}{L}$$

Lets take $L = 1$ as constant, and construct the array (matrix) where each row of the matrix represents a different $n = 1, 2, 3, ...$ and each column represents a different value for $x$ on a fixed grid.

Using NumPy, this is straightforward to do via **broadcasting**, which we'll discuss later in the context of mathematical operations.

In [None]:
L = 1.0
x_grid = np.linspace(0.0, L, 1000)
n_grid = np.array([nn for nn in range(1, 101)])

In [None]:
def psi(x, n, L=L):
    return np.sqrt(2.0 / L) * np.sin(n * np.pi * x / L)

First, we can do this the slow way:

In [None]:
psi_matrix_slow = []

# For every n
for nn in n_grid:
    
    # Use a temporary array
    tmp = []
    
    # Such that for every x
    for xx in x_grid:
        tmp.append(psi(xx, nn, L=L))
    psi_matrix_slow.append(tmp)

# Turn the resulting list of lists into an array
psi_matrix_slow = np.array(psi_matrix_slow)

Now, with broadcasting!

In [None]:
psi_matrix = psi(x_grid.reshape(1, -1), n_grid.reshape(-1, 1), L=L)

In [None]:
fig, ax = plt.subplots(1, 1, figsize=(2, 1))

ax.plot(x_grid.squeeze(), psi_matrix[0, :], 'k', label="$n=1$")
ax.plot(x_grid.squeeze(), psi_matrix[1, :], 'r', label="$n=2$")
ax.plot(x_grid.squeeze(), psi_matrix[2, :], 'b', label="$n=3$")

ax.set_xlabel("$x$")
ax.set_ylabel("$\psi_n(x)$")
ax.legend(frameon=False, bbox_to_anchor=(1.0, 1.0))

plt.show()

Here's another fun thing we can do. Note that these "wave functions" have the following analytic property: they are all normalized to 1:

$$ \int_0^L |\psi_n(x)|^2 dx = 1.$$

As an exercise, let's check this numerically. We already have a dense grid in `x`, so let's simply sum all of the values of `psi_matrix` over that axis, and multiply by the constant spacing in `x`.

In [None]:
dx = np.diff(x_grid)[0]

In [None]:
(psi_matrix**2).sum(axis=1) * dx

This works so well because the grid in `x` is so dense and these are very smooth functions!

## Advanced operations

Of course, we want to do more advanced things with NumPy, such as matrix multiplication, dot products, etc. Of course, there are NumPy functions for this! Let's take a look

### Matrix multiplication

In Python3, matrix multiplication can be performed using the `@` operator. It's extremely simple, see below. Note that the shapes of the matrices must be compatible.

In [None]:
np.random.seed(123)
X1 = np.random.randint(low=1, high=5, size=(10, 20))
X2 = np.random.randint(low=1, high=5, size=(20, 30))
result = X1 @ X2

As always, let's write a "dumb" function to test that this result produces what we want it to.

In [None]:
def slow_matmul(mat1, mat2):
    mat3 = np.empty(shape=(mat1.shape[0], mat2.shape[1]))
    for row in range(mat1.shape[0]):
        for col in range(mat2.shape[1]):
            mat3[row, col] = 0
            for xx, yy in zip(mat1[row, :], mat2[:, col]):
                mat3[row, col] += xx * yy
    return mat3

In [None]:
result_slow = slow_matmul(X1, X2)
assert np.all(result_slow == result)

In [None]:
%timeit X1 @ X2

In [None]:
%timeit slow_matmul(X1, X2)

The matrix operation is again about 1000 times faster, and that's only for extremely small matrices!

### Dot product

The dot product (at least the one we'll focus on) projects two vectors of the same size into the real numbers:

$$ \mathbf{v}_1 \cdot \mathbf{v}_2 = c \in \mathbb{R}. $$

The NumPy `dot` function however, does a _lot_ of things, one of which is this aforementioned operation. Let's start with that one:

In [None]:
v1 = np.array([1, 2, 3])
v2 = np.array([4, 5, 6])

In [None]:
assert np.dot(v1, v2) == 1 * 4 + 2 * 5 + 3 * 6

It's always useful to look at the documentation for various functions and methods. In NumPy, this is especially important since the inherent broadcasting potential can lead to unexpected results. Let's look at the documentation together.

In [None]:
np.dot?

# Summary

* NumPy is **the** numerical computing tool for Python. If you are a scientist doing any kind of numerical coding and you are _not_ using NumPy, please start now!
* NumPy is generally fast, and should be used instead of for loops in essentially every situation.
* There are two exceptions: 1) when it is not possible, in e.g. state-dependent simulations where $x(t + 1) = f(x(t))$, and 2) when memory management becomes an issue.