<a target="_blank" href="https://colab.research.google.com/github/JLDC/Data-Science-Fundamentals/blob/master/notebooks/04_numpy-random-numbers.ipynb">
    <img src="https://i.ibb.co/2P3SLwK/colab.png"  style="padding-bottom:5px;" />Open this notebook in Google Colab
</a>

___

# NumPy
___
`numpy` is the numerical computation package and it underlies everything we will do in this class. More often than not, it *acts in the background*, i.e., you might not realize that we are working with `numpy` because it seems that we are only working with `pandas`, but every column in `pandas` is nothing but a `numpy` vector with some additional features.

While `numpy` is so prevalent in what we will do, we only need to use `numpy` directly from time to time. Due to this, we will go over some `numpy` basics in this notebook.

In [None]:
import numpy as np # Import the package

Let's start by observing how a list from Python base compares to a `numpy`.

In [None]:
x_numpy = np.array([1, 2, 3]) # Create a numpy array of numbers [1, 2, 3]
x_base = [1, 2, 3] # Create a list of numbers [1, 2, 3] in base Python

In [None]:
# Observe how these objects behave with mathematical operations
print("Base :", x_base * 2)
print("Numpy:", x_numpy * 2)

As you can see, multiplying a base list simply repeats the list as we can see above. However, doing this with a `numpy` array instead, we now obtain element-wise multiplication, which is much more practical for numerical computing. Of course, `numpy` doesn't stop here, you can also do addition, subtraction, division, exponentials, and/or vector operations, here are a few examples:

In [None]:
x = np.array([1, 2, 3]) # Create a numpy array

In [None]:
np.exp(x) # Exponential applied element-wise

In [None]:
np.sqrt(x) # Notice how numpy also has most mathematical functions

In [None]:
x - 2 # Element-wise subtraction

`numpy` arrays have an attribute called `.shape` which tells us the shape of the object in form `(number_of_rows, number_of_columns)`.

In [None]:
x.shape

## Vectors and matrices
___
Notice the blank space after the comma. This means that `x_numpy` is a **vector**. This can be somewhat confusing at first, but `numpy` differentiates between vector and matrices, even when these are the same size, e.g., a column vector with 3 elements is not the same as a $3 \times 1$ matrix.

In [None]:
# Create a 2x3 matrix
mat = np.array([[1, 2, 3], [4, 5, 6]])
mat

In [None]:
mat.shape # Our matrix has 2 rows and 3 columns

So if we multiply `mat` by `x`, will that give us a $2 \times 1$ matrix? Since `mat` is $2 \times 3$ and `x` is $3 \times 1$, this is what we would expect, right?

In [None]:
mat * x

⚠️ Not exactly! In `numpy`, the `*` operator refers to **element-wise multiplication**. This is important to know! So basically, what happened is that `x` multiplied the each row of `mat` element-by-element... makes sense! But how can we do vector/matrix multiplication?

This is done using the `@` operator. Be wary, this is something that is `numpy` specific and that typically cannot be used in base Python.

In [None]:
# Matrix multiplication of a 2x3 matrix and a 3x1 column vector
mat @ x

Here we go! Notice how this is the same as the sum of each row of the element-wise multiplication... if this seems strange to you, now might be a good time to review some of your linear algebra courses. In fact, most of machine learning is just linear algebra in disguise!

At the end of this notebook, we will discuss another example of matrix multiplication and compare it with an iterative approach.

## Useful functions and methods
___
`numpy` also provides many useful functions, statistical and mathematical...

In [None]:
# Compute the mean of the elements in the matrix
mat.mean()

In [None]:
# Compute the standard deviations of the elements in the matrix
mat.std()

In [None]:
# We can also use np.mean and np.std as functions instead of methods
np.std(mat)

In [None]:
# Even better, we can use NumPy functions on base Python lists...
np.mean(x_base)

In [None]:
# We can also transpose matrices, there are many ways to do this
mat.transpose()

In [None]:
# ... another way
np.transpose(mat)

In [None]:
# ... and my favorite
mat.T

## Random numbers
___
`numpy` also allows us to create random numbers using the submodule `numpy.random`, e.g.,

In [None]:
np.random.seed(72) # Set seed for reproduceability
# Create a 5-element array of random numbers uniform between 0 and 1
x_rand = np.random.rand(5) 
x_rand

In [None]:
# Create a 1000-element array of random numbers normally distributed with mean 0 and std 1
x_randn = np.random.randn(1000) 
# Print the mean and standard deviation
print(f"The mean is {x_randn.mean():.2f} and the standard deviation is {x_randn.std():.2f}")

In [None]:
# numpy can not only find the minimum but also the place where this minimum happens!
print(f"The minimum is {x_randn.min():.2f} and it occurs at index {x_randn.argmin()}")
print(f"The maximum is {x_randn.max():.2f} and it occurs at index {x_randn.argmax()}")

The `argmin` and `argmax` functions used above will come in handy when doing data science! Make sure you understand their purpose.

## 🙀 🤯 Filtering
___
Let's now have a look at filtering in `numpy` and try to understand how this can be useful. `numpy` provides the `where` function, which creates a mask of where a provided condition is true. The best way to understand it is to have a look at how it works in practice.

In [None]:
# List of indices where x_randn is at the minimum
np.where(x_randn == x_randn.min(), 1, 0)

In [None]:
# Compare the results with `argmin`
x_randn.argmin()

In a sense, `.argmin` is a shorthand for the `.where` condition we used above. But `.where` is much more powerful and flexible. Consider the following examples:

In [None]:
# List of indices where x_randn is at most the minimum + 0.5
np.where(x_randn < x_randn.min() + 0.5, 1, 0)

In [None]:
# We can also use these values to extract the values directly by indexing the vector
x_randn[np.where(x_randn < x_randn.min() + .5, 1, 0)]

In case we don't need the indices but only the number of elements, we can use `np.count_nonzero` instead of doing `len(np.where(...))`.

In [None]:
# Count the number of values lower than the minimum + 0.5
np.count_nonzero(x_randn < x_randn.min() + .5)

Alright, let's now have a look at a practical use-case in which we use filtering. As a matter of fact, we will do something similar in the notebooks later on, so pay close attention!

Consider that we have built a machine learning model that predicts whether a student passes or fails the assessment year. We have a vector of predictions $\mathbf{y}_\text{pred} = [0 \, 1 \, 1 \, 1 \, \dots, \, 0]^\top$ where $0$ means that the student fails the year and $1$ indicates that they pass.

At the end of the year, we have obtain true results $\mathbf{y}_\text{true} = [1 \, 1 \, 1 \, 0 \, \dots, \, 0]^\top$, it is time to quantify how well our machine learning model performed. We want to compute:

+ The number of **true positives** (students that passed and we predicted that they would pass)
+ The number of **true negatives** (students that failed and we predicted that they would fail)
+ The number of **false positives** (students that failed and we predicted that they would pass)
+ The number of **false negatives** (students that passed and we predicted that they would fail)

In [None]:
# Set random seed
np.random.seed(72)
N = 1500 # Number of students
# Create two random vectors where the chance of a value being zero is 25%
# and the chance of a value being 1 is 75%
y_true = np.random.choice((0, 1), N, p=(.25, .75))
y_pred = np.random.choice((0, 1), N, p=(.25, .75))

In [None]:
# True positives: we predict 1 (pass) and the true value is 1
TP = np.count_nonzero((y_true == 1) & (y_pred == 1))
# True negatives: we predict 0 (fail) and the true value is 0
TN = np.count_nonzero((y_true == 0) & (y_pred == 0))
# False positives: we predict 1 (pass) and the true value is 0
FP = np.count_nonzero((y_true == 0) & (y_pred == 1))
# False negatives: we predict 0 (fail) and the true value is 1
FN = np.count_nonzero((y_true == 1) & (y_pred == 0))

In [None]:
# Print the results
print("        __________________")
print("       |    Prediction    |")
print("       |------------------|")
print("       |    Fail |   Pass |")
print(" Truth |------------------|")
print(f"  Fail |   {TN:>5} |  {FP:>5} |")
print(f"  Pass |   {FN:>5} |  {TP:>5} |")

## The power of matrix multiplication
___
Say that we have two matrices, $\mathbf{A}$ and $\mathbf{B}$ and we want to compute their product $\mathbf{AB}$.

We could, in theory, implement matrix multiplication *by hand*. Remember how to multiply two matrices:

1. Multiply the first row of $\mathbf{A}$ with the first, second, third, ..., column of $\mathbf{B}$
2. Multiply the second row of $\mathbf{A}$ with the first, second, third, ..., column of $\mathbf{B}$
3. Continue until you have done this for all the rows of $\mathbf{A}$.

Can you see why a double loop will be helpful here? We are looping over the rows of $\mathbf{A}$ and for each row, we are looping over the columns of $\mathbf{B}$ !

Let's go ahead and implement this algorithm and verify our result using the matrix multiplication of `numpy`.

In [None]:
np.random.seed(72) # Set random seed
# Create two random matrices with Poisson (λ=5) distributed elements, 1000x1000 size
A = np.random.poisson(5, (1000, 1000))
B = np.random.poisson(5, (1000, 1000))

In [None]:
%%time
# Compute the multiplication using a double loop
C_loop = np.empty((1000, 1000)) # Create an empty matrix that we will fill

for i in range(A.shape[0]): # Loop over the rows of A
    for j in range(B.shape[1]): # Loop over the columns of B
        # Compute the sum of the element-wise multiplication
        C_loop[i, j] = np.sum(A[i, :] * B[:, j])

In [None]:
%%time
# Compute the matrix multiplication using numpy
C_numpy = A @ B

In [None]:
# Make sure both resuls are the same
np.all(C_numpy == C_loop)

So using a loop, we compute the result in approximately 4.75 seconds. When using matrix multiplication, this reduces to roughly 780 milliseconds, that's an 80% speed increase! While 4.75 seconds might not seem like much, if you have to compute many matrix multiplications, the 80% speedup will turn out to be massive.