# NumPy tutorial

[Numerical Python](https://numpy.org) is one of the most fundamental tools in each data miner's toolbox. It is impossible to do serious data pre-processing and transformation without the understanding of `NumPy` and its most commonly used methods. The goal of this tutorial is to familiarize students with this awesome library.

## Introduction

99% of `NumPy` is one data structure and operations on that structure. `NumPy` stores everything in multidimensional arrays and vectorizes all operations on these arrays. For a motivating example, let's compare the speed of computing an average of 10 mln of random numbers stored in a list vs an array

In [None]:
import numpy as np
from numpy.random import randint

randoms = randint(low=0, high=1000, size=100000000)
lst = list(randoms)

In [None]:
len(lst)

In [None]:
%%time

# old-school iteration
summ = 0
for i in range(len(lst)):
    summ += lst[i]
    
print(f'Average = {summ/len(lst)}')

In [None]:
%%time

# using built-ins sum() and len()
print(f'Average = {sum(lst)/len(lst)}')

In [None]:
%%time

# using NumPy
print(f'Average = {np.mean(randoms)}')

There are only two rules regarding `NumPy` arrays:
- every element must be of the same type and size
- if an array has arrays, they must match as well

Let's see how to create an array and what happens when we start messing with the types and sizes of objects

In [None]:
a = np.array([1, 2, 3, 4, 5], dtype=np.int16)

print(f'Shape (sizes of dimensions): {a.shape}')
print(f'Number of dimensions: {a.ndim}')
print(f'Length (number of elements): {len(a)}')
print(f'Size (number of nested elements): {a.size}')
print(f'Type : {type(a)}')
print(f'Data type (type of array elements): {a.dtype}')

Now let's see how the same descriptors can be applied to a two-dimensional array

In [None]:
a = np.array([
    [1, 2, 3, 4, 5],
    [1, 4, 9, 16, 25]
], dtype=np.int16)

print(f'Shape (sizes of dimensions): {a.shape}')
print(f'Number of dimensions: {a.ndim}')
print(f'Length (number of elements): {len(a)}')
print(f'Size (number of nested elements): {a.size}')
print(f'Type : {type(a)}')
print(f'Data type (type of array elements): {a.dtype}')

Array elements should be of the same type. Let's see what happens if we mix two types.

In [None]:
a = np.array([1, 2, 'mary', 'had', 2, 'lambs', True])

a

In [None]:
a.dtype

We can also try to modify the length of array's elements

In [None]:
a = np.array(['mary', 'had', 'a', 'little', 'lamb'])

In [None]:
a.dtype, a

In [None]:
a[4] = 'and very very very very long snake'

print(a)

After an array has been created, it can be reshaped to whatever shape one desires. A special function is provided for transposing an array (changing rows into columns and vice versa)

In [None]:
a = np.array([0,1,2,3,4,5,6,7,8,9,10,11])

In [None]:
a.shape

In [None]:
a.shape = (1, 12)
a

In [None]:
??np.reshape

In [None]:
a = a.reshape(3, 4)

a

In [None]:
# transposition of an arary

a.T

## Creating arrays

The easiest way to create a 1-d array is to use a list. If you want a 2-d array, you use a list of lists. 3-d arrays are created using a list of lists of lists. You get the gist.

In [None]:
a_1d = np.array([1, 2, 3, 4])

a_2d = np.array([
    [1, 2, 3, 4],
    [1, 4, 9, 16],
    [1, 8, 27, 64]
])

a_3d = np.array([
    [
        [0, 0],
        [0, 1],
    ],
    [
        [1, 0],
        [1, 1],
    ],
])

In [None]:
a_2d

There are utility functions in the `np` module for creating popular types of arrays:
- an array filled with zeros
- an array filled with ones
- an array filled with any value
- an array of consecutive (or stepped) values
- an array filled with random values
- a diagonal array

In [None]:
np.zeros(shape=(3,3))

In [None]:
np.zeros(shape=(3,5))

In [None]:
np.ones((3,2,3), dtype=np.int16)

In [None]:
np.full(shape=(4,2), fill_value='mary had a little lamb')

In [None]:
??np.arange

In [None]:
np.arange(-2, 2, 0.5)

In [None]:
??np.random.randn

In [None]:
np.random.randn(3, 3)

In [None]:
np.random.randint(low=1, high=7, size=(3,3))

In [None]:
np.eye(5)

In [None]:
np.eye(5, 8)

## Indexing arrays

Arrays in `NumPy` are 0-indexed. Indexing of 1-d arrays is very easy, just follow the pattern of *start*:*end*:*step*

In [None]:
a = np.arange(0, 10)

print(f'Full array: {a}')
print(f'First element: {a[0]}')
print(f'First three elements: {a[0:3]}')
print(f'Last element: {a[len(a)-1]} and {a[-1]}')
print(f'Even elements: {a[::2]}')

Indexing of n-dim arrays is a bit more tricky. Keep in mind that axis 0 refers to rows and axis 1 refers to columns. For high dimensional arrays try to build the following intuition:
- 1-d: a row of values
- 2-d: a matrix (rows and columns) of values
- 3-d: a row of arrays
- 4-d: a matrix of arrays
- and so on...

In [None]:
a = np.array([
    [1, 2, 3, 4],
    [10, 20, 30, 40],
    [100, 200, 300, 400],
])

print(a)

print(f'Element at second row, third column: {a[1,2]}')
print(f'Entire first row: {a[0,]}')
print(f'Entire first row as 2-d array: {a[0, None]}')
print(f'First and second rows, last column: {a[:2,-1]}')


The same goes with all n-dim arrays. For instance, let's extract first matrix, all rows, first column. You can also use indexing to assign multiple values to array cell at once.

In [None]:
a_3d

In [None]:
a_3d[1, :, 0]

In [None]:
a_3d[1, :, 1] = -1

print(a_3d)

## Basic operations on arrays

All array operations are vectorized, so they tend to be very quick. By default, `NumPy` performs element-wise array operations. If you want to correctly multiply arrays, use `@` operator as shown below.

In [None]:
a = np.arange(0, 12)
b = np.arange(12, 24)

a.shape = b.shape = 3, 4

In [None]:
a

In [None]:
b

In [None]:
a + b

In [None]:
b - a

In [None]:
a + 10

In [None]:
a.shape, b.shape

In [None]:
a @ b.T 

## Exercises

### Calculating sliding averages

Given an array of daily measurements, create a new array with averages computed over each pair of consecutive days

In [None]:
measurements = np.arange(100)
...

### Battleships

Given a 10x10 playing field with hidden battleships and a list of shooting targets, compute the number of hits.

In [None]:
sea = np.random.randint(low=0, high=2, size=(10,10))

sea

In [None]:
targets = np.array([
    [0,3],
    [1,7],
    [2,2],
    [3,5],
    [8,2]
])
...

## Broadcasting

This is by far the most important concept in `NumPy`. Broadcasting is an automatic expansion of arrays so that they match with their operands.

Let's start with the simplest example.

In [None]:
a = np.arange(10)

a + 10

The same happens for 2-d arrays

In [None]:
a = np.array([
    [1, 2, 3, 4],
    [1, 4, 9, 16],
])

b = np.array([
    [0.1, 0.2]]
)

In [None]:
a.shape, b.T.shape

In [None]:
a + b.T

The simple rule for broadcasting is the following:

If we want to operate on two arrays `a` and `b`:
- moving backwards from the last dimension of each array, we check if their dimensions are the same or one equals 1
- if all of `a`'s dimensions are compatible with `b`'s dimensions, arrays `a` and `b` are compatible.

In [None]:
np.random.seed(1234)

a = np.random.randint(low = 1, high = 10, size = (3, 4))
print(a)

b = np.random.randint(low = 1, high = 10, size = (3, 1))
print(b)

In [None]:
a + b

In [None]:
np.random.seed(1234)

a = np.random.randint(low = 1, high = 10, size = (3, 1, 4))
print(a)

b = np.random.randint(low = 1, high = 10, size = (2, 1))
print(b)

In [None]:
(a + b).shape

Sometimes it is useful to be able to manually modify the shape of the array. This can be done using the `np.newaxis` function (which is simply an alias for the `None` keyword)

In [None]:
a = np.array([1, 2, 3, 5, 7, 11, 13])

In [None]:
a.shape

In [None]:
a[:, np.newaxis].shape

In [None]:
a[None, :].shape

This can be very useful if one wants to build an array containing the results of a cross-join operation on two matrices. Suppose we are trying to create $c_{ij} = a_i - b_j$.

In [None]:
b = np.arange(7)

c = a[:, None] - b[None, :]

print(c)

## Boolean indexing

Anytime you have a boolean array, you can use it to mask entries in another array.

In [None]:
a = np.random.randint(0, 100, size=(5,5))

a

In [None]:
temps_greater_than_50 = a > 50

In [None]:
a[temps_greater_than_50]

Boolean masking may be applied not only to values, but to rows and columns as well. Just remember to use slicing:

*array*[*row_mask*,*col_mask*]

In [None]:
rows_2_and_4 = np.array([False, True, False, True, False])
cols_1_and_2 = np.array([True, True, False, False, False])

In [None]:
a[rows_2_and_4]

In [None]:
a[rows_2_and_4, cols_1_and_2]

In [None]:
names = np.array(["Dennis", "Dee", "Charlie", "Mac", "Frank"])
ages = np.array([43, 44, 43, 42, 74])
genders = np.array(['male', 'female', 'male', 'male', 'male'])

In [None]:
names[(genders == 'male') & (ages > 43)]

In [None]:
names[~(genders == 'male') & (ages % 2 == 0)]

## `Random` module

One of the most frequently used parts of the `NumPy` is the random number generation procedure. Below you can see examples of different samples:
- normal sample
- uniform sample
- choosing from a set with/without replacement

In [None]:
np.random.normal(loc=10.0, scale=1.0, size=(3,3))

In [None]:
np.random.randint(low=10, high=20, size=(3,3))

In [None]:
np.random.uniform(low=0, high=1, size=5)

In [None]:
np.random.choice(
    a=[1,2,3,4,5,6],
    replace=True,
    size=10
)

In [None]:
np.random.choice(
    a=['this','is','sampling','without','replacement'],
    replace=True,
    size=8
)

Despite the fact that most people use the `random` module as above, this way is in fact deprecated, because it introduces a dependency on the random number generator used currently by `NumPy`. In theory, if `NumPy` changes the generator, all the code becomes non-reproducible.
A simple solution is to use the generic `Generator` class.

In [None]:
generator = np.random.default_rng(seed=123)

In [None]:
generator.integers(low=1, high=100, size=10)

In [None]:
generator.normal(loc=0, scale=1, size=10)

In [None]:
generator.choice(a=[1,2,3], replace=True, size=10)

## Exercises

### Two reviewers

You are given two arrays representing ratings assigned to 100 movies by two reviewers. Identify movies such that the reviewers differ in their rating by at most 1.

In [None]:
movies = np.arange(100)
reviewer_a = np.random.choice(a=[1,2,3,4,5], size=100)
reviewer_b = np.random.choice(a=[1,2,3,4,5], size=100)

## Using `where`

`np.where` is a very useful function which allows to quickly filter elements of an array based on the condition. Imagine you have two large arrays and you want to create a third array such that it contains, for each cell, the larger value from the two arrays. First, let's do it in a traditional way.

In [None]:
a = np.random.randint(1, 6, size=10**6)
b = np.random.randint(1, 6, size=10**6)

In [None]:
%%time
c = np.zeros(a.size)

for i in range(a.size):
    if a[i] > b[i]:
        c[i] = a[i]
    else:
        c[i] = b[i]


In [None]:
%%time
d = np.where(a > b, a, b)

In [None]:
np.array_equal(c,d)

## Exercises

### First to finish the assignment

Given an array with students' assignments ordered by the increasing date of submission, you want to reward first 3 students who submitted their work and who got at least 75 points. Increase their scores by 5 points.

In [None]:
grades = np.random.randint(low=0, high=100, size=50)
...

## Math functions

`NumPy` contains several highly optimized implementations of math functions. Whenever possible, try to use them instead of your own implementations. Remember, that math functions are easily generalized to n-dim arrays.

In [None]:
a = np.array([
    [1, 2, 3, 4],
    [5, 6, 7, 8],
    [9, 10, 11, 12],
    [13, 14, 15, 16],
], dtype=np.float64)

In [None]:
a

In [None]:
np.sum(a)

In [None]:
np.sum(a, axis=0)

In [None]:
np.sum(a, axis=1)

But beware of `nan`s, as they tend to destroy all math!

In [None]:
a

In [None]:
a[2,2] = np.nan

In [None]:
a.sum()

In [None]:
a[~np.isnan(a)]

In [None]:
a[~np.isnan(a)].sum()

In [None]:
np.sum(a, where=~np.isnan(a))

In [None]:
np.nansum(a)

In [None]:
np.sum(np.nan_to_num(a))

Let's see what else we can do with `nan`s

In [None]:
a[0,1] = np.nan
a[1,3] = np.nan

In [None]:
np.isnan(a)

In [None]:
np.any(np.isnan(a), axis=1)

In [None]:
mask = np.any(np.isnan(a), axis=1)
a[mask]

## Concatenation & sorting

Concatenation means joining two arrays by rows or by columns. An array may be concatenated with itself or with another array. There are 4 functions that help with concatenation.

In [None]:
a = np.zeros(shape=(3,2))
b = np.ones(shape=(2,2))

In [None]:
a

In [None]:
b

In [None]:
np.concatenate([a, a, a, a], axis=0)

In [None]:
np.concatenate([a, b], axis=0)

In [None]:
np.vstack([a,b])

In [None]:
np.hstack([a.T,b])

In [None]:
np.stack([a[:2,:2], b], axis=0)

Unfortunately, `NumPy` does not provide any easy way of reverse sort, and sorting is limited to two functions.

In [None]:
a = np.random.randint(1, 100, size=50)
a

In [None]:
a.sort()

In [None]:
a[::-1]

In [None]:
np.sort(a)[::-1]

If you want to be able to sort values in the first column of an array according to the order in the second column of an array, you need to use `np.argsort`.

In [None]:
a = np.random.randint(1, 100, size=20)
a.shape = 5,4

print(a)

In [None]:
np.sort(a, axis=0)

In [None]:
a

In [None]:
a[np.argsort(a[:,2])]