# NumPy tutorial

**Author: Kuba Czech, 156035**

[Numerical Python](https://numpy.org) is one of the most fundamental tools in each data miner's toolbox. It is impossible to do serious data pre-processing and transformation without the understanding of `NumPy` and its most commonly used methods. The goal of this tutorial is to familiarize students with this awesome library.

## Introduction

Python is among the least efficient languages out there. Data mining is a field of study, which requires a lot of processing power. Why would we ever want to use python in order to do the data mining? The answer is really simple - we do not. We use python only as an interface to packages, which contain very efficient and optimized programs, such as numpy, scipy, sklearn, tensorflow, pytorch, keras... The list is long. Python is very flexible and simple. We just like its syntax, ability to combine it with programs written in other languages, not necessairly its efficency. In the previous tutorial we have already seen some numpy stuctures. When we imported the toy data sets, the data was stored in something called ndarray. In fact 99% of `NumPy` is about this very data structure and operations on this structure. `NumPy` stores everything in multidimensional arrays and vectorizes all operations on these arrays. 

Contrary to Python lists, the NumPy arrays represent tensors (e.g. 1st rank tensor - a vector, 2nd rank tensor - a matrix; do not confuse this with Tensorflow tensor data structure, tensors in a mathematical sense). Python list is just a list of things, specifically it could be a list of lists, there is no additional limitation to that. A tensor has some limitations:
- every element must be of the same type and size
- if an array has arrays, they must match as well

After all this:

\begin{matrix} 1 & 2 & 3 \\ 4 & 5 & 6 & 8 \\ 7 & 8 &  \end{matrix}

is not a matrix.

While this:



In [1]:
lst = [[1,2,3],[4,5,6,8],[7,8]]

is a list.

For a motivating example, let's compare the speed of computing an average of 10 mln of random numbers stored in a list vs an array

First, we are going to generate a list of 10 mln random numbers. We will use a function from the numpy package named randint. As you might suspect the function can be used to generate a random integer. In fact this function is flexible enough to create a tensor of any shape with random integers.

In [2]:
import numpy as np
from numpy.random import randint

randoms = randint(low=0, high=1000, size=(1000,1000))
randoms = randint(low=0, high=1000, size=(100,100,100))
randoms = randint(low=0, high=1000, size=(10,10000))
randoms = randint(low=0, high=1000, size=(10000000))
lst = list(randoms)

I hope you're not surprised the size is not just a single number. n-th rank tensor requires n numbers to describe its shape. A vector has just its length. A matrix has a number of columns and a number of rows. A cuboid has three dimensions, and so on. So, in the second to last line we generated a vector of 10 mln random numbers. In the last line we converted it to a list. We can get the length of this list with a built-in len function.

In [3]:
len(lst)

10000000

We expect the len function to return an integer. So it does. If we try to check the length of a numpy array we are going to get only the size of the first dimension. If we want to get a full description of the tensor shape, we can use the `shape` attribute.

In [4]:
sample = np.zeros((1000,1000,1000))

print(len(sample))
print(sample.shape)

1000
(1000, 1000, 1000)


Let's start the calculation of an average - we will use the cell magic here to compare the solutions.

In [5]:
%%time

# old-school iteration
summ = 0
for i in range(len(lst)):
    summ += lst[i]
    
print(f'Average = {summ/len(lst)}')

Average = 499.5444774
CPU times: user 865 ms, sys: 2.08 ms, total: 867 ms
Wall time: 868 ms


In [6]:
%%time

# using built-ins sum() and len()
print(f'Average = {sum(lst)/len(lst)}')

Average = 499.5444774
CPU times: user 215 ms, sys: 768 µs, total: 216 ms
Wall time: 216 ms


In [7]:
%%time

# using NumPy
print(f'Average = {np.mean(randoms)}')

Average = 499.5444774
CPU times: user 4.38 ms, sys: 817 µs, total: 5.2 ms
Wall time: 3.99 ms


Let's see how to create an array and what happens when we start messing with the types and sizes of objects.

In the previous examples we already saw how to create an array of random numbers (and an array, which consists of zeros).
We can also create a numpy array with a python list, like this:

In [8]:
a = np.array([1, 2, 3, 4, 5])

we have also seen how to get the exact shape of an array. In fact there are more descriptors for an array.

In [9]:
print(f"""
Shape (sizes of dimensions): {a.shape}
Number of dimensions: {a.ndim}
Length (number of elements): {len(a)}
Size (number of nested elements): {a.size}
Type : {type(a)}
Data type (type of array elements): {a.dtype}
""")


Shape (sizes of dimensions): (5,)
Number of dimensions: 1
Length (number of elements): 5
Size (number of nested elements): 5
Type : <class 'numpy.ndarray'>
Data type (type of array elements): int64



Now let's see how the same descriptors can be applied to a two- and three -dimensional array

In [10]:
a = np.array([
    [1, 2, 3, 4, 5],
    [1, 4, 9, 16, 25]
])

print(f"""
Shape (sizes of dimensions): {a.shape}
Number of dimensions: {a.ndim}
Length (number of elements): {len(a)}
Size (number of nested elements): {a.size}
Type : {type(a)}
Data type (type of array elements): {a.dtype}
""")


Shape (sizes of dimensions): (2, 5)
Number of dimensions: 2
Length (number of elements): 2
Size (number of nested elements): 10
Type : <class 'numpy.ndarray'>
Data type (type of array elements): int64



In [11]:
a = np.array([
    [
    [1, 2, 3, 4, 5],
    [1, 1, 2, 3, 5]
    ],
    [
    [1, 2, 3, 4, 5],
    [1, 4, 9, 16, 25]
    ]
])

print(f"""
Shape (sizes of dimensions): {a.shape}
Number of dimensions: {a.ndim}
Length (number of elements): {len(a)}
Size (number of nested elements): {a.size}
Type : {type(a)}
Data type (type of array elements): {a.dtype}
""")


Shape (sizes of dimensions): (2, 2, 5)
Number of dimensions: 3
Length (number of elements): 2
Size (number of nested elements): 20
Type : <class 'numpy.ndarray'>
Data type (type of array elements): int64



Array elements should be of the same type. Let's see what happens if we mix two or more types.

In [12]:
a = np.array([1, 2, 'mary', 'had', 2.5, 'lambs'])
print(a)

['1' '2' 'mary' 'had' '2.5' 'lambs']


In [13]:
a.dtype #1, 2, 2.5 were converted to strings

dtype('<U32')

We can also try to modify the length of array's elements

In [14]:
a = np.array(['mary', 'had', 'a', 'little', 'lamb'])
a.dtype #dtype('<U6') means that it is less than or equal to 6 unicode characters

dtype('<U6')

In [15]:
a[4] = 'and very very very very long snake'
a

array(['mary', 'had', 'a', 'little', 'and ve'], dtype='<U6')

After an array has been created, it can be reshaped to whatever shape one desires. A special function is provided for transposing an array (changing rows into columns and vice versa)

In [16]:
a = np.array(list(range(12)))

a.shape = (3,4)
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [17]:
a = a.reshape(6, 2)
a

array([[ 0,  1],
       [ 2,  3],
       [ 4,  5],
       [ 6,  7],
       [ 8,  9],
       [10, 11]])

In [18]:
a.T

array([[ 0,  2,  4,  6,  8, 10],
       [ 1,  3,  5,  7,  9, 11]])

## Creating arrays

The easiest way to create a 1-d array is to use a list. If you want a 2-d array, you use a list of lists. 3-d arrays are created using a list of lists of lists. You get the gist.

In [19]:
a_1d = np.array([1, 2, 3, 4])

a_2d = np.array([
    [1, 2, 3, 4],
    [1, 4, 9, 16],
    [1, 8, 27, 64]
])

a_3d = np.array([
    [
        [0, 0],
        [0, 1],
    ],
    [
        [1, 0],
        [1, 1],
    ],
])

There are utility functions in the `np` module for creating popular types of arrays:
- an array filled with zeros
- an array filled with ones
- an array filled with any value
- an array of consecutive (or stepped) values
- an array filled with random values
- a diagonal array

In [20]:
np.zeros(shape=(3,3))

array([[0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [21]:
np.zeros(shape=(3,5))

array([[0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0.]])

In [22]:
np.ones(10)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

In [23]:
np.ones(10, dtype=np.int32)

array([1, 1, 1, 1, 1, 1, 1, 1, 1, 1], dtype=int32)

In [24]:
np.full(shape=(4,4), fill_value='empty')

array([['empty', 'empty', 'empty', 'empty'],
       ['empty', 'empty', 'empty', 'empty'],
       ['empty', 'empty', 'empty', 'empty'],
       ['empty', 'empty', 'empty', 'empty']], dtype='<U5')

??np.arange

In [25]:
np.arange(-2, 2, 0.5)

array([-2. , -1.5, -1. , -0.5,  0. ,  0.5,  1. ,  1.5])

In [26]:
np.random.randn(3, 3, 2) #array of real value elements

array([[[-1.78952412,  0.39015536],
        [ 0.85481288, -0.29831703],
        [-0.80741834, -0.39294963]],

       [[ 0.62134825, -1.22810024],
        [ 0.60495966,  1.34635497],
        [-1.16300635,  0.59386814]],

       [[-1.20676218, -0.64187466],
        [ 0.3165754 ,  0.028176  ],
        [-0.0763222 , -0.4889157 ]]])

In [27]:
np.random.randint(low=1, high=7, size=10)

array([5, 4, 6, 3, 1, 2, 4, 3, 3, 5])

In [28]:
np.eye(5)

array([[1., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0.],
       [0., 0., 1., 0., 0.],
       [0., 0., 0., 1., 0.],
       [0., 0., 0., 0., 1.]])

In [29]:
np.eye(5, 8)

array([[1., 0., 0., 0., 0., 0., 0., 0.],
       [0., 1., 0., 0., 0., 0., 0., 0.],
       [0., 0., 1., 0., 0., 0., 0., 0.],
       [0., 0., 0., 1., 0., 0., 0., 0.],
       [0., 0., 0., 0., 1., 0., 0., 0.]])

## Indexing arrays

Arrays in `NumPy` are 0-indexed. Indexing of 1-d arrays is very easy, just follow the pattern of *start*:*end*:*step*

In [30]:
a = np.arange(0, 10)

print(f"""{a}

First element: {a[0]}
First three elements: {a[0:3]}
Last element: {a[len(a)-1]} and {a[-1]}
Even elements: {a[::2]}
""")

[0 1 2 3 4 5 6 7 8 9]

First element: 0
First three elements: [0 1 2]
Last element: 9 and 9
Even elements: [0 2 4 6 8]



Indexing of n-dim arrays is a bit more tricky. Keep in mind that axis 0 refers to rows and axis 1 refers to columns. For high dimensional arrays try to build the following intuition:
- 1-d: a row of values
- 2-d: a matrix (rows and columns) of values
- 3-d: a row of arrays
- 4-d: a matrix of arrays
- and so on...

In [31]:
a = np.array([
    [1, 2, 3, 4],
    [10, 20, 30, 40],
    [100, 200, 300, 400],
])

print(f"""{a}

Element at second row, third column: {a[1,2]}
Entire first row: {a[0]}
Entire first row as 2-d array: {a[0, None]}
First and second rows, last column: {a[:2,-1]}
""")

[[  1   2   3   4]
 [ 10  20  30  40]
 [100 200 300 400]]

Element at second row, third column: 30
Entire first row: [1 2 3 4]
Entire first row as 2-d array: [[1 2 3 4]]
First and second rows, last column: [ 4 40]



The same goes with all n-dim arrays. For instance, let's extract first matrix, all rows, first column. You can also use indexing to assign multiple values to array cell at once.

In [32]:
a_3d

array([[[0, 0],
        [0, 1]],

       [[1, 0],
        [1, 1]]])

In [33]:
a_3d[0, :,0]

array([0, 0])

In [34]:
a_3d[1, :, 1] = -1

print(a_3d)

[[[ 0  0]
  [ 0  1]]

 [[ 1 -1]
  [ 1 -1]]]


## Basic operations on arrays

All array operations are vectorized, so they tend to be very quick. By default, `NumPy` performs element-wise array operations. If you want to correctly multiply arrays, use `@` operator as shown below.

In [35]:
a = np.arange(0, 12)
b = np.arange(12, 24)

a.shape = b.shape = 3, 4

In [36]:
a

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [37]:
b

array([[12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23]])

In [38]:
a + b

array([[12, 14, 16, 18],
       [20, 22, 24, 26],
       [28, 30, 32, 34]])

In [39]:
b - a

array([[12, 12, 12, 12],
       [12, 12, 12, 12],
       [12, 12, 12, 12]])

In [40]:
a * 10

array([[  0,  10,  20,  30],
       [ 40,  50,  60,  70],
       [ 80,  90, 100, 110]])

In [41]:
a @ b.T #@ - matrix multiplication

array([[ 86, 110, 134],
       [302, 390, 478],
       [518, 670, 822]])

## Homework

### Calculating sliding averages

Given an array of daily measurements, create a new array with averages computed over each pair of consecutive days. Compare the execution time of various solutions.

In the first solution you should use the for loop comprehension in order to create an array of pairs of measurements. Then, using the numpy average function calculate averages over the list of pairs.

In [42]:
measurements = np.arange(100)

In [52]:
#??np.average

49.5


In [55]:
#%%timeit
arr = [[measurements[i], measurements[i+1]] for i in range(len(measurements)-1)]
sol1 = np.zeros(len(measurements)-1)
for i in range(len(arr)):
    sol1[i] = np.average(arr[i])

In the second solution you are still supposed to create an array of pairs of measurements. This time use the numpy vstack to stack two duplicates of the measurements array. Cut the last day from the first duplicate, cut the first day from the second duplicate. Remember to check the validity of the solution, use transposition if necessary.

In [63]:
??np.vstack

[1;31mSignature:[0m [0mnp[0m[1;33m.[0m[0mvstack[0m[1;33m([0m[0mtup[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mSource:[0m   
[1;33m@[0m[0marray_function_dispatch[0m[1;33m([0m[0m_vhstack_dispatcher[0m[1;33m)[0m[1;33m
[0m[1;32mdef[0m [0mvstack[0m[1;33m([0m[0mtup[0m[1;33m)[0m[1;33m:[0m[1;33m
[0m    [1;34m"""
    Stack arrays in sequence vertically (row wise).

    This is equivalent to concatenation along the first axis after 1-D arrays
    of shape `(N,)` have been reshaped to `(1,N)`. Rebuilds arrays divided by
    `vsplit`.

    This function makes most sense for arrays with up to 3 dimensions. For
    instance, for pixel-data with a height (first axis), width (second axis),
    and r/g/b channels (third axis). The functions `concatenate`, `stack` and
    `block` provide more general stacking and concatenation operations.

    Parameters
    ----------
    tup : sequence of ndarrays
        The arrays must have the same shape along all but th

In [54]:
??np.transpose

In [53]:
a = np.array([[1,2,3],[2,3,4]])
print(a)
print(a.T)

[[1 2 3]
 [2 3 4]]
[[1 2]
 [2 3]
 [3 4]]


In [54]:
sol2 = np.vstack((measurements[:-1], measurements[1:])).mean(axis = 0)
correct_sol = [0.5 + i for i in range(99)] #we have 100 - 1 = 99 pairs

In your third solution use the discrete convolution function. This time you calculate the averages with the convolution directly, you don't need to use average function here.

An example of discrete convolution:

given a filter $G=[g_1,g_2,g_3]$ and a vector (1st rank tensor) $V=[v_1,v_2,v_3,v_4,v_5,v_6]$ the convolution of the vector $V$ and a filter $G$ is calculated as follows:

$$V * G = [g_1\cdot v_1+g_2\cdot v_2+g_3\cdot v_3,$$
$$g_1\cdot v_2+g_2\cdot v_3+g_3\cdot v_4,$$
$$g_1\cdot v_3+g_2\cdot v_4+g_3\cdot v_5,$$
$$g_1\cdot v_4+g_2\cdot v_5+g_3\cdot v_6]$$

keep in mind that the filter can be of any size as long as each of its components (of the shape) is equal or lower than its equivalent in the tensor. However, the dimensionality of the filter has to be the same as the dimensionality of the tensor (the filter could be of a lower dimensionality, but then it should be interpreted as if it was of the same dimensionality as the tensor). Notice the similarity between sliding window and the convolution - in both cases you put a "window" or a "filter" on the data and move the window throughout the data. In the example above the window moves like this:

[**_1,2,3_**,4,5,6]

[1,**_2,3,4_**,5,6]

[1,2,**_3,4,5_**,6]

[1,2,3**_,4,5,6_**]


In [85]:
??np.convolve

[1;31mSignature:[0m [0mnp[0m[1;33m.[0m[0mconvolve[0m[1;33m([0m[0ma[0m[1;33m,[0m [0mv[0m[1;33m,[0m [0mmode[0m[1;33m=[0m[1;34m'full'[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mSource:[0m   
[1;33m@[0m[0marray_function_dispatch[0m[1;33m([0m[0m_convolve_dispatcher[0m[1;33m)[0m[1;33m
[0m[1;32mdef[0m [0mconvolve[0m[1;33m([0m[0ma[0m[1;33m,[0m [0mv[0m[1;33m,[0m [0mmode[0m[1;33m=[0m[1;34m'full'[0m[1;33m)[0m[1;33m:[0m[1;33m
[0m    [1;34m"""
    Returns the discrete, linear convolution of two one-dimensional sequences.

    The convolution operator is often seen in signal processing, where it
    models the effect of a linear time-invariant system on a signal [1]_.  In
    probability theory, the sum of two independent random variables is
    distributed according to the convolution of their individual
    distributions.

    If `v` is longer than `a`, the arrays are swapped before computation.

    Parameters
    ----------
   

In [56]:
%%time
sol3 = np.convolve(measurements, [0.5, 0.5], mode = 'valid')

CPU times: user 31 µs, sys: 24 µs, total: 55 µs
Wall time: 43.9 µs


In [60]:
print(correct_sol == list(sol1))
print(correct_sol == list(sol2))
print(correct_sol == list(sol3))

True
True
True


In this solution use the numpy insert to add a 0 at the beginning of the measurements array (for padding purposes). Then calculate the cumulative sums of the measurements. Finally use the cumulative sum array in order to calculate the averages (it is a combination of picking elements in the correct order and a simple numerical operation). 

hint:

measurements = [1**,2,3,**4,5,6]

cumulative_sums = [0,**1**,3,**6**,10,15,21]

$$c_1 = x_0+x_1$$
$$c_3=x_0+x_1+x_2+x_3$$
$$c_3-c_1 = x_2+x_3$$

measurements = [1,2,**3,4**,5,6]

cumulative_sums = [0,1,**3**,6,**10**,15,21]

$$c_2 = x_0+x_1+x_2$$
$$c_2=x_0+x_1+x_2+x_3+x_4$$
$$c_3-c_1 = x_3+x_4$$



In [88]:
??np.insert

[1;31mSignature:[0m [0mnp[0m[1;33m.[0m[0minsert[0m[1;33m([0m[0marr[0m[1;33m,[0m [0mobj[0m[1;33m,[0m [0mvalues[0m[1;33m,[0m [0maxis[0m[1;33m=[0m[1;32mNone[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mSource:[0m   
[1;33m@[0m[0marray_function_dispatch[0m[1;33m([0m[0m_insert_dispatcher[0m[1;33m)[0m[1;33m
[0m[1;32mdef[0m [0minsert[0m[1;33m([0m[0marr[0m[1;33m,[0m [0mobj[0m[1;33m,[0m [0mvalues[0m[1;33m,[0m [0maxis[0m[1;33m=[0m[1;32mNone[0m[1;33m)[0m[1;33m:[0m[1;33m
[0m    [1;34m"""
    Insert values along the given axis before the given indices.

    Parameters
    ----------
    arr : array_like
        Input array.
    obj : int, slice or sequence of ints
        Object that defines the index or indices before which `values` is
        inserted.

        .. versionadded:: 1.8.0

        Support for multiple insertions when `obj` is a single scalar or a
        sequence with one element (similar to calling insert mul

In [91]:
??np.cumsum

[1;31mSignature:[0m [0mnp[0m[1;33m.[0m[0mcumsum[0m[1;33m([0m[0ma[0m[1;33m,[0m [0maxis[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mdtype[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mout[0m[1;33m=[0m[1;32mNone[0m[1;33m)[0m[1;33m[0m[1;33m[0m[0m
[1;31mSource:[0m   
[1;33m@[0m[0marray_function_dispatch[0m[1;33m([0m[0m_cumsum_dispatcher[0m[1;33m)[0m[1;33m
[0m[1;32mdef[0m [0mcumsum[0m[1;33m([0m[0ma[0m[1;33m,[0m [0maxis[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mdtype[0m[1;33m=[0m[1;32mNone[0m[1;33m,[0m [0mout[0m[1;33m=[0m[1;32mNone[0m[1;33m)[0m[1;33m:[0m[1;33m
[0m    [1;34m"""
    Return the cumulative sum of the elements along a given axis.

    Parameters
    ----------
    a : array_like
        Input array.
    axis : int, optional
        Axis along which the cumulative sum is computed. The default
        (None) is to compute the cumsum over the flattened array.
    dtype : dtype, optional
        Type of t

In [61]:
np.insert(measurements, 0, 0)
cs = np.cumsum(measurements)
print(cs)

[   0    1    3    6   10   15   21   28   36   45   55   66   78   91
  105  120  136  153  171  190  210  231  253  276  300  325  351  378
  406  435  465  496  528  561  595  630  666  703  741  780  820  861
  903  946  990 1035 1081 1128 1176 1225 1275 1326 1378 1431 1485 1540
 1596 1653 1711 1770 1830 1891 1953 2016 2080 2145 2211 2278 2346 2415
 2485 2556 2628 2701 2775 2850 2926 3003 3081 3160 3240 3321 3403 3486
 3570 3655 3741 3828 3916 4005 4095 4186 4278 4371 4465 4560 4656 4753
 4851 4950]


## Broadcasting

This is by far the most important concept in `NumPy`. Broadcasting is an automatic expansion of arrays so that they match with their operands.

Let's start with the simplest example.

In [62]:
a = np.arange(10)

a + 10

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

The same happens for 2-d arrays

In [63]:
a = np.array([
    [1, 2, 3, 4],
    [1, 4, 9, 16],
])

b = np.array([
    [0.1, 0.2, 0.3, 0.4]
])

In [64]:
a

array([[ 1,  2,  3,  4],
       [ 1,  4,  9, 16]])

In [65]:
b

array([[0.1, 0.2, 0.3, 0.4]])

In [66]:
a + b

array([[ 1.1,  2.2,  3.3,  4.4],
       [ 1.1,  4.2,  9.3, 16.4]])

In [67]:
a.shape, b.shape

((2, 4), (1, 4))

The simple rule for broadcasting is the following:

If we want to operate on two arrays `a` and `b`:
- moving backwards from the last dimension of each array, we check if their dimensions are the same or one equals 1
- if all of `a`'s dimensions are compatible with `b`'s dimensions, arrays `a` and `b` are compatible.

In [68]:
np.random.seed(1234)

a = np.random.randint(low = 1, high = 10, size = (3, 4))
a

array([[4, 7, 6, 5],
       [9, 2, 8, 7],
       [9, 1, 6, 1]])

In [69]:
b = np.random.randint(low = 1, high = 10, size = (3, 1))
b

array([[7],
       [3],
       [1]])

In [70]:
a + b

array([[11, 14, 13, 12],
       [12,  5, 11, 10],
       [10,  2,  7,  2]])

In [71]:
np.random.seed(1234)

a = np.random.randint(low = 1, high = 10, size = (3, 1, 4))
a

array([[[4, 7, 6, 5]],

       [[9, 2, 8, 7]],

       [[9, 1, 6, 1]]])

In [72]:
b = np.random.randint(low = 1, high = 10, size = (2, 1))
b

array([[7],
       [3]])

In [73]:
a + b

array([[[11, 14, 13, 12],
        [ 7, 10,  9,  8]],

       [[16,  9, 15, 14],
        [12,  5, 11, 10]],

       [[16,  8, 13,  8],
        [12,  4,  9,  4]]])

Sometimes it is useful to be able to manually modify the shape of the array. This can be done using the `np.newaxis` function (which is simply an alias for the `None` keyword)

In [74]:
a = np.array([1, 2, 3, 5, 7, 11, 13])

In [75]:
a

array([ 1,  2,  3,  5,  7, 11, 13])

In [76]:
a[:, np.newaxis]

array([[ 1],
       [ 2],
       [ 3],
       [ 5],
       [ 7],
       [11],
       [13]])

In [77]:
a[None, :]

array([[ 1,  2,  3,  5,  7, 11, 13]])

In [78]:
a[:, None]

array([[ 1],
       [ 2],
       [ 3],
       [ 5],
       [ 7],
       [11],
       [13]])

This can be very useful if one wants to build an array containing the results of a cross-join operation on two matrices. Suppose we are trying to create $c_{ij} = a_i - b_j$.

In [79]:
b = np.arange(7)
c = a[:, None] - b[None, :]
c

array([[ 1,  0, -1, -2, -3, -4, -5],
       [ 2,  1,  0, -1, -2, -3, -4],
       [ 3,  2,  1,  0, -1, -2, -3],
       [ 5,  4,  3,  2,  1,  0, -1],
       [ 7,  6,  5,  4,  3,  2,  1],
       [11, 10,  9,  8,  7,  6,  5],
       [13, 12, 11, 10,  9,  8,  7]])

## Homework

### Battleships

Given a 10x10 playing field with hidden battleships and a list of shooting targets, compute the number of hits. You are only allowed to use the following functions:

 - ndarray.take
 - ndarray.T
 - ndarray.shape



In [80]:
import numpy as np
sea = np.random.randint(low=0, high=2, size=(10,10))

sea
#and also +, -, *, /, @, [], ()

array([[0, 1, 0, 1, 0, 0, 0, 1, 1, 1],
       [0, 1, 1, 0, 1, 0, 1, 0, 1, 1],
       [1, 1, 0, 1, 0, 1, 1, 0, 0, 1],
       [0, 0, 1, 1, 1, 0, 0, 0, 1, 1],
       [1, 1, 1, 1, 1, 0, 1, 0, 1, 0],
       [1, 0, 0, 0, 0, 0, 0, 0, 0, 1],
       [1, 1, 0, 1, 0, 0, 1, 1, 0, 1],
       [0, 1, 1, 0, 1, 0, 0, 0, 1, 1],
       [0, 0, 0, 0, 0, 1, 1, 1, 1, 0],
       [1, 0, 1, 0, 1, 0, 1, 0, 1, 0]])

In [81]:

targets = np.array([
    [0,3],
    [1,7],
    [2,2],
    [3,5],
    [8,2]
]) #10000

shots = np.zeros(shape=(10, 10))
shots[targets[:,0],targets[:,1]] = 1
hits = int(sum(np.array((shots * sea).reshape(100))))
print(f"Number of hits is equal to: {hits}")

Number of hits is equal to: 1


In [82]:
sea.take(targets)

array([[0, 1],
       [1, 1],
       [0, 0],
       [1, 0],
       [1, 0]])

## Boolean indexing

Anytime you have a boolean array, you can use it to mask entries in another array.

In [83]:
a = np.random.randint(0, 100, size=(5,5))
a

array([[90, 33, 21, 71, 68],
       [81, 52, 64, 85, 41],
       [ 1, 14,  3, 30, 12],
       [73, 19, 26, 96, 68],
       [64, 22, 56, 84,  8]])

In [84]:
mask = a > 80
mask

array([[ True, False, False, False, False],
       [ True, False, False,  True, False],
       [False, False, False, False, False],
       [False, False, False,  True, False],
       [False, False, False,  True, False]])

In [85]:
a[mask]

array([90, 81, 85, 96, 84])

Boolean masking may be applied not only to values, but to rows and columns as well. Just remember to use slicing:

*array*[*row_mask*,*col_mask*]

In [86]:
rows_2_and_4 = np.array([False, True, False, True, False])
cols_1_and_2 = np.array([True, True, False, False, False])

In [87]:
a[rows_2_and_4]

array([[81, 52, 64, 85, 41],
       [73, 19, 26, 96, 68]])

In [88]:
a[rows_2_and_4, cols_1_and_2]

array([81, 19])

In [89]:
names = np.array(["Dennis", "Dee", "Charlie", "Mac", "Frank"])
ages = np.array([43, 44, 43, 42, 74])
genders = np.array(['male', 'female', 'male', 'male', 'male'])

In [90]:
names[(genders == 'male') & (ages > 43)]

array(['Frank'], dtype='<U7')

In [91]:
names[~(genders == 'male') & (ages % 2 == 0)]

array(['Dee'], dtype='<U7')

## `Random` module

One of the most frequently used parts of the `NumPy` is the random number generation procedure. Below you can see examples of different samples:
- normal sample
- uniform sample
- choosing from a set with/without replacement

In [93]:
np.random.normal(loc=10.0, scale=1.0, size=10) #loc == mean, scale == stdev

array([ 9.28736945, 10.86234561,  9.42244972,  9.48818195, 10.42555169,
        9.93624175, 10.19689478,  9.88686182,  9.72597595, 10.27021894])

In [94]:
np.random.randint(low=10, high=20, size=(3,3))

array([[17, 19, 13],
       [14, 14, 11],
       [11, 16, 12]])

In [95]:
np.random.uniform(low=0, high=1, size=5)

array([0.70652816, 0.72665846, 0.90008784, 0.7791638 , 0.59915478])

In [96]:
np.random.choice(
    a=[1,2,3,4,5,6],
    replace=True,
    size=5
)

array([2, 6, 6, 6, 4])

In [97]:
np.random.choice(
    a=['this','is','sampling','without','replacement'],
    replace=False,
    size=3
)

array(['sampling', 'replacement', 'this'], dtype='<U11')

Despite the fact that most people use the `random` module as above, this way is in fact deprecated, because it introduces a dependency on the random number generator used currently by `NumPy`. In theory, if `NumPy` changes the generator, all the code becomes non-reproducible.
A simple solution is to use the generic `Generator` class.

In [98]:
generator = np.random.default_rng(seed=123)

In [99]:
generator.integers(low=1, high=100, size=10)

array([ 2, 68, 59,  6, 90, 22, 26, 19, 34, 18])

In [100]:
generator.normal(loc=0, scale=1, size=10)

array([ 0.57710379, -0.63646365,  0.54195222, -0.31659545, -0.32238912,
        0.09716732, -1.52593041,  1.1921661 , -0.67108968,  1.00026942])

In [101]:
generator.choice(a=[1,2,3], replace=True, size=10)

array([1, 2, 2, 3, 3, 1, 3, 3, 1, 2])

## Homework

### Two reviewers

You are given two arrays representing ratings assigned to 100 movies by two reviewers. Identify movies such that the reviewers differ in their rating by at most 1.

In [103]:
movies = np.arange(100)
reviewer_a = np.random.choice(a=[1,2,3,4,5], size=100)
reviewer_b = np.random.choice(a=[1,2,3,4,5], size=100)
# print(reviewer_a)
# print(reviewer_b)

In [104]:
movies_with_similar_review = movies[abs(reviewer_a - reviewer_b) <= 1]
movies_with_similar_review

array([ 0,  2,  8,  9, 10, 11, 12, 15, 16, 20, 21, 22, 23, 24, 25, 27, 31,
       34, 41, 42, 44, 46, 52, 54, 55, 57, 58, 59, 60, 61, 62, 63, 67, 69,
       71, 72, 73, 76, 78, 79, 80, 82, 83, 85, 86, 88, 90, 91, 92, 93, 95,
       96, 97, 99])

## Using `where`

`np.where` is a very useful function which allows to quickly filter elements of an array based on the condition. Imagine you have two large arrays and you want to create a third array such that it contains, for each cell, the larger value from the two arrays. First, let's do it in a traditional way.

In [105]:
a = np.random.randint(1, 6, size=10**5)
b = np.random.randint(1, 6, size=10**5)

In [106]:
%%time
c = np.zeros(a.size)

for i in range(a.size):
    if a[i] > b[i]:
        c[i] = a[i]
    else:
        c[i] = b[i]


CPU times: user 33.4 ms, sys: 957 µs, total: 34.4 ms
Wall time: 33.5 ms


In [107]:
%%time
d = np.where(a > b, a, b)

CPU times: user 602 µs, sys: 586 µs, total: 1.19 ms
Wall time: 673 µs


In [108]:
np.array_equal(c,d)

True

## Homework

### First to finish the assignment

Given an array with students' assignments ordered by the increasing date of submission, you want to reward first 3 students who submitted their work and who got at least 75 points. Increase their scores by 5 points.

In [109]:
grades = np.random.randint(low=0, high=100, size=50)

idx = np.where(grades >= 75)[0][:3]
grades[idx] += 5
grades

array([64, 55, 68, 11, 69, 72, 28, 12, 93, 24,  8, 35,  9, 32, 33, 91, 58,
       52, 36, 33, 94, 19, 87, 13, 72, 16, 54, 12,  1, 68, 13, 66, 50, 23,
       13, 91,  5, 16,  1, 94, 91,  8, 24, 58, 49, 91, 99, 84, 51, 30])

## Math functions

`NumPy` contains several highly optimized implementations of math functions. Whenever possible, try to use them instead of your own implementations. Remember, that math functions are easily generalized to n-dim arrays.

In [110]:
a = np.array([
    [1, 2, 3, 4],
    [5, 6, 7, 8],
    [9, 10, 11, 12],
    [13, 14, 15, 16],
], dtype=np.float64)

In [111]:
np.sum(a)

136.0

In [114]:
np.sum(a, axis=0)

array([28., 32., 36., 40.])

In [115]:
np.sum(a, axis=1)

array([10., 26., 42., 58.])

But beware of `nan`s, as they tend to destroy all math!

In [116]:
a[2,2] = np.nan

In [117]:
np.sum(a)

nan

In [118]:
np.isnan(a)

array([[False, False, False, False],
       [False, False, False, False],
       [False, False,  True, False],
       [False, False, False, False]])

In [119]:
np.sum(a, where=~np.isnan(a))

125.0

In [120]:
np.nansum(a)

125.0

In [121]:
np.sum(np.nan_to_num(a))

125.0

Let's see what else we can do with `nan`s

In [122]:
a[0,1] = np.nan
a[1,3] = np.nan

In [123]:
np.isnan(a)

array([[False,  True, False, False],
       [False, False, False,  True],
       [False, False,  True, False],
       [False, False, False, False]])

In [124]:
np.any(np.isnan(a), axis=1)

array([ True,  True,  True, False])

In [125]:
mask = np.any(np.isnan(a), axis=1)
a[mask]

array([[ 1., nan,  3.,  4.],
       [ 5.,  6.,  7., nan],
       [ 9., 10., nan, 12.]])

## Concatenation & sorting

Concatenation means joining two arrays by rows or by columns. An array may be concatenated with itself or with another array. There are 4 functions that help with concatenation.

In [126]:
a = np.zeros(shape=(3,2))
b = np.ones(shape=(2,2))

In [127]:
np.concatenate([a, a, a, a], axis=0)

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.],
       [0., 0.]])

In [128]:
np.concatenate([a, b], axis=0)

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [1., 1.],
       [1., 1.]])

In [129]:
np.vstack([a,b])

array([[0., 0.],
       [0., 0.],
       [0., 0.],
       [1., 1.],
       [1., 1.]])

In [130]:
np.hstack([a.T,b])

array([[0., 0., 0., 1., 1.],
       [0., 0., 0., 1., 1.]])

In [131]:
np.stack([a[:2,:2], b], axis=0)

array([[[0., 0.],
        [0., 0.]],

       [[1., 1.],
        [1., 1.]]])

Unfortunately, `NumPy` does not provide any easy way of reverse sort, and sorting is limited to two functions.

In [132]:
a = np.random.randint(1, 100, size=50)
a

array([85, 55,  4, 56, 68, 55, 81, 97, 73, 65, 96, 53, 99, 51, 76, 91, 28,
       78,  2, 76, 13, 13, 22, 13, 89, 26, 78, 94, 12, 66, 32, 23,  2,  7,
       38, 48, 77, 34, 45, 58, 73, 28, 94, 48, 41,  2, 86, 54, 53, 22])

In [133]:
a.sort()

In [134]:
a

array([ 2,  2,  2,  4,  7, 12, 13, 13, 13, 22, 22, 23, 26, 28, 28, 32, 34,
       38, 41, 45, 48, 48, 51, 53, 53, 54, 55, 55, 56, 58, 65, 66, 68, 73,
       73, 76, 76, 77, 78, 78, 81, 85, 86, 89, 91, 94, 94, 96, 97, 99])

In [135]:
np.sort(a)[::-1]

array([99, 97, 96, 94, 94, 91, 89, 86, 85, 81, 78, 78, 77, 76, 76, 73, 73,
       68, 66, 65, 58, 56, 55, 55, 54, 53, 53, 51, 48, 48, 45, 41, 38, 34,
       32, 28, 28, 26, 23, 22, 22, 13, 13, 13, 12,  7,  4,  2,  2,  2])

If you want to be able to sort values in the first column of an array according to the order in the second column of an array, you need to use `np.argsort`. 

In [136]:
a = np.random.randint(1, 100, size=20)
a.shape = 5,4

a

array([[25, 90, 39, 83],
       [68, 72, 26, 21],
       [84, 95, 90, 58],
       [61, 84, 35, 11],
       [47, 61, 35, 45]])

In [137]:
np.sort(a, axis=1)

array([[25, 39, 83, 90],
       [21, 26, 68, 72],
       [58, 84, 90, 95],
       [11, 35, 61, 84],
       [35, 45, 47, 61]])

In [138]:
a

array([[25, 90, 39, 83],
       [68, 72, 26, 21],
       [84, 95, 90, 58],
       [61, 84, 35, 11],
       [47, 61, 35, 45]])

In [139]:
a[np.argsort(a[:,1])]

array([[47, 61, 35, 45],
       [68, 72, 26, 21],
       [61, 84, 35, 11],
       [25, 90, 39, 83],
       [84, 95, 90, 58]])