Some notes and things I try with numpy following Chapter 4. NumPy Basics: Arrays and Vectorized Computation, and Appendix A in: Wes McKinney: Python for Data Analysis, 2017.

## Chapter 4. NumPy Basics: Arrays and Vectorized Computation

### numpy is fast because of parallelization
- and more efficient in memory usage, too

In [2]:
import numpy as np 

my_arr = np.arange(1000000)
my_list = list(range(1000000))

%time for _ in range(10): my_arr = my_arr * 2

CPU times: user 6.25 ms, sys: 3.29 ms, total: 9.54 ms
Wall time: 9.59 ms


In [3]:
my_arr.shape # shape is the same, the values are doubled ten times

(1000000,)

In [4]:
my_arr[:10]

array([   0, 1024, 2048, 3072, 4096, 5120, 6144, 7168, 8192, 9216])

The `%time` is a magic command in Jupyter Notebook. Outputs mean: </br>
    - user: time the CPU spent running your code in user mode (executing the Python process directly) </br>
    - sys: The time spent in system mode, where the operating system was performing operations on behalf of 
        your program (e.g., memory management) </br>
    - total: sum of the above two </br>
    - Wall time: total time, including all of the above AND waiting for other resources (can differ from total 
        time, because other processes or threads may be running concurrently)

Here, the wall time (first run: 11.7 ms) is much shorter, which suggests that operations were performed in 
parallel or optimized using NumPy's underlying implementation.

In [5]:
%time for _ in range(10): my_list = [x*2 for x in my_list] # each value is doubled 10 times

# comment: this code caused vscode to break, because it doubles the size of the list so it would be 2^10 times 
# as long: %time for _ in range(10): my_list = my_list * 2

CPU times: user 214 ms, sys: 53.3 ms, total: 267 ms
Wall time: 267 ms


For the python list, the wall time is about the total time.

### 4.1 The NumPy `ndarray`
- container type: multidimensional and for homogeneous data types
- perform mathematical operations on whole blocks of data

In [6]:
data = np.random.randn(2,3)
data

array([[ 1.56360656, -1.89709327,  0.66943621],
       [ 1.13491486,  0.09507661,  0.59080532]])

In [7]:
data * 10

array([[ 15.6360656 , -18.9709327 ,   6.69436212],
       [ 11.34914856,   0.95076609,   5.90805325]])

In [8]:
data + data

array([[ 3.12721312, -3.79418654,  1.33887242],
       [ 2.26982971,  0.19015322,  1.18161065]])

In [9]:
data.dtype # type of data within the array

dtype('float64')

In [10]:
list_of_data = [1,2,3]

np.array(list_of_data) # np.array accepts any sequence type or nested sequence types of equal length each
# note the __str__ or __repr__ (not sure which) is showing "array", not "np.array"

array([1, 2, 3])

In [11]:
type(np.array(list_of_data))

numpy.ndarray

np.array() copies the incomming data by default </br>
np.asarray() does not copy if the incomming data already is an array

In [12]:
arr = np.array([1,2,3], dtype=np.float32)
arr.dtype

dtype('float32')

numpy detects dtype automatically from out the memory of the data (magic), but we can also pass it. </br>
</br>
Changing the dtype of an array so called casting:

In [13]:
int_arr = arr.astype("int8")
int_arr.dtype

dtype('int8')

##### Vectorization
- any operation between equal-sized arrays is applies element-wise
- operations with scalars propagate the operation to each element in the array
- comparisons between arrays of the same size yield boolean arrays
- operations between differently sized arrays can broadcast under certain condition
    - same shape in one dimension, I believe

##### Slicing
- 1-D arrays behave almost like python lists
- but other than list, np array slices are `views` on the original array
    - this means the data is NOT copied and any modification on a view will be reflected on the source array
- for more-dimensional array, the elements at each index are not scalars, but arrays themselves:

In [14]:
arr3d = np.array([[[1,2,3], [4,5,6]], [[7,8,9], [10,11,12]]])
arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [15]:
arr3d[0]

array([[1, 2, 3],
       [4, 5, 6]])

In [16]:
old_values = arr3d[0].copy()
arr3d[0] = 42
arr3d

array([[[42, 42, 42],
        [42, 42, 42]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [17]:
arr3d[0] = old_values
arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [18]:
arr3d[1,0]

array([7, 8, 9])

In [19]:
# is the same as:
index = arr3d[1]
index[0]

array([7, 8, 9])

In [20]:
arr2d = np.array([[1,2,3], [4,5,6], [7,8,9]])
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [21]:
# this slices along axis 0
arr2d[:2] # read: select the first two rows of arr2d

array([[1, 2, 3],
       [4, 5, 6]])

In [22]:
# passing multiple slices (one for each dimension):
arr2d[:2, 1:] # read: select the first two rows and then from this, select every column after the first column

array([[2, 3],
       [5, 6]])

In [23]:
# assigning to a slice selection assigns to the whole selection (because slices are views):
arr2d[:2, 1:] = 0
arr2d

array([[1, 0, 0],
       [4, 0, 0],
       [7, 8, 9]])

##### Boolean indexing
- caution: it doesn't fail if the lengthes don't match (really not? I can hardly believe this)

In [24]:
names = np.array(["bob", "sandy", "bob"])

arr2d[names == "bob"]

array([[1, 0, 0],
       [7, 8, 9]])

In [25]:
arr2d[names == "bob", 1:] # mix of boolean indexing and slicing

array([[0, 0],
       [8, 9]])

- trying if the statement from above about not raising on different shapes is true:

In [26]:
arr2d_expanded = np.vstack([arr2d, np.array([98, 99, 100])])   
arr2d_expanded

array([[  1,   0,   0],
       [  4,   0,   0],
       [  7,   8,   9],
       [ 98,  99, 100]])

In [27]:
arr2d_expanded[names == "bob"] # so it does raise ...

IndexError: boolean index did not match indexed array along axis 0; size of axis is 4 but size of corresponding boolean axis is 3

##### Fancy indexing
- indexing using integer arrays

In [28]:
arr = np.empty((8,4)) # np.empty() creates an array with some data it happens to have in memory (not reliable)
arr

array([[5.41617385e-310, 0.00000000e+000, 6.36306148e-310,
        6.36305145e-310],
       [6.36305695e-310, 6.36305145e-310, 6.36306342e-310,
        6.36305145e-310],
       [6.36306268e-310, 6.36305695e-310, 6.36305145e-310,
        6.36305903e-310],
       [6.36305695e-310, 6.36305695e-310, 6.36306263e-310,
        6.36305695e-310],
       [6.36305145e-310, 5.41656433e-310, 6.36306343e-310,
        6.36306271e-310],
       [6.36306342e-310, 6.36306343e-310, 6.36306343e-310,
        6.36306277e-310],
       [6.36306343e-310, 6.36305695e-310, 6.36306343e-310,
        6.36306341e-310],
       [6.36306339e-310, 6.36306340e-310, 6.36305695e-310,
        6.36305145e-310]])

In [None]:
for i in range(8): # this is part of initing the array; it's not fancy indexing yet
    arr[i] = i

In [30]:
arr

array([[0., 0., 0., 0.],
       [1., 1., 1., 1.],
       [2., 2., 2., 2.],
       [3., 3., 3., 3.],
       [4., 4., 4., 4.],
       [5., 5., 5., 5.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]])

In [None]:
arr[[4,3,0,6]] # passing a list or ndarray of integers to specify the desired order: this is fancy indexing

array([[4., 4., 4., 4.],
       [3., 3., 3., 3.],
       [0., 0., 0., 0.],
       [6., 6., 6., 6.]])

In [32]:
# passing a multiple-index array does something different: it selects a one-dimensional array corresponding to each tuple of indexes:
arr = np.arange(32).reshape((8,4))
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])

In [33]:
arr[[1,5,7,2],[0,3,1,2]] # this is the fancy indexing

array([ 4, 23, 29, 10])

Fancy indexing is a way to select multiple elements (other than just indexing) at the same time, that are not contiguous (other than slicing).

- it can be used to select or set specific elements (multi dimensional)
- it can be used to re-order the data in an array (not in place; for inplace ordering we would use `np.argsort()`)
- boolean indexing is a type of fancy indexing

##### Transposing arrays and swapping axes
- transposing is a special way of reshaping and returns a view without creating a copy
- ndarrays have a `transpose()` method and a `.T` attribute
    - we use this for calculating inner products ($x^Tx$), for instance

### 4.2 Universal Functions: Fast element-wise array functions
- a `ufunc` is a vectorizing wrapper around a normal function, returning element-wise transformations
- examples: `np.sqrt()`, `np.exp()`, `np.add()`, `np.maximum()`
- the can take one or two arrays as input and mostly return one array (sometimes also multiple arrays, although not common; example: `np.modf()`)

### 4.3 Array-oriented programming
##### Conditional logic
- `np.where()` is a vectorized version of the ternary expression `x if condition else y`

In [37]:
# trying a ternary expression
x = 10
y = 20
result = x if x > y else y  # Picks the larger value
print(result)
print(type(result))


20
<class 'int'>


In [38]:
# trying np.where()
xarr = np.array([1.1, 1.2, 1.3, 1.4])
yarr = np.array([2.1, 2.2, 2.3, 2.4])
cond = np.array([True, False, True, True])

# we want to take a value from `xarr`, where the condition from `cond` is True, and from `yarr` otherwise

result = np.where(cond, xarr, yarr)
result

array([1.1, 2.2, 1.3, 1.4])

In [39]:
# other example
arr = np.random.randn(4, 4)

arr = np.where(arr<0, -1, arr)
arr

array([[ 0.12110708, -1.        ,  1.96610356, -1.        ],
       [ 0.1118205 ,  1.02967095, -1.        , -1.        ],
       [ 1.40108331, -1.        , -1.        ,  0.61962777],
       [ 1.33009324, -1.        ,  0.64076656,  2.05993252]])

##### Mathematical and statistical methods

In [53]:
arr = np.random.randn(5, 4) # generate normally distributed data
arr

array([[ 1.2980223 , -0.31134484,  1.13939312,  1.230545  ],
       [-1.07852185,  0.20719577,  0.66337511,  0.34010541],
       [ 0.02902769, -1.52797291,  0.15086521, -1.26519605],
       [ 0.57630202,  1.52964349, -0.88973635, -0.23148914],
       [ 0.99065693, -0.12356336, -0.69366481, -0.14388266]])

In [None]:
print(arr.mean()) # or np.mean(arr)
print(arr.sum())

# these take an optional axis param to compute the statistic over a certain axis:
print(arr.sum(axis=1)) # or arr.mean(1)

0.0944880032241776
1.889760064483552
[ 3.35661557  0.13215443 -2.61327605  0.98472002  0.02954609]


In [56]:
# these do not aggregate, but produce an array of the intermediate results:
arr = np.array([1,2,3])

print(arr.cumsum())
print(arr.cumprod())

[1 3 6]
[1 2 6]


In [None]:
# for multidimensional arrays:
arr = np.array([[1,2,3], [4,5,6]])
print(arr)

print('')

print(arr.cumsum()) # add everything
print(arr.cumsum(axis=0)) # add row-wise

print('')

print(arr.cumprod())
print(arr.cumprod(axis=0))

[[1 2 3]
 [4 5 6]]

[ 1  3  6 10 15 21]
[[1 2 3]
 [5 7 9]]

[  1   2   6  24 120 720]
[[ 1  2  3]
 [ 4 10 18]]


##### Methods for boolean arrays
- `.any()` and `.all()` are specifically useful for boolean arrays; `.sum()` is also often used
- on non-boolean arrays the first two evaluate to True for non-zero values

##### Sorting

In [None]:
arr = np.random.randn(2,2).sort() # sorts in place
arr # which means: sort() returns nothing, only None

In [None]:
type(arr) # this is why this is a Nonetype

NoneType

In [None]:
# this is how it works
arr = np.random.randn(2,2)
arr.sort()
arr

array([[-1.26586607,  1.3622619 ],
       [-0.88691094, -0.85749665]])

In [86]:
# quick and dirty way to find percentiles:
large_arr = np.random.randn(1000)
large_arr.sort()
large_arr[int(0.05 * len(large_arr))] # percentile at rank 5

np.float64(-1.6057149256592729)

##### Unique and other set logic

In [87]:
# test membership of the values in one array in another:
values = np.array([0,3,2,1,2,4])
np.in1d(values, [2,3])

  np.in1d(values, [2,3])


array([False,  True,  True, False,  True, False])

### 4.3 File Input and Output with Arrays
- numpy can save and load data to disk in text or in binary format using `np.save()` and `np.load()`
- they will be stored as npy files:

In [None]:
arr = np.arange(10)
np.save('some_array', arr) # saves arrays as `some_array.npy`

In [90]:
arr_loaded = np.load('some_array.npy')
arr_loaded

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [91]:
# for multiple arrays
np.savez('array_archive.npz', a=arr, b=arr) # a and b being the names of (usually) different arrays

In [92]:
arch = np.load('array_archive.npz') # will return a dict with the names of the arrays as keys
arch['b']

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

- all the above methods save the data uncompressed
- if the data compresses well, we can also use `np.savez_compressed()`

### 4.5 Linear Algebra
- `np.dot()` for matrix multiplication
- the `@` symbol (introduced in Python 3.5) is an infix operator also doing matrix multiplication

In [101]:
from numpy.linalg import inv, qr

X = np.random.randn(5,5)

mat = X.T.dot(X)
inv(mat)


array([[  0.68863277,   2.32533218,  -1.20754068,   1.97367978,
         -1.89007534],
       [  2.32533218,  19.71960478, -10.63496604,  16.14356698,
        -16.49136878],
       [ -1.20754068, -10.63496604,   5.92782184,  -8.74737409,
          9.06393264],
       [  1.97367978,  16.14356698,  -8.74737409,  13.40734895,
        -13.56452877],
       [ -1.89007534, -16.49136878,   9.06393264, -13.56452877,
         14.18291675]])

In [None]:
mat.dot(inv(mat)) # almost eye matrix (rounding errors obviously, other than in the book)

array([[ 1.00000000e+00,  5.89976977e-16, -7.00569346e-16,
        -2.23960879e-15, -1.23199659e-15],
       [ 7.11534201e-17,  1.00000000e+00, -1.37409970e-15,
         1.35910277e-15,  1.11921110e-15],
       [ 1.19080155e-15, -7.35171767e-15,  1.00000000e+00,
        -6.71488870e-15,  9.86315869e-15],
       [ 1.38359770e-15, -1.79708104e-15, -5.73206221e-16,
         1.00000000e+00, -1.05919482e-14],
       [ 5.98515432e-16,  5.98899018e-15, -4.34049322e-15,
         6.21085003e-15,  1.00000000e+00]])

In [None]:
q, r = qr(mat) # qr factorization of a matrix
q

array([[-0.49470092,  0.0276895 ,  0.04151266,  0.07548343,  0.86433977],
       [-0.25767361, -0.66784315,  0.02841727,  0.672401  , -0.18616978],
       [ 0.17588013,  0.15794074, -0.9188557 ,  0.29466703,  0.11400197],
       [-0.53229441, -0.36975612, -0.39023576, -0.6157343 , -0.22029614],
       [ 0.61204914, -0.62574272, -0.02982297, -0.27608288,  0.39589233]])

In [100]:
r

array([[ -8.67547834,  -4.7668058 ,   2.9413361 , -10.13989014,
         10.56799639],
       [  0.        ,  -4.43133203,   1.06571303,  -4.70637323,
         -5.37570551],
       [  0.        ,   0.        ,  -2.41544778,  -2.36445853,
         -0.6376045 ],
       [  0.        ,   0.        ,   0.        ,  -2.24992763,
         -1.41350362],
       [  0.        ,   0.        ,   0.        ,   0.        ,
          0.23161474]])

### 4.6 Pseudorandom Number Generation
- numpy's random module supplements the python random module with efficient functions to create random number arrays from different probability distributions
- pseudorandom, because these are deterministic algorithms based on a random seed

In [None]:
# example: standard normal distribution
np.random.normal(size=[4,4])

array([[-0.59839524, -0.05451426, -0.39527943, -2.10952311],
       [-1.27982573, -1.85372548,  1.42104452,  0.48263228],
       [-1.63395085,  0.11694552, -1.62034731,  0.40186316],
       [ 0.26425778,  0.0514621 ,  1.13957537,  0.0518309 ]])

In [None]:
# change random seed
np.random.seed(1234)

- the random number generation depends on a random seed itself, which is called the 'global random seed'
- we can circumvent this global random seed by creating an isolated random number generator:

In [104]:
# random number generator
rng = np.random.RandomState(42)
rng.randn(10)

array([ 0.49671415, -0.1382643 ,  0.64768854,  1.52302986, -0.23415337,
       -0.23413696,  1.57921282,  0.76743473, -0.46947439,  0.54256004])

## Appendix A: Advanced NumPy
##### Reshaping

In [None]:
arr = np.arange(12)
arr.reshape((4,3), order="C") # inplace operation, the order="C" is the default

array([[ 0,  1,  2],
       [ 3,  4,  5],
       [ 6,  7,  8],
       [ 9, 10, 11]])

In [109]:
arr.reshape((4,3), order="F") #  F is for Fortran, it's columns first, then rows

array([[ 0,  4,  8],
       [ 1,  5,  9],
       [ 2,  6, 10],
       [ 3,  7, 11]])

- if one of the shape dimensions is `-1`, the dimension will be inferred from the data (and the other defined shapes):

In [110]:
arr = np.arange(15)
arr.reshape(3, -1)

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [None]:
# shaping back into one dimension
arr.ravel() # does not return a copy under certain conditions

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [112]:
flattened_arr = arr.flatten() # does return a copy
flattened_arr

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [117]:
# order argument can also be passed into some functions
arr =arr.reshape(3,5)
arr.ravel()

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

In [118]:
arr.ravel('F')

array([ 0,  5, 10,  1,  6, 11,  2,  7, 12,  3,  8, 13,  4,  9, 14])

##### Repeating elements

In [None]:
arr =np.arange(3)
arr.repeat(3) # passing an int: repeats each element the same number of times

array([0, 0, 0, 1, 1, 1, 2, 2, 2])

In [123]:
arr.repeat([2,0,3]) # passing a sequence: repeats each element a different number of times

array([0, 0, 2, 2, 2])

In [127]:
# multidimensional arrays can be repeated along a certain axis
arr = np.random.randn(2,2)
arr.repeat(2, axis=0)

array([[-0.79305184, -2.48300109],
       [-0.79305184, -2.48300109],
       [-0.65743278, -1.49791964],
       [-0.65743278, -1.49791964]])

In [129]:
# or repeat several times along a certain axis
arr = np.random.randn(2,2)
arr.repeat([2,3], axis=0)

array([[ 0.45429061, -0.5994875 ],
       [ 0.45429061, -0.5994875 ],
       [-0.30168219, -1.06731923],
       [-0.30168219, -1.06731923],
       [-0.30168219, -1.06731923]])

In [130]:
arr.repeat([2,3], axis=1)

array([[ 0.45429061,  0.45429061, -0.5994875 , -0.5994875 , -0.5994875 ],
       [-0.30168219, -0.30168219, -1.06731923, -1.06731923, -1.06731923]])

In [133]:
# tile is a way to reproduce an array along a certain axis
arr = np.arange(4).reshape(2,2)
np.tile(arr, 2)

array([[0, 1, 0, 1],
       [2, 3, 2, 3]])

In [135]:
# the second argument can be a tuple indicating the tiling
np.tile(arr, [2,3])

array([[0, 1, 0, 1, 0, 1],
       [2, 3, 2, 3, 2, 3],
       [0, 1, 0, 1, 0, 1],
       [2, 3, 2, 3, 2, 3]])

##### Fancy indexing equivalents: take and put
- useful for making a selection along a certain axis

In [None]:
arr = np.arange(10) * 100
inds = [1, 4, 7]

arr[inds] # fancy indexing

array([100, 400, 700])

In [139]:
arr.take(inds)

array([100, 400, 700])

In [140]:
arr.put(inds, 42)
arr

array([  0,  42, 200, 300,  42, 500, 600,  42, 800, 900])

##### Broadcasting
- arithmetic between arrays of different shapes
- the shapes are suitable for broadcasting if either for each dimension it is the same across arrays or it is 1
    - ofter, in order to make it work, we need to add a new axis of length 1

In [146]:
arr = np.random.randn(4,3)
arr

array([[ 0.10156386,  0.55768196, -1.00808752],
       [-0.06576862, -2.89332594, -0.14339833],
       [ 2.25114946, -0.58481767,  0.52004145],
       [ 0.87469057, -1.02561222,  0.55715798]])

In [147]:
arr.mean(0) #means over axis 0

array([ 0.79040882, -0.98651847, -0.0185716 ])

In [148]:
demeaned = arr - arr.mean(0)
demeaned

array([[-0.68884496,  1.54420042, -0.98951591],
       [-0.85617744, -1.90680747, -0.12482673],
       [ 1.46074064,  0.4017008 ,  0.53861306],
       [ 0.08428175, -0.03909376,  0.57572959]])

In [None]:
demeaned.mean(0) # almost zero

array([ 0.00000000e+00, -5.55111512e-17, -2.77555756e-17])

In [150]:
# adding a new axis easily:
arr = np.zeros((4,4))
arr

array([[0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.],
       [0., 0., 0., 0.]])

In [152]:
arr_3d = arr[:, np.newaxis, :]
arr_3d

array([[[0., 0., 0., 0.]],

       [[0., 0., 0., 0.]],

       [[0., 0., 0., 0.]],

       [[0., 0., 0., 0.]]])

- we can also set array values via broadcasting

##### ufunc instance methods
- ufuncs have special instance methods for special vectorized operations

In [None]:
# reduce takes a single array and aggregates it, optionally along an axis:
arr = np.arange(10)
np.add.reduce(arr) # same as arr.sum()

np.int64(45)

In [None]:
# logical_and to check if elements are sorted
arr = np.random.randn(5,5)
arr[::2].sort(1) # sort every other row
arr

array([[-1.54678256, -0.95017152, -0.65289464,  0.04829081,  1.32566546],
       [-0.76039907, -1.23175176, -1.44158413,  0.74995621,  0.3737859 ],
       [ 0.35400768,  0.58035137,  0.81355741,  1.92065877,  2.66433731],
       [-2.60520846, -1.42462652, -0.3184543 , -0.95227597, -0.42400786],
       [-1.90565143, -0.55122187, -0.28483035, -0.02222463,  0.58650511]])

In [None]:
arr[:,:-1] < arr[:,1:] # check of sorted

array([[ True,  True,  True,  True],
       [False, False,  True, False],
       [ True,  True,  True,  True],
       [ True,  True, False,  True],
       [ True,  True,  True,  True]])

In [161]:
np.logical_and.reduce(arr[:,:-1] < arr[:,1:], axis=1)

array([ True, False,  True, False,  True])

- `np.logical_and.reduce()` is equivalent to the `.all()` method
- `accumulate()` is related to `reduce()` like `cumsum()` is related to `sum()`:

In [166]:
arr = np.arange(15).reshape(3,5)
arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [167]:
np.add.accumulate(arr, axis=1)

array([[ 0,  1,  3,  6, 10],
       [ 5, 11, 18, 26, 35],
       [10, 21, 33, 46, 60]])

- outer() performs a pairwise cross product between two arrays:

In [168]:
arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [None]:
np.multiply.outer(arr, np.arange(5)) # output has a dimension that is the sum of the dimension of the inputs

array([[[ 0,  0,  0,  0,  0],
        [ 0,  1,  2,  3,  4],
        [ 0,  2,  4,  6,  8],
        [ 0,  3,  6,  9, 12],
        [ 0,  4,  8, 12, 16]],

       [[ 0,  5, 10, 15, 20],
        [ 0,  6, 12, 18, 24],
        [ 0,  7, 14, 21, 28],
        [ 0,  8, 16, 24, 32],
        [ 0,  9, 18, 27, 36]],

       [[ 0, 10, 20, 30, 40],
        [ 0, 11, 22, 33, 44],
        [ 0, 12, 24, 36, 48],
        [ 0, 13, 26, 39, 52],
        [ 0, 14, 28, 42, 56]]])

In [None]:
# reduceat performs a local aggregation; a bit like groupby:
arr = np.arange(10)
indices = [0,5,8] # starting indices for the segments / bins
np.add.reduceat(arr, indices)

array([10, 18, 17])

- a structured array is a np.array with a dtype that is structured (consists of several parts, that are accessible via a dict)
    - we can pass the dtypes via a list or a tuple

##### Advanced sorting
- `argsort` is a way of sorting indirectly; receiving a list of indices that reflect the sorted array

In [179]:
values = np.array([5,0,1,3,2])
indexer = values.argsort()
indexer


array([1, 2, 4, 3, 0])

In [180]:
values[indexer]

array([0, 1, 2, 3, 5])

- `lexsort` can be used to sort several arrays on their respective values
    - the values of the last array is prioritised in the sorting and the values from the former array(s) are sorted in between
    - it's similar to grouping first and then sorting
    - pandas `sort_values` is related

- `partition` and `argpartition` are used to split an array and only sort part of it

In [181]:
np.random.seed(12345)
arr = np.random.randn(20)
arr

array([-0.20470766,  0.47894334, -0.51943872, -0.5557303 ,  1.96578057,
        1.39340583,  0.09290788,  0.28174615,  0.76902257,  1.24643474,
        1.00718936, -1.29622111,  0.27499163,  0.22891288,  1.35291684,
        0.88642934, -2.00163731, -0.37184254,  1.66902531, -0.43856974])

In [None]:
np.partition(arr, 3) #  after this, the first three elements in the result are the smallest three values in no particular order

array([-2.00163731, -1.29622111, -0.5557303 , -0.51943872, -0.43856974,
       -0.37184254, -0.20470766,  0.09290788,  0.22891288,  0.27499163,
        0.28174615,  0.47894334,  0.76902257,  0.88642934,  1.00718936,
        1.24643474,  1.35291684,  1.39340583,  1.66902531,  1.96578057])

In [183]:
indices = np.argpartition(arr, 3) # returns the indices of the array with the first three elements in order (similar to argsort)
indices

array([16, 11,  3,  2, 19, 17,  0,  6, 13, 12,  7,  1,  8, 15, 10,  9, 14,
        5, 18,  4])

In [184]:
arr.take(indices) # like take_along_axis

array([-2.00163731, -1.29622111, -0.5557303 , -0.51943872, -0.43856974,
       -0.37184254, -0.20470766,  0.09290788,  0.22891288,  0.27499163,
        0.28174615,  0.47894334,  0.76902257,  0.88642934,  1.00718936,
        1.24643474,  1.35291684,  1.39340583,  1.66902531,  1.96578057])

- `searchsorted` performs a binary search on a sorted array
    - returns the location in the array where the value would need to be inserted to maintain sortedness
    - we can also pass a sequence of values to have a sequence of indices returned

In [185]:
arr = np.array([1,2,5,6,9])
arr.searchsorted(4)

np.int64(2)

In [None]:
arr.searchsorted(5) # by default returns the index on the left side, but can be changed with the `side` param

np.int64(2)