#### Contents

* Universal Functions
* Aggregation
* Broadcasting
* Masking
* Fancy Indexing
* Array Sorting

### What are Universal Functions in NumPy?

Most of the time we have to loop over the array to perform simple computations like addition, subtraction, division, etc on each array element. Since these are repeated operations the time taken to compute increases with relatively larger data. Thankfully, NumPy makes this faster by using vectorized operations, generally implemented through NumPy’s universal functions (ufuncs). Let’s understand with an example.


Suppose we have an array of random integers between 1 to 10 and would like to get square of each element of the array. What we do with the knowledge of Python is:

In [9]:
import numpy as np

np.random.seed(0)

In [10]:
values = np.random.randint(1, 10, size = 10)

In [11]:
output = np.empty(len(values)) # empty array to store output

In [16]:
for i in range(len(values)):
    output[i] = values[i] ** 2

output

array([81., 25., 16.,  1., 16., 36.,  1.,  9., 16., 81.])

This takes a lot of time to write and compute, especially for larger arrays in a real dataset. Let’s see how ufuncs make it simpler both ways.

In [17]:
values = np.random.randint(1, 10, size = 10)

values ** 2

array([ 4, 16, 16, 16, 64,  1,  4,  1, 25, 64])

Simply by performing an operation on the array it will be applied to each element within the array. As we notice it also retains the dtype. Ufunc operations are extremely flexible. We can also perform operations between two arrays.

In [18]:
array1 = np.arange(1, 5)
array2 = np.arange(6, 10)

print(array1, array2)

# adding corresponding elements of array1 and array2

addition = array1 + array2

print(addition)

[1 2 3 4] [6 7 8 9]
[ 7  9 11 13]


All these arithmetic operations are wrappers around NumPy builtin functions. For example, + operator is a wrapper for add function.


In [19]:
addition = np.add(array1, array2)

addition

array([ 7,  9, 11, 13])

Numpy arithmetic opertions:

1)    +    np.add             Addition

2)    -    np.subtract        Subtraction

3)    -    np.negative        Unary negation

4)    *    np.multiply        Multiplication

5)    /    np.divide          Division

6)    //   np.floor_divide    Floor Division

7)    **   np.power           Exponential

8)    %    np.mod             Modulus or remainder

Some of the most useful functions provided by NumPy are trigonometric, logarithmic, and exponential functions. As data scientists, we are supposed to be aware of it. These will come handy while working on real datasets.


### Trignometric Funtions

In [20]:
theta = np.linspace(0, np.pi, 5)

In [25]:
print(f"theta = {theta} \n")
print(f"sin(theta) = {np.sin(theta)} \n")
print(f"cos(theta) = {np.cos(theta)} \n")
print(f"tan(theta) = {np.tan(theta)} \n")

theta = [0.         0.78539816 1.57079633 2.35619449 3.14159265] 

sin(theta) = [0.00000000e+00 7.07106781e-01 1.00000000e+00 7.07106781e-01
 1.22464680e-16] 

cos(theta) = [ 1.00000000e+00  7.07106781e-01  6.12323400e-17 -7.07106781e-01
 -1.00000000e+00] 

tan(theta) = [ 0.00000000e+00  1.00000000e+00  1.63312394e+16 -1.00000000e+00
 -1.22464680e-16] 



### Inverse Trignometric Functions

In [26]:
x = np.array([-1, 0, 1])

In [31]:
print(f"x = {x} \n")
print(f"arcsin(x) = {np.arcsin(x)} \n")
print(f"arccos(x) = {np.arccos(x)} \n")
print(f"arctan(x) = {np.arctan(x)} \n")

x = [-1  0  1] 

arcsin(x) = [-1.57079633  0.          1.57079633] 

arccos(x) = [3.14159265 1.57079633 0.        ] 

arctan(x) = [-0.78539816  0.          0.78539816] 



### Exponents

In [32]:
x = np.arange(1,4)

In [37]:
print(f"x = {x} \n")
print(f"e^x = {np.exp(x)} \n")
print(f"2^x = {np.exp2(x)} \n")
print(f"3^x = {np.power(3, x)} \n")

x = [1 2 3] 

e^x = [ 2.71828183  7.3890561  20.08553692] 

2^x = [2. 4. 8.] 

3^x = [ 3  9 27] 



### Logarithms

In [38]:
x = np.random.randint(1, 4, size = 4)

In [43]:
print(f"x = {x} \n")
print(f"ln(x) = {np.log(x)} \n")
print(f"log2(x) = {np.log2(x)} \n")
print(f"log10(x) = {np.log10(x)} \n")

x = [3 3 1 3] 

ln(x) = [1.09861229 1.09861229 0.         1.09861229] 

log2(x) = [1.5849625 1.5849625 0.        1.5849625] 

log10(x) = [0.47712125 0.47712125 0.         0.47712125] 



### Aggregation

As a data analyst or data scientist, the very first step is to explore and understand the data. One way to do it is to compute summary statistics. Although, the most common statistical methods to summarize the data are mean and standard deviation other aggregates are also useful such as sum, product, median, maximum, minimum, etc.

Let us understand with an example by computing the sum, min, and max.

In [44]:
array = np.random.random(10)

In [48]:
print(array, '\n')
print(f"Summation: {np.sum(array)} \n")
print(f"Min: {np.min(array)} \n")
print(f"Max: {np.max(array)} \n")

[0.13521817 0.32414101 0.14967487 0.22232139 0.38648898 0.90259848
 0.44994999 0.61306346 0.90234858 0.09928035] 

Summation: 4.1850852746175216 

Min: 0.09928035035897387 

Max: 0.9025984755294046 



For most of the NumPy aggregates the shorthand syntax is to use methods of the array objects instead of functions. The above operation can also be performed as shown below which is of no difference computationally.

In [49]:
array = np.random.random(10)

In [51]:
print(array, '\n')
print(f"Summation: {array.sum()} \n")
print(f"Min: {array.min()} \n")
print(f"Max: {array.max()} \n")

[0.96980907 0.65314004 0.17090959 0.35815217 0.75068614 0.60783067
 0.32504723 0.03842543 0.63427406 0.95894927] 

Summation: 5.467223647647123 

Min: 0.038425426472734725 

Max: 0.9698090677467488 



### IMPORTANT
#### Difference between Python aggregate functions and NumPy aggregate functions

The one question you can raise is why to use NumPy aggregate functions when these functions are already inbuilt in Python ( sum(), min(), max(), etc). Of course, the difference is NumPy functions are much faster but more importantly NumPy functions are aware of dimensions. Python functions behave differently on multidimensional arrays.

Suppose we like to get some of all the elements in an array of size 2x5. For better understanding, we will take a simple array of numbers from 0 to 9.

In [52]:
array = np.arange(10).reshape(2, 5)

In [54]:
print(array, "\n")
print(f"Summation: {sum(array)}")

[[0 1 2 3 4]
 [5 6 7 8 9]] 

Summation: [ 5  7  9 11 13]


We were expecting the output to be 45 (0+1+2+3+4+5+6+7+8+9) but the result is very unexpected. These kinds of results will cost a lot while summarizing data. Hence, always make sure you are using the NumPy version of aggregate function while working on arrays.


### Multidimensional aggregates

One common type of operation is aggregation along rows and columns. Since NumPy functions are aware of dimensions it is easier to do so, for example, minimum value among each row and column. Functions take an additional argument that specifies the axis along which we wish to perform aggregation.

Suppose we have a table of marks obtained by students and each column represents a different subject. We wish to find the minimum and maximum marks in each subject and total marks scored by each student. ‘axis = 0’ to specify columns-wise operation and ‘axis=1’ for row-wise. The result will an 1-d array.

![image.png](attachment:image.png)

In [56]:
np.random.seed(0)

marks = np.random.randint(20, 100, size = (4, 6))

In [60]:
print(marks, "\n")
print(f'Min marks in each subject: {marks.min(axis = 0)} \n')
print(f'Max marks in each subject: {marks.max(axis = 0)} \n')
print(f'Total marks of each student: {marks.sum(axis = 1)} \n')

[[64 67 84 87 87 29]
 [41 56 90 32 78 85]
 [59 66 57 45 97 92]
 [29 40 89 99 67 84]] 

Min marks in each subject: [29 40 57 32 67 29] 

Max marks in each subject: [64 67 90 99 97 92] 

Total marks of each student: [418 382 416 408] 



### Other aggregation functions by NumPy

np.prod, np.mean, np.std, np.var, np.argmin (find index of minimum value), np.argmax (find index of maximum value), np.median, np.percentile (compute rank-based statistics of elements).

### Broadcasting

We have already seen NumPy universal functions at the very beginning. Broadcasting is another means of applying ufuncs but on arrays of different sizes. Broadcasting is nothing but a set of rules applied by NumPy to perform unfuncs on arrays of different sizes.

Consider adding two arrays of size 3x3 and 1x3. For our understanding, we can think of this operation as the smaller array is stretched or broadcasted to match the size of a larger array. This stretching of the array does not take place actually, this is just for better understanding.

In [61]:
m = np.ones((3, 3))

In [71]:
print(f'm: {m} \n')
print(f'Shape of m: {m.shape}')

m: [[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]] 

Shape of m: (3, 3)


In [72]:
n = np.array([0, 1, 2])

In [73]:
print(f'n: {n} \n')
print(f'Shape of n: {n.shape}')

n: [0 1 2] 

Shape of n: (3,)


In [74]:
add = m + n

In [75]:
print(f'm + n: {add} \n')
print(f'Shape of m + n: {add.shape}')

m + n: [[1. 2. 3.]
 [1. 2. 3.]
 [1. 2. 3.]] 

Shape of m + n: (3, 3)


Confusion and complication increase when both the arrays need to be broadcasted.

In [77]:
m = np.arange(3)
n = np.arange(3).reshape(3, 1)

In [79]:
print(f'm: {m} \n')
print(f'Shape of m: {m.shape} \n')
print(f'n: {n} \n')
print(f'Shape of n: {n.shape}')

m: [0 1 2] 

Shape of m: (3,) 

n: [[0]
 [1]
 [2]] 

Shape of n: (3, 1)


In [80]:
add = m + n

In [81]:
print(f'm + n: {add} \n')
print(f'Shape of m + n: {add.shape}')

m + n: [[0 1 2]
 [1 2 3]
 [2 3 4]] 

Shape of m + n: (3, 3)


Jake VanderPlas, author of the book Python Data Science Handbook has provided excellent visualization to explain this process. The light-colored boxes represent the stretched values.

![image.png](attachment:image.png)

### 3 Rules for Broadcasting

Above is the logical imagination to understand. We will explore the theoretical rules with examples.

In [82]:
m = np.arange(3).reshape((3,1))
n = np.arange(3)
m.shape = (3, 1)
n.shape = (3,)

By rule 1, if two arrays differ in their shape the array with lesser shape should be padded with ‘1’ on its left side. Padding is done only of left side

* m.shape => (3, 1)
* n.shape => (1, 3)

By rule 2, if still the shape of two arrays do not match then each array whose dimension is equal to 1 should be broadcasted to match the shape of other array.

* m.shape => (3, 3)
* n.shape => (3, 3)

Stressing on rule 2, it says we can stretch the array only if value of one of its dimensions is 1. We cannot do this for dimension value other than 1. Let’s see an example where the dimension in the shape of an array will be different from 1 during the application of rule 2.

### Example 2:

* m = np.arange(3).reshape((3,2))
* n = np.arange(3)
* m.shape = (3, 1)
* n.shape = (3,)

#### By rule 1,

* m.shape => (3, 2)
* n.shape => (1, 3)

#### By rule 2,

* m.shape => (3, 2)
* n.shape => (3, 3)
#### Note: we can streatch only when value is 1.

#### By rule 3, if shapes of both arrays disagree and any dimension of neither array is 1 then an error should be raised.

### Masking

Masking is a method used extensively in the data processing. It allows us to extract, count, modify or manipulate values in an array based on certain criteria, these criteria are specified using comparison operators and boolean operators.

Suppose we have a two-dimensional array of size (3, 4) we would like to get a subset of the array whose values are less than 5.

In [86]:
np.random.seed(0)

x = np.random.randint(10, size = (3, 4))

In [87]:
print(f'Original: {x} \n')
print(f'Values of x less than 5: {x[x < 5]} \n')

Original: [[5 0 3 3]
 [7 9 3 5]
 [2 4 7 6]] 

Values of x less than 5: [0 3 3 3 2 4] 



#### Let’s break it down

We used a comparison operator ‘<’ on array x. As we already know this applies element-wise ufunc (np.less()) on the array. As a result, we get an array of boolean operators. True, if the element at the corresponding position is less than 5 else False.

In [88]:
[x < 5]

[array([[False,  True,  True,  True],
        [False, False,  True, False],
        [ True,  True, False, False]])]

When we say x[x<5], the above returned boolean values are applied on original array x resulting to return the elements of the array whose indices are True, eventually values less than 5. Similar way we can use all the comparison or boolean operators available in Python. We can even combine two operations say x[(x>3) & (x<6)] to get values between 3 and 6, only that the result of operations should be boolean. Notice, here we use bitwise operator ‘&’ rather than keyword ‘and’.

### REMEMBER
The keyword ‘and’ and ‘or’ performs single boolean operation on entire array while bitwise ‘&’ and ‘|’ performs multiple boolean operations on elements of an array. Always use bit-wise operators while masking.

### Fancy indexing

Fancy indexing is similar to normal indexing as we already know. The only difference is we pass an array of indices here. This advanced version of indexing allows quick access and/or modification of complicated subsets of an array.

Suppose we want to access elements at index 2, 5, and 9 of an array, the old school method would be [x[2], x[5], x[9]]. This can we simplified using fancy indexing.

In [91]:
x = np.random.randint(100, size = 10)

In [92]:
print(x)
print(x[[2, 5, 9]])

[34 48 93  3 98 42 77 21 73  0]
[93 42  0]


Likewise, we can fancy index two-dimensional array. Let’s see equivalent operation of x[0, 2], x[1, 3] and x[2, 1] in fancy indexing.

In [93]:
x = np.random.randint(100, size = (3, 5))

In [96]:
print(x, '\n')

row = [0, 1, 2]
col = [2, 3, 1]

print(x[row, col])

[[10 43 58 23 59]
 [ 2 98 62 35 94]
 [67 82 46 99 20]] 

[58 35 82]


This can be further simplified if either row or column value is constant. Let’s say we like to get values at index x[2, 1], x[2, 3] and x[2, 4]. The below yellow color highlight is for row value and blue color for the column value. Similarly, we can also modify values using fancy indexing by using the assignment operator ‘=’.

In [97]:
x = np.random.randint(100, size = (3, 5))

In [99]:
print(x, '\n')
print(x[2, [1, 3, 4]])

[[81 50 27 14 41]
 [58 65 36 10 86]
 [43 11  2 51 80]] 

[11 51 80]


### Array sorting

np.sort is a more efficient sorting function than Python’s built-in sort function. Additionally, np.sort is aware of dimensions. Let’s see a few flavors of the NumPy sorting function.

In [100]:
x = np.random.randint(10, size = 10)

In [101]:
print(x, '\n')
print(f'Sorted array: {np.sort(x)}')

[0 6 0 6 3 3 8 8 8 2] 

Sorted array: [0 0 2 3 3 6 6 8 8 8]


In [104]:
# Indices of the sorted array

print(f'Indices of the sorted array: {np.argsort(x)}')

Indices of the sorted array: [0 2 9 4 5 1 3 6 7 8]


In [105]:
x.sort()

print(f'In-place sorting: {x}')

In-place sorting: [0 0 2 3 3 6 6 8 8 8]


Notice, when we use the method sort(), it alters the value of array x itself. Meaning, the original order of array x in lost. It is called in-place sorting.
