Python Data Analytics Book

# Chapter 3 NumPy

In [1]:
import numpy as np

## Basic Operations

### Arithmetic Operations

In [2]:
a = np.arange(4)
print(a)

[0 1 2 3]


In [3]:
a + 4 # operations are element-wise

array([4, 5, 6, 7])

In [4]:
a*2

array([0, 2, 4, 6])

In [5]:
b = np.arange(4,8)
print(b)

[4 5 6 7]


In [7]:
print(a+b)

[ 4  6  8 10]


In [8]:
print(a-b)

[-4 -4 -4 -4]


In [9]:
print(a*b)

[ 0  5 12 21]


these operations are also available for functions, provided that the value returned is a NumPy array. For example, we can multiply the array with the sine or the square root of the elements of the array b.

In [10]:
a * np.sin(b)

array([-0.        , -0.95892427, -0.558831  ,  1.9709598 ])

In [13]:
print(a*np.sqrt(b))

[0.         2.23606798 4.89897949 7.93725393]


Even in the multidimensional case, the arithmetic operations are element-wise. 

In [18]:
A = np.arange(0, 9).reshape(3,3)
print(A)

[[0 1 2]
 [3 4 5]
 [6 7 8]]


In [19]:
B = np.ones((3,3))
print(B)

[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]


In [20]:
print(A*B)

[[0. 1. 2.]
 [3. 4. 5.]
 [6. 7. 8.]]


### The Matrix Product

We use dot() function to do the matrix multiplication.

In [21]:
np.dot(A,B)

array([[ 3.,  3.,  3.],
       [12., 12., 12.],
       [21., 21., 21.]])

In [22]:
# We can also write as follows
A.dot(B)

array([[ 3.,  3.,  3.],
       [12., 12., 12.],
       [21., 21., 21.]])

Since matrix product is not a commutative operation, then the order of operands is important. A*B is not equal to B*A.

In [23]:
B.dot(A)

array([[ 9., 12., 15.],
       [ 9., 12., 15.],
       [ 9., 12., 15.]])

### Increment and Decrement Operators

In [29]:
a = np.arange(4)
a

array([0, 1, 2, 3])

In [30]:
a+= 1
print(a)

[1 2 3 4]


In [31]:
a -= 2
print(a)

[-1  0  1  2]


In [32]:
a *= 2
print(a)

[-2  0  2  4]


### Universal Functions (ufunc)

Function operating of an array in an element-by-element fashion. This outputs an array of the same size of inputs.

eg: mathematical/trigonometric operations like sqrt(), log(), sin()

In [33]:
a = np.arange(1, 5)
a

array([1, 2, 3, 4])

In [34]:
np.sqrt(a)

array([1.        , 1.41421356, 1.73205081, 2.        ])

In [35]:
np.log(a)

array([0.        , 0.69314718, 1.09861229, 1.38629436])

In [36]:
np.sin(a)

array([ 0.84147098,  0.90929743,  0.14112001, -0.7568025 ])

### Aggregate Functions

In [44]:
a = np.array([3.3, 4.5, 1.2, 5.7, 0.3])
print(a.sum())
print(a.min())
print(a.max())
print(a.mean())
print(a.std())

15.0
0.3
5.7
3.0
2.0079840636817816


## Indexing, Slicing, and Iterating

### Indexing

In [46]:
a = np.arange(10, 16)
print(a)

[10 11 12 13 14 15]


In [47]:
a[4]

14

In [48]:
a[-1]

15

To select multiple items at once, we can pass array of indexes within square brackets. 

In [49]:
a[[2,3,5]]

array([12, 13, 15])

In [50]:
# Two-dimensional array
A = np.arange(10, 19).reshape((3,3))
print(A)

[[10 11 12]
 [13 14 15]
 [16 17 18]]


In [51]:
# third element in second row:
A[1, 2]

15

### Slicing

Slicing is the operation which allows you to extract portions of an array to generate new ones. Whereas
using the Python lists the arrays obtained by slicing are copies, in NumPy, arrays are views onto the same
underlying buffer.

In [52]:
a = np.arange(10, 16)
a

array([10, 11, 12, 13, 14, 15])

In [54]:
a[1:5] # slice 2nd to fifth element
# a[1:5:1] # last 1 is default

array([11, 12, 13, 14])

In [55]:
a[1:5:2] # skips number

array([11, 13])

In [57]:
a[::2] 
# if first number is omitted implicit number is 0
# if second number is omitted implicit value is max index of the array
# if last number is omitted implicit number is 1

array([10, 12, 14])

In [59]:
a[:5:2]

array([10, 12, 14])

In [60]:
a[:5:]

array([10, 11, 12, 13, 14])

for 2-d array, slicing is separately defined both for rows and for columns.

In [61]:
A = np.arange(10, 19).reshape(3,3)
A

array([[10, 11, 12],
       [13, 14, 15],
       [16, 17, 18]])

In [62]:
A[0, :] # first row and all col

array([10, 11, 12])

In [64]:
A[:, 0] # all row first column

array([10, 13, 16])

In [65]:
# To extract smaller matrix; explicitly define intervals
A[0:2, 0:2]

array([[10, 11],
       [13, 14]])

If the indexes of the rows and cols to be extracted are not contigous; we can specify an array of indexes. 

In [67]:
A[[0,2], 0:2] # first and third row && first and second column

array([[10, 11],
       [16, 17]])

### Iterating an Array

In [71]:
for i in a:
    print(i)

10
11
12
13
14
15


In [72]:
for row in A:
    print(row)

[10 11 12]
[13 14 15]
[16 17 18]


In [77]:
# TO make an element-by-element operation
for i in A.flat:
    print(i)

10
11
12
13
14
15
16
17
18


apply_along_axis() takes 3 arguments: aggregate function, axis on which to apply the iteration, and finally the array. 

axis = 0: iteration evaluates the elements col by col

axis = 1: iteration evaluates the elements row by row

In [79]:
np.apply_along_axis(np.mean, axis = 0, arr=A)

array([13., 14., 15.])

In [80]:
np.apply_along_axis(np.median, axis = 1, arr=A)

array([11., 14., 17.])

In [87]:
def half(x):
    return x/2

In [88]:
np.apply_along_axis(half, axis=1, arr=A) # along rows

array([[5. , 5.5, 6. ],
       [6.5, 7. , 7.5],
       [8. , 8.5, 9. ]])

In [90]:
np.apply_along_axis(half, axis=0, arr=A) # along col: ufunc results same

array([[5. , 5.5, 6. ],
       [6.5, 7. , 7.5],
       [8. , 8.5, 9. ]])

## Conditions and Boolean Arrays

In [91]:
# Problem: need to select all values less than 0.5 in a 4x4 matrix
A = np.random.random([4,4])
A

array([[0.7657256 , 0.65707169, 0.27178928, 0.63491188],
       [0.66021638, 0.21012729, 0.05527403, 0.0596874 ],
       [0.35620105, 0.05431466, 0.98845667, 0.46762767],
       [0.49922656, 0.10063685, 0.51946963, 0.76574972]])

In [92]:
A<0.5 # returns boolean

array([[False, False,  True, False],
       [False,  True,  True,  True],
       [ True,  True, False,  True],
       [ True,  True, False, False]])

In [93]:
# To extract the values that fits the condition
A[A<0.5]

array([0.27178928, 0.21012729, 0.05527403, 0.0596874 , 0.35620105,
       0.05431466, 0.46762767, 0.49922656, 0.10063685])

## Shape Manipulation

In [116]:
# Convert 1-d array to matrix(2-d) using reshape
a = np.random.random(12)
print(a)

[0.82291654 0.55572977 0.50230859 0.05000322 0.45308712 0.3580098
 0.70570414 0.38938792 0.76515474 0.78281333 0.33054409 0.13381857]


In [118]:
A = a.reshape(3,4) # reshape creates new array
A

array([[0.82291654, 0.55572977, 0.50230859, 0.05000322],
       [0.45308712, 0.3580098 , 0.70570414, 0.38938792],
       [0.76515474, 0.78281333, 0.33054409, 0.13381857]])

In [119]:
# To modify the current array, assign a tuple containing new dimensions to shape
a.shape = (3,4)
a

array([[0.82291654, 0.55572977, 0.50230859, 0.05000322],
       [0.45308712, 0.3580098 , 0.70570414, 0.38938792],
       [0.76515474, 0.78281333, 0.33054409, 0.13381857]])

In [108]:
# inverse option modifying 2d to 1d is possible with ravel() function
print(a) # 2-d
a = a.ravel() # converts back to 1-d
print(a)

[[0.00442907 0.74061396 0.34680425 0.4684155 ]
 [0.70527895 0.46929213 0.17600258 0.23291177]
 [0.73137102 0.95725357 0.40976111 0.24918919]]
[0.00442907 0.74061396 0.34680425 0.4684155  0.70527895 0.46929213
 0.17600258 0.23291177 0.73137102 0.95725357 0.40976111 0.24918919]


In [120]:
# acting directly on shape attribute of the array itself would work
print(a)
a.shape = (12)
print(a)

[[0.82291654 0.55572977 0.50230859 0.05000322]
 [0.45308712 0.3580098  0.70570414 0.38938792]
 [0.76515474 0.78281333 0.33054409 0.13381857]]
[0.82291654 0.55572977 0.50230859 0.05000322 0.45308712 0.3580098
 0.70570414 0.38938792 0.76515474 0.78281333 0.33054409 0.13381857]


In [123]:
a.shape = (3,4)
print(a)

[[0.82291654 0.55572977 0.50230859 0.05000322]
 [0.45308712 0.3580098  0.70570414 0.38938792]
 [0.76515474 0.78281333 0.33054409 0.13381857]]


In [127]:
A.transpose() # transpose

array([[0.82291654, 0.45308712, 0.76515474],
       [0.55572977, 0.3580098 , 0.78281333],
       [0.50230859, 0.70570414, 0.33054409],
       [0.05000322, 0.38938792, 0.13381857]])

## Array Manipulation

### Joining Arrays

Vertical stacking with vstack(): uses 2nd array as rows of first array.

Horizontal stacking hstack(): 2nd array is added to the col of first array.

In [130]:
A = np.ones((3,3))
B = np.zeros((3,3))

In [131]:
np.vstack((A, B))

array([[1., 1., 1.],
       [1., 1., 1.],
       [1., 1., 1.],
       [0., 0., 0.],
       [0., 0., 0.],
       [0., 0., 0.]])

In [132]:
np.hstack((A, B))

array([[1., 1., 1., 0., 0., 0.],
       [1., 1., 1., 0., 0., 0.],
       [1., 1., 1., 0., 0., 0.]])

column_stack()    
row_stack()        
These functions are normally used with 1-d arrays that are stacked as columns or rows in order to form a new 2-d array.

In [133]:
a = np.array([0,1,2])
b = np.array((3,4,5))
c = np.array([6,7,8])

In [134]:
np.column_stack((a, b, c))

array([[0, 3, 6],
       [1, 4, 7],
       [2, 5, 8]])

In [135]:
np.row_stack((a, b, c))

array([[0, 1, 2],
       [3, 4, 5],
       [6, 7, 8]])

### Splitting Arrays

hsplit()           
vsplit()

In [136]:
A = np.arange(16).reshape((4,4))
A

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [137]:
# Horizontal split: width of array divided into 2 parts
# 4*4 matrix -> split into TWO 2*4 matrices

In [139]:
[B, C] = np.hsplit(A, 2)
print("Matrix B: ")
B

Matrix B: 


array([[ 0,  1],
       [ 4,  5],
       [ 8,  9],
       [12, 13]])

In [140]:
print("Matrix C: ")
C

Matrix C: 


array([[ 2,  3],
       [ 6,  7],
       [10, 11],
       [14, 15]])

In [150]:
# Split the array vertically: height of array divided into 2 parts
[B, C] = np.vsplit(A, 2)

In [151]:
B

array([[0, 1, 2, 3],
       [4, 5, 6, 7]])

In [152]:
C

array([[ 8,  9, 10, 11],
       [12, 13, 14, 15]])

A more complex command is the split() function, which allows you to split the array into nonsymmetrical parts. In addition, passing the array as an argument, you have also to specify the indexes of the parts to be divided.    
axis = 1: indexes will be those of the columns;       
axis = 0: indexes will be the row indexes.

In [157]:
[A1, A2, A3] = np.split(A, [1,3], axis=1) # axis = 1 -> colIndex
A1

array([[ 0],
       [ 4],
       [ 8],
       [12]])

In [155]:
A2

array([[ 1,  2],
       [ 5,  6],
       [ 9, 10],
       [13, 14]])

In [156]:
A3

array([[ 3],
       [ 7],
       [11],
       [15]])

In [158]:
[A1, A2, A3] = np.split(A, [1,3], axis = 0) # axis = 0 -> rowIndex
A1

array([[0, 1, 2, 3]])

In [159]:
A2

array([[ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [160]:
A3

array([[12, 13, 14, 15]])

## General Concepts

### Copies or Views of Objects

In [162]:
a = np.array([1,2,3,4])
b = a
b

array([1, 2, 3, 4])

In [166]:
a[2] = 0
print(b) # when assigning one array (a) to another array(b), we are 
# not copying; it's just another way to call the same array (a). 

[1 2 0 4]


In [167]:
# When slicing object returned is only the view of original array.
c = a[0:2]
c

array([1, 2])

In [169]:
a[0] = 0
c # even with slicing, we are pointing to the same object. 
# To generate a complete copy and distinct array we can use copy()

array([0, 2])

In [170]:
a = np.array([1,2,3,4])
c = a.copy()
c

array([1, 2, 3, 4])

In [172]:
a[0] = 0
c # after copying c remains unchanged

array([1, 2, 3, 4])

### Vectorization

Vectorization + Broadcasting: is the basis of internal implementation of NumPy. Vectorization is absence of explicit loop during the developing of code. It leads to more concise and readable code that take on more mathematical expression. 

In [175]:
# a*b expression allowed
# A*B allowed

Java type example for multiplication:        
for (i = 0; i < rows; i++){    
c[i] = a[i]*b[i];    
}

For Matrix Multiplication:    
for( i=0; i < rows; i++){   
for(j=0; j < columns; j++){    
c[i][j] = a[i][j]*b[i][j];   
}    
}

### Broadcasting

In [176]:
A = np.arange(16).reshape(4,4)
b = np.arange(4)
A

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15]])

In [177]:
b

array([0, 1, 2, 3])

Rules of broadcasting are:        
1. Add a 1 to each missing dimension. If compatibility rules are satisfied now, move on to the second rule.

2. How to extend the size of smallest array so that it takes on the same size of the biggest, so that element-wise operations is applicable. 

In [180]:
A+b # broadcasting rules applied

array([[ 0,  2,  4,  6],
       [ 4,  6,  8, 10],
       [ 8, 10, 12, 14],
       [12, 14, 16, 18]])

There may be more complex cases in which the two arrays have different shapes but each of them is smaller than the other only for some dimensions.

In [185]:
m = np.arange(6).reshape(3,1,2)
n = np.arange(6).reshape(3,2,1)
m

array([[[0, 1]],

       [[2, 3]],

       [[4, 5]]])

In [186]:
n

array([[[0],
        [1]],

       [[2],
        [3]],

       [[4],
        [5]]])

In this case both undergo the extension of dimensions (broadcasting).

In [187]:
m + n

array([[[ 0,  1],
        [ 1,  2]],

       [[ 4,  5],
        [ 5,  6]],

       [[ 8,  9],
        [ 9, 10]]])

## Structured Array

strut array with dtype and order:    
bytes b1   
int i1, i2, i4, i8   
unsigned ints u1, u2, u4, u8    
floats f2, f4, f8    
complex c8, c16   
fixed length strings a<n>

In [191]:
structured = np.array([(1, 'First', 0.5, 1+2j),
                       (2, 'Second', 1.3, 2-2j), 
                       (3, 'Third', 0.8, 1+3j)], dtype=('i2, a6, f4, c8'))

structured

array([(1, b'First', 0.5, 1.+2.j), (2, b'Second', 1.3, 2.-2.j),
       (3, b'Third', 0.8, 1.+3.j)],
      dtype=[('f0', '<i2'), ('f1', 'S6'), ('f2', '<f4'), ('f3', '<c8')])

In [193]:
structured = np.array([(1, 'First', 0.5, 1+2j),
                       (2, 'Second', 1.3,2-2j),
                       (3, 'Third', 0.8, 1+3j)],
                       dtype=('int16, a6, float32, complex64'))
structured # regular datatypes can also be explicitly used

array([(1, b'First', 0.5, 1.+2.j), (2, b'Second', 1.3, 2.-2.j),
       (3, b'Third', 0.8, 1.+3.j)],
      dtype=[('f0', '<i2'), ('f1', 'S6'), ('f2', '<f4'), ('f3', '<c8')])

In [194]:
structured[1]

(2, b'Second', 1.3, 2.-2.j)

In [195]:
structured['f1'] # f stands for field (names assigned automatically)
# and progressive integer that indicates the position in the sequence. 

array([b'First', b'Second', b'Third'], dtype='|S6')

In [196]:
# We can assign more meaningful names at the time of declaring an arrray
structured = np.array([(1,'First',0.5,1+2j),
                       (2,'Second',1.3,2-2j),
                       (3,'Third',0.8,1+3j)],
                       dtype=[('id','i2'),('position','a6'),('value','f4'),('complex','c8')])

structured

array([(1, b'First', 0.5, 1.+2.j), (2, b'Second', 1.3, 2.-2.j),
       (3, b'Third', 0.8, 1.+3.j)],
      dtype=[('id', '<i2'), ('position', 'S6'), ('value', '<f4'), ('complex', '<c8')])

In [197]:
# or rename names later; redefining the tuples of names assinged to dtype
# attribute of the structured array
structured.dtype.names = ('ID', 'Order', 'Value', "Complex_value")
structured

array([(1, b'First', 0.5, 1.+2.j), (2, b'Second', 1.3, 2.-2.j),
       (3, b'Third', 0.8, 1.+3.j)],
      dtype=[('ID', '<i2'), ('Order', 'S6'), ('Value', '<f4'), ('Complex_value', '<c8')])

In [198]:
structured['Order']

array([b'First', b'Second', b'Third'], dtype='|S6')

## Reading and Writing Array Data on Files

### Loading and Saving Data in Binary Files

In [201]:
# save() and load() are used
data = np.random.random(16).reshape(4,4)
data

array([[0.38946112, 0.28364488, 0.18181654, 0.90461758],
       [0.69690805, 0.10486431, 0.23786802, 0.64607234],
       [0.46775743, 0.34660811, 0.02862247, 0.40247567],
       [0.14678848, 0.29250875, 0.04247112, 0.27343011]])

In [202]:
# save data
np.save('saved_data', data) # saved into current working directory

In [203]:
# when we need to recover the data stored in .npy file use load()
loaded_data = np.load('saved_data.npy')
loaded_data

array([[0.38946112, 0.28364488, 0.18181654, 0.90461758],
       [0.69690805, 0.10486431, 0.23786802, 0.64607234],
       [0.46775743, 0.34660811, 0.02862247, 0.40247567],
       [0.14678848, 0.29250875, 0.04247112, 0.27343011]])

### Reading File with Tabular Data

In [205]:
# TXT or CSV format
dat = np.genfromtxt('data.csv', delimiter = ',', names = True)
dat # we get structured array where col headings become the names of field

array([(1., 123., 1.4, 23.), (2., 110., 0.5, 18.), (3., 164., 2.1, 19.)],
      dtype=[('id', '<f8'), ('value1', '<f8'), ('value2', '<f8'), ('value3', '<f8')])