Trying my hands on with numpy for the first time. So I'm going to go through some basics in preparation for the programming assignments. I will try to keep this updated with more hands on of numpy and related packages as I go further.

Import the package first!

In [16]:
import numpy as np

Arrays are very basic construct within numpy. Here is how to define one.

In [19]:
arr = np.array([1, 2, 3, 4])
print(arr)
print("Shape of the array is: ", arr.shape)

[1 2 3 4]
Shape of the array is:  (4,)


Shape of the array is nothing but the dimension of the array. We get that by accessing the `shape` value of the array. The list `[1, 2, 3, 4]` might look like a 1-dimensional vector with a dimension of 1x4. But it is not. Lists like `[1, 2, 3 ,4]` are not considered vectors. We rarely deal with this type of array while working on the problems.

In [25]:
# This is a 2-D array
arr2d = np.array([[1, 2, 3], [4, 5, 6]])
print("Shape of 2D array:", arr2d.shape)

# A 3-D array. This kind of array is often used in image processing
# Each pixel containing three 3 8-bit values for Red, Green and Blue
arr3d = np.array([[[0x1a, 0x1b, 0x1c], [0x1d, 0x1e, 0x1f], [0x11, 0x12, 0x13]],
                 [[0x2a, 0x2b, 0x2c], [0x2d, 0x2e, 0x2f], [0x21, 0x22, 0x23]],
                 [[0x3a, 0x3b, 0x3c], [0x2d, 0x2e, 0x2f], [0x31, 0x32, 0x33]]])

print("Shape of 3-D array:", arr3d.shape)

Shape of 2D array: (2, 3)
Shape of 3-D array: (3, 3, 3)


Lets try some built-in functions to create numpy arrays...Keeping the [numpy docs](https://docs.scipy.org/doc/numpy-1.10.1/genindex.html) handy for quick reference.

In [54]:
# np.zeros takes three parameters: shape, dtype, order.
# Datatype defaults to float and order defaults to C-style. 

# this creates a 4x3 matrix filled with 0.0
zeros = np.zeros((4, 3))

print("\n====Zeros====")
print(zeros)
print("Type:{}, Shape:{}, value type:{}".format(type(zeros), zeros.shape, zeros.dtype))

print("\n====Ones====")
ones = np.ones(3, dtype=np.int)
anotherOnes1D = np.ones((3,), dtype=int) # Same as above, but uses int instead of np.int

print(ones, ones.shape, ones.dtype)

print("\n====Identity matrix====")
# np.identity returns a nxn identity matrix.
# Like most np calls, we can specify the dtype here as well.
iden = np.identity(3)
print(iden)

print("\n====Identity matrix using eye====")
# This is advanced version of np.identity
# lets us control the shape as well as the diagonal
# axis to start setting 1s.
eye = np.eye(3, 2, dtype=int)
print(eye)

print("\n====Random matrices====")
# takes a variable length argument list for the dimensions
# fills the output array with values from the uniform
# distribution between 0..1
randArray = np.random.rand(4, 3)
print("Uniform distribution")
print(randArray)

# randint takes the lower and upper limits for the random int
# and size of the array
randIntArray = np.random.randint(10, size=(3, 2))
print("\nRandom Ints")
print(randIntArray)


====Zeros====
[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]
Type:<class 'numpy.ndarray'>, Shape:(4, 3), value type:float64

====Ones====
[1 1 1] (3,) int64

====Identity matrix====
[[1. 0. 0.]
 [0. 1. 0.]
 [0. 0. 1.]]

====Identity matrix using eye====
[[1 0]
 [0 1]
 [0 0]]

====Random matrices====
Uniform distribution
[[0.21216697 0.10413436 0.458502  ]
 [0.42438921 0.59061434 0.24831119]
 [0.15178391 0.50220867 0.83642531]
 [0.62141571 0.77486848 0.66162555]]

Random Ints
[[0 6]
 [4 5]
 [6 1]]


Now lets try simple operations on the numpy objects. 

In [63]:
# transpose a matrix
data = np.random.randint(10, size=(3, 4))

print("Data:\n", data)
print("Data transpose:\n", data.T)

Data:
 [[6 1 4 9]
 [8 9 2 1]
 [8 4 9 7]]
Data transpose:
 [[6 8 8]
 [1 9 4]
 [4 2 9]
 [9 1 7]]


Reshaping is another frequently performed operation across all machine learning problems. An example use case is image processing. Consider the image with resolution 32x32 px. It is digitally represented as 32x32x3 matrix (32x32 by of RGB values). It is often reshaped into one dimension array of 32\*32\*3x1 for faster access and easier handling.

In [64]:
print("Data:")
print(data)
print("\nReshaped data:")
print(data.reshape((4, 3)))
print("\nReshaping again:")
print(data.reshape((1,12)))
print("\nReshaping again:")
print(data.reshape(1, 12).reshape(2, 6))

Data:
[[6 1 4 9]
 [8 9 2 1]
 [8 4 9 7]]

Reshaped data:
[[6 1 4]
 [9 8 9]
 [2 1 8]
 [4 9 7]]

Reshaping again:
[[6 1 4 9 8 9 2 1 8 4 9 7]]

Reshaping again:
[[6 1 4 9 8 9]
 [2 1 8 4 9 7]]


In [65]:
%%html
<style>
    table {
        display: inline-block
    }
</style>

Trying out some math operations on the numpy arrays as we will often need to solve equations in building a machine learning models. Numpy has equivalent math functions for almost all of the math operations that python natively supports.

Here are some examples:

| Operation | Numpy equivalent |
|------------|------------------|
|c = a + b  | c = np.add(a, b) |
|c = a - b  | c = np.subtract(a, b)|
|c = a * b  | c = np.multiply(a, b) |
|c = sum(nums) | c = np.sum(nums) |
|c = max(nums) | c = np.max(nums) |
|matrix multiply | c = np.dot(a, b)|

The numpy equivalents are highly optimized to vectorize the operations that runs much much faster than if the operation was implemented with for loop.

In [107]:
a = np.random.randint(10, size=(4, 3))
b = np.random.randint(10, size=(4, 3))

print('a:', a)
print('b:', b)

c = a + b # equivalent to calling np.add(a, b)
print("\nAdding a, b")
print(c)

print("\nSubtracting a, b")
c = a - b # equivalent to calling np.subtract(a, b)
print(c)

print("\nMultiply a, b, element wise")
c = a * b # equivalent to calling np.multiply(a, b)
print(c)

print("\nMatrix multiplication of a,b after reshaping")
# a is of shape (4, 3) and b is also of shape (4, 3). 
# we cannot take dot product if the inner dimensions don't match.
# Reshape to the rescue
c = np.dot(a, b.reshape(3, 4))
print(c)

print("\nExponent")
c = np.exp(a) # raises each element in the input by e
print(c)

a: [[5 5 7]
 [4 4 6]
 [0 6 8]
 [0 4 5]]
b: [[1 6 2]
 [9 6 2]
 [1 1 2]
 [6 3 1]]

Adding a, b
[[ 6 11  9]
 [13 10  8]
 [ 1  7 10]
 [ 6  7  6]]

Subtracting a, b
[[ 4 -1  5]
 [-5 -2  4]
 [-1  5  6]
 [-6  1  4]]

Multiply a, b, element wise
[[ 5 30 14]
 [36 24 12]
 [ 0  6 16]
 [ 0 12  5]]

Matrix multiplication of a,b after reshaping
[[49 82 36 57]
 [40 68 30 46]
 [52 60 30 14]
 [34 38 19  9]]

Exponent
[[1.48413159e+02 1.48413159e+02 1.09663316e+03]
 [5.45981500e+01 5.45981500e+01 4.03428793e+02]
 [1.00000000e+00 4.03428793e+02 2.98095799e+03]
 [1.00000000e+00 5.45981500e+01 1.48413159e+02]]


When the dimensions(shape) of the input matrices don't match, python automatically broadcasts the data where needed and applicable. Broadcasting and reshaping are widely used. Lets see some examples of broadcasting

In [108]:
a = np.random.randint(10, size = (4, 3))
b = np.random.randint(10, size = (4, 1))

print("a:\n", a)
print("b:\n", b)
# a is of shape (4, 3) and b is of shape (4, 1). Adding them
# together will cause python automatically broadcast the data
# in b to match the dimension of a and then perform the addition.
print("a+b:\n", a + b)

a:
 [[6 7 1]
 [7 2 7]
 [5 1 0]
 [4 1 9]]
b:
 [[0]
 [6]
 [1]
 [8]]
a+b:
 [[ 6  7  1]
 [13  8 13]
 [ 6  2  1]
 [12  9 17]]


If the operand cannot be broadcasted, python will throw a ValueError. 

In [109]:
c = b.reshape(2, 2)
print("a+c:\n", a + c)

ValueError: operands could not be broadcast together with shapes (4,3) (2,2) 

When dealing with large data sets, vectorized processing of the data makes a huge difference in the computation time. The vectorized implementation makes use of SIMD processing capability of CPU and GPUs heavily. Here is a small snippet to show the difference between for-loop based computation and vectorized implementation of the same computation

In [102]:
import time

a = np.random.rand(1000000) # np.array of 1M entries
b = np.random.rand(1000000) 
c = np.empty(1000000) # to store the sum

print("Adding a and b using for-loop")
start = time.time()

for i in range(len(a)):
    c[i] = a[i] + b[i]
    
end = time.time()

print("Time taken:{}ms".format((end - start)*1000))

print("\nAdding a and b using vectorized function")
start = time.time()
c = np.add(a, b)
end = time.time()
print("Time taken:{}ms".format((end - start)*1000))


print("\nSumming 'a' using for loop")
start = time.time()
sumA = sum(a)
end = time.time()
print("Calculated sum = {}, time taken = {}ms".format(sumA, (end - start)*1000))

print("\nSumming 'a' using vectorized function")
start = time.time()
sumA = np.sum(a)
end = time.time()
print("Calculated sum = {}, time taken = {}ms".format(sumA, (end - start)*1000))

Adding a and b using for-loop
Time taken:457.3659896850586ms

Adding a and b using vectorized function
Time taken:4.012107849121094ms

Summing 'a' using for loop
Calculated sum = 499596.61660891515, time taken = 91.26925468444824ms

Summing 'a' using vectorized function
Calculated sum = 499596.6166089125, time taken = 0.827789306640625ms


As seen in the above output, the vectorized is way faster than the for-loop iterations. Almost 100 times faster.

References:
* [Numpy Docs](https://docs.scipy.org/doc/numpy-1.10.1/genindex.html)