#### Create a NumPy Array

The learning objectives of this section are:

* Understand advantages of vectorised code using NumPy (over standard python ways)
* Create NumPy arrays
* Convert lists and tuples to NumPy arrays
* Create (initialise) arrays
* Compare computation times in NumPy and standard Python lists

#### NumPy Basics

NumPy is a library written for scientific computing and data analysis. It stands for numerical python.

The most basic object in NumPy is the ndarray, or simply an array, which is an n-dimensional, homogenous array. By homogenous, we mean that all the elements in a NumPy array have to be of the same data type, which is commonly numeric (float or integer).

#### Create an array From an Iterable

Such as

* list
* tuple
* range iterator

##### Notice that not all iterables can be used to create a numpy array, such as set and dict

In [1]:
#np is simply an alias, you may use any other alias, though np is quite standard
import numpy as np

#### Create an 1D Array

In [2]:
# Creating a 1-D array using a list
arr = np.array([1,2,3,4,5])
print(arr)

[1 2 3 4 5]


In [3]:
print(type(arr))

<class 'numpy.ndarray'>


In [5]:
# creating a 1-D array using a tuple 
arr = np.array ((1,2,3,4,5,6,))
print(arr)

[1 2 3 4 5 6]


In [7]:
arr = np.array(range(10))
print(arr)

[0 1 2 3 4 5 6 7 8 9]


#### Create an 2D Array with Specified Data Type

In [8]:
arr = np.array([[1,2,3], [4,5,6]], dtype='int')
print(arr)
print('Data Type:',arr.dtype)

[[1 2 3]
 [4 5 6]]
Data Type: int32


#### Create an 3D Array

In [12]:
arr = np.array([[[10,20,30],[40,50,60,],[70,80,90]]])
print(arr)

[[[10 20 30]
  [40 50 60]
  [70 80 90]]]


#### Create an array within specified range 

np.range() Method can be used to replace np.array (range()) method

In [14]:
# np.arrange (start, stop,step)
arr =np.arange (0,20,2)
print(arr)

[ 0  2  4  6  8 10 12 14 16 18]


The other common way is to initialise arrays. You do this when you know the size of the array beforehand.step size

* np.linspace(): Create array of fixed length
* np.random.rand(): method returns values in the range [0,1)
* np.ones(): Create array of 1s
* np.zeros(): Create array of 0s
* np.random.random(): Create array of random numbers
* np.arange(): Create array with increments of a fixed step size

#### Create an array of evenly spaced numbers within specified range

np.linspace(start, stop, num_of_elements, endpoint=True, retstep=False) has 5 parameters:

* start: start number (inclusive
* stop: end number (inclusive unless endpoint set to False)
* num_of_elements: number of elements contained in the array
* endpoint: boolean value representing whether the stop number is inclusive or not
* retstep: boolean value representing whether to return the step size

In [19]:
arr, step_size = np.linspace(0, 6, 8, endpoint=False, retstep=True)
print(arr)
print('The step size is ' + str(step_size))

[0.   0.75 1.5  2.25 3.   3.75 4.5  5.25]
The step size is 0.75


#### Create an array of random values of given shape

np.random.rand() method returns values in the range [0,1)

In [20]:
np.random.rand()

0.7980063209768992

In [21]:
np.random.rand(5)

array([0.97379714, 0.4650916 , 0.10670162, 0.53197029, 0.84667025])

In [22]:
arr = np.random.rand(3, 3)
print(arr)

[[0.59668903 0.2637436  0.69507712]
 [0.80391812 0.21015934 0.8240712 ]
 [0.09613389 0.84961287 0.55233953]]


In [23]:
np.random.rand(15,2)

array([[0.47432819, 0.03493833],
       [0.51708064, 0.78153418],
       [0.53081357, 0.18517047],
       [0.70510517, 0.56792691],
       [0.26743641, 0.59048675],
       [0.57872347, 0.71969992],
       [0.89673748, 0.14291323],
       [0.30273786, 0.44937332],
       [0.84412172, 0.09735793],
       [0.9832366 , 0.07934179],
       [0.23149624, 0.12158291],
       [0.57001669, 0.29335752],
       [0.68882759, 0.64391716],
       [0.07024948, 0.95609832],
       [0.05026582, 0.49225277]])

In [28]:
# Create a 4 x 4 random array of integers ranging from 0 to 9
np.random.randint(0, 100, (6,6))

array([[ 6, 37, 14, 58, 29,  6],
       [95,  8, 45, 80, 24, 19],
       [27, 61, 20, 77, 63, 26],
       [24, 21, 95, 10, 82, 23],
       [ 2, 37, 76, 95, 42, 35],
       [99, 61, 90, 29,  5, 47]])

#### Create an array of zeros of given shape

* np.zeros(): create array of * all zeros in given shape
* np.zeros_like(): create arr* ay of all zeros with the same shape and data type as the given input array

In [31]:
zeros = np.zeros((3,6))
print(zeros)

[[0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0.]]


#### np.zeros_like()

In [32]:
arr = np.array([[1,2], [3,4],[5,6]])
arr

array([[1, 2],
       [3, 4],
       [5, 6]])

In [33]:
zeros = np.zeros_like(arr)
print(zeros)
print('Data Type:',zeros.dtype)

[[0 0]
 [0 0]
 [0 0]]
Data Type: int32


#### Create an array of ones of given shape

* np.ones(): create array of all ones in given shape* 
np.ones_like(): create array of all ones with the same shape and data type as the given input array

In [34]:
ones = np.ones((3,2))
print(ones)

[[1. 1.]
 [1. 1.]
 [1. 1.]]


In [35]:
arr = [[1,2,3], [4,5,6]]
ones = np.ones_like(arr)
print(ones)
print('Data Type: ' + str(ones.dtype))

[[1 1 1]
 [1 1 1]]
Data Type: int32


#### Create an empty array of given shape

* np.empty(): create array of empty values in given shape* 
np.empty_like(): create array of empty values with the same shape and data type as the given input array

###### Notice that the initial values are not necessarily set to zeroes

They are just some garbage values in random memory addresses.

In [37]:
empty = np.empty((5,5))
print(empty)

[[6.23042070e-307 4.67296746e-307 1.69121096e-306 7.56602523e-307
  1.89146896e-307]
 [7.56571288e-307 3.11525958e-307 1.24610723e-306 1.37962320e-306
  1.29060871e-306]
 [2.22518251e-306 1.33511969e-306 1.78022342e-306 1.05700345e-307
  3.11525958e-307]
 [1.69118108e-306 8.06632139e-308 1.20160711e-306 1.69119330e-306
  1.29062229e-306]
 [6.89804133e-307 1.11261162e-306 8.34443015e-308 1.21455192e+224
  2.60096946e-306]]


In [38]:
arr = np.array([[1,2,3], [4,5,6]], dtype=np.int64)
empty = np.empty_like(arr)
print(empty)
print('Data Type: ' + str(empty.dtype))

[[4607182418800017408 4607182418800017408 4607182418800017408]
 [4607182418800017408 4607182418800017408 4607182418800017408]]
Data Type: int64


#### Create an array of constant values of given shape

* np.full(): create array of constant values in given shape* 
np.full_like(): create array of constant values with the same shape and data type as the given input array

In [39]:
full = np.full((4,4), 5)
print(full)

[[5 5 5 5]
 [5 5 5 5]
 [5 5 5 5]
 [5 5 5 5]]


In [40]:
arr = np.array([[1,2], [3,4]], dtype=np.float64)
full = np.full_like(arr, 5)
print(full)
print('Data Type: ' + str(full.dtype))

[[5. 5.]
 [5. 5.]]
Data Type: float64


#### Create an array in a repetitive manner

* np.repeat(iterable, reps, axis=None): repeat each element by n times
* iterable : input array* 
rep s: number of repetition
* axis : which axis to repeat along, default is None which will flatten     the input array and then repeat
* np.tile() : repeat the whole array by n times
* iterable : input array
* reps : number of repetitions, it can be a tuple to represent repetitions along x-axis and y-axis

s-axis

In [41]:
# No axis specified, then flatten the input array first and repeat
arr = [[0, 1, 2], [3, 4, 5]]
print(np.repeat(arr, 3)) 

[0 0 0 1 1 1 2 2 2 3 3 3 4 4 4 5 5 5]


In [42]:
# An example of repeating along x-axis
arr = [[0, 1, 2], [3, 4, 5]]
print(np.repeat(arr, 3, axis=0)) 

[[0 1 2]
 [0 1 2]
 [0 1 2]
 [3 4 5]
 [3 4 5]
 [3 4 5]]


In [43]:
# An example of repeating along y-axis
arr = [[0, 1, 2], [3, 4, 5]]
print(np.repeat(arr, 3, axis=1)) 

[[0 0 0 1 1 1 2 2 2]
 [3 3 3 4 4 4 5 5 5]]


In [44]:
# Repeat the whole array by a specified number of times
arr = [0, 1, 2]
print(np.tile(arr, 3))

[0 1 2 0 1 2 0 1 2]


#### Create an identity matrix of given size

* np.eye(size, k=0): create an identity matrix of given size
* size: the size of the identity matrix
* k: the diagonal offset
* np.identity(): same as np.eye() but does not carry parameters 

In [45]:
identity_matrix = np.eye(5)
print(identity_matrix)

[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]


In [46]:
# An example of diagonal offset
identity_matrix = np.eye(5, k=-1)
print(identity_matrix)

[[0. 0. 0. 0. 0.]
 [1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]]


In [47]:
identity_matrix = np.identity(5)
print(identity_matrix)

[[1. 0. 0. 0. 0.]
 [0. 1. 0. 0. 0.]
 [0. 0. 1. 0. 0.]
 [0. 0. 0. 1. 0.]
 [0. 0. 0. 0. 1.]]


#### Create an array with given values on the diagonal

In [48]:
arr = np.random.rand(5,5)
print(arr)
# Extract values on the diagonal
print('Values on the diagonal: ' + str(np.diag(arr)))

[[5.16822629e-01 8.95974314e-01 7.21202651e-01 2.75529747e-01
  7.82153719e-02]
 [4.81838168e-02 2.17003054e-01 7.01686260e-01 4.76870005e-01
  3.18563656e-01]
 [7.10796062e-01 3.77714762e-01 4.05257463e-01 4.19794811e-01
  2.66053776e-02]
 [7.23013204e-03 6.14656050e-01 4.60208701e-01 7.03166672e-01
  9.96073960e-01]
 [4.99330543e-01 6.51195336e-02 3.62483456e-04 3.37277644e-01
  7.63109066e-02]]
Values on the diagonal: [0.51682263 0.21700305 0.40525746 0.70316667 0.07631091]


In [49]:
# Not necessarily to be a square matrix
arr = np.random.rand(10,3)
print(arr)
# Extract values on the diagonal
print('Values on the diagonal: ' + str(np.diag(arr)))

[[0.99465956 0.04359471 0.00763805]
 [0.11128546 0.74634723 0.44942451]
 [0.37875401 0.86063987 0.17669601]
 [0.81718445 0.30752509 0.63574323]
 [0.86987206 0.29897514 0.28731267]
 [0.93122497 0.53259548 0.02886934]
 [0.75645431 0.79775767 0.05501871]
 [0.31005841 0.00733608 0.70875873]
 [0.95168744 0.60619972 0.86505849]
 [0.67113836 0.32897051 0.91949991]]
Values on the diagonal: [0.99465956 0.74634723 0.17669601]


In [50]:
# Create a matrix given values on the diagonal
# All non-diagonal values set to zeros
arr = np.diag([1,2,3,4,5])
print(arr)

[[1 0 0 0 0]
 [0 2 0 0 0]
 [0 0 3 0 0]
 [0 0 0 4 0]
 [0 0 0 0 5]]


#### Advantages of NumPy

What is the use of arrays over lists, specifically for data analysis? Putting crudely, it is convenience and speed :

1. You can write vectorised code on numpy arrays, not on lists, which is convenient to read and write, and concise.
2. Numpy is much faster than the standard python ways to do computations.

Vectorised code typically does not contain explicit looping and indexing etc. (all of this happens behind the scenes, in precompiled C-code), and thus it is much more concise.

Let's see an example of convenience, we'll see one later for speed.

Say you have two lists of numbers, and want to calculate the element-wise product. The standard python list way would need you to map a lambda function (or worse - write a for loop), whereas with NumPy, you simply multiply the arrays.

In [51]:
list_1 = [3, 6, 7, 5]
list_2 = [4, 5, 1, 7]

# the list way to do it: map a function to the two lists
product_list = list(map(lambda x, y: x*y, list_1, list_2))
print(product_list)

[12, 30, 7, 35]


#### using array

In [52]:
# The numpy array way to do it: simply multiply the two arrays
array_1 = np.array(list_1)
array_2 = np.array(list_2)

array_3 = array_1*array_2
print(array_3)
print(type(array_3))

[12 30  7 35]
<class 'numpy.ndarray'>


As you can see, the NumPy way is clearly more concise.

Even simple mathematical operations on lists require for loops, unlike with arrays. For example, to calculate the square of every number in a list:

In [53]:
# Square a list
list_squared = [i**2 for i in list_1]

# Square a numpy array
array_squared = array_1**2

print(list_squared)
print(array_squared)

[9, 36, 49, 25]
[ 9 36 49 25]


#### Compare Computation Times in NumPy and Standard Python Lists

We mentioned that the key advantages of numpy are convenience and speed of computation.

You'll often work with extremely large datasets, and thus it is important point for you to understand how much computation time (and memory) you can save using numpy, compared to standard python lists.

Let's compare the computation times of arrays and lists for a simple task of calculating the element-wise product of numbers

In [56]:
list_1 = [i for i in range(10000000)]
list_2 = [j**2 for j in range(10000000)]

import time
# store start time, time after computation, and take the difference
t0 = time.time()
product_list = list(map(lambda x, y: x*y, list_1, list_2))
t1 = time.time()
list_time = t1 - t0 
print("Time Taken:",t1-t0)

Time Taken: 1.2387065887451172


#### Using numpy array

In [57]:
array_1 = np.array(list_1)
array_2 = np.array(list_2)

t0 = time.time()
array_3 = array_1*array_2
t1 = time.time()
numpy_time = t1 - t0

print("Time Taken:",t1-t0)

Time Taken: 0.05288529396057129


In [58]:
print("The ratio of time taken is {}".format(list_time/numpy_time))

The ratio of time taken is 23.422514956022308


In this case, numpy is an order of magnitude faster than lists. This is with arrays of size in millions, but you may work on much larger arrays of sizes in order of billions. Then, the difference is even larger.

Some reasons for such difference in speed are:

* NumPy is written in C, which is basically being executed behind the scenes* 
NumPy arrays are more compact than lists, i.e. they take much lesser storage space than lists