# 4  NumPy Basics: Arrays and Vectorized Computation

In [1]:
import numpy as np

In [2]:
my_arr = np.arange(1_000_000)

In [4]:
my_list = list(range(1_000_000))

NumPy-based algorithms are generally 10 to 100 times faster (or more) than their pure Python counterparts and use significantly less memory.

In [5]:
%timeit my_arr2 = my_arr * 2

751 μs ± 12 μs per loop (mean ± std. dev. of 7 runs, 1,000 loops each)


In [6]:
%timeit my_list2 = [x*2 for x in my_list]

25.3 ms ± 1.29 ms per loop (mean ± std. dev. of 7 runs, 10 loops each)


# 4.1 The NumPy ndarray: A Multidimensional Array Object

data = np.array([[],[]])

In [7]:
data = np.array([[1.5, -0.1,3], [0,-3, 6.5]])

In [8]:
data

array([[ 1.5, -0.1,  3. ],
       [ 0. , -3. ,  6.5]])

In [9]:
data * 10

array([[ 15.,  -1.,  30.],
       [  0., -30.,  65.]])

In [10]:
data + data

array([[ 3. , -0.2,  6. ],
       [ 0. , -6. , 13. ]])

In the first example, all of the elements have been multiplied by 10. In the second, the corresponding values in each "cell" in the array have been added to each other.

In [11]:
data.shape

(2, 3)

In [12]:
data.dtype

dtype('float64')

In [13]:
data1 = [6, 7.5, 8, 0, 1]

In [14]:
arr1 = np.array(data1)

In [16]:
arr1

array([6. , 7.5, 8. , 0. , 1. ])

Nested sequences, like a list of equal-length lists, will be converted into a multidimensional array:

In [20]:
data2 = [[1,2,3,4], [5,6,7,8]]

In [21]:
arr2 = np.array(data2)

In [22]:
arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

Since data2 was a list of lists, the NumPy array arr2 has two dimensions, with shape inferred from the data. We can confirm this by inspecting the ndim and shape attributes:

In [23]:
arr2.ndim

2

In [24]:
arr2.shape

(2, 4)

Unless explicitly specified (discussed in Data Types for ndarrays), numpy.array tries to infer a good data type for the array that it creates. The data type is stored in a special dtype metadata object; for example, in the previous two examples we have:

In [26]:
arr1.dtype

dtype('float64')

In [27]:
arr2.dtype

dtype('int64')

In addition to numpy.array, there are a number of other functions for creating new arrays. As examples, numpy.zeros and numpy.ones create arrays of 0s or 1s, respectively, with a given length or shape. numpy.empty creates an array without initializing its values to any particular value. To create a higher dimensional array with these methods, pass a tuple for the shape:

In [28]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [30]:
np.zeros((3,6))

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

Caution
It’s not safe to assume that numpy.empty will return an array of all zeros. This function returns uninitialized memory and thus may contain nonzero "garbage" values. You should use this function only if you intend to populate the new array with data.

In [33]:
np.empty((2,3,2))

array([[[0.00000000e+000, 2.47032823e-322],
        [0.00000000e+000, 0.00000000e+000],
        [1.29441743e-312, 4.47032019e-038]],

       [[4.46553417e-090, 4.91127600e-062],
        [1.52411040e-052, 1.47743128e-075],
        [3.99910963e+252, 1.46030983e-319]]])

In [34]:
np.arange(15)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

Since NumPy is focused on numerical computing, the data type, if not specified, will in many cases be float64 (floating point).

## Data Types for ndarrays
The data type or dtype is a special object containing the information (or metadata, data about data) the ndarray needs to interpret a chunk of memory as a particular type of data:

You can explicitly convert or cast an array from one data type to another using ndarray’s astype method:

In [36]:
arr = np.array([1,2,3,4,5])

In [37]:
arr.dtype

dtype('int64')

In [39]:
float_arr = arr.astype(np.float64)

In [40]:
float_arr

array([1., 2., 3., 4., 5.])

In [41]:
float_arr.dtype

dtype('float64')

In this example, integers were cast to floating point. If I cast some floating-point numbers to be of integer data type, the decimal part will be truncated:

In [43]:
arr = np.array([3.7, -1.2, -2.6, .05, 12.9, 10.1])

In [44]:
arr

array([ 3.7 , -1.2 , -2.6 ,  0.05, 12.9 , 10.1 ])

In [45]:
arr.astype(np.int32)

array([ 3, -1, -2,  0, 12, 10], dtype=int32)

If you have an array of strings representing numbers, you can use astype to convert them to numeric form:

In [47]:
numeric_strings = np.array(["1.25", "-9.6", "42"], dtype=np.bytes_)

In [48]:
numeric_strings.astype(float)

array([ 1.25, -9.6 , 42.  ])

If casting were to fail for some reason (like a string that cannot be converted to float64), a ValueError will be raised. Before, I was a bit lazy and wrote float instead of np.float64; NumPy aliases the Python types to its own equivalent data types.

You can also use another array’s dtype attribute:

In [49]:
int_array = np.arange(10)

In [50]:
calibers =np.array([.22, .270, .357, .380, .44, .50], dtype=np.float64)

In [51]:
int_array.astype(calibers.dtype)

array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

There are shorthand type code strings you can also use to refer to a dtype:

In [53]:
zeros_uint32 = np.zeros(8, dtype="u4")

In [54]:
zeros_uint32

array([0, 0, 0, 0, 0, 0, 0, 0], dtype=uint32)

Calling astype always creates a new array (a copy of the data), even if the new data type is the same as the old data type.

## Arithmetic with NumPy Arrays
Arrays are important because they enable you to express batch operations on data without writing any for loops. NumPy users call this vectorization. Any arithmetic operations between equal-size arrays apply the operation element-wise:

In [55]:
arr = np.array ([[1.,2.,3.], [4.,5.,6.]])

In [56]:
arr

array([[1., 2., 3.],
       [4., 5., 6.]])

In [57]:
arr * arr

array([[ 1.,  4.,  9.],
       [16., 25., 36.]])

In [58]:
arr - arr

array([[0., 0., 0.],
       [0., 0., 0.]])

Arithmetic operations with scalars propagate the scalar argument to each element in the array:

In [61]:
1 /arr

array([[1.        , 0.5       , 0.33333333],
       [0.25      , 0.2       , 0.16666667]])

In [62]:
arr **2

array([[ 1.,  4.,  9.],
       [16., 25., 36.]])

Comparisons between arrays of the same size yield Boolean arrays:

In [63]:
arr2 = np.array([[0.,4.,1.], [7.,2.,12.]])

In [64]:
arr2

array([[ 0.,  4.,  1.],
       [ 7.,  2., 12.]])

In [65]:
arr2 > arr

array([[False,  True, False],
       [ True, False,  True]])

Evaluating operations between differently sized arrays is called broadcasting and will be discussed in more detail in Appendix A: Advanced NumPy. Having a deep understanding of broadcasting is not necessary for most of this book.

In [66]:
arr = np.arange(10)

In [67]:
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [68]:
arr[5]

np.int64(5)

In [69]:
arr[5:8]

array([5, 6, 7])

In [70]:
arr[5:8]

array([5, 6, 7])

In [71]:
arr[5:8] = 12

In [72]:
arr

array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

As you can see, if you assign a scalar value to a slice, as in arr[5:8] = 12, the value is propagated (or broadcast henceforth) to the entire selection.

An important first distinction from Python's built-in lists is that array slices are views on the original array. This means that the data is not copied, and any modifications to the view will be reflected in the source array.

To give an example of this, I first create a slice of arr:

In [73]:
arr_slice = arr[5:8]

In [74]:
arr_slice

array([12, 12, 12])

In [76]:
arr_slice[1] = 12345

In [77]:
arr

array([    0,     1,     2,     3,     4,    12, 12345,    12,     8,
           9])

The "bare" slice [:] will assign to all values in an array:

In [79]:
arr_slice[:] = 64

In [80]:
arr

array([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9])

If you are new to NumPy, you might be surprised by this, especially if you have used other array programming languages that copy data more eagerly. As NumPy has been designed to be able to work with very large arrays, you could imagine performance and memory problems if NumPy insisted on always copying data.

If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the array—for example, arr[5:8].copy(). As you will see, pandas works this way, too.

With higher dimensional arrays, you have many more options. In a two-dimensional array, the elements at each index are no longer scalars but rather one-dimensional arrays:

In [81]:
arr2d = np.array([[1,2,3], [4,5,6], [7,8,9]])

In [82]:
arr2d[2]

array([7, 8, 9])

Thus, individual elements can be accessed recursively. But that is a bit too much work, so you can pass a comma-separated list of indices to select individual elements. So these are equivalent:

In [83]:
arr2d[0][2]

np.int64(3)

In [84]:
arr2d[0,2]

np.int64(3)

In multidimensional arrays, if you omit later indices, the returned object will be a lower dimensional ndarray consisting of all the data along the higher dimensions. So in the 2 × 2 × 3 array arr3d:

In [85]:
arr3d = np.array([[[1,2,3], [4,5,6]],[[7,8,9], [10,11,12]]])

In [86]:
arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [87]:
arr3d[0]

array([[1, 2, 3],
       [4, 5, 6]])

In [89]:
old_values = arr3d[0].copy()