In [1]:
import numpy as np

# 4 NumPy Basics: Arrays and Vectorized Computation

In [2]:
my_arr = np.arange(1_000_000)
my_list = list(range(1_000_000))

Wow those look big, now multiply them by two.

In [3]:
 %timeit my_arr2 = my_arr * 2

368 µs ± 27.1 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


In [4]:
%timeit my_list2 = [x * 2 for x in my_list]


50.4 ms ± 375 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


DID YOU SEE HOW FAST THAT LIST WENT?! IT WAS LIKE ZOOOOOM. This is because NumPy is like 10 to 1000 times FASTER then pure Python, and they use significantly less memory.


# 4.1 The NumPy ndarray: A Multidimensional Array Object
One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements.

To give you a flavor of how NumPy enables batch computations with similar syntax to scalar values on built-in Python objects, I first import NumPy and create a small array:



In [6]:
data = np.array([[1.5, -0.1, 3], [0, -3, 6.5]])
data

array([[ 1.5, -0.1,  3. ],
       [ 0. , -3. ,  6.5]])

You can also do mathematical Operations with `data`

In [7]:
data * 10

array([[ 15.,  -1.,  30.],
       [  0., -30.,  65.]])

In [8]:
data + data

array([[ 3. , -0.2,  6. ],
       [ 0. , -6. , 13. ]])

An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. Every array has a `shape`, a tuple indicating the size of each dimension, and a `dtype`, an object describing the data type of the array:

In [9]:
data.shape

(2, 3)

In [10]:
data.dtype

dtype('float64')

## Creating ndarrays
The easiest way to create an array is to use the `array` function. This accepts any sequence-like object (including other arrays) and produces a new NumPy array containing the passed data. For example, a list is a good candidate for conversion:



In [11]:
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1


array([6. , 7.5, 8. , 0. , 1. ])

Nested sequences, like a list of equal-length lists, will be converted into a multidimensional array:

In [12]:
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
arr2

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

Since `data2` was a list of lists, the NumPy array `arr2` has two dimensions, with shape inferred from the data. We can confirm this by inspecting the `ndim` and `shape` attributes:



In [13]:
arr2.ndim

2

In [14]:
arr2.shape

(2, 4)

`numpy.array` tries to infer a good data type for the array that it creates. The data type is stored in a special `dtype` metadata object; for example, in the previous two examples we have:

In [15]:
arr1.dtype

dtype('float64')

In [16]:
arr2.dtype

dtype('int64')

In addition to `numpy.array`, there are a number of other functions for creating new arrays. As examples, `numpy.zeros` and `numpy.ones` create arrays of 0s or 1s, respectively, with a given length or shape. `numpy.empty` creates an array without initializing its values to any particular value. To create a higher dimensional array with these methods, pass a tuple for the shape:


In [17]:
np.zeros(10)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

In [18]:
np.zeros((3, 6))

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

In [19]:
np.empty((2, 3, 2))

array([[[0.00000000e+000, 2.47032823e-322],
        [0.00000000e+000, 0.00000000e+000],
        [0.00000000e+000, 8.75983079e+164]],

       [[5.20269907e-090, 9.76835939e+165],
        [8.26442131e-072, 6.52172024e-038],
        [3.99910963e+252, 1.46030983e-319]]])

- It’s not safe to assume that `numpy.empty` will return an array of all zeros. This function returns uninitialized memory and thus may contain nonzero "garbage" values. You should use this function only if you intend to populate the new array with data.



`numpy.arange` is an array-valued version of the built-in Python `range` function:



In [20]:
np.arange(15)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

## Data Types for ndarrays
The `data` type or `dtype` is a special object containing the information (or `metadata`, data about data) the ndarray needs to interpret a chunk of memory as a particular type of data:

In [21]:
arr1 = np.array([1, 2, 3], dtype=np.float64)
arr2 = np.array([1, 2, 3], dtype=np.int32)

In [22]:
arr1.dtype

dtype('float64')

In [23]:
arr2.dtype

dtype('int32')

You can explicitly convert or cast an array from one data type to another using ndarray’s `astype` method:



In [25]:
arr = np.array([1, 2, 3, 4, 5])
arr.dtype

dtype('int64')

In [26]:
float_arr = arr.astype(np.float64)
float_arr

array([1., 2., 3., 4., 5.])

In [27]:
float_arr.dtype

dtype('float64')

In this example, integers were cast to floating point. If I cast some floating-point numbers to be of integer data type, the decimal part will be truncated:



In [29]:
arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
arr

array([ 3.7, -1.2, -2.6,  0.5, 12.9, 10.1])

In [30]:
arr.astype(np.int32)

array([ 3, -1, -2,  0, 12, 10], dtype=int32)

If you have an array of strings representing numbers, you can use `astype`to convert them to numeric form:

In [32]:
numeric_strings = np.array(["1.25", "-9.6", "42"], dtype=np.string_)
numeric_strings.astype(float)


array([ 1.25, -9.6 , 42.  ])

- Be cautious when using the numpy.string_ type, as string data in NumPy is fixed size and may truncate input without warning. pandas has more intuitive out-of-the-box behavior on non-numeric data.

If casting were to fail for some reason (like a string that cannot be converted to `float64`), a `ValueError` will be raised. Before, I was a bit lazy and wrote `float` instead of `np.float64`; NumPy aliases the Python types to its own equivalent data types.

You can also use another array’s `dtype` attribute:

In [34]:
int_array = np.arange(10)
calibers = np.array([.22, .270, .357, .380, .44, .50], dtype=np.float64)
int_array.astype(calibers.dtype)


array([0., 1., 2., 3., 4., 5., 6., 7., 8., 9.])

There are shorthand type code strings you can also use to refer to a `dtype`:

In [36]:
zeros_uint32 = np.zeros(8, dtype="u4")
zeros_uint32


array([0, 0, 0, 0, 0, 0, 0, 0], dtype=uint32)

- Calling `astype` always creates a new array (a copy of the data), even if the new data type is the same as the old data type.



## Arithmetic with NumPy Arrays
Arrays are important because they enable you to express batch operations on data without writing any `for` loops. NumPy users call this `vectorization`. Any arithmetic operations between equal-size arrays apply the operation element-wise:

In [37]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr

array([[1., 2., 3.],
       [4., 5., 6.]])

In [38]:
arr * arr

array([[ 1.,  4.,  9.],
       [16., 25., 36.]])

In [39]:
arr - arr

array([[0., 0., 0.],
       [0., 0., 0.]])

In [40]:
1 / arr

array([[1.        , 0.5       , 0.33333333],
       [0.25      , 0.2       , 0.16666667]])

In [41]:
arr ** 2

array([[ 1.,  4.,  9.],
       [16., 25., 36.]])

Comparisons between arrays of the same size yield Boolean arrays:

In [44]:
arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])
arr2

array([[ 0.,  4.,  1.],
       [ 7.,  2., 12.]])

In [45]:
arr2 > arr

array([[False,  True, False],
       [ True, False,  True]])

## Basic Indexing and Slicing
NumPy array indexing is a deep topic, as there are many ways you may want to select a subset of your data or individual elements. One-dimensional arrays are simple; on the surface they act similarly to Python lists:


In [46]:
arr = np.arange(10)
arr

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

In [47]:
arr[5]

5

In [48]:
arr[5:8]

array([5, 6, 7])

In [49]:
arr[5:8] = 12
arr

array([ 0,  1,  2,  3,  4, 12, 12, 12,  8,  9])

To give an example of this, I first create a slice of `arr`:

In [50]:
arr_slice = arr[5:8]
arr_slice

array([12, 12, 12])

Now, when I change values in `arr_slice`, the mutations are reflected in the original array arr:

In [51]:
arr_slice[1] = 12345
arr

array([    0,     1,     2,     3,     4,    12, 12345,    12,     8,
           9])

The "bare" slice `[:]` will assign to all values in an array:

In [54]:
arr_slice[:] = 64
arr

array([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9])

- If you want a copy of a slice of an ndarray instead of a view, you will need to explicitly copy the array—for example, `arr[5:8].copy()`. As you will see, pandas works this way, too.



With higher dimensional arrays, you have many more options. In a two-dimensional array, the elements at each index are no longer scalars but rather one-dimensional arrays:

In [56]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
arr2d[2]

array([7, 8, 9])

Thus, individual elements can be accessed recursively. But that is a bit too much work, so you can pass a comma-separated list of indices to select individual elements. So these are equivalent:



In [58]:
arr2d[0][2]

3

In [59]:
arr2d[0, 2]

3

In multidimensional arrays, if you omit later indices, the returned object will be a lower dimensional ndarray consisting of all the data along the higher dimensions. So in the 2 × 2 × 3 array `arr3d`:



In [60]:
arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])

arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

`arr3d[0]` is a 2 × 3 array:

In [61]:
arr3d[0]

array([[1, 2, 3],
       [4, 5, 6]])

Both scalar values and arrays can be assigned to `arr3d[0]`:



In [62]:
old_values = arr3d[0].copy()
arr3d[0] = 42
arr3d

array([[[42, 42, 42],
        [42, 42, 42]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

In [63]:
arr3d[0] = old_values
arr3d

array([[[ 1,  2,  3],
        [ 4,  5,  6]],

       [[ 7,  8,  9],
        [10, 11, 12]]])

Similarly, `arr3d[1, 0]` gives you all of the values whose indices start with `(1, 0)`, forming a one-dimensional array:



In [64]:
arr3d[1, 0]

array([7, 8, 9])

This expression is the same as though we had indexed in two steps:

In [65]:
x = arr3d[1]
x

array([[ 7,  8,  9],
       [10, 11, 12]])

In [66]:
x[0]

array([7, 8, 9])

Note that in all of these cases where subsections of the array have been selected, the returned arrays are views.

- This multidimensional indexing syntax for NumPy arrays will not work with regular Python objects, such as lists of lists.


## Indexing with slices
Like one-dimensional objects such as Python lists, ndarrays can be sliced with the familiar syntax:


In [67]:
arr

array([ 0,  1,  2,  3,  4, 64, 64, 64,  8,  9])

In [68]:
arr[1:6]

array([ 1,  2,  3,  4, 64])

In [69]:
arr2d

array([[1, 2, 3],
       [4, 5, 6],
       [7, 8, 9]])

In [70]:
arr2d[:2]

array([[1, 2, 3],
       [4, 5, 6]])

As you can see, it has sliced along axis 0, the first axis. A slice, therefore, selects a range of elements along an axis. It can be helpful to read the expression `arr2d[:2]` as "select the first two rows of `arr2d`."


In [71]:
arr2d[:2, 1:]

array([[2, 3],
       [5, 6]])

When slicing like this, you always obtain array views of the same number of dimensions. By mixing integer indexes and slices, you get lower dimensional slices.

For example, I can select the second row but only the first two columns, like so:

Here, while `arr2d` is two-dimensional, `lower_dim_slice` is one-dimensional, and its shape is a tuple with one axis size:

In [74]:
lower_dim_slice = arr2d[1, :2]
lower_dim_slice.shape

(2,)

Similarly, I can select the third column but only the first two rows, like so:

In [75]:
arr2d[:2, 2]

array([3, 6])

Note that a colon by itself means to take the entire axis, so you can slice only higher dimensional axes by doing:

In [76]:
arr2d[:, :1]

array([[1],
       [4],
       [7]])

Of course, assigning to a slice expression assigns to the whole selection:


In [78]:
arr2d[:2, 1:] = 0
arr2d

array([[1, 0, 0],
       [4, 0, 0],
       [7, 8, 9]])

## Boolean Indexing
Let’s consider an example where we have some data in an array and an array of names with duplicates:

In [79]:
names = np.array(["Bob", "Joe", "Will", "Bob", "Will", "Joe", "Joe"])
data = np.array([[4, 7], [0, 2], [-5, 6], [0, 0], [1, 2],
                 [-12, -4], [3, 4]])

In [80]:
names

array(['Bob', 'Joe', 'Will', 'Bob', 'Will', 'Joe', 'Joe'], dtype='<U4')

In [81]:
data

array([[  4,   7],
       [  0,   2],
       [ -5,   6],
       [  0,   0],
       [  1,   2],
       [-12,  -4],
       [  3,   4]])

That's right, we can do vectorization on these too

In [82]:
names == "Bob"

array([ True, False, False,  True, False, False, False])

This can also be passed when Indexing the Array. See how it takes those specific values out of the spots:

In [83]:
data[names == "Bob"]

array([[4, 7],
       [0, 0]])

Now the Boolean Array must be the same length as the Array axis it's indexing. But you can mix and match Boolean Arrays with slices or Integers. (Even sequences of Integers):

In [89]:
data[names == "Bob", 1:]

array([[7],
       [0]])

In [87]:
data[names == "Bob", 1]

array([7, 0])

To select everything but `Bob` you can use `!=`or negate the condition with `~`

In [90]:
names != "Bob"

array([False,  True,  True, False,  True,  True,  True])

In [91]:
~(names == "Bob")

array([False,  True,  True, False,  True,  True,  True])

In [92]:
data[~(names == "Bob")]

array([[  0,   2],
       [ -5,   6],
       [  1,   2],
       [-12,  -4],
       [  3,   4]])

In fact `~` is exceptionally useful when you want to invert a Boolean Array referenced by a variable

In [94]:
cond = names == "Bob"
data[~cond]

array([[  0,   2],
       [ -5,   6],
       [  1,   2],
       [-12,  -4],
       [  3,   4]])

Now it would be terrible if we couldn't combine multiple conditions like we can with Python. So we got that covered with `&` (and) and `|` (or)

In [95]:
mask = (names == "Bob") | (names == "Will")
mask

array([ True, False,  True,  True,  True, False, False])

Selecting data from an array by Boolean indexing and assigning the result to a new variable __always__ creates a copy of the data, even if the returned array is unchanged.

- The Python `and`/`or` __DO NOT WORK__ with Boolean arrays, use `&` / `|`

Setting values with Boolean arrays works by substituting the value or values on the righthand side into the locations where the Boolean array's values are `True` To set all negative values in `data` to 0, we need only do this:

In [97]:
data[data < 0] = 0
data

array([[4, 7],
       [0, 2],
       [0, 6],
       [0, 0],
       [1, 2],
       [0, 0],
       [3, 4]])

You can also set whole rows or columns using a one-dimensional Boolean array:



In [98]:
data[names != "Joe"] = 7
data

array([[7, 7],
       [0, 2],
       [7, 7],
       [7, 7],
       [7, 7],
       [0, 0],
       [3, 4]])

## Fancy Indexing
This is just indexing using integer arrays, lets make an 8 x 4 array to experiment with

In [100]:
arr = np.zeros((8, 4))
for i in range(8):
    arr[i] = i
    
arr

array([[0., 0., 0., 0.],
       [1., 1., 1., 1.],
       [2., 2., 2., 2.],
       [3., 3., 3., 3.],
       [4., 4., 4., 4.],
       [5., 5., 5., 5.],
       [6., 6., 6., 6.],
       [7., 7., 7., 7.]])


To select a subset of the rows in a particular order, you can simply pass a list or ndarray of integers specifying the desired order:



In [101]:
arr[[4, 3, 0, 6]]

array([[4., 4., 4., 4.],
       [3., 3., 3., 3.],
       [0., 0., 0., 0.],
       [6., 6., 6., 6.]])

Hopefully this code did what you expected! Using negative indices selects rows from the end:

- Basically starting from the last line, 7,6,5 Five is the -3 spot. And so on and so fourth

In [102]:
arr[[-3, -5, -7]]

array([[5., 5., 5., 5.],
       [3., 3., 3., 3.],
       [1., 1., 1., 1.]])

Passing multiple index arrays does something slightly different; it selects a one-dimensional array of elements corresponding to each tuple of indices. See how it populates:

In [103]:
arr = np.arange(32).reshape((8, 4))
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22, 23],
       [24, 25, 26, 27],
       [28, 29, 30, 31]])

In [105]:
arr[[1, 5, 7, 2], [0, 3, 1, 2]]

array([ 4, 23, 29, 10])

Here the elements `(1, 0), (5, 3), (7, 1)`, and `(2, 2)` were selected. The result of fancy indexing with as many integer arrays as there are axes is always one-dimensional.

The behavior of fancy indexing in this case is a bit different from what some users might have expected (It reads from the second one first), which is the rectangular region formed by selecting a subset of the matrix’s rows and columns. Here is one way to get that:



In [106]:
arr[[1, 5, 7, 2]][:, [0, 3, 1, 2]]

array([[ 4,  7,  5,  6],
       [20, 23, 21, 22],
       [28, 31, 29, 30],
       [ 8, 11,  9, 10]])

Keep in mind that fancy indexing, unlike slicing, __always__ copies the data into a new array when assigning the result to a new variable. If you assign values with fancy indexing, the indexed values will be modified:



In [107]:
arr[[1, 5, 7, 2], [0, 3, 1, 2]]


array([ 4, 23, 29, 10])

In [108]:
arr[[1, 5, 7, 2], [0, 3, 1, 2]] = 0
arr

array([[ 0,  1,  2,  3],
       [ 0,  5,  6,  7],
       [ 8,  9,  0, 11],
       [12, 13, 14, 15],
       [16, 17, 18, 19],
       [20, 21, 22,  0],
       [24, 25, 26, 27],
       [28,  0, 30, 31]])

This can be used to remove data, update it, then put it back to get the new data array

## Transposing Arrays and Swapping Axes

Transposing is a special form of reshaping that similarly returns a view on the underlying data without copying anything. Arrays have the `transpose` method and the special `T` attribute:



In [109]:
arr = np.arange(15).reshape((3, 5))
arr

array([[ 0,  1,  2,  3,  4],
       [ 5,  6,  7,  8,  9],
       [10, 11, 12, 13, 14]])

In [110]:
arr.T

array([[ 0,  5, 10],
       [ 1,  6, 11],
       [ 2,  7, 12],
       [ 3,  8, 13],
       [ 4,  9, 14]])

When doing matrix computations, you may do this very often—for example, when computing the inner matrix product using `numpy.dot`:



In [111]:
arr = np.array([[0, 1, 0], [1, 2, -2], [6, 3, 2], [-1, 0, -1], [1, 0, 1]])
arr

array([[ 0,  1,  0],
       [ 1,  2, -2],
       [ 6,  3,  2],
       [-1,  0, -1],
       [ 1,  0,  1]])

In [112]:
np.dot(arr.T, arr)

array([[39, 20, 12],
       [20, 14,  2],
       [12,  2, 10]])

The `@` infix operator is another way to do matrix multiplication:



In [114]:
arr.T @ arr

array([[39, 20, 12],
       [20, 14,  2],
       [12,  2, 10]])

Simple transposing with `.T` is a special case of swapping axes. ndarray has the method `swapaxes`, which takes a pair of axis numbers and switches the indicated axes to rearrange the data:



In [119]:
arr

array([[ 0,  1,  0],
       [ 1,  2, -2],
       [ 6,  3,  2],
       [-1,  0, -1],
       [ 1,  0,  1]])

In [121]:
arr.swapaxes(0,1)

array([[ 0,  1,  6, -1,  1],
       [ 1,  2,  3,  0,  0],
       [ 0, -2,  2, -1,  1]])

`swapaxes` similarly returns a view on the data without making a copy.

# Pseudorandom Number Generation

The `numpy.random` module supplements the built-in Python `random` module with functions for efficiently generating whole arrays of sample values from many kinds of probability distributions. For example, you can get a 4 × 4 array of samples from the standard normal distribution using `numpy.random.standard_normal`:



In [122]:
samples = np.random.standard_normal(size=(4, 4))
samples

array([[ 1.06374599,  0.2381585 , -0.38021508,  0.91328387],
       [-1.21123844,  0.1023357 ,  2.28596312, -0.76381204],
       [-0.53160301, -2.88681026, -0.08716062,  0.4639119 ],
       [ 1.09244672,  1.44324643, -1.60902945,  0.08891758]])

Python’s built-in `random` module, by contrast, samples only one value at a time. As you can see from this benchmark, `numpy.random` is well over an order of magnitude faster for generating very large samples:



In [123]:
from random import normalvariate
N = 1_000_000

%timeit samples = [normalvariate(0, 1) for _ in range(N)]
%timeit np.random.standard_normal(N)


546 ms ± 12 ms per loop (mean ± std. dev. of 7 runs, 1 loop each)
28.6 ms ± 768 µs per loop (mean ± std. dev. of 7 runs, 10 loops each)


Darn that is a huge difference.

These random numbers are not truly random (rather, pseudorandom) but instead are generated by a configurable random number generator that determines deterministically what values are created. Functions like `numpy.random.standard_normal` use the `numpy.random` module's default random number generator, but your code can be configured to use an explicit generator:



In [125]:
rng = np.random.default_rng(seed=12345)

data = rng.standard_normal((2, 3))

The `seed` argument is what determines the initial state of the generator, and the state changes each time the rng object is used to generate data. The generator object `rng` is also isolated from other code which might use the `numpy.random` module:



In [126]:
type(rng)

numpy.random._generator.Generator

# Universal Functions: Fast Element-Wise Array Functions

A universal function, or ufunc, is a function that performs element-wise operations on data in ndarrays. You can think of them as fast vectorized wrappers for simple functions that take one or more scalar values and produce one or more scalar results.

Many ufuncs are simple element-wise transformations, like `numpy.sqrt` or `numpy.exp`:

