# CHAPTER 3
# NumPy Basics

NumPy, short for Numerical Python, is one of the most important foundational packages for numerical computing in Python. Most computational packages providing scientific functionality use NumPy’s array objects as the _lingua franca_ for data exchange.

Here are some of the things you’ll find in NumPy: 
<br>
-  ndarray, an efficient multidimensional array providing fast array-oriented arithmetic operations and flexible broadcasting capabilities. 
-  Mathematical functions for fast operations on entire arrays of data without having to write loops. 
-  Tools for reading/writing array data to disk and working with memory-mapped files. 
-  Linear algebra, random number generation, and Fourier transform capabilities. 
-  A C API (Common Application Programming Interface) for connecting NumPy with libraries written in C, C++, or FORTRAN.

While NumPy by itself does not provide modeling or scientific functionality, having an understanding of NumPy arrays and array-oriented computing will help you use tools with array-oriented semantics, like pandas, much more effectively. One of the reasons NumPy is so important for numerical computations in Python is because it is designed for efficiency on large arrays of data. There are a number of reasons for this: 
-  NumPy internally stores data in a contiguous block of memory, independent of other built-in Python objects. NumPy’s library of algorithms written in the C language can operate on this memory without any type checking or other overhead. NumPy arrays also use much less memory than built-in Python sequences. 
- NumPy operations perform complex computations on entire arrays without the need for Python for loops.

To give you an idea of the performance difference, consider a NumPy array of one million integers, and the equivalent Python list:

In [3]:
import numpy as np

my_arr = np.arange(1000000)
my_list = list(range(1000000)) 

Now let’s multiply each sequence by 2:

In [4]:
%time for _ in range(10): my_arr2 = my_arr * 2 

CPU times: user 12.5 ms, sys: 1.18 ms, total: 13.7 ms
Wall time: 13 ms


In [5]:
%time for _ in range(10): my_list2 = [x * 2 for x in my_list] 

CPU times: user 321 ms, sys: 87.7 ms, total: 409 ms
Wall time: 408 ms


NumPy-based algorithms are generally 10 to 100 times faster (or more) than their pure Python counterparts and use significantly less memory. 

## 3.1  The NumPy ndarray: A Multidimensional Array Object 

One of the key features of NumPy is its N-dimensional array object, or ndarray, which is a fast, flexible container for large datasets in Python. Arrays enable you to perform mathematical operations on whole blocks of data using similar syntax to the equivalent operations between scalar elements. To give you a flavor of how NumPy enables batch computations with similar syntax to scalar values on built-in Python objects, first import NumPy and generate a small array of random data:


In [6]:
import numpy as np

# Generate some random data
data = np.random.randn(2, 3)
data

array([[-0.06297979, -0.54360514, -1.88513677],
       [-2.065427  , -0.30454256, -0.27965759]])

Then write mathematical operations with data:


In [7]:
data * 10

array([[ -0.6297979 ,  -5.43605138, -18.85136766],
       [-20.65427005,  -3.04542559,  -2.79657589]])

In [8]:
data + data

array([[-0.12595958, -1.08721028, -3.77027353],
       [-4.13085401, -0.60908512, -0.55931518]])

In the first example, all of the elements have been multiplied by 10. In the second, the corresponding values in each “cell” in the array have been added to each other.

An ndarray is a generic multidimensional container for homogeneous data; that is, all of the elements must be the same type. Every array has a shape, a tuple indicating the size of each dimension, and a dtype, an object describing the data type of the array:

In [9]:
data.shape

(2, 3)

In [10]:
data.dtype

dtype('float64')

### 3.1.1 Creating ndarrays 

The easiest way to create an array is to use the array function. This accepts any sequence-like object (including other arrays) and produces a new NumPy array containing the passed data. For example, a list is a good candidate for conversion:


In [11]:
data1 = [6, 7.5, 8, 0, 1]
arr1 = np.array(data1)
arr1

array([6. , 7.5, 8. , 0. , 1. ])

Nested sequences, like a list of equal-length lists, will be converted into a multidimensional array:

In [12]:
data2 = [[1, 2, 3, 4], [5, 6, 7, 8]]
arr2 = np.array(data2)
arr2 

array([[1, 2, 3, 4],
       [5, 6, 7, 8]])

A simple representation of the example of multiple dimensions:

<br>
<img src="Fig1.jpg", style="width: 600px";>


Since _data2_ was a list of lists, the NumPy array _arr2_ has two dimensions with shape inferred from the data. We can confirm this by inspecting the **ndim** and **shape** attributes:

In [13]:
arr2.ndim 

2

In [14]:
arr2.shape

(2, 4)

Unless explicitly specified (more on this later), np.array tries to infer a good data type for the array that it creates. The data type is stored in a special **dtype** metadata object; for example, in the previous two examples we have:

In [15]:
print(arr1.dtype)
print(arr2.dtype)

float64
int64


In addition to np.array, there are a number of other functions for creating new arrays. As examples, **zeros** and **ones** create arrays of 0s or 1s, respectively, with a given length or shape. **empty** creates an array without initializing its values to any particular value. To create a higher dimensional array with these methods, pass a tuple for the shape:

In [16]:
np.zeros(5)

array([0., 0., 0., 0., 0.])

In [17]:
np.zeros((3,6))

array([[0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.],
       [0., 0., 0., 0., 0., 0.]])

In [20]:
np.empty((3,2,3))

array([[[0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.]],

       [[0., 0., 0.],
        [0., 0., 0.]]])

<br>
<img src="Fig2.jpg", style="width: 600px";>


It’s not safe to assume that np.empty will return an array of all zeros. In some cases, it may return uninitialized “garbage” values.


**arange** is an array-valued version of the built-in Python range function:


In [19]:
np.arange(15)

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14])

See Table 3-1 for a short list of standard array creation functions. Since NumPy is focused on numerical computing, the data type, if not specified, will in many cases be float64 (floating point).

<br>
<center>Table 3.1: Array creation functions </center>
<img src="Table3.1.jpg", style="width: 800px";>

#### 3.1.1.1 Data Types for ndarrays 

The data type or **dtype** is a special object containing the information (or metadata, data about data) the ndarray needs to interpret a chunk of memory as a particular type of data:


In [None]:
arr1 = np.array([1, 2, 3], dtype=np.float64)
arr2 = np.array([1, 2, 3], dtype=np.int32)

print(arr1.dtype)
print(arr2.dtype)

**dtypes** are a source of NumPy’s flexibility for interacting with data coming from other systems. In most cases they provide a mapping directly onto an underlying disk or memory representation, which makes it easy to read and write binary streams of data to disk and also to connect to code written in a low-level language like C or Fortran. The numerical dtypes are named the same way: a type name, like float or int, followed by a number indicating the number of bits per element. A standard doubleprecision floating-point value (what’s used under the hood in Python’s float object) takes up 8 bytes or 64 bits. Thus, this type is known in NumPy as float64. 

You can explicitly convert or *cast* an array from one dtype to another using ndarray’s **astype** method.

In [None]:
arr = np.array([1, 2, 3, 4, 5])
arr.dtype

In [None]:
float_arr = arr.astype(np.float64)
float_arr.dtype

In the above example, integers were cast to floating point. If we cast some floating-point numbers to be of integer dtype, the decimal part will be truncated:

In [None]:
arr = np.array([3.7, -1.2, -2.6, 0.5, 12.9, 10.1])
arr 

In [None]:
arr.astype(np.int32) 

If you have an array of strings representing numbers, you can use **astype** to convert them to numeric form:

In [None]:
numeric_strings = np.array(['1.25', '-9.6', '42'], dtype=np.string_)
numeric_strings.astype(float)

### 3.1.2 Arithmetic with NumPy Arrays 

Arrays are important because they enable you to express batch operations on data without writing any **for** loops. NumPy users call this *vectorization*. Any arithmetic operations between equal-size arrays applies the operation element-wise:


In [None]:
arr = np.array([[1., 2., 3.], [4., 5., 6.]])
arr

In [None]:
arr * arr

In [None]:
arr - arr

Arithmetic operations with scalars propagate the scalar argument to each element in the array:

In [None]:
1 / arr

In [None]:
arr ** 0.5

Comparisons between arrays of the same size yield boolean arrays:

In [None]:
arr2 = np.array([[0., 4., 1.], [7., 2., 12.]])
arr2 

In [None]:
 arr2 > arr

### 3.1.3 Basic Indexing and Slicing 

NumPy array indexing is a rich topic, as there are many ways you may want to select a subset of your data or individual elements. One-dimensional arrays are simple; on the surface they act similarly to Python lists:

In [None]:
arr = np.arange(10)
arr

In [None]:
arr[5]

In [None]:
arr[5:8]

In [None]:
arr[5:8] = 12
arr

As you can see, if you assign a scalar value to a slice, as in arr[5:8] = 12, the value is propagated (or broadcasted henceforth) to the entire selection. An important first distinction from Python’s built-in lists is that array slices are views on the original array. This means that the data is not copied, and any modifications to the view will be reflected in the source array. 

To give an example of this, we first create a slice of arr:


In [None]:
arr_slice = arr[5:8]
arr_slice 

Now, when we change values in arr_slice, the mutations are reflected in the original array arr:

In [None]:
arr_slice[1] = 12345
arr 

The “bare” slice [ : ] will assign to all values in an array:

In [None]:
arr_slice[:] = 64
arr 

With higher dimensional arrays, you have many more options. In a two-dimensional array, the elements at each index are no longer scalars but rather one-dimensional arrays:

In [None]:
arr2d = np.array([[1, 2, 3], [4, 5, 6], [7, 8, 9]])
print(arr2d)
arr2d[2] 

Thus, individual elements can be accessed recursively. But that is a bit too much work, so you can pass a comma-separated list of indices to select individual elements. So these are equivalent:

In [None]:
arr2d[0][2] 

In [None]:
arr2d[0, 2]

See Figure 3-1 for an illustration of indexing on a two-dimensional array. We find it helpful to think of axis 0 as the “rows” of the array and axis 1 as the “columns.”

<img src="Figure3.1.jpg", style="width: 500px";>
<center>Figure 3.1: Indexing elements in a NumPy array </center>
<br>

In multidimensional arrays, if you omit later indices, the returned object will be a lower dimensional ndarray consisting of all the data along the higher dimensions. So in the 2 × 2 × 3 array arr3d:

In [None]:
arr3d = np.array([[[1, 2, 3], [4, 5, 6]], [[7, 8, 9], [10, 11, 12]]])
arr3d 

arr3d[0] is a 2 × 3 array:

In [None]:
arr3d[0]

Both scalar values and arrays can be assigned to arr3d[0]:

In [None]:
old_values = arr3d[0].copy()
arr3d[0] = 42
arr3d 

In [None]:
arr3d[0] = old_values
arr3d

Similarly, arr3d[1, 0] gives you all of the values whose indices start with (1, 0), forming a 1-dimensional array:

In [None]:
 arr3d[1, 0]

This expression is the same as though we had indexed in two steps:

In [None]:
x = arr3d[1]
x 

In [None]:
x[0]

Note that in all of these cases where subsections of the array have been selected, the returned arrays are views.

#### 3.1.3.1 Indexing with slices

Like one-dimensional objects such as Python lists, ndarrays can be sliced with the familiar syntax:

In [None]:
print(arr)

arr[1:6]

Consider the two-dimensional array from before, arr2d. Slicing this array is a bit different:

In [None]:
print(arr2d)

arr2d[:2]

As you can see, it has sliced along axis 0, the first axis. A slice, therefore, selects a range of elements along an axis. It can be helpful to read the expression arr2d[:2] as “select the first two rows of arr2d.” 

You can pass multiple slices just like you can pass multiple indexes:

In [None]:
arr2d[:2, 1:] 

When slicing like this, you always obtain array views of the same number of dimensions. By mixing integer indexes and slices, you get lower dimensional slices. 

For example, you can select the second row but only the first two columns like so:

In [None]:
 arr2d[1, :2]

Similarly, you can select the third column but only the first two rows like so:

In [None]:
 arr2d[:2, 2] 

See Figure 3-2 for an illustration. Note that a colon by itself means to take the entire axis, so you can slice only higher dimensional axes by doing:

In [None]:
 arr2d[:, :1] 

Of course, assigning to a slice expression assigns to the whole selection:

In [None]:
arr2d[:2, 1:] = 0
arr2d 

<img src="Figure3.2.jpg", style="width: 500px";>
<center>Figure 3.2: Two-dimensional array slicing  </center>
<br>


### 3.1.4Transposing Arrays and Swapping Axes 

<br>
<img src="Fig3.jpg", style="width: 700px";>

Transposing is a special form of reshaping that similarly returns a view on the underlying data without copying anything. Arrays have the transpose method and also the special **T** attribute:


In [None]:
arr = np.arange(15).reshape((3, 5))
arr

In [None]:
arr.swapaxes(0,1)

In [None]:
arr.T

When doing matrix computations, you may do this very often—for example, when computing the inner matrix product using **np.dot**:

In [None]:
arr = np.random.randn(6, 3)
arr 

In [None]:
np.dot(arr.T, arr)

For higher dimensional arrays, **transpose** will accept a tuple of axis numbers to permute the axes (for extra mind bending):

In [None]:
arr = np.arange(16).reshape((2, 2, 4))
arr 

In [None]:
arr.transpose((1, 0, 2)) 

Here, the axes have been reordered with the second axis first, the first axis second, and the last axis unchanged. 

Simple transposing with **.T** is a special case of swapping axes. ndarray has the method **swapaxes**, which takes a pair of axis numbers and switches the indicated axes to rearrange the data:

In [None]:
arr

In [None]:
arr.swapaxes(0, 1) 

swapaxes similarly returns a view on the data without making a copy. 

Example representation of swapping:

<br>
<img src="Fig5.jpg", style="width: 900px";>

## 3.2 Universal Functions 

A universal function, or **ufunc**, is a function that performs element-wise operations on data in ndarrays. Many ufuncs are simple element-wise transformations, like **sqrt** or **exp**:


In [None]:
arr = np.arange(10)
arr 

In [None]:
 np.sqrt(arr) 

In [None]:
 np.exp(arr)

These are referred to as *unary ufuncs*. Others, such as add or maximum, take two arrays (thus, binary ufuncs) and return a single array as the result:

In [None]:
x = np.random.randn(8)
y = np.random.randn(8)

print(x)
print()
print(y)

In [None]:
 np.maximum(x, y) 

Here, **numpy.maximum** computed the element-wise maximum of the elements in *x* and *y*. 

While not common, a ufunc can return multiple arrays. **modf** is one example, a vectorized version of the built-in Python **divmod**; it returns the fractional and integral parts of a floating-point array:


In [None]:
arr = np.random.randn(7) * 5
arr 

In [None]:
remainder, whole_part = np.modf(arr)
remainder 

In [None]:
whole_part

Ufuncs accept an optional out argument that allows them to operate in-place on arrays:

In [None]:
arr

In [None]:
np.sqrt(arr)

In [None]:
np.sqrt(arr, arr)

In [None]:
arr

See Tables 3-3 and 3-4 for a listing of available ufuncs.

<br>
<center>Table 3.2: Unary *ufuncs*  </center>
<img src="Table3.2.jpg", style="width: 800px";>

<br>
<center>Table 3.3: Binary *ufuncs*  </center>
<img src="Table3.3.jpg", style="width: 800px";>


## 3.3 Array-Oriented Programming with Arrays 

Using NumPy arrays enables you to express many kinds of data processing tasks as concise array expressions that might otherwise require writing loops. This practice of replacing explicit loops with array expressions is commonly referred to as *vectorization*. In general, vectorized array operations will often be one or two (or more) orders of magnitude faster than their pure Python equivalents, with the biggest impact in any kind of numerical computations. 

As a simple example, suppose we wished to evaluate the function *sqrt(x^2 + y^2)* across a regular grid of values. The **np.meshgrid** function takes two 1D arrays and produces two 2D matrices corresponding to all pairs of *(x, y)* in the two arrays:


In [None]:
points = np.arange(-5, 5, 0.01) # 1000 equally spaced points
xs, ys = np.meshgrid(points, points)
ys 

Now, evaluating the function is a matter of writing the same expression you would write with two points:

In [None]:
z = np.sqrt(xs ** 2 + ys ** 2)
z 

As a preview of Chapter 7 (Plotting and Visualization), we use matplotlib to create visualizations of this twodimensional array. Here we used the matplotlib function imshow to create an image plot from a two-dimensional array of function values.

In [None]:
import matplotlib.pyplot as plt

plt.title("Image plot of $\sqrt{x^2 + y^2}$ for a grid of values") 
plt.imshow(z, cmap=plt.cm.gray); plt.colorbar() 
plt.show()

### 3.3.1 Expressing Conditional Logic as Array Operations 

The **numpy.where** function is a vectorized version of the ternary expression *x if condition else y*. Suppose we had a boolean array and two arrays of values:


In [None]:
xarr = np.array([1.1, 1.2, 1.3, 1.4, 1.5])
yarr = np.array([2.1, 2.2, 2.3, 2.4, 2.5])

cond = np.array([True, False, True, True, False]) 

Suppose we wanted to take a value from *xarr* whenever the corresponding value in *cond* is True, and otherwise take the value from *yarr*. A list comprehension doing this might look like:

In [None]:
result = [(x if c else y)for x, y, c in zip(xarr, yarr, cond)]
result

This has multiple problems. First, it will not be very fast for large arrays (because all the work is being done in interpreted Python code). Second, it will not work with multidimensional arrays. With **np.where** you can write this very concisely:

In [None]:
result = np.where(cond, xarr, yarr)
result 

The second and third arguments to **np.where** don’t need to be arrays; one or both of them can be scalars. A typical use of where in data analysis is to produce a new array of values based on another array. Suppose you had a matrix of randomly generated data and you wanted to replace all positive values with 2 and all negative values with –2. This is very easy to do with np.where:

In [None]:
arr = np.random.randn(4, 4)
arr 

In [None]:
arr > 0

In [None]:
np.where(arr > 0, 2, -2)

You can combine scalars and arrays when using np.where. For example, you can replace all positive values in *arr* with the constant 2 like so:

In [None]:
 np.where(arr > 0, 2, arr) # set only positive values to 2 

### 3.3.2 Mathematical and Statistical Methods 

A set of mathematical functions that compute statistics about an entire array or about the data along an axis are accessible as methods of the array class. You can use aggregations (often called reductions) like *sum, mean,* and *std* (standard deviation) either by calling the array instance method or using the top-level NumPy function. Here we generate some normally distributed random data and compute some aggregate statistics:


In [None]:
arr = np.random.randn(5, 4)
arr

In [None]:
arr.mean()

In [None]:
np.mean(arr)

In [None]:
arr.sum()

Functions like **mean** and **sum** take an optional axis argument that computes the statistic over the given axis, resulting in an array with one fewer dimension:

In [None]:
arr.mean(axis=1) 

In [None]:
 arr.sum(axis=0) 

Here, *arr.mean(axis=1)* means “compute mean across the columns” where *arr.sum(axis=0)* means “compute sum down the rows.”

Axes are defined for arrays with more than one dimension. A 2-dimensional array has two corresponding axes: the first running vertically downwards across rows (axis 0), and the second running horizontally across columns (axis 1).

Many operation can take place along one of these axes. For example, we can sum each row of an array, in which case we operate along columns, or axis 1:

In [None]:
x = np.arange(12).reshape((3,4))
x

In [None]:
x.sum(axis=0)

In [None]:
x.sum(axis=1)

Other methods like **cumsum** and **cumprod** do not aggregate, instead producing an array of the intermediate results:

In [None]:
arr = np.array([0, 1, 2, 3, 4, 5, 6, 7])
arr.cumsum() 

In multidimensional arrays, accumulation functions like **cumsum** return an array of the same size, but with the partial aggregates computed along the indicated axis according to each lower dimensional slice:

In [None]:
arr = np.array([[0, 1, 2], [3, 4, 5], [6, 7, 8]])
arr 

In [None]:
arr.cumsum(axis=0)

In [None]:
arr.cumprod(axis=1)

See Table 3-4 for a full listing of basic array statistical method.

<br>
<center>Table 3.4: Basic array statistical methods </center>
<img src="Table3.4.jpg", style="width: 800px";>

### 3.3.3 Sorting 

Like Python’s built-in list type, NumPy arrays can be sorted in-place with the sort method:


In [None]:
arr = np.random.randn(6)
arr 

In [None]:
arr.sort()
arr

You can sort each one-dimensional section of values in a multidimensional array inplace along an axis by passing the axis number to **sort**:

In [None]:
arr = np.random.randn(5, 3)
arr 

In [None]:
arr.sort(axis=1)
arr 

The top-level method **np.sort** returns a sorted copy of an array instead of modifying the array in-place. A quick-and-dirty way to compute the quantiles of an array is to sort it and select the value at a particular rank:

In [None]:
large_arr = np.random.randn(1000)
large_arr.sort()

large_arr[int(0.05 * len(large_arr))] # 5% quantile 