# Intro to Numpy

When it comes to number crunching, Python is actually considered quite slow. It's an interpreted language, which means that there's quite a bit of overhead associated with looping through a list of numbers and calculating, say, a simple sum. Despite this, Python has become a popular language for numerical work, and this is due to the existence of `numpy`.

## NumPy arrays are performant for numerical operations

Let's import `numpy`; this creates a `module` object in our Python session, including in it all that `numpy` has to offer:

In [1]:
import numpy as np

The core functionality of `numpy` comes in the form of the array data structure it includes. We can create an array of integers, for example, by using `np.array` and feeding it a list of integers:

In [2]:
arr = np.array([1, 3, 7, 9])

In [3]:
arr

array([1, 3, 7, 9])

A distinguishing characteristic of a `numpy` array is that all of its elements are of the same data type. In this case, we have an array of 64 bit integers:

In [4]:
arr.dtype

dtype('int64')

The elements of an array are arranged in a computer's physical memory in a contiguous block. This allows for fast computation since modern CPUs will read as much contiguous data from memory as they can fit in their cache, which means that neighboring elements of a `numpy` array will already be in cache when a computation is requested on a given element. The result is faster computation than we would expect from, say, a Python `list`.

As a point of comparison, we'll sum a million integers in a `list`, and do the same with an `array`, and see how much time this takes:

In [5]:
biglist = list(range(1000000))

In [6]:
%%timeit 

sum(biglist)

6.17 ms ± 258 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [7]:
bigarray = np.arange(1000000)

In [8]:
bigarray

array([     0,      1,      2, ..., 999997, 999998, 999999])

In [9]:
%%timeit

np.sum(bigarray)

659 µs ± 2.24 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


For this simple test, the sum on the `numpy` array was 10 times faster than that on the `list`. Speed differences such as this add up very quickly when doing many operations on numerical data, so the benefits of `numpy` cannot be understated here.

## Working with arrays

`numpy` arrays have a number of methods attached to them. For example, these aggregating functions that yield a single value:

In [10]:
arr.sum()

20

In [11]:
arr.prod()

189

In [12]:
arr.min()

1

In [13]:
arr.max()

9

There are also methods like `shape`, which tells us the shape of our array as a tuple:

In [14]:
arr.shape

(4,)

In this case, we are dealing with a one-dimensional `numpy` array of length 4. This is similar to a mathematical *vector*.

We can reshape an array with the `reshape` method:

In [15]:
arr.reshape(2, 2)

array([[1, 3],
       [7, 9]])

Importantly, this *does not* change the existing array, but instead gives back a new array with the new shape. A `numpy` array cannot be reshaped or resized *in-place*; a new array must be made with the data copied to accommodate this behavior.

In [16]:
arr_n = arr.reshape(2, 2)

In [17]:
arr_n

array([[1, 3],
       [7, 9]])

In [18]:
arr

array([1, 3, 7, 9])

`numpy` also features standalone functions such as `arange`, which functions similarly to the built-in `range` but gives `numpy` arrays:

In [19]:
arr = np.arange(10, 46)

In [20]:
arr

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
       44, 45])

We can reshape this into a new 2-dimensional array with 3 rows with `reshape`. We only need to specify the length of a single dimension for this to work; the other dimension can be figured out, and we tell the function to figure it out with a `-1`:

In [21]:
arr = arr.reshape((3, -1))

In [22]:
arr

array([[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
       [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33],
       [34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45]])

If needed, we can obtain a 1-D array from an array of any dimension with the `flatten` method:

In [23]:
arr.flatten()

array([10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21, 22, 23, 24, 25, 26,
       27, 28, 29, 30, 31, 32, 33, 34, 35, 36, 37, 38, 39, 40, 41, 42, 43,
       44, 45])

## Array arithmetic

Arithmetic operations with arrays occur *element-wise*. Multiplying by 3 and subtracting 2 gives a new array with that operation performed on each element individually:

In [24]:
(arr * 3) - 2

array([[ 28,  31,  34,  37,  40,  43,  46,  49,  52,  55,  58,  61],
       [ 64,  67,  70,  73,  76,  79,  82,  85,  88,  91,  94,  97],
       [100, 103, 106, 109, 112, 115, 118, 121, 124, 127, 130, 133]])

This allows us to write code treating an array as if it was a single number! For example, our Fahrenheit to Celsius converter that we wrote last lesson would work just fine as written on a `numpy` array of temperatures:

In [25]:
def fahr_to_cels(temp):
    return (100 / 180) * (temp - 32)

In [26]:
fahr_to_cels(arr)

array([[-12.22222222, -11.66666667, -11.11111111, -10.55555556,
        -10.        ,  -9.44444444,  -8.88888889,  -8.33333333,
         -7.77777778,  -7.22222222,  -6.66666667,  -6.11111111],
       [ -5.55555556,  -5.        ,  -4.44444444,  -3.88888889,
         -3.33333333,  -2.77777778,  -2.22222222,  -1.66666667,
         -1.11111111,  -0.55555556,   0.        ,   0.55555556],
       [  1.11111111,   1.66666667,   2.22222222,   2.77777778,
          3.33333333,   3.88888889,   4.44444444,   5.        ,
          5.55555556,   6.11111111,   6.66666667,   7.22222222]])

Arithmetic between arrays is also element-wise. If we add two arrays of the same shape together, elements in the corresponding row-column are added to each other in the resulting array:

In [27]:
arr

array([[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
       [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33],
       [34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45]])

In [28]:
arr + arr

array([[20, 22, 24, 26, 28, 30, 32, 34, 36, 38, 40, 42],
       [44, 46, 48, 50, 52, 54, 56, 58, 60, 62, 64, 66],
       [68, 70, 72, 74, 76, 78, 80, 82, 84, 86, 88, 90]])

This makes calculating quantities with arrays incredibly concise. It's not necessary to loop through the elements of an array in Python to calculate something with each element. Instead, we can treat an array as if it was a single quantity, and calculations are *fast*.

## Indexing and slicing arrays

There are times when we need to select a subset of the elements in an array to operate on. As with other data structures in Python, `numpy` arrays use 0-based indexing, and we can select elements with square (`[]`) brackets:

In [29]:
arr

array([[10, 11, 12, 13, 14, 15, 16, 17, 18, 19, 20, 21],
       [22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33],
       [34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45]])

Selecting element 1 from this array gives the row at index 1 as an array:

In [30]:
arr[1]

array([22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33])

To select a single element from the array, we need to specify the column like:

In [31]:
arr[1, 5]

27

We can select multiple rows with slices; for example, all rows starting with row 1 to the end of the array:

In [32]:
arr[1:]

array([[22, 23, 24, 25, 26, 27, 28, 29, 30, 31, 32, 33],
       [34, 35, 36, 37, 38, 39, 40, 41, 42, 43, 44, 45]])

We can slice the columns as well. The slicing below will select row 1 onward, but only elements in columns 3 up to and not including column 5:

In [33]:
arr[1:, 3:5]

array([[25, 26],
       [37, 38]])

### Challenge: create an array of shape (4, 5) starting with `np.arange(20)`, then obtain the mean value of elements in rows 1,2 inclusive and columns 1 through 3 inclusive (zero based).

We could do:

In [34]:
# create array with proper shape
my_arr = np.arange(20).reshape(4, 5)

# grab subselection of rows, columns
subsel = my_arr[1:3, 1:4]

# calculate mean
subsel.mean()

9.5

But we could also accomplish this in one line, since each method call/slice returns an array, and finally `mean` gives the single number as a result:

In [35]:
np.arange(20).reshape((4, -1))[1:3, 1:4].mean()

9.5

`numpy` also features a function called `mean` which does the same thing as the array method:

In [36]:
np.mean(np.arange(20).reshape((4, -1))[1:3, 1:4])

9.5

## Making arrays from scratch

There are several ways we can create arrays with different contents, for different purposes. We've already seen how to make an array from a list or using `np.arange`, but here we point out some other useful ways of creating an array.

If we want a 2-D array of a particular set of numbers, we can give the `np.array` function a list of lists:

In [37]:
np.array([[1,2,3], [4, 5, 6]])

array([[1, 2, 3],
       [4, 5, 6]])

`np.zeros` will create an array of all zeros with the specified shape:

In [38]:
np.zeros(100)

array([0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.,
       0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0., 0.])

Likewise, `np.ones` will give us all ones:

In [39]:
np.ones(100)

array([1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.,
       1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1., 1.])

`np.random` is a submodule that includes functions for sampling value from various probability distributions, for example, a [standard normal distribution](https://en.wikipedia.org/wiki/Normal_distribution#Standard_normal_distribution):

In [40]:
np.random.randn(100)

array([ 1.19041461,  1.29098655,  1.28217756,  0.45223381, -0.24914508,
       -0.21115722, -1.32784828,  0.33474804, -0.60466281, -1.09885189,
       -1.14628019, -0.20876934,  0.73866908, -1.57987507,  0.59418876,
        0.43004531, -0.96119315,  0.80126323,  0.45154057, -0.31806986,
       -1.22452201, -0.11402925, -1.57626672,  1.19156359, -0.99368864,
       -0.82099168,  1.1614733 , -0.07329234,  1.42278427, -1.29304768,
        1.2357342 , -0.28743795,  1.72575906,  0.75259497, -0.02571497,
        2.12379036,  0.61442886,  0.80487883, -0.98747478,  0.03396688,
        2.01824579,  0.54437396, -0.74633589,  1.32337983, -1.22117364,
       -0.5960966 ,  0.10543284, -0.70862352, -2.45010706, -1.49213963,
        0.3470624 ,  1.64095276, -0.4187089 , -1.02567218, -1.70699153,
        0.68849978, -2.19110112, -0.71500508, -1.01256546,  0.20110122,
        0.75461731,  0.93530883,  0.51712423, -0.61531593, -0.50129879,
        0.09739117,  0.87829891, -0.54483914,  0.25858596, -0.25

## Boolean indexing

Selecting values from a `numpy` array can also be done using *boolean indexing*.

In [41]:
arr = np.arange(20)
arr

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

Inequality operations, such as "greater-than" (`>`) give us back a `numpy` array composed of bools:

In [42]:
arr > 7

array([False, False, False, False, False, False, False, False,  True,
        True,  True,  True,  True,  True,  True,  True,  True,  True,
        True,  True])

As with arithmetic operations, this operation is performed element-wise, so we have a `True` for each element that was greater than 7, and `False` for each element that was not. We can use this boolean array as if it was an index to our array:

In [43]:
arr[arr > 7]

array([ 8,  9, 10, 11, 12, 13, 14, 15, 16, 17, 18, 19])

The result is an array featuring only elements for which there was a `True` in the boolean array. This kind behavior allows us to filter values from an array to get only those matching some criteria.

## Fancy indexing

If we have a one-dimensional array, we can perform so-called *fancy-indexing* with a list of index values we want to select:

In [44]:
arr

array([ 0,  1,  2,  3,  4,  5,  6,  7,  8,  9, 10, 11, 12, 13, 14, 15, 16,
       17, 18, 19])

In [45]:
arr[[-1, -1, 3, 5, -2]]

array([19, 19,  3,  5, 18])

This is useful when we know precisely which indices we want to select out of an existing array. There is no need to select each element from a set of indices in a loop.

## Numpy arrays are the core of the scientific Python ecosystem

In the lessons that follow, we'll get exposure to more tools in the scientific Python ecosystem. All of these tools utilize `numpy` arrays, making `numpy` and its array data structure important to become familiar with. Whether we are using `pandas` for analyzing tabular data or scikit-learn for applying a machine-learning algorithm, `numpy` arrays are at the core of how these tools work.