## NumPy arrays vs. Python lists

NumPy (Numerical Python, pronounced as "num-pie") is an open-source Python library for efficient and convenient manipulation of large multi-dimensional arrays of data. It comes as part of the Anaconda installation of Python.

Similar functionality is already implemented in Python's lists. However, using lists with large datasets would be very slow, because in order to manipulate data contained in lists, Python requires one to write code to manipulate each element one by one. Operations such as accessing the length of a list, appending or removing elements from a list using Python are not only very slow, but also involve writing much more code, in comparision to NumPy.

Consider the following example. We'd like to calculate the [Body Mass Index](https://en.wikipedia.org/wiki/Body_mass_index) for children in a certain class. The formula of BMI is:

$$BMI = \frac{weight}{height^2}$$

Suppose we have collected data on the weight and height of every child in the class:

In [12]:
import random


# generate 30 random weight values, in kg
weights = [70+random.random()*10 for x in range(300)]

# print the first 10 values
weights[:10]

[70.99135134613645,
 73.51383874092978,
 79.9922328176173,
 71.07823649266399,
 78.18269246905174,
 72.16296886822036,
 77.70079513353403,
 75.14316403992761,
 74.15173246583876,
 74.67616711147245]

In [30]:
# height values, in m
heights = [1.5+random.random()/3 for x in range(300)]
heights[:10]

[1.6901929076940627,
 1.7972063640844302,
 1.6946736367654232,
 1.5631308030109219,
 1.7720819087392763,
 1.6694078461660231,
 1.6464523437298846,
 1.6613386911464072,
 1.6359037178294615,
 1.7391928868774658]

Given the data, we can calculate the BMI of every child, using the following code with Python lists:

In [8]:
def get_bmis(weights, heights):
    """The function takes two lists for weights and heights and returns a list with BMIs.
    """
    bmis = []
    for w, h in zip(weights, heights):
        bmi = w/h**2
        bmis.append(bmi)
    return bmis

Let's measure how long it takes for the function to do the calculations, using the **"ipython magic"** command "timeit". The command runs a given line of code seven times, each time within a large number of loops, and outputs the mean and standard deviation of the duration of the runs per loop.

In [9]:
%timeit get_bmis(weights, heights)

121 µs ± 16.2 µs per loop (mean ± std. dev. of 7 runs, 10000 loops each)


Each time you run the cell, the result will be somewhat different, but it should be around 10 microseconds per loop.

Now, let's implement the same using NumPy. We'll first need to convert Python lists to NumPy arrays.

In [33]:
# first import numpy
# it is customary to use the alias "np", as it needs to be typed often
import numpy as np

# convert Python lists to NumPy arrays
np_heights = np.array(heights)
np_weights = np.array(weights)

The BMIs can now be calculated using just one line of code:

In [34]:
%timeit bmis = np_weights/np_heights**2

1.86 µs ± 19.8 ns per loop (mean ± std. dev. of 7 runs, 100000 loops each)


Again, the duration will be slightly different each time you run this cell, but it should be around 1.5 microseconds. In other words, the NumPy implementation is over 6 times faster.

Incidentally, the "timeit" magic command is very useful and can help you decide between different alternative implementations of a certain functionality, if the execution time is a major factor for you.

## Creating a NumPy array

A NumPy array can be created from a Python list using the `array` function, as you saw above:

In [10]:
a = np.array([1, 2, 3])
a

array([1, 2, 3])

In [11]:
# check the type of the variable
type(a)

numpy.ndarray

An array can be multi-dimensional:

In [12]:
a = np.array([[1, 2, 3], [4, 5, 6]])
a

array([[1, 2, 3],
       [4, 5, 6]])

An array can also be generated as a range of values, using the `arange` function:

In [13]:
a = np.arange(0, 10)
a

array([0, 1, 2, 3, 4, 5, 6, 7, 8, 9])

It can be filled with random numbers:

In [14]:
# generate an array with random values, in two rows and two columns
a = np.random.random((2, 2))
a

array([[0.33467815, 0.04857383],
       [0.65524032, 0.34739434]])

One can create an array where all elements are zeros, using the `zeros` function, which takes a tuple that specifies the required shape of the array:

In [15]:
a = np.zeros((2, 3))
a

array([[0., 0., 0.],
       [0., 0., 0.]])

Or, an array, filled with ones:

In [16]:
a = np.ones((2, 3))
a

array([[1., 1., 1.],
       [1., 1., 1.]])

One can specify a specific type of the numerical variables to be used in the array, by supplying the `dtype` argument:

In [17]:
a = np.array([0, 1, 2, 3, 4], dtype=np.int64)
a

array([0, 1, 2, 3, 4], dtype=int64)

Here is a list of numerical data types that a NumPy array can hold:
    
<img src="http://vpekar.github.io/images/session9_1.png" width="450">

The choice of the variable type to use is a trade-off between the desired precision of calculation and the available computational resources.

## Attributes of a NumPy array

There are two commonly used attributes of a NumPy array: shape and size.
    
The **shape** of an array refers to the number of elements along each dimension of the array. The `shape` attribute of an array contains a tuple, where each element is the number of elements along one dimension.

For example:

In [18]:
# a "one-dimensional" array, i.e. a flat list of elements, with 3 elements
a = np.array([1, 2, 3])
a.shape

(3,)

In [19]:
# a two-dimensional array, with 3 rows and 2 columns
a = np.array([[1, 2], [3, 4], [5, 6]])
a.shape

(3, 2)

Note in the shape tuple, the first elements refers to rows, and the second to columns.

In [20]:
# a two-dimensional array, with 2 rows and 3 columns
a = np.array([[1, 2, 3], [4, 5, 6]])
a.shape

(2, 3)

In [21]:
# a three-dimensional array
a = np.array([[[1, 2], [3, 4]], [[5, 6], [7, 8]], [[9, 10], [11, 12]]])
a.shape

(3, 2, 2)

The **size** attribute refers to the total number of elements in the array:

In [22]:
# there are 12 elements in the last array
a.size

12