# How to write efficient code

In this notebook, you will learn about
- Slicing
- Views/copies
- Advanced indexing
- How to reshape, flatten and increase the dimensions of an array

---

## 1. What is efficient code?

NumPy was created with the goal of making scientific computing in Python possible (and with good performance). While its high-level Python syntax makes it accessible and easy to learn, the core of NumPy is well-optimized C code. When we say efficient NumPy code, we mean taking advantage of the structure and C-level implementation of arrays, operations and functions as much as possible and avoiding extra computational cost. Let's see how to take full advantage of this efficient implementation.

We will continue using the example of [Notebook 1](01_Intro.ipynb):

In [None]:
import numpy as np
import pandas as pd

quality_of_life = pd.read_csv('../data/quality_of_life_index.csv')

This time, let's select one 1D array containing only the Quality of Life index, and let's build another 2D array containing Quality of Life, Cost of Living and Pollution indices.

In [None]:
quality_index = np.array(quality_of_life['Quality of Life Index'])
quality_cost_pollution = np.array(quality_of_life[['Quality of Life Index', 'Cost of Living Index', 'Pollution Index']]) 

## 2. Operations and built-in utilities

There are several built-in utilities that can be applied to a NumPy array. For example, we can compute the maximum and minimum values of an array using

In [None]:
np.amax(quality_index), np.amin(quality_index)

Note that when applying these functions to an array with more than one axis, we can pick which axis to compute the maximum or minimum for. Take, for example, our `quality_cost_pollution` array. Let's say we want to compute the *maximum along all rows*.

**Note** There is a source of confusion that may arise with the expression *along all rows*. One way to think about this is to reason that the axis selected in the function call is the axis to be collapsed at the end of the operation. For example:

In [None]:
quality_cost_pollution.shape

Selecting to compute the maximum over `axis=0`, the axis corresponding to the rows will be collapsed: 

In [None]:
np.amax(quality_cost_pollution, axis=0)

If we select `axis=1`, the axis corresponding to the columns will be collapsed:

In [None]:
np.amax(quality_cost_pollution, axis=1)

As expected,

In [None]:
np.amax(quality_cost_pollution, axis=1).shape

If no axis is selected, these functions compute the result over the *flattened* array - meaning they compare every element, disregarding the dimensions of the array.

In [None]:
np.amax(quality_cost_pollution)

Other useful functions include:

In [None]:
np.mean(quality_index)  # Compute the arithmetic mean along the specified axis.

In [None]:
mean_indices = np.mean(quality_cost_pollution, axis=0)
print(f'The mean for the Quality of Life index is {mean_indices[0]}')
print(f'The mean for the Cost of Living index is {mean_indices[1]}')
print(f'The mean for the Pollution index is {mean_indices[2]}')

Now, let's say we want to compute the sum all the elements in the array. We can use the `np.sum` function to do this:

In [None]:
np.sum(quality_index)

Note also that, if our array has two possible `axis` over which to compute the sum, we can tell `sum` what to do using the `axis` keyword: 

In [None]:
quality_cost_pollution.shape

Here, `axis=0` corresponds to the sum over all rows: 

In [None]:
np.sum(quality_cost_pollution, axis=0)

While `axis=1` corresponds to the sum over all columns:

In [None]:
np.sum(quality_cost_pollution, axis=1)

---

#### Self-assessment 1

---

## 3. Slicing

NumPy allows you to select items in an array not only individually, but as a subset of the initial array. For example, you can take up a *slice* of a NumPy array by using the same slicing syntax as you would use with Python lists, extending this concept to N dimensions. For example, to select the top 5 quality of life indices from our array, we can do

In [None]:
top_quality = quality_index[0:5]
print(top_quality)

Note that

In [None]:
top_quality.shape

Consider now our 2-d array

In [None]:
quality_cost_pollution

If we want to select the first 5 rows of this 2D array, we can use the following syntax:

In [None]:
quality_cost_pollution[0:5, :]

(Note that the colon `:` denotes we didn't make any explicit choice of indices for the second axis, which in this case means we take all columns for the result)

If instead we wanted to choose the first two columns, with all rows, we would do

In [None]:
quality_cost_pollution[:, 0:2]

To select elements from a sub-array located in rows 5 through 9, and columns 0 and 1, we would do

In [None]:
quality_cost_pollution[5:10, 0:2]

**Note** You may use slicing to set values in the array, but (unlike lists) you can never grow the array using slicing. For that, you need to create a new array with the appropriate size and copy the data to this new object.

---

#### Self-assessment 2

---

## 4. Views and copies

Behind the scenes, the NumPy array is a contiguous block of memory consisting of two parts: the data buffer with the actual data elements, and the metadata which contains information about the data buffer. The metadata includes data type, strides and other important information that helps manipulate the ndarray easily.

Because of the way NumPy is built, it is often possible to access the data buffer directly for more efficient computations: we call this a `view`. When this is not possible, for example when we need to increase the number of elements of an array, a `copy` is made. Copies take more space in memory and can impact performance for large datasets, so they should be avoided.

You don't need to understand all the details about copies and views, but you should be aware that some NumPy operations creates views, while others creates copies - this can create serious bottlenecks for your algorithm's performance and should be handled carefully if you want to write efficient code.

Let's look at a concrete example:

In [None]:
quality_index[0:5]

In [None]:
top_quality = quality_index[0:5]  # The slicing operation creates a view of the original array

In [None]:
top_quality[0] = 300  # By changing this element of the view, we are also changing the element in the original array!

In [None]:
top_quality

In [None]:
quality_index

The base attribute of the ndarray makes it easy to tell if an array is a view or a copy. The base attribute of a view returns the original array while for a copy it returns `None`.

In [None]:
top_quality.base is quality_index  # top_quality is a view of quality_index

If we want to make sure `top_quality` is an entirely different array from `quality_index`, we can use the `.copy()` method:

In [None]:
top_quality = quality_index[0:5].copy()

In [None]:
top_quality.base is quality_index

## 5. Advanced indexing

In addition to selecting elements with integer or tuple indices, NumPy implements *advanced indexing* techniques, allowing us to use ndarrays or boolean objects as indices. For example, suppose we want to select all elements in our `quality_index` array above a certain value - say 200. First, to detect which elements satisfy this condition, we can test the array directly:

In [None]:
quality_index > 200

Note that the output from this is an array with boolean values:

In [None]:
boolean_array = quality_index > 200
boolean_array.dtype

This boolean array can then be used to directly select the elements from the original array for which the condition is met:

In [None]:
quality_index[quality_index > 200]

This syntax can be very powerful and compact. Let's say we want to select only the values larger than the array average. We can do this by using

In [None]:
quality_index[quality_index >= np.mean(quality_index)]

Note that it is also possible to select elements from an array using another array (or a list, or tuple). For example:

In [None]:
top_quality = quality_index[0:5]
print(top_quality)

In [None]:
top_quality[[1, 1, 2, 3]]

**Note** Advanced indexing always returns a copy of the data (contrast with basic slicing that returns a view). This can have serious impact in the performance and memory cost of these indexing operations.

In [None]:
top_quality[[1, 1, 2, 3]].base is quality_index

---

#### Self-assessment 3

---

## 5. How to reshape, flatten and increase the dimensions of an array

Consider the array

In [None]:
a = np.arange(12)

In [None]:
a.shape

Let's say we wanted to make sure this array has shape `(12, 1)` (we'll see why this can be important in a minute!) Using `np.newaxis` will increase the dimensions of your array by one dimension when used once. This means that a 1D array will become a 2D array, a 2D array will become a 3D array, and so on.

In [None]:
b = a[:, np.newaxis]
b.shape

Now, let's say we wanted to re-organize `b` into a different shape. We can use `np.reshape` to do this:

In [None]:
c = b.reshape(2, 3, 2)
c.shape

**Note** The product of the new shape selected in the reshape operation must be equal to the product of the original array shape. In this case, the original array had a shape product of 12 and the reshaped array had a product of `2*3*2` which is also 12. If we had tried another shape with a different product, we would get an error:

```python
d = b.reshape(2, 5)

---------------------------------------------------------------------------
ValueError                                Traceback (most recent call last)
/var/folders/3w/490kdvj917n7kpxjpx14zztw0000gq/T/ipykernel_46779/3997546504.py in <module>
----> 1 d = b.reshape(2, 5)

ValueError: cannot reshape array of size 12 into shape (2,5)
```

On the other hand, there are two popular ways to flatten an array: `np.flatten()` and `np.ravel()`. The primary difference between the two is that the new array created using `np.ravel()` is actually a reference to the parent array (i.e., a “view”). This means that any changes to the new array will affect the parent array as well. Since `np.ravel` does not create a copy, it’s memory efficient.

In [None]:
c

In [None]:
c.flatten()

In [None]:
c.ravel()

You can see that `c.ravel()` creates a view and not a copy, because if we change `e`, `c` also gets changed:

In [None]:
e = c.ravel()
e[0] = -1
c

---

#### Self-assessment 4

---

---

## Read more

- [Indexing on ndarrays](https://numpy.org/devdocs/user/basics.indexing.html)
- [Copies and Views](to be added)
- [Routines documentation](https://numpy.org/devdocs/reference/routines.html)

## Next

Go to [Notebook 3: Vectorization](03_Vectorization.ipynb).