# DS-GA-3001 Advanced Python for Data Science

Before you turn this problem in, make sure you **restart the kernel** (in the menubar, select Kernel$\rightarrow$Restart). You can then either run all cells (in the menubar, select Cell$\rightarrow$Run All), or run each cell individually, **in order**, during the class.

Any textual answers that need to be provided will be marked with "YOUR ANSWER HERE". Replace this text with your answer to the question.

Any code answers that need to be provided will be marked with:

```
# YOUR CODE HERE
raise NotImplementedError()
```

Replace all this code with your answer to the question. If you do not answer the question, the `NotImplementedError` exception will be raised, which will indicate to the grader that no answer has been supplied.

In many cases, code answers will also have some associated test code. You should execute the tests after you have entered your code in order to ensure that your answer is correct. You should not proceed to the next question until your answer is correct.

Finally, insert your Net ID and the Net ID's of any collaborators in the cell below.

In [1]:
NET_ID = "jl6583"
COLLABORATORS = ""

---

# NumPy Performance Tips

## Why are NumPy arrays efficient?

A NumPy array is described by metadata (number of dimensions, shape, data type, and so on) and the actual data. The data is stored in a homogeneous and contiguous block of memory, at a particular address in system memory. This block of memory is called the data buffer, and is the main difference with a pure Python structures, like a list, where the items are scattered across the system memory. This aspect is the critical feature that makes NumPy arrays so efficient.

Why is this so important?

1. **Array computations can be written very efficiently in a low-level language like C** (and a large part of NumPy is actually written in C). Knowing the address of the memory block and the data type, it is just simple arithmetic to loop over all items, for example. There would be a significant overhead to do that in Python with a list.

2. **Spatial locality in memory access patterns** results in significant performance gains, notably thanks to the CPU cache. Indeed, the cache loads bytes in chunks from RAM to the CPU registers. Adjacent items are then loaded very efficiently (sequential locality, or locality of reference).

3. **Data elements are stored contiguously in memory**, so that NumPy can take advantage of vectorized instructions on modern CPUs, like Intel's SSE and AVX, AMD's XOP, and so on. For example, multiple consecutive floating point numbers can be loaded in 128, 256 or 512 bits registers for vectorized arithmetical computations implemented as CPU instructions.

Additionally, NumPy can also be linked to highly optimized linear algebra libraries like BLAS and LAPACK, for example through the Intel Math Kernel Library (MKL). A few specific matrix computations may also be multithreaded, taking advantage of the power of modern multicore processors.

In conclusion, storing data in a contiguous block of memory ensures that the architecture of modern CPUs is used optimally, in terms of memory access patterns, CPU cache, and vectorized instructions.

## Array copy semantics

Although NumPy arrays enable fast computations, it is still easy to run into performance problems. One of the most common performance issues you will enounter with NumPy arrays is caused by inadvertent array copies. This occurs when NumPy makes a copy of an existing array in order to perform some operation. To do this, NumPy must copy all the elements from the old array to the new array, which for large arrays, can impose a significant performance and memory penalty. In this section we will expore how to identify when an array is copied and show some techniques for avoiding it.

Try running the following code:

In [2]:
import numpy as np

def func1():
    a = np.zeros(1000000)
    a *= 2
    
def func2():
    a = np.zeros(1000000)
    a = a * 2

%timeit func1()
%timeit func2()

100 loops, best of 3: 3.21 ms per loop
100 loops, best of 3: 5.36 ms per loop


Why is there a difference?

An expression like "`a *= 2`" corresponds to an in-place operation, where all values of the array are multiplied by two. By contrast, "`a = a * 2`" means that a new array containing the values of "`a * 2`" is created, and the variable a now points to this new array. The old array becomes unreferenced and will be deleted by the garbage collector. No memory allocation happens in the first case, contrary to the second case.

How can we tell if an array has been copied or not? The `is` keyword will tell us if the arrays are the same, but not if they are only sharing data. In the example below, `is` says the arrays are different, but is this really the case?

In [3]:
a = np.zeros((10, 10))
b = a.reshape((1, -1))
b is a

False

One way to check if the arrays are sharing data is to compare the location of the data using the following function:

In [4]:
def id(x):
    # This function returns the memory
    # block address of an array.
    return x.__array_interface__['data'][0] #returns the array head

In [5]:
id(a) == id(b)

True

*Note that this only works if the arrays have the same offset (first element). Two shared arrays with different offsets will have different memory locations:*

In [6]:
id(a) == id(a[1:])

False

Reshaping an array may or may not involve a copy. In the example above, we saw that reshape does not copy the array if the order is preserved.

Transposing the array changes its order, so the array is copied. 

In [7]:
c = a.T.reshape((1, -1)) #Transposing array will generate a copy
id(a) == id(c)

False

The flatten and ravel methods of an array reshape it into a 1D vector (flattened array). The former method always returns a copy, whereas the latter returns a copy only if necessary (so it's significantly faster too, especially with large arrays).

In [8]:
d = a.flatten()
id(a) == id(d)

False

In [9]:
e = a.ravel() #only make copy when necessary
id(e) == id(a)

True

In [10]:
%timeit a.flatten()
%timeit a.ravel()

The slowest run took 200.55 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 663 ns per loop
The slowest run took 10.97 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 283 ns per loop


## Why can some arrays be reshaped without a copy?

A 2-dimensional array contains items indexed by two numbers (row and column), but it is stored internally as a 1-dimensional contiguous block of memory (sometimes known as a vector), accessible with a single number. 

There is more than one way of storing items in a 1-dimensional block of memory. One way is to store the elements of the first row first, then the second row, and so on. Another way is to store the elements of the first column first, then the second column, and so on. This first method is called row-major order, while the latter is called column-major order. Choosing between the two methods is only a matter of internal convention. NumPy and C use the row-major order. Other languages, like FORTRAN, use column-major order.

For example, the NumPy array:

<p>
<table align="left">
<tr><td>row</td><td align="center" colspan="3">column</td></tr>
<tr><td></td><td>0</td><td>1</td><td>2</td></tr>
<tr><td>0</td><td>[0,0]</td><td>[0,1]</td><td>[0,2]</td></tr>
<tr><td>1</td><td>[1,0]</td><td>[1,1]</td><td>[1,2]</td></tr>
<tr><td>2</td><td>[2,0]</td><td>[2,1]</td><td>[2,2]</td></tr>
</table>
</p>

This array is stored in memory as:

<table align="left">
<tr><td>offset</td><td>0</td><td>1</td><td>2</td>
<td>3</td><td>4</td><td>5</td>
<td>6</td><td>7</td><td>8</td></tr>
<tr><td>cell</td><td>[0,0]</td><td>[0,1]</td><td>[0,2]</td>
<td>[1,0]</td><td>[1,1]</td><td>[1,2]</td>
<td>[2,0]</td><td>[2,1]</td><td>[2,2]</td></tr>
</table>

NumPy uses the notion of **strides** to convert between a multidimensional index and the memory location of the underlying (1-dimensional) sequence of elements. The specific mapping between `array[i1, i2]` and the relevant byte address of the internal data is given by:

`offset = array.strides[0] * i1 + array.strides[1] * i2`

When reshaping an array, NumPy avoids copies when possible by modifying the strides attribute. For example, when transposing an array, the order of strides is reversed, but the underlying data remains identical. However, flattening a transposed array cannot be accomplished simply by modifying strides, so a copy is needed.

Suppose we have the following array:

In [11]:
a = np.random.rand(5000, 5000)

<div class="alert alert-success">
Write a Python statement to find the sum of the elements in the first row and time how long it takes to run:
</div>

In [12]:
%timeit np.sum(a[0,:])

The slowest run took 17.95 times longer than the fastest. This could mean that an intermediate result is being cached 
100000 loops, best of 3: 5.74 µs per loop


<div class="alert alert-success">
Now write a Python statement to find the sum of the first element in each column and time how long it takes to run: 
</div>

In [13]:
%timeit np.sum(a[:,0])

The slowest run took 20.83 times longer than the fastest. This could mean that an intermediate result is being cached 
10000 loops, best of 3: 83.4 µs per loop


You should see that the second statement takes much longer to run than the first.

## Broadcasting rules

NumPy broadcasting allows you to perform computations on arrays with different but compatible shapes. This means that it's not always necessary to reshape or tile arrays to make their shapes match. Broadcasting does not involve any memory copy.

> Two dimensions are compatible when they are equal, or when one of them is one.

The following example illustrates two ways of doing an outer product between two vectors: the first method involves array tiling, the second one involves broadcasting. The last method is significantly faster.

In [14]:
n = 1000
a = np.arange(n)
ac = a[:, np.newaxis]
ar = a[np.newaxis, :] #broadcasting does not involve mem cpy

In [15]:
%timeit np.tile(ac, (1, n)) * np.tile(ar, (n, 1))

100 loops, best of 3: 12.7 ms per loop


In [16]:
%timeit ar * ac #elemwise product

100 loops, best of 3: 3.13 ms per loop


## Selecting elements efficiently

NumPy offers multiple ways of selecting slices of arrays:

1. Array views refer to the original data buffer of an array, but with different offsets, shapes and strides. They only permit strided selections (i.e. with linearly spaced indices).

2. Functions, such as `take` are availanble to make arbitrary selections along one axis. 

3. Fancy indexing is the most general selection method, but it is also the slowest.

Suppose we have an array with a large number of rows and we want to select slices from the first dimension.

In [17]:
n, d = 100000, 100
a = np.random.random_sample((n, d))

First, time how long it takes to select every 10th row using a view:

In [18]:
%timeit b1 = a[::10]

The slowest run took 18.09 times longer than the fastest. This could mean that an intermediate result is being cached 
1000000 loops, best of 3: 224 ns per loop


<div class="alert alert-success">
Using the technique we used previously, check if `b1` is a copy of `a`:
</div>

In [19]:
b1 = a[::10]
id(b1) == id(a)

True

Now time how long it takes to select every 10th row using fancy indexing:

In [20]:
i = np.arange(0,n,10)
%timeit b2 = a[i]

100 loops, best of 3: 3.71 ms per loop


<div class="alert alert-success">
Again, check if `b2` is a copy of `a`:
</div>

In [21]:
b2 = a[i]
id(b2) == id(a) #fancy indexing makes a copy for the sliced array

False

It is clear that fancy indexing is significanctly slower than using a view. In some cases it might be faster to use NumPy's `take` method, however this depends on the type of fancy indexing being done.

In [22]:
%timeit b3 = np.take(a, i, axis=0)

100 loops, best of 3: 4.04 ms per loop


When the indices to select along one axis are specified by a vector of boolean masks, the function `compress` is an alternative to fancy indexing.

In [23]:
i = np.random.random_sample(n) < .5

In [24]:
%timeit a[i]

100 loops, best of 3: 17.5 ms per loop


In [25]:
%timeit np.compress(i, a, axis=0)

The slowest run took 4.35 times longer than the fastest. This could mean that an intermediate result is being cached 
10 loops, best of 3: 18 ms per loop


This method is also faster than fancy indexing.

Fancy indexing is the most general way of making completely arbitrary selections of an array. However, more specific and faster methods often exist and should be preferred when possible.

## Using what you have learned

Suppose we want to write a function to remove the i-th row and j-th column from a 2d array. The simplistic way to do this would be as follows:

In [26]:
import numpy as np

def remove_ij(x, i, j):
    
    # Remove the ith row
    idx = range(x.shape[0])
    idx.remove(i)
    x = x[idx,:]

    # Remove the jth column
    idx = range(x.shape[1])
    idx.remove(j)
    x = x[:,idx]
    
    return x

Let's try removing a row and column from a large array:

In [27]:
big_array = np.ones((500,500))
%timeit remove_ij(big_array, 345, 276)

The slowest run took 5.09 times longer than the fastest. This could mean that an intermediate result is being cached 
1000 loops, best of 3: 904 µs per loop


<div class="alert alert-success">
Now try modifiying the remove_ij array to use *array slicing* rather than the `remove` function.
</div>

In [28]:
def faster_remove_ij(x, i, j):
    x = np.delete(np.delete(x,j,1),i,0) #view based operation
    return x

In [29]:
from nose.tools import assert_equal, assert_less
import timeit

big_array1 = np.ones((500,500))
big_array2 = np.ones((500,500))
row = 450
col = 222

def slow_remove():
    remove_ij(big_array1, row, col)

def fast_remove():
    faster_remove_ij(big_array2, row, col)

slow_time = timeit.timeit(slow_remove, number=10000)
fast_time = timeit.timeit(fast_remove, number=10000)
print "%f should be less than %f" % (fast_time, slow_time)
assert_less(fast_time, slow_time)

5.910545 should be less than 10.404120
