# Slicing arrays gives you a *pointer*, not an array

A common pitfall when working with slices of data in an array is not realizing that those slices are not their own variables. They merely point to the data which is still stored in the parent array. Let's see how this can cause problems.

In [2]:
import numpy as np

In [16]:
# Start with making an array of zeros
foo = np.zeros((7,7))
print(foo)

[[0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]]


In [17]:
# Next, we make a new variable that is a view of the central portion of the parent
bar = foo[2:5,2:5]
print(bar)

[[0. 0. 0.]
 [0. 0. 0.]
 [0. 0. 0.]]


In [18]:
# We add something to that new array, thinking that we aren't touching the original data
bar +=1
# And we are surprised to see that the original has changed, too!
print(bar)
print(foo)

[[1. 1. 1.]
 [1. 1. 1.]
 [1. 1. 1.]]
[[0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 1. 1. 1. 0. 0.]
 [0. 0. 1. 1. 1. 0. 0.]
 [0. 0. 1. 1. 1. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]
 [0. 0. 0. 0. 0. 0. 0.]]


In [19]:
# This happen even if the variable alteration is happening inside of a function!
def background_correction(array):
    """ This function's purpose is to return a background-subtracted version of the input array """
    background = 5
    array -= background
    return array

In [20]:
nobackbar = background_correction(bar)
print(nobackbar)
print(bar)
print(foo)

[[-4. -4. -4.]
 [-4. -4. -4.]
 [-4. -4. -4.]]
[[-4. -4. -4.]
 [-4. -4. -4.]
 [-4. -4. -4.]]
[[ 0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0. -4. -4. -4.  0.  0.]
 [ 0.  0. -4. -4. -4.  0.  0.]
 [ 0.  0. -4. -4. -4.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.]
 [ 0.  0.  0.  0.  0.  0.  0.]]


Not only has the input variable been edited by the sloppy function call, but the original `foo` array has been edited as well!

Why does python behave this way? The answer is to save on memory footprint. If python made copies of every slice of every array, it would take a lot of memory to carry around those copies and it would take a lot of time to make those copies.

Before you make any edits to any dataset or slice of a dataset, make a copy yourself. Here are two ways to do so when defining the array slice, and two ways to do so within the function call.

In [21]:
# Method 1: using an array-slice's copy method (fast and simple)
bar = foo[2:5,2:5].copy()
# Method 2: using the built-in np.copy(function) (versatile and can be used for many object types!)
bar = np.copy(foo[2:5,2:5])
# Method 3: at the top of a function to make sure it is "memory safe"
def background_correction(array):
    """ This function's purpose is to return a background-subtracted version of the input array """
    arraycopy = array.copy()
    background = 5
    arraycopy -= background
    return arraycopy
# Method 4: just don't store anything back into the input variables in a function definition
def background_correction(array):
    """ This function's purpose is to return a background-subtracted version of the input array """
    background = 5
    correctedarray = array - background
    return correctedarray

Making too many copies can be bad for performance. Sometimes it's harder to notice when an array is being modified. Making copies when defining new variables is a useful tool for debugging, but make sure to clean up after yourself if you've proven that making a copy doesn't help in a given situation!

# For loops and performance
Python is slow, especially when looping through large datasets, and especially especially when doing nested loops through big arrays. C is faster than python, [about two orders of magnitude faster by some metrics](https://github.com/niklas-heer/speed-comparison). Numpy functions are written in C. Any time you can offload a for loop into a numpy function (usually a broadcast), you can greatly speed up your code.

In [13]:
# not really all that big, but big enough to see the effect
bigarray = np.ones((1000,1000))
import time

In [14]:
start = time.time()
for i in range(len(bigarray)):
    for j in range(len(bigarray[0])):
        bigarray[i][j] + bigarray[j][i]

print("for loops time: " + str(time.time()-start))
# On my computer this took 0.643669843673706 seconds

for loops time: 0.6813843250274658


In [15]:
start = time.time()
bigarray + bigarray.transpose()
print("with array transformation in C: " + str(time.time()-start))
# On my computer this took 0.005024433135986328 seconds

with array transformation in C: 0.005844831466674805


You can easily see how this can really start to add up if you are doing more complex operations within the for loop, and/or operating on datasets that are actually really big.

If your code is having performance problems, or if you just want to avoid them, a good rule of thumb is to offload as many for loops into C as possible. Sometimes this involves getting creative with array transformations and mathematical relationships.

If you are having performance issues and aren't sure how to remove some of your for loops, this manual is your best friend: https://numpy.org/doc/stable/user/basics.broadcasting.html