# Operating on Numpy arrays

## Objectives

Introduce numpy's **pass-by-reference** approach handling numpy arrays and methods for avoiding pitfalls when operating on numpy arrays.

## Introduction

From: http://span.ece.utah.edu/common-mistakes-in-moving-from-matlab-to-python:

"Whenever you reference an array in Python, the computer will provide the memory address for the thing you are accessing, not the actual value. This is called **pass-by-reference**. This saves memory and makes your programs faster, but it is also harder to keep straight."  

From: https://docs.python.org/2/library/copy.html

"Assignment statements in Python do not copy objects, they create bindings [pointers] between a target and an object." "... a copy is sometimes needed so one can change one copy without changing the other. The 'copy' module provides generic ... copy operations."

If you are not familiar with the **pass-by-reference** aspect of Python then I strongly suggest you read this short, informative essay on "Python Names and Values": https://nedbatchelder.com/text/names.html

We've briefly touched on this important subject in earlier tutorials.  Now we'll go into a bit more detail.

> Note: For this tutorial you will need the 2010 monthly SSH fields (ShortName **ECCO_L4_SSH_LLC0090GRID_MONTHLY_V4R4**) and the [grid file](https://ecco-v4-python-tutorial.readthedocs.io/ECCO_v4_Loading_the_ECCOv4_native_model_grid_parameters.html). The `ecco_access` library used in the notebook will handle download or retrieval of the necessary data, if you have set up the library [in your Python path](https://ecco-v4-python-tutorial.readthedocs.io/ECCO_access_intro.html#Setting-up-ecco_access).

## Variable assignments

Unlike some other languages, creating a new variable with an assignment statement in Python such as
`
x = some_numpy_array
`

does not make a copy of ``some_numpy_array``.  Instead, the assignment statement makes ``x`` and ``some_numpy_array`` both point to the same `numpy` array in memory.  Because ``x`` and ``some_numpy_array`` are both refer (or pointer) to the same `numpy` array in memory, the `numpy` array can be changed by operations on either ``x`` or ``some_numpy_array``.  If you aren't aware of this behavior then you may run into very difficult to identify bugs in your calculations!

### A simple demonstration

Let's demonstrate this issue with a very simple `numpy` array

In [1]:
import numpy as np
import xarray as xr
import sys
import matplotlib.pyplot as plt
%matplotlib inline
import json
import glob
from copy import deepcopy 
import warnings
warnings.filterwarnings('ignore')

import ecco_v4_py as ecco
import ecco_v4_py.ecco_access as ea


# are you working in the AWS Cloud?
incloud_access = False

# indicate mode of access from PO.DAAC
# options are:
# 'download': direct download from internet to your local machine
# 'download_ifspace': like download, but only proceeds 
#                     if your machine have sufficient storage
# 's3_open': access datasets in-cloud from an AWS instance
# 's3_open_fsspec': use jsons generated with fsspec and 
#                   kerchunk libraries to speed up in-cloud access
# 's3_get': direct download from S3 in-cloud to an AWS instance
# 's3_get_ifspace': like s3_get, but only proceeds if your instance 
#                   has sufficient storage
user_home_dir = expanduser('~')
download_dir = join(user_home_dir,'Downloads','ECCO_V4r4_PODAAC')
if incloud_access:
    access_mode = 's3_open_fsspec'
    download_root_dir = None
    jsons_root_dir = join(user_home_dir,'MZZ')
else:
    access_mode = 'download_ifspace'
    download_root_dir = download_dir
    jsons_root_dir = None

Create a simple numpy array

In [4]:
a=np.array([1, 2, 3, 4, 5])

# Assign 'b' to point to the same numpy array
b=a

# Test to see if b and a point to the same thing
b is a

True

Now change the fourth element of ``b`` and print both ``a`` and ``b``

In [5]:
b[3] = 10
print (a)
print (b)

[ 1  2  3 10  5]
[ 1  2  3 10  5]


### A fancier demonstration

Let's now demonstrate with a `numpy` array that stores ``SSH`` output.

In [6]:
## load ECCO datasets needed for the tutorial

ShortNames_list = ["ECCO_L4_GEOMETRY_LLC0090GRID_V4R4",\
                   "ECCO_L4_SSH_LLC0090GRID_MONTHLY_V4R4"]

ds_dict = ea.ecco_podaac_to_xrdataset(ShortNames_list,\
                                              StartDate='2010-01',EndDate='2010-12',\
                                              mode=access_mode,\
                                              download_root_dir=download_root_dir,\
                                              jsons_root_dir=jsons_root_dir,\
                                              max_avail_frac=0.5)

ecco_grid = ds_dict[ShortNames_list[0]].compute()
ssh_dataset = ds_dict[ShortNames_list[1]].compute()

## Merge SSH and grid
output_all = xr.merge((ssh_dataset, ecco_grid))

Recall the dimensions of our ``SSH`` `DataArray`:

In [7]:
output_all.SSH.dims

('time', 'tile', 'j', 'i')

Show the first four SSH values in **j** and **i** for the fifth month (May 2010) and second tile:

In [8]:
output_all.SSH[4,1,0:4,0:4].values

array([[-1.3647507, -1.3654912, -1.3614875, -1.3682245],
       [-1.3252122, -1.3242471, -1.3242686, -1.3410169],
       [-1.2957976, -1.2947818, -1.3008444, -1.3268987],
       [-1.285025 , -1.2845778, -1.2949846, -1.325418 ]], dtype=float32)

Assign the variable `ssh_tmp` to this *subset* of the `numpy` array that ``SSH`` points to:

In [9]:
ssh_tmp = output_all.SSH[4,1,0:2,0:2].values
ssh_tmp

array([[-1.3647507, -1.3654912],
       [-1.3252122, -1.3242471]], dtype=float32)

Now change the values of all elements of ``ssh_tmp`` to 10

In [10]:
ssh_tmp[:] = 10
ssh_tmp

array([[10., 10.],
       [10., 10.]], dtype=float32)

And see that yes, in fact, this change is reflected in our ``SSH`` `DataArray`:

In [11]:
output_all.SSH[4,1,0:4,0:4].values

array([[10.       , 10.       , -1.3614875, -1.3682245],
       [10.       , 10.       , -1.3242686, -1.3410169],
       [-1.2957976, -1.2947818, -1.3008444, -1.3268987],
       [-1.285025 , -1.2845778, -1.2949846, -1.325418 ]], dtype=float32)

## Dealing with *pass-by-reference*: right hand side operations

One way to have a new variable assignment not point to the original variable is to *perform an operation on the right hand side of the assignment statement*.  

"Python evaluates expressions from left to right. Notice that while evaluating an assignment, the right-hand side is evaluated before the left-hand side."
https://docs.python.org/2/reference/expressions.html#evaluation-order

Performing an operation on the right hand side creates new values in memory.  The new variable assignment will then point to these new values, leaving the original untouched.

### Simple demonstration 1
Operate on ``a`` by adding 1 before the assigment statement

In [12]:
# Create a simple numpy array
a=np.array([1, 2, 3, 4, 5])

b = a + 1

print (a)
print (b)

[1 2 3 4 5]
[2 3 4 5 6]


Now change the fourth element of ``b`` and print both ``a`` and ``b``

In [13]:
b[3] = 10
print (a)
print (b)

[1 2 3 4 5]
[ 2  3  4 10  6]


``a`` and ``b`` do indeed point to different values in memory.

### Simple demonstration 2

Operate on ``a`` by adding 0 before the assigment statement.  This is a kind of dummy operation.

In [14]:
# Create a simple numpy array
a=np.array([1, 2, 3, 4, 5])

# Add 0 to `a`:
b = a + 0

print (a)
print (b)

[1 2 3 4 5]
[1 2 3 4 5]


In [15]:
# Test to see if b and a point to the same thing
b is a

False

Now change the fourth element of ``b`` and print both ``a`` and ``b``

In [16]:
b[3] = 10
print (a)
print (b)

[1 2 3 4 5]
[ 1  2  3 10  5]


Once again we see that ``a`` and ``b`` do indeed point to different values in memory.

### A fancier demonstration

Let's now demonstrate with a `numpy` array that stores ``SSH`` output.

In [17]:
output_all.SSH[4,1,5:9,5:9].values

array([[-1.3839613, -1.3924128, -1.3913262, -1.3853507],
       [-1.3569907, -1.3593892, -1.3547356, -1.3458692],
       [-1.3178127, -1.316035 , -1.3086342, -1.2973909],
       [-1.267521 , -1.2628312, -1.2530723, -1.239709 ]], dtype=float32)

In [18]:
ssh_tmp = output_all.SSH[4,1,5:9,5:9].values * output_all.rA[1,5:9,5:9].values
ssh_tmp[:] = 10
ssh_tmp

array([[10., 10., 10., 10.],
       [10., 10., 10., 10.],
       [10., 10., 10., 10.],
       [10., 10., 10., 10.]], dtype=float32)

In [19]:
output_all.SSH[4,1,5:9,5:9].values

array([[-1.3839613, -1.3924128, -1.3913262, -1.3853507],
       [-1.3569907, -1.3593892, -1.3547356, -1.3458692],
       [-1.3178127, -1.316035 , -1.3086342, -1.2973909],
       [-1.267521 , -1.2628312, -1.2530723, -1.239709 ]], dtype=float32)

Operating on the right hand side of the assignment does indeed new arrays in memory leaving the original SSH `numpy` array untouched.

## Dealing with *pass-by-reference*: copy and deepcopy

A second way to have a new variable assignment not point to the original variable is to *use the copy or deepcopy command*.

### Simple demonstration
Use the `numpy` command.

In [20]:
# Create a simple numpy array
a=np.array([1, 2, 3, 4, 5])
b=np.copy(a)

print (a)
print (b)

[1 2 3 4 5]
[1 2 3 4 5]


Now change the fourth element of ``b`` and print both ``a`` and ``b``

In [21]:
b[3] = 10
print (a)
print (b)

[1 2 3 4 5]
[ 1  2  3 10  5]


In [22]:
output_all.SSH

### Fancier demonstration

`Dataset` and `DataArray` objects are too complicated for `numpy`'s `copy` command.  For complex objects such as these use the `deepcopy` command.

In [23]:
ssh_tmp = deepcopy(output_all.SSH)
ssh_tmp[:] = 10
ssh_tmp[4,1,5:9,5:9].values

array([[10., 10., 10., 10.],
       [10., 10., 10., 10.],
       [10., 10., 10., 10.],
       [10., 10., 10., 10.]], dtype=float32)

In [24]:
output_all.SSH[4,1,5:9,5:9].values

array([[-1.3839613, -1.3924128, -1.3913262, -1.3853507],
       [-1.3569907, -1.3593892, -1.3547356, -1.3458692],
       [-1.3178127, -1.316035 , -1.3086342, -1.2973909],
       [-1.267521 , -1.2628312, -1.2530723, -1.239709 ]], dtype=float32)

Using `deepcopy` gives us an entirely new array in memory.  Operations on ``ssh_tmp`` do not affect the original fields that we found in the `output_all_SSH` `DataArray`.

#### alternative to `deepcopy`
`xarray` give us another way to deepcopy `DataArrays` and `Datasets`:

``
ssh_tmp = output_all.copy(deep=True)
``

## Conclusion

You now know about the possible pitfalls for dealing with Python's **pass-by-reference** way of handling assignment statements and different methods for making copies of `numpy` arrays and `Datasets` and `DataArrays`.  