# Memory management in Python
Andrew Delman, updated 2024-07-10

## Objectives
Demonstrate how Python objects (variables) are stored in and cleared from memory, and establish good practices for conserving memory in a Python workspace.

## Introduction
Python is a very useful computing language for doing numerical analysis of large datasets, largely because of packages that have been developed specifically for this purpose. [Xarray](https://docs.xarray.dev/en/stable/) allows us to open data from a number of files using a single line of code (with `open_mfdataset`), while [Dask](https://www.dask.org/) makes it easy to parallelize computations and stagger memory usage by "chunking" the computations. However, we still need to be mindful of the memory limitations of our workspace, and there are some quirks to memory management in Python.


## View vs. copy
In some numerical computing languages such as Matlab, assigning a new variable instantly creates an independent block of memory in our workspace associated with that variable. In Python, when you create a variable by assigning data from another variable, the assignment does not create a copy of that data. Rather, a pointer called a *view* is created from the new variable to the data in the original. This is the default behavior for native Python objects such as lists, tuples, and dictionaries, as well as [NumPy arrays](https://numpy.org/doc/stable/user/basics.copies.html).

### Native Python object (list)
Consider a simple list of 3 numbers

In [1]:
test_list = [1,2,3]
view_list = test_list
view_list

[1, 2, 3]

We can see if `view_list` is read from the same object in memory as `test_list`, using the `is` operator.

In [2]:
view_list is test_list

True

Not only that, because `view_list` is identical in memory space to `test_list`, when `view_list` is modified `test_list` is modified as well.

In [3]:
view_list[1] = 4
test_list

[1, 4, 3]

Compare this with creating a *copy* of `test_list`. There is a `copy` package that does this in Python, but you can also create a copy just by appending `.copy()`, or by using an operator on the right-hand side of the assignment. For example, we can "append" an empty list to create a copy.

In [4]:
test_list = [1,2,3]
copy_list = test_list + []
copy_list

[1, 2, 3]

Note that if we use the `==` operator, which compares the *values* of the two objects, the result is True.

In [5]:
copy_list == test_list

True

But if we compare using the `is` operator, we find that the two objects are independent.

In [6]:
copy_list is test_list

False

And so if `copy_list` is modified, `test_list` is not changed.

In [7]:
copy_list[1] = 4
test_list

[1, 2, 3]

## NumPy arrays

Views or copies of NumPy arrays can be created using the same general syntax used above.

In [8]:
import numpy as np

test_array = np.array([1,2,3])
view_array = test_array
view_array

array([1, 2, 3])

In [9]:
view_array is test_array

True

In [10]:
view_array[1] = 4
test_array

array([1, 4, 3])

In [11]:
test_array = np.array([1,2,3])
copy_array = test_array + 0
copy_array

array([1, 2, 3])

In [12]:
copy_array == test_array

array([ True,  True,  True])

In [13]:
copy_array is test_array

False

In [14]:
copy_array[1] = 4
test_array

array([1, 2, 3])

For NumPy arrays, views or copies can also be created by appending `view()` or `copy()` to the variable name.

## Delayed computations and memory footprint

As the examples above showed, understanding the difference between a view and a copy is important to ensure your code is working the way you think it should be. And Python's capacity to reference an object without creating a separate copy of it is a helpful memory-saver. But there are other ways to create a Python object without writing its data to memory. Let's consider the memory footprint of two large arrays: one a regular NumPy array, the other a [Dask array](https://docs.dask.org/en/stable/array.html).
> **Note**: There are a number of tools that can be used to attempt to track the memory footprint of an object in Python, such as the object's `__sizeof__` method, `sys.getsizeof`, and `pympler.asizeof`. To get the most accurate estimate of actual memory usage in the workspace, in this tutorial we use `psutil.virtual_memory()` from the [psutil]((https://psutil.readthedocs.io/en/latest/#memory) package. This function tells us the memory available, and then we can track how it changes.

In [15]:
import numpy as np
import psutil

# memory stats (in bytes)
psutil.virtual_memory()

svmem(total=7952175104, available=6225760256, percent=21.7, used=1463742464, free=5393702912, active=409407488, inactive=1563713536, buffers=1105920, cached=1093623808, shared=8818688, slab=107347968)

In [16]:
# log available memory
memory_log = np.array([psutil.virtual_memory().available])

# NumPy array of random numbers
rng = np.random.default_rng()
random_array = rng.standard_normal((1000,1000))
memory_log = np.append(memory_log,psutil.virtual_memory().available)
print('Change in available memory: ',np.diff(memory_log[-2:])[0]/(2**20),' MB')

Change in available memory:  -7.72265625  MB


So the impact on available memory of creating a NumPy array of random numbers with a standard normal distribution is a little less than 8 MB. Compare this to the estimate of the array's size using the `__sizeof__` method.

In [17]:
print('Array size estimate: ',(random_array.__sizeof__())/(2**20),' MB')

Array size estimate:  7.6295166015625  MB


Aside from the sign difference, the memory footprint estimates are very close.

Now we create a Dask array that is the same size.

In [18]:
import dask.array as da

# Dask array of random numbers, with chunks of size 100 in each dimension
memory_log = np.append(memory_log,psutil.virtual_memory().available)
dask_random_array = da.from_array(random_array,chunks=(100,100))
memory_log = np.append(memory_log,psutil.virtual_memory().available)
print('Change in available memory: ',np.diff(memory_log[-2:])[0]/(2**20),' MB')

Change in available memory:  0.0  MB


In [19]:
print('Array size estimate: ',(dask_random_array.__sizeof__())/(2**20),' MB')

Array size estimate:  5.340576171875e-05  MB


In [20]:
# are the two arrays dependent (True) or independent (False)?
dask_random_array is random_array

False

For the Dask array, it is like we made a copy of `random_array` in that the two arrays are independent in any further modifications, but the "copy" still has a negligible memory footprint (compared to the NumPy array's size of ~7.6 MB).

Now we make another Dask array that squares the numbers in the first array. 

In [21]:
memory_log = np.append(memory_log,psutil.virtual_memory().available)
dask_sq_array = dask_random_array**2
memory_log = np.append(memory_log,psutil.virtual_memory().available)
print('Change in available memory: ',np.diff(memory_log[-2:])[0]/(2**20),' MB')

Change in available memory:  0.0  MB


And the new array also has no memory footprint!

If we want to load a Dask array (or part of it) into workspace memory, then we can use `.compute()`.

In [22]:
random_array[0:3,0:3]

array([[-2.28651615, -0.47829676, -0.9985106 ],
       [ 0.02445815,  0.69858675,  1.12554956],
       [ 1.11910231, -0.86319442, -0.25582477]])

In [23]:
dask_random_array[0:3,0:3].compute()

array([[-2.28651615, -0.47829676, -0.9985106 ],
       [ 0.02445815,  0.69858675,  1.12554956],
       [ 1.11910231, -0.86319442, -0.25582477]])

In [24]:
dask_sq_array[0:3,0:3].compute()

array([[5.22815609e+00, 2.28767794e-01, 9.97023426e-01],
       [5.98201074e-04, 4.88023443e-01, 1.26686182e+00],
       [1.25238999e+00, 7.45104611e-01, 6.54463138e-02]])

Generally when using Dask arrays (or Xarray datasets/data arrays that consist of Dask arrays), then it is a good practice to load an array into memory before you reference it multiple times...otherwise when the code finally does execute it will be repeating the same computations!

## Deleting variables to free up memory

Python object views and Dask arrays are two examples of how Python uses references to save memory in your workspace. This does mean, however, that deleting data in your workspace can get a bit complicated. In contrast to Matlab where `clear variable` will delete the array `variable` from your workspace, the corresponding Python function `del variable` will only delete `variable` if there are no other references to `variable` existing in the workspace.

Consider the memory impact if we load a portion of `dask_sq_array` into memory, and then delete it using the `del` command.

In [25]:
memory_log = np.append(memory_log,psutil.virtual_memory().available)
sq_block = dask_sq_array[0:100,:].compute()
memory_log = np.append(memory_log,psutil.virtual_memory().available)
print('Change in available memory: ',np.diff(memory_log[-2:])[0]/(2**20),' MB')

print('sq_block is type: ',type(sq_block))

memory_log = np.append(memory_log,psutil.virtual_memory().available)
del sq_block
memory_log = np.append(memory_log,psutil.virtual_memory().available)
print('Change in available memory: ',np.diff(memory_log[-2:])[0]/(2**20),' MB')

Change in available memory:  -1.62890625  MB
sq_block is type:  <class 'numpy.ndarray'>
Change in available memory:  0.0  MB


The NumPy array `sq_block` used up a block of memory when it was created, but that memory was not made available again when `sq_block` was deleted. Why? Because when `sq_block` was computed and loaded into memory, its data was also associated with the parent array `dask_sq_array`.

In [26]:
memory_log = np.append(memory_log,psutil.virtual_memory().available)
# del dask_sq_array
# del dask_random_array
del random_array
# del rng
memory_log = np.append(memory_log,psutil.virtual_memory().available)
print('Change in available memory: ',np.diff(memory_log[-2:])[0]/(2**20),' MB')
# rng = np.random.default_rng()
random_array = rng.standard_normal((1000,1000))
dask_random_array = da.from_array(random_array,chunks=(100,100))
dask_sq_array = dask_random_array**2
memory_log = np.append(memory_log,psutil.virtual_memory().available)
print('Change in available memory: ',np.diff(memory_log[-2:])[0]/(2**20),' MB')

Change in available memory:  0.0  MB
Change in available memory:  -7.6640625  MB


## Xarray and memory management