# Data Management for Individual Scientists

Authors: Erik Tollerud & Brigitta Sipőcz

In Astronomy, "Data Management" is typically used to describe large-scale efforts like the Gigabytes-per-second Large Synoptic Survey Telescope or the over a hundred different observing modes James Webb Space Telescope.  But for an individual scientist, the general concept data management still applies, just in a very different sense: managing data from or for your own scientific projects.  This tutorial aims to suggest some guidelines and pitfalls for personal data management.

While this tutorial covers several levels of complexity, there is one golden rule, which you should remember even if you remember nothing else: Do Not Make Your Own Format. You need only examine the examples that are shown in [the astropy table reader docs](http://docs.astropy.org/en/stable/io/ascii/#supported-formats) or the related [fixed width gallery](http://docs.astropy.org/en/stable/io/ascii/fixed_width_gallery.html#fixed-width-gallery) to see the needless complexity that has been introduced by well-meaning astronomers that chose to roll their own.  While the tools available to us often make it easy, do your best to resist. Future collaborators, future co-workers, future students, and future you will thank you.

Note that while this tutorial is primarily based on Python and some parts are Python-specific to provide concrete examples, most of the guidelines discussed here apply to a range of approaches and languages.

# DM the Easy Way: "automatic" tools

## Integrated Caching

Some software packages provide caching - i.e., after a file is downloaded, it's automatically saved somewhere and used automatically the next time you ask for it. This can be a convenient way to ensure data you need for your work is available and limit your impact on remote servers that might provide the data.

As an example, consider the following function for downloading a Hubble Space Telescope image of one of the greatest galaxies - the Local Group dwarf GR8:

In [None]:
from astropy.utils import data
from astropy.io import fits

gr8_url = 'https://archive.stsci.edu/pub/hlsp/angst/acs/hlsp_angst_hst_acs-wfc_10915-gr8_f814w_v1_ref.fits'
gr8_fn = data.download_file(gr8_url, cache=True)
fits.open(gr8_fn).info()

The first time you run this it should take a little while to run the download (it's a 70MB file), but if you run it again, you'll see it's almost instantaneous.  This is because the file has been saved in a (relatively hidden to you) location and is re-used when you ask for it again.

This may seem like an easy way to manage your data, but consider these cases:
* What happens if the remote file gets udated?
* What happens if you start running out of space and want to delete some of your old data files?

You can address these manually by running the cell below, but consider what happens if you lose this notebook sometime between now and when the problems above arise. For this reason, this general problem has been enshrined in an computer science/software engineering adage: https://martinfowler.com/bliki/TwoHardThings.html .

In [None]:
data.clear_download_cache(gr8_url)

To see this problem a little more clearly, consider the code below:

In [None]:
from astropy.coordinates import SkyCoord, EarthLocation

sc = SkyCoord(ra=1, dec=2, unit='deg',
              frame='fk5', obstime='2019-8-1',
              location=EarthLocation.of_site('kitt peak'))
sc.transform_to('altaz')

Depending on when/if you've last used code like this, it might take a little while to run the first time, because behind the scenes the code has to look up the exact orientation of the Earth on the day in question (something that is not fully predictible due to things like earthquakes and therefore requires data downloads).  These data change regularly, so behind the scenes careful management is required by the software to ensure the file stays up to date and you aren't constantly served an out-of-date file that gives inaccurate position information, ruining your precision science.

To sum up - while caching is a viable solution if the software you were using is careful about managing it for you, in general you should not rely on it unless you are sure the data are never going to change, are publicly available, and are small enough you don't have to worry about deleting them.

## Pickling

Now lets consider the topic of *saving* data (as opposed to getting it). Python and its wider ecosystem provides a few ways of doing this that are built-in and relatively easy.  But as with caching, the easiest ways come with certain pitfalls.

Consider the following generated image - how would we save it?

In [None]:
import numpy as np
from matplotlib import pyplot as plt
%matplotlib inline

image_scaling_factor = 1e4
xpix, ypix = np.mgrid[:512, :512]
img = xpix * ypix + np.random.randn(512, 512) * image_scaling_factor
plt.imshow(img);

Python provides a built-in way to handle basic data like this, called "pickling":

In [None]:
import pickle

data_to_save = {'image': img, 'xy': (xpix, ypix), 'image_scaling_factor': image_scaling_factor}
with open('mydata.pickle', 'wb') as f:
    pickle.dump(data_to_save, f)

In [None]:
import os

os.listdir()

In [None]:
with open('mydata.pickle', 'rb') as f:
    loaded_data = pickle.load(f)
loaded_data

This shows that you can save out the image and some extra information with a minimum of fuss, and load it again almost as simply. However, there are some serious drawbacks here - to start with, take a look at the size of the file:

In [None]:
os.path.getsize('mydata.pickle')/1024/1024 # MB

How large do you expect? Why might this not be ideal, particularly if this were a significantly larger dataset?

Moreover, `pickle` has some more subtle drawbacks. The following cells illustrate some of these.

#### Pickle Issue 1

In [None]:
with open('data_management.ipynb', 'r') as nb_file:
    data_to_save = {'image': img, 'file_to_open': nb_file}
    with open('mydata.pickle', 'wb') as f:
        pickle.dump(data_to_save, f)

This problem is straightforward: not all types are picklable at all.

#### Pickle Issue 2

Let's say you've decided to use Python for it's object-oriented power.  You decide to make an image generator *class* instead of pickling the data itself:

In [None]:
class ImageGenerator:
    def __init__(self, imgsize, imgscl):
        self.imgsize = imgsize
        self.imgscl = imgscl
        
    def make_image(self):
        xpix, ypix = np.mgrid[:self.imgsize[0], :self.imgsize[1]]
        return xpix * ypix + np.random.randn(*xpix.shape) * self.imgscl

In [None]:
imagegen = ImageGenerator((512, 512), 1e4)

with open('mydata.pickle', 'wb') as f:
    pickle.dump(imagegen, f)
with open('mydata.pickle', 'rb') as f:
    imagegen_loaded = pickle.load(f)
    
plt.imshow(imagegen_loaded.make_image());

So far so good.  But now let's say you realize you want to re-work the class to use a more useful variable name:

In [None]:
class ImageGenerator:
    def __init__(self, image_size, image_scaling_factor):
        self.image_size = image_size
        self.image_scaling_factor = image_scaling_factor
        
    def make_image(self):
        xpix, ypix = np.mgrid[:self.image_size[0], :self.image_size[1]]
        return xpix * ypix + np.random.randn(*xpix.shape) * self.image_scaling_factor

with open('mydata.pickle', 'rb') as f:
    imagegen_loaded = pickle.load(f)
    
plt.imshow(imagegen_loaded.make_image())

As you can see, the old object unpickled just fine... but doesn't work!  This is a simple example of a more general problem that if you ever need to re-name, re-design or otherwise change something you've pickled, and often renders all your pickles somewhere between mildly broken and completely unpickleable. And that's bad.

#### Pickle Issue 3

Now lets say you get an email "from" a trusted collaborator who includes a file for you to use.  You good naturedly load it up and see this:

In [None]:
this_is_a_very_safe_pickle_trust_me = b"\x80\x03cbuiltins\nexec\nq\x00X\x8b\x01\x00\x00\nimport base64\nexec(base64.b64decode(b'CnByaW50KCJJIENBTiBIQVogQ0hFRVNFQlVSR0VSLiBBbHNvIEkgaGFja2VkIHlvdXIgZGF0YS4iKQppZiAnaW1hZ2VnZW5fbG9hZGVkJyBpbiBnbG9iYWxzKCk6CiAgICBpbWFnZV90b19oYWNrID0gZ2xvYmFscygpWydpbWFnZWdlbl9sb2FkZWQnXQogICAgZm9yIGksIG5hbWUgaW4gZW51bWVyYXRlKGltYWdlX3RvX2hhY2suX19kaWN0X18pOgogICAgICAgIHNldGF0dHIoaW1hZ2VfdG9faGFjaywgbmFtZSwgJ/CfkIgnIGlmIGklMj09MCBlbHNlICfwn42UJykK'))\nq\x01\x85q\x02Rq\x03."

pickle.loads(this_is_a_very_safe_pickle_trust_me)
    
imagegen_loaded.imgsize, imagegen_loaded.imgscl

As this demonstrates, pickle is an inherently insafe format because it has the potential to execute any arbitrary code while being unpickled (to see how this was done, you can look at `safe.py` in this repo). That means you should never trust any pickle someone sends you... So it's effectively useless for sharing data with others. 

#### Pickle Issue 4

On top of all that, the pickle format *itself* changes over time, such that pickles produced by a newer Python may not work on older versions of Python, and essentially none of them work with any language other than Python.

Taken together, that means pickle, while very convenient, is not useful for anything beyond saving your *own* data if it's either very simple data, or something that you're sure won't ever change (and trust me... it will).

### Exercise: the numpy formats

Similar, but somewhat different from the pickle format are the `npy` and `npz` formats.  These are files that the `numpy` package can produce from `numpy` arrays.  While they provide a similar quick-and-easy way to save out data from Python, they also have their own drawbacks (some similar to Pickle, others less so).  Explore trying to replicate the pickle example but with the `numpy` formats, and compare and contrast the advantages/disadvantages.  Discuss with your neighbor if you are both willing and interested.

In [None]:
np.save?

In [None]:
np.savez?

In [None]:
np.savez_compressed?

# DM the slightly harder way: managing files 

While the above 

## Tables/catalogs

1. demo csv-writers
2. show the dangers of non-roundtripping (demoing ECSV as an example of keeping metadaya)
3. compare csv to fits - binary better, but row-based
4. Do a row/column-major comparison - asdf? or hdf5?

## Images/data files

1. show how fits and hdf5 are both reasonable as raw images
2. demo data structures like nddata
3. demo file *management* using https://ccdproc.readthedocs.io/en/latest/image_management.html as a case study

# "Real DM": databases

1. show how shelve can do in a pinch, but 
2. describe sqlite and show an example of using it locally to store data
3. link to aq tutorial (Gaia or sdss - whichever is query)


Mention spark/AWS