# Data, files, IO

author: Eric Charles, based on a similar NB by Sam Schmidt<br>
Last successfully run: October 6, 2022<br>

This notebook will demonstrate a variety of ways that users may interact with data using the descformats package<br>

In additions to the descformats code, we have developed another companion package, `tables_io` [available here on Github](https://github.com/LSSTDESC/tables_io/). <br>

`tables_io` aims to simplify IO for reading/writing to some of the most common file formats used within DESC: HDF5 files, parquet files, Astropy tables, and `qp` ensembles.  There are several examples of tables_io usage in the [nb directory](https://github.com/LSSTDESC/tables_io/tree/main/nb) of the `tables_io` repository, but we will demonstrate usage in several places in this notebook as well.  For furthe examples consult the tables_io nb examples.  

In short, `tables_io` aims to simplify fileIO, and much of the io is automatically sorted out for you if your files have the appriorate extensions: that is, you can simply do a tables_io.read("file.fits") to read in a fits file or tables_io.read("newfile.pq") to read in a dataframe in parquet format.  

Other concept used in the `descformats` used in a Jupyter Notebook are the DataStore and DataHandle.   These provide you with ways to access particular types of data.  We will demonstrate some useful features of the DataStore and the DataHandle below.

Let's start out with some imports:

In [None]:
import os
import numpy as np
import descformats
import tables_io

First, let's use tables_io to read in some example data.  There are two example files that ship with descformats containing a small amount of cosmoDC2 data from healpix pixel `9816`, it is located in the `src/descformats/data/testdata` directory in the DESCFormats repository.  Let's read in one of those data files:

(NOTE: for historical reasons, our examples files have data that is in hdf5 format where all of the data arrays are actually in a single hdf5 group named "photometry".  We will grab the data specifically from that hdf5 group by reading in the file and specifying ["photometry"] as the group in the cell below.  We'll call our dataset "traindata_io" to indicate that we've read it in via tables_io, and distinguish it from the data that we'll place in the DataStore in later steps:

In [None]:
DATADIR = os.path.abspath(os.path.join(os.path.dirname(descformats.__file__), 'data'))
trainFile = os.path.join(DATADIR, 'testdata/test_dc2_training_9816.hdf5')
testFile = os.path.join(DATADIR, 'testdata/test_dc2_validation_9816.hdf5')


tables_io reads this data in as an ordered dictionary of numpy arrays by default, though you can be converted to other data formats, such as a pandas dataframe as well. 

descformats wraps this functionality into a table handle object that you will be able to use as a reference to the data when building data analysis pipelines.

Let's print out the keys in the ordered dict showing the available columns, then convert the data to a pandas dataframe and look at a few of the columns as a demo:

In [None]:
handle = descformats.TableHandle('data', path=trainFile)

In [None]:
traindata_io = handle.read()

In [None]:
traindata_io

In [None]:
traindata_pq = tables_io.convert(traindata_io, tables_io.types.PD_DATAFRAME)

In [None]:
traindata_pq['photometry'].head()

Next, let's set up the Data Store, so that our RAIL module will know where to fetch data.  We will set "allow overwrite" so that we can overwrite data files and not throw errors while in our jupyter notebook:

In [None]:
from descformats.data import DATA_STORE

In [None]:
DS = DATA_STORE()
DS.__class__.allow_overwrite = True

We need to add our data to the DataStore, we can add previously read data, like our `traindata_pq`, or add data to the DataStore directly via the `DS.read_file` method, which we will do with our "test data".  We can add data with `DS.add_data` for the data already in memory, we want our data in a Numpy Ordered Dict, so we will specify the type as a TableHandle.  If, instead, we were storing a qp ensemble then we would set the handle as a `QPHandle`. 

In [None]:
#add data that is already read in
train_data = DS.add_data("train_data", traindata_io, descformats.TableHandle )

To read in data from file, we can use `DS.read_file`, once again we want a TableHandle, and we can feed it the `testFile` path defined in Cell #2 above:

In [None]:
#add test data directly to datastore from file:
test_data = DS.read_file("test_data", descformats.TableHandle, testFile)

Let's list the data abailable to us in the DataStore:

In [None]:
DS

Note that the DataStore is just a dictionary of the files.  Each Handle object contains the actual data, which is accessible via the `.data` property for that file.  While not particularly designed for it, you can manipulate the data via these dictionaries, which is handy for on-the-fly exploration in notebooks.<br>
For example, say we want to add an additional column to the train_data, say "FakeID" with a more simple identifier than the long ObjID that is contained the `id` column:

In [None]:
train_data().keys()
numgals = len(train_data()['photometry']['id'])
train_data()['photometry']['FakeID'] = np.arange(numgals)

Let's convert our train_data to a pandas dataframe with tables_io, and our new "FakeID" column should now be present:

In [None]:
train_table = tables_io.convertObj(train_data()['photometry'], tables_io.types.PD_DATAFRAME)
train_table.head()

And there it is, a new "FakeID" column is now added to the end of the dataset, success!

### Using DataHandle objects to read data

There are a variety of ways that you can use a DataHandle, 

1. reading all the data at once with the read() function
2. opening the file, but not reading the data
3. iterating over the data

#### Let's start by reading all the data

In [None]:
new_handle = descformats.TableHandle('data', path=trainFile)

In [None]:
new_handle.read()

Note that the data are stored in the DataHandle object, so you can do:

In [None]:
new_handle.data

Note also that the file has been closed

In [None]:
new_handle.fileObj is None

#### Opening the DataHandle

First, let's reset the data in the data handle, then we will use the open() method to open the file object

In [None]:
new_handle.data = None

In [None]:
new_handle.open()

In [None]:
new_handle.fileObj

In [None]:
new_handle.data is None

In [None]:
new_handle.close()

In [None]:
new_handle.fileObj

#### Iterating over the data

Here we use the DataHandle to create an iterator over the data

In [None]:
x = new_handle.iterator(groupname='photometry', chunk_size=1000)
print(x)
for xx in x:
    print(xx[0], xx[1], xx[2]['id'][0])