# Creating and interacting with a DAXA Archive

This tutorial will explain the basic concepts behind the second type of important class in DAXA, the Archive class (with mission classes being the first type, see [the missions tutorial](missions.html)). DAXA Archives are what manage the datasets that we download from various missions, enabling easy access and greatly simplifying processing/reduction - they allow you to stop thinking about all the files and settings that any large dataset entails.

We will cover the following:

* Setting up an Archive from scratch, using filtered DAXA missions.
* Loading an existing Archive from disk.
* The properties of an Archive.
* Accessing processing logs and success information (though we do not cover processing in this part of the documentation).

## Import Statements

In [1]:
from daxa.mission import XMMPointed, Chandra, eRASS1DE, ROSATPointed
from daxa.archive import Archive

## What is a DAXA archive?

DAXA Archives take a set of filtered missions, make sure that their data are downloaded, and enable easy access and organisation of all data files and processing functions. Key functionality includes:

* Storing the logs and errors of all processing steps (if run).
* Allowing for their easy retrieval. 
* Managing the myriad files produced during the processing.
* Keeping track of which processes failed for which data, ensuring that any further processing only runs on data that have successfully passed through the earlier processes.

Archives can also be loaded back into DAXA at a later date, so that the processing logs of data that has since been found to be problematic can be easily inspected, or indeed so that processing steps can be re-run with different settings; this also allows for archives to be updated, if more data become available.

## Creating a new archive

Here we will demonstrate how to set up a new DAXA Archive from scratch - this information can be combined with the [the missions tutorial](missions.html) and the <font color='red'>case studies</font> to create an archive from any dataset you might be using.

### Step 1 - Set up and filter missions 

The first thing we have to do is to select the observations that we wish to include in the archive (and indeed the missions that we wish to include). The missions all have different characteristics, so your choice of which to include will be heavily dependent on your science case.

Here we will create an archive of XMM, Chandra, eROSITA All-Sky DR1, and ROSAT pointed observations of a famous galaxy cluster (though the archive would behave the same if it held data for a large sample of objects).

First of all, we define instances of the mission classes that we wish to include:

In [None]:
xm = XMMPointed()
ch = Chandra()
er = eRASS1DE()
rp = ROSATPointed()

Then we filter them to only include observations of our cluster:

In [None]:
xm.filter_on_name("A3667")
ch.filter_on_name("A3667")
er.filter_on_name("A3667")
rp.filter_on_name("A3667")

We then download the available data (though the declaration of an Archive would also trigger this, we do it this way because we wish to download pre-generated products for Chandra and ROSAT pointed observations):

In [None]:
xm.download()
ch.download(download_products=True)
er.download()
rp.download(download_products=True)

### Step 2 - Setting up an Archive object

Now we create the actual DAXA Archive instance - all this requires is for us to choose an archive name (which is what will be used to load it back in at a later date, if necessary) and to pass in the filtered missions that we have already created:

In [None]:
arch = Archive("A3667", [xm, ch, er, rp], clobber=True)

Now we've declared it, we can use the `info()` method to get a summary of its current status, including the amount of data available:

In [None]:
arch.info()

### Step 3 - Processing the Archive

We're not actually going to cover _how_ to process things here, as each telescope tends to have its own backend software with a unique way of doing things; they each have their own processing tutorials, which will demonstrate both a one-line processing method, and how to control the reduction in more detail. Any processing method will take the archive object as an argument, and act on the data stored within it.

So instead we include this step here to highlight that the next logical step after the creation of a new archive is to run processing and reduction routines, if raw data have been downloaded. The successful completion of this step will leave you with an archive of data that you can easily manage, access, and use for your scientific analyses.

If you elected to download existing products (most missions support this), then only one processing step is necessary - this reorganises the downloaded data so that it is compatible with DAXA storage and file naming conventions. **It will have run automatically on declaration**

### Note on saving Archives

Archive instances can be saved and loaded back in (as you'll see in the next section). This can be triggered manually by running the `save()` method, but __this shouldn't be generally necessary__ - this is because the archive is automatically saved upon first setup, and after every processing step.

## Loading an existing archive

As we have intimated, previously created archives can be loaded back in to memory in exactly the same state as when they were saved. We will demonstrate this here with an archive we prepared earlier - it has had XMM processing applied, which will allow us to demonstrate the logging and management functionality. 

Reloading an archive has a number of possible applications:

* Access to archive data management functions - e.g. locating specific data files, identifying what observations are available.
* Checking processing logs - e.g. finding errors or warnings in the processing of data that has since been identified as problematic.
* Updating the archive - either adding another mission, or using the archive to check for new data matching your original mission filtering operations (these are stored in the mission saves, so can be re-run automatically).

All you need to do is set up an Archive instance and pass the name of an existing archive - this assumes your code is running in the same directory as it was originally, as Archives are stored in 'daxa_output' (if the DAXA configuration file hasn't been altered). The configuration can also be altered so that all DAXA outputs are stored in an absolute path, in which case defining an Archive object with the name of an existing dataset would work from any directory).

Loading in an archive (note that you don't need to pass any missions, loading the archive back in will also reinstate the missions as they were when the Archive was last saved):

In [9]:
prev_arch = Archive("PHL1811_made_earlier")

  self._fetch_obs_info()


Once again, we will run the `info()` method, but note that for this archive the XMM-Newton Pointed mission is marked as 'fully processed':

In [10]:
prev_arch.info()


-----------------------------------------------------
Number of missions - 4
Total number of observations - 9
Beginning of earliest observation - 1990-10-31 00:00:00
End of latest observation - 2015-11-30 04:11:08.184002

-- XMM-Newton Pointed --
   Internal DAXA name - xmm_pointed
   Chosen instruments - M1, M2, PN
   Number of observations - 4
   Fully Processed - False

-- Chandra --
   Internal DAXA name - chandra
   Chosen instruments - ACIS-I, ACIS-S, HRC-I, HRC-S
   Number of observations - 3
   Fully Processed - False

-- NuSTAR Pointed --
   Internal DAXA name - nustar_pointed
   Chosen instruments - FPMA, FPMB
   Number of observations - 1
   Fully Processed - False

-- RASS --
   Internal DAXA name - rosat_all_sky
   Chosen instruments - PSPC
   Number of observations - 1
   Fully Processed - False
-----------------------------------------------------



We note that it _is_ possible to declare an Archive with a previously used name and overwrite it - you just have to pass `clobber=True` when you declare the Archive instance. We print the docstring of the Archive class here for reference:

In [11]:
print(Archive.__doc__)


    The Archive class, which is to be used to consolidate and provide some interface with a set
    of mission's data. Archives can be passed to processing and cleaning functions in DAXA, and also
    contain convenience functions for accessing summaries of the available data.

    :param str archive_name: The name to be given to this archive - it will be used for storage
        and identification. If an existing archive with this name exists it will be read in, unless clobber=True.
    :param List[BaseMission]/BaseMission missions: The mission, or missions, which are to be included
        in this archive - any setup processes (i.e. the filtering of data to be acquired) should be
        performed prior to creating an archive. The default value is None, but this should be set for any new
        archives, it can only be left as None if an existing archive is being read back in.
    :param bool clobber: If an archive named 'archive_name' already exists, then setting clobber to True
 

## Accessing component missions

The missions that were used to create an archive can be retrieved, giving access to their information tables - note that you cannot just use the filtering methods of a mission to change the data in the archive; altering the observations in an archive requires using <font color='red'>the archive `update()` method.</font>

To retrieve a mission you can either address the archive with the DAXA internal name of the mission, or get the whole list using the `missions` property:

In [12]:
prev_arch['xmm_pointed']

<daxa.mission.xmm.XMMPointed at 0x7fdaf9af12b0>

In [13]:
prev_arch.missions

[<daxa.mission.xmm.XMMPointed at 0x7fdaf9af12b0>,
 <daxa.mission.chandra.Chandra at 0x7fdb18ffc0a0>,
 <daxa.mission.nustar.NuSTARPointed at 0x7fdb1977a280>,
 <daxa.mission.rosat.ROSATAllSky at 0x7fdb19de2790>]

In [14]:
prev_arch['xmm_pointed'].filtered_obs_info

Unnamed: 0,ra,dec,ObsID,start,science_usable,duration,proprietary_end_date,revolution,proprietary_usable,end
922,157.7463,31.04889,102041001,2000-12-07 04:57:14,True,0 days 01:30:06,2002-04-06 00:00:00,182,True,2000-12-07 06:27:20
3802,328.7562,-9.373528,204310101,2004-11-01 09:06:42,True,0 days 09:08:39,2005-12-01 00:00:00,897,True,2004-11-01 18:15:21
6014,234.89625,-83.59306,502671101,2008-04-01 17:24:48,True,0 days 05:25:20,2009-05-29 00:00:00,1522,True,2008-04-01 22:50:08
12006,328.75625,-9.373333,761910201,2015-11-29 09:38:07,True,0 days 16:30:00,2016-12-11 23:00:00,2925,True,2015-11-30 02:08:07


## Archive properties

Here we run through the general properties of the archive class, summarising their meaning and content.

### Name

The `archive_name` class returns the name that was given to the archive on creation - this cannot be changed.

In [15]:
prev_arch.archive_name

'PHL1811_made_earlier'

### Archive Path

This property (`archive_path`) returns the absolute path to the top level of this archive's storage directory:

In [16]:
prev_arch.archive_path

'/Users/dt237/code/DAXA/docs/source/notebooks/tutorials/daxa_output/archives/PHL1811_made_earlier/'

### Mission Names

In addition to the `missions` property discussed earlier, we include a `mission_names` property which lists the internal names of the mission classes associated with the archive:

In [17]:
prev_arch.mission_names

['xmm_pointed', 'chandra', 'nustar_pointed', 'rosat_all_sky']

## Processing-related Archive properties

This section deals with Archive properties that are related to processing of the available data into something scientifically useful - this is why we loaded an existing archive with processing applied, to demonstrate the contents of these properties:

### Process Success

The `process_success` property is very important - it is a nested dictionary which records which processing steps were 'successful' (usually defined as no errors being detected, and expected files being found) for which data. Those that were successful have a boolean value of True, those that weren't have a boolean value of False.

Top level keys will always be mission name, the next level down will be the process name, and the layer below that will be the unique IDs of the data the process acted on. This is often an ObsID, but can also be ObsID + instrument name, or ObsID + instrument name + sub-exposure ID.

This allows you (but more importantly the archive itself) to know which stages have failed for which data - that in turn means any processing that is dependent on a previous stage can know which data to skip. All this ensures no interruptions when reducing large sets of data.

In this case we've run all processing steps on the XMM data in the archive, note the following entries:
* `espfilt` 0502671101PNS003 
* `espfilt` 0502671101M2S002

Both failed safely, and were not considered for the next processing stages. Also note that the ObsID 0102041001 does not appear after the `odf_ingest` step, as all of its data was taken in CalClosed mode and can't be used for the study of the target objects:

In [18]:
prev_arch.process_success

{'xmm_pointed': {'cif_build': {'0204310101': True,
   '0102041001': True,
   '0761910201': True,
   '0502671101': True},
  'odf_ingest': {'0102041001': True,
   '0502671101': True,
   '0204310101': True,
   '0761910201': True},
  'epchain': {'0502671101PNS003': True,
   '0204310101PNS003': True,
   '0761910201PNS003': True},
  'emchain': {'0502671101M1S001': True,
   '0204310101M1S001': True,
   '0502671101M2S002': True,
   '0204310101M2S002': True,
   '0761910201M2S002': True,
   '0761910201M1S001': True},
  'espfilt': {'0502671101M2S002': False,
   '0204310101M1S001': True,
   '0204310101M2S002': True,
   '0502671101M1S001': True,
   '0761910201M1S001': True,
   '0761910201M2S002': True,
   '0502671101PNS003': False,
   '0204310101PNS003': True,
   '0761910201PNS003': True},
  'cleaned_evt_lists': {'0204310101M2S002': True,
   '0761910201M1S001': True,
   '0204310101M1S001': True,
   '0502671101M1S001': True,
   '0761910201M2S002': True,
   '0204310101PNS003': True,
   '0761910201PNS

### Process Logs (stdout)

All of the logs for all processes run by DAXA are stored, and can be accessed through the `process_logs` property - this is structured in the exact same way as `process_success`, as a nested dictionary. The only difference here is that the final values are strings rather than booleans.

We show the log for a single process applied to a single piece of data, otherwise this tutorial document would be very long indeed - this particular process worked perfectly:

In [29]:
print(prev_arch.process_logs['xmm_pointed']['espfilt']['0204310101M1S001'])

espfilt:- Executing (routine): espfilt eventfile=/Users/dt237/code/DAXA/docs/source/notebooks/tutorials/daxa_output/archives/PHL1811_made_earlier/processed_data/xmm_pointed/0204310101/P0204310101M1S001MIEVLI0000.FIT withoot=no ootfile=dataset method=histogram withsmoothing=yes smooth=51 withbinning=yes binsize=60 ratio=1.2 withlongnames=yes elow=2500 ehigh=8500 rangescale=6 allowsigma=3 limits='0.1 6.5' keepinterfiles=no  -w 1 -V 4
espfilt:- espfilt (espfilt-4.3)  [xmmsas_20211130_0941-20.0.0] started:  2024-04-10T19:55:37.000
espfilt:-  ESPFILT: Processing eventlist: /Users/dt237/code/DAXA/docs/source/notebooks/tutorials/daxa_output/archives/PHL1811_made_earlier/processed_data/xmm_pointed/0204310101/P0204310101M1S001MIEVLI0000.FIT
espfilt:-  *FOV IMAGE* = mos1S001-fovim-2500-8500.fits
evselect:- Executing (routine): evselect table=/Users/dt237/code/DAXA/docs/source/notebooks/tutorials/daxa_output/archives/PHL1811_made_earlier/processed_data/xmm_pointed/0204310101/P0204310101M1S001MIEV

In [19]:
prev_arch.process_errors

{'xmm_pointed': {'cif_build': {},
  'odf_ingest': {},
  'epchain': {},
  'emchain': {},
  'espfilt': {'0502671101M2S002': [' raised by  - '],
   '0502671101PNS003': ['noCounts raised by espfilt - All histo counts are zero! Check your FOV Lightcurve!']},
  'cleaned_evt_lists': {},
  'merge_subexposures': {}},
 'chandra': {},
 'nustar_pointed': {},
 'rosat_all_sky': {}}

In [22]:
print(prev_arch.get_process_raw_error_logs('espfilt', full_ident="0502671101M2S002"))



In [23]:
print(prev_arch.get_process_raw_error_logs('espfilt', full_ident="0502671101PNS003"))

