# Data Formating Overview

**Contributors**: Lili Alderson, Munazza Alam, Natasha Batalha, Hannah Wakeford, add your name too if you'd like to contribute!

This notebook was created to enable common formatting for the Early Release Science data analysis using `xarray` files. We will review: 

1. `xarray` 101
2. Variable terminology
3. File naming schemes 
4. Data formating 
5. Physical unit archiving 

This is namely for the booking of the following data products: 

1. Stellar spectra 
2. Raw light curves 
3. Fitted light curves 
4. Transit spectra

### Packages to install before getting started 


<h1>Table of Contents<span class="tocSkip"></span></h1>
<div class="toc"><ul class="toc-item"><li><span><a href="#Data-Formating-Overview" data-toc-modified-id="Data-Formating-Overview-1"><span class="toc-item-num">1&nbsp;&nbsp;</span>Data Formating Overview</a></span></li><li><span><a href="#Data-Types:-Using-xarray" data-toc-modified-id="Data-Types:-Using-xarray-2"><span class="toc-item-num">2&nbsp;&nbsp;</span>Data Types: Using <code>xarray</code></a></span><ul class="toc-item"><li><span><a href="#The-three-components-of-an-xarray-file" data-toc-modified-id="The-three-components-of-an-xarray-file-2.1"><span class="toc-item-num">2.1&nbsp;&nbsp;</span>The three components of an <code>xarray</code> file</a></span></li></ul></li><li><span><a href="#What-data-go-in-what-files?" data-toc-modified-id="What-data-go-in-what-files?-3"><span class="toc-item-num">3&nbsp;&nbsp;</span>What data go in what files?</a></span></li><li><span><a href="#Variable-Terminology-and-Required-data_vars,-coords-&amp;-attrs-for-each-file-type" data-toc-modified-id="Variable-Terminology-and-Required-data_vars,-coords-&amp;-attrs-for-each-file-type-4"><span class="toc-item-num">4&nbsp;&nbsp;</span>Variable Terminology and Required <code>data_vars</code>, <code>coords</code> &amp; <code>attrs</code> for each file type</a></span><ul class="toc-item"><li><span><a href="#Stellar-Spectra" data-toc-modified-id="Stellar-Spectra-4.1"><span class="toc-item-num">4.1&nbsp;&nbsp;</span>Stellar Spectra</a></span></li><li><span><a href="#Raw-Light-Curves" data-toc-modified-id="Raw-Light-Curves-4.2"><span class="toc-item-num">4.2&nbsp;&nbsp;</span>Raw Light Curves</a></span></li><li><span><a href="#Fitted-Light-Curves" data-toc-modified-id="Fitted-Light-Curves-4.3"><span class="toc-item-num">4.3&nbsp;&nbsp;</span>Fitted Light Curves</a></span></li><li><span><a href="#Transit-Spectra" data-toc-modified-id="Transit-Spectra-4.4"><span class="toc-item-num">4.4&nbsp;&nbsp;</span>Transit Spectra</a></span></li></ul></li><li><span><a href="#Specifying-units" data-toc-modified-id="Specifying-units-5"><span class="toc-item-num">5&nbsp;&nbsp;</span>Specifying units</a></span></li><li><span><a href="#xarray-Basics" data-toc-modified-id="xarray-Basics-6"><span class="toc-item-num">6&nbsp;&nbsp;</span><code>xarray</code> Basics</a></span><ul class="toc-item"><li><span><a href="#Easy-Example:-Fake-Transit-Spectra" data-toc-modified-id="Easy-Example:-Fake-Transit-Spectra-6.1"><span class="toc-item-num">6.1&nbsp;&nbsp;</span>Easy Example: Fake Transit Spectra</a></span></li><li><span><a href="#2D-data:-e.g.-Raw-Light-Curves" data-toc-modified-id="2D-data:-e.g.-Raw-Light-Curves-6.2"><span class="toc-item-num">6.2&nbsp;&nbsp;</span>2D data: e.g. Raw Light Curves</a></span></li></ul></li><li><span><a href="#Storing-xarray-data" data-toc-modified-id="Storing-xarray-data-7"><span class="toc-item-num">7&nbsp;&nbsp;</span>Storing <code>xarray</code> data</a></span><ul class="toc-item"><li><span><a href="#Filenaming" data-toc-modified-id="Filenaming-7.1"><span class="toc-item-num">7.1&nbsp;&nbsp;</span>Filenaming</a></span></li><li><span><a href="#Using-netcdf" data-toc-modified-id="Using-netcdf-7.2"><span class="toc-item-num">7.2&nbsp;&nbsp;</span>Using <code>netcdf</code></a></span></li><li><span><a href="#Using-pickle" data-toc-modified-id="Using-pickle-7.3"><span class="toc-item-num">7.3&nbsp;&nbsp;</span>Using <code>pickle</code></a></span></li></ul></li><li><span><a href="#Reading/interpreting-an-xarray-file" data-toc-modified-id="Reading/interpreting-an-xarray-file-8"><span class="toc-item-num">8&nbsp;&nbsp;</span>Reading/interpreting an <code>xarray</code> file</a></span></li><li><span><a href="#Checking-your-data-is-in-compliance" data-toc-modified-id="Checking-your-data-is-in-compliance-9"><span class="toc-item-num">9&nbsp;&nbsp;</span>Checking your data is in compliance</a></span></li></ul></div>

In [None]:
!pip install netCDF4
!pip install h5netcdf
!pip install xarray


# Data Types: Using `xarray`

[xarray: N-D labeled arrays and datasets in Python](https://docs.xarray.dev/en/stable/): From their website: "array introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like arrays, which allows for a more intuitive, more concise, and less error-prone developer experience. The package includes a large and growing library of domain-agnostic functions for advanced analytics and visualization with these data structures."

Xarray is your friend and will make it very easy for other folks to use your data. 

## The three components of an `xarray` file

### 1. `attrs` : attributes 

These are the "attributes" that contain any high level meta data. This could be things like author lists, parameters used in the data reduction, description of the code used, etc. 

### 2. `data_vars` : data variables 

These are the main "data variables". These are usually final results of your modeling or data reduction analysis. It will ultimately be what the user of your data will be looking for. 

### 3. `coords` : coordinate systems 

These are the coordinate system of your `data_vars`. Common coordinate systems be a wavelength grid for a flux array, or a time array for a time series. 

# What data go in what files? 


⭐️Prefer to see a clear break down of these in a [google sheet](https://docs.google.com/spreadsheets/d/1JZbKQJsu5YNpJLsgY-w21Sed9-D6uVPltMogB-ju0aY/edit#gid=0)??⭐️

### File 1: Stellar Spectra 

Contains.. stellar spectra INSERT (could contain broader descriptions and/or link to codes that do these things or produce these products)

**Suggested naming scheme** : `f'stellar-spec-planet{W39}-mode{G395H}-code{Eureka}-author{Alderson}.xc'` 

### File 2: Raw light curves

Contains.. Raw light curves INSERT (could contain broader descriptions and/or link to codes that do these things or produce these products)

**Suggested naming scheme** : `f'raw-light-curve-planet{W39}-mode{G395H}-code{Eureka}-author{Alderson}.xc'`  

### File 3: Fitted light curves 

Contains... Fitted light curves INSERT (could contain broader descriptions and/or link to codes that do these things or produce these products)

**Suggested naming scheme** : `f'fitted-light-curve-planet{W39}-mode{G395H}-code{Eureka}-author{Alderson}.xc'`

### File 4: Transit Spectra 

Contains... Transit spectra (could contain broader descriptions and/or link to codes that do these things or produce these products)

**Suggested naming scheme** : `f'transit-spectrum-planet{W39}-mode{G395H}-code{Eureka}-author{Alderson}.xc'` INSERT something you like 



# Variable Terminology and Required `data_vars`, `coords` & `attrs` for each file type

⭐️Prefer to see a clear break down of these in a [google sheet](https://docs.google.com/spreadsheets/d/1JZbKQJsu5YNpJLsgY-w21Sed9-D6uVPltMogB-ju0aY/edit#gid=0)??⭐️



## Stellar Spectra 

###  `attrs` 

#### Required `attrs`

1. `author` (`str`): Author or author list 
2. `contact` (`str`): point of contact 
3. `code` (`str` or `dict`): code used. Sometimes people only use one code, in which case a string will suffice e.g., "eureka". Other times people might use a combination of many, in which case a dictionary can list the code used at each stage {"Stage1":eureka,"Aperture":my_pipeline_name,"Stage2":eureka} 

#### Optional  `attrs`
1. `doi` (str): make sure to include if you want your work referenced!
2. `extraction_params` (`dict`) : key parameters used to extract the 1D stellar spectra from the 2D integration images  `{'aperture_width_in_pixels': 6 'aperture_poly_order': 2}`
3. `cleaning_params` (`dict`) : any relevant parameters or necessary information relating to how you cleaned the data, such as the [ramp  rejection threshold](https://jwst-pipeline.readthedocs.io/en/latest/jwst/jump/arguments.html)  e.g., `{'jump_threshold': 7, 'f_noise_method':'column by column'}`
4. `notes` (str) : any additional reduction notes that you want the user to be aware of (e.g., you may wish to briefly explain the combination of codes used)

### `data_vars`

#### Requred `data_vars`

1. `flux` (2D array) : array of 1D stellar spectra with shape (total number of stellar spectra, length of stellar spectrum) e.g., if your dataset contains 8000 integrations and the spectrum covers 4000 pixels this array would have shape (8000,4000) 
2. `flux_error` (2D array) : array of errors associated with the flux data. Must be same size as `flux` 
3. `quality_flag` (2D array) : a requirement of [chromatic](https://github.com/zkbt/chromatic), "2D array indicating whether a particular flux data point is good (True) or bad (False)". Must be same size as `flux`

#### Optional  `data_vars`

Optional `data_vars` here will include a variety of detrending parameters with potentially 1D (e.g. time) or 2D (e.g. time vs wavelength) dimensions. These products, which may be requested by light curve fitters, might include the pixel shifts of the spectral trace in the x and y directions, which would have lengths equal to the total number of stellar spectra. Below we provide an example set of `data_vars`. 

5. shift_x (1D array) :  pixel shift in x 
6. shift_y (1D array) : pixel shift in y 
7. others?? Zach?? 

### `coords`

#### Requred `coords`

1. `wavelength` (1D array) : wavelength array, should be same length as a stellar spectrum 
2. `time` (1D array) : time array, should be same length as the total number of stellar spectra


## Raw Light Curves 

###  `attrs` 

#### Required `attrs`

1. `author` (`str`): Author or author list 
2. `contact` (`str`): point of contact 
3. `code` (`str` or `dict`): code used. For example sometimes people only use one code "eureka", other times people might use many {"Stage1":jwst,"Stage2":my_pipeline}
4. `data_origin` (`str`) : Who's stellar spectra did you use to make these? In order of preference: provide: 1) link to data doi, 2) link to data (e.g. personal drive), 3) link to paper doi, 4) contact email of author

#### Optional  `attrs`
1. `doi` (str): made sure to include if you want your work referenced! 
2. `notes` (str) : any additional reduction notes that you want the user to be aware of (e.g., you may wish to clarify if the contained light curves are binned spectrscopic light curves or broadband white light curves)

### `data_vars` 

#### Required `data_vars` 
1. `raw_flux` (2D array) : array of 1D light curves with shape (total number of light curves, length of light curve) e.g., if your dataset contains 1000 binned light curves and each light curve contains 8000 time increments this array would have shape (1000,8000) 
2. `raw_flux_error` (2D array) : array of errors associated with the raw_flux data. Must be same size as raw_flux
3. `quality_flag` (2D array) : a requirement of [chromatic](https://github.com/zkbt/chromatic), "2D array indicating whether a particular flux data point is good (True) or bad (False)" Must be same size as raw_flux

#### Optional `data_vars` 

4. `detrending parameters` (`dict`) : any detrending parameters produced by your code that may be requested by light curve fitters. This might include the pixel shifts of the spectral trace in the x and y directions, e.g., `{'x_shifts': array_of_x, 'y_shifts': array_of_y}`

### `coords` 

### Required `coords` 

1. `time_flux` (1D array) : time which light curves are plotted against, should be same length as number of raw_flux data points in each light curve (**Lili: note that in the stellar spec we called this `time`. here we are differentiating between `time_flux` so that in the next one there are the two different times. I am wondering if you want to change to `time_flux` in stellar spec to reduce time array names) **
2. `central_wavelength` (1D array) : central wavelength associated with each `raw_flux` light curve, should be same length as total number of light curves

### Optional `coords` 

3. `bin_half_width` (1D array) : half widths 
4. `start_wavelength` (1D array) : starting wavelength of each bin
5. `end_wavelength` (1D array) : ending wavelength of each bin 

## Fitted Light Curves

###  `attrs` 

#### Required `attrs` 

1. `author` (`str`): Author or author list 
2. `contact` (`str`): point of contact 
3. `code` (`str` or `dict`): code used. For example sometimes people only use one code "eureka", other times people might use many {"Stage1":jwst,"Stage2":my_pipeline}
4. `data_origin` (`dict`) : Who's stellar spectra and raw light curve did you use to make these? In order of preference: provide: 1) link to data doi, 2) link to data (e.g. personal drive), 3) link to paper doi, 4) contact email of author. For example: `{'stellar_spec':"zenodo.org/xxxx", 'raw_light_curve:"zenodo.org/yyyy"}`
5. `system_params` (`dict`) : dictionary of relevant planet and system parameters. `{'rp': jupiter_radius, 'rs': stellar_radius , 'a_rs' : (a/Rs), 'tc' :  mid_transit_time}`
6. `limb_darkening_params` (`dict`) : : dictionary of relevant limb darkening parameters.  `{'u1':x, 'u2':y,	'u3':z,	'u4':w}`

#### Optional  `attrs`
1. `doi` (str): made sure to include if you want your work referenced! 
2. `notes` (str) : any additional reduction notes that you want the user to be aware of (e.g., you may wish to clarify if the contained light curves are binned spectrscopic light curves or broadband white light curves) 

#### Required `data_vars` 

1. `raw_flux` (2D array) : array of 1D light curves before systematic detrending with shape (total number of light curves, length of light curve) e.g., if your dataset contains 1000 binned light curves and each light curve contains 8000 time increments this array would have shape (1000,8000) 
2. `raw_flux_error` (2D array) : array of errors associated with the raw_flux data. Must be same size as `raw_flux`
3. `light_curve_model` (2D array) : array of 1D light curve models with shape (total number of light curve models, length of model). While the total number of light curves must be the same as for `raw_flux`, the length of the model may be different if the model is at a higher time resolution.
4. `corrected_flux` (2D array) : array of 1D light curves after systematic detrending Must be same size as `raw_flux`
5. `corrected_flux_error` (2D array) array of errors associated with the corrected_flux data. Must be same size as `raw_flux`
6. `quality_flag` (2D array) : a requirement of chromatic, "2D array indicating whether a particular flux data point is good (True) or bad (False)" Must be same shape as `raw_flux`

#### Optional `data_vars`

7. `systematic_model` (2D array) : model used to detrend the light curves with shape (total number of light curve models, length of model). While the total number of light curves must be the same as for `raw_flux`, the length of the model may be different if the model is at a higher time resolution.
8. `residuals` (2D array) : RMS residuals of the light curve fit. Must be same size as `raw_flux`

Optional `data_vars` here will include a variety of detrending parameters with potentially 1D (e.g. time) or 2D (e.g. time vs wavelength) dimensions. These products, which may be requested by light curve fitters, might include the pixel shifts of the spectral trace in the x and y directions, which would have lengths equal to the total number of stellar spectra. Below we provide an example set of detrending parameters you might consider adding:

9. shift_x (1D array) :  pixel shift in x 
10. shift_y (1D array) : pixel shift in y 

### `coords` 

#### Required `coords` 

1. `time_flux` (1D array) : time array for flux 
2. `time_model` (1D array) :	time array of systematic models 
3. `central_wavelength` (1D array): central wavelengths

#### Optional `coords`

4. `bin_half_width` (1D array) : half widths 
5. `start_wavelength` (1D array) : starting wavelength of each bin
6. `end_wavelength` (1D array) : ending wavelength of each bin 


## Transit Spectra

###  `attrs` 

#### Required `attrs`


1. `author` (`str`): Author or author list 
2. `contact` (`str`): point of contact 
3. `code` (`str` or `dict`): code used. For example sometimes people only use one code "eureka", other times people might use many {"Stage1":jwst,"Stage2":my_pipeline}
4. `data_origin` (`dict`) : Who's stellar spectra, raw light curve, and fitted light curve did you use to make these? In order of preference: provide: 1) link to data doi, 2) link to data (e.g. personal drive), 3) link to paper doi, 4) contact email of author. For example: `{'stellar_spec':"zenodo.org/xxxx", 'raw_light_curve:"zenodo.org/yyyy", 'fitted_light_curve':"zenodo.org/zzzz"}`
5. `system_params` (`dict`) : dictionary of relevant planet and system parameters. `{'rp': jupiter_radius, 'rs': stellar_radius , 'a_rs' : (a/Rs), 'tc' :  mid_transit_time}`
6. `limb_darkening_params` (`dict`) : dictionary of relevant kimb darkening parameters.  `{'u1':x, 'u2':y,	'u3':z,	'u4':w}`

#### Optional  `attrs`
1. `doi` (str): made sure to include if you want your work referenced!  
2. `notes` (str) : any additional information you want the user to be aware of

###  `data_vars` 

#### Required `data_vars`

1. `transit_depth` (required) : transit depth as (rp/rs)^2 (**Lili**: Note that this might become a larger part of a community framework.. you might think about making this "optional" along with "eclipse_depth", and/or phase curve 
2. `transit_depth_error` (required) : error on transit depth as 1-sigma (rp/rs)^2 

### `coords` 

#### Required `coords` 

1. `central_wavelength` (1D array): central wavelengths

#### Optional `coords`

2. `bin_half_width` (1D array) : half widths 
3. `start_wavelength` (1D array) : starting wavelength of each bin
4. `end_wavelength` (1D array) : ending wavelength of each bin 



# Specifying Physical Units

Units are provided as Strings. However, we should be able to convert all units to `astropy.units`. 

For **physical unitless parameters** (e.g. transit_depth) unit should be provided such that they can still be astropy converted. E.g. area/area

For **non-physical unitless parameters** (e.g. a bool) the unit field should be specified with a blank string. 

## Allowable non-physical units

Allowable flux units that are not `astropy` friendly:

- `e-`
- `DN`


## Specifying Time Units

All time units should also use `astropy.time`. There is a [full tutorial available](https://docs.astropy.org/en/stable/time/index.html) through the astropy team. 

Example: 


In [None]:
import astropy.units as u

In [None]:
#examples of valid units

u.Unit('cm') #Valid 
#u.Unit('CM') #NOT valid
u.Unit("R_jup")#Valid
u.Unit("R_jupiter")#Valid
#u.Unit("R_Jupiter")#NOT Valid

In [None]:
unit = 'cm'
#doing it this away enables easy conversions. for example: 
(1*u.Unit('R_jup')).to('cm')

In [None]:
u.Unit('micron')#valid
u.Unit('um')#valid

In [None]:
#transit depth 
u.Unit('(R_jup*R_jup)/(R_jup*R_jup)')

Unit(dimensionless)

In [None]:
from astropy.time import Time
import numpy as np

In [None]:
tm = Time(np.linspace(51544, 61544, 300), format='mjd')

# `xarray` Basics

[xarray: N-D labeled arrays and datasets in Python](https://docs.xarray.dev/en/stable/): From their website: "array introduces labels in the form of dimensions, coordinates and attributes on top of raw NumPy-like arrays, which allows for a more intuitive, more concise, and less error-prone developer experience. The package includes a large and growing library of domain-agnostic functions for advanced analytics and visualization with these data structures."

Xarray is your friend and will make it very easy for other folks to use your data. Let's build some simple examples.


## Easy Example: Fake Transit Spectra 

Here we will show an example where wavelength is the coordinate system, and transit depth is the final output product you are trying to share with the community

In [None]:
import xarray as xr
import json #we will use this to dump model parameters into an attribute

In [None]:
#fake flat spectrum 
central_wavelength = np.linspace(1,10,500)
bin_half_width = np.concatenate(([0.1],np.diff(np.linspace(1,10,500))))
transit_depth_error = np.random.randn(500)+100e-6
transit_depth = np.random.randn(500)*transit_depth_error+0.01

Practice convert to `xarray`. In this case we are storing `transit_depth` data, labeled with `unitless` "(rp/rs)^2" that is on a grid of `central_wavelength` with units of "micron"

In [None]:
# put data into a dataset where each
ds = xr.Dataset(
    data_vars=dict(
        transit_depth=(["central_wavelength"],
                       transit_depth,{'units': 'unitless'}),#, required
        transit_depth_error=(["central_wavelength"],
                       transit_depth_error,{'units': 'unitless'}),#, required
    ),
    coords=dict(
        central_wavelength=(["central_wavelength"], 
                             central_wavelength,{'units': 'micron'}),#required*
        bin_half_width=(["bin_half_width"], 
                             bin_half_width,{'units': 'micron'})#required*
    ),
    attrs=dict(author="L Alderson",#required
               contact="lili.alderson13@gmail.com",#required,
               code="mycode",#could also insert github link
               data_origin=json.dumps({'stellar_spec':'zenodo.org/xxxx', 
                                   'raw_light_curve:'www.drive.google.com/someonegavemethis', 
                                   'fitted_light_curve':'malam@carnegiescience.edu'}), #required, in this case I used numpy to make my fake model. 
               doi="add your paper here",#optional if there is a citation to reference
               system_params=json.dumps({'rp':1, 'rs':1, 'a_rs':0.1, 'tc':0.9}), #optional in accordance with model runs
               limb_darkening_params=json.dumps({'u1':1, 'u2':1,'u3':1, 'u4':1}), #optional in accordance with model runs
              )
)

In [None]:
#printing is easy
ds

In [None]:
#plotting is easy
ds['transit_depth'].plot()

## 2D data: e.g. Raw Light Curves



In [None]:
time_flux = np.linspace(10,10000,400)#fake time array in BJD
raw_flux = np.zeros((len(central_wavelength), len(time_flux))) + 1e8
raw_flux_error = np.zeros((len(central_wavelength), len(time_flux))) + 1e8 
quality_flag = np.zeros((len(central_wavelength), len(time_flux))) + 1e8 

# put data into a dataset where each
ds = xr.Dataset(
    #now data is a function of two dimensions
    data_vars=dict(raw_flux=(["central_wavelength","time_flux"], raw_flux,{'units': 'ergs/(cm*cm*Angstrom)'}),
                   raw_flux_error=(["central_wavelength","time_flux"], raw_flux_error,{'units': 'ergs/(cm*cm*Angstrom)'}),
                   quality_flag=(["central_wavelength","time_flux"], quality_flag,{'units': 'unitless'}),
                  ),
    coords=dict(
        central_wavelength=(["central_wavelength"], 
                             central_wavelength,{'units': 'micron'}),#required*
        time_flux=(["time_flux"], 
                             time_flux,{'units': 'BJD'}),#required*
        bin_half_width=(["bin_half_width"], 
                             bin_half_width,{'units': 'micron'})#required*
    ),
    attrs=dict(author="L Alderson",#required
               contact="lili.alderson13@gmail.com",#required,
               code="mycode",#could also insert github link
               data_origin=json.dumps({'stellar_spec':'zenodo.org/xxxx', #required, in this case I used numpy to make my fake model. 
               doi="add your paper here",#optional if there is a citation to reference
              )
)

In [None]:
#easy plotting 
ds['raw_flux'].plot()

# Storing `xarray` data 

## Filenaming

We usually rely on a long filename to give us information about the model. If we properly use `attrs` then filenaming does not matter. However, friendly filenames are always appreciated by people using your models. We suggest the following naming convention. 

Given independent variables (x,y,z): `tag_x_{x}_y_{y}_z_{z}.nc`

For example: `stellar_spectra_target_w39_mode_G395H_code_eureka.nc`

## Using `netcdf`

"The recommended way to store xarray data structures is netCDF, which is a binary file format for self-described datasets that originated in the geosciences. Xarray is based on the netCDF data model, so netCDF files on disk directly correspond to Dataset objects (more accurately, a group in a netCDF file directly corresponds to a Dataset object. See Groups for more.)" - [Quoted from xarray website](https://docs.xarray.dev/en/stable/user-guide/io.html)

In [None]:
ds.to_netcdf("stellar_spectra_target_FAKE_mode_G395H_code_eureka.nc")

## Using `pickle`

Pickle is also an option, though not recommended because of differences in pickling between version of python. Additionally, netcdfs are more versatile across coding languages. 

In [None]:
import pickle as pk
pk.dump(ds, open("stellar_spectra_target_FAKE_mode_G395H_code_eureka.pk",'wb'))

# Reading/interpreting an `xarray` file

First, make sure you have installed [netCDF4](https://github.com/Unidata/netcdf4-python) and [h5netcdf](https://github.com/h5netcdf/h5netcdf) : 

```
pip install netCDF4
pip install h5netcdf
```
or if you prefer conda
```
conda install -c conda-forge netCDF4
```

In [None]:
ds_ex = xr.open_dataset("stellar_spectra_target_FAKE_mode_G395H_code_eureka.nc")


Look at all the information we can glean from this

In [None]:
ds_ex #39 data variables

In [None]:
ds_ex['central_wavelength']#data operates very similarly to pandas, note we can see the unit of the coordinate system


In [None]:
ds_ex['central_wavelength'].values #same as pandas!


In [None]:
ds_ex['raw_flux']


How to get attributes from string dictionary

In [None]:
json.loads(ds_sm.attrs['data_origin'])

In [None]:
json.loads(ds_sm.attrs['data_origin'])['stellar_spectra'] #origin of stellar spectra

# Checking your data is in compliance

Coming soon upon finalization of data structions

In [None]:
def data_check(usr_xa):
    """This function will check that all the requirements have been met"""
    
    #step 1: check that required attributes are present 
    assert 'author' in usr_xa.attrs ,'No author information in attrs'
    assert 'contact' in usr_xa.attrs ,'No contact information in attrs'
    assert 'code' in usr_xa.attrs , 'Code used was not specified in attrs'
    
    #step 2: check that all coordinates have units
    try: 
        for i in usr_xa.coords.keys(): test= usr_xa[i].units
    except AttributeError: 
        print(f'Missing unit for {i} coords')

    #step 2: check that all coordinates have units
    try: 
        for i in usr_xa.data_vars.keys(): test=usr_xa[i].units
    except AttributeError: 
        print(f'Missing unit for {i} data_var')
    """    #step 3: check that some attrs is a proper dictionary
    try : 
        for i in usr_xa.attrs:
            #these need to be dictionaries to be interpretable
            if i in ['planet_params','stellar_params','cld_params','orbit_params']: 
                json.loads(usr_xa.attrs[i])
    except ValueError: 
        print(f"Was not able to read attr for {i}. This means that you did not properly define a dictionary with json and a dict."," For example: json.dumps({'mp':1,'rp':1})")
    
    #step 4: hurray if you have made it to here this is great
    #last thing is the least important -- to make sure that we agree on terminology
    for i in usr_xa.attrs: 
        if i == 'planet_params': 
            for model_key in json.loads(usr_xa.attrs[i]).keys():
                assert model_key in ['rp', 'mp', 'tint', 'heat_redis', 'p_reference',
                'pteff', 'mh' , 'cto' , 'logkzz'], f'Could not find {model_key} in listed planet_params attr. This might be because we havent added it yet! Check your terms and contact us if this is the case'
        
        elif  i == 'stellar_params': 
            for model_key in json.loads(usr_xa.attrs[i]).keys():
                assert model_key in ['logg', 'feh', 'steff', 'rs', 'ms',
                ], f'Could not find {model_key} in listed stellar_params attr. This might be because we havent added it yet! Check your terms and contact us if this is the case'
        
        elif  i == 'orbit_params': 
            for model_key in json.loads(usr_xa.attrs[i]).keys():
                assert model_key in ['sma',
                ], f'Could not find {model_key} in listed orbit_params attr. This might be because we havent added it yet! Check your terms and contact us if this is the case'
        
        elif  i == 'cld_params': 
            for model_key in json.loads(usr_xa.attrs[i]).keys():
                assert model_key  in ['opd','ssa','asy','fsed'
                ], f'Could not find {model_key} in listed cld_params attr. This might be because we havent added it yet! Check your terms and contact us if this is the case'
    """
    
    print('SUCCESS!')


In [None]:
#ds_sm = xr.open_dataset("example_file.nc")
#data_check(ds_sm)