# Intro to Xarray
Here I'll demonstrate how Xarray can help simplify data processing of our orbital data

In [None]:
import numpy as np
import pandas as pd
import xarray as xr
import os
import glob

First we need to get get the files we want to load in. Rather than hard coding this in, I will use `glob` to extract the names of all of the input files contained in my data directory. You can think of this construct as an example of the *declarative programming* paradigm, because we are starting from our goal (get a sorted list of files in the data directory with names that end with `-XV.csv`) rather than explicitly writing out the steps needed to achieve the goal.

In [None]:
datadir = os.path.join(os.pardir,"data")
datafiles = glob.glob(os.path.join(datadir, "*-XV.csv"))
datafiles.sort()

Next I'll read in each of the files in the list. Xarray does not have a native way to direclty read in CSV files. [The Xarray documentation recommends using Pandas as an intermediate for processing CSV files.](https://docs.xarray.dev/en/stable/user-guide/io.html#csv-and-other-formats-supported-by-pandas) So we'll use the Pandas `read_csv` method to read each file in as a Pandas DataFrame and then convert the DataFrame to an Xarray Dataset. 

Before doing this for all of the files, let's start with a single one so we can follow each of the steps.

### The data format.
First, let's take a look at what the input data looks like so we know how best to go about reading it in. Let's start with the first file in our input file list. I'll print the first three lines so you can see how it is structured.

In [None]:
print(f"File name: {datafiles[0]}")

In [None]:
with open(datafiles[0], 'r') as f:
    [print(f"Line {line_num+1}: {next(f)}") for line_num in range(3)]

### Reading into a Pandas DataFrame
The data is comma-delimited, and with a header as the first line. This format is basically what Pandas expects as a default, making our call to `read_csv` relatively straight-forward. Let's see what it looks like when we read this in using the default arguments to Pandas.

In [None]:
df = pd.read_csv(datafiles[0])
df

### Converting a Pandas DataFrame into an Xarray Dataset
As you can see, Pandas has read this data in and labeled each row by its header value. Let's see what it looks like when we convert this to an Xarray Dataset. We'll use the built-in Pandas method `to_xarray` to do this.

In [None]:
ds = df.to_xarray()
ds

### Using a column of the input data as an index.
This is pretty good, but notice how it has generated both a *dimension* and a *coordinate* called `index`. When a dimension and coordinate are linked in an Xarray object, it is called a *dimension coordinate* and is printed in bold in the Notebook view. This was automatically generated from the Pandas DataFrame's index column. However, we know that each row of data represents a point in time, so the data really ought to be indexed by the time variable, `t`. 

We could attempt to do this now by converting the Dataset variable `t` into a dimension coordinate, but I think it would be cleaner to get this sorted while reading in the data to begin with, as it only requires a single additional argument to our `read_csv` call: `index_col`. This tells Pandas which column of data should be treated as the index. This is the 0th column in our input data.

In [None]:
df = pd.read_csv(datafiles[0],index_col=0)
df

In [None]:
ds = df.to_xarray()
ds

With one argument early in our process, we now have our time values as the *dimension coordinate* of the Dataset. 

### Including metadata to give context to your data
One of the advantages of Xarray is that it provides far more options to supply useful metadata to our data. By giving our dimenion coordinates a name, we can process our data using that name and "natural" values of that name, rather than by arbitrary index values. Another way that useful metadata can be supplied is via attributes. These are metadata that can be used to give our data more context, by, for instance, supplying units to the data variables.

You can create any kind of attribute you want, but there are two that you should start with, as these are used by the Xarray plotting methods to automatically label your plots for you once we get to that stage. These are: `units` and `long_name`. Let's set those for each of our Data variables and our time coordinate:

In [None]:
ds['t'] = ds['t'].assign_attrs(long_name='Time', units='day')

In [None]:
cart = ['x','y','z']
for c in cart:
    r = f"r{c}h"
    v = f"v{c}h"
    ds[r] = ds[r].assign_attrs(long_name=f"Heliocentric $r_{c}$", units='AU')
    ds[v] = ds[v].assign_attrs(long_name=f"Heliocentric $v_{c}$", units='AU/day')

To see how useful this is, let's plot one of the variables. Notice that it labeled our axes for us.

In [None]:
ds['rxh'].plot()

Including metadata like this very early in the pipeling can greatly help with being able to understand the data you are working with, and help communicate results with less ambiguity.

### Merging multiple input data files into one Dataset
Now that we've established the basics of our data pipeline on one file, now we'd like to do this on all our files. Before we proceed, let's write a simple function that does all of the intermediate steps that we just did above, so we can repeat this for all of our data. By separating the data analysis steps into its own function, it helps improve the readability of our script, as the individual steps will be separated from the loop that executes all the steps. I'll also include a step that extracts the planet name from the file name.

In [None]:
def process_inputs(filename):
    # Read in data file and convert to Xarray Dataset
    df = pd.read_csv(filename,index_col=0)
    ds = df.to_xarray()
    
    # Set units and long_name attributes
    ds['t'] = ds['t'].assign_attrs(long_name='Time', units='day')
    cart = ['x','y','z']
    for c in cart:
        r = f"r{c}h"
        v = f"v{c}h"
        ds[r] = ds[r].assign_attrs(long_name=f"Heliocentric $r_{c}$", units='AU')
        ds[v] = ds[v].assign_attrs(long_name=f"Heliocentric $v_{c}$", units='AU/day')
        
    # Extract planet name and store it as a new variable value
    name = filename.split(os.path.sep)[-1].split("-")[0]
    ds['name'] = [name]
    ds['name'] = ds['name'].assign_attrs(long_name="Planet name")
    return ds

In [None]:
planet_data = []
for f in datafiles:
    planet_data.append(process_inputs(f))

Now we have a list of Xarray Datasets, with each element of the list representing a planet. Let's see what one of these looks like.

In [None]:
planet_data[0]

We could stop here, but we'd be left with a list of different Datasets that were disconnected from each other. However, we know that they all share the same time coordinates and variables. Xarray is designed to deal with multidimensional data, and so we can combine the data together into a single Dataset and treat planet names as a dimension of the data.

Notice that Xarray automatically converted our `name` variable into a dimension coordinate for us, given that each input file had a single value for this variable. This makes it super easy to combine together. [There are a number of ways of combining data, depending on what you start with and what your goal is.](https://docs.xarray.dev/en/stable/user-guide/combining.html) Because of the choices we made throughout our processing pipeline, we can easily combine the data using a simple call to the `concat` method.  

In [None]:
ds = xr.concat(planet_data,dim='name')

In [None]:
ds

### Processing the Dataset.
Now that we've got our data into a useful format, we can start to do some processing on it. Because our Dataset is rich with context, we can do most of our processing in the declarative programming mode. For instance, suppose we want to compute the position and velocity magnitudes of our planet orbits. These can be done with single lines that are easy to understand.

In [None]:
ds['rhmag'] = np.sqrt(ds['rxh']**2 + ds['ryh']**2 + ds['rzh']**2)
ds['vhmag'] = np.sqrt(ds['vxh']**2 + ds['vyh']**2 + ds['vzh']**2)

Remember to add metadata!

In [None]:
ds['rhmag'] = ds['rhmag'].assign_attrs(long_name="Heliocentric $|\mathbf{r}|$", units="AU")
ds['vhmag'] = ds['vhmag'].assign_attrs(long_name="Heliocentric $|\mathbf{v}|$", units="AU/day")

In [None]:
ds['rhmag'].plot(hue="name")

We can also add new variables to the Dataset by keeping in mind that each variable is a DataArray, so you can create them using out of thigns like numpy arrays, lists, dictionaries, etc. Let's do this for planet mass. I'll use list comprehension to convert our MSun/Mpl dictionary into Mpl values, using the dictionary keys as our dimension coordinate

In [None]:
MSun_over_Mpl = {
    'Mercury': 6023600.0,
    'Venus': 408523.71,
    'Earth': 328900.56,
    'Mars': 3098708.,
    'Jupiter': 1047.3486,
    'Saturn': 3497.898,
    'Uranus': 22902.98,
    'Neptune': 19412.24,
    'Pluto': 1.35e8
}

In [None]:
ds['mass'] = xr.DataArray(data=[1.0/v for k,v in MSun_over_Mpl.items()], 
             coords={"name" : [k for k in MSun_over_Mpl]},
             attrs={"long_name" : "Mass of planet", "units" : "$M_{sun}$"},
            )

In [None]:
ds

We can also compute the gravitational parameter value for each body that we can use for computing orbital elements.

In [None]:
ds['mu'] = 1.0 + ds['mass']
ds['mu'] = ds['mu'].assign_attrs(long_name="Gravitational parameter $\mu$", units="$M_{sun}$")