## 1b. Exporting ERA5-Land Hourly met point data to ELM-ready format
This notebook will use the results of notebook `1a`, so make sure you understand that one before running this one. Essentially, we're going to convert from "raw" ERA5 met data to netCDF files that ELM expects. 

I have made this code pretty simple for the user, but there's quite a bit happening under the hood. I'll describe some of this along the way.

In [1]:
import pandas as pd
from pathlib import Path
import ngeegee.e5l_utils as eu
from ngeegee.utils import _ROOT_DIR

# Build our points dictionary (same as in 1a notebook)
points = {'abisko' : (68.35, 18.78333),
        'tvc' : (68.742, -133.499),
        'toolik' : (68.62758, -149.59429),
        'chars' :  (69.1300, -105.0415),
        'qhi' : (69.5795, -139.0762),
        'sam' : (72.22, 126.3),
        'sjb' : (78.92163, 11.83109)}

# Load the raw ERA5-Land hourly data we exported in notebook 1a.
path_e5lh = _ROOT_DIR / 'notebooks' / 'notebook_data' / 'ngee_test_era5_timeseries.csv'
df = pd.read_csv(path_e5lh)
print(df)

           pid              date  temperature_2m  u_component_of_wind_10m  \
0       abisko  2022-01-01 00:00      253.960236                 1.675461   
1          tvc  2022-01-01 00:00      238.081329                 1.542160   
2       toolik  2022-01-01 00:00      235.602814                 0.689621   
3        chars  2022-01-01 00:00      247.622345                -0.047684   
4          qhi  2022-01-01 00:00      242.927032                 5.590012   
...        ...               ...             ...                      ...   
192523  toolik  2025-02-19 23:00      247.975540                -0.436813   
192524   chars  2025-02-19 23:00      252.844681                 3.637405   
192525     qhi  2025-02-19 23:00      250.440384                 3.153030   
192526     sam  2025-02-19 23:00      243.457962                 0.308304   
192527     sjb  2025-02-19 23:00      265.143509                -0.293259   

        v_component_of_wind_10m  surface_solar_radiation_downwards_hourly  

Ok, we've got our `df` loaded. Next, we're going to run a preprocessing script. It has a few main jobs:
- converts units to ELM-expected ones
- computes "indirect" variables (humidities) that are not directly available in ERA5-Land hourly but are needed by ELM
- ensures no negative values for variables where that is physically impossible
- does some date formatting and trimming. Preprocessing ensures that only full years of data are included (i.e. trims the dataframe to the earliest and latest years where all 365 days of data are available.)
- removes leap years

In [2]:
df = eu.e5lh_to_elm_preprocess(df, remove_leap=True, verbose=True) # remove_leap is True by default, just showing it here FYI. verbose is False by default.

0.91% of the values in surface_solar_radiation_downwards_hourly were negative and reset to 0.
0.62% of the values in total_precipitation_hourly were negative and reset to 0.


We see that there were some negative values in the radiation and precipitation bands. However, these percentages are fairly small and not worrisome.

Let's continue. 

Next, we want to validate our ERA5-Land hourly data. What do I mean by "validate"?
- ensure that the mean of the ERA5 data is approximately the same order of magnitude as a reference value (that I've precomputed for TVC, so isn't representative)
- ensure that the ranges of the ERA5 data do not exceed realistic ones, nor those expected by ELM. (I pulled this info from the OLMT repo.)

Honestly, you could probably skip this step. I developed it to make sure all the unit conversions etc. done in `e5lh_to_elm_preprocess` were correct. It's probably a good habit to leave this in, though, and it's pretty fast anyway.

In [3]:
eu.validate_met_vars(df)


LOW CONCERN: 35% of the values in surface_pressure are beyond the range of the reference variable PSRF.
LOW CONCERN: 11% of the values in wind_speed are beyond the range of the reference variable WIND.
No reference statistics were available for the following variables, so their ranges were not validated: ['u_component_of_wind_10m', 'v_component_of_wind_10m', 'dewpoint_temperature_2m', 'wind_direction', 'relative_humidity']


Ok, let's unpack this.

> LOW CONCERN: 35% of the values in surface_pressure are beyond the range of the reference variable PSRF.

This means that some of the values in `surface_pressure` are beyond the range that I computed based on TVC. Since it's only 35% and we didn't get any `HIGH CONCERN` printouts, I wouldn't worry about it. I really need a better dataset to compute these ranges from...
We'll skip the other "LOW CONCERN" message as well.

The last thing is 
> No reference statistics were available for the following variables, so their ranges were not validated: ['u_component_of_wind_10m', 'v_component_of_wind_10m', 'dewpoint_temperature_2m', 'wind_direction', 'relative_humidity']

This simply means that I was unable to find reference statistics for these variables, so we can't actually check their ranges/order of magnitudes to ensure they're reasonable. Future work, I guess. You might want to look at them and make sure they're what you expect.

Now we can move on to the last step--actually exporting the ELM-ready netCDFs. We'll need to specify a few things before running our export command.

In [4]:
# Define a DataFrame from our points dictionary
df_loc = pd.DataFrame({'pid' : points.keys(),
                       'lat' : [points[p][0] for p in points],
                       'lon' : [points[p][1] for p in points]}) # sorry, this is awkward to do but it will make things scalable as this repo develops
dir_out = _ROOT_DIR / 'notebooks' / 'notebook_data' / 'elm_ready' # a folder will be created for each site; this directory specifies where the site folders go
zval = 1  # I am still unsure why this is relevant, but the files need to specify a "z values" in meters, which is the height above ground at which the observations were made (I think). Using 1 as a placeholder for now.

# Now we export!
eu.export_for_elm(df, df_loc, dir_out, zval)

...and that's it! You can see that a folder for each site has been created:

![Site folders!](notebook_data/images/1b-site-folders.png)



We can also look in a folder--here we'll choose abisko:

![Abisko](notebook_data/images/1b-abisko-elm-variables.png)

We see that there are individual netCDFs in each site corresponding to the variables needed by ELM*.

*Note that currently `ngeegee` is only supporting variables needed when running in  `COUPLER_BYPASS` mode. There isn't a ton of documentation on this, but Dan Riscutto created this as a faster way to get your met data to ELM. The alternative is called `DATM_MODE`. 

If you want, you can open a netCDF and poke around:

In [8]:
import os
import xarray as xr

path_file = _ROOT_DIR / 'notebooks' / 'notebook_data' / 'elm_ready' / 'abisko'
files = os.listdir(path_file)
f = path_file / files[0]

ds = xr.open_dataset(f)
print(ds)


<xarray.Dataset> Size: 420kB
Dimensions:                                     (n: 1, DTIME: 26280)
Coordinates:
  * DTIME                                       (DTIME) datetime64[ns] 210kB ...
Dimensions without coordinates: n
Data variables:
    LONGXY                                      (n) float32 4B ...
    LATIXY                                      (n) float32 4B ...
    surface_thermal_radiation_downwards_hourly  (n, DTIME) float64 210kB ...
Attributes:
    history:      Created using xarray
    units:        W/m2
    description:  incident longwave (FLDS)
    calendar:     noleap
    created_on:   2025-02-26


I haven't yet tested if these netCDFs are indeed 100% ELM compatible :) 
But please let me know if you find problems! (Open an issue in this repo.)