# ERA5 Analysis Process

This Jupyter notebook provides a brief overview of how to use the **geodata** package to download ERA5 data from the [Copernicus Data Store](https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels?tab=overview), create geographic-temporal subsets called cutouts, and use those cutouts to generate standalone datasets for separate analysis.

*The following guide assumes you have installed and configured **geodata** and all required dependencies.*

## Step 1 - Setup

Import the package first.

In [1]:
import geodata

  from .autonotebook import tqdm as notebook_tqdm


Notifications in **geodata** are implemented using `loggers` from the `logging` library.
It is recommended to always launch a logger to get information on what is going on. For debugging, you can use the more verbose `level=logging.DEBUG`:

In [2]:
import logging

logging.basicConfig(level=logging.INFO)

## Step 2 - Download and Create Cutout
Assuming you have previously created a CDS account and set up the CDS API credentials, you can download ERA5 data from the CDS API as follows.

First, define a dataset object for the data you wish to download:

In [3]:
## For ERA5, pass geographic bounds in array as follows:
## bounds = [West, South, East, North]
## Omitting bounds will default to global file of 20+ GB per month
DS = geodata.Dataset(
    module="era5",
    weather_data_config="wind_solar_hourly",
    years=slice(2005, 2005),
    months=slice(1, 2),
    bounds=[50, -3, 45, 3],
)

[32m2024-11-06 15:32:36,835 - geodata.dataset - INFO - Directory /Users/xiqiangliu/.local/geodata/era5 found, checking for completeness.[0m
[32m2024-11-06 15:32:36,835 - geodata.dataset - INFO - Directory complete.[0m


* Use `module` to specify the data source. In this example, it is "era5".
* Use `weather_data_config` to specifiy the dataset.  In this example, hourly data is used, as specified by the `"wind_solar_hourly"` value.
* Use `years=slice()` and `months=slice()` to specify the years and months for download.  In each parameter, the first value indicates the start period, and the second value the end period.
* Use `bounds` to specify the geographic bounds to which you wish to limit your download data.  `bounds` should be set as follows: `bounds = [West, South, East, North]`. Omitting bounds will default to downloading a global file of 20+ GB per month.

Use the code block below to begin the download.

When a `dataset` object is created, **geodata** performs a check to see if the data specified has already been downloaded by checking for the existence of ERA5 datafiles in the `era5` directory configured in `src/geodata/config.py` (downloaded data is placed into subdirectories by year and then - for daily files - by month, ie `2011/01, 2011/02, 2012/01`, etc).  Monthly files are simply placed in the month's folder.  If downloaded data is found, the `prepared` attribute is set to `True` upon `dataset` object declaration.

Accordingly, the snippet below saves you the trouble of accidentally redownloading data if it is already present in the correct subdirectories.

In [4]:
if DS.prepared == False:
    DS.get_data()

Finally, in order to use the downloaded ERA5 data with **geodata**, run:

In [5]:
DS.trim_variables()

`trim_variables()` subsets and resaves the downloaded files so that only those variables needed to generate **geodata** outputs are kept.

## Step 3 - Create Cutout

A cutout is a subset of downloaded data based on specified time periods and geographic coordinates.  Cutouts are saved to the cutout directory specified in `src/geodata/config.py` and can be used to generate multiple outputs.


In [11]:
cutout = geodata.Cutout(
    name="era5-europe-test-2005-01",
    module="era5",
    weather_data_config="wind_solar_hourly",
    xs=slice(46, 48),
    ys=slice(1, 2),
    years=slice(2005, 2005),
    months=slice(1, 1),
)

[32m2024-11-06 15:33:07,427 - geodata.cutout - INFO - Cutout (era5-europe-test-2005-01, /Users/xiqiangliu/.local/geodata/cutouts) not found or incomplete.[0m


The above code creates a cutout for January 2011 for a geographic area corresponding to a portion of Europe. Walking through the parameters:

* `name` will be the name of the directory created in the cutouts folder where **geodata** will place the data files corresponding to the cutout.
* `module` indicates the source for the data from which the cutout is created.
* Use `xs=slice()` and `ys=slice()` to define a geographical range for the cutout.
* Use `years=slice()` and `months=slice()` to define a temporal range for the cutout. 

`geodata.Cutout()` only defines the cutout object in memory.  To actually create the cutout files, run `prepare()`:

In [12]:
cutout.prepare()

[32m2024-11-06 15:33:08,612 - geodata.preparation - INFO - Starting preparation of cutout 'era5-europe-test-2005-01'[0m
[32m2024-11-06 15:33:08,615 - geodata.datasets.era5 - INFO - MultiIndex([(2005, 1)],
           names=['year', 'month'])[0m
[32m2024-11-06 15:33:08,616 - geodata.datasets.era5 - INFO - [(2005, 1)][0m
[32m2024-11-06 15:33:09,772 - geodata.datasets.era5 - INFO - Opening /Users/xiqiangliu/.local/geodata/era5/2005/01/wind_solar_hourly.nc[0m
[32m2024-11-06 15:33:09,799 - geodata.preparation - INFO - Merging variables into monthly compound files[0m
[32m2024-11-06 15:33:09,801 - geodata.preparation - INFO - Cutout 'era5-europe-test-2005-01' has been successfully prepared[0m


Running `cutout.prepare()` as above will create the cutout by downloading and then subsetting the ERA5 data.  Accordingly, the above code block could take a while to finish processing.

`prepare()` will first perform a check to see if a cutout has already been created at the specified directory, and will exit the download.  creation process if a cutout already exists.  To override this behavior and force a redownload and recalculation of the cutout, run `prepare(overwrite=True)`.

To verify the results of the cutout, you can print some attributes to the console as follows.

Basic information:

In [13]:
cutout

<Cutout era5-europe-test-2005-01 x=48.00-48.00 y=2.00-1.00 time=2005/1-2005/1 prepared>

Name:

In [14]:
cutout.name

'era5-europe-test-2005-01'

Coordinates:

In [15]:
cutout.coords

Coordinates:
    number      int64 8B 0
  * valid_time  (valid_time) datetime64[ns] 6kB 2005-01-01 ... 2005-01-31T23:...
  * y           (y) float64 40B 2.0 1.75 1.5 1.25 1.0
  * x           (x) float64 8B 48.0
    expver      (valid_time) <U4 12kB '0001' '0001' '0001' ... '0001' '0001'
    lon         (x) float64 8B 48.0
    lat         (y) float64 40B 2.0 1.75 1.5 1.25 1.0
  * time        (time) datetime64[ns] 8B 2005-01-01
  * year-month  (year-month) object 8B MultiIndex
  * year        (year-month) int64 8B 2005
  * month       (year-month) int64 8B 1

All metadata:

In [16]:
cutout.meta

## Step 4 - Generate Outputs

**geodata** currently supports the following outputs using ERA5 data from the [Copernicus Data Store](https://cds.climate.copernicus.eu/cdsapp#!/dataset/reanalysis-era5-single-levels?tab=overview).

### Wind
* Wind generation time-series (`wind`)
* Wind speed time-series (`windspd`)

### Solar
* Solar photovoltaic generation time-series (`pv`)

### Wind Generation Time-series
Convert wind speeds for turbine to wind energy generation using the following code:

In [18]:
ds_wind = cutout.wind(turbine="Suzlon_S82_1.5_MW", smooth=True)

Going over the parameters:

* `cutout` - **string** -  A cutout created by `geodata.Cutout()`
* `turbine` - **string or dict** - Name of a turbine known by the reatlas client or a turbineconfig dictionary with the keys 'hub_height' for the hub height and 'V', 'POW' defining the power curve.  For a full list of currently supported turbines, see [the list of Turbines here.](https://github.com/east-winds/geodata/tree/master/geodata/resources/windturbine)
* `smooth` - **bool or dict** - If True smooth power curve with a gaussian kernel as determined for the Danish wind fleet to Delta_v = 1.27 and sigma = 2.29. A dict allows to tune these values.

*Note* - 
You can also specify all of the general conversion arguments documented in the `convert_and_aggregate` function (e.g. `var_height='lml'`).

The convert function returns an xarray dataset, which is an in-memory representation of a NetCDF file.

In [19]:
ds_wind

To convert this array to a more conventional dataframe, run:

In [20]:
df_wind = ds_wind.to_dataframe(name="wind")

which converts the xarray dataset into a pandas dataframe:

In [21]:
df_wind

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,number,expver,lon,lat,wind
time,y,x,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
2005-01-01 00:00:00,2.00,48.0,0,0001,48.0,2.00,0.813830
2005-01-01 00:00:00,1.75,48.0,0,0001,48.0,1.75,0.810174
2005-01-01 00:00:00,1.50,48.0,0,0001,48.0,1.50,0.806619
2005-01-01 00:00:00,1.25,48.0,0,0001,48.0,1.25,0.802788
2005-01-01 00:00:00,1.00,48.0,0,0001,48.0,1.00,0.773333
...,...,...,...,...,...,...,...
2005-01-31 23:00:00,2.00,48.0,0,0001,48.0,2.00,0.956431
2005-01-31 23:00:00,1.75,48.0,0,0001,48.0,1.75,0.952222
2005-01-31 23:00:00,1.50,48.0,0,0001,48.0,1.50,0.948045
2005-01-31 23:00:00,1.25,48.0,0,0001,48.0,1.25,0.943549


To output the data to a csv for separate analysis:

In [None]:
df_wind.to_csv("era5_wind_data.csv")

### Wind Speed Density Time-series
Extract wind speeds at given height (ms-1)

In [22]:
ds_windspd = cutout.windspd(turbine="Vestas_V66_1750kW")

Going over the parameters:

* `cutout` - **string** -  A cutout created by `geodata.Cutout()`
* `**params` - Must have 1 of the following:
    - `turbine` - **string or dict** - Name of a turbine known by the reatlas client or a turbineconfig dictionary with the keys 'hub_height' for the hub height and 'V', 'POW' defining the power curve.  For a full list of currently supported turbines, see [the list of Turbines here.](https://github.com/east-winds/geodata/tree/master/geodata/resources/windturbine)
    - `hub-height` - **num** - Extrapolation height (m)
    
*Note* - 
You can also specify all of the general conversion arguments documented in the `convert_and_aggregate` function (e.g. `var_height='lml'`).

The convert function returns an xarray dataset, which is an in-memory representation of a NetCDF file.

In [23]:
ds_windspd

To convert this array to a more conventional dataframe, run:

In [None]:
df_windspd = ds_windspd.to_dataframe(name="windspd")

which converts the xarray dataset into a pandas dataframe:

In [None]:
df_windspd

To output the data to a csv for separate analysis:

In [None]:
df_windspd.to_csv("era_windspd_data.csv")

### Solar Photovoltaic Generation Time-series

Convert downward-shortwave, upward-shortwave radiation flux and ambient temperature into a pv generation time-series.


In [24]:
ds_pv = cutout.pv(panel="KANEKA", orientation="latitude_optimal")

Going over the parameters:

* `cutout` - **string** -  A cutout created by `geodata.Cutout()`
* `panel` - string - Specify a solar panel type on which to base the calculation.  **geodata** contains an internal solar panel dictionary with keys defining several solar panel characteristics used for the time-series calculation.  For a complete list of included panel types, see [the list of panel types here.](https://github.com/east-winds/geodata/tree/master/geodata/resources/solarpanel)
* `orientation` - str, dict or callback - Panel orientation can be chosen from either `latitude_optimal`, a constant orientation such as `{'slope': 0.0,'azimuth': 0.0}`,  or a callback function with the same signature as the callbacks generated by the `geodata.pv.orientation.make_*` functions.
* (optional) clearsky_model - string or None - 	Either the `simple` or the `enhanced` Reindl clearsky model. The default choice of None will choose dependending on data availability, since the `enhanced` model also incorporates ambient air temperature and relative humidity.


The convert function returns an xarray dataset, which is an in-memory representation of a NetCDF file.

In [None]:
ds_pv

To convert this array to a more conventional dataframe, run:

In [None]:
df_pv = ds_pv.to_dataframe(name="pv")

which converts the xarray dataset into a pandas dataframe:

In [None]:
df_pv

To output the data to a csv for separate analysis:

In [None]:
df_pv.to_csv("era_pv_data.csv")