# Basic data access 
This notebook showcases helper functions from `climakitae` that enable you to access and export the AE catalog data, while also allowing you to perform spatial subsetting and view the data options in an easy-to-use fashion. These functions could be easily implemented in a python script. 

In [9]:
import climakitae as ck 
from climakitae.core.data_interface import (
    get_data_options, 
    get_subsetting_options, 
    get_data
)

## High-level details 
The AE data catalog has many different types of data. Our helper library `climakitae` attempts to make accessing and retrieveing this data intuitive, as well as simplify climate and statistical analysis with the data down the line, by performing some data transformations as the data is retrieved.<br><br> To retrieve the data, you'll need to make some selections as to your climate variable, data resolution, location settings, and many other options. There are also several high-level options you'll need to set when selecting your data, detailed below: 

### Data type: Gridded or Stations
**Gridded**: Gridded (i.e. raster) climate data at various spatial resolutions.<br><br>
**Stations**: Gridded (i.e. raster) climate data at unique grid cell(s) corresponding to the central coordinates of the selected weather station(s). 
- This data is bias-corrected (i.e localized) to the exact location of the weather station using the historical in-situ data from the weather station(s). 
- This data is currently only available for dynamically downscaled air temperature data. 

### Scientific approach: Time or Warming Level
**Time**: Retrieve the data using a traditional time-based approach that allows you to select historical data, future projections, or both, along with a time-slice of interest. 
- “Historical Climate” includes data from 1980-2014 simulated from the same GCMs used to produce the Shared Socioeconomic Pathways (SSPs). It will be automatically appended to a SSP time series when both are selected. Because this historical data is obtained through simulations, it represents average weather during the historical period and is not meant to capture historical timeseries as they occurred.
- “Historical Reconstruction” provides a reference downscaled [reanalysis](https://www.ecmwf.int/en/about/media-centre/focus/2020/fact-sheet-reanalysis) dataset based on atmospheric models fit to satellite and station observations, and as a result will reflect observed historical time-evolution of the weather.
- Future projections are available for [greenhouse gas emission scenario (Shared Socioeconomic Pathway, or SSP)](https://climatescenarios.org/primer/socioeconomic-development) SSP 3-7.0 through 2100 with the dynamically-downscaled General Circulation Models (GCMs).
     - One GCM was additionally downscaled for two additional SSPs (SSP 5-8.5 and SSP 2-4.5)<br>

**Warming Level**: Retrieve the data by future global warming levels, which will automatically retrieve all available model data for the historical+future period and then calculate the time window around which each simulation reaches the selected warming level.  
- Because warming levels are defined based on amount of global mean temperature change, they can be used to compare possible outcomes across multiple scenarios or model simulations.
- This approach includes all simulations that reach a specified amount of warming regardless of when they reach that level of warming, rather than the time-based appraoch, which will preliminarily subset a portion of simulations that follow a given SSP trajectory.
    
### Downscaling method: Dynamical, Statistical, or both
**Dynamical**:[Dynamically downscaled](https://dept.atmos.ucla.edu/alexhall/downscaling-cmip6) WRF data, produced at hourly intervals. If you select 'daily' or 'monthly' for 'Timescale', you will receive an average of the hourly data. The spatial resolution options, on the other hand, are each the output of a different simulation, nesting to higher resolution over smaller areas.<br><br>
**Statistical**: [Hybrid-statistically downscaled](https://loca.ucsd.edu) LOCA2-Hybrid data, available at daily and monthly timescales. Multiple LOCA2-Hybrid simulations are available (100+) at a fine spatial resolution of 3km.

## See the options in our data catalog in a table
This function returns a pandas DataFrame (i.e. a table) of our data options. You can also use the library `climakitaegui` to visualize these options in an interactive panel. See the notebook `interactive_data_access_and_viz.ipynb` to explore that approach. 

In [10]:
options_df = get_data_options()
options_df.head(10)


Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,variable,resolution
downscaling_method,scenario,timescale,Unnamed: 3_level_1,Unnamed: 4_level_1
Statistical,Historical Climate,daily,Maximum relative humidity,3 km
Statistical,Historical Climate,daily,Minimum relative humidity,3 km
Statistical,Historical Climate,daily,Specific humidity at 2m,3 km
Statistical,Historical Climate,daily,Precipitation (total),3 km
Statistical,Historical Climate,daily,Shortwave flux at the surface,3 km
Statistical,Historical Climate,daily,Maximum air temperature at 2m,3 km
Statistical,Historical Climate,daily,Minimum air temperature at 2m,3 km
Statistical,Historical Climate,daily,West-East component of Wind at 10m,3 km
Statistical,Historical Climate,daily,North-South component of Wind at 10m,3 km
Statistical,Historical Climate,daily,Wind speed at 10m,3 km


In [11]:
options_df.to_csv('data_options.csv')

## See the data options for a particular subset of inputs
The `get_data_options` function enables you to input a number of different function arguments, corresponding to the columns in the table above, to subset the table. Inputting no arguments, like we did above, will return the entire range of options.<br><br>First, lets print the function documentation to see the inputs and outputs of the function. If an argument (or "parameter", as listed in the documentation) is listed as "optional", that means you don't have to input anything for that argument. In the case of this function, none of the function arguments are required, so you can simply call the function. 

In [12]:
print(get_data_options.__doc__)

Get data options, in the same format as the Select GUI, given a set of possible inputs.
    Allows the user to access the data using the same language as the GUI, bypassing the sometimes unintuitive naming in the catalog.
    If no function inputs are provided, the function returns the entire AE catalog that is available via the Select GUI

    Parameters
    ----------
    variable : str, optional
        Default to None
    downscaling_method : str, optional
        Default to None
    resolution : str, optional
        Default to None
    timescale : str, optional
        Default to None
    scenario : str or list, optional
        Default to None
    tidy : boolean, optional
        Format the pandas dataframe? This creates a DataFrame with a MultiIndex that makes it easier to parse the options.
        Default to True
    enable_hidden_vars : boolean, optional
        Return all variables, including the ones in which "show" is set to False?
        Default to False

    Returns
  

If you call the function with **no inputs**, it will simply return the entire catalog! But, let's say you want to see all the data options for statistically downscaled data at 3 km resolution. You'll want to provide inputs for the `downscaling_method` and `resolution` arguments. 

In [13]:
get_data_options(
    downscaling_method = "Statistical", 
    resolution = "3 km"
)

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,variable,resolution
downscaling_method,scenario,timescale,Unnamed: 3_level_1,Unnamed: 4_level_1
Statistical,Historical Climate,daily,Maximum relative humidity,3 km
Statistical,Historical Climate,daily,Minimum relative humidity,3 km
Statistical,Historical Climate,daily,Specific humidity at 2m,3 km
Statistical,Historical Climate,daily,Precipitation (total),3 km
Statistical,Historical Climate,daily,Shortwave flux at the surface,3 km
Statistical,...,...,...,...
Statistical,SSP 5-8.5,monthly,Maximum air temperature at 2m,3 km
Statistical,SSP 5-8.5,monthly,Minimum air temperature at 2m,3 km
Statistical,SSP 5-8.5,monthly,West-East component of Wind at 10m,3 km
Statistical,SSP 5-8.5,monthly,North-South component of Wind at 10m,3 km


Perhaps you want to see all the data options for daily precipitation. We have several precipitation options in the catalog. You don't need to know the name of these variables; simply use "precipitation" as your input to the function for the `variable` argument.<br><br>The function prefers that your inputs match an actual option in the catalog-- with exact capitalizations and no misspelling-- and will print a warning if your input is not a direct match ("precipitation" is not an option, but "Precipitation (total)" is). The function will then try to make a guess as to what you actually meant. 

In [14]:
# get_data_options(
#     variable = "monthly", 
#     timescale = "daily"
# )

The function can also return a simple pandas DataFrame without the complex MultiIndex. Just set `tidy = False`.

In [15]:
get_data_options(
    variable = "precipitation", 
    timescale = "daily", 
    tidy = False
) 

Input variable='precipitation' is not a valid option.
Closest options: 
- Maximum precipitation
- Precipitation (cumulus portion only)
- Precipitation (grid-scale portion only)
- Precipitation (total)
Outputting data for variable='Maximum precipitation'



Unnamed: 0,variable,downscaling_method,resolution,timescale,scenario
0,Maximum precipitation,Dynamical,45 km,daily,Historical Climate
1,Maximum precipitation,Dynamical,9 km,daily,Historical Climate
2,Maximum precipitation,Dynamical,3 km,daily,Historical Climate
3,Maximum precipitation,Dynamical,45 km,daily,SSP 2-4.5
4,Maximum precipitation,Dynamical,9 km,daily,SSP 2-4.5
5,Maximum precipitation,Dynamical,45 km,daily,SSP 3-7.0
6,Maximum precipitation,Dynamical,9 km,daily,SSP 3-7.0
7,Maximum precipitation,Dynamical,3 km,daily,SSP 3-7.0
8,Maximum precipitation,Dynamical,45 km,daily,SSP 5-8.5
9,Maximum precipitation,Dynamical,9 km,daily,SSP 5-8.5


## See all the geometry options for spatially subsetting the data during retrieval
These options will match those in our AE selections GUI. This will enable you to retrieve a subset for a specific region.

In [16]:
subset_options_df = get_subsetting_options()
subset_options_df.head(10)

Unnamed: 0_level_0,Unnamed: 1_level_0,geometry
area_subset,cached_area,Unnamed: 2_level_1
states,ID,"POLYGON ((-117.24269 44.39655, -117.23485 44.3..."
states,WA,"MULTIPOLYGON (((-122.57041 48.53786, -122.5686..."
states,NM,"POLYGON ((-109.05018 31.48001, -109.04985 31.4..."
states,CA,"MULTIPOLYGON (((-118.60443 33.47856, -118.5988..."
states,CO,"POLYGON ((-109.06026 38.59933, -109.05955 38.7..."
states,UT,"POLYGON ((-114.05297 37.59279, -114.05248 37.6..."
states,WY,"POLYGON ((-111.05457 45.00096, -111.04508 45.0..."
states,NV,"POLYGON ((-120.00576 39.22867, -120.00561 39.2..."
states,MT,"POLYGON ((-116.04915 48.50205, -116.04914 48.5..."
states,AZ,"POLYGON ((-114.81631 32.50804, -114.81433 32.5..."


This shows a lot of options! Say you're only interested in California counties. Simply set the argument `area_subset` to "CA counties" to see the all options for counties. The function documentation shows the other options, which also match the values in the column "area_subset" in the table above. 

In [17]:
print(get_subsetting_options.__doc__)

Get all geometry options for spatial subsetting.
    Options match those in selections GUI

    Parameters
    ----------
    area_subset : str
        One of "all", "states", "CA counties", "CA Electricity Demand Forecast Zones", "CA watersheds", "CA Electric Balancing Authority Areas", "CA Electric Load Serving Entities (IOU & POU)", "Stations"
        Defaults to "all", which shows all the geometry options with area_subset as a multiindex

    Returns
    -------
    geom_df : pd.DataFrame
        Geometry options
        Shows only options for one area_subset if input is provided that is not "all"
        i.e. if area_subset = "states", only the options for states will be returned

    


In [18]:
get_subsetting_options(area_subset = "CA counties")

Unnamed: 0_level_0,geometry
cached_area,Unnamed: 1_level_1
Alameda County,"POLYGON ((-122.37312 37.88388, -122.37378 37.8..."
Alpine County,"POLYGON ((-120.07333 38.70109, -120.07332 38.7..."
Amador County,"POLYGON ((-121.02771 38.50011, -121.02771 38.5..."
Butte County,"POLYGON ((-122.06943 39.84053, -122.06886 39.8..."
Calaveras County,"POLYGON ((-120.6318 38.34603, -120.6318 38.345..."
Colusa County,"POLYGON ((-121.91512 38.92535, -121.91491 38.9..."
Contra Costa County,"POLYGON ((-121.69732 37.78244, -121.69084 37.7..."
Del Norte County,"POLYGON ((-124.31611 41.72839, -124.3137 41.72..."
El Dorado County,"POLYGON ((-120.18443 39.03101, -120.18838 39.0..."
Fresno County,"POLYGON ((-119.57319 36.48884, -119.57305 36.4..."


You can see all the options for subsetting, and their corresponding geometries, but you don't actually need to use the geometries for subsetting if you use climakitae's data retrieval function-- `get_catalog_data` -- explained in the next section. 

## Retrieve data using the get_data() function
You can easily retrieve data from the Analytics Engine data catalog using climakitae's ```get_data``` function, described below. Additional details for each of the function arguments can be viewed in function docstrings in the next code cell. 

### Required inputs 
This function requires you to input values for the following arguments: 
- variable (required)
- resolution (required)
- timescale (required)

### Location subsetting 
The options for location subsetting can be found using the `get_data_options()` function, as described in the beginning of this notebook. You can also opt to perform an area average by setting `area_average = "Yes"`. The `get_data()` function will default to returning the entire spatial domain, with no area averaging performed. 
- area_subset (optional) 
- cached_area (optional) 
- area_average (optional)

### Additional options
Further modify the data returned using the following arguments.
- downscaling method (optional)
- approach (optional) 
- scenario (optional)
- units (optional)
- time_slice (optional)
- warming_level (optional)
- warming_level_window (optional)
- warming_level_months (optional)

In [19]:
# See additional details about the function arguments by printing the docstring
print(get_data.__doc__)

Retrieve formatted data from the Analytics Engine data catalog using a simple function.
    Contrasts with DataParameters().retrieve(), which retrieves data from the user inputs in climakitaegui's selections GUI.

    Parameters
    ----------
    variable : str
        String name of climate variable
    resolution : str, one of ["3 km", "9 km", "45 km"]
        Resolution of data in kilometers
    timescale : str, one of ["hourly", "daily", "monthly"]
        Temporal frequency of dataset
    downscaling_method : str, one of ["Dynamical", "Statistical", "Dynamical+Statistical"], optional
        Downscaling method of the data:
        WRF ("Dynamical"), LOCA2 ("Statistical"), or both "Dynamical+Statistical"
        Default to "Dynamical"
    data_type : str, one of ["Gridded", "Stations"], optional
        Whether to choose gridded data or weather station data
        Default to "Gridded"
    approach : one of ["Time", "Warming Level"], optional
        Default to "Time"
    scenario

### Example 1: Time-based approach
Retrieve gridded data using a time-based approach. ```approach``` is an optional function argument, but the default is to use a time-based approach, so you don't actually need to set this argument. 

#### Example 1a
First, let's retrieve 3 kilometer resolution statistically downscaled historical data at a monthly timestep. 

In [20]:
get_data(
    variable = "Precipitation (total)", 
    downscaling_method = "Statistical", 
    resolution = "3 km", 
    timescale = "monthly", 
    scenario = "Historical Climate"
    # approach = "Time" # Optional because "Time" is the function default 
)

!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!
!!! Returned data array is huge. Operations could take 10x to infinity longer than 1GB of data !!!
!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!!



Unnamed: 0,Array,Chunk
Bytes,112.56 GiB,255.69 MiB
Shape,"(1, 70, 780, 495, 559)","(1, 1, 308, 310, 351)"
Dask graph,840 chunks in 224 graph layers,840 chunks in 224 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 112.56 GiB 255.69 MiB Shape (1, 70, 780, 495, 559) (1, 1, 308, 310, 351) Dask graph 840 chunks in 224 graph layers Data type float64 numpy.ndarray",70  1  559  495  780,

Unnamed: 0,Array,Chunk
Bytes,112.56 GiB,255.69 MiB
Shape,"(1, 70, 780, 495, 559)","(1, 1, 308, 310, 351)"
Dask graph,840 chunks in 224 graph layers,840 chunks in 224 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


#### Example 1b
Now say you're only interested in this data for San Bernadino County, and you want to compute an area average over the entire county. 

In [21]:
get_data(
    variable = "Precipitation (total)", 
    downscaling_method = "Statistical", 
    resolution = "3 km", 
    timescale = "monthly", 
    scenario = "Historical Climate",
    
    # Modify location settings
    cached_area = "San Bernardino County", 
    area_average = "Yes"
)

Unnamed: 0,Array,Chunk
Bytes,426.56 kiB,2.41 kiB
Shape,"(1, 70, 780)","(1, 1, 308)"
Dask graph,210 chunks in 437 graph layers,210 chunks in 437 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 426.56 kiB 2.41 kiB Shape (1, 70, 780) (1, 1, 308) Dask graph 210 chunks in 437 graph layers Data type float64 numpy.ndarray",780  70  1,

Unnamed: 0,Array,Chunk
Bytes,426.56 kiB,2.41 kiB
Shape,"(1, 70, 780)","(1, 1, 308)"
Dask graph,210 chunks in 437 graph layers,210 chunks in 437 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


#### Example 1c 
Perhaps next you want to get dynamically downscaled (i.e. WRF) precipitation data instead. First, you might want to check what options you have for scenario, timescale, and resolution using the ```get_data_options()``` function. 

In [22]:
get_data_options(
    variable = "Precipitation (total)", 
    downscaling_method = "Dynamical"
) 

Unnamed: 0_level_0,Unnamed: 1_level_0,Unnamed: 2_level_0,variable,resolution
downscaling_method,scenario,timescale,Unnamed: 3_level_1,Unnamed: 4_level_1
Dynamical,Historical Climate,hourly,Precipitation (total),45 km
Dynamical,Historical Climate,hourly,Precipitation (total),9 km
Dynamical,Historical Climate,hourly,Precipitation (total),3 km
Dynamical,SSP 3-7.0,hourly,Precipitation (total),45 km
Dynamical,SSP 3-7.0,hourly,Precipitation (total),9 km
Dynamical,SSP 3-7.0,hourly,Precipitation (total),3 km
Dynamical,Historical Climate,daily,Precipitation (total),45 km
Dynamical,Historical Climate,daily,Precipitation (total),9 km
Dynamical,Historical Climate,daily,Precipitation (total),3 km
Dynamical,Historical Climate,monthly,Precipitation (total),45 km


Next, let's retrieve both the future and historical dynamically downscaled data. "Historical Climate" is the correct historical data option here; "Historical Reconstruction" data is from ERA5 (a climate reanalysis product, rather than a climate model), and cannot be retrieved with future data in the same function call. <br><br>You can set the ```scenario``` argument to retrieve the shared socioeconomic pathway data (future projections) appended to the historical data. You can also set your desired time period using the ```time_slice``` argument. 

In [23]:
get_data(
    variable = "Precipitation (total)", 
    downscaling_method = "Dynamical", 
    resolution = "45 km", 
    timescale = "monthly", 
    cached_area = "San Bernardino County", 
    
    # Modify time-based settings 
    time_slice = (2000,2050),
    scenario = [
        "Historical Climate", 
        "SSP 3-7.0", 
        "SSP 2-4.5",
        "SSP 5-8.5"
    ]
) 

-------
You have retrieved data for more than one SSP, but not all ensemble members for each GCM are available for all SSPs.

As a result, some scenario and simulation combinations may contain NaN values.

If you want to remove these empty simulations, it is recommended to first subset the data object by each individual scenario and then dropping NaN values.


Unnamed: 0,Array,Chunk
Bytes,2.75 MiB,83.45 kiB
Shape,"(3, 8, 612, 7, 7)","(1, 1, 436, 7, 7)"
Dask graph,48 chunks in 155 graph layers,48 chunks in 155 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 2.75 MiB 83.45 kiB Shape (3, 8, 612, 7, 7) (1, 1, 436, 7, 7) Dask graph 48 chunks in 155 graph layers Data type float32 numpy.ndarray",8  3  7  7  612,

Unnamed: 0,Array,Chunk
Bytes,2.75 MiB,83.45 kiB
Shape,"(3, 8, 612, 7, 7)","(1, 1, 436, 7, 7)"
Dask graph,48 chunks in 155 graph layers,48 chunks in 155 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


### Example 2: Warming levels approach 
By default, the function uses a time-based approach. To use a warming levels approach, set the argument ```approach = "Warming Level"```. 

#### Example 2a
Retrieve the same data as example 1c, using a warming levels approach instead of a time-based approach. Note that the ```scenario``` and ```time_slice``` arguments are invalid for a warming levels approach; if provided, they will be ignored by the function. 

In [24]:
get_data(
    variable = "Precipitation (total)", 
    downscaling_method = "Dynamical", 
    resolution = "45 km", 
    timescale = "monthly", 
    cached_area = "San Bernardino County", 
    
    # Modify your approach 
    approach = "Warming Level",
)

Unnamed: 0,Array,Chunk
Bytes,689.06 kiB,58.19 kiB
Shape,"(1, 360, 7, 7, 10)","(1, 304, 7, 7, 1)"
Dask graph,20 chunks in 193 graph layers,20 chunks in 193 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 689.06 kiB 58.19 kiB Shape (1, 360, 7, 7, 10) (1, 304, 7, 7, 1) Dask graph 20 chunks in 193 graph layers Data type float32 numpy.ndarray",360  1  10  7  7,

Unnamed: 0,Array,Chunk
Bytes,689.06 kiB,58.19 kiB
Shape,"(1, 360, 7, 7, 10)","(1, 304, 7, 7, 1)"
Dask graph,20 chunks in 193 graph layers,20 chunks in 193 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


#### Example 2b
The ```get_data()``` function uses a default warming levels window of +/- 15 years, resulting in a 30 year period. Lets modify that by setting ```warming_level_window = 10``` to retrieve a 20 year window.<br><br>We can also modify the warming levels computed to include additional warming levels beyond the default. Let's select a few more by setting ```warming_level = [2.5, 3.0, 4.0]```. 

In [25]:
get_data(
    variable = "Precipitation (total)", 
    downscaling_method = "Dynamical", 
    resolution = "45 km", 
    timescale = "monthly", 
    cached_area = "San Bernardino County", 
    approach = "Warming Level",
    
    # Modify warming level settings 
    warming_level_window = 10, 
    warming_level = [2.5, 3.0, 4.0]
)

-----------------------------------
There may be NaNs in your data for certain simulation/warming level combinations if the warming level is not reached for that particular simulation before the year 2100. 

This does not mean you have missing data, but rather a feature of how the data is combined in retrieval to return a single data object. 

If you want to remove these empty simulations, it is recommended to first subset the data object by each individual warming level and then dropping NaN values.


Unnamed: 0,Array,Chunk
Bytes,1.35 MiB,45.94 kiB
Shape,"(3, 240, 7, 7, 10)","(1, 240, 7, 7, 1)"
Dask graph,30 chunks in 245 graph layers,30 chunks in 245 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray
"Array Chunk Bytes 1.35 MiB 45.94 kiB Shape (3, 240, 7, 7, 10) (1, 240, 7, 7, 1) Dask graph 30 chunks in 245 graph layers Data type float32 numpy.ndarray",240  3  10  7  7,

Unnamed: 0,Array,Chunk
Bytes,1.35 MiB,45.94 kiB
Shape,"(3, 240, 7, 7, 10)","(1, 240, 7, 7, 1)"
Dask graph,30 chunks in 245 graph layers,30 chunks in 245 graph layers
Data type,float32 numpy.ndarray,float32 numpy.ndarray


### Example 3: Weather station data 
By default, the function retrieves non-bias-corrected gridded data, but you can also retrieve dynamically downscaled climate data that has been bias-corrected using historical weather station data. This data is retrieved at the unique grid cell(s) corresponding to the selected weather station(s). This data can be retrieved using the `data_type` and `stations` arguments. If you don't set the `stations` argument, the function will return all available weather stations-- a hefty data retrieval that takes a while to run and is therefore is not recommended. 

```
data_type = "Stations" # Return bias-corrected gridded data at a station(s) of interest 
data_type = "Gridded" # Return gridded data (function default) 
```

As of now, you can only retrieve hourly data for the variable "Air Temperature at 2m". You can also choose the resolution of the gridded data used in bias correction by setting the `resolution` argument to either "3 km" or "9 km".

#### Example 3a

In [26]:
get_data(
    variable = "Air Temperature at 2m", # Required argument
    resolution = "9 km", # Required argument. Options: "9 km" or "3 km" 
    timescale = "hourly", # Required argument
    data_type = "Stations", # Required argument
    stations = "San Diego Lindbergh Field (KSAN)" # Optional argument. If no input, all weather stations are retrieved 
)

  da_adj["time"] = da_adj.indexes["time"].to_datetimeindex()


Unnamed: 0,Array,Chunk
Bytes,18.18 MiB,2.27 MiB
Shape,"(8, 1, 297840)","(1, 1, 297840)"
Dask graph,8 chunks in 84 graph layers,8 chunks in 84 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 18.18 MiB 2.27 MiB Shape (8, 1, 297840) (1, 1, 297840) Dask graph 8 chunks in 84 graph layers Data type float64 numpy.ndarray",297840  1  8,

Unnamed: 0,Array,Chunk
Bytes,18.18 MiB,2.27 MiB
Shape,"(8, 1, 297840)","(1, 1, 297840)"
Dask graph,8 chunks in 84 graph layers,8 chunks in 84 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


To see all the available weather station options, you can use the `get_subsetting_options()` function detailed at the top of this notebook. Simply set the `area_subset` function argument to `"Stations"`. 

In [27]:
get_subsetting_options(area_subset="Stations") 

Unnamed: 0_level_0,geometry
cached_area,Unnamed: 1_level_1
Arcata Eureka Airport (KACV),POINT (-124.10479 40.97844)
Bakersfield Meadows Field (KBFL),POINT (-119.05524 35.43424)
Blythe Asos (KBLH),POINT (-114.71451 33.61876)
Burbank-Glendale-Pasadena Airport (KBUR),POINT (-118.36543 34.19966)
Desert Resorts Regional Airport (KTRM),POINT (-116.16412 33.63166)
Downtown Los Angeles USC Campus (KCQT),POINT (-118.291 34.024)
Fresno Yosemite International Airport (KFAT),POINT (-119.72016 36.77999)
Gillespie Field Airport (KSEE),POINT (-116.9725 32.82611)
Imperial County Airport (KIPL),POINT (-115.57656 32.83464)
Lancaster William J Fox Field (KWJF),POINT (-118.21255 34.74121)


#### Example 3b
To demonstrate the flexibility of this function, let's make a few changes to the function argument in the code below: 
1) Retrieve more than one weather station. 
2) Change the resolution of the data used for bias-correction
3) Change the variable units (function default to degrees Kelvin, the native unit of the raw data)
4) Change the timescale to retrieve a 5 year period (function defaults to the entire historical record: 1980-2014)

Note that this function will take more time to run since we're retrieving more than one station. 

In [28]:
get_data(
    variable = "Air Temperature at 2m", 
    resolution = "3 km",
    timescale = "hourly",
    data_type = "Stations",
    stations = [
        "San Francisco International Airport (KSFO)", 
        "Oakland Metro International Airport (KOAK)", 
    ],
    units = "degF", 
    time_slice = (2000,2005) 
)

  da_adj["time"] = da_adj.indexes["time"].to_datetimeindex()
  da_adj["time"] = da_adj.indexes["time"].to_datetimeindex()


Unnamed: 0,Array,Chunk
Bytes,3.21 MiB,410.62 kiB
Shape,"(8, 1, 52560)","(1, 1, 52560)"
Dask graph,8 chunks in 91 graph layers,8 chunks in 91 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 3.21 MiB 410.62 kiB Shape (8, 1, 52560) (1, 1, 52560) Dask graph 8 chunks in 91 graph layers Data type float64 numpy.ndarray",52560  1  8,

Unnamed: 0,Array,Chunk
Bytes,3.21 MiB,410.62 kiB
Shape,"(8, 1, 52560)","(1, 1, 52560)"
Dask graph,8 chunks in 91 graph layers,8 chunks in 91 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray

Unnamed: 0,Array,Chunk
Bytes,3.21 MiB,410.62 kiB
Shape,"(8, 1, 52560)","(1, 1, 52560)"
Dask graph,8 chunks in 91 graph layers,8 chunks in 91 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 3.21 MiB 410.62 kiB Shape (8, 1, 52560) (1, 1, 52560) Dask graph 8 chunks in 91 graph layers Data type float64 numpy.ndarray",52560  1  8,

Unnamed: 0,Array,Chunk
Bytes,3.21 MiB,410.62 kiB
Shape,"(8, 1, 52560)","(1, 1, 52560)"
Dask graph,8 chunks in 91 graph layers,8 chunks in 91 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


#### Example 3c 
You can also retrieve data for the future period by setting the `scenario` argument to one or more SSPs. The code below will retrieve data for 2000-2035 for the Sacramento Executive Airport (KSAC) station under Shared Socioeconomic Pathway 3-7.0. Since we are retrieving data in both the historical and future period, we need to set the scenario to `['Historical Climate', 'SSP 3-7.0']`.

In [29]:
get_data(
    variable = "Air Temperature at 2m", 
    resolution = "9 km",
    timescale = "hourly",
    data_type = "Stations",
    stations = "Sacramento Executive Airport (KSAC)",
    units = "degF", 
    time_slice = (2000,2035),
    scenario = ["Historical Climate", "SSP 3-7.0"]
)

  da_adj["time"] = da_adj.indexes["time"].to_datetimeindex()


Unnamed: 0,Array,Chunk
Bytes,19.25 MiB,2.41 MiB
Shape,"(8, 1, 315360)","(1, 1, 315360)"
Dask graph,8 chunks in 163 graph layers,8 chunks in 163 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 19.25 MiB 2.41 MiB Shape (8, 1, 315360) (1, 1, 315360) Dask graph 8 chunks in 163 graph layers Data type float64 numpy.ndarray",315360  1  8,

Unnamed: 0,Array,Chunk
Bytes,19.25 MiB,2.41 MiB
Shape,"(8, 1, 315360)","(1, 1, 315360)"
Dask graph,8 chunks in 163 graph layers,8 chunks in 163 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


## Exporting data

To save data as a file, call `export` and input your desired
1) data to export – an [xarray DataArray or Dataset](https://docs.xarray.dev/en/stable/user-guide/data-structures.html), as output by e.g. selections.retrieve()
2) output file name (without file extension)
3) file format ("NetCDF", "Zarr", or "CSV")

We recommend NetCDF or Zarr, which suits data and outputs from the Analytics Engine well – they efficiently store large data containing multiple variables and dimensions. Metadata will be retained in these files.

NetCDF or Zarr can be export locally (such as onto the JupyterHUB user partition). Optionally Zarr can be exported to an AWS S3 scratch bucket for storing very large exports.

CSV can also store Analytics Engine data with any number of variables and dimensions. It works the best for smaller data with fewer dimensions. The output file will be compressed to ensure efficient storage. Metadata will be preserved in a separate file.

CSV stores data in tabular format. Rows will be indexed by the index coordinate(s) of the DataArray or Dataset (e.g. scenario, simulation, time). Columns will be formed by the data variable(s) and non-index coordinate(s).

In [30]:
# First, load some data into an xarray object using get_data 
data_to_download = get_data(
    variable = "Air Temperature at 2m", 
    downscaling_method = "Dynamical", 
    resolution = "45 km", 
    timescale = "monthly", 
    scenario = "Historical Climate",
    time_slice = (2000,2001)
)

# Next, export the data to a netcdf file 
ck.export(data_to_download, filename="my_filename1", format="NetCDF") 

Exporting specified data to NetCDF...
Saving file locally as NetCDF4...
Saved! You can find your file in the panel to the left and download to your local machine from there.


Some additional file export formats are demonstrated below. The code has been commented out; simply remove the code comment hash (#) at beginning of the line to make the code executable. 

In [31]:
# ck.export(data_to_use, filename="my_filename2", format="Zarr") # Zarr export locally

# ck.export(data_to_use, filename="my_filename3", format="Zarr", mode="s3") # Zarr export to S3

# ck.export(data_to_use, filename="my_filename4", format="CSV") # CSV export locally

Zarr format is technically a directory, not a single file, because it uses chunking to write and read data. This cloud-optimized format enables some unique benefits for performing data computations in a cloud computing environment like the AE Jupyter Hub, but can also make it tricky to delete. Because of that, we have built a simple helper function to facilitate easily deleting zarrs. 

In [32]:
# ck.remove_zarr("my_filename2") # Helper function to delete Zarr directory tree

_______________

Attempt to download Riverside County data monthly data to NetCDF. 

In [33]:
precip_riverside = get_data(
    variable = "Precipitation (total)", 
    downscaling_method = "Statistical", 
    resolution = "3 km", 
    timescale = "monthly", 
    cached_area = "Riverside County", 
    approach = "Warming Level"
)

In [None]:
precip_riverside_annual = precip_riverside.sel(warming_level=2).sum(dim='time_delta').astype(float)
precip_riverside_annual

Unnamed: 0,Array,Chunk
Bytes,2.15 MiB,15.91 kiB
Shape,"(21, 104, 129)","(21, 97, 1)"
Dask graph,258 chunks in 1474 graph layers,258 chunks in 1474 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 2.15 MiB 15.91 kiB Shape (21, 104, 129) (21, 97, 1) Dask graph 258 chunks in 1474 graph layers Data type float64 numpy.ndarray",129  104  21,

Unnamed: 0,Array,Chunk
Bytes,2.15 MiB,15.91 kiB
Shape,"(21, 104, 129)","(21, 97, 1)"
Dask graph,258 chunks in 1474 graph layers,258 chunks in 1474 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


: 

In [None]:
ck.export(precip_riverside_annual, filename="precip_riverside_annual", format="NetCDF") 

Exporting specified data to NetCDF...
Saving file locally as NetCDF4...


________________


In [None]:
precip_riverside[precip_riverside['warming_level'] == 2]

Unnamed: 0,Array,Chunk
Bytes,773.81 MiB,190.97 kiB
Shape,"(1, 360, 21, 104, 129)","(1, 12, 21, 97, 1)"
Dask graph,11868 chunks in 1467 graph layers,11868 chunks in 1467 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 773.81 MiB 190.97 kiB Shape (1, 360, 21, 104, 129) (1, 12, 21, 97, 1) Dask graph 11868 chunks in 1467 graph layers Data type float64 numpy.ndarray",360  1  129  104  21,

Unnamed: 0,Array,Chunk
Bytes,773.81 MiB,190.97 kiB
Shape,"(1, 360, 21, 104, 129)","(1, 12, 21, 97, 1)"
Dask graph,11868 chunks in 1467 graph layers,11868 chunks in 1467 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [None]:
annual_precip = precip_riverside.groupby('centered_year').sum()

In [None]:
annual_sum = annual_precip.sel(warming_level=2).sum(dim='time_delta').astype(float)

In [None]:
for name, coord in annual_sum.coords.items():
    print(f"Coordinate: {name}, dtype: {coord.dtype}")

Coordinate: warming_level, dtype: float64
Coordinate: lat, dtype: float32
Coordinate: lon, dtype: float32
Coordinate: spatial_ref, dtype: int64
Coordinate: centered_year, dtype: int64


In [None]:
annual_sum

Unnamed: 0,Array,Chunk
Bytes,767.81 kiB,15.91 kiB
Shape,"(21, 104, 45)","(21, 97, 1)"
Dask graph,90 chunks in 1718 graph layers,90 chunks in 1718 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray
"Array Chunk Bytes 767.81 kiB 15.91 kiB Shape (21, 104, 45) (21, 97, 1) Dask graph 90 chunks in 1718 graph layers Data type float64 numpy.ndarray",45  104  21,

Unnamed: 0,Array,Chunk
Bytes,767.81 kiB,15.91 kiB
Shape,"(21, 104, 45)","(21, 97, 1)"
Dask graph,90 chunks in 1718 graph layers,90 chunks in 1718 graph layers
Data type,float64 numpy.ndarray,float64 numpy.ndarray


In [None]:
ck.export(annual_sum, filename="riverside_annual_sum", format="NetCDF") 

Exporting specified data to NetCDF...
Saving file locally with compression...


TypeError: No conversion path for dtype: dtype('<U4')