# GeoDN Course 1: Practical Guide to Geospatial Data #
> Copyright (c) 2024 International Business Machines Corporation

> This software is released under the MIT License.
> https://opensource.org/licenses/MIT

# Session 1 - Data exploration using GeoDN

In this notebook, you will learn how to:

- (1) Connect to GeoDN Discovery.
- (2) Explore the catalogue of data available on GeoDN.
- (3) Identify a collection to query.
- (4) Perform a query.
- (5) Download and plot the data.

The examples below use the [ECMWF Reanalysis v5 (ERA5)](https://www.google.com/url?sa=t&rct=j&q=&esrc=s&source=web&cd=&cad=rja&uact=8&ved=2ahUKEwj04re8ztyBAxU2lGoFHStMDPAQFnoECBIQAQ&url=https%3A%2F%2Fwww.ecmwf.int%2Fen%2Fforecasts%2Fdataset%2Fecmwf-reanalysis-v5&usg=AOvVaw2BTzzlSSeVEXYcvSEZGwB5&opi=89978449) dataset to demonstrate how to look at hourly precipitation data for a given location and time.

### Prepare
Load the `geodn.discovery` module.

In [1]:
from geodn.discovery import discoveryv2 as discovery

### Connect to GeoDN discovery

Now we have authenticated, we can connect to the GeoDN Discovery service. This will also you to explore and query data in the GeoDN catalogue.

In [2]:
geodn_discovery = discovery.DiscoveryV2()

Now that we have connected to both the GeoDN Discovery service, we can take a look at putting it to use in the sections below.

***
# 1. Explore the catalogue of data available on GeoDN
In this section, you will learn how to explore the catalogue of data available on GeoDN. 

Now that you have successfully connected to GeoDN, make use of the `get_collections` function to show a list of all the data collections available. The `get_collections` function returns a list of Collection IDs. These Collections IDs will be used to identify which data collection you would like to explore further.

In [3]:
geodn_discovery.get_collections()

***
# 2. Identify a collection to query
Chose a Collection ID from the list returned by `get_collections` as the dataset to explore further. In this notebook, we will selection "Global weather (ERA5)". You will need to exactly copy the string and assign it to a variable such as `collection_id`.

Now a collection has been selected, we can start to explore the data in the collection. Let's start by using the `describe_collection` method to return a description of the collection. This description includes information such the license, the bands available in the dataset and the temporal and spatial extent of the dataset. 

We will use the information returned by `describe_collection` later on to identify a search region within the dataset to query later on. The `describe_collection` function takes a `collection_id` as a parameter.

In [22]:
collection_id = "Global weather (ERA5)"
geodn_discovery.describe_collection(collection_id)

Use the `describe_collection_dimensions` method to return a dictionary containing information on the bands, temporal and spatial extent. We will assign this result to the `dimensions` variable.

We can see that for ERA5, the bands available are for "Total precipitation", the temporal extent is from January 1st 1993, 00:00 UTC (1993-01-01T00:00:00Z) to September 8th 2023, 11:00 UTC (2023-09-08T08:00:00ZZ) and that the spatial extent is -180$^\circ$ west, 180$^\circ$ east, -90$^\circ$ South and 90$^\circ$ North. Note that date and time expressions adhere to [RFC 3339](https://datatracker.ietf.org/doc/html/rfc3339#section-5.6).

In [5]:
dimensions = geodn_discovery.describe_collection_dimensions(collection_id)
print(dimensions)

To better understand the spatial extent of the dataset, we can use the `plot_collection_extent` function to plot the collection extent. This helps to inform you if the area you are interested in querying is included in this particular dataset.

In the case of the ERA5 collection, the spatial extent is global but this won't be the case for all collections. Some collections will have a smaller spatial extent, so you will need to bear this in mind when consider which geographical area to query.

In [10]:
geodn_discovery.plot_collection_extent(collection_id)

***
# 3. Perform a query
Now that you have identified a collection to query, let's select a query region, a time and the information we would like to query from the dateset.

From the `describe_collection` and `describe_collection_dimensions` functions, we know the extent of the data and the bands available. Ensure the query bounds reside within the dateset extent.

Define a start and end time as strings in the following format:


```python
start = "YYYY-MM-DDTHH:MM:HHZ"
end = "YYYY-MM-DDTHH:MM:HHZ"
```

Define a bounding box using latitude and longitude as float in the following format:

```python
west = <degrees west>
south = <degrees south>
east = <degrees east>
north = <degrees north>
```

Define a list of bands in the following format:

```python
bands = ["Total precipitation"]
```

In [14]:
# Define the start and end time for the data query
start = "2007-01-01T11:00:00Z"
end = "2007-12-31T11:00:00Z"

# Define the bounding box for the data query
west = -0.48
south = 53.709
east = -0.22
north = 53.812

# Define the bands for the data query
bands = ["Total precipitation"]

Plot the area to be queried using the `plot_with_bbox` functions.

In [15]:
geodn_discovery.plot_with_bbox(west, south, east, north, zoom_start=5)

Use the `query` function to perform the data query given the chosen parameters. Note that the `temporal_extent` argument expects a dictionary in the form:

```python
{"start": start, "end": end}
```

and that the `spatial_extent` argument expects a dictionary in the form:

```python
{"west": west, "south": south, "east": east, "north": north}
```

The `query` function will return a data cube if successful.

In [16]:
data_cube = geodn_discovery.query(
    collection_id = collection_id, 
    bands = bands, 
    temporal_extent = {"start": start, "end": end},
    spatial_extent = {"west": west, "south": south, "east": east, "north": north},
)

Now that a query has been performed, there are 94 different processes that can be apply to manipulate the data, such as `mask`, `aggregate_spatial`, `filter_bbox`, `min_time` and `max_time`.

Use the `list_processes` function to see all the different processes available.

In [17]:
geodn_discovery.list_processes()

Let's try the `max_time` process which computes the largest value of an array of numbers, which is equal to the first element of a sorted (i.e., ordered) version of the array.

In [18]:
data_cube_max = data_cube.max_time()
data_cube_max

***
# 4. Download and plot the data
Save the data query results using the `save` functions. The `save` function takes two arguments and one optional argument, the data cube to be save, the name of the file to save to and optionally, a boolean to indicate if the file should be saved even if it already exists. By default the results are save to a directory called `data` in the current working directory. If this directory doesn't exist, it is created. If the `force` parameter is not set, or set to `False`, the file will only be save if it does not already exist. If it is set to `True`, the file will be created and saved each time.

Alternatively, it is possible to specify a particular directory to save the data cube to using the format:

```python
filename = "total_precip.nc"
data_dir = "my_dir/
geodn_discovery.save(data_cube_max, filename, data_dir)
```

In [19]:
filename = "total_precip.nc"
geodn_discovery.save(data_cube_max, filename, force=True)

Use the `open_datacube` function to load data cube from file into an xarry object.

In [20]:
path = "data/" + filename
x_data = geodn_discovery.open_datacube(path)
x_data

Finally, let's use the `plot` function to plot the data from file.

In [21]:
geodn_discovery.plot(path)

***
# 5. Explore more
To extend the learning objectives already achieved here, take a look and explore different data sets, time periods, bands and locations to query data from.


***
# 6. Up next
Next we will show you how to run a workflow.
