# GeoDN Course 2: Fundamentals of Geospatial Data and Modeling - Part 1 Geospatial Data Discovery for Climate Risk
> Copyright (c) 2024 International Business Machines Corporation

> This software is released under the MIT License.
> https://opensource.org/licenses/MIT

# Session 1 - Query Analyse Rainfall
In this notebook you will learn how to:  

-(1) Query GeoDN data for an area of interest to retrieve rainfall data  
-(2) Plot and analyse the data to identify extreme rainfall events which may have caused surface water flooding  

This example uses the [CEH-GEAR](https://catalogue.ceh.ac.uk/documents/dbf13dd5-90cd-457a-a986-f2f9dd97e93c) precipitation data to demonstrate analysis of rainfall extremes.

### Prepare
Load imports, including the `geodn.discovery` module.

In [None]:
import os
from dotenv import load_dotenv
from geodn.discovery import discoveryv2
from itertools import product
import numpy as np
import pandas as pd
from datetime import datetime, timedelta
import json
import geopandas as gpd
import plotly.express as px
import plotly.graph_objects as go
import pickle
load_dotenv()

### Connect to GeoDN discovery

In [None]:
geodn_discovery = discoveryv2.DiscoveryV2()

***
# 1: Explore and query rainfall data from GeoDN
In this section, you will connect to GeoDN discovery and locate the dataset.

Use `get_collections` function to return a list of Collection IDs.

Select "CEH  gridded hourly rainfall for Great Britain". You will need to exactly copy the string and assign it to a variable such as collection_id.

Use the `describe_collection` method to return a description of the collection. This description includes information such the license, the bands available in the dataset and the temporal and spatial extent of the dataset.

In [None]:
geodn_discovery.get_collections()

In [None]:
collection_id = 'CEH gridded hourly rainfall for Great Britain'
geodn_discovery.describe_collection(collection_id)

Use the `describe_collection_dimensions` method to return a dictionary containing information on the bands, temporal and spatial extent. We will assign this result to the `dimensions` variable, then list this.

In [None]:
dimensions = geodn_discovery.describe_collection_dimensions(collection_id)

In [None]:
dimensions

Define the time period, area of interest and variable for the query.

In [None]:
# Define the start and end time for the data query
start = "2007-01-01T11:00:00Z"
end = "2007-12-31T11:00:00Z"

# Define the bounding box for the data query
west = -0.48
south = 53.709
east = -0.22
north = 53.812

# Define the bands for the data query
bands = ["CEH rainfall for Great Britain"]

Plot the area to be queried with the `plot_with_bbox' function.

In [None]:
geodn_discovery.plot_with_bbox(west, south, east, north, zoom_start=11) 

Call `geodn_discovery` to run the query and save the results in a `data_cube` object. Here, we need to provide the name of the dataset in `collection_id`, the name of the variable in `bands`, and the time range and bounding box we want to query in `temporal_extent` and `spatial_extent` respectively.

If this function returns an error, check back to the results of the `describe_collection` function above, to ensure you have specified data which are available.

In [None]:
data_cube = geodn_discovery.query(
    collection_id = collection_id, 
    bands = bands, 
    temporal_extent = {"start": start, "end": end},
    spatial_extent = {"west": west, "south": south, "east": east, "north": north},
)

In [None]:
data_cube

***
# 2. Download, save and plot the data
In this section, you will download the data to a netcdf file, then load it into an array. You'll then quickly plot the data to check it, before aggregating to create a time series across the whole area of interest.

In [None]:
filename = "CEH_hourly_rainfall.nc"

In [None]:
geodn_discovery.save(data_cube, filename, force = True)

Use the `open_datacube` function to load data from the file into an xarray object.

In [None]:
path = "data/" + filename
x_data = geodn_discovery.open_datacube(path)
x_data

First, aggregate in time and plot the data to check it. Here we take the maximum rainfall in mm per hour over the time period of interest, therefore we need to specify the dimension `time` to aggregate over.

Note that, since this dataset is derived from raingauge observations of rainfall, we see the polygonal structure of the raingauge network in the dataset.

In [None]:
x_data.dims

In [None]:
max_value = x_data.max(dim = 'time')
max_value.plot()

Now aggregate the data spatially to create a time series, where the aggregated value corresponds to the maximum rainfall occuring in the areas of interest.

Plot using Matplotlib.

In [None]:
time_series = x_data.max(dim = ["x", "y"])
time_series.plot()

***
# 3. Find high rainfall events
In this section, you'll use the time series you created in the previous section to create a data frame, then analyse this to find the highest intensity rainfall events. You'll then plot these on a time series and as spatial maps.

First, convert the data to a `pandas` `dataframe` for further analysis. Extract the time information and convert to a `datetime` object and add this as a column in the data frame.

In [None]:
time_series_df = time_series.to_dataframe().reset_index()
time_series_df['datetime'] = time_series.indexes['time'].to_datetimeindex()
time_series_df.rename({51593 : 'CEH rainfall for Great Britain'}, axis = 'columns', inplace=True) 

In [None]:
time_series_df.head()

Locate the highest rainfall events by sorting the time series in descending order.

Find the top 10 events. You should find that these occur during June and September 2007. The highest rainfall recorded was 19.4mm in one hour on 16 June 2007.

In [None]:
time_series_sorted_by_value = time_series_df.sort_values(by=['CEH rainfall for Great Britain'],ascending=False)
top_n_events = time_series_sorted_by_value.head(10)
display(top_n_events)

Now plot the time series and highlight these heavy rainfall events, using a `plotly` interactive line plot.

We need to sort the time series by date to plot correctly.

In [1]:
time_series_sorted = time_series_df.sort_values(by = ['datetime'])

fig_timeseries = px.line(time_series_sorted,
              x='datetime',
              y='CEH rainfall for Great Britain',
              )

fig_top_n = px.scatter(top_n_events,x='datetime',y='CEH rainfall for Great Britain',color_discrete_sequence=['red'])

fig_timeseries_and_top_n = go.Figure(data=fig_timeseries.data+fig_top_n.data) 
fig_timeseries_and_top_n.show()

NameError: name 'time_series_df' is not defined

Now plot thumbnail maps to show the top 10 rainfall events.

In [None]:
figure_crop = [10,30,20,65] 
figure_width = 500 # in pixels
figure_height = 300 # in pixels
# find the relevant dates 
dates_top_n = top_n_events['time'].values

# find events in the datacube
x_top = x_data.sel(time = dates_top_n)

In [None]:
# plot 
i=0
for eventtime in x_top.time.to_numpy():
    date_info = str(eventtime)
    map_data = x_top[i]
    fig_map = px.imshow(
                    map_data[figure_crop[0]:figure_crop[1],figure_crop[2]:figure_crop[3]], 
                    width= figure_width,       
                    height = figure_height, 
                    title = date_info
                  )
    fig_map.show()
    i=i+1

***
# Explore more
To extend your analysis, try comparing rainfall events in a different dataset, for example the 'Global weather (ERA5)' you explored in the previous course. What do you notice about the rainfall extremes captured in this dataset? You could also choose to look at a different location, or consider extreme of a different weather variable, such as temperature.

***
# What's next?

Continue to Notebooks 2 and 3 to explore flood maps and flood impact.