
# <b>Tutorial 2: Data Preparation and visualisation</b>



## Learning Objectives:

In this session we will learn: 
1. How to perform further cube operations
2. How to prepare data for analysis
4. How to visualise data 

<table>
 
  <tr>
    <td><img src="images/global_airtemp_cp.png" width=400 height=250></td>
    <td><img src="images/global_airtemp_ts.png" width=400 height=250></td>
    
  </tr>
</table>

## Contents

1. [Constraint and cube extraction](#extract)
2. [Basic cube calculations](#calc)
3. [Time series and spatial plots](#plots)
4. [Saving the cube](#save)
5. [Exercises](#exercise)

<div class="alert alert-block alert-warning">
<b>Prerequisites</b> <br> 
- Basic programming skills in python<br>
- Familiarity with python libraries Iris, Numpy and Matplotlib<br>
- Basic understanding of climate data<br>
</div>

___

## 1. Constraint and cube extraction<a id='extract'></a>

### 1.1 Import libraries.
Import the necessary libraries. Current datasets are in zarr format, we need zarr and xarray libraries to access the data

In [None]:
import numpy as np
import xarray as xr
import zarr
import iris
import os
from scripts.xarray_iris_coord_system import XarrayIrisCoordSystem as xics
xi = xics()
xr.set_options(display_style='text') # Work around for AML bug that won't display HTML output.

### 1.2 Set up authentication for the Azure blob store

The data for this course is held online in an Azure Blob Storage Service. To access this we use a SAS (shared access signature).  You should have been given the credentials for this service before the course, but if not please ask your instructor. We use the getpass module here to avoid putting the token into the public domain. Run the cell below and in the box enter your SAS and press return. This will store the password in the variable SAS.

In [None]:
import getpass
# SAS WITHOUT leading '?'
SAS = getpass.getpass()

In [None]:
store = zarr.ABSStore(container='metoffice-20cr-ds', prefix='monthly/', account_name="metdatasa", blob_service_kwargs={"sas_token":SAS})
type(store)

### 1.3 Read monthly data
A Dataset consists of coordinates and data variables. Let's use the xarray's **open_zarr()** method to read all our zarr data into a dataset object and display it's metadata

In [None]:
# use the open_zarr() method to read in the whole dataset metadata
dataset = xr.open_zarr(store)
# print out the metadata
dataset

Convert dataset into iris cubelist

In [None]:
# create an empty list to hold the iris cubes
cubelist = iris.cube.CubeList([])
# use the DataSet.apply() to convert the dataset to Iris Cublelist
dataset.apply(lambda da: cubelist.append(xi.to_iris(da)))
# print out the cubelist
cubelist

The cubelist printed above holds all of the data from the Zarr file in a list. To see more detail on each of the cubes in the list click on it. That shows a table with information about the name and units of the cube, its shape and coordinates.

We will see in the next section how to obtain a single cube for use in our analysis and visualisation.

---

### 1.4 Indexing the cube
**AIM:** Extract the ***cloud_area_fraction*** data and index it by a subset of latitudes and longitudes values

</pre>
<div class="alert alert-block alert-info">
<b>Note:</b> Cubes can be indexed in a similar manner to that of NumPy arrays. The result of indexing a cube is always a copy of the cube.<br>
    
For more information on cube indexing in numpy see <a href="https://numpy.org/doc/stable/reference/arrays.indexing.html">Indexing</a> in the numpy documentation
    
</div>

In [None]:
# extract the variale from cubelist
caf = cubelist.extract_strict('cloud_area_fraction')
caf

In [None]:
# subsetting the lat/lon values by indexing the first 10 values
subset_caf = caf[..., :10, :10]
subset_caf

In [None]:
# subseting the cube with 50th to 99th lat/lon values at time index 10
subset_caf = caf[10, 50:100, 50:100]
subset_caf

<div class="alert alert-block alert-info">
<b>Note:</b> The extract above returns a 2 dimensional cube with latitude/longitude at a single time. Note that time is now a scalar (a single time: 1851-11-16 00:00:00)
    
</div>

In [None]:
# Extracting first 10 elements from time dimension
subset_caf = caf[:10]
subset_caf

### 1.5 Time constraint
**AIM:** Use constraint and extract methods to subset a cube or cubelist.

The monthly data ranges from 1850 to 2000. In some cases we might not need all the time series and we might only be interested in 50 years 1950 - 2000.
In such cases, we can extract cube creating a time constraint. 
Let's extract "air_pressure_at_sea_level" cube, extract the cube containing data from 1950 to 2000 using time constraint.

<div class="alert alert-block alert-info">
<b>Note:</b> We've already seen above the <b>extract_strict</b> method to extract specific cube from cubelist. We can also apply constraints to a single cube (or a CubeList) using the respective <b>constraint</b> and <b>extract</b> methods.
 
Iris's <b>constraint</b> mechanism provides a powerful way to filter a subset of data from a larger collection. The Constraint constructor takes arbitrary keywords to constrain coordinate values.
    
<b>extract_strict</b> returns a single cube while <b>extract</b> methods returns a cubelist. If you use extract_strict and more or less than 1 cube matches then it is an error. 
    
</div>

In [None]:
# Extracting air pressure at sea level cube from cublist 
air_pres = cubelist.extract_strict('air_pressure_at_sea_level')

In [None]:
# Extracting from year 1950 to 2000
start_time = 1950
end_time = 2000
time_constraint = iris.Constraint(time=lambda cell: start_time <= cell.point.year <= end_time)
subcube = air_pres.extract(time_constraint)

To check if we have got the right cube, we can print start data and end date of subcube

In [None]:
tcoord = subcube.coord('time')
units = tcoord.units
tdata = [units.num2date(point) for point in tcoord.points]
print('Start time: ',tdata[0])
print('End time:   ',tdata[-1])

<div class="alert alert-block alert-info">
<b>Note:</b> It is common to want to build a constraint for time.
This can be achieved by comparing cells containing datetimes

There are a few different approaches for producing time constraints in Iris. We focus here on one approach for constraining on time in Iris.

This approach allows us to access individual components of cell datetime objects and run comparisons on those.
    
</div>

Similar to constraining years, we can also constrain months and days

Consider a case where we want to get only a few months, like March, April and May, from our subcube

In [None]:
# extracting month june, july and august from the list of years
month_constraint = iris.Constraint(time=lambda cell: cell.point.month in (3,4,5))
subcube.extract(month_constraint)

### 1.6 Extract region.

To make your analysis faster and easier, you can extract a smaller part of the model domain. In these examples we will work with Shanghai, but you can choose any region you want. 

<b>Note:</b> The original model data is on a rotated pole grid system, as shown in the diagrams below. The x and y coordinates are not true latitude and longitude so to extract a latitude-longitude box, we use the *extract_rot_cube()* function to do this. It works by first calculating the true latitude and longitue of each grid cell and uses these to select which are in the area of interest.

<table>
 
  <tr>
    <td><img src="images/rotated_pole_1.png" width=400 height=250></td>
    <td><img src="images/rotated_pole_2.png" width=400 height=250></td>
    
  </tr>
</table>

In [None]:
# let's first print the values of lat/lon before extracting
print('latitude: [', air_pres.coord('grid_latitude').points.min(), ', ', air_pres.coord('grid_latitude').points.max(), ']')
print('longitude: [', air_pres.coord('grid_longitude').points.min(), ', ', air_pres.coord('grid_longitude').points.max(), ']')

Let's try to extract Shanghai region using **extract_rot_cube**.

**extract_rot_cube** takes the latitude and longitude of the region of interest and returns a smaller cube with the extracted region of rotated pole coordinates. 

First define the lat lon of Shanghai region:

In [None]:
min_lat=29.0
max_lat=32.0
min_lon=118.0
max_lon=123.0

In [None]:
# load extract_rot_cube from catnip
from catnip.preparation import extract_rot_cube
ext_cube = extract_rot_cube(air_pres, min_lat, min_lon, max_lat, max_lon)
ext_cube

In [None]:
# we can see that the min/max boundaries now changed
print('latitude: [', ext_cube.coord('grid_latitude').points.min(), ', ', ext_cube.coord('grid_latitude').points.max(), ']')
print('longitude: [', ext_cube.coord('grid_longitude').points.min(), ', ', ext_cube.coord('grid_longitude').points.max(), ']')

### 1.7 Constraint on cell methods and attributes

In our cubelist, we can see that we have four cubes named air_temperature: Minimum, Maximum and two Means (one with pressure level).
Let's try to extract air temperature and see what we get.


In [None]:
air_temp = cubelist.extract('air_temperature')
air_temp

In order to get only one cube i.e. the time mean at the surface and not on the pressure levels, we need to constrain using the cell method. A [cell_method](https://cfconventions.org/Data/cf-conventions/cf-conventions-1.7/build/ch07s03.html) is a piece of metadata which describes additional characteristics of a field.  Let try to create a constraint and use it to extract the desired cube.

In [None]:
# constrain for the cube that does not have 'pressure' in its coordinate list
cube_cons_surf = iris.Constraint(cube_func=lambda c: 'pressure' not in [coord.name() for coord in c.coords()])
# also constrain to be only mean temperature 
cube_cons_mean = iris.Constraint(cube_func=lambda c: (len(c.cell_methods) > 0) and (c.cell_methods[0].method == 'mean'))
# now apply the above constrains
air_temp_mean = air_temp.extract_strict(cube_cons_surf & cube_cons_mean)
air_temp_mean                              

Now we got desired cube. Now, if we look into minimum and maximum cubes, that does not contains cell method, instead, information lies in their respective attributes. 

We can extract, for example minimum cube, by constraining the attributes.

In [None]:
min_cons = iris.Constraint(cube_func=lambda c: ('ukmo__process_flags' in c.attributes) and (c.attributes['ukmo__process_flags'][0].split(' ')[0] == 'Minimum'))
air_temp_min = air_temp.extract_strict(min_cons)
air_temp_min                              

<div class="alert alert-block alert-success">
    <b>Task:</b><br><ul>
        <li>Extract from cubelist <b>relative humidity</b> cube: <b>year</b>: 1900-2000, <b>months</b>: May-September <b>Cell method</b>: Mean (4 hours) 

  
</div>


In [None]:
# Extract relative humidity cube
# write your code here ..

___

## 2. Basic Calculations<a id='calc'></a>

### 2.1 Calculating mean, max, min
In this section we will use **iris.analysis** method to calculate basic mean, min and max values. But before we do this we need to understand two important concepts/techniques that are used in the analysis code to follow:
1. When we calculate area averages, we need to be able to calculate the area of each grid box. and for this we need to know the boundaries of each grid box. If the longitude and latitude bounds are not defined in the cube we can guess the bounds based on the coordinates point values and that is what the *guess.bounds()* function does in the code below.
2. Once we have our longitude and latitude boundaries we can use the *iris.analysis.cartography.area_weights* to compute the data as a weighted mean of all grid-boxes. The *area_weights()* function returns an array of area weights, with the same dimensions as the cube where a larger cell has more weight in the average than a smaller one.

Now let's extract the *surface_temperature* and calculate mean over the whole region. 

In [None]:
#  extract surface_temerature
sft = cubelist.extract_strict('surface_temperature')

Using the **collapsed** and **analysis** methods over grid_latitude and grid_longitude, we can get the timeseries of mean over the whole domain.

In [None]:
import iris.analysis.cartography

#Since grid_latitude and grid_longitude were both point coordinates we must guess bound positions for them in order to calculate the area of the grid boxes
sft.coord('grid_latitude').guess_bounds()
sft.coord('grid_longitude').guess_bounds()

grid_areas = iris.analysis.cartography.area_weights(sft)

# calculating mean using area_weights method
sft_mean = sft.collapsed(['grid_longitude', 'grid_latitude'], iris.analysis.MEAN, weights=grid_areas)
sft_mean

<div class="alert alert-block alert-info">
<b>Note:</b> The above cube has reduced to only one dimension i.e. "time"

    
<br>iris.analysis provides a range of statistical methods, see [iris.analysis dcumentation](https://scitools.org.uk/iris/docs/v1.9.0/html/iris/iris/analysis.html)
    
<br> Collapse method can be applied to one, more or all the dimensions.
</div>

### 2.2 Basic arithmetic operations

Basic arithmetic operations like addition, subtraction, multiplication, square root, power etc. can be performed on iris cube.

Let's calculate 10m windspeed using **x_wind** and **y_wind** cubes.

In our cubelist, we have two variables with same cell method. We can constraint using coordinates information.

To calculate 10m windspeed we need data which is not on pressure levels.

In [None]:
# extract x_wind and y_wind
xcons = iris.Constraint(cube_func=lambda c: c.standard_name == 'x_wind' and ('pressure' not in [coord.name() for coord in c.coords()]))
ycons = iris.Constraint(cube_func=lambda c: c.standard_name == 'y_wind' and ('pressure' not in [coord.name() for coord in c.coords()]))

u = cubelist.extract_strict(xcons)
v = cubelist.extract_strict(ycons)

Let's create a windspeed cube by copying the u cube first

In [None]:
windspeed = u.copy()

Calculate windspeed:

In [None]:
import numpy as np
windspeed.data = np.sqrt(u.data**2 + v.data**2)
windspeed

We see that cube name is "x_wind", that is becuase we copied the u_cube. We can rename it to "windspeed"

In [None]:
windspeed.rename("wind speed")
windspeed

<div class="alert alert-block alert-info">
<b>Note:</b> When performing arithmetic calculation, consider the units, name and other metadata information. 
</div>

___

## 3. Time series and spatial plots<a id='plots'></a>

### 3.1 Time series plots
Using iris quick plot to create time series plots. Let's load the necessary libraries first.


In [None]:
# we first need to load libraries for plotting 
import iris.plot as iplt
import iris.quickplot as qplt
import matplotlib.pyplot as plt

Let's plot the timeseries of mean surface temeprature over Shanghai region from 1950 - 2000

In [None]:
# loading mean air temperature 
sft = cubelist.extract_strict('surface_temperature')
sft.coord_system()
sft

In [None]:
# Shanghai region coordinates 
min_lat=29.0
max_lat=32.0
min_lon=118.0
max_lon=123.0
# load extract_rot_cube from catnip
from catnip.preparation import extract_rot_cube
sft_shangai = extract_rot_cube(sft, min_lat, min_lon, max_lat, max_lon)
sft_shangai

In [None]:
# Now constrain over time
start_time = 1950
end_time = 2000
time_constraint = iris.Constraint(time=lambda cell: start_time <= cell.point.year <= end_time)
sft_tim = sft_shangai.extract(time_constraint)
sft_tim

In [None]:
# collapse the longitude and latitude and calculate mean over the time period
timeseries = sft_tim.collapsed(['grid_latitude','grid_longitude'], iris.analysis.MEAN)
timeseries

Lets plot the timeseries using a standard matplotlib library.


In [None]:
# ploting with matplotlib 
plt.plot(timeseries.data)
plt.show()

We have got the time series values. Now we can plot them using the [**iris quickplot**](https://scitools.org.uk/iris/docs/latest/iris/iris/quickplot.html?highlight=quickplot).[what does this add?]

In [None]:
# plotting with the quickplot 
qplt.plot(timeseries)
plt.show()

<div class="alert alert-block alert-info">
<b>Note:</b> <b>iris.quickplot</b> adds extra automatic labelling: axes are labelled with a coordinate name and units, and the plot title is taken from the cube name. On the other hand using matplotlib.plot we need to add labels and title manually. 
</div>

### 3.2 Contour plots
Using iris quick plot to create contour plots

Let's plot the average surface temperature from 1900 to 2000 over Shangai region.

We can collapse 'time' dimension os sft_tim cube to get the spatial mean 

In [None]:
spatial_mean = sft_tim.collapsed(['time'], iris.analysis.MEAN)

Now that we have the mean values we can make a spatial contour plot using the iris quickplot contourf method

In [None]:
# plot the surface temperature contour at the first timestep 
qplt.contourf(spatial_mean)
# add some coastlines for context
plt.gca().coastlines()
# set the figure size
plt.gcf().set_size_inches(8,12)
plt.show()

<div class="alert alert-block alert-info">
<b>Note:</b> <b>iris.quickplot</b> also adds the colorbar
</div>

<div class="alert alert-block alert-success">
    <b>Task:</b><br><ul>
        <li>Plot time series of maximum air temperature from 1900 to 2000 of only summer season (June, July and August)</li>
        <li>Plot contour plot of the maximum air temperature from 1900 to 2000 of only summer season (June, July and August)</li>
    </ul>
</div>


In [None]:
# time series plot
# write your code here ..

In [None]:
# contour plot
# write your code here ..

___

## 4. Saving the cube<a id='save'></a>

### 4.1 Save the cube in zarr store
We can save our cube in zarr store to be used later. 

For this purpose, we first need to convert cube into xarray and then save it into zarr store.

Let's save 'spatial_mean' cube from the above section.

In [None]:
# converting cube back to xarray
sft_mean = xr.DataArray.from_iris(spatial_mean)

# rename the xarray
sft_mean.rename('surface_temperature_mean')

# checking the chunk size of the xarray
sft_mean.chunks

In [None]:
# convert the xarray into dataset
sft_mean_ds = sft_mean.to_dataset()

# path to where store the zarr data
zarr_store = f"{os.environ['HOME']}/zstore"

# store the dataset to specfied path as zarr data store
sft_mean_ds.to_zarr(zarr_store, consolidated=True, mode='w')

___

## 5. Exercises<a id='exercise'></a>

In this exercise we will analyse the mean precipitation rate from 1950 - 2010 over the Shanghai region

### Exercise 1: Load monthly data

In [None]:
# write your code here ... 

### Exercise 2: Extract precipitation_flux

In [None]:
# write your code here ...

### Exercise 3: calculate mean

In [None]:
# write your code here ...

### Exercise 4: Plot timeseries

In [None]:
# write your code here ...

### Exercise 5: Spatial plot over Shanghai

In [None]:
# write your code here ...

___

</pre>
<div class="alert alert-block alert-success">
<b>Summary</b><br> 
    In this session we learned how:<br>
    <ul>
        <li>to prepre sube for analysis</li>
        <li>to perform basic arithmatic operation</li>
        <li>to plot timeseries and contours </li>
        <li>to save data in zarr format </li>
    </ul>

</div>
