# Worksheet 1: File locations and pre-processing

The exercises in this worksheet demonstrate some of the tools available for data analysis, and how to prepare CORDEX output for analysis (pre-processing). This can be time consuming for large amounts of data, so in this worksheet a small subset of data is used to easily demonstrate the steps involved.

[CORDEX data](https://cordex.org/data-access/how-to-access-the-data/) is in [NetCDF](https://www.unidata.ucar.edu/software/netcdf/) format (a standard format in climate science) so that it can be used in post processing packages such as Python and the Python library [Iris](http://scitools.org.uk/iris/docs/latest/index.html).


<div class="alert alert-block alert-warning">
<b>By the end of this worksheet you should be able to:</b><br> 
- Identify and list the names of CORDEX output data in netCDF format using standard Linux commands.<br>
- Use basic Iris commands to load data files, and view Iris cubes. <br>
- Use Iris commands to merge netCDF files
- Take a subset of the data based on a date range
- Save the output as NetCDF files.
</div>


<div class="alert alert-block alert-info">
<b>Note:</b> In the boxes where there is code or where you are asked to type code, click in the box, then press <kbd>Ctrl</kbd> + <kbd>Enter</kbd> to run the code. <br>
<b>Note:</b> An percentage sign <code>%</code> is needed to run some commands on the shell. It is noted where this is needed.<br>
<b>Note:</b> A hash <code>#</code> denotes a comment; anything written after this character does not affect the command being run. <br>
</div>


## Contents

### [1.1: Data locations and file names](#1.1)

### [1.2: Getting started with Python and Iris](#1.2)

### [1.3: Merge Problems](#1.3)

### [1.4 Extracting data within a specific time range](#1.4)

### [1.5: Saving data to a new file](#1.5)


<a id='1.1'></a>

## 1.1 Data locations and file names

The datasets used within these worksheets are already linked to the notebook in order to provide quick and easy access for the purpose of this training. However the commands learned in this worksheet provide useful context for future work in a linux and unix scripting environment.


**a)** Firstly, find out what location you are currently in by using the **`pwd`** command; **`pwd`** stands for **print working directory**.

In the cell below type **`%pwd`** on a new line and then press <kbd>Ctrl</kbd> + <kbd>Enter</kbd>.


In [None]:
# Type %pwd below and press 'ctrl' + 'enter'


**b)** List the contents of this directory; **`ls`** stands for **list** and using the **`-l`** option gives a longer listing with more information, such as file size and modification date.

In the cell below type **`%ls`** on a separate line then press <kbd>Ctrl</kbd> + <kbd>Enter</kbd>.


In [None]:
# Type %ls and press 'ctrl' + 'enter'.


# Type %ls -l and press 'ctrl' + 'enter'.



---

<div class="alert alert-block alert-success">
<b>Question:</b> What is the difference between <code>ls</code> and <code>ls -l</code>?  What extra information do you see? Which file was edited most recently?
</div>


<b>Answer</b>: <br>
_...Double click here to type your answer..._

---


**c)** To avoid conflicts of data when running locally, we will take a copy of the source files used in the training (this is not needed if running on the cloud, we instead download from an S3 bucket). Run the command in the following cell. It might take a few minutes to complete.


In [None]:
import subprocess
subprocess.run(['rsync', '-r', '/project/ciid/projects/PRECIS/worksheets/data_v2', '.'])

If the command works correctly you should see the message:

`CompletedProcess(args=['rsync', '-r', '/project/ciid/projects/PRECIS/worksheets/data_v2', '.'], returncode=0)`


**d)** Move to the directory (i.e. folder) called `data_v2/EAS-22`. This directory contains CORDEX data for the East Asia Domain.

**Hint:** The `cd` command stands for _change directory_

Type your command(s) below and press <kbd>Ctrl</kbd> + <kbd>Enter</kbd>.


In [None]:
# Type your %cd [directory-path] command below and press 'ctrl' + 'enter'.


# List the contents of this directory, using a previous command.



**d)** There are a lot of files in this directory! The file names contain information on the simulated date of the data they contain - you'll learn more about the naming convention for CORDEX data in another presention.

For now, list only the files containing monthly temperature data using the following command:

Type **`%ls tas*mon*`** in the code block below.


In [None]:
# Type %ls tas*mon* and press 'ctrl' + 'enter'.



<div class="alert alert-block alert-info">
<b>Note:</b> The asterisk character <code>*</code> (also known as <i>glob</i>) matches any string within the filename
</div>


**e)** This still returns too many files to comfortably count manually. **`wc`** stands for **word count**; combining this command with **`ls`** allows us to count the number of items in that directory.

In the cell below type **`%ls tas*mon* | wc -l`** then press <kbd>Ctrl</kbd> + <kbd>Enter</kbd>.


In [None]:
# Type %ls *mon* | wc -l command below and press 'ctrl' + 'enter'



---

<div class="alert alert-block alert-success">
<b>Question:</b> How many nc files are in this directory, in total?
<br>How many of these nc files contain the string 'historical'; relating to the historical climate simulation? What command do you need to use to find this out?
</div>


<b>Answer</b>:
<br>_Total number of nc files:
<br>Number of historical nc files:
<br>Command used to find number of historical nc files:_

---


**f)** To list all the files containing monthly data from a period starting 202101, we use the code **`*mon_202101-??????.nc`**

Type below **`%ls *mon_202101-??????.nc`**


In [None]:
# Type %ls *mon_202101-??????.nc and press 'ctrl' + 'enter'.



<div class="alert alert-block alert-info">
<b>Note:</b> The question mark character <code>?</code> matches any single character
</div>


**g)** Now move up one level in the directory tree and list the directories.

Type `cd ..` to move up one level in the directory tree and press <kbd>Ctrl</kbd> + <kbd>Enter</kbd> to execute the command


In [None]:
# Type %cd .. and press 'ctrl' + 'enter'.




---

<div class="alert alert-block alert-success">
<b>Question:</b> Which directory are you in now?  What else can you see in this directory?
</div>


<b>Answer</b>:<br>
_...Double click here to type your answer..._


---

<a id='1.2'></a>

## 1.2 Getting started with Python and Iris

<p><img src="img/python_and_iris.png" alt="python + iris logo" style="float: center; height: 100px;"/></p>

Python is a general purpose programming language. Python supports modules and packages, which encourages program modularity and code reuse.

We also use the Python library [Iris](http://scitools.org.uk/iris/docs/v2.4.0/index.html), which is written in Python and is maintained by the Met Office. Iris seeks to provide a powerful, easy to use, and community-driven Python library for analysing and visualising meteorological and oceanographic data sets.

The top level object in Iris is called a <b>cube</b>. A cube contains data and metadata about a phenomenon (i.e. air_temperature). Iris handles several different types of file formats, loading them into Iris cubes.

For a brief introduction to Iris and the cube formatting please read this Introduction page here:

http://scitools.org.uk/iris/docs/v2.4.0/userguide/iris_cubes.html

For future reference please refer to the Iris website:

http://scitools.org.uk/iris/docs/v2.4.0/index.html


**a)** First, run the code-blocks below to **load** a file into Iris and **print** its metadata structure. <br>

To run the code, click in the box below and press <kbd>Ctrl</kbd> + <kbd>Enter</kbd>.


In [None]:
root_dir = "/home/h03/fris/code/PyPRECIS/notebooks/data_v2"
domain = "EAS-22"
path_dir = root_dir+domain+"/"

%cd path_dir

In [None]:
# import the necessary modules
import iris
import glob
import os

# this is needed so that the plots are generated inline with the code instead of a separate window
%matplotlib inline 

# provide the path of your sample data
sample_data = path_dir + 'sample_data.nc'

# Constraint the reading to a single variable and load it into an Iris cube
cube = iris.load_cube(sample_data)

# Print the Iris cube
print(cube)

---

<div class="alert alert-block alert-success">
<b>Question:</b> Can you explain how our sample data we printed above relates to this picture?
<img src="img/multi_array_to_cube.png" alt="diagram of an Iris cube" style="height: 300px"/> <br>

- Is our data above a 3D or a 2D cube? <br>
- What are the cube dimensions? <br>
- How many grid boxes is the latitudinal range divided into? <br>
- What meteorological variable does this cube represent? <br>
- What unit is used for this variable? <br>
</div>


<b>Double click here to type your answer:</b>:
<br>_Is this cube 3D or 2D?
<br>What are the cube dimensions?
<br>How many grid boxes is the latitudinal range divided into?
<br>What meteorological variable does this cube represent?
<br>What unit is used for this variable?_

---


**b)** Now **plot** the data for the selected variable: <br>

To run the code, click in the box below and press <kbd>Ctrl</kbd> + <kbd>Enter</kbd>.


In [None]:
import matplotlib.pyplot as plt
import iris.quickplot as qplt

plt.figure(figsize=(12,12))  # Set the figure size

# find the edges of grid cells (Aka the boounds)
cube.coord('grid_latitude').guess_bounds()
cube.coord('grid_longitude').guess_bounds()

qplt.pcolormesh(cube)  # Plot the cube
plt.title(cube.name())  # Add plot title
plt.clim(0, 5e-4)  # Set colour bar range
plt.show()  # Show the plot

<div class="alert alert-block alert-success">
Sense check the data plotted:

- Can you make out sections of the East Asian coastline?
- How about the scale?

As we progress through these workbooks, we'll learn how to process the data into more intuitive units and mask / add country boundaries, so it's easier to understand the information.

</div>


<a id='1.3'></a>

## 1.3 Merge problems

**a)** When using data, we want a single cube for all fields with the same standard name and sequential timesteps. `iris.load` will return as few cubes as possible, by collecting cubes from multiple files together. However, on some occasions this merge process does not give a single cube when we would expect it to.

This section demonstrates how to deal with cases like this and make a single cube for these data.


<div class="alert alert-block alert-info">
You can read more about iris loading behaviour <a href="https://scitools.org.uk/iris/docs/v2.4.0/userguide/loading_iris_cubes.html">here</a>.
</div>


As an example we will load some temperature data from the [East Asia Domain](https://cordex.org/domains/region-7-east-asia/) at 25 km resolution


In [None]:
# these variables form part of the standard CORDEX filename convention
rcm = "GERICS-REMO2015"
experiment = "historical"
gcm = "NCC-NorESM1-M"
variable = "tas"

Let's find out which of the files match the pattern above.


In [None]:
file_name = variable+"*"+gcm+"*"+experiment+"*"+rcm+"*"
file_list = glob.glob(path_dir+file_name)

# Complete the print statement to see which East Asian files match the specified criteria
print()

<div class="alert alert-block alert-success">
How many files were returned?    
</div>


**Double click here to type your answer:**

How many files were returned?


**b)** Run the line of code below to try and force Iris to load this data into a _single_ cube. 
<div class="alert alert-block alert-success">
<b>This command will return an error</b>, read the output to find out why.
</div>

In [None]:
cube = iris.load_cube(file_list)

The important part of this error is the following message:

    ConstraintMismatchError: failed to merge into a single cube.
      cube.attributes values differ for keys: 'history', 'creation_date', 'tracking_id'


Instead we will load this data with `iris.load()` and then look more closely at the data, before we fix the issues.


In [None]:
cubes = iris.load(file_list)
print(cubes)
print()
for cube in cubes:
    print(cube.attributes['creation_date'], cube.attributes['tracking_id'])

<div class="alert alert-block alert-success">
<b>Question:

- How many cubes are in the cube list you loaded?<br>
- Are they all the same size in space?<br>
- Do they have the same number of timesteps? Why do you think this is? (Hint: look again at the filenames we are loading)
- What are the differences in the attributes and do you think this is important when analysing your data?
</div>


<b>Answer</b>:
<br>_How many cubes are in the cube list you loaded?
<br>Are they all the same size in space?
<br>Do they have the same number of timesteps? Why do you think this is? (Hint: look again at the filenames we are loading)
<br>What are the differences in the attributes?
<br>Do you think these differences are important when analysing your data?_

---


**c)** Now let's solve this problem so we can get a single cube. We will do this using the [equalise_attributes](https://scitools.org.uk/iris/docs/v2.4.0/iris/iris/experimental/equalise_cubes.html) function from Iris.


In [None]:
from iris.experimental.equalise_cubes import equalise_attributes

equalise_attributes(cubes)

# now print the attributes of each cube
for cube in cubes:
    print(list(cube.attributes.keys()))

<div class="alert alert-block alert-success">

The equalise_attributes function has removed the metadata which is inconsistent between the cubes.

<b>Question:

- Why might it be a bad idea to apply this function without looking at the data first?
</b>
</div>


<b>Type your answer:</b>:
<br>_Why might it be a bad idea to apply this function without looking at the data first?_


The following loop is an alternative method to eliminate the mismatching attributes

    for i, icube in enumerate(cube):
        del cube[i].attributes['creation_date']
        del cube[i].attributes['tracking_id']


**d)** We can now merge the data into a single cube.


In [None]:
cube = cubes.concatenate_cube()
print(cube)

<div class="alert alert-block alert-success">
<li> Now that we have combined multiple files into a single cube, what is the cube's shape? 
<li> How does this compare with the cube_list created in 1.2e? 
<li> Based on all the information you've gained about the data so far, what time period do you expect the data in this cube to span?
</div>


**Type your answers here**

- Now that we have combined multiple files into a single cube, what is the cube's shape?
- How does this compare with the cube_list created in 1.2e?
- Based on all the information you've gained about the data so far, what time period do you expect the data in this cube to span?


<a id='1.4'></a>

## 1.4 Extracting data within a specific time range

**a)** This is a lot of data, and so for now, we will cut this down to include December 1989 to November 1991 inclusive using a time constraint. Edit the code below to specify the missing end date.
(**Hint:** specify the adjacent months BEFORE and AFTER the time period you wish to keep.)


In [None]:
from iris.time import PartialDateTime

print('original cube first and last dates')
print(cube.coord("time")[0])
print(cube.coord("time")[-1])

time_constraint = iris.Constraint(time=lambda cell: PartialDateTime(year=1989,month=11) 
                            < cell.point < PartialDateTime(year=YYYY,month=MM))
sub_cube = cube.extract(time_constraint)

**b)** Check the first and last timesteps in your constrained cube are correct:


In [None]:
print()
print('new cube first and last dates')
print(sub_cube.coord("time")[0])
print(sub_cube.coord("time")[-1])

<a id='1.5'></a>

## 1.5 Save data to a new file

**a)** We will now save this data to a new file.

Take note of the file names. Well chosen filenames can help you keep track of the contents of your files. We suggest developing a consistent syntax based on the filename patterns of the CORDEX data.


In [None]:
out_file_name = variable+'_'+domain+'_'+gcm+'_'+experiment+'_r1ip1_'+rcm+'_v2_mon_198912-199111.nc'

save_location = root_dir+'cordex_training/'

%mkdir {save_location}

print('saving file to: ' + save_location + out_file_name)
iris.save(sub_cube, save_location + out_file_name)

<div class="alert alert-block alert-info">
<b>Note:</b>
As we progress through these worksheets, keep a note of how we update the file names when making further changes to the data.
</div>


---

<div class="alert alert-block alert-success">
<b>b) Question:</b> Use the <b>cd</b> and <b>ls</b> commands to check the NetCDF directory that you have been creating new files in. <br>

- Confirm the names of the new files you have been creating. <br>
- What is the size of the concatenated file (containing December 1979 to November 1989 data)?
</div>


In [None]:
# use %cd and %ls to list the contents of your new directory containing NetCDF files:


# use %ls -lh to compare the size of the original files and final netcdf file you saved



<b>Answer:</b><br>
_Size of the file written out at the end:_


<center>
<div class="alert alert-block alert-warning">
<b>This completes worksheet 1.</b> <br>You have created pre-processed files (metdata fixed, concenated over time, extracted data in a specific time range and saved in NetCDF format). <br>
In worksheet 2, you will begin to analyse these files.
</div>
</center>


<p><img src="img/MO_MASTER_black_mono_for_light_backg_RBG.png" alt="python + iris logo" style="float: center; height: 100px;"/></p>
<center>© Crown Copyright 2022, Met Office</center>
