# Worksheet 1: File locations and pre-processing
<div class="alert alert-block alert-warning">
** By the end of this worksheet you should be able to:** <br> 
- Identify and list the names of PRECIS output data in PP format using standard Linux commands.<br>
- Use basic Iris commands to load data files, and view Iris cubes. <br>
- Use Iris commands to remove the model rim, select data variables and save the output as NetCDF files.

</div>

The following exercises demonstrate some of the tools available for data analysis, and how to prepare PRECIS output for analysis. This can be time consuming for large amounts of data, so in this worksheet a small subset is used to demonstrate the steps involved. In the worksheets that follow, data that has already been processed will be used.

PRECIS output data tables are in PP format, a Met Office binary data format. This worksheet converts data to [NetCDF](https://www.unidata.ucar.edu/software/netcdf/) format (a standard format in climate science) in order that it can be used in post processing packages such as Python and the python library [Iris](http://scitools.org.uk/iris/docs/latest/index.html).  

<div class="alert alert-block alert-info">
**Note:** In the boxes where there is code or where you are asked to type code, click in the box press <kbd>Ctrl</kbd> + <kbd>Enter</kbd> to run the code. <br>
**Note:** Anything after the character `#` is just a comment and does not affect the command being run. <br>
**Note:** An exclamation **`!`** mark is needed to run commands on the shell, and is noted where needed.<br>
**Note:** In jupyter notebooks **`%`** is used to execute in commands in the shell.<br>
**Note:** There is a difference between **`!`** and **`%`**. Shell commands in the notebook are executed in a temporary subshell and some commands such as **`cd`** don't have now shell counterpart so need to be preceeded by **`%`** </div>


## 1.1 Data locations and file names

The datasets used within these worksheets are made available through the notebook in order to providie quick and easy access for the purpose of this training. However the controls learnt in this worksheet provide useful context for future work in a linux and unix scripting environment.

The dataset used here is a three year subset of monthly PRECIS data over south east Asia driven by the HadCMQ0 GCM.


* Firstly, find out what location you are currently in by using the **`pwd`** command; **`pwd`** stands for **print working directory**.

In the cell belwo type **`!pwd`** on a seperate line and then press <kbd>Ctrl</kbd> + <kbd>Enter</kbd>.


In [None]:
#Type the command below this line

* List the contents of this directory; **`ls`** stands for **list** and using the **`-l`** option gives a longer listing with more information, such as file and size and modification date.

In the cell below type  **`!ls`** on a seperate line then press <kbd>Ctrl</kbd> + <kbd>Enter</kbd>.

In [None]:
# Type the command below this line

In the cell below type **`!ls -l`** on a seperate line and then press <kbd>Ctrl</kbd> + <kbd>Enter</kbd>.

In [None]:
# Type the command below this line

* Move to the directory (i.e. folder) called **data_directory/cahpa**. This directory contains the data from the PRECIS experiment with the RUNID: *cahpa*. **`cd`** command stands for **change directory**.

In the cell below type **`%cd data_directory/cahpa`** on a seperate line and then press <kbd>Ctrl</kbd> + <kbd>Enter</kbd>.


In [1]:
# Type the command below this line

* List the contents of this directory; remeber **`ls`** command stands for **list** and using the **`-l`** option gives a longer listing with more information.

In [None]:
# Type the command below this line

* List all the files containing data for September.

    Type **`!ls \*sep*\`** in the code block below.

    How many files contain data for September?

<div class="alert alert-block alert-info">
**Note:** The asterisk character `*` (also known as _glob_) matches any string within the filename
</div>

In [None]:
# Type the command below this line

* List all the files containing data from 1982 (i.e. all files which begin **`cahpaa.pmi2`**)

    Type below **`!ls cahpaa.pmi2???.pp`**

<div class="alert alert-block alert-info">
**Note:** The question mark character `?` matches any single character
</div>



In [None]:
# Type the command below this line

* Move up two levels in the directory tree and list the directories.

In the cells below type the following 3 commands on a seperate line (one in each cell) and then press <kbd>Ctrl</kbd> + <kbd>Enter</kbd> to execute the command.

**`%cd ../..`**

**`!pwd`**

**`!ls`**

In [None]:
# Type the command below this line

In [None]:
# Type the command below this line

In [None]:
# Type the command below this line

The directories **`/daily`** and **`/monthly`** contain data used in the worksheets which follow this one.

***
## 1.2 A brief introduction to python and Iris

Python is an interpreted, object-oriented, high-level programming language. Python supports modules and packages, which encourages program modularity and code reuse. 


We also use the python library [Iris](http://scitools.org.uk/iris/docs/latest/index.html), which is written in Python and is maintained by the Met Office. Iris seeks to provide a powerful, easy to use, and community-driven Python library for analysing and visualising meteorological and oceanographic data sets.

The top level object in Iris is called a cube. A cube contains data and metadata about a phenomenon (i.e. air_temperature). Iris implementsseveral major format importers which can take files of specific formats and turn them into Iris cubes.


For a brief introduction to Iris and the cube formatting please read this Introduction page here: 

http://scitools.org.uk/iris/docs/latest/userguide/iris_cubes.html

For further future reference please refer to the Iris website:

http://scitools.org.uk/iris/docs/latest/index.html

Next we see some simple examples of how to load a file into an Iris, print its metadata strucuture and then plot the data.

***
### 1.2.1 Load the data into an Iris cube and plot the data

* First we need to change to our data directory at 'practice_ppfiles/cahpa'

click in the box and press <kbd>Ctrl</kbd> + <kbd>Enter</kbd>.

In [None]:
# import the necessary modules
import iris
import matplotlib.pyplot as plt
import iris.plot as iplt
import glob
import os

# this is needed so that the plots are generated inline with the code instead of a separate window
%matplotlib inline 

# the following is needed to be compliant with the Iris's latest NetCDF default saving behaviour
iris.FUTURE.netcdf_promote = 'True'
iris.FUTURE.netcdf_no_unlimited = 'True'

print ('Modules imported')

In [None]:
# set the path for practise data files
data_files = '/net/data/users/ssadri/data-local/PRECIS_WORK/practise_ppfiles/cahpa/'
# change to the directory where the the data is
%cd $data_files
print ('Data directory was set to: {Location}'.format(Location=data_files))

Next:
* Read the PP data file into an Iris cube, constrain the load to a single variable and then print the Iris cube"

In [None]:
# specify the data file name to load
sample_data = 'cahpaa.pmi2jan.pp'

# Constraint the reading to a single varialbe and load it into an Iris cube
total_precipitation_cube = iris.load_cube(sample_data,iris.AttributeConstraint(STASH='m01s05i216'))

# Print the Iris cube
print (total_precipitation_cube)

Finally:
* Plot the data for the selected varaible

In [None]:
# set plot area big enough
plt.figure(figsize=(20,10))

# A contour plot of this variable
iplt.contour(total_precipitation_cube)
plt.title(total_precipitation_cube.name())
plt.show()

***
## 1.3 Rim removal (single file example)

The edges (or rim) of RCM outputs are biased due to the linear relaxation used on certain variables to apply the GCM lateral boundary conditions. This rim from each edge needs to be excluded from any analysis.

The practice PP files like the one we used above have an eight points wide rim. We now demonstrate how to:
* Remove this 8-point rim from our practice file and save it as NetCDF file
* Plot the new file so that you can see the difference

Click in the box and press <kbd>Ctrl</kbd> + <kbd>Enter</kbd>.

In [None]:
# Use basic indexing to remove the 8 points rim and save is as a new cube
total_precipitation_cube_norim = total_precipitation_cube[8:-8, 8:-8]

# Set the path to the output directory
output_dir = '/net/data/users/ssadri/data-local/PRECIS_WORK/practise_ppfiles/cahpa/'

# format of the output NetCDF filename
output_filename = "{Outdir:s}/Sample_Data_{Name}_Norim.nc".format(Outdir=output_dir,Name=total_precipitation_cube_norim.name())

# Save the new data with rim removed as NetCDF file
iris.save(total_precipitation_cube_norim, output_filename)

print('Saving {File} to : {Location}'.format(File=output_filename, Location=output_dir))

In [None]:
# set plot area big enough
plt.figure(figsize=(20,10))

# Plot the no-rim data next to the original for comparision 
plt.subplot(121)
iplt.contour(total_precipitation_cube)
plt.title(total_precipitation_cube.name())

plt.subplot(122)
iplt.contour(total_precipitation_cube_norim)
plt.title(total_precipitation_cube.name() + ' with rim removed')
plt.show()

***
### 1.3.1 Select variables and remove rim (multiple files)
Next we repeat the rim romoval operation but for multiple files.

In this example you will see a reference to the cubes attibute 'STASH'. STASH codes are used as a storage handling system for all the variables that the PRECIS model and Met Office UM model provides. 

Each stash code refers to a variable.  In this example, the following STASH codes are used: 03236 = air temperature; 16222 = air pressure and 05216 = precipitation. You will notice that the files have been saved with the relevant STASH code in this example. In this example we use these stash codes and actual varialbe names to name our output files so that they are more readable.

In [None]:
# change to the directory where the the data is
%cd $data_files
print ('Data directory was set to: {Location}'.format(Location=data_files))

# Path to the output directory
output_dir = data_files + 'rr8_removed'


# List of stash code for the variables of interest: (Temperature, Precipitation, Surface Pressure)
stash_codes = ['m01s03i236','m01s05i216','m01s16i222']

# checks if output directory exists, if not creates a directory
if not os.path.exists('rr8_removed'):
    os.mkdir('rr8_removed')
# specify a filename format and read all the files matching it into a list
file_list = glob.glob('cahpaa.pmi????.pp')

for data_file in file_list:    
    
    print ('Loading data file: {File}'.format(File=data_file))
    
    # Loop trhough the stash codes and only load the variables of intererest
    for stash_code in stash_codes:
    
        # Constraint the reading to a single variable and load it into an Iris cube
        cube = iris.load_cube(data_file,iris.AttributeConstraint(STASH=stash_code))
        # remove the rim and save it as a new cube
        no_rim_cube = cube[8:-8, 8:-8]
    
        # define a format fpr the output filename using the stash code and actual variable names
        output_filename = "{Outdir:s}/rr8_{Variable}_{Stash}_{Name}".format(Outdir=output_dir,Name=data_file,Stash=stash_code, Variable=no_rim_cube.name())
        # save the file
        iris.save(no_rim_cube, output_filename)            
    
    print('Saving {File} to : {Location}'.format(File=data_file, Location=output_filename))

***
## 1.4 Convert PP files to NetCDF and save them seperately by stashcode

We can use a Iris to separate the variables and save them as NetCDF files.

* Separate the variables in all of the monthly files into separate directories and save as NetCDF files.

    Click in the box below and press <kbd>Ctrl</kbd> + <kbd>Enter</kbd> to run the code.

In [None]:
# change to the directory where the rim-removed files are
%cd rr8_removed
print ('Data directory was set to: {Location}'.format(Location=data_files))


# find all the files from which to remove the rim
no_rr8_ppfiles = glob.glob('rr8*.pp')

for no_rr8_file in no_rr8_ppfiles:
    
    # This will load all the variables in the file into a CubeList
    data_cubes = iris.load(no_rr8_file)
    
    for cube in data_cubes:
        
        # get the STASH code
        cubeSTASH = cube.attributes['STASH'] 

        # create a directory based on the STASH code
        dir_name = str(cubeSTASH.section).zfill(2)+str(cubeSTASH.item).zfill(3)
        
        # checks if directory exists, if not creates a directory
        if not os.path.exists(dir_name):
            os.mkdir(dir_name)
            
        # for saving replace the *.pp file extension with *.nc
        out_file = no_rr8_file.replace('.pp','.' + dir_name + '.nc')
        
        # save the merged data cube
        iris.save(cube, dir_name + '/' + out_file)
        print('Saving {File} to : {Location}'.format(File=out_file, Location=dir_name))

* For each variable (Temperature, Precipitation, Surface Pressure) put the monthly files into a single cube and save as NetCDF file.

    The monthly files are for the years 1981, 1982, and 1983, hence the file name saved to including 1981_1983.

    Click in the box below and press <kbd>Ctrl</kbd> + <kbd>Enter</kbd> to run the code.


In [None]:
import iris
import glob

stash_codes = ['03236','05216','16222']

# loop over each directory stash code
for stash in stash_codes:
   
    # load the file names into the variable flist
    files_list = glob.glob(stash + '/' + '*.nc')
    for netcdf_file in files_list:
        data_cube = iris.load_cube(netcdf_file)
        
        outfile = stash + '/cahpaa.pm.1981_1983.rr8.' + stash + '.nc'
        iris.save(data_cube, outfile)
        print('Monthly files for years 1981-1983 have been saved into a single NetCDF file {File}'.format(File=outfile))

© Crown Copyright 2018, Met Office