# Initial data processing for TCRM

Before we simulate synthetic TC events, we need to perform some basic processing of the input TC observation database. There's a significant amount of inconsistency and missing data in the best-track archives, and so we need to separate out the required data and ensure there is no invalid data in there. 

This notebook will execute the initial processing steps that TCRM uses, so that we can then use the resulting data in subsequent notebooks that explore in more detail the inner workings of TCRM.

In [None]:
import os
import io
import sys

from tcrm import doOutputDirectoryCreation
from DataProcess.DataProcess import DataProcess
from Utilities.config import ConfigParser
from Utilities.parallel import attemptParallel, disableOnWorkers

Because TCRM is often executed in a parallel processing environment, we need to have methods to ensure that some parts of the code only execute on a single processor, e.g. directory creation, to avoid race conditions and multiple processes attempting to read/write from one file. This next bit of code helps to handle that problem. You don't need to know the details of what's going on here. 

In [9]:
global pp
pp = attemptParallel()
import atexit
atexit.register(pp.finalize)

<bound method DummyPypar.finalize of <Utilities.parallel.DummyPypar object at 0x113DEC50>>

In [14]:
configstr = """
[DataProcess]
InputFile=C:/WorkSpace/data/TC/Allstorms.ibtracs_wmo.v03r09.csv
Source=IBTRACS
StartSeason=1981
FilterSeasons=False

[Region]
; Domain for windfield and hazard calculation
gridLimit={'xMin':90.,'xMax':180.,'yMin':-30.0,'yMax':-5.0}
gridSpace={'x':1.0,'y':1.0}
gridInc={'x':1.0,'y':0.5}

[TrackGenerator]
NumSimulations=5000
YearsPerSimulation=10
SeasonSeed=68876543
TrackSeed=334825
TimeStep=1.0

[Input]
landmask = C:/WorkSpace/tcrm/input/landmask.nc
mslpfile = C:/WorkSpace/tcrm/MSLP/slp.day.ltm.nc
datasets = IBTRACS,LTMSLP

[Output]
Path=C:/WorkSpace/data/TC/aus

[Hazard]
Years=2,5,10,20,25,50,100,200,250,500,1000,2000,2500,5000
MinimumRecords=10
CalculateCI=False

[Logging]
LogFile=C:/WorkSpace/data/TC/aus/log/aus.log
LogLevel=INFO
Verbose=False

[IBTRACS]
; Input data file settings
url = ftp://eclipse.ncdc.noaa.gov/pub/ibtracs/v03r06/wmo/csv/Allstorms.ibtracs_wmo.v03r09.csv.gz
path = C:/WorkSpace/data/TC/
filename = Allstorms.ibtracs_wmo.v03r09.csv
columns = tcserialno,season,num,skip,skip,skip,date,skip,lat,lon,skip,pressure
fielddelimiter = ,
numberofheadinglines = 3
pressureunits = hPa
lengthunits = km
dateformat = %Y-%m-%d %H:%M:%S
speedunits = kph

[LTMSLP]
; MSLP climatology file settings
URL = ftp://ftp.cdc.noaa.gov/Datasets/ncep.reanalysis.derived/surface/slp.day.1981-2010.ltm.nc
path = C:/WorkSpace/data/MSLP
filename = slp.day.ltm.nc
"""

The following piece of code sets up an instance of Python's `ConfigParser` configuration module, with a couple of minor modifications for TCRM. We then read in the string version of the configuration detailed above. We could equally pass the `config.readfp()` method the name of a complete TCRM configuration file, but it's easier to show how configuration changes affect the way the model operates.

In [15]:
config = ConfigParser()
config.readfp(io.BytesIO(configstr))

We need to create a directory for the output to be stored in. If the specified directory already exists, the data that exists in that folder will be overwritten. If the output directory cannot be created, either due to permission errors or because you've specified a path that is unreachable, then the code will raise an exception, indicating the reason for the failure.

In [16]:
doOutputDirectoryCreation(configstr)

These two lines do the majority of the work of processing the input file. The first sets up a `DataProcess` object, which has a single public method (but many private methods) to perform the processing

In [17]:
dp = DataProcess(configstr)
dp.processData()



You might see a couple of "WARNING" messages appear indicating that certain data fields are not available. These are optional, not essential fields. If essential fields are missing, then the `processData` method will raise an exception and stop execution.

So now, what has this produced? We can check the files that now exist in the output path:

In [32]:
outputPath = config.get('Output', 'Path')
processPath = os.path.join(outputPath, 'process')
dirlist = os.listdir(processPath)
dirlist

['all_bearing',
 'all_lon_lat',
 'all_pressure',
 'all_speed',
 'bearing_no_init',
 'bearing_rate',
 'cyclone_tracks',
 'dat',
 'frequency',
 'init_bearing',
 'init_lon_lat',
 'init_pressure',
 'init_speed',
 'jdays',
 'jday_genesis',
 'jday_obs',
 'origin_lon_lat',
 'origin_year',
 'pressure_no_init',
 'pressure_rate',
 'speed_no_init',
 'speed_rate',
 'timeseries',
 'wind_speed']

The listing shows a number of files, with no extension. These are text files, some in a csv format, that are read by different processes at different stages later in the execution of TCRM. 

#### Questions:
1. Open the 'cyclone_tracks' file in a text editor and check out the contents. What might the first column represent? What about the other columns?
2. Open the 'jday_genesis' file and see if you can dduce what it contains.



### Plotting the output from `DataProcess`

To let you visually inspect the data produced by `DataProcess.processData`, there's a simple function that plots the data for you. We'll execute the function here, then display the contents in the notebook, using some simple IPython functions.

In [None]:
from tcrm import doDataPlotting
from ipywidgets import interact, Dropdown
from IPython.display import Image

doDataPlotting(configstr)

In [36]:
statsPlotPath = os.path.join(outputPath, 'plots/stats')
imglist = os.listdir(statsPlotPath)

def showImg(imgfile):
    Image(imgfile)
    
imgDropdown = Dropdown(options=imglist, value=imglist[0], description="File name")
interact(showImg, imgfile=imgDropdown)

<function __main__.showImg>