# Lesson 5 - Errors and Troubleshooting

## Introduction

In this lesson we would try to provide a list of possible errors that you might run into and provide information on how to address those issues as they arise. Also note that some warnings might seem like an error but are actually warnings or system glitches, we would also mention few sample of those. We break down the errors to 4 sections, depending on the procedure that they might occur. 
* Initialization 
* Spinup
* Calibration 
* Validation 

## Initialization 

There are few things that could cause error in the initialization steps (covered in Lesson 2), few of them were mentioned in Lesson 2 and we will review them here also. Many of these would be easy to figure out, and the error message could provide good guidance on the issue. One caveat is that meaninful error messages are provided in the email message and are not printed on screen, so it is hard to see them in this training as that option is not available, we will try to provide sample of how the email error message would look like.  

### Issue 1: Database Already Exists

If you are attempting to set up a new calibration run directory, or re-start after detecting an error, you may encounter a situation in which the database file has already been populated. 

In this case, you will not be able to initialize the calibration setup. Below, we will attempt to run `initDB.py` as we learned in *lesson 2: Initialization* once more and see what happens. 

In [None]:
%%bash
# Create the empty database
python /home/docker/PyWrfHydroCalib/initDB.py --optDbPath /home/docker/example_case/Calibration/output/DATABASE.db

As you can see, we received the following error message: **ERROR: /home/docker/example_case/Calibration/output/DATABASE.db Already Exists.**

Manually remove the database file prior to beginning the next run, or use a different filename. We recommend if you run into any issue during initialization, restart from the begining and remove the database file. For the rest of this lesson, we would create a different example case and practice there in order to avoid corrupting our work from previous lessons. 

In [None]:
%%bash
# Create an empty directory 
mkdir /home/docker/example_case/Calibration/output/issue_1

# Create the empty database
python /home/docker/PyWrfHydroCalib/initDB.py --optDbPath /home/docker/example_case/Calibration/output/issue_1/DATABASE_issue_1.db

### Issue 2: Errors in domainMeta.csv File Structure:

`domainMeta.csv` is an important file for the calibration procedure containing all the metadata about the domains user calibrate. In order for the calibration procedure to properly ingest information from this file, the column names and overall structure must be exact because the expected results are hardcoded. `inputDomainMeta.py` script would read the `domainMeta.csv` and will check the content and try to catch some of the possible errors including:

* Checks the number of columns in the csv file
* Checks the column name, they should match what the workflow expect otherwise it will throw an error. 
* Check the existance of the domain directory
* Check the existance of the necessary files such as domain files, forcing directory, observation dierctory 

For example, check out what will happen if you provide a different column name as expected by the workflow.

In [None]:
%%bash
# Create an empty directory 
mkdir /home/docker/example_case/Calibration/output/issue_2

# Make a copy of the domainMeta.csv file 
cp /home/docker/PyWrfHydroCalib/setup_files/domainMeta_01447720.csv /home/docker/example_case/Calibration/output/issue_2/domainMeta_01447720.csv

# Change the name of a field "rfc" to something else for exmaple "new_field"
sed -i -e 's#rfc#new_field#g' /home/docker/example_case/Calibration/output/issue_2/domainMeta_01447720.csv

# Create a new database for the issue 2
python /home/docker/PyWrfHydroCalib/initDB.py --optDbPath /home/docker/example_case/Calibration/output/issue_2/DATABASE_issue_2.db

# Add the domainMeta info to the database
python /home/docker/PyWrfHydroCalib/inputDomainMeta.py /home/docker/example_case/Calibration/output/issue_2/domainMeta_01447720.csv --optDbPath /home/docker/example_case/Calibration/output/issue_2/DATABASE_issue_2.db

The workflow will throw and error message **ERROR: Unexpected column name in: /home/docker/example_case/Calibration/output/issue_2/domainMeta_01447720.csv**, this message is very easy to understand and fix like issue 1. 

As mentioned before the name of the columns in this file are hardcoded in the calibration workflow and therefore it is recommended to keep the format exactly the same to avoid running into issues. 

If the content of the file is as expected, next the workflow checks the existance of the doamin directory and throw an error if it does not exists. For example, let's change the domain directory to an non existind directory and check the result.

In [None]:
%%bash

# Make a copy of the domainMeta.csv file 
cp /home/docker/PyWrfHydroCalib/setup_files/domainMeta_01447720.csv /home/docker/example_case/Calibration/output/issue_2/domainMeta_01447720.csv

# Change the "domain_path" to a non existing directory
sed -i -e 's#/home/docker/example_case/Calibration/Input_Files/01447720#/home/docker/example_case/Calibration/Input_Files/NON_EXISTING#g' /home/docker/example_case/Calibration/output/issue_2/domainMeta_01447720.csv

# Add the domainMeta info to the database
python /home/docker/PyWrfHydroCalib/inputDomainMeta.py /home/docker/example_case/Calibration/output/issue_2/domainMeta_01447720.csv --optDbPath /home/docker/example_case/Calibration/output/issue_2/DATABASE_issue_2.db

The workflow will give you clear message that the domain directory does not exists. **ERROR: Directory: /home/docker/example_case/Calibration/Input_Files/NON_EXISTING not found.** The same goes for the files that are necessary for the workflow, if they do not exist under the domain directory, workflow will give you clear message. We will not show case those situations. 

The 3 necessary fields from `domainMeta.csv` files as explained in lesson 2 are the `gage_id`, `link_id` and `domain_path`. These 3 fields needs to be specified. Unfortunately, the error messages if any of these fields are missing are not that informative. Let us check what will hapeen if any of those are missing. Let's begin with checking the missing `link_id`. 

In [None]:
%%bash

# Make a copy of the domainMeta.csv file 
cp /home/docker/PyWrfHydroCalib/setup_files/domainMeta_01447720.csv /home/docker/example_case/Calibration/output/issue_2/domainMeta_01447720.csv

# remove link_id
sed -i -e 's#4185779##g' /home/docker/example_case/Calibration/output/issue_2/domainMeta_01447720.csv

# Add the domainMeta info to the database
python /home/docker/PyWrfHydroCalib/inputDomainMeta.py /home/docker/example_case/Calibration/output/issue_2/domainMeta_01447720.csv --optDbPath /home/docker/example_case/Calibration/output/issue_2/DATABASE_issue_2.db

It will give an error message like **ERROR: Unable to open CSV file: /home/docker/example_case/Calibration/output/issue_2/domainMeta_01447720.csv** which is not that informative. The same goes for missing `domain_path`. 

In [None]:
%%bash
# Make a copy of the domainMeta.csv file 
cp /home/docker/PyWrfHydroCalib/setup_files/domainMeta_01447720.csv /home/docker/example_case/Calibration/output/issue_2/domainMeta_01447720.csv

# remove domain_path
sed -i -e 's#/home/docker/example_case/Calibration/Input_Files/01447720##g' /home/docker/example_case/Calibration/output/issue_2/domainMeta_01447720.csv

# Add the domainMeta info to the database
python /home/docker/PyWrfHydroCalib/inputDomainMeta.py /home/docker/example_case/Calibration/output/issue_2/domainMeta_01447720.csv --optDbPath /home/docker/example_case/Calibration/output/issue_2/DATABASE_issue_2.db

Again this error message is not that informative, and user needs to double check the content of the file to make sure everything is specified properly. Lastly, we will check the case of missing `gage_id`. 

In [None]:
%%bash

# Make a copy of the domainMeta.csv file 
cp /home/docker/PyWrfHydroCalib/setup_files/domainMeta_01447720.csv /home/docker/example_case/Calibration/output/issue_2/domainMeta_01447720.csv

# Remove the gage_id
sed -i -e 's#01447720##g' /home/docker/example_case/Calibration/output/issue_2/domainMeta_01447720.csv
sed -i -e 's#/home/docker/example_case/Calibration/Input_Files/#/home/docker/example_case/Calibration/Input_Files/01447720#g' /home/docker/example_case/Calibration/output/issue_2/domainMeta_01447720.csv

# Add the domainMeta info to the database
python /home/docker/PyWrfHydroCalib/inputDomainMeta.py /home/docker/example_case/Calibration/output/issue_2/domainMeta_01447720.csv --optDbPath /home/docker/example_case/Calibration/output/issue_2/DATABASE_issue_2.db

As seen above, the workflow did not give any error, and entered the domain info with the missing `gage_id` to the database. However, this field is required since the parining between the model simulations and observation is based on this field. In summary, user is encouraged to carefully prepare the file, and make sure these three fields are defined properly. 

### Issue 3: Errors Related to Setup.parm 
There are a number of error messages that could help you navigate why the initialization step is failed. 

* Check for the existance of the database
* Check the existance of the outDir, experiment directory, necessary files. 
* Check the viable options for the namelist options
* Check for validity of the dates for spinup, calibration and validation 

Below are few samples. 

#### Non Existing OutDir in setup.parm 
Another easy issue to detect is a missing directory. User defines where the experiment will be run in the `setup.parm` file using the variable `outDir`. This directory should exist. 

In [None]:
%%bash
#creat the directory
mkdir /home/docker/example_case/Calibration/output/issue_3

# Make a copy of the setup.parm file 
cp /home/docker/PyWrfHydroCalib/setup_files/setup.parm /home/docker/example_case/Calibration/output/issue_3/setup.parm

# Change the name of a field "rfc" to something else for exmaple "new_field"
sed -i -e 's#outDir = /home/docker/example_case/Calibration/output/#outDir = /home/docker/example_case/Calibration/NON_EXISTING/#g' /home/docker/example_case/Calibration/output/issue_3/setup.parm

# Create a new database for the issue 3
python /home/docker/PyWrfHydroCalib/initDB.py --optDbPath /home/docker/example_case/Calibration/output/issue_3/DATABASE_issue_3.db

# Add the domainMeta info to the database
python /home/docker/PyWrfHydroCalib/inputDomainMeta.py /home/docker/PyWrfHydroCalib/setup_files/domainMeta_01447720.csv --optDbPath /home/docker/example_case/Calibration/output/issue_3/DATABASE_issue_3.db

# initialize
python /home/docker/PyWrfHydroCalib/jobInit.py /home/docker/example_case/Calibration/output/issue_3/setup.parm --optExpID 1 --optDbPath /home/docker/example_case/Calibration/output/issue_3/DATABASE_issue_3.db

In this case, the error message is easy to find and fix. 
**ERROR: Directory: /home/docker/example_case/Calibration/NON_EXISTING/ not found.
ERROR: Improper Entries Into Config File.
ERROR: Failure to initialize calibration workflow job**

#### Existing Experiment Directory 

User specify a name for experiment in the `setup.parm` file. Python workflow will create a directory with the name of the experiment under the `outDir` directory and place all the calibration related files in there. If user has tried to initialize the model or by mistake is trying to overwrite an already existing experiment, python workflow will issue an error message. 


In [None]:
%%bash
# Make a copy of the setup.parm file 
cp /home/docker/PyWrfHydroCalib/setup_files/setup.parm /home/docker/example_case/Calibration/output/issue_3/setup.parm

# initialize
python /home/docker/PyWrfHydroCalib/jobInit.py /home/docker/example_case/Calibration/output/issue_3/setup.parm --optExpID 1 --optDbPath /home/docker/example_case/Calibration/output/issue_3/DATABASE_issue_3.db

#### Improper Entry in Setup.parm 

All the inputs in the `setup.parm` files are hardcoded, and therefore they should exist and a failure in having all the required options will result in a failure of the initialization and it will provide an error message that initialization has failed. We encourage users to copy the provided setup.parm file here and modify it. It should be noted that the setup.parm files are not compatiable with different versions of the `PyWrfHydroCalib`, and user should use the `setup.parm` file that is compatiable with the code used. 


In [None]:
%%bash
#clean the directory
rm -rf /home/docker/example_case/Calibration/output/issue_3/*

# Make a copy of the setup.parm file 
cp /home/docker/PyWrfHydroCalib/setup_files/setup.parm /home/docker/example_case/Calibration/output/issue_3/setup.parm

# change the outDir:
sed -i -e 's#outDir = /home/docker/example_case/Calibration/output/#outDir = /home/docker/example_case/Calibration/output/issue_3/#g' /home/docker/example_case/Calibration/output/issue_3/setup.parm

# chaneg the entry enableMask to Non_Existing_Entry
sed -i -e 's#basinType#NonExistingEntry#g' /home/docker/example_case/Calibration/output/issue_3/setup.parm

# Create a new database for the issue 3
python /home/docker/PyWrfHydroCalib/initDB.py --optDbPath /home/docker/example_case/Calibration/output/issue_3/DATABASE_issue_3.db

# Add the domainMeta info to the database
python /home/docker/PyWrfHydroCalib/inputDomainMeta.py /home/docker/PyWrfHydroCalib/setup_files/domainMeta_01447720.csv --optDbPath /home/docker/example_case/Calibration/output/issue_3/DATABASE_issue_3.db

# initialize
python /home/docker/PyWrfHydroCalib/jobInit.py /home/docker/example_case/Calibration/output/issue_3/setup.parm --optExpID 1 --optDbPath /home/docker/example_case/Calibration/output/issue_3/DATABASE_issue_3.db

#### Non Existing File Specified in the Setup File
All the input files specified in the `setup.parm` file needs to be existing. The workflow will check and throw an error if it does not. If any of the entries are not esstential, user needs to modify the code to allow for the flexibility or create an empty file which will be ignore. For example, the WRF-Hydro exectauble is a necessary file. If not exiting, the workfow will provide you an error message that the file is missing. 

In [None]:
%%bash
#clean the directory
rm -rf /home/docker/example_case/Calibration/output/issue_3/*

# Make a copy of the setup.parm file 
cp /home/docker/PyWrfHydroCalib/setup_files/setup.parm /home/docker/example_case/Calibration/output/issue_3/setup.parm

# change the outDir:
sed -i -e 's#outDir = /home/docker/example_case/Calibration/output/#outDir = /home/docker/example_case/Calibration/output/issue_3/#g' /home/docker/example_case/Calibration/output/issue_3/setup.parm

# change the path to the WRF-Hydro exectuable to a non existing file 
sed -i -e 's#/home/docker/wrf_hydro_nwm_public/trunk/NDHMS/Run/wrf_hydro.exe#/home/docker/wrf_hydro_nwm_public/trunk/NDHMS/Run/wrf_hydro.exe_NON_EXISTING#g' /home/docker/example_case/Calibration/output/issue_3/setup.parm

# Create a new database for the issue 3
python /home/docker/PyWrfHydroCalib/initDB.py --optDbPath /home/docker/example_case/Calibration/output/issue_3/DATABASE_issue_3.db

# Add the domainMeta info to the database
python /home/docker/PyWrfHydroCalib/inputDomainMeta.py /home/docker/PyWrfHydroCalib/setup_files/domainMeta_01447720.csv --optDbPath /home/docker/example_case/Calibration/output/issue_3/DATABASE_issue_3.db

# initialize
python /home/docker/PyWrfHydroCalib/jobInit.py /home/docker/example_case/Calibration/output/issue_3/setup.parm --optExpID 1 --optDbPath /home/docker/example_case/Calibration/output/issue_3/DATABASE_issue_3.db

USer would receive the same error message if there are any other missing files. 

#### Using Same JobID for Two Experiments
Each experiment needs to have its own unique jobID and the workflow will throw an error if it has already being use. Below will create a job ID of 1 and try to reuse it. 

In [None]:
%%bash
#clean the directory
rm -rf /home/docker/example_case/Calibration/output/issue_3/*

# Make a copy of the setup.parm file 
cp /home/docker/PyWrfHydroCalib/setup_files/setup.parm /home/docker/example_case/Calibration/output/issue_3/setup.parm

# change the outDir:
sed -i -e 's#outDir = /home/docker/example_case/Calibration/output/#outDir = /home/docker/example_case/Calibration/output/issue_3/#g' /home/docker/example_case/Calibration/output/issue_3/setup.parm

# Create a new database for the issue 3
python /home/docker/PyWrfHydroCalib/initDB.py --optDbPath /home/docker/example_case/Calibration/output/issue_3/DATABASE_issue_3.db

# Add the domainMeta info to the database
python /home/docker/PyWrfHydroCalib/inputDomainMeta.py /home/docker/PyWrfHydroCalib/setup_files/domainMeta_01447720.csv --optDbPath /home/docker/example_case/Calibration/output/issue_3/DATABASE_issue_3.db

# initialize
python /home/docker/PyWrfHydroCalib/jobInit.py /home/docker/example_case/Calibration/output/issue_3/setup.parm --optExpID 1 --optDbPath /home/docker/example_case/Calibration/output/issue_3/DATABASE_issue_3.db

# retrying after the initialization was successful
python /home/docker/PyWrfHydroCalib/jobInit.py /home/docker/example_case/Calibration/output/issue_3/setup.parm --optExpID 1 --optDbPath /home/docker/example_case/Calibration/output/issue_3/DATABASE_issue_3.db

Many of the pieces about the database was written initially for a central database file (used to be postgress) and all different users and experiments accessing it. However, with moving from postgress to sqlite file based database we are using the a unique database for a single job (with as many basins as user specifies). 

Errors related to initialization are quick to identify and usually easy to address. We have the following recommendations: 
* copy the `domianMeta.csv` file and keep the format as is while modifying it. 
* copy the `setup.parm` file and keep the format as is. Do not add any new entry and make sure you do not delete any existing entry even though it is not use. 
* If any of the initialization steps fails, start over by removing the database and the created experiment directory and start over. 

# SPINUP
Spin up usually crashes if the model run crashes and a LOCK filecalled `RUN.LOCK` will be created and it placed under the `RUN.SPINIP` directory. If the email address is provided in the `setup.parm` file, then a proper message will be sent out to the user specifying that there is a problem and user needs to address it. The workflow will be running in the background but since there is LOCK file, it will not do anything and will hold still until you fix the issue and remove the LOCK file. After removing the LOCK file, the procedure will restart and pick up from where it crashed.

### How the content of the email looks like?
* If WRF-Hydro model fails, it would be a message like: `ERROR: SIMULATION FOR GAGE: 01447720 HAS FAILED A SECOND TIME. PLEASE FIX ISSUE AND MANUALLY REMOVE LOCK FILE: /home/docker/example_case/Calibration/output/example1/01447720/RUN.SPINUP/RUN.LOCK`

### Where to search for error messages?
* if WRF-Hydro model fails, you need to search for the standard error and out in the `RUN.SPINUP/OUTPUT` directory. The standard error and out files would be called something like: `/home/docker/example_case/Calibration/output/example1/01447720/RUN.SPINUP/OUTPUT/WH_1_1.err and WH_1_1.out`. These two files are the first place to look for hints of why the model failed. You could also look at the diag files in the same folder, sometimes those files have better info in them. 

### What could be the possible cause of WRF-Hydro failuer?
There could be many reasons of why the model has failed, but few of the common ones are the following:
* Forcing file is missing, or the file is corrupted or have an invalid value
* Model executable is not matching the Table files or the namelists used here. It should be noted that the namelists are generated in by the python workflow and if you are using a different namelist that has extra options, you need to make modification to the PyWrfHydroCalib. 
* One of the domain files is missing or corrupted. 

In the example below, we would rename one of the forcing files to cause error and show you how to address the issue. 

In [None]:
%%bash
# Create the directory if does not exist 
mkdir /home/docker/example_case/Calibration/output/issue_4/

# Clean the directory for cases of running this cell more than once 
rm -rf /home/docker/example_case/Calibration/output/issue_4/*

# Make a copy of the setup.parm file 
cp /home/docker/PyWrfHydroCalib/setup_files/setup.parm /home/docker/example_case/Calibration/output/issue_4/setup.parm

# change the outDir:
sed -i -e 's#outDir = /home/docker/example_case/Calibration/output/#outDir = /home/docker/example_case/Calibration/output/issue_4/#g' /home/docker/example_case/Calibration/output/issue_4/setup.parm

# Create a new database for the issue 4
python /home/docker/PyWrfHydroCalib/initDB.py --optDbPath /home/docker/example_case/Calibration/output/issue_4/DATABASE_issue_4.db

# Add the domainMeta info to the database
python /home/docker/PyWrfHydroCalib/inputDomainMeta.py /home/docker/PyWrfHydroCalib/setup_files/domainMeta_01447720.csv --optDbPath /home/docker/example_case/Calibration/output/issue_4/DATABASE_issue_4.db

# Initialize
python /home/docker/PyWrfHydroCalib/jobInit.py /home/docker/example_case/Calibration/output/issue_4/setup.parm --optExpID 1 --optDbPath /home/docker/example_case/Calibration/output/issue_4/DATABASE_issue_4.db

# let s change the name of one forcing file 
mv /home/docker/example_case/Calibration/Input_Files/01447720/FORCING/2010100107.LDASIN_DOMAIN1 /home/docker/example_case/Calibration/Input_Files/01447720/FORCING/2010100107.LDASIN_DOMAIN1_RENAMED

# Run Spin up
python /home/docker/PyWrfHydroCalib/spinOrchestrator.py 1 --optDbPath /home/docker/example_case/Calibration/output/issue_4/DATABASE_issue_4.db

Spin up will start and run for 6 time step (6 hours) and it will fail then since the forcign file (2010100107.LDASIN_DOMAIN1) does not exist. As a result the following LOCK file will be created. **/home/docker/example_case/Calibration/output/issue_4/example1/01447720/RUN.SPINUP/RUN.LOCK**. The above job will be hanging till user remove the `RUN.LOCK`, so let's take a look at the `OUTPUT` directory now and checkout the `diag_hydro.00000` file for the error message (in this case the WH_1_1.out and WH_1_1.err did not provide a proper messge of failure). We now can rename the forcing file and remove the `RUN.LOCK` file, after doing so the workflow will continue. Let's copy paste the following commands in the shell. 

In [None]:
%%bash 
mv /home/docker/example_case/Calibration/Input_Files/01447720/FORCING/2010100107.LDASIN_DOMAIN1_RENAMED /home/docker/example_case/Calibration/Input_Files/01447720/FORCING/2010100107.LDASIN_DOMAIN1
rm /home/docker/example_case/Calibration/output/issue_4/example1/01447720/RUN.SPINUP/RUN.LOCK

After removing the LOCK file, it will take few minutes for the model to restart and the simulations will continue and finish. 

### Errors not caught by workflow: 
Some of the errors do not get caught by the workflow and those are the most difficult ones to figure out. Below is a few of the possible failures that could happen. 
#### Not finding restart files at the end of spin up date. 
User specify the start and end date of the spip up period in the `setup.parm` file. User also specified the frequency of restart files (for both LSM and hydro) in the `setup.parm` file. The workflow looks for the restart file at the end date requested for the spin up as an indication tha the model run is complete. So if the user define the sin up period and the restart frequency in a way that there will not be any restart files outputted at the end of the simulations, python workflow does not see the restart file and will fall into an infite loop. Our common practice is to define the perios (both start/end) for spin up, calibration and validation at the start of the month and set the frequency of restart files to -9999 which will output restart files once every month. 

#### Fixed Grid ID: 
Grid ID is hard-coded to 1 in the calibration workflow, meaning it is expecting the restart files to be named like `???DOMAIN1` and if they are not, the model does not see the resart files, and think that the job has not finished, and restart it. In this case, the workflow will be in an inifite loop and does not finish. If your domain is an NWM cutout, then this problem will not arise. However, if you are using a non NWM domain, make sure the `grid_id` global attribute in the `wrfinput.nc` file is 1. This is where the LSM will grab the index of the domain. 

#### Workflow needs both restart files:
User cannot turn off all the routings, since the workflow look for both restart files (LSM and hydro) as an indication of job completion. If user turns off all the routing options, there will not be any hydro restart file, and therefore the model will fall into in infinite loop. 

## Calibration 

There are two type of LOCK files for the calibration, either `RUN.LOCK` file as result of WRF-Hydro model run failure or `CALIB.LOCK` file as a result of the calibration workflow failure. The `calibOrchestrator.py` workflow will be running in the background but since there is LOCK file, it will not do anything and will hold still until you fix the issue and remove the LOCK file. After removing the LOCK file, the procedure will restart and pick up from it crashed.

### How the content of the email looks like?
* If WRF-Hydro model fails, it would be a message like `ERROR: SIMULATION FOR GAGE: 01447720 HAS FAILED A SECOND TIME. PLEASE FIX ISSUE AND MANUALLY REMOVE LOCK FILE: /home/docker/example_case/Calibration/output/example1/01447720/RUN.CALIB/RUN.LOCK`
* If calibration workflow fails, user would receive a message like the following: ` Calibration Scripts failed a second time for gage: 01447720Iteration: 297 Failed.  Please remove LOCKFILE: /home/docker/example_case/Calibration/output/example1/01447720/RUN.CALIB/CALIB.LOCK`

### Where to search for error messages?
* if WRF-Hydro model fails, you need to search for the standard error and out in the `RUN.CALIB/OUTPUT` directory. The standard error and out files would be called something like: `/home/docker/example_case/Calibration/output/example1/01447720/RUN.CALIB/OUTPUT/WH_1_1.err and WH_1_1.out`. These two files are the first place to look for hints of why the model failed. You could also look at the diag files in the same folder; sometimes those files have better info in them. 
* if the calibration workflow fails, you will see a standard error and out file in the `RUN.CALIB/OUTPUT` directory which look like this: `/home/docker/example_case/Calibration/output/example1/01447720/RUN.CALIB/OUTPUT/WH_CALIB_1_1.err and WH_CALIB_1_1.out`. The error information is in this file. 


We already saw how the model failes in the spin up and the procedure to look for the errors, fix it and restart the procedure. The same applies for the calibration and therefore we do not repeat it. Instead, we will create a scenario that the model calibration fails and fix it. We will rename the observation file, so the calibarion workflow fails and then fix it. To save time, we will use the same files generated in the previsou issue sample, and just proceed with the calibration. 

In [None]:
%%bash
# Let s rename the obsStrData.Rdata file 
mv /home/docker/example_case/Calibration/output/issue_4/example1/01447720/RUN.CALIB/OBS/obsStrData.Rdata /home/docker/example_case/Calibration/output/issue_4/example1/01447720/RUN.CALIB/OBS/obsStrData.Rdata_RENAMED

# Run calibration
python /home/docker/PyWrfHydroCalib/calibOrchestrator.py 1 --optDbPath /home/docker/example_case/Calibration/output/issue_4/DATABASE_issue_4.db

Because the `obsStrData.Rdata` does not exist the calibration workflow will fail and creates a LOCK file (`/home/docker/example_case/Calibration/output/issue_4/example1/01447720/RUN.CALIB/CALIB.LOCK`), the above cell will stay hanging till you kill the job or remove the LOCK file. let's check out the `OUTPUT` directory and the content of the calibration standard errro and out. 

In [None]:
%%bash 
cat /home/docker/example_case/Calibration/output/issue_4/example1/01447720/RUN.CALIB/OUTPUT/WH_CALIB*

If you run the above command in the shell, you will notice this error message : **cannot open compressed file '/home/docker/example_case/Calibration/output/issue_4//example1/01447720/RUN.CALIB/OBS/obsStrData.Rdata', probable reason 'No such file or directory'**. Let's rename the obsStrData.Rdata back and remove the LOCk file now. Run the following commands in the shell.

In [None]:
%%bash 
mv /home/docker/example_case/Calibration/output/issue_4/example1/01447720/RUN.CALIB/OBS/obsStrData.Rdata_RENAMED /home/docker/example_case/Calibration/output/issue_4/example1/01447720/RUN.CALIB/OBS/obsStrData.Rdata
rm /home/docker/example_case/Calibration/output/issue_4/example1/01447720/RUN.CALIB/CALIB.LOCK

The calibration workflow restarts from where it was left and continue. Not always the messages are this easy to understand and might need more digging to find out why it failed. If that is the case, the user needs to run the `/home/docker/example_case/Calibration/output/issue_4/example1/01447720/RUN.CALIB/calib_workflow.R` and debug. This script has one argument which is the `/home/docker/example_case/Calibration/output/issue_4/example1/01447720/RUN.CALIB/calibScript.R`. 

## Validation 

There are two type of LOCK files for the calibration, either `RUN.LOCK` file as result of WRF-Hydro model run failure or `VALID.LOCK` file as a result of the calibration workflow failure. The `runValidOrchestrator.py` workflow will be running in the background but since there is LOCK file, it will not do anything and will hold still until you fix the issue and remove the LOCK file. After removing the LOCK file, the procedure will restart and pick up from it crashed. Procedure of finding errors and addressing them is very similar to the spinup and calibration and therefore we will not repeat it. 

## How to stop the processes if required?
Sometimes we need to completely stop the process and restart the whole thing, for example if there is a missing R library, just removing the LOCK file is not going to be useful and one needs to stop the procesure, install the library and restart. In that case, the user needs to kill the python process (one of the `spinOrchestrator.py`, `calibOrchestrator.py`, or `runValidOrchestrator.py`) and kill all the jobs that are submitted to the queue. In our training, we did not use any scheduler, therefore step 2 is not required. But when applicable, the user needs to kill the jobs in queue also. Then after you make the necessary changes, just resubmit the python process (one of the `spinOrchestrator.py`, `calibOrchestrator.py`, or `runValidOrchestrator.py`) and everything should restart. 

**NOTE**, if user kill the jobs in the queue and not the python process, the python workflow will see there is not job in the queue and resubmit it. That is how workflow has been designed to deal with the wall clock limitations. It keeps checking the queue and when does not find the job, it will resubmit it. 

## Random Warnings: 

Sometimes there is system glitches and the workflow cannot update a field in the database, they you will get email message like these below: 
* ERROR: Unable to query domain meta table for gages metadata.
* ERROR: Unable to update calibration status for job ID: 4 domainID: 28 Iteration: 430
* ERROR: Failure to enter value for parameter: bexp jobID: 4 domainID: 40 iteration: 451

These are not really erorrs and will be resolved by the python workflow and can be ignored. 


## Bug that will be addressed:
As explained there are a number of directories created under `RUN.CALIB` directory, one of them is called `FINAL_PARAMETERS`. Unfortunatele, this dir does not have the files associated to the best iteraiton. It contains the files from the next iteration of the best one. We are aware of the bug, and will address it soon. 

Meanwhile, if user needs the domain files with the best parameters, you couild find them under `RUN.VALID/OUTPUT/BEST` directory. 

## Conclusion:

we just went through a few of the errors that user could encounter and how to address them. This list is not complete, please contact us if you run into any issue that is not explained here. 