# SOMOSPIE

Migrating code to a Jupyter Notebook: converted bash shell scripts to cells that call on the subscript files.
https://docs.python.org/2/library/subprocess.html

Text from the paper.
https://github.com/TauferLab/Src_SoilMoisture/tree/master/2018_BigData/docs/2018paper

#### Abstract

The current availability of soil moisture data over large areas comes from satellite remote sensing technologies (i.e., radar-based systems), but these data have coarse resolution and often exhibit large spatial information gaps. Where data are too coarse or sparse for a given need (e.g., precision agriculture), one can leverage machine-learning techniques coupled with other sources of environmental information (e.g., topography) to generate gap-free information and at a finer spatial resolution (i.e., increased granularity). To this end, we develop a spatial inference engine consisting of modular stages for processing spatial environmental data, generating predictions with machine-learning techniques, and analyzing these predictions. We demonstrate the functionality of this approach and the effects of data processing choices via multiple prediction maps over a United States ecological region with a highly diverse soil moisture profile (i.e., the Middle Atlantic Coastal Plains). The relevance of our work derives from a pressing need to improve the spatial representation of soil moisture for applications in environmental sciences (e.g., ecological niche modeling, carbon monitoring systems, and other Earth system models) and precision agriculture (e.g., optimizing irrigation practices and other  land management decisions).

## Overview

We build a modular SOil MOisture SPatial Inference Engine (SOMOSPIE) for prediction of missing soil moisture information. SOMOSPIE includes three main stages, illustrated below: (1) data processing to select a region of interest, incorporate predictive factors such as topographic parameters, and reduce data redundancy for these new factors; (2) soil moisture 
prediction with three different machine learning methods (i.e.,  kNN, HYPPO, and RF); and (3) analysis and visualization of the prediction outputs.

![inference-engine](../figs/inference-engine.png)

### User Input

Make changes to the cell below, then in the "Cell" menu at the top, select "Run All".

In [1]:
# Here the user specify the working directory...
START = "../"
# ... the subfolder with the modular scripts...
CODE = "code/"
# ... the subfolder with the data...
DATA = "data/"
# ... the subfolder for output. 
OUTPUT = "out/"

YEAR = 2016
# Assuming SM_FILE below has multiple months of SM data, 
# specify the month here (1=January, ..., 12=December)
# The generated predictions will go in a subfolder of the data folder named by this number.
# Set to 0 if train file is already just 3-columns (lat, lon, sm).
MONTH = 4

#############################
# Within the data folder...

# ... there should be a subfolder with/for training data...
TRAIN_DIR = f"{YEAR}/t/"#-100000"
# ... and a subfolder with/for evaluation data.
EVAL_DIR = f"{YEAR}/e/"

# THE FOLLOWING 3 THINGS WILL ONLY BE USED IF MAKE_T_E = 1.
# Specify the location of the file with sm data.
# Use an empty string or False if the train folder is already populated.
SM_FILE = f"{YEAR}/{YEAR}_ESA_monthly.rds"
# Specify location of eval coordinates needing covariates attached.
# An empty string or False will indicate that the eval folder is already populated.
EVAL_FILE = f""#{YEAR}/{MONTH}/ground_sm_means_CONUS.csv"
# Specify location of the file with covariate data.
# An empty string or False will indicate that covariates are already attached to train and eval files.
COV_FILE = "USA_topo.tif"#8.5_topo.tif"#6.2_topo.tif"#

##########################
# If the Train and Eval files need to be generated, set MAKE_T_E = 1.
MAKE_T_E = 0
# If you wish to perform PCA, set USE_PCA = 1; otherwise USE_PCA = 0.
USE_PCA = 0
# Compute residuals from original test data? Set to 1.
# Split off (e.g.) 25% of the original for test for validation? Set to 1.25
# Use the EVAL_FILE as truth for validation? Set to 2.
# Split off a fraction of 
VALIDATE = 1.25
RAND_SEED = 0 #0 for new, random seed, to be found in log file
# Create images?
USE_VIS = 1

# Specify the ecoregions to cut out of the sm data.
#REG_LIST = ["6.2.10", "6.2.12", "6.2.13", "6.2.14"]
#REG_LIST = [f"6.2"]#.{l3}" for l3 in range(3, 16) if l3!=6]
REG_LIST = ["8.5.1"]#, "8.5.2"]#, "8.5.3"]#"8.5", 
# Specify the number of km of a buffer you want on the training data.
BUFFER = 0#100000

# Dictionary with a models as keys and model-specific parameter:arglist dictionaries as values.
MODICT = {
#          "1NN":{"-p":[1]}, 
#          "KKNN":{"-k":[10]}, 
          "RF":{}, 
#          "HYPPO":{"-p":[1], "-k":[10], "-D":[3], "-v":[2]},
#          "UNMODEL":{}
         }

### Libraries and utility functions
Misc. Python functions to assist all the processes below.

In [2]:
# Required packages
# R: raster, caret, quantregForest, rgdalless, kknn, rasterVis
# Python2: pandas, sklearn, argparse, sys, numpy, itertools, random, 
#          scipy, matplotlib, re, ipykernel
# Python3: argparse, re, itertools, random, scipy, ipykernel

import pathlib, proc
from subprocess import Popen

# https://docs.python.org/2/library/os.html#files-and-directories
from os import listdir, chdir 

from __utils import *
# The following are in __utils
#def bash(*argv):
#    call([str(arg) for arg in argv])
#    
#def append_to_folder(folder_path, suffix):
#    if type(folder_path)==str:
#        return folder_path.rstrip("/") + str(suffix)
#    else:
#        folder = folder_path.name + str(suffix)
#        return folder_path.parent.joinpath(folder)


## Stage 1: Curating Data

In [3]:
from __A_curate import curate

## Stage 2: Generating a model; making predictions

In [4]:
from __B_model import model

## Stage 3: Analysis and Visualization

In [5]:
from __C_analyze import analysis

In [6]:
from __D_visualize import visualize

### Wrapper Script

In [7]:
########################################
# Wrapper script for most of the workflow
#
#     Arguments:
#         START   directory of folder containing both train and predi folder
#                 the train folder contains regional files
#                 the predi folder must contain regional files 
#                 with the same names as in the train folder

START = pathlib.Path(START).resolve()
print(f"Starting folder: {START}\n")
    
# Set the working directory to the code subfolder, for running the scipts       
chdir(pathlib.Path(START, CODE))

# Change data files and folders to full paths
DATA = START.joinpath(DATA)
if MAKE_T_E:
    if SM_FILE:
        SM_FILE = DATA.joinpath(SM_FILE)
        if not SM_FILE.exists():
            print(f"ERROR! Specified SM_FILE does not exist: {SM_FILE}")
    if COV_FILE:
        COV_FILE = DATA.joinpath(COV_FILE)
        if not COV_FILE.exists():
            print(f"ERROR! Specified COV_FILE does not exist: {COV_FILE}")
    if EVAL_FILE:
        EVAL_FILE = DATA.joinpath(EVAL_FILE)
        if not EVAL_FILE.exists():
            print(f"ERROR! Specified EVAL_FILE does not exist: {EVAL_FILE}")
else:
    SM_FILE = ""
    COV_FILE = ""
    EVAL_FILE = ""
TRAIN_DIR = DATA.joinpath(TRAIN_DIR)
EVAL_DIR = DATA.joinpath(EVAL_DIR)
OUTPUT = START.joinpath(OUTPUT).joinpath(str(YEAR))

print(f"Original training data in: {TRAIN_DIR}")
print(f"Original evaluation data in: {EVAL_DIR}")
    
# ... so we can suffix them at will
MNTH_SUFX = f"-{MONTH}"

##########################################
# 1 Data Processing

# ORIG is the sm data before any filtering, for use with analysis()
# TRAIN is the training set after filtering and pca, if specified
# EVAL is the evaluation set after filtering and pca, if specified
curate_input = [OUTPUT, SM_FILE, COV_FILE, EVAL_FILE, REG_LIST, BUFFER, 
                TRAIN_DIR, MONTH, EVAL_DIR, USE_PCA, VALIDATE, RAND_SEED]
print(f"curate(*{curate_input})")
ORIG, TRAIN, EVAL = curate(*curate_input)

print(f"Curated training data in: {TRAIN}")
print(f"Curated evaluation data in: {EVAL}")

if len(listdir(TRAIN)) != len(listdir(EVAL)):
    print(listdir(TRAIN))
    print(listdir(EVAL))
    raise Exception("We've got a problem! TRAIN and EVAL should have the same contents.")

##########################################
# 2 Modeling

PRED = OUTPUT.joinpath(str(MONTH))
NOTE = ""
if BUFFER:
    NOTE += f"-{BUFFER}"
if USE_PCA:
    NOTE += "-PCA"
model_input = [0, TRAIN, EVAL, PRED, MODICT, NOTE]
print(f"model(*{model_input})")
model(*model_input)

##########################################
# 3 Analysis & Visualization

for region in REG_LIST:
    #LOGS = os.path.join(PRED, region, SUB_LOGS,"")
    
    if VALIDATE:
        analysis_input = [region, PRED, ORIG, VALIDATE]
        print(f"analysis(*{analysis_input})")
        analysis(*analysis_input)
    
    if USE_VIS:
        # Specify the input data folder and the output figures folder
        DATS = PRED.joinpath(region)
        OUTS = DATS.joinpath(SUB_FIGS)
        
        visualize_input = [DATS, OUTS, 1, VALIDATE, 1, 0]
        print(f"visualize(*{visualize_input})")
        visualize(*visualize_input)

Starting folder: /home/dror/Src_SoilMoisture/SOMOSPIE

Original training data in: /home/dror/Src_SoilMoisture/SOMOSPIE/data/2016/t
Original evaluation data in: /home/dror/Src_SoilMoisture/SOMOSPIE/data/2016/e
curate(*[PosixPath('/home/dror/Src_SoilMoisture/SOMOSPIE/out/2016'), '', '', '', ['8.5.1'], 0, PosixPath('/home/dror/Src_SoilMoisture/SOMOSPIE/data/2016/t'), 4, PosixPath('/home/dror/Src_SoilMoisture/SOMOSPIE/data/2016/e'), 0, 1.25, 0])
Curation log file: /home/dror/Src_SoilMoisture/SOMOSPIE/out/2016/proc-log4.txt
Curated training data in: /home/dror/Src_SoilMoisture/SOMOSPIE/data/2016/t-postproc
Curated evaluation data in: /home/dror/Src_SoilMoisture/SOMOSPIE/data/2016/e-postproc
model(*[0, PosixPath('/home/dror/Src_SoilMoisture/SOMOSPIE/data/2016/t-postproc'), PosixPath('/home/dror/Src_SoilMoisture/SOMOSPIE/data/2016/e-postproc'), PosixPath('/home/dror/Src_SoilMoisture/SOMOSPIE/out/2016/4'), {'RF': {}}, ''])
analysis(*['8.5.1', PosixPath('/home/dror/Src_SoilMoisture/SOMOSPIE/out

<Figure size 432x288 with 0 Axes>

<Figure size 432x288 with 0 Axes>