# Feature Extraction From Model Output

In this excercise we'll use NMFk to extract temporal features from the output of hydrologic model outputs in the Colorado River Basin.  We'll focus on several hydrological and climate variables related to drought.


In [46]:
import DelimitedFiles
import HDF5
import FileIO
import JLD2
import Mads
import NMFk
import PlotlyJS

We define our six different drough features below. We'll also look at two different global climate models: GFDL-ESM2G and IPSL-CM5A-LR. We use temperature and precipitation output from these models to run our hydrologic model. The future simulations of these models show similar increases in temperature. However, the IPSL model shows a decrease in future precipitation, while GFDL shows an increase. 

In [35]:
varnames_long = ["Min stream flow", "Max soil moisture", "Dry dates", "Min temperature",  "Max wind speed", "Max stream flow", "Max snow water equivalent", "Max temperature", "Min soil moisture", "Max evaporation", "Max precipitation"]
varnames = [:qn, :soilmx, :dryd, :tn, :windx, :qx, :swex, :tx, :soilmn, :evapx, :precx]
gcm_models = ["GFDL-ESM2G", "IPSL-CM5A-LR"]

2-element Array{String,1}:
 "GFDL-ESM2G"
 "IPSL-CM5A-LR"

Here we show the names of the subwatersheds within the Colorado River Basin wehre we have data. Some of the watersheds have the same names, so we give each watershed a unique name.

In [36]:
indir = "data/"
dr, hr = DelimitedFiles.readdlm("data/CRB.csv", ','; header=true);
dn, hn = DelimitedFiles.readdlm("data/huc8id_name_unique.csv", ','; header=true);
dx = DelimitedFiles.readdlm("data/huc8_list.txt", ' ');
ii = indexin(dr[:,11], dn[:,1]);
ii = indexin(dx, dn[:,1]);
watersheds = dn[ii[ii .!= nothing], 3];
dn[end-10:end, :]

11×3 Array{Any,2}:
 15050301  "Upper Santa Cruz"  "Upper Santa Cruz"
 15060202  "Upper Verde"       "Upper Verde"
 15010008  "Upper Virgin"      "Upper Virgin"
 14050005  "Upper White"       "Upper White"
 14050001  "Upper Yampa"       "Upper Yampa"
 14040109  "Vermilion"         "Vermilion"
 14030001  "Westwater Canyon"  "Westwater Canyon"
 15060102  "White"             "White One"
 15010011  "White"             "White Two"
 14060006  "Willow"            "Willow"
 15020004  "Zuni"              "Zuni"

The data in each file contains 6 global climate models and 11 climate/hydrology parameters. For each of those there is a 134 x 73 matrix of data. The matrix is made up of 134 sub-watersheds and 71 5-day interval periods representing a 1-year time-series of the 30 year average for that parameter.

In [37]:
file = "data/extremes_future_06222020_5day.h5"
case = "extremes_future_06222020_5day"

hf = HDF5.h5open(file)
@show(names(hf))
a = HDF5.read(hf, names(hf)[1]);
@show(collect(keys(a)))
xaxis = 1:1:73;
size(a[string(varnames[1])])

names(hf) = ["GFDL-ESM2G", "GFDL-ESM2M", "HadGEM2-ES365", "IPSL-CM5A-LR", "MIROC-ESM", "MPI-ESM-LR"]
collect(keys(a)) = ["qn", "soilmx", "dryd", "tn", "windx", "qx", "swex", "tx", "soilmn", "evapx", "precx"]


(134, 73)

In [47]:
# Run NMFk on Indice Data
# loop through files
hf = HDF5.h5open(file)
nkrange = 2:4
nNMF = 10
# Loop through GCM models
for n in gcm_models
        a = HDF5.read(hf, n)
        ka = Vector{Int64}(undef, length(a))
        kaa = Vector{Vector{Int64}}(undef, length(a))
        wg = Vector{Vector{Int64}}(undef, length(a))
        # Run climate/hydrology indices in parallel
        for i in 1:1:size(varnames)[1]
            indice = a[string(varnames[i])]
            if sum(indice) == 0
                    @warn("$case $n is empty!")
                    continue
            end

            if size(indice, 1) != length(watersheds) || size(indice, 2) !=  length(xaxis) || length(a) != length(varnames_long)
                    @warn("$case $n has wrong dimensions: $(length(a))!")
                    continue
            end
            wrange = vec(sum(indice; dims=2) .!= 0)
            if sum(.!wrange) > 0
                    @warn("Watersheds $(sum(.!wrange)) are empty: $(watersheds[.!wrange])")
            end
            @info("Processing $(varnames_long[i]) ...")
            Xo = permutedims(indice);
            X = (Xo .- minimum(Xo)) / (maximum(Xo) - minimum(Xo)); # normalize matrix values
            # Here is where we execute NMFk on climate indices
            W, H, fitquality, robustness, aic, kopt = NMFk.execute(X, nkrange, nNMF; load=true, resultdir="results-$(case)/$(n)/$(varnames_long[i])");
            ka[i] = kopt;
            @info("$(varnames_long[i]): optimal number of features: $kopt")
            kaa[i] = NMFk.getks(nkrange, robustness[nkrange]);
            @info("$(varnames_long[i]): plausible number of features: $(kaa[i])")
            # Loop through differing number of features k
            for k in kaa[i]
                    @info(k)
                    if ispath("figures-$(case)/$(n)/$(varnames_long[i])/") == false
                            mkpath("figures-$(case)/$(n)/$(varnames_long[i])/")
                    end
                    if isfile("figures-$(case)/$(n)/$(varnames_long[i])/groups-$(k)-assignments.jld2")
                            watershed_groups = FileIO.load("figures-$(case)/$(n)/$(varnames_long[i])/groups-$(k)-assignments.jld2", "watershed_groups")
                    else
                            # We run K means to cluster sub-watersheds into groups based on climate indices
                            watershed_groups = NMFk.robustkmeans(H[k], k).assignments
                            FileIO.save("figures-$(case)/$(n)/$(varnames_long[i])/groups-$(k)-assignments.jld2", "watershed_groups", watershed_groups)
                    end
                    if k == kopt
                            #@info(watershed_groups)
                            wg[i] = watershed_groups
                    end
                    #@info("Attribute vs number of features")
                    #@info("$case")
                    #@info("Model: $n")
                    #@info("Indice: $(varnames[i])")
                    #@info("groups-$(k)-assignments")
                    #@info([watersheds[wrange] watershed_groups[wrange]])
                    DelimitedFiles.writedlm("figures-$(case)/$(n)/$(varnames_long[i])/groups-$(k)-assignments.dat", [watersheds[wrange] watershed_groups[wrange]], ',')
            end
        end
        DelimitedFiles.writedlm("figures-$(case)/$(n)/number_of_features_optimal.dat", [varnames_long ka], ',') 
        DelimitedFiles.writedlm("figures-$(case)/$(n)/number_of_features_plausible.dat", [varnames_long kaa], ',')
        s = NMFk.sankey(wg, varnames_long)
        PlotlyJS.savehtml(s, "figures-$(case)/$(n)/sankey-watersheds.html", :remote)
end
HDF5.close(hf)

┌ Info: Processing Min stream flow ...
└ @ Main In[47]:29
┌ Info: Loading requested but `casefilename` is not specified; casefilename = "nmfk" will be used!
└ @ NMFk /Users/ctalsma/.julia/packages/NMFk/t59Nq/src/NMFkExecute.jl:36
└ @ FileIO /Users/ctalsma/.julia/packages/FileIO/5JdlO/src/loadsave.jl:215
┌ Info: Loading requested but `casefilename` is not specified; casefilename = "nmfk" will be used!
└ @ NMFk /Users/ctalsma/.julia/packages/NMFk/t59Nq/src/NMFkExecute.jl:36
└ @ FileIO /Users/ctalsma/.julia/packages/FileIO/5JdlO/src/loadsave.jl:215
┌ Info: Loading requested but `casefilename` is not specified; casefilename = "nmfk" will be used!
└ @ NMFk /Users/ctalsma/.julia/packages/NMFk/t59Nq/src/NMFkExecute.jl:36
└ @ FileIO /Users/ctalsma/.julia/packages/FileIO/5JdlO/src/loadsave.jl:215
┌ Info: Results
└ @ NMFk /Users/ctalsma/.julia/packages/NMFk/t59Nq/src/NMFkExecute.jl:15
┌ Info: Optimal solution: 4 features
└ @ NMFk /Users/ctalsma/.julia/packages/NMFk/t59Nq/src/NMFkExecute.jl:20
┌ 

Signals:  2 Fit:     2.084074 Silhouette:    0.9980779 AIC:    -81868.78
Signals:  3 Fit:    0.5895055 Silhouette:            1 AIC:    -93807.45
Signals:  4 Fit:    0.2769513 Silhouette:    0.9671435 AIC:    -100783.2
Signals:  2 Fit:     2.084074 Silhouette:    0.9980779 AIC:    -81868.78
Signals:  3 Fit:    0.5895055 Silhouette:            1 AIC:    -93807.45
Signals:  4 Fit:    0.2769513 Silhouette:    0.9671435 AIC:    -100783.2
Signals:  2 Fit:    0.5744071 Silhouette:     0.979927 AIC:    -94475.25
Signals:  3 Fit:   0.08049662 Silhouette:    -0.179252 AIC:    -113284.1
Signals:  4 Fit:    0.0419623 Silhouette:  -0.00161882 AIC:    -119242.5
Signals:  2 Fit:    0.5744071 Silhouette:     0.979927 AIC:    -94475.25
Signals:  3 Fit:   0.08049662 Silhouette:    -0.179252 AIC:    -113284.1
Signals:  4 Fit:    0.0419623 Silhouette:  -0.00161882 AIC:    -119242.5
Signals:  2 Fit:      33.6017 Silhouette:    0.9999979 AIC:    -54672.35
Signals:  3 Fit:     15.84675 Silhouette:    0.2323

┌ Info: Optimal solution: 4 features
└ @ NMFk /Users/ctalsma/.julia/packages/NMFk/t59Nq/src/NMFkExecute.jl:20
┌ Info: Max snow water equivalent: optimal number of features: 4
└ @ Main In[47]:35
┌ Info: Max snow water equivalent: plausible number of features: [2, 3, 4]
└ @ Main In[47]:37
┌ Info: 2
└ @ Main In[47]:40
┌ Info: 3
└ @ Main In[47]:40
┌ Info: 4
└ @ Main In[47]:40
┌ Info: Processing Max temperature ...
└ @ Main In[47]:29
┌ Info: Loading requested but `casefilename` is not specified; casefilename = "nmfk" will be used!
└ @ NMFk /Users/ctalsma/.julia/packages/NMFk/t59Nq/src/NMFkExecute.jl:36
└ @ FileIO /Users/ctalsma/.julia/packages/FileIO/5JdlO/src/loadsave.jl:215
┌ Info: Loading requested but `casefilename` is not specified; casefilename = "nmfk" will be used!
└ @ NMFk /Users/ctalsma/.julia/packages/NMFk/t59Nq/src/NMFkExecute.jl:36
└ @ FileIO /Users/ctalsma/.julia/packages/FileIO/5JdlO/src/loadsave.jl:215
┌ Info: Loading requested but `casefilename` is not specified; casefilena

Signals:  2 Fit:     2.030861 Silhouette:    0.9887886 AIC:    -82121.79
Signals:  3 Fit:    0.6829456 Silhouette:    0.9888929 AIC:    -92368.21
Signals:  4 Fit:    0.3549449 Silhouette:     0.860446 AIC:    -98356.07
Signals:  2 Fit:     2.030861 Silhouette:    0.9887886 AIC:    -82121.79
Signals:  3 Fit:    0.6829456 Silhouette:    0.9888929 AIC:    -92368.21
Signals:  4 Fit:    0.3549449 Silhouette:     0.860446 AIC:    -98356.07
Signals:  2 Fit:    0.2691629 Silhouette:    0.9510745 AIC:    -101890.2
Signals:  3 Fit:   0.05102706 Silhouette:    0.2413497 AIC:    -117743.3
Signals:  4 Fit:   0.02344124 Silhouette:  -0.08126535 AIC:    -124938.3
Signals:  2 Fit:    0.2691629 Silhouette:    0.9510745 AIC:    -101890.2
Signals:  3 Fit:   0.05102706 Silhouette:    0.2413497 AIC:    -117743.3
Signals:  4 Fit:   0.02344124 Silhouette:  -0.08126535 AIC:    -124938.3
Signals:  2 Fit:     23.30288 Silhouette:     0.989792 AIC:    -58252.56
Signals:  3 Fit:       13.544 Silhouette:    0.1100

┌ Info: Optimal solution: 3 features
└ @ NMFk /Users/ctalsma/.julia/packages/NMFk/t59Nq/src/NMFkExecute.jl:20
┌ Info: Max snow water equivalent: optimal number of features: 3
└ @ Main In[47]:35
┌ Info: Max snow water equivalent: plausible number of features: [2, 3]
└ @ Main In[47]:37
┌ Info: 2
└ @ Main In[47]:40
┌ Info: 3
└ @ Main In[47]:40
┌ Info: Processing Max temperature ...
└ @ Main In[47]:29
┌ Info: Loading requested but `casefilename` is not specified; casefilename = "nmfk" will be used!
└ @ NMFk /Users/ctalsma/.julia/packages/NMFk/t59Nq/src/NMFkExecute.jl:36
└ @ FileIO /Users/ctalsma/.julia/packages/FileIO/5JdlO/src/loadsave.jl:215
┌ Info: Loading requested but `casefilename` is not specified; casefilename = "nmfk" will be used!
└ @ NMFk /Users/ctalsma/.julia/packages/NMFk/t59Nq/src/NMFkExecute.jl:36
└ @ FileIO /Users/ctalsma/.julia/packages/FileIO/5JdlO/src/loadsave.jl:215
┌ Info: Loading requested but `casefilename` is not specified; casefilename = "nmfk" will be used!
└ @ NM

In [48]:
Mads.display("figures/groups-2.pdf")
Mads.display("figures/groups-3.pdf")

Error encountered while load FileIO.File{FileIO.DataFormat{:PDF},String}("figures/groups-2.pdf").

Fatal error:
Error encountered while load FileIO.File{FileIO.DataFormat{:PDF},String}("figures/groups-3.pdf").

Fatal error:


Process(`[4mopen[24m [4mfigures/groups-3.pdf[24m`, ProcessExited(0))

In [35]:
Mads.display("figures/5day_GFDL-ESM2G_chrono_groups-3.pdf")
Mads.display("figures/5day_IPSL-CM5A-LR_chrono_groups-3.pdf")

Error encountered while load FileIO.File{FileIO.DataFormat{:PDF},String}("figures/5day_GFDL-ESM2G_chrono_groups-3.pdf").

Fatal error:
Error encountered while load FileIO.File{FileIO.DataFormat{:PDF},String}("figures/5day_IPSL-CM5A-LR_chrono_groups-3.pdf").

Fatal error:


Process(`[4mopen[24m [4mfigures/5day_IPSL-CM5A-LR_chrono_groups-3.pdf[24m`, ProcessExited(0))