# Feature Extraction From Model Output

In this excercise we'll use NMFk to extract temporal features from the output of hydrologic model outputs in the Colorado River Basin.  We'll focus on several hydrological and climate variables related to drought.


In [33]:
import DelimitedFiles
import HDF5
import FileIO
import JLD2
import Mads
import NMFk

We define our six different drough features below. We'll also look at two different global climate models: GFDL-ESM2G and IPSL-CM5A-LR. We use temperature and precipitation output from these models to run our hydrologic model. The future simulations of these models show similar increases in temperature. However, the IPSL model shows a decrease in future precipitation, while GFDL shows an increase. 

In [38]:
varnames_long = ["Dry dates", "Max evaporation", "Min stream flow", "Min soil moisture", "Max soil moisture", "Max snow water equivalent", "Max temperature"]
varnames = [:dryd, :evapx, :qn, :soilmn, :swex, :tx]
gcm_models = ["GFDL-ESM2G", "IPSL-CM5A-LR"]

2-element Array{String,1}:
 "GFDL-ESM2G"
 "IPSL-CM5A-LR"

Here we show the names of the subwatersheds within the Colorado River Basin wehre we have data. Some of the watersheds have the same names, so we give each watershed a unique name.

In [9]:
indir = "data/"
dr, hr = DelimitedFiles.readdlm("data/CRB.csv", ','; header=true);
dn, hn = DelimitedFiles.readdlm("data/huc8id_name_unique.csv", ','; header=true);
dx = DelimitedFiles.readdlm("data/huc8_list.txt", ' ');
ii = indexin(dr[:,11], dn[:,1]);
ii = indexin(dx, dn[:,1]);
watersheds = dn[ii[ii .!= nothing], 3];
dn[end-10:end, :]

11×3 Array{Any,2}:
 15050301  "Upper Santa Cruz"  "Upper Santa Cruz"
 15060202  "Upper Verde"       "Upper Verde"
 15010008  "Upper Virgin"      "Upper Virgin"
 14050005  "Upper White"       "Upper White"
 14050001  "Upper Yampa"       "Upper Yampa"
 14040109  "Vermilion"         "Vermilion"
 14030001  "Westwater Canyon"  "Westwater Canyon"
 15060102  "White"             "White One"
 15010011  "White"             "White Two"
 14060006  "Willow"            "Willow"
 15020004  "Zuni"              "Zuni"

The data in each file contains 6 global climate models and 11 climate/hydrology parameters. For each of those there is a 134 x 73 matrix of data. The matrix is made up of 134 sub-watersheds and 71 5-day interval periods representing a 1-year time-series of the 30 year average for that parameter.

In [39]:
file = "data/extremes_historical_06222020_5day.h5"

hf = HDF5.h5open(file)
@show(names(hf))
a = HDF5.read(hf, names(hf)[1]);
@show(collect(keys(a)))
size(a[string(varnames[1])])

names(hf) = ["GFDL-ESM2G", "GFDL-ESM2M", "HadGEM2-ES365", "IPSL-CM5A-LR", "MIROC-ESM", "MPI-ESM-LR"]
collect(keys(a)) = ["qn", "soilmx", "dryd", "tn", "windx", "qx", "swex", "tx", "soilmn", "evapx", "precx"]


(134, 73)

In [16]:
# Run NMFk on Indice Data
# loop through files
hf = HDF5.h5open(file)
nkrange = 2:6
nNMF = 10
# Loop through GCM models
for n in gcm_models
        a = HDF5.read(hf, n)
        # for i = 1:length(a)
        ka = Vector{Int64}(undef, length(a))
        kaa = Vector{Vector{Int64}}(undef, length(a))
        wg = Vector{Vector{Int64}}(undef, length(a))
        # Run climate/hydrology indices in parallel
        Threads.@threads for i in 1:1:varnames
                indice = a[string(varnames[i])]
                if sum(indice) == 0
                        @warn("$case $n is empty!")
                        continue
                end
                if size(indice, 1) != length(watersheds) || size(indice, 2) !=  length(xaxis) || length(a) != length(varnames_long)
                        @warn("$case $n has wrong dimensions: $(size(a))!")
                        continue
                end
                wrange = vec(sum(indice; dims=2) .!= 0)
                if sum(.!wrange) > 0
                        @warn("Watersheds $(sum(.!wrange)) are empty: $(watersheds[.!wrange])")
                end
                @info("Processing $(varnames_long[i]) ...")
                Xo = permutedims(indice) 
                X = (Xo .- minimum(Xo)) / (maximum(Xo) - minimum(Xo)) # normalize matrix values
                # Here is where we execute NMFk on climate indices
                W, H, fitquality, robustness, aic, kopt = NMFk.execute(X, nkrange, nNMF; load=true, resultdir="results-$(case)/$(n)/$(varnames_long[i])")
                ka[i] = kopt
                @info("$(varnames_long[i]): optimal number of features: $kopt")
                kaa[i] = NMFk.getks(nkrange, robustness[nkrange])
                @info("$(varnames_long[i]): plausible number of features: $(kaa[i])")
                # Loop through differing number of features k
                for k in kaa[i]
                        @info(k)
                        if ispath("figures-$(case)/$(n)/$(varnames_long[i])/") == false
                                mkpath("figures-$(case)/$(n)/$(varnames_long[i])/")
                        end
                        if isfile("figures-$(case)/$(n)/$(varnames_long[i])/groups-$(k)-assignments.jld2")
                                watershed_groups = FileIO.load("figures-$(case)/$(n)/$(varnames_long[i])/groups-$(k)-assignments.jld2", "watershed_groups")
                        else
                                # We run K means to cluster sub-watersheds into groups based on climate indices
                                watershed_groups = NMFk.robustkmeans(H[k], k).assignments
                                FileIO.save("figures-$(case)/$(n)/$(varnames_long[i])/groups-$(k)-assignments.jld2", "watershed_groups", watershed_groups)
                        end
                        if k == kopt
                                @info(watershed_groups)
                                wg[i] = watershed_groups
                        end
                        @info("Attribute vs number of features")
                        display([varnames_long ka])
                        @info("$case")
                        @info("Model: $n")
                        @info("Indice: $(varnames[i])")
                        @info("groups-$(k)-assignments")
                        @info([watersheds[wrange] watershed_groups[wrange]])
                        DelimitedFiles.writedlm("figures-$(case)/$(n)/$(varnames_long[i])/groups-$(k)-assignments.dat", [watersheds[wrange] watershed_groups[wrange]], ',')
                        for g in sort(unique(watershed_groups))
                                @info("$(varnames_long[i]): group #$(g) out of $(k):")
                                display(watersheds[wrange][watershed_groups[wrange] .== g])
                        end
                end
        end
        DelimitedFiles.writedlm("figures-$(case)/$(n)/number_of_features_optimal.dat", [varnames_long ka], ',')
        DelimitedFiles.writedlm("figures-$(case)/$(n)/number_of_features_plausible.dat", [varnames_long kaa], ',')
        s = NMFk.sankey(wg, varnames_long)
        PlotlyJS.savehtml(s, "figures-$(case)/$(n)/sankey-watersheds.html", :remote)
end
HDF5.close(hf)

LoadError: [91mUndefVarError: HF5 not defined[39m

In [32]:
Mads.display("figures/groups-2.pdf")
Mads.display("figures/groups-3.pdf")

Error encountered while load FileIO.File{FileIO.DataFormat{:PDF},String}("figures/groups-2.pdf").

Fatal error:
Error encountered while load FileIO.File{FileIO.DataFormat{:PDF},String}("figures/groups-3.pdf").

Fatal error:


Process(`[4mopen[24m [4mfigures/groups-3.pdf[24m`, ProcessExited(0))

In [35]:
Mads.display("figures/5day_GFDL-ESM2G_chrono_groups-3.pdf")
Mads.display("figures/5day_IPSL-CM5A-LR_chrono_groups-3.pdf")

Error encountered while load FileIO.File{FileIO.DataFormat{:PDF},String}("figures/5day_GFDL-ESM2G_chrono_groups-3.pdf").

Fatal error:
Error encountered while load FileIO.File{FileIO.DataFormat{:PDF},String}("figures/5day_IPSL-CM5A-LR_chrono_groups-3.pdf").

Fatal error:


Process(`[4mopen[24m [4mfigures/5day_IPSL-CM5A-LR_chrono_groups-3.pdf[24m`, ProcessExited(0))