Skip to content

Simple application of Generalized Linear Regression (GLM)

miturbide edited this page Oct 6, 2016 · 2 revisions
# If we have not installed the "devtools" library we should do it:
# install.packages("devtools")
# library(devtools)
# Installing the downscaleR R-package
# devtools::install_github(c("SantanderMetGroup/downscaleR.java@stable","SantanderMetGroup/downscaleR@stable"))
library(downscaleR)
value <- "/home/maialen/WORK/LOCAL/downscaler_wiki/data/datasets/VALUE_ECA_86_v2"

Loading Data

####Observation data

The VALUE_ECA&D dataset contains weather data of 86 stations spread over Europe, and is available for download:

download.file("http://meteo.unican.es/work/downscaler/data/VALUE_ECA_86_v2.tar.gz", 
              destfile = "mydirectory/VALUE_ECA_86_v2.tar.gz")
# Extract files from the tar.gz file
untar("mydirectory/VALUE_ECA_86_v2.tar.gz", exdir = "mydirectory")
# Data inventory
value <- "mydirectory/VALUE_ECA_86_v2"

stationInfo(value)

In particular, we see that the SAN SEBASTIAN-IGUELDO station is the number 41 in the dataset (code = "000234").

There are different options for loading arbitrary stations from a dataset (see section 3.2. Accessing and loading station data). In this case we are interested in loading just one single station, located in San Sebastian and called "Igueldo". The easiest approach is to provide an (approximated) coordinate to San Sebastian (-2.03N, 43.30N), and the function will return the closest station to the given point. Alternatively, it is also possible to retrieve the station from its code (safer in case there are other stations in the proximity that may be confounded):

obs <- loadStationData(dataset = value, 
                                 var = "precip", 
                                 stationID = "000234", 
                                 season=c(12,1,2), 
                                 years = 1991:2000)

There is an R-object equivalent to the loading example shown (type help(VALUE\_Igueldo\_tp)), thus, the previous steps can be skiped:

data(VALUE_Igueldo_tp)
obs <- VALUE_Igueldo_tp

####Model data

In this example, the NCEP reanalysis data will be used, that can be accessed and loaded remotely through the ECOMS User Data Gateway (ECOMS-UDG) with loadECOMS function from package ecomsUDG.Racess or, as in this case, accessing the netcdf files remotely with downscaleR through the corresponding ncml file (see section 1.1. Accessing and loading grid data). However, in this case, we are going to use a subdomain of the NCEP dataset encompassing the period 1961-2010 for the Iberian Peninsula, which is available in a tar.gz file for download:

download.file("http://meteo.unican.es/work/downscaler/data/Iberia_NCEP.tar.gz", 
              destfile = "mydirectory/Iberia_NCEP.tar.gz")
# Extract files
untar("mydirectory/Iberia_NCEP.tar.gz", exdir = "mydirectory")
# Define the path to the ncml file
ncep <- "mydirectory/Iberia_NCEP.ncml"
dir <- "/home/maialen/WORK/LOCAL/downscaler_wiki/"
ncep <- (paste(dir, "data/datasets/Iberia_NCEP/Iberia_NCEP.ncml", sep=""))

The set of predictors (a.k.a. atmospheric pattern) P5 is next loaded for the training period. A single variable for a specific spatiotemporal domain is what we term a field in downscaleR terminology. In this case, for the same spatio-temporal domain we are loading two different variables, that can be merged together forming a multifield. A multifield can be loaded from a given dataset using the loadMultiField function, that takes care of regridding to ensure the spatial matching of the different input fields. (NOTE: it is also possible to load the different fields sepparately and then join them using the multifield constructor makeMultiField.

Next, We load the variables of the atmospheric pattern P5 (mean surface temperature and sea-level pressure) whose standard names are "tas" and "psl" respectively, as indicated in the vocabulary (see help("vocabulary") for details).

modeldata <- loadMultiField(ncep, vars = c("tas","psl"), 
                       lonLim = c(-10,5), 
                       latLim = c(34,44), 
                       season = c(12,1,2), 
                       years = 1991:2010)
# This is how the multifield looks like:
plotMeanField(modeldata)
Subsetting the model data in train and test data

In this example, we are using NCEP both for train and test, considering different time periods for each. Next we apply function subsetField to the multifield to extract the data corresponding to the train and test periods (1991-2000 and 2001-2010 respectively):

train <- subsetField(field = modeldata, years = 1991:2000)
test <- subsetField(field = modeldata, years = 2001:2010)

###Principal Components and EOFs

It is a standard practice in atmospheric science to apply principal component analysis (PCA, a.k.a. EOF analysis) to the predictors, eliminating the large amount of redundant information while minimizing the loss of variability, being therefore an efficient technique for reduction of data dimensionality. Thus, instead of using the raw multifield loaded as predictor (which is also possible), we will perform a previous PCA on it. In this example, we will retain the PCs explaining the 99% of the variability of each field (although note that it is also possible to retain all PCs, or a user-defined number of them: type help("prinComp") for details).

pca.pred <- prinComp(train, v.exp = .99)

The main spatial modes after the PCA can be visually inspected using the plotEOF function. For instance, for sea-level pressure these are all the EOFs obtained:

plotEOF(pca.pred, "psl")

Application of the glm method

The generalized linear regresion method is applied to the NCEP test data. The arguments follow the order observations > predictors > test data. Further additional arguments control the different options of the glm fitting, for example, the number of PCs to be used (n.eofs = 15) (type help("downscale") for more details).

library(abind)
glimpr<- function(obs, pred, sim, pr.threshold = .1, n.eofs = 15) {
      modelPars <- ppModelSetup(obs, pred, sim)
      if(is.null(n.eofs)){
            n.eofs <-  dim(modelPars$pred.mat)[2]
      }
      pred <- NULL
      sim <- NULL
      # Prepare predictand matrix
      ymat <- if (isTRUE(modelPars$stations)) {
            obs$Data
      } else {
            if (is.null(dim(obs$Data))) {
                  as.matrix(obs$Data)
            } else {
                  array3Dto2Dmat(obs$Data)
            }
      }
      # binarize      
      ymat.bin <- ymat
      ymat.bin[which(ymat < pr.threshold)] <- 0
      ymat.bin[which(ymat > 0)] <- 1L
      wet.prob <- apply(ymat.bin, 2, function(x) {sum(x, na.rm = TRUE) / length(na.exclude(x))})
      # Filter empty gridcell series
      rm.ind <- which(is.na(wet.prob))
      # Valid columns (cases)
      cases <- if (length(rm.ind) > 0) {
            (1:ncol(ymat))[-rm.ind]      
      } else {
            1:ncol(ymat)
      }
      # Models
      message("[", Sys.time(), "] Fitting models ...")
      pred.list <- suppressWarnings(lapply(1:length(cases), function(x) {
            mod.bin <- glm(ymat.bin[ ,cases[x]] ~ ., data = as.data.frame(modelPars$pred.mat)[,1:n.eofs], family = binomial(link = "logit"))
            sims.bin <- sapply(1:length(modelPars$sim.mat), function(i) {
                
                  pred <- predict(mod.bin, newdata = as.data.frame(modelPars$sim.mat[[i]])[,1:n.eofs], type = "response")
                  pred[which(pred >= quantile(pred, 1 - wet.prob[cases[x]]))] <- 1L
                  pred <- as.integer(pred)
            })
            mod.bin <- NULL
            wet <- which(ymat[ ,cases[x]] != 0)
            # TODO: handle exception when model is over-specified (https://stat.ethz.ch/pipermail/r-help/2000-January/009833.html)
            mod.gamma <- glm(ymat[ ,cases[x]] ~., data = as.data.frame(modelPars$pred.mat)[,1:n.eofs], family = Gamma(link = "log"), subset = wet)
            sims <- sapply(1:length(modelPars$sim.mat), function(i) {
                  unname(predict(mod.gamma, newdata = as.data.frame(modelPars$sim.mat[[i]])[,1:n.eofs], type = "response") * sims.bin[ ,i])
            })
            return(unname(sims))
      }))
      # Refill with NaN series
      if (length(rm.ind) > 0) {
            aux.list <- rep(list(matrix(nrow = nrow(ymat), ncol = length(modelPars$sim.mat))), ncol(ymat))
            aux.list[cases] <- pred.list
            pred.list <- aux.list
            aux.list <- NULL
      }
      x.obs <- getCoordinates(obs)$x
      y.obs <- getCoordinates(obs)$y
      aux.list <- lapply(1:length(modelPars$sim.mat), function (i) {
            aux <- lapply(1:length(pred.list), function(j) {
                  pred.list[[j]][ ,i]
            })
            aux.mat <- do.call("cbind", aux)
            out <- if (!modelPars$stations) {
                  mat2Dto3Darray(aux.mat, x.obs, y.obs)
            } else {
                  aux.mat
            } 
            aux.mat <- NULL
            return(out)
      })
      # Data array - rename dims
      dimNames <- renameDims(obs, modelPars$multi.member)
      obs$Data <- drop(unname(do.call("abind", c(aux.list, along = -1))))
      aux.list <- NULL
      attr(obs$Data, "dimensions") <- dimNames
      attr(obs$Data, "downscaling:method") <- "glm"
      attr(obs$Data, "downscaling:simulation_data") <- modelPars$sim.dataset
      attr(obs$Data, "downscaling:n_eofs") <- n.eofs
      
      # Date replacement
      obs$Dates <- modelPars$sim.dates 
      message("[", Sys.time(), "] Done.")
      return(obs)
}
output <- glimpr(obs, pca.pred, test, n.eofs = 15)
output <- downscale(obs, pca.pred, test, method = "glm", n.eofs = 15)

NOTE: Instead on the principal components results, the argument pred can be filled with the original multifield. The use of the function would be similar:

# Do not run, just as an example
noPCA.output <- downscale(obs, train, test, method = "glm")

However, note that due to the advantages of using EOFs rather than the raw fields, in the following only the results from using the principal components will be shown.


The structure of the output is similar to that of the input observations, but the attributes in the Data element indicate that this is the output of an glm downscaling, that 15 PCs have been used and that the test data correspond to NCEP (note that the value of the last attribute depends on the user machine, as dataset is locally stored):

attributes(output$Data)

##Validation of results

In order to show a brief comparison of the downscaled series versus the actual series, we load the validation data from the VALUE dataset for the test period:

val<- loadStationData(dataset = value, 
                                 var = "precip", 
                                 stationID = "000234", 
                                 season=c(12,1,2), 
                                 years = 2001:2010)

The following lines make a basic comparison of the observed and downscaled time series for the validation period, generating the next figure:

# The test period is coerced to a POSIXlt class object to define the temporal dimension
test.period <- as.POSIXlt(val$Dates$start, tz = "GMT")

# Plot1
plot(test.period, output$Data, ty = 'l', ylab = "tasmin")
lines(test.period, val$Data, col = "blue", lty = 2)
legend("topleft", c("Observed", "Downscaled"), lty = c(1,2), col = c(1,4))
# Plot2
plot(val$Data, output$Data, asp = 1, ylab = "Observed", xlab = "Downscaled", col = "blue")
lm1 <- lm(val$Data ~ output$Data)
abline(reg = lm1, col = "red")
r2 <- round(summary(lm1)$adj.r.squared, 2)
bias <- round(mean(output$Data - val$Data, na.rm = TRUE), 2)
mtext(paste("Rsq =", r2, " / bias = ", bias))

The result obtained can be checked against the results in Gutiérrez et al 2013, by comparing the correlation obtained in this exercise (r^2 = 0.44, equivalent to a correlation of ~0.66), as obtained depicted in Fig. 6 (Upper left panel on the right block, corresponding to method M1a.). From these results we can expect a notable improvement of model performance.

Altough we have downscaled daily data, we can analyze the inter-annual variability, using getYearsAsINDEX function to extract the year as index to calculate the desired statistics. In the next example the annual mean and sd is computed and plotted.

#Annual time series plot
##Year index of the test period
years <- getYearsAsINDEX(val)

##Compute statistic
x <- tapply(val$Data, INDEX = years, FUN = mean)
xs <- tapply(val$Data, INDEX = years, FUN = sd)
y <- tapply(test$Data[1,,3,3], INDEX = years, FUN = mean)
ys <- tapply(test$Data[1,,3,3], INDEX = years, FUN = sd)
w <- tapply(output$Data, INDEX = years, FUN = mean, na.rm = T)
ws <- tapply(output$Data, INDEX = years, FUN = sd, na.rm = T)

plot(1:10, x, lwd = 2, ty = "l", ylim = c(-5,15), xlab = "Years",
ylab = "Annual mean of the Daily tp", cex = .6)
polygon(x = c(1:10, 10:1), y =c (xs+x,rev(x-xs)), col ="grey", border = NA)
polygon(x = c(1:10, 10:1), y =c (ys+y,rev(y-ys)), col = rgb(1,0,0,0.2), border = NA)
polygon(x =  c(1:10, 10:1), y =c (ws+w,rev(w-ws)), col = rgb(0,0,1,0.2), border = NA)
lines(1:10, x, lwd = 2, ty = "l")
lines(1:10, y, lwd = 2, ty = "l", col = "red")
lines(1:10, w, col = "blue", lwd = 2)
legend(6,10, legend = c("observed", "simulated", "downscaled"),
fill = c("black", "red", "blue"), box.lwd = 0, cex = .6)

<-- Home page of the Wiki