# CPO

In [None]:
library("mlrCPO")

In [None]:
df = data.frame(a = 1:3, b = -(1:3) * 10)

**CPO**s are first-class objects in R that represent data manipulation. They can be combined to form networks of operation, they can be attached to `mlr` `Learner`s, and they have tunable Hyperparameters that influence their behaviour.

# Lifecycle of a CPO

## CPO Constructor

In [None]:
print(cpoPca)  # example CPOConstructor

In [None]:
class(cpoPca)

CPO constructors have parameters that
* set the CPO Hyperparameters
* set the CPO ID (default NULL)
* resetrict the data columns a CPO operates on (`affect.*` parameters)

In [None]:
names(formals(cpoPca))

## CPO

In [None]:
(cpo = cpoScale()) # construct CPO with default Hyperparameter values

In [None]:
class(cpo)  # CPOs that are not compound are "CPOPrimitive"

In [None]:
summary(cpo)  # detailed printing

In [None]:
# Functions that work on CPOs:
getParamSet(cpo)

In [None]:
getHyperPars(cpo)

In [None]:
setHyperPars(cpo, scale.center = FALSE)

In [None]:
getCPOId(cpo)

In [None]:
setCPOId(cpo, "MYID")

In [None]:
getCPOName(cpo)
getCPOName(setCPOId(cpo, "MYID"))  # the name includes the ID

In [None]:
getCPOAffect(cpo)  # empty, since no affect set
getCPOAffect(cpoPca(affect.pattern = "Width$"))

In [None]:
getCPOProperties(cpo)  # see properties explanation below

In [None]:
# these are internals
getCPOKind(cpo)  # trafo, retrafo, inverter
getCPOBound(cpo)  # databound, targetbound, both

### Exporting Parameters
Sometimes when using many CPOs, their hyperparameters may get messy. CPO enables the user to control which hyperparameter get exported. The parameter "export" can be one of "export.default", "export.set", "export.unset", "export.default.set", "export.default.unset", "export.all", "export.none". "all" and "none" do what one expects; "default" exports the "recommended" parameters; "set" and "unset" export the values that have not been set, or only the values that were set (and are not left as default). "default.set" and "default.unset" work as "set" and "unset", but restricted to the default exported parameters.

In [None]:
(sc = cpoScale())
getParamSet(sc)
cat("---\n")
(sc = cpoScale(export = "export.none"))
getParamSet(sc)
cat("---\n")
(sc = cpoScale(scale = FALSE, export = "export.unset"))
getParamSet(sc)

### CPO Application using `%>>%` or `applyCPO`
`CPO`s can be applied to `data.frame` and `Task` objects.

In [None]:
head(iris) %>>% cpoPca()
# head(getTaskData(applyCPO(cpoPca(), iris.task)))

### CPO Composition using `%>>%` or `composeCPO`
`CPO` composition results in a new CPO which mostly behaves like a primitive CPO. Exceptions are:
* Compound CPOs have no `id`
* Affect of compound CPOs cannot be retrieved

In [None]:
scale1 = cpoScale()
scale2 = cpoScale()
# scale1 %>>% scale2  # error! parameters 'center' and 'scale' occur in both
compound = setCPOId(scale1, "scale1") %>>% setCPOId(scale2, "scale2")
composeCPO(setCPOId(scale1, "scale1"), setCPOId(scale2, "scale2"))  # same

In [None]:
class(compound)

In [None]:
summary(compound)

In [None]:
getCPOName(compound)

In [None]:
as.character(try(getCPOId(compound)))  # error: no ID for compound CPOs
as.character(try(getCPOAffect(compound)))  # error: no affect for compound CPOs

In [None]:
getParamSet(compound)

In [None]:
getHyperPars(compound)

In [None]:
setHyperPars(compound, scale1.center = TRUE, scale2.center = FALSE)

### Compound CPO decomposition, CPO chaining

In [None]:
as.list(compound)

In [None]:
chainCPO(as.list(compound))  # chainCPO: list CPO -> CPO

### CPO - Learner attachment using `%>>%` or `attachCPO`

In [None]:
lrn = makeLearner("classif.logreg")

In [None]:
(cpolrn = cpo %>>% lrn)  # the new learner has the CPO hyperparameters

In [None]:
attachCPO(compound, lrn)  # attaching compound CPO

In [None]:
# CPO learner decomposition
getLearnerCPO(cpolrn)  # the CPO
getLearnerBare(cpolrn)  # the Learner

## Retrafo
CPOs perform data-dependent operation. However, when this operation becomes part of a machine-learning process, the operation on predict-data must depend only on the training data.

The `Retrafo` object represents the re-application of a trained CPO

In [None]:
transformed = iris %>>% cpo
head(transformed)

In [None]:
retrafo(transformed)

In [None]:
# retrafos are stored as attributes
attributes(transformed)

### Retrafo Inspection
`Retrafo` objects can be inspected using `getRetrafoState`. The state contains the hyperparameters, the `control` object (CPO dependent data representing the data information needed to re-apply the operation), and information about the `Task` / `data.frame` layout used for training (column names, column types) in `data$shapeinfo.input` and `data$shapeinfo.output`.

The state can be manipulated and used to create new `Retrafo`s, using `makeRetrafoFromState`.

In [None]:
(state = getRetrafoState(retrafo(iris %>>% cpoScale())))

In [None]:
state$control$center[1] = 1000  # will now subtract 1000 from the first column
new.retrafo = makeRetrafoFromState(cpoScale, state)
head(iris %>>% new.retrafo)

### Application of Retrafo using `%>>%`, `applyCPO`, or `predict`

In [None]:
head(iris) %>>% retrafo(transformed)
# should give the same as head(transformed), since the same data was used.
# same:
invisible(applyCPO(retrafo(transformed), head(iris)))
invisible(predict(retrafo(transformed), head(iris)))

### Retrafos from CPO Learners

In [None]:
cpomodel = train(cpolrn, pid.task)

In [None]:
retrafo(cpomodel)

In [None]:
head(getTaskData(pid.task %>>% retrafo(cpomodel)))
# this is what the model would see, if we predict() it with the model

### Retrafos are automatically chained when applying CPOs (!!!)
When executing `data %>>% CPO`, the result has an associated `Retrafo` object. When applying another `CPO`, the `Retrafo` will be the chained operation. This is to make `data %>>% CPO1 %>>% CPO2` the way one expects it to work.

In [None]:
data = head(iris) %>>% pca
retrafo(data)

In [None]:
data2 = data %>>% cpoScale()
# retrafo(data2) is the same as retrafo(data %>>% pca %>>% scale)
retrafo(data2)

In [None]:
# to interrupt this chain, set retrafo to NULL
retrafo(data) = NULL
data2 = data %>>% cpoScale()
retrafo(data2)

### Retrafo Composition, Decomposition, Chaining

In [None]:
compound.retrafo = retrafo(head(iris) %>>% compound)
compound.retrafo

In [None]:
(retrafolist = as.list(compound.retrafo))

In [None]:
retrafolist[[1]] %>>% retrafolist[[2]]

In [None]:
chainCPO(retrafolist)

## Inverter
Inverters represent the operation of inverting transformations done to prediction columns. They are not usually exposed outside of `Learner` objects, but can be retrieved when retransformed data is tagged using `tagInverse`.

Inverters are currently not fully functional.

In [None]:
# there is currently no example targetbound cpo
logtransform = makeCPOTargetOp("logtransform", .data.dependent = FALSE,
                               .stateless = TRUE, .type = "regr",
  cpo.trafo = {
    target[[1]] = log(target[[1]])
    target
  }, cpo.retrafo = { print(match.call()) })


In [None]:
log.retrafo = retrafo(bh.task %>>% logtransform())  # get a target-bound retrafo
getCPOKind(log.retrafo)  # logtransform is *stateless*, so it is a retrafo *and* an inverter
getCPOBound(log.retrafo)

In [None]:
inverter(bh.task %>>% log.retrafo)

In [None]:
#inverter(tagInvert(bh.task) %>>% log.retrafo)
# currently not implemented :-/

Inverting is done with the `invert` function.

In [None]:
log.bh = bh.task %>>% logtransform()
log.prediction = predict(train("regr.lm", log.bh), log.bh)

In [None]:
# invert(retrafo(log.bh), log.prediction)  # not implemented :-/
# invert(retrafo(log.bh), log.prediction$data["response"])  # not implemented :-/


# CPO Properties
CPOs contain information about the kind of data they can work with, and what kind of data they produce. `getCPOProperties` returns a list with the slots `properties`, `properties.data`, `properties.needed`, `properties.adding`, indicating the kind of data a CPO can handle, the kind of data it needs the data receiver (e.g. attached learner) to have, and the properties it adds to a given learner. An example is a CPO that converts factors to numerics: The receiving learner needs to handle numerics, so `properties.needed = "numerics"`, but it *adds* the ability to handle factors (since they are converted), so `properties.adding = c("factors", "ordered")`. `properties.data` is only different from `properties` if `affect.*` parameters are given. In that case, `properties.data` determines what properties the selected subset of columns must have.

In [None]:
getCPOProperties(cpoDummyEncode())

In [None]:
train("classif.geoDA", bc.task)  # gives an error

In [None]:
train(cpoDummyEncode(reference.cat = TRUE) %>>% makeLearner("classif.geoDA"), bc.task)

In [None]:
getLearnerProperties("classif.geoDA")

In [None]:
getLearnerProperties(cpoDummyEncode(TRUE) %>>% makeLearner("classif.geoDA"))

# Special CPOs

## NULLCPO
`NULLCPO` is the neutral element of `%>>%`. It is returned by some functions when no other CPO or Retrafo is present.

In [None]:
NULLCPO

In [None]:
is.nullcpo(NULLCPO)

In [None]:
NULLCPO %>>% cpoScale()

In [None]:
NULLCPO %>>% NULLCPO

In [None]:
print(as.list(NULLCPO))

In [None]:
chainCPO(list())

## CPO Applicator
A simple CPO with one parameter which gets applied to the data as CPO. This is different from a multiplexer in that its parameter is free and can take any value that behaves like a CPO. On the downside, this does not expose the argument's parameters to the outside.

In [None]:
cpa = cpoApply()
summary(cpa)

In [None]:
head(iris %>>% setHyperPars(cpa, apply.cpo = cpoScale()))

In [None]:
head(iris %>>% setHyperPars(cpa, apply.cpo = cpoPca()))

In [None]:
# attaching the cpo applicator to a learner gives this learner a "cpo" hyperparameter
# that can be set to any CPO.
getParamSet(cpoApply() %>>% makeLearner("classif.logreg"))

## CPO Multiplexer
Combine many CPOs into one, with an extra `selected.cpo` parameter that chooses between them.

In [None]:
cpm = cpoMultiplex(list(cpoScale, cpoPca))
summary(cpm)

In [None]:
head(iris %>>% setHyperPars(cpm, multiplex.selected.cpo = "scale"))

In [None]:
# every CPO's Hyperparameters are exported
head(iris %>>% setHyperPars(cpm, multiplex.selected.cpo = "scale", multiplex.scale.center = FALSE))

In [None]:
head(iris %>>% setHyperPars(cpm, multiplex.selected.cpo = "pca"))

## Meta-CPO
A CPO that builds data-dependent CPO networks. This is a generalized CPO-Multiplexer that takes a function which decides (from the data, and from user-specified hyperparameters) what CPO operation to perform. Besides optional arguments, the used CPO's Hyperparameters are exported as well. This is a generalization of `cpoMultiplex`; however, `requires` of the involved parameters are not adjusted, since this is impossible in principle.

In [None]:
s.and.p = cpoMeta(logical.param: logical,
.export = list(cpoScale(id = "scale"), 
  cpoPca(id = "pca")),
cpo.build = function(data, target, logical.param, scale, pca) {
  if (logical.param || mean(data[[1]]) > 10) {
    scale %>>% pca
  } else {
    pca %>>% scale
  }
})

In [None]:
 summary(s.and.p())

The resulting CPO `s.and.p` performs scaling and PCA, with the order depending on the parameter `logical.param` and on whether the mean of the data's first column exceeds 10. If either of those is true, the data will be first scaled, then PCA'd, otherwise the order is reversed.
The all CPOs listed in `.export` are passed to the `cpo.build`.

## CBind CPO
`cbind` other CPOs as operation. The `cbinder` makes it possible to build DAGs of CPOs that perform different operations on data and paste the results next to each other.

In [None]:
scale = cpoScale(id = "scale")
scale.pca = scale %>>% cpoPca()
cbinder = cpoCbind(scaled = scale, pcad = scale.pca, original = NULLCPO)

In [None]:
# cpoCbind recognises that "scale.scale" happens before "pca.pca" but is also fed to the
# result directly. The summary draws a (crude) ascii-art graph.
summary(cbinder)

In [None]:
head(iris %>>% cbinder)

In [None]:
# the unnecessary copies of "Species" are unfortunate. Remove them with cpoSelect:
selector = mlr:::cpoSelect(type = "numeric")
cbinder.select = cpoCbind(scaled = selector %>>% scale, pcad = selector %>>% scale.pca, original = NULLCPO)
cbinder.select
head(iris %>>% cbinder)

In [None]:
# alternatively, we apply the cbinder only to numerical data
head(iris %>>% cpoApply(cbinder, affect.type = "numeric"))

# Builtin CPOs

## Listing CPOs
Builtin CPOs can be listed with `listCPO()`.

In [None]:
listCPO()

## cpoScale
Implements the `base::scale` function.

In [None]:
df %>>% cpoScale()

In [None]:
df %>>% cpoScale(scale = FALSE)  # center = TRUE

## cpoPca
Implements `stats::prcomp`. No scaling or centering is performed.

In [None]:
df %>>% cpoPca()

In [None]:
df %>>% cpoPca()

## cpoDummyEncode
Dummy encoding of factorial variables. Optionally uses the first factor as reference variable.

In [None]:
head(iris %>>% cpoDummyEncode())

In [None]:
head(iris %>>% cpoDummyEncode(reference.cat = TRUE))

## cpoSelect
Select to use only certain columns of a dataset. Select by column index, name, or regex pattern.

In [None]:
head(iris %>>% cpoSelect(pattern = "Width"))

In [None]:
# selection is additive
head(iris %>>% cpoSelect(pattern = "Width", type = "factor"))

## cpoDropConstants
Drops constant features or numerics, with variable tolerance

In [None]:
head(iris) %>>% cpoDropConstants()  # drops 'species'
head(iris) %>>% cpoDropConstants(abs.tol = 0.2)  # also drops 'Petal.Width'

## cpoFixFactors
Drops unused factors and makes sure prediction data has the same factor levels as training data.

In [None]:
levels(iris$Species)

In [None]:
irisfix = head(iris) %>>% cpoFixFactors()  # Species only has level 'setosa' in train
levels(irisfix$Species)

In [None]:
rf = retrafo(irisfix)
iris[c(1, 100, 140), ]
iris[c(1, 100, 140), ] %>>% rf

## cpoMissingIndicators
Creates columns indicating missing data. Most useful in combination with cpoCbind.

In [None]:
impdata = df
impdata[[1]][1] = NA
impdata

In [None]:
impdata %>>% cpoMissingIndicators()
impdata %>>% cpoCbind(NULLCPO, dummy = cpoMissingIndicators())

## cpoApplyFun
Apply an univariate function to data columns

In [None]:
head(iris %>>% cpoApplyFun(function(x) sqrt(x) - 10, affect.type = "numeric"))

## cpoAsNumeric
Convert (non-numeric) features to numeric

In [None]:
head(iris[sample(nrow(iris), 10), ] %>>% cpoAsNumeric())

## cpoCollapseFact
Combine low prevalence factors. Set `max.collapsed.class.prevalence` how big the combined factor level may be.

In [None]:
iris2 = iris
iris2$Species = factor(c("a", "b", "c", "b", "b", "c", "b", "c",
                        as.character(iris2$Species[-(1:8)])))
head(iris2, 10)
head(iris2 %>>% cpoCollapseFact(max.collapsed.class.prevalence = 0.2), 10)

## cpoModelMatrix
Specify which columns get used, and how they are transformed, using a `formula`.

In [None]:
head(iris %>>% cpoModelMatrix(~0 + Species:Petal.Width))
# use . + ... to retain originals
head(iris %>>% cpoModelMatrix(~0 + . + Species:Petal.Width))

## cpoScaleRange
scale values to a given range

In [None]:
head(iris %>>% cpoScaleRange(-1, 1))

## cpoScaleMaxAbs
Multiply features to set the maximum absolute value.

In [None]:
head(iris %>>% cpoScaleMaxAbs(0.1))

## cpoSpatialSign
Normalize values row-wise

In [None]:
head(iris %>>% cpoSpatialSign())

## Imputation
There are two *general* and many *specialised* imputation CPOs. The general imputation CPOs have parameters that let them use different imputation methods on different columns. They are a thin wrapper around `mlr`'s `impute()` and `reimpute()` functions. The specialised imputation CPOs each implement exactly one imputation method and are closer to the behaviour of typical CPOs.

### General Imputation Wrappers
`cpoImpute` and `cpoImputeAll` both have parameters very much like `impute()`. The latter assumes that *all* columns of its input is somehow being imputed and can be preprended to a learner to give it the ability to work with missing data. It will, however, throw an error if data is missing after imputation.

In [None]:
impdata %>>% cpoImpute(cols = list(a = imputeMedian()))

In [None]:
impdata %>>% cpoImpute(cols = list(b = imputeMedian()))  # NAs remain
#impdata %>>% cpoImputeAll(cols = list(b = imputeMedian()))  # error, since NAs remain

In [None]:
missing.task = makeRegrTask("missing.task", impdata, target = "b")
# the following gives an error, since 'cpoImpute' does not make sure all missings are removed
# and hence does not add the 'missings' property.
#train(cpoImpute(cols = list(a = imputeMedian())) %>>% makeLearner("regr.lm"), missing.task)
# instead, the following works:
train(cpoImputeAll(cols = list(a = imputeMedian())) %>>% makeLearner("regr.lm"), missing.task)

### Specialised Imputation Wrappers
There is one for each imputation method.

In [None]:
impdata %>>% cpoImputeConstant(10)

In [None]:
getTaskData(missing.task %>>% cpoImputeMedian())

In [None]:
# The specialised impute CPOs are:
listCPO()[listCPO()$category == "imputation" & listCPO()$subcategory == "specialised",
          c("name", "description")]

## Feature Filtering
There is one *general* and many *specialised* feature filtering CPOs. The general filtering CPO, `cpoFilterFeatures`, is a thin wrapper around `filterFeatures` and takes the filtering method as its argument. The specialised CPOs each call a specific filtering method.

Most arguments of `filterFeatures` are reflected in the CPOs. The exceptions being:
1. for `filterFeatures`, the filter method arguments are given in a list `filter.args`, instead of in `...`
2. The argument `fval` was dropped for the specialised filter CPOs.
3. The argument `mandatory.feat` was dropped. Use `affect.*` parameters to prevent features from being filtered.

In [None]:
head(getTaskData(iris.task %>>% cpoFilterFeatures(method = "variance", perc = 0.5)))

In [None]:
head(getTaskData(iris.task %>>% cpoFilterVariance(perc = 0.5)))

In [None]:
# The specialised filter CPOs are:
listCPO()[listCPO()$category == "featurefilter" & listCPO()$subcategory == "specialised",
          c("name", "description")]

# Creating Custom CPOs

In [None]:
names(formals(makeCPO))  # see help(makeCPO) for explanation of arguments

In [None]:
# an example 'pca' CPO
# demonstrates the (object based) "separate" CPO API
pca = makeCPO("pca",  # name
  center = TRUE: logical,  # one logical parameter 'center'
  .datasplit= "numeric",  # only handle numeric columns
  .retrafo.format = "separate",  # default, can be omitted
  # cpo.trafo is given as a function body. The function head is added
  # automatically, containing 'data', 'target', and 'center'
  # (since a 'center' parameter was defined)
  cpo.trafo = {
    pcr = prcomp(as.matrix(data), center = center)
    # The following line creates a 'control' object, which will be given
    # to retrafo.
    control = list(rotation = pcr$rotation, center = pcr$center)
    pcr$x  # returning a matrix is ok
  # Just like cpo.trafo, cpo.retrafo is a function body, with implicit
  # arguments 'data', 'control', and 'center'.
  }, cpo.retrafo = {
    scale(as.matrix(data), center = control$center, scale = FALSE) %*%
      control$rotation
  })
head(iris %>>% pca())

In [None]:
# an example 'scale' CPO
# demonstrates the (functional) "separate" CPO API
scaleC = makeCPO("scale",
  .datasplit = "numeric",
  # .retrafo.format = "separate" is implicit
  cpo.trafo = function(data, target) {
    result = scale(as.matrix(data))
    cpo.retrafo = function(data) {
      # here we can use the 'result' object generated in cpo.trafo
      scale(as.matrix(data), attr(result, "scaled:center"),
    attr(result, "scaled:scale"))
    }
    result
  }, cpo.retrafo = NULL)
head(iris) %>>% scaleC()

In [None]:
# an example constant feature remover CPO
# demonstrates the "combined" CPO API
constFeatRem = makeCPO("constFeatRem",
  .datasplit = "target",
  .retrafo.format = "combined",
  cpo.trafo = function(data, target) {
    cols.keep = names(Filter(function(x) {
    length(unique(x)) > 1
      }, data))
    # the following function will do both the trafo and retrafo
    result = function(data) {
      data[cols.keep]
    }
    result
  }, cpo.retrafo = NULL)
head(iris) %>>% constFeatRem()

In [None]:

# an example 'square' CPO
# demonstrates the "stateless" CPO API
square = makeCPO("square",
  .datasplit = "numeric",
  .retrafo.format = "stateless",
  cpo.trafo = NULL, # optional, we don't need it since trafo & retrafo same
  cpo.retrafo = function(data) {
    as.matrix(data) ^ 2
  })
head(iris) %>>% square()

# Tuning CPO
CPOs export their parameters when attached to a learner. Tuning of CPO-Learners works exactly as tuning for ordinary Learners.

In [None]:
(clrn = cpoModelMatrix() %>>% makeLearner("classif.logreg"))
getParamSet(clrn)

In [None]:
ps = makeParamSet(
    makeDiscreteParam(
        "model.matrix.formula",
        values = list(first = ~0 + ., second = ~0 + .^2, third = ~0 + .^3)))

In [None]:
tuneParams(clrn, pid.task, cv5, par.set = ps,
           control = makeTuneControlGrid(),
           show.info=TRUE)

Tuning of CPOs and tuning of Learners can happen at the same time.

In [None]:
tlrn = cpoModelMatrix() %>>%
       cpoApply() %>>%
       cpoFilterGainRatio() %>>%
       makeLearner("classif.ctree")
sprintf("Parameters: %s", paste(names(getParamSet(tlrn)$pars), collapse=", "))
ps2 = makeParamSet(
    makeDiscreteParam(
        "model.matrix.formula",
        values = list(first = ~0 + ., second = ~0 + .^2)),
    makeDiscreteParam(
        "apply.cpo",
        values = list(nopca = NULLCPO,
                      onlypca = cpoPca(),
                      addpca = cpoCbind(NULLCPO, cpoPca()))),
    makeDiscreteParam(
        "gain.ratio.perc",
        values = list(0.333, 0.667, 1.0)),
    makeDiscreteParam("teststat", values = c("quad", "max")))
    

In [None]:
# commit a70edb1f makes output unbearably ugly, so suppress it here
suppressMessages(suppressWarnings(tuneParams(tlrn, pid.task, cv5, par.set = ps2,
           control = makeTuneControlGrid(),
           show.info=FALSE)))