# Functions in R
* Author: Johannes Maucher, some modifications OK in 2019
* Last Update: 2017-03-13

When to write a function? A rule of thumb from Hadley Wickham, author of [R for Data Science](https://r4ds.had.co.nz) [WH19]:

> "Never copy and paste more than twice"

In [1]:
library(tidyverse)
library(modelr)

-- [1mAttaching packages[22m --------------------------------------- tidyverse 1.2.1 --
[32mv[39m [34mggplot2[39m 3.2.0     [32mv[39m [34mpurrr  [39m 0.3.2
[32mv[39m [34mtibble [39m 2.1.3     [32mv[39m [34mdplyr  [39m 0.8.3
[32mv[39m [34mtidyr  [39m 0.8.3     [32mv[39m [34mstringr[39m 1.4.0
[32mv[39m [34mreadr  [39m 1.3.1     [32mv[39m [34mforcats[39m 0.4.0
-- [1mConflicts[22m ------------------------------------------ tidyverse_conflicts() --
[31mx[39m [34mdplyr[39m::[32mfilter()[39m masks [34mstats[39m::filter()
[31mx[39m [34mdplyr[39m::[32mlag()[39m    masks [34mstats[39m::lag()


R is primarily a functional language. Functions are treated as other data types. For example functions can be assigned to variables and can be passed as arguments to other functions. Even simple operators as *+* are functions. The conventional formulation *x+y* is just a shortcut for "+"(x, y):

In [2]:
9+6
"+"(9, 6)

One of the most popular concepts of R functions is that they can be applied in a *vectorized* manner. This means that they can be executed for an individual element as well as elementwise for a collection of elements, e.g. vectors or matrices:

In [3]:
a <- 1:10
a
b <- 11:20
b
"+"(a, b)
a+b

In other programming languages such operations are typically defined only for single elements and an elementwise calculation on vectors is usually implemented by a repeated call of the operation within a for-loop. 

## Build-in functions
R provides an immense bunch of built-in-functions. These are functions, which are available in the basic R-package. They can be applied whenever needed and need not be explicitely loaded. 


### Examples for mathematical build-in functions
Some basic statistics, such as maximum, minimum, mean, standard-deviation and variance can be calculated by the following built-in-functions:

In [4]:
max(a)
min(a)
mean(a)
sd(a)
var(a)

Some other mathematical built-in functions are:

In [5]:
sqrt(25)
cos(pi)
seq(1, 100, by=20)

### Examples for built-in character functions

Character functions are executed on textual data. A small but important subset of character built-in functions is:

* [`nchar(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/nchar): Returns the number of characters in 'x'.

* [`substr(x, start, stop)`](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/substr): Returns the substring of 'x', which starts at index 'start' and terminates at index 'stop'.

* [`strsplit(x, split, fixed=FALSE)`](https://stat.ethz.ch/R-manual/R-devel/library/base/html/strsplit.html): Splits the character-variable 'x' at all characters defined in the pattern 'split'. If 'fixed=TRUE', then pattern 'split' is interpreted as a character variable. If 'fixed=FALSE', the 'split' is interpreted as a regular expression. 

For more informations to [`regular expressions`](https://stringr.tidyverse.org/articles/regular-expressions.html).

There are a lot of string functions in ['stringr' package](https://r4ds.had.co.nz/strings.html). 


* [`grep(pattern, x, ignore.case=FALSE, fixed=TRUE)`](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/grep): Searches for *pattern* in 'x' vector. If 'fixed=FALSE', then pattern is a regular expression. If 'fixed=TRUE', then 'pattern' is a text string. Returns the matching indices in the vector - this is not the pattern position in a string.

* [`gregexpr(pattern, x, ignore.case=FALSE, fixed=TRUE)`](https://www.rdocumentation.org/packages/Biostrings/versions/2.40.2/topics/gregexpr2): Searches for *pattern* in 'x'. If 'fixed=FALSE', then pattern is a regular expression. If 'fixed=TRUE', then 'pattern' is a text string. Returns the matching indices in the string.

* `sub(pattern, replacement, x, ignore.case=FALSE, fixed=FALSE)`: Finds pattern in 'x' and substitutes the 'replacement' text. If 'fixed=FALSE', then 'pattern' is a regular expression. If 'fixed=TRUE', then 'pattern' is a text string. Note that`sub()` replaces only the first occurence of 'pattern'. If all occurences shall be replaced `gsub()` can be applied.

* `paste(A, sep="")`: Concatenates the strings in 'A' (sequence of strings) after using the 'sep' string to separate them.

* `toupper(x)`: Turns all characters in *x* to uppercase.

* `tolower(x)`: Turns all characters in *x* to lowercase.

These functions are demonstrated in the following lines of codes:

`Keep in mind`: We can use *named* and *unnamed* arguments, e.g. `strsplit(myCharVar, split='.', fixed=FALSE)` - see below.

In [6]:
myCharVar <- "Das ist ein einfacher Satz. Und hier kommt nochmal ein Satz."
nchar(myCharVar)
substr(myCharVar, 5, 8)     #included the space character
strsplit(myCharVar, split='.', fixed=TRUE)

#\\s  : matches any whitespace
strsplit(myCharVar, '\\s')  #split at all whitespaces

In [7]:
#vector of character-strings
seqChars <- c("Das ist Satz 1.", "Hier ist der zweite Satz.", 
              "Und hier der dritte.") 

cat("Output indices in vector:\n")
grep('der', seqChars, fixed=TRUE)

cat("Output positions in string:\n")
gregexpr('der', seqChars[2], fixed=TRUE)
gregexpr('xyz', seqChars[2], fixed=TRUE)


Output indices in vector:


Output positions in string:


In [8]:
#\\si  : matches a whitespace and a "i" character
grep('\\si', seqChars, fixed=FALSE)  
#\\d  : matches any digit
grep('\\d', seqChars, fixed=FALSE)

sub('zweite', '2.', seqChars, fixed=TRUE)
paste("Feature", 1:5, sep="-")


In [9]:
curval <- 10
paste("The value is", curval, sep=": ")
paste("Today is", date(), sep=": ")

toupper('abCD')
tolower('EFgh')

### Examples for built-in Date functions
Dates and times are often represented as character variables. However, in this representation date-time-calculations, such as the determination of the number of days inbetween two given dates, are not possible. Date-time calculations can easily be performed if the character representation of a date is transformed to an R-date-object by the built-in-function `as.Date`:

In [10]:
(day1 <- as.Date("2015-09-08"))
(day2 <- as.Date("25.07.2016", format="%d.%m.%Y"))
(day3 <- as.Date('November 7, 2017', format='%B %d, %Y'))

In [11]:
day2 - day1 #number of days between the two dates

Time difference of 321 days

In [12]:
mydays <- c(day1, day2, day3)
weekdays(mydays)

Much more built-in functions for date- and time processing are described e.g. in [https://www.stat.berkeley.edu/~s133/dates.html](https://www.stat.berkeley.edu/~s133/dates.html).

### Examples for other useful built-in functions
Besides a vast variety of mathematical functions, there are a lot other useful build-in functions in R. Here is just a small subset of such useful helpers:

* [`cat(A)`](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/cat): Concatenates objects in A.
* [`seq_range(x, n)`](https://www.rdocumentation.org/packages/tidyr/versions/0.3.1/topics/seq_range): Generates a numeric vector with n elements from x (package [modelr](https://cran.r-project.org/web/packages/modelr/modelr.pdf)).
* [`seq_len(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/seq): Generates an integer vector with 1 to x elements.
* [`seq_along(x)`](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/seq)`: Generates an integer vector of the indices of the elements of x.


* `length(x)`: Returns the length of an object *x*. E.g. *length(c(4, 2, 19))* returns 3.
* [`cut(x, n)`](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/cut): Divides the continuous variable *x* into a vector with *n* levels with same ranges from minimum to maximum of *x*.
* [`pretty(x, n)`](https://www.rdocumentation.org/packages/geojsonio/versions/0.7.0/topics/pretty): Divides a continuous variable *x* into *n* intervals by selecting *n+1* equally spaced rounded values.


**Examples:**

In [13]:
x <- seq(13, 40, 5)
cat("x <- seq(13, 40, 5): ", x, "\n\n")

cat("seq_range(x, 3): \n")
library(modelr)
seq_range(x, 3)

cat("seq_len(5): \n")
seq_len(5)

#seq_along() is useful in iterations 
cat("seq_along(x): \n")
seq_along(x)


x <- seq(13, 40, 5):  13 18 23 28 33 38 

seq_range(x, 3): 


seq_len(5): 


seq_along(x): 


In [14]:
x

cat("length(x):\n")
length(x)


length(x):


In [15]:
#Example
x1 <- c(45, 2, 82, 22, 4)
cat("x1:", x1, "\n")      

var1 <- cut(x1, 4)
var1    #Here you see the values of x1 assigned to the 3 level ranges from minimum to maximum of var1

cat("Show levels of intervals/ranges:\n")
levels(var1)

x1: 45 2 82 22 4 


Show levels of intervals/ranges:


In [16]:
x

pretty(x, 5)

In [17]:
m <- 3
j <- 5
cat(" Value of m is:\t", m, "\n","Value of j is\t", j)

 Value of m is:	 3 
 Value of j is	 5

## Functions from external R packages
There exists more than 10000 R-packages, which provide solutions for all kinds of problems. External R packages can be downloaded e.g. from [https://cran.r-project.org/](https://cran.r-project.org/). The list of all installed package, available in your current environment, can be obtained by the following statement:

In [18]:
library()

## User-defined functions

Users can define their own functions. The encapsulations of code in functions provides more structure, readability and maintainability. The most important advantage however is, that some routines, which are required not only once, need not be implemented repititevly. A function must be defined only once and can then be used wherever it is required.

The general syntax for functions in R is:

In [19]:
functionName <- function(listOfParameters){
 statements
    
 return (result)
}

The list of parameters within the brackets that follow the keyword *function* are the *arguments*, which are passed as input to the function. Within the function body (inside the curly brackets) arbitrarily complex statements are executed. The defined function is assigned to a variable *functionName*. The function can be accessed via this variable-name as shown below.

The result of this computation is returned by the function. A *return*-statement could include one value (also like matrix or data frame etc.) or list of return-values, e.g. 

    return(list(theta=mat, cov=cov(t(mat))))
 
`Keep in mind:` If you want to return more than one value, you can use a list with the different values, e.g. `return(list(value1, value2, value3))` a.s.o.

`Keep in mind:` Sometimes you do not find a *return()* statement, then the last statement is returned. My personal opinion is that this is bad code.

For example in the following code-snippet a function is defined, which normalizes the values of the vector, which is passed as argument to the function. The normalized values of the passed vector is returned by the function.

In [20]:
myNormalizer <- function(rawdata){
    maximum <- max(rawdata)
    minimum <- min(rawdata)
    normeddata <- (rawdata-minimum)/(maximum-minimum)
    
    return (normeddata)
}

Now, this function can be executed wherever it is required, by the name 

In [21]:
(a <- 10:20)

A <- myNormalizer(a)

print(A)

 [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0


In [22]:
(b <- c(33, 34, 20, 52, 60, 71))
B <- myNormalizer(b)

print(B)

[1] 0.2549020 0.2745098 0.0000000 0.6274510 0.7843137 1.0000000


### Passing optional arguments to a function in the function

The function `myNormalizer()` shall now be extended such that it can also provide normalized values, which are rounded to a configurable number of digits. The standard R-function `round()` already has the parameter `digits`, which allows to set the number of digits after the decimal point. The `round()`-function shall now be applied in the new function `myRoundNormalizer()`. Hence, a value for the `digits` parameter of the `round()`-function must be passed to `myRoundNormalizer()`. Passing an arbitrary set of parameters to an inner function can be realized by the triple-dot-function: `...`

This is demonstrated in the following code cells. 



In [23]:
myRoundNormalizer <- function(rawdata, round=FALSE, ...){   #We cam also write "round=F"
    maximum <- max(rawdata)
    minimum <- min(rawdata)
    normeddata <- (rawdata-minimum)/(maximum-minimum)
    if (!round){
        return (normeddata)
    } else {
        return (round(normeddata, ...))
    }
}

In [24]:
B <- myRoundNormalizer(b)
print(B)

B <- myRoundNormalizer(b, round=TRUE)
print(B)

B <- myRoundNormalizer(b, round=TRUE, digits=2)
print(B) 

[1] 0.2549020 0.2745098 0.0000000 0.6274510 0.7843137 1.0000000
[1] 0 0 0 1 1 1
[1] 0.25 0.27 0.00 0.63 0.78 1.00


### 'Value by Copy' - Arguments and global Variables

Passed arguments are *'value by copy'*. So, let us take a look on global variable setting.
    

In [25]:
#Example data
dat <- c(36, 45, 22, 45, 67, 43, 34, 70, 67, 78, 34, 32, 
          34, 55, 58, 45, 37, 38, 44, 55, 66, 62, 59, 55)

grp <- factor(rep(LETTERS[1:4], c(4,6,6,8)))
df <- data.frame(group=grp, dat=dat)

glimpse(df)

Observations: 24
Variables: 2
$ group [3m[90m<fct>[39m[23m A, A, A, A, B, B, B, B, B, B, C, C, C, C, C, C, D, D, D, D, D...
$ dat   [3m[90m<dbl>[39m[23m 36, 45, 22, 45, 67, 43, 34, 70, 67, 78, 34, 32, 34, 55, 58, 4...


In [26]:

setVAL.NOTWORK <- function(pval) {    
   #No change to the original variable df
   df$dat[df$dat == pval] <- 999    
   
   #df[df$dat == 999, ]
}

setVAL.WORK <- function(pval) {
   #No change to the original variable df
   df$dat[df$dat == pval] <- 999    
   
   #df[df$dat == 999, ]
    
   #We can use <<- to assign values to a global variable
   df <<- df
}


cat("\nCompare before and after values (not work):\n")
setVAL.NOTWORK(55)
df[df$dat == 999, ]

cat("\nCompare before and after values (work):\n")
setVAL.WORK(55)
df[df$dat == 999, ]



Compare before and after values (not work):


group,dat
<fct>,<dbl>



Compare before and after values (work):


Unnamed: 0_level_0,group,dat
Unnamed: 0_level_1,<fct>,<dbl>
14,C,999
20,D,999
24,D,999


## Efficient evaluation of functions: apply, lapply, sapply

In the case that there exists many sequences of numeric values (such as `a` and `b` above) and each sequence shall be normalized by the `myNormalizer`-function, one can just implement a loop, which envokes in each iteration the `myNormalizer`-function for an individual input-argument (sequence of numeric values). Such an implementation would work, but is not very efficient. 

It would be much more efficient to use the R built-in function [`lapply(list of variables, functionName)`](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/lapply). 'list of variables' can be a list or data frame. Note that the first parameter is a **list** or **data frame**, which contains the objects on which the function (`myNormalizer()` in the example below) shall be executed.  You have to pay attention to the data type of the result, which have mostly the data type 'list'.

`Keep in mind`: If you are using a data frame and the data have different data types, they will be converted with may be not wanted results. 

As shown in the following code-snippet no looping is required in this way:

In [27]:
(columnlist <- list(l1=a, 
                    l2=b))

In [42]:
#Example with list
columnlist
columnlistNormed <- lapply(columnlist, myNormalizer) 
class(columnlistNormed) 

columnlistNormed

In [44]:
#Example with data frame
dfIn <- data.frame(x1=seq(1, 10, by=1), 
                   x2=seq(10, 100, by=10))

class(dfIn)
dfIn

dfOut <- lapply(dfIn, mean)

#dfOut <- as.data.frame(dfOut)
class(dfOut)
dfOut


x1,x2
<dbl>,<dbl>
1,10
2,20
3,30
4,40
5,50
6,60
7,70
8,80
9,90
10,100


In the case that an arbitrary function shall be executed not on a list of objects, but on an **array** or a **matrix** or **data frame**, the [`apply()`](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/apply)-function can be used. This function has an additional parameter, which determines along which axes of the multidimensional object the function shall be applied. You have to pay attention to the data type of the result, which have mostly the data type 'matrix' or 'vector'.

`Keep in mind again`: If you are using a data frame and the data have different data types, they will be converted with may be not wanted results. 

This is demonstrated in the example below. Here the `myNormalizer()`-function is first performed row-wise (parameter 1 in `apply()`) and then columnwise (parameter 2 in `apply()`).

In [30]:
(mymat <- matrix(floor(runif(28)*20), nrow=4, ncol=7))


0,1,2,3,4,5,6
2,7,8,12,14,5,4
16,11,15,18,9,12,3
15,10,7,11,11,1,14
16,15,19,16,8,4,11


#### Rowwise normalization:

*Look at the parameters and the described informations about them*


In [46]:
(mymatNormed <- apply(mymat, 1, myNormalizer))
class(mymatNormed)

0,1,2,3
0.0,0.8666667,1.0,0.8
0.4166667,0.5333333,0.6428571,0.7333333
0.5,0.8,0.4285714,1.0
0.8333333,1.0,0.7142857,0.8
1.0,0.4,0.7142857,0.2666667
0.25,0.6,0.0,0.0
0.1666667,0.0,0.9285714,0.4666667


#### Columnwise normalization:

In [49]:
mymat
(mymatNormed <- apply(mymat, 2, myNormalizer))


0,1,2,3,4,5,6
2,7,8,12,14,5,4
16,11,15,18,9,12,3
15,10,7,11,11,1,14
16,15,19,16,8,4,11


0,1,2,3,4,5,6
0.0,0.0,0.08333333,0.1428571,1.0,0.3636364,0.09090909
1.0,0.5,0.66666667,1.0,0.1666667,1.0,0.0
0.9285714,0.375,0.0,0.0,0.5,0.0,1.0
1.0,1.0,1.0,0.7142857,0.0,0.2727273,0.72727273


#### Pay attention to the accessing of the arguments in the functions of `apply()` and the others:

In [53]:
myTestfunc <- function(rawdata){
    #  You can not address a matrix column here by name, e.g. rawdata$col1,
    #  but you can use rawdata[1]
    
    #Try what happend with
    #cat(rawdata$col1,"\n")
    
    #But you can adress by index
    cat(rawdata[1:6], "\n")
    print(class(rawdata[1]))
    
    res <- 1  #unsinnig - nur zur Anschauung
    
    return (res)
}

df <- as.data.frame(mymat)
colnames(df) <- c("col1", "col2", "col3", "col4", "col5", "col6", "col7")
df

dummy <- apply(df, 1, myTestfunc)
class(dummy)

col1,col2,col3,col4,col5,col6,col7
<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>,<dbl>
2,7,8,12,14,5,4
16,11,15,18,9,12,3
15,10,7,11,11,1,14
16,15,19,16,8,4,11


2 7 8 12 14 5 
[1] "numeric"
16 11 15 18 9 12 
[1] "numeric"
15 10 7 11 11 1 
[1] "numeric"
16 15 19 16 8 4 
[1] "numeric"


[`sapply()`](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/lapply)-is similar to *lapply()*. However, it returns a vector or a matrix instead of a list: 

In [34]:
#Example with list
(columnmean <- sapply(columnlist, mean))
class(columnmean)

In [35]:
#Example with data frame
dfIn <- data.frame(x1=seq(1, 10, by=1), 
                   x2=seq(10, 100, by=10))

class(dfIn)
dfIn

dfOut <- sapply(dfIn, mean)

class(dfOut)
dfOut

x1,x2
<dbl>,<dbl>
1,10
2,20
3,30
4,40
5,50
6,60
7,70
8,80
9,90
10,100


### Comparison of performance

Remember: With [`system.time()`](https://www.rdocumentation.org/packages/base/versions/3.6.1/topics/system.time) we measure "user CPU time" which outputs the CPU time spent by the current process (e.g. the current R session) and "system CPU time" which outputs the CPU time spent by the kernel (the operating system). "elapsed CPU time" is the elapsed time by the statements - that is our point of interest here.


In [36]:
testFun1 <- function(x) {
    
    #print(x)
    #print(class(x))
    
    cVec <- strsplit(x, "%", fixed=FALSE)
    #print(class(cVec))   #Pay attention: list
    
    cVec <- as.data.frame(cVec, stringsAsFactors=FALSE)  #takes time, so better sapply()
    #print(cVec)
    nVec <- as.numeric(cVec[1, ])
    
    #We have no simple possibilty to access e.g. to the first element of all lists
    #nVec <- as.numeric(cVec[[1]][1])  
    
    return(nVec)
}

x <- data.frame(income=rep("15000%Test", 1000), stringsAsFactors=FALSE)

system.time({
    y <- testFun1(x$income) 
})

head(y)


   user  system elapsed 
   0.15    0.03    0.17 

In [37]:
testFun2 <- function(x) {
    
    #print(x)
    #print(class(x))
    
    cVec <- strsplit(x, "%", fixed=FALSE)
    #print(class(cVec))   #Pay attention: list
    
    #print(cVec[[1]][1])
    
    nVec <- as.numeric(cVec[[1]][1]) #Access to first element of first list
    
    return(nVec)
}

x <- data.frame(income=rep("15000%Test", 1000), stringsAsFactors=FALSE)

system.time({
    y <- sapply(x$income, testFun2)
})

head(y)

   user  system elapsed 
   0.02    0.00    0.01 

### What can we do, if we have more arguments in the FUN argument of `sapply()` and others?

Look at the following example:

In [38]:
myfun <- function(var1, var2, var3){
      return(var1*var2*var3)
}

sapply(1:4, myfun, var2=2, var3=100)  #We have not to repeat the first argument!
                                      #and we use the mystic signs ...


See [Strategies to Speedup R Code](https://www.r-bloggers.com/strategies-to-speedup-r-code/) to pimp your R code.

## Exercises

[Exercise on functions in R](../exercises/Ass05FunctionsR.ipynb)