# **Functions in R**

---

## Functions used
* `table()`
* `which.max()`
* `which()`
* `na.omit()`
* `paste()`
* `unique()`
* `ls()`
* `rm()`

---

## What are functions in R?

<img src="https://upload.wikimedia.org/wikipedia/commons/thumb/3/3b/Function_machine2.svg/1200px-Function_machine2.svg.png" style="background-color:red;" width="300" >

Functions in R 
1. take one or more arguments as input,
2. perform a given task using these inputs, and
3. return one or more objects as output

Example: The minimum function `min()` 
1. takes a numeric vector as input,
2. finds the minimum of the vector, and 
3. returns the minimum

<br>

### Why use functions?

1. Avoid running several lines over and over again (i.e. more efficient coding)
2. Allows us to use functions others have built without having to know what it does internally!
  * This becomes useful when we start using functions to implement complex statistical algorithms

<br>

An R function is created using the keyword `function()` using the syntax below


```
function_name <- function(arg_1, arg_2, ...) {
   Function body 
   return(output)
}
```

* Function Name
  * this is the actual name of the function and is store as an R object of class type 'function'
* Arguments
  * Arguments (`arg_1, arg_2, ...`) are placeholders. When you call a function, you pass a value (e.g., number, vector, dataframe etc.) to the argument
* Function Body
  * This is a collection of R commands that define what the function does
* Return Value
  * Returns the value of the function and is the last expression in the function body



<br>

## Creating our own mode function `myMode()`


* Three measures of central tendency in data
  * mean
  * median
  * mode

<br> 
* We are already familiar with the mean and median functions, but R does not have a built-in function to find the mode
* We can build our own

<br>

The mode of a vector is the value that occurs most often.

<br>


What is the mode of the following vector (without using R)?
```
c(4, 8, 4, 8, 4, 5)
```




<br>

How do we arrive at the answer above?
1. Indentified the unique values of the vector
2. Counted the occurrence of the unique values
3. Returned the value that occurred most often

We can replicate these steps in R

In [None]:
my_vector <- c(4, 8, 4, 8, 4, 5)

# the table() function finds the frequency/counts of 
# all unique values in the vector
vector_freq <- table(my_vector)
vector_freq

In [None]:
# Find the index of the maximum
max_index <- which.max(vector_freq)
max_index

In [None]:
names(vector_freq)

In [None]:
# Find the number corresponding to the max index
mode_out <- names(vector_freq[max_index])
mode_out`

In [None]:
# need to convert the value to a number
as.numeric(mode_out)

In [None]:
# all lines together
my_vector <- c(4, 8, 4, 8, 4, 5)

vector_freq <- table(my_vector)
max_index <- which.max(vector_freq)
mode_out <- names(vector_freq[max_index])
as.numeric(mode_out)

<br>

### Creating the myMode function

In [None]:
myMode <- function(my_vector) {

  # find frequencies
  vector_freq <- table(my_vector)

  # find max index
  max_index <- which.max(vector_freq)

  # find value with max index
  mode_out <- names(vector_freq[max_index])
  
  # our return statement
  return(as.numeric(mode_out))
}

In [None]:
# running the myMode() function
this_is_mode <- myMode(c(4, 8, 4, 8, 4, 5))
this_is_mode

<br>

What happens if we have multiple modes (i.e. more than one value have a tie for occurring most often)?

What is the mode of the following vector?

```
c(4, 8, 4, 8, 4, 5, 8)
```

In [None]:
# running the myMode() function
myMode(c(4, 8, 4, 8, 4, 5, 8))

* The `which.max()` function only returns the index of the first maximum value
* We want to return 

In [None]:
my_vector <- c(4, 8, 4, 8, 4, 5, 8)

# find frequencies
vector_freq <- table(my_vector)
vector_freq

In [None]:
# find max index
max_index <- which.max(vector_freq)
max_index

* Need to find all frequencies that equal the max frequency

In [None]:
# first find maximum
max_freq <- max(vector_freq)
max_freq

In [None]:
which(vector_freq == max_freq)

In [None]:
# find max index
max_index <- which(vector_freq == max_freq)
max_index

In [None]:
mode_out <- names(vector_freq)[max_index]
mode_out

In [None]:
as.numeric(mode_out)

<br>

Rewriting the function for multiple modes

In [None]:
myMode <- function(my_vector) {

  # find frequencies
  vector_freq <- table(my_vector)

  # find max index(es)
  max_freq  <- max(vector_freq)
  max_index <- which(vector_freq == max_freq)

  # find value with max index
  mode_out <- names(vector_freq)[max_index]
  
  # our return statement
  return(as.numeric(mode_out))
}

In [None]:
myMode(c(4, 8, 4, 8, 4, 5, 8))

<br>

What if we want a warning if there is more than one mode?
* We can include print and if statements

In [None]:
modes <- myMode(c(4, 8, 4, 8, 4, 5, 8))
modes

length(modes)  # number of modes

In [None]:
paste("Number of modes:", length(modes), sep=" ")

In [None]:
myMode <- function(my_vector) {

  # find frequencies
  vector_freq <- table(my_vector)

  # find max index(es)
  max_freq  <- max(vector_freq)
  max_index <- which(vector_freq == max_freq)

  # find value with max index
  mode_out <- names(vector_freq)[max_index]

  # statement for number of modes
  nb_modes <- length(mode_out)
  if (nb_modes > 1) {
    print( paste("Number of modes:", nb_modes, sep=" ") )
  }
  
  # our return statement
  return(as.numeric(mode_out))
}

In [None]:
# single mode
modes <- myMode(c(4, 8, 4, 8, 4, 5))
modes

In [None]:
# multiple modes
modes <- myMode(c(4, 8, 4, 8, 4, 5, 8))
modes

<br>

What if we want to be able to turn on/off verbosity (i.e. turn on/off the comments)?
* Can add a function argument
* Function arguments can have default values
* We will make a verbosity argument default to `FALSE`
* When the argument is `TRUE`, the function will print the statement


In [None]:
myMode <- function(my_vector, verbose = FALSE) {

  # find frequencies
  vector_freq <- table(my_vector)

  # find max index(es)
  max_freq  <- max(vector_freq)
  max_index <- which(vector_freq == max_freq)

  # find value with max index
  mode_out <- names(vector_freq)[max_index]

  
  # only print if verbose is TRUE
  if (verbose) {

    # statement for number of modes
    nb_modes <- length(mode_out)
    if (nb_modes > 1) {
      print( paste("Number of modes:", nb_modes, sep=" ") )
    }

  }
  
  # our return statement
  return(as.numeric(mode_out))
}

In [None]:
modes <- myMode(c(4, 8, 4, 8, 4, 5, 8))
modes

In [None]:
modes <- myMode(c(4, 8, 4, 8, 4, 5, 8), verbose = TRUE)
modes

<br>

### Debugging

What if we want to find the mode of a character vector? What do we change?


In [None]:
modes <- myMode(c("four", "eight", "four", "eight", 
                  "four", "five", "eight"))
modes

* We are attempting to convert a character vector to numeric
* We can avoid this by checking for this in our function

In [None]:
myMode <- function(my_vector, verbose = FALSE) {

  # find frequencies
  vector_freq <- table(my_vector)

  # find max index(es)
  max_freq  <- max(vector_freq)
  max_index <- which(vector_freq == max_freq)

  # find value with max index
  mode_out <- names(vector_freq)[max_index]

  
  # only print if verbose is TRUE
  if (verbose) {

    # statement for number of modes
    nb_modes <- length(mode_out)
    if (nb_modes > 1) {
      print( paste("Number of modes:", length(modes), sep=" ") )
    }

  }
  
  # our return statement
  if (class(my_vector) == 'numeric') {
    return(as.numeric(mode_out))
  } else {
    return(mode_out)
  }
  
}

In [None]:
modes <- myMode(c("four", "eight", "four", "eight", 
                  "four", "five", "eight"))
modes

<br>

## Built-in Functions

The `unique()` function finds all unique values in a vector or table

In [None]:
my_vector <- c(3, 2, 4, 4, 2, 4, 4, 4, 2)
unique(my_vector)

<br>

`sort()` sorts a vector

In [None]:
my_vector <- c(3, 2, 4, 4, 2, 4, 4, 4, 2)
sort(my_vector)

<br>

We are already familiar with the `mean()` function, but the `mean()` function, like many others, have additional arguments you can use

In [None]:
my_vector <- c(3, 2, 4, 4, 2, 4, 4, 4, 2, NA)
mean(my_vector)

In [None]:
# na.rm is an argument in the mean function
mean(my_vector, na.rm=TRUE)

In [None]:
# or we can remove the missing value
na.omit(my_vector)

In [None]:
# then average
mean(na.omit(my_vector))

<br>

Identify which variables you have already stored in your workspace

In [None]:
ls()

<br>

Remove objects from your workspace

In [None]:
rm(list = ls())

In [None]:
ls()

<br>

Using other libraries to do the work for you!

In [None]:
# installs a package someone else wrote
# usually only have to do this once, but colab requires
# you to do this each time
install.packages("modeest")

In [None]:
# loads the package
library(modeest)

In [None]:
mfv(c(4, 8, 4, 8, 4, 5, 8))

<br>



## Scope of functions

* Scope is essentially knowing where variables can and cannot be accessed
* This concept is best understood through example

In [None]:
# function adds 5 to x
add_5 <- function(x) {
  y <- x + 5
  return(y)
}

In [None]:
add_5(2)

* We stored the variable `y` in this function
* Let's see if we can access it

In [None]:
y

* The variable `y` only exists with the function `add_5` (i.e. the scope of `y`)
* In this case, `y` is a **local variable** since it is define locally within a function

<br>

What happens in the following?

In [None]:
# function adds x and y
add_y <- function(x) {
  return(x + y)
}

In [None]:
y <- 3
add_y(2)

* Careful when using variables that are not defined in a function!
* R will pull the variable globally if possible
* This can lead to annoying bugs down the road

<br>

In [None]:
# remove all variables
rm(list = ls())

# function adds 5 to x
add_5 <- function(x) {
  y <- x + 5
  return(y)
}

In [None]:
add_5(2)
y

<br>

* The `<<-` assignment operator allows you to access variables globally
* This has its uses, but NEVER use this in this course

In [None]:
# remove all variables
rm(list = ls())

# function adds 5 to x
add_5 <- function(x) {
  y <<- x + 5
  return(y)
}

In [None]:
add_5(2)
y

<br>



## Functions Practice

Create a function `quadratic_formula` that has the following attributes
* Input: Three single numbers `a`, `b`, `c` as function arguments
* Ouput: The solution to the qudratic equation
  * $\begin{align*}
\frac{-b \pm \sqrt{b^2 - 4ac}} {2a}
\end{align*}$
* Us `if` statements to potentially return three types of function outputs
  1. If $b^2 - 4ac$ is zero, simply return the one solution
  2. If $b^2 - 4ac$ is less than zero, print a statement saying there are no zeros that are real numbers
  3. If $b^2 - 4ac$ is greater than zero, return the two real zeros as a single vector

In [None]:
quadratic_formula(1, 0, 0)

In [None]:
quadratic_formula(1, 0, 1)

In [None]:
quadratic_formula(1, 0, -1)

<br>

Create a function `avg_diff` that has the following attributes
* Inputs:
  * vector1 - a numeric vector
  * vector2 - a numeric vector
* Ouput: A list of two objects providing
  1. the difference between the two vectors
  2. the average difference between the two vectors
  3. the length of the first vector
* Us `if` statements to 
  * print a warning statement if the vectors are not the same length

In [None]:
vec1 <- c(1, 2, 3, 4, 5, 6)
vec2 <- c(0, 1, 2, 3, 4, 5)
avg_diff(vec1, vec2)

In [None]:
vec1 <- c(1, 2, 3, 4, 5, 6)
vec2 <- c(0, 1, 2, 3, 4)
avg_diff(vec1, vec2)

<br>