## Additional Regular Expression Functions and Stringi

In [3]:
library(tidyverse)

Registered S3 methods overwritten by 'ggplot2':
  method         from 
  [.quosures     rlang
  c.quosures     rlang
  print.quosures rlang
Registered S3 method overwritten by 'rvest':
  method            from
  read_xml.response xml2
-- Attaching packages --------------------------------------- tidyverse 1.2.1 --
v ggplot2 3.1.1       v purrr   0.3.2  
v tibble  2.1.1       v dplyr   0.8.0.1
v tidyr   0.8.3       v stringr 1.4.0  
v readr   1.3.1       v forcats 0.4.0  
-- Conflicts ------------------------------------------ tidyverse_conflicts() --
x dplyr::filter() masks stats::filter()
x dplyr::lag()    masks stats::lag()


In [5]:
test = c('text', 'text200withnum', '200')

All str functions have the string as the first parameter, the pattern as the second, and then the rest are function dependent 

### Str_detect: Returns true if the string contains the pattern, false otherwise

In [11]:
(str_detect(test, '\\d+')) # sum, which

### Str_replace: Replaces matched patterns in the string with a replacement

In [13]:
?str_replace

In [14]:
str_replace(test, '\\d+', 'XX')

In [15]:
str_replace(test, '\\d+', '') 

In [17]:
str_replace_all(test, '\\d', 'X') 

### Str_extract: Get exactly the matching patterns

In [19]:
str_extract(test, '\\d+')

### Stringi

- Stringr is built on top of the stringi package
- Stringr covers most of our basic string neccessities
- If you find yourself in a situation where you don't think any stringr function does what you want, a stringi function almost certainly will (49 vs. 250)
- The functions work similarly for the most part: Just swap str_ for stri_

# Writing Functions

We already have some experience calling functions that have been written for us. Sometimes you will want to do something that has no public implementation, so it makes sense to write your own function. A rule of thumb is that if you are planning to repeat the same operation several times, you should encapsulate that behaviour in a function. 

## Basic Syntax

myfunction <- function(arg1, arg2, ... ){
    statements
    return(object)
}

In [25]:
say_hello <- function() {
    
    print("hello")
}

In [28]:
say_hello()

[1] "hello"


In [35]:
say_hello <- function(guest) {
    
    paste("ello", guest)
}

In [36]:
say_hello("Ariel")

In [37]:
say_something <- function(word1 = "hello", word2 = "world") {
    
    paste(word1, word2)
}

In [42]:
say_something()

In [49]:
add_numbers <- function(x, y) {
    
    x + y # return
    x - y
}

In [52]:
add_numbers(3, 4)

## Example: Rescaling Data

In [53]:
df = tibble(
  a = rnorm(10),
  b = rnorm(10),
  c = rnorm(10),
  d = rnorm(10)
)

print(df)

# A tibble: 10 x 4
        a       b       c      d
    <dbl>   <dbl>   <dbl>  <dbl>
 1  0.702  0.0956  1.46    0.356
 2 -0.228  1.81   -0.195  -0.567
 3 -2.84  -0.553  -0.252   0.477
 4 -0.568  1.02    0.450   0.345
 5  0.377  0.660   0.488   2.44 
 6  1.14  -0.392  -1.04    0.250
 7 -0.545  1.42   -0.0298  0.287
 8 -0.166 -0.234   0.815  -0.401
 9 -0.472 -0.957   1.50    0.546
10 -0.660  0.593  -1.08   -0.854


For every data point, we want to subtract the smallest number in the column and then divide by the range.

In [60]:
df2 = df %>% 
    mutate(a = (a-min(a))/(max(a)-min(a)),
           b = (b-min(b))/(max(b)-min(b)),
           c = (c-min(c))/(max(c)-min(c)),
           d = (d-min(d))/(max(d)-min(d)))
 
print(df2)

# A tibble: 10 x 4
       a     b      c      d
   <dbl> <dbl>  <dbl>  <dbl>
 1 0.891 0.380 0.982  0.367 
 2 0.657 1     0.342  0.0870
 3 0     0.146 0.320  0.404 
 4 0.571 0.712 0.592  0.364 
 5 0.809 0.584 0.607  1     
 6 1     0.204 0.0154 0.335 
 7 0.577 0.859 0.406  0.346 
 8 0.672 0.261 0.734  0.137 
 9 0.596 0     1      0.425 
10 0.548 0.560 0      0     


In [64]:
rescale = function(x) {
    (x - min(x)) / (max(x) - min(x))
}

In [62]:
df2 = df %>%
    mutate(a=rescale(a), b=rescale(b), c=rescale(c), d=rescale(d))

In [63]:
print(df2)

# A tibble: 10 x 4
       a     b      c      d
   <dbl> <dbl>  <dbl>  <dbl>
 1 0.891 0.380 0.982  0.367 
 2 0.657 1     0.342  0.0870
 3 0     0.146 0.320  0.404 
 4 0.571 0.712 0.592  0.364 
 5 0.809 0.584 0.607  1     
 6 1     0.204 0.0154 0.335 
 7 0.577 0.859 0.406  0.346 
 8 0.672 0.261 0.734  0.137 
 9 0.596 0     1      0.425 
10 0.548 0.560 0      0     


## Conditional Execution

Often when writing functions we need to do different things depending on what data is passed in. This is known as *conditional execution*, and is accomplished using the `if/else` construct:
```{r}
if (condition) {
  # code executed when condition is TRUE
} else {
  # code executed when condition is FALSE
}
```

We already saw an example of this type of behavior with the if_else fuction. 

### Example
The *Heaviside step function* is defined as
$$H(x)=\begin{cases}0,&x\le 0\\
1,&\text{otherwise}
\end{cases}.$$
How can we code this as an R function?

In [67]:
H = function(x) {
    
    if (x <= 0) {
        return(0)
    } else {
        return(1)
    }
}

In [68]:
H(0)

In [69]:
H(1)

In [70]:
H(2)

In [71]:
H(-1)

### Multiple conditions
Sometimes you will want to check multiple conditions using an `if` statement. For example, let's define the function $$\operatorname{sgn}(x) = \begin{cases}-1,&x<0\\0,&x=0\\1,&x>0.\end{cases}$$

In [72]:
sgn = function(x) {
    if (x < 0) {
        return(-1)
    } else if (x == 0) {
        return(0)
    } else {
        return(1)
    }
}

In [73]:
sgn(2)

In [74]:
sgn(0)

In [75]:
sgn(-2)

The general form is
```{r}
if (condition1) {
  # do something
} else if (condition2) {
  # do something else
} else {
  # do another thing
}
```

### Validation
When writing functions it's a good idea to *validate* the input -- that is, make sure it matches your assumptions about what is being passed to the function. Consider the following function which returns the weighted average of a vector:

In [79]:
w_mean = function(x, w) {
    sum(x * w) / sum(w)
}

In [81]:
w_mean(x = c(1,2,3,4), w = c(.2, .4, .3, .1))

In [82]:
w_mean(c(1,2,3), w=c(1, 2))

"longer object length is not a multiple of shorter object length"

In [83]:
w_mean = function(x, w) {
    stopifnot(length(w) == length(x))
    sum(x * w) / sum(w)
}

In [84]:
w_mean(c(1,2,3), w=c(1, 2))

ERROR: Error in w_mean(c(1, 2, 3), w = c(1, 2)): length(w) == length(x) is not TRUE


In [85]:
w_mean = function(x, w) {
    if (length(w) != length(x)) {
        stop('length of data and weight vector do not match')
    }
    (x * w) / sum(w)
}

w_mean(c(1,2,3), c(1, 2))

ERROR: Error in w_mean(c(1, 2, 3), c(1, 2)): length of data and weight vector do not match


###  Dot-dot-dot (`…`)
Some functions are designed to take a variable number of inputs. We saw this for example with the `str_c` function.

To construct a function that takes a variable number of arguments we use the `...` notation:
```{r}
f = function(...) {
    <do something with variable arguments>
}
```

In [88]:
commas <- function(...) {
    str_c(..., collapse = ", ")
}
commas(letters[1:10])

Can you get away with just using a fixed amount of vectors?

### Pipeable functions
We've seen a lot of uses of the pipe operator `%>%`. As you become more advanced, you may find it useful to create your own functions which can be used in data pipelines. 

#### Transformations
For pipeable functions that transform a data frame, simply return the altered version of the data frame. For example:

In [91]:
get_row <- function(df, n=1) {
    df %>% slice(n)
}

mydf = tibble(x=c(1,2,3), y=c("a","b","c"))
print(mydf)

mydf %>% get_row()
mydf %>% get_row(3)

# A tibble: 3 x 2
      x y    
  <dbl> <chr>
1     1 a    
2     2 b    
3     3 c    


x,y
1,a


x,y
3,c


#### Side effects
Some functions have *side effects* but don't modify the original data frame. For example, consider the following function which counts how many missing values are present in a data frame:

In [92]:
show_missings = function(df) {
  n <- sum(is.na(df))
  cat("Missing values: ", n, "\n", sep = "")
  df  # note return value
}

In [93]:
show_missings(mpg)

Missing values: 0


manufacturer,model,displ,year,cyl,trans,drv,cty,hwy,fl,class
audi,a4,1.8,1999,4,auto(l5),f,18,29,p,compact
audi,a4,1.8,1999,4,manual(m5),f,21,29,p,compact
audi,a4,2.0,2008,4,manual(m6),f,20,31,p,compact
audi,a4,2.0,2008,4,auto(av),f,21,30,p,compact
audi,a4,2.8,1999,6,auto(l5),f,16,26,p,compact
audi,a4,2.8,1999,6,manual(m5),f,18,26,p,compact
audi,a4,3.1,2008,6,auto(av),f,18,27,p,compact
audi,a4 quattro,1.8,1999,4,manual(m5),4,18,26,p,compact
audi,a4 quattro,1.8,1999,4,auto(l5),4,16,25,p,compact
audi,a4 quattro,2.0,2008,4,manual(m6),4,20,28,p,compact


In [94]:
show_missings = function(df) {
  n <- sum(is.na(df))
  cat("Missing values: ", n, "\n", sep = "")
  invisible(df)  # return will not print out
}

In [99]:
library(nycflights13)

"package 'nycflights13' was built under R version 3.6.3"

In [100]:
show_missings(flights)

Missing values: 46595


In [101]:
flights %>% filter(month < 5) %>% show_missings

Missing values: 18387


## Local vs Global Variables

https://www.w3chools.com/r/r_variables_global.asp#:~:text=If%20you%20create%20a%20variable,and%20with%20the%20original%20value.