<a href="https://colab.research.google.com/github/JordanDCunha/R-for-Data-Science-2e-/blob/main/Chapter_25.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# 25.1 Introduction

Writing **functions** is one of the most effective ways to level up as a data scientist. Functions let you replace copy-and-paste workflows with reusable, well-named building blocks.

## Why write functions?
Functions beat copy-and-paste because they:
- Give your code **clear meaning** through good names  
- Centralize logic, so changes happen in **one place**
- Reduce **accidental bugs** from inconsistent edits
- Make work **reusable across projects**, saving time long-term

A solid rule of thumb: if you’ve copied the same code **three times**, it’s time to write a function.

## Types of functions covered in this chapter
- **Vector functions**: take vectors in, return vectors
- **Data frame functions**: take a data frame in, return a data frame
- **Plot functions**: take a data frame in, return a plot

Examples throughout the chapter use tidyverse tools and familiar datasets from `nycflights13`.


In [None]:
# Prerequisites
library(tidyverse)
library(nycflights13)

# Example: a simple vector function
delay_hours <- function(delay_minutes) {
  delay_minutes / 60
}

delay_hours(c(30, 90, 120))

# Example: a simple data frame function
mean_delay_by_dest <- function(df) {
  df |>
    group_by(dest) |>
    summarize(
      mean_arr_delay = mean(arr_delay, na.rm = TRUE),
      .groups = "drop"
    )
}

mean_delay_by_dest(flights)

# Example: a simple plot function
plot_delay_distribution <- function(df) {
  ggplot(df, aes(arr_delay)) +
    geom_histogram(binwidth = 10) +
    labs(x = "Arrival delay (minutes)", y = "Count")
}

plot_delay_distribution(flights)


# 25.2 Vector functions

**Vector functions** take one or more vectors as input and return a vector (or a single summary value). They are especially useful inside `mutate()`, `filter()`, and `summarize()` because they reduce repetition, prevent copy-paste bugs, and make your code easier to read and maintain.

A classic example is rescaling a numeric vector to the range \([0, 1]\). Writing this logic once as a function avoids subtle errors and lets you improve the implementation in a single place.

Vector functions generally fall into two big groups:
- **Mutate functions**: return a vector of the same length as the input
- **Summary functions**: return a single value (used in `summarize()`)

Below are clean implementations of the main examples from this section, plus solutions to the exercises.


In [None]:
library(tidyverse)

# ---- Core example: rescale to [0, 1] ----
rescale01 <- function(x) {
  rng <- range(x, na.rm = TRUE, finite = TRUE)
  scaled <- (x - rng[1]) / (rng[2] - rng[1])
  if_else(is.infinite(x) & x < 0, 0,
          if_else(is.infinite(x) & x > 0, 1, scaled))
}

rescale01(c(-10, 0, 10))
rescale01(c(1:10, Inf, -Inf))


# ---- Mutate-style vector functions ----
z_score <- function(x) {
  (x - mean(x, na.rm = TRUE)) / sd(x, na.rm = TRUE)
}

clamp <- function(x, min, max) {
  case_when(
    x < min ~ min,
    x > max ~ max,
    .default = x
  )
}

first_upper <- function(x) {
  str_sub(x, 1, 1) <- str_to_upper(str_sub(x, 1, 1))
  x
}

clean_number <- function(x) {
  is_pct <- str_detect(x, "%")
  num <- x |>
    str_remove_all("%|,|\\$") |>
    as.numeric()
  if_else(is_pct, num / 100, num)
}

fix_na <- function(x) {
  if_else(x %in% c(997, 998, 999), NA, x)
}


# ---- Summary-style vector functions ----
prop_missing <- function(x) {
  mean(is.na(x))
}

prop_total <- function(x) {
  x / sum(x, na.rm = TRUE)
}

percent_total <- function(x, digits = 1) {
  round(x / sum(x, na.rm = TRUE) * 100, digits)
}

commas <- function(x) {
  str_flatten(x, collapse = ", ", last = " and ")
}

cv <- function(x, na.rm = FALSE) {
  sd(x, na.rm = na.rm) / mean(x, na.rm = na.rm)
}

n_missing <- function(x) {
  sum(is.na(x))
}

both_na <- function(x, y) {
  sum(is.na(x) & is.na(y))
}


# ---- Dates & statistics exercises ----
age_years <- function(birthdate, ref_date = Sys.Date()) {
  as.numeric(difftime(ref_date, birthdate, units = "days")) / 365.25
}

variance <- function(x, na.rm = FALSE) {
  mean((x - mean(x, na.rm = na.rm))^2, na.rm = na.rm)
}

skewness <- function(x, na.rm = FALSE) {
  m <- mean(x, na.rm = na.rm)
  s <- sd(x, na.rm = na.rm)
  mean(((x - m) / s)^3, na.rm = na.rm)
}


# ---- Tiny but useful helpers ----
is_directory <- function(x) {
  file.info(x)$isdir
}

is_readable <- function(x) {
  file.access(x, 4) == 0
}


# 25.3 Data frame functions

**Data frame functions** wrap up repeated *dplyr pipelines*, not just single expressions. They behave like verbs:  
they take a data frame as the **first argument**, optionally take column arguments, and return a data frame (or summary).

The main challenge is **indirection**: dplyr uses *tidy evaluation*, so when you pass column names into a function, you must tell dplyr to look *inside* those arguments. This is done with **embracing**: `{{ }}`.

Two tidy-eval modes matter:
- **Data-masking** → used by `filter()`, `mutate()`, `summarize()`, `group_by()`
- **Tidy-selection** → used by `select()`, `rename()`, `across()`, `pivot_*()`

If you need tidy-selection *inside* a data-masking verb, use `pick()`.

Below are clean solutions to the exercises, showing real-world data frame functions that rely on embracing correctly.


In [None]:
library(tidyverse)
library(nycflights13)

# ---- 1. Find cancelled or severely delayed flights ----
filter_severe <- function(df = flights, hours = 1) {
  df |>
    filter(is.na(arr_time) | arr_delay > hours * 60)
}

# Example:
# flights |> filter_severe()
# flights |> filter_severe(hours = 2)


# ---- 2. Count cancelled vs delayed flights by group ----
summarize_severe <- function(df) {
  df |>
    summarize(
      cancelled = sum(is.na(arr_time)),
      delayed = sum(arr_delay > 60, na.rm = TRUE),
      .groups = "drop"
    )
}

# Example:
# flights |> group_by(dest) |> summarize_severe()


# ---- 3. Summarize weather variable (min / mean / max) ----
summarize_weather <- function(df = weather, var) {
  df |>
    summarize(
      min  = min({{ var }}, na.rm = TRUE),
      mean = mean({{ var }}, na.rm = TRUE),
      max  = max({{ var }}, na.rm = TRUE),
      .groups = "drop"
    )
}

# Example:
# weather |> summarize_weather(temp)


# ---- 4. Convert clock time (HHMM) to decimal hours ----
standardize_time <- function(df, var) {
  df |>
    mutate(
      {{ var }} :=
        floor({{ var }} / 100) +
        ({{ var }} %% 100) / 60
    )
}

# Example:
# flights |> standardize_time(sched_dep_time)


# ---- 5. Generalized count + proportion (any number of vars) ----
count_prop <- function(df, ..., sort = FALSE) {
  df |>
    count(..., sort = sort) |>
    mutate(prop = n / sum(n))
}

# Example:
# diamonds |> count_prop(cut)
# diamonds |> count_prop(cut, color)


# 25.4 Plot functions

**Plot functions** return ggplot objects instead of data frames. This works smoothly with the tidyverse because `aes()` uses **data-masking**, so column arguments must be **embraced** with `{{ }}`.

Why plot functions are powerful:
- They remove repeated ggplot boilerplate
- They enforce consistent visual style
- They still allow extension with `+` after the function returns the plot

In the exercises below, we build a plotting function step-by-step:
1. Draw a scatterplot from a dataset and x/y variables  
2. Add a linear best-fit line (no standard error band)  
3. Add a dynamic title using tidy-evaluation-aware labeling

The final result is a reusable, composable plotting helper.


In [None]:
library(tidyverse)

# ---- Scatterplot + linear fit + title ----
scatter_lm <- function(df, x, y) {
  title <- rlang::englue("Scatterplot of {{ x }} vs {{ y }}")

  df |>
    ggplot(aes(x = {{ x }}, y = {{ y }})) +
    geom_point(alpha = 0.6) +
    geom_smooth(
      method = "lm",
      formula = y ~ x,
      se = FALSE
    ) +
    labs(title = title)
}

# Example:
# diamonds |> scatter_lm(carat, price)
# starwars |> filter(mass < 1000) |> scatter_lm(mass, height)


# 25.5 Style

Good function style is about **readability for humans**, not pleasing R.

Key takeaways:
- Prefer **clear over short** names (autocomplete exists for a reason)
- Functions → **verbs**, arguments → **nouns**
- Indent function bodies by **two spaces**
- Always include `{}` after `function()`
- Add spaces inside `{{ }}` to highlight tidy evaluation

## Exercise answers

### 1. Better names for the mystery functions

**Original `f1()`**
- What it does: checks whether a string starts with a given prefix  
- Better names:
  - `starts_with_prefix()`
  - `has_prefix()`

**Original `f3()`**
- What it does: repeats `y` to match the length of `x`
- Better names:
  - `repeat_to_length()`
  - `recycle_like()`

### 2. Naming distributions: `rnorm()` vs `norm_r()`

**Why `norm_r()` could be better**
- Consistent verb–noun structure
- Easier to autocomplete by distribution name
- Scales better to new users (`norm_r`, `norm_d`, `norm_p`)

**Why `rnorm()` might be better**
- Longstanding R convention
- Very compact
- Groups related functions by prefix (`r`, `d`, `p`, `q`)

**Even clearer option**
- `sample_normal()`
- `density_normal()`
- `cdf_normal()`
- `quantile_normal()`

Clarity wins when teaching or writing reusable code; brevity wins for expert workflows.


In [None]:
# Renamed versions with clearer intent

starts_with_prefix <- function(string, prefix) {
  str_sub(string, 1, str_length(prefix)) == prefix
}

recycle_like <- function(x, y) {
  rep(y, length.out = length(x))
}

# Example usage:
# starts_with_prefix("tidyverse", "tidy")
# recycle_like(1:10, c("A", "B"))
