Skip to content
This repository has been archived by the owner on Sep 18, 2019. It is now read-only.

Commit

Permalink
thoughts about working on vectors outside and inside tibbles
Browse files Browse the repository at this point in the history
  • Loading branch information
jennybc committed Oct 5, 2016
1 parent e10927b commit b153c21
Show file tree
Hide file tree
Showing 3 changed files with 686 additions and 0 deletions.
123 changes: 123 additions & 0 deletions block031_vector-tibble-relations.Rmd
@@ -0,0 +1,123 @@
---
title: "Vectors versus tibbles"
output:
html_document:
toc: true
toc_depth: 4
---

```{r setup, include = FALSE, cache = FALSE}
knitr::opts_chunk$set(error = TRUE, collapse = TRUE, comment = "#>")
```

In STAT 545 and the Master of Data Science program, we teach data analysis starting with a clean data frame (or tibble), an excerpt of the [Gapminder](http://www.gapminder.org) data, from the [gapminder package](https://github.com/jennybc/gapminder). We study the tibble's extent and variable types. We practice filtering, selecting, arranging, summarizing, mutating, and visualizing.

Then we gradually start to reveal the more complicated operations needed to produce such clean data, which requires working with various types of atomic vectors in R, such as character or factor, and even with other data structures altogether, such as matrices or lists.

Two questions keep coming up:

* Why are you showing me how to do things to vectors two different ways: as naked vectors and as vectors inside a data frame?
* What's the general workflow for working on naked vectors vs vectors inside a data frame?

### Load packages

In our examples, we'll use various core tidyverse packages and stringr.

```{r}
library(tidyverse)
library(stringr)
```

### Vector operation example

Consider example `table3` from the `tidyr` package.

```{r}
table3
```

It gives the rate of new tuberculosis cases for 3 countries in two years. But the `rate` variable needs to be split into the numerator and denominator and converted to numeric if we really want to work with this data, rather than just gaze upon it.

Here's one way to do that if you pull the `rate` variable out of the table via `$`.

```{r}
table3$rate %>%
str_split_fixed(pattern = "/", n = 2)
```

But now what? You've got a character matrix that is disassociated with `table3`. You've got more work to do before you can move on.

When a variable needs lots of remedial work, this workflow is justified and unavoidable. But in many cases, you can fix variables "in place" and work inside the original tibble.

### Tibble operation example

If we want to stay in the world of data frames or tibbles, we can get a much nicer result with `tidyr::separate()`.

```{r}
table3 %>%
separate(rate, into = c("cases", "population"), convert = TRUE)
```

This gives us a modified version of `table3` but with `rate` removed and replaced by proper integer variables with the numerator (`cases`) and the denominator (`population`).

We are immediately ready to move on to further analysis or visualiation. When feasible, it is generally advisable to manipulate vectors inside the data frame where they live.

### Workarounds and failure

The above was a carefully chosen example, where the splitting functions existed in both settings. In general, you have more flexibility with operations for naked vectors than with vectors inside a tibble. Luckily there are many situations in which you can still manipulate a vector inside a tibble. Use `mutate()`!

Let's convert `country` from character to factor in `table1`, the tidy version of this example dataset:

```{r}
table1 %>%
mutate(country = factor(country))
```

The `mutate()` strategy even works when multiple vectors form the necessary input to create a single new vector. Going backwards, we could re-create the vexing `rate` variable "by hand" from `cases` and `population`:

```{r}
table1 %>%
mutate(rate = paste(as.character(cases), as.character(population), sep = "/")) %>%
select(-cases, -population)
```

Finally, if it's easier for your development process, you can always pull a variable out, work on it, then put it back in. What if we were crazy and preferred to have the year given in Roman numerals?

```{r}
## make a copy so I don't mess with table1
tmp_df <- table1
## create working copy of the variable of interest
(tmp_var <- tmp_df$year)
## do my thing ...please just pretend it's way more complicated ;)
(tmp_var <- as.roman(tmp_var))
## put it back into the tibble
tmp_df$year <- tmp_var
## admire our work
tmp_df
```

Let's confront one genuinely fiddly scenario: what if you want to convert one (or more) existing variables into two (or more) new variables? And you aren't lucky enough to have the perfect function, like `tidyr::separate()`, available?

Unfortunately, `mutate()` doesn't do exactly what you'd want.

```{r}
table3 %>%
mutate(rate = str_split(rate, pattern = "/"))
```

Here the character strings with `cases`, and `population` are being stored as character vectors of length two inside a list-column, which is awkward to unpack.

You would be better off to apply `str_split()` outside the tibble, make that into a tibble, then column bind it back in.

```{r}
new_vars <- table3$rate %>%
str_split_fixed(pattern = "/", n = 2) %>%
as_tibble()
colnames(new_vars) <- c("cases", "population")
new_vars
table3 %>%
select(-rate) %>%
bind_cols(new_vars)
```

One day, it's likely this will be easier (multi-mutate?) But we're not there yet.

0 comments on commit b153c21

Please sign in to comment.