Permalink
Browse files

quick pass through factor stuff

  • Loading branch information...
1 parent a700732 commit ffb47b3c607a3b7c158520835c47fd9659253531 @jennybc jennybc committed Oct 12, 2016
View
@@ -108,6 +108,7 @@
</style>
+
<div class="container-fluid main-container">
<!-- tabsets -->
@@ -162,6 +163,7 @@ <h1 class="title toc-ignore">Be the boss of your factors</h1>
</ul>
</div>
+<p><strong>As of 2016-10-11 this is deprecated. STAT 545 is now using the <a href="https://github.com/tidyverse/forcats">forcats</a> package for factor management, which is covered in the <a href="block029_factors.html">current factor lesson</a>.</strong></p>
<div id="load-the-gapminder-data" class="section level3">
<h3>Load the Gapminder data</h3>
<p>As usual, load the Gapminder excerpt and the <code>ggplot2</code> package. Load <code>plyr</code> and/or <code>dplyr</code>. If you load both, load <code>plyr</code> first.</p>
@@ -189,7 +191,7 @@ <h1 class="title toc-ignore">Be the boss of your factors</h1>
do(le_lin_fit(.)) %&gt;%
ungroup()
gcoefs
-## &lt;tibble [142 x 4]&gt;
+## # A tibble: 142 × 4
## country continent intercept slope
## &lt;fctr&gt; &lt;fctr&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 Afghanistan Asia 29.90729 0.2753287
@@ -202,7 +204,7 @@ <h1 class="title toc-ignore">Be the boss of your factors</h1>
## 8 Bahrain Asia 52.74921 0.4675077
## 9 Bangladesh Asia 36.13549 0.4981308
## 10 Belgium Europe 67.89192 0.2090846
-## ... with 132 more rows</code></pre>
+## # ... with 132 more rows</code></pre>
<p>Or, if you wish, the <code>plyr</code> way:</p>
<pre class="r"><code>gcoefs2 &lt;- ddply(gapminder, ~ country + continent, le_lin_fit)
gcoefs2
@@ -500,7 +502,7 @@ <h1 class="title toc-ignore">Be the boss of your factors</h1>
group_by(country) %&gt;%
summarize(max_le = max(lifeExp))
i_le_max
-## &lt;tibble [5 x 2]&gt;
+## # A tibble: 5 × 2
## country max_le
## &lt;fctr&gt; &lt;dbl&gt;
## 1 Egypt 71.338
@@ -547,7 +549,7 @@ <h1 class="title toc-ignore">Be the boss of your factors</h1>
<p>Reorder the <code>continent</code> factor, according to the estimated intercepts.</p>
<p>To review, remember we have computed the estimated intercept and slope for each country:</p>
<pre class="r"><code>head(gcoefs)
-## &lt;tibble [6 x 4]&gt;
+## # A tibble: 6 × 4
## country continent intercept slope
## &lt;fctr&gt; &lt;fctr&gt; &lt;dbl&gt; &lt;dbl&gt;
## 1 Afghanistan Asia 29.90729 0.2753287
@@ -568,7 +570,7 @@ <h1 class="title toc-ignore">Be the boss of your factors</h1>
filter(country %in% k_countries, year &gt; 2000) %&gt;%
droplevels()
kDat
-## &lt;tibble [6 x 6]&gt;
+## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## &lt;fctr&gt; &lt;fctr&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;
## 1 Australia Oceania 2002 80.370 19546792 30687.755
@@ -585,14 +587,14 @@ <h1 class="title toc-ignore">Be the boss of your factors</h1>
&quot;Korea, Dem. Rep.&quot; = &quot;North Korea&quot;,
&quot;Korea, Rep.&quot; = &quot;South Korea&quot;)))
data_frame(levels(kDat$country), levels(kDat$new_country))
-## &lt;tibble [3 x 2]&gt;
-## levels(kDat$country) levels(kDat$new_country)
-## &lt;chr&gt; &lt;chr&gt;
-## 1 Australia Oz
-## 2 Korea, Dem. Rep. North Korea
-## 3 Korea, Rep. South Korea
+## # A tibble: 3 × 2
+## `levels(kDat$country)` `levels(kDat$new_country)`
+## &lt;chr&gt; &lt;chr&gt;
+## 1 Australia Oz
+## 2 Korea, Dem. Rep. North Korea
+## 3 Korea, Rep. South Korea
kDat
-## &lt;tibble [6 x 7]&gt;
+## # A tibble: 6 × 7
## country continent year lifeExp pop gdpPercap new_country
## &lt;fctr&gt; &lt;fctr&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt; &lt;fctr&gt;
## 1 Australia Oceania 2002 80.370 19546792 30687.755 Oz
@@ -664,7 +666,7 @@ <h1 class="title toc-ignore">Be the boss of your factors</h1>
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
head(gapminder)
-## &lt;tibble [6 x 6]&gt;
+## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## &lt;fctr&gt; &lt;chr&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
@@ -686,7 +688,7 @@ <h1 class="title toc-ignore">Be the boss of your factors</h1>
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
head(gapminder)
-## &lt;tibble [6 x 6]&gt;
+## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## &lt;fctr&gt; &lt;fctr&gt; &lt;int&gt; &lt;dbl&gt; &lt;int&gt; &lt;dbl&gt;
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
@@ -712,6 +714,7 @@ <h1 class="title toc-ignore">Be the boss of your factors</h1>
$('tr.header').parent('thead').parent('table').addClass('table table-condensed');
});
+
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
View
@@ -2,6 +2,8 @@
+**As of 2016-10-11 this is deprecated. STAT 545 is now using the [forcats](https://github.com/tidyverse/forcats) package for factor management, which is covered in the [current factor lesson](block029_factors.html).**
+
### Load the Gapminder data
As usual, load the Gapminder excerpt and the `ggplot2` package. Load `plyr` and/or `dplyr`. If you load both, load `plyr` first.
@@ -42,7 +44,7 @@ gcoefs <- gapminder %>%
do(le_lin_fit(.)) %>%
ungroup()
gcoefs
-## <tibble [142 x 4]>
+## # A tibble: 142 × 4
## country continent intercept slope
## <fctr> <fctr> <dbl> <dbl>
## 1 Afghanistan Asia 29.90729 0.2753287
@@ -55,7 +57,7 @@ gcoefs
## 8 Bahrain Asia 52.74921 0.4675077
## 9 Bangladesh Asia 36.13549 0.4981308
## 10 Belgium Europe 67.89192 0.2090846
-## ... with 132 more rows
+## # ... with 132 more rows
```
Or, if you wish, the `plyr` way:
@@ -395,7 +397,7 @@ i_le_max <- iDat %>%
group_by(country) %>%
summarize(max_le = max(lifeExp))
i_le_max
-## <tibble [5 x 2]>
+## # A tibble: 5 × 2
## country max_le
## <fctr> <dbl>
## 1 Egypt 71.338
@@ -476,7 +478,7 @@ To review, remember we have computed the estimated intercept and slope for each
```r
head(gcoefs)
-## <tibble [6 x 4]>
+## # A tibble: 6 × 4
## country continent intercept slope
## <fctr> <fctr> <dbl> <dbl>
## 1 Afghanistan Asia 29.90729 0.2753287
@@ -504,7 +506,7 @@ kDat <- gapminder %>%
filter(country %in% k_countries, year > 2000) %>%
droplevels()
kDat
-## <tibble [6 x 6]>
+## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Australia Oceania 2002 80.370 19546792 30687.755
@@ -521,14 +523,14 @@ kDat <- kDat %>%
"Korea, Dem. Rep." = "North Korea",
"Korea, Rep." = "South Korea")))
data_frame(levels(kDat$country), levels(kDat$new_country))
-## <tibble [3 x 2]>
-## levels(kDat$country) levels(kDat$new_country)
-## <chr> <chr>
-## 1 Australia Oz
-## 2 Korea, Dem. Rep. North Korea
-## 3 Korea, Rep. South Korea
+## # A tibble: 3 × 2
+## `levels(kDat$country)` `levels(kDat$new_country)`
+## <chr> <chr>
+## 1 Australia Oz
+## 2 Korea, Dem. Rep. North Korea
+## 3 Korea, Rep. South Korea
kDat
-## <tibble [6 x 7]>
+## # A tibble: 6 × 7
## country continent year lifeExp pop gdpPercap new_country
## <fctr> <fctr> <int> <dbl> <int> <dbl> <fctr>
## 1 Australia Oceania 2002 80.370 19546792 30687.755 Oz
@@ -609,7 +611,7 @@ str(gapminder)
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
head(gapminder)
-## <tibble [6 x 6]>
+## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fctr> <chr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
@@ -636,7 +638,7 @@ str(gapminder)
## $ pop : int 8425333 9240934 10267083 11537966 13079460 14880372 12881816 13867957 16317921 22227415 ...
## $ gdpPercap: num 779 821 853 836 740 ...
head(gapminder)
-## <tibble [6 x 6]>
+## # A tibble: 6 × 6
## country continent year lifeExp pop gdpPercap
## <fctr> <fctr> <int> <dbl> <int> <dbl>
## 1 Afghanistan Asia 1952 28.801 8425333 779.4453
@@ -10,6 +10,8 @@ output:
knitr::opts_chunk$set(error = TRUE, collapse = TRUE)
```
+**As of 2016-10-11 this is deprecated. STAT 545 is now using the [forcats](https://github.com/tidyverse/forcats) package for factor management, which is covered in the [current factor lesson](block029_factors.html).**
+
### Load the Gapminder data
As usual, load the Gapminder excerpt and the `ggplot2` package. Load `plyr` and/or `dplyr`. If you load both, load `plyr` first.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
View
@@ -14,13 +14,15 @@ knitr::opts_chunk$set(error = TRUE, collapse = TRUE, comment = "#>")
### Factors: where they fit in
-We've spent alot of time working with big, beautiful data frame, like the Gapminder data. But we also need to manage the individual variables housed within.
+We've spent alot of time working with big, beautiful data frames, like the Gapminder data. But we also need to manage the individual variables housed within.
Factors are the variable type that useRs love to hate. It is how we store truly categorical information in R. The values a factor can take on are called the **levels**. For example, the levels of the factor `continent` in Gapminder are are "Africa", "Americas", etc. and this is what's usually presented to your eyeballs by R. In general, the levels are friendly human-readable character strings, like "male/female" and "control/treated". But *never ever ever* forget that, under the hood, R is really storing integer codes 1, 2, 3, etc.
This [Janus](http://en.wikipedia.org/wiki/Janus)-like nature of factors means they are rich with booby traps for the unsuspecting but they are a necessary evil. I recommend you learn how to be the boss of your factors. The pros far outweigh the cons. Specifically in modelling and figure-making, factors are anticipated and accommodated by the functions and packages you will want to exploit.
-The worst kind of factor is the stealth factor. The variable that you think of as character, but that is actually a factor (numeric!!). This is one of the classic gotchas in R. Check your variable types explicitly when things seem weird. Where do stealth factors come from? Base R has a burning desire to turn character information into factor. The most common place this happens is at data import via `read.table()` and friends. To shut this down, use `stringsAsFactors = FALSE` or -- even better -- use the tidyverse functions `read_csv()`, `read_tsv()`, etc.
+**The worst kind of factor is the stealth factor.** The variable that you think of as character, but that is actually a factor (numeric!!). This is a classic R gotcha. Check your variable types explicitly when things seem weird. It happens to the best of us.
+
+Where do stealth factors come from? Base R has a burning desire to turn character information into factor. The happens most commonly at data import via `read.table()` and friends. But `data.frame()` and other functions are also eager to convert character to factor. To shut this down, use `stringsAsFactors = FALSE` in `read.table()` and `data.frame()` or -- even better -- **use the tidyverse**! For data import, use `readr::read_csv()`, `readr::read_tsv()`, etc. For data frame creation, use `tibble::tibble()`. And so on.
Good articles about how the factor fiasco came to be:
@@ -57,68 +59,79 @@ class(gapminder$continent)
summary(gapminder$continent)
```
-Get a result similar to `dplyr::count()` but on a naked factor.
+Get a result similar to `dplyr::count()`, but on a free-range factor, with `forcats::fct_count()`.
```{r}
gapminder %>%
count(continent)
fct_count(gapminder$continent)
```
-
### Dropping unused levels
-`droplevels()` for operating on factors living in a data frame (or on a single factor).
-`fct_drop()` for operating on a factor directly.
+Just because you drop all the rows corresponding to a specific factor level, the levels of the factor itself do not change. This will come back to haunt you when you make a figure and all levels are included in the automatic legend. Sometimes it's all legend, no figure!
+
+Watch what happens to the levels of `country` (= nothing) when we filter Gapminder to a handful of countries.
```{r}
h_countries <- c("Egypt", "Haiti", "Romania", "Thailand", "Venezuela")
h_gap <- gapminder %>%
filter(country %in% h_countries)
-h_gap %>% str()
nlevels(h_gap$country)
+```
+
+Even though `h_gap` only has data for a handful of countries, we are still schlepping around all `r nlevels(gapminder$country)` levels from the original `gapminder` tibble.
-## in data frame context
+How can you get rid of them? The base function `droplevels()` operates on all the factors in a data frame or on a single factor. The function `forcats::fct_drop()` operates on a factor.
+
+```{r}
h_gap_dropped <- h_gap %>%
droplevels()
nlevels(h_gap_dropped$country)
-## in a factor vector context
-h_gap$country %>% levels()
-h_gap$country %>% fct_drop() %>% levels()
+## use forcats::fct_drop() on a free-range factor
+h_gap$country %>%
+ fct_drop() %>%
+ levels()
```
### Change order of the levels, principled
-Default order is alphabetical. Which is practically random, when you think about it! It is preferable to order the levels according to some principle:
+By default, factor levels are ordered alphabetically. Which might as well be random, when you think about it! It is preferable to order the levels according to some principle:
* Frequency. Make the most common level the first and so on.
* Another variable. Order factor levels according to a summary statistic for another variable. Example: order Gapminder countries by life expectancy.
-Order by frequency, forwards and backwards. Motivated by the downstream need to make tables and figures, esp. frequency barplots.
+First, we order continent by frequency, forwards and backwards. Motivated by the downstream need to make tables and figures, esp. frequency barplots.
```{r}
-## order by frequency
+## default order is alphabetical
gapminder$continent %>%
levels()
+
+## order by frequency
gapminder$continent %>%
fct_infreq() %>%
levels() %>% head()
+
## backwards!
gapminder$continent %>%
fct_infreq() %>%
fct_rev() %>%
levels() %>% head()
```
-Order by another variable, forwards and backwards. This other variable is usually quantitative and you will order the factor accoding to a grouped summary. The factor is the grouping variable and the default summarizing function is `median()`.
+Now we order `country` by another variable, forwards and backwards. This other variable is usually quantitative and you will order the factor accoding to a grouped summary. The factor is the grouping variable and the default summarizing function is `median()` but you can specify something else.
```{r}
+## order countries by median life expectancy
fct_reorder(gapminder$country, gapminder$lifeExp) %>%
levels() %>% head()
+
## order accoring to minimum life exp instead of median
fct_reorder(gapminder$country, gapminder$lifeExp, min) %>%
levels() %>% head()
+
## backwards!
fct_reorder(gapminder$country, gapminder$lifeExp, .desc = TRUE) %>%
levels() %>% head()
@@ -139,8 +152,7 @@ ggplot(gap_asia_2007, aes(x = lifeExp, y = fct_reorder(country, lifeExp))) +
geom_point()
```
-
-Use `fct_reorder2()` when you have a line chart of a quantitative x against another quantitative y and your factor provides the color. This way the legend appears in some order as the data!
+Use `fct_reorder2()` when you have a line chart of a quantitative x against another quantitative y and your factor provides the color. This way the legend appears in some order as the data! Contrast the legend on the left with the one on the right.
```{r legends-made-for-humans, fig.show = 'hold', out.width = '49%'}
h_countries <- c("Egypt", "Haiti", "Romania", "Thailand", "Venezuela")
@@ -157,7 +169,7 @@ ggplot(h_gap, aes(x = year, y = lifeExp,
### Change order of the levels, "because I said so"
-Sometimes you just want to hoist one or more level to the front. Because I said so. This resembles what we do when we move variables to the front with `dplyr::select(var1, var, everything())`.
+Sometimes you just want to hoist one or more levels to the front. Why? Because I said so. This resembles what we do when we move variables to the front with `dplyr::select(var1, var, everything())`.
```{r}
h_gap$country %>% levels()
@@ -166,15 +178,14 @@ h_gap$country %>% fct_relevel("Romania", "Haiti") %>% levels()
### Recode the levels
+Sometimes you have better ideas about what certain levels should be. Recode them.
+
```{r}
i_gap <- gapminder %>%
filter(country %in% c("United States", "Sweden", "Australia")) %>%
droplevels()
i_gap$country %>% levels()
-## oops United States is giving me trouble
i_gap$country %>%
- fct_recode("USA" = "United States", "Oz" = "Australia")%>% levels()
-
-fct_count(gapminder$continent)
+ fct_recode("USA" = "United States", "Oz" = "Australia") %>% levels()
```
Oops, something went wrong.

0 comments on commit ffb47b3

Please sign in to comment.