/
03d-data-structures.Rmd
417 lines (270 loc) · 18.7 KB
/
03d-data-structures.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
# Data Structures
```{r, include=FALSE}
source("common.R")
```
```{r include=FALSE}
options("knitr.graphics.auto_pdf" = TRUE)
knitr::opts_chunk$set(out.width = "100%")
```
There is a topic I have been skirting around for some time now and I think it is time that we have to have a rather important conversation. It's one that is almost never fun but is quite necessary because without it, there may be many painful lessons learned in the future. We're going to spend this next chapter talking about data structures—but not all of them! We'll only cover the three most common and, by the end of this, it is my hope you will have a much stronger idea of _what_ you are working with and _why_ it behaves the way it does.
We will cover vectors, data frames rather briefly, and lists. We'll talk about some of their defining characteristics and how we can interact with them. Often the theory behind these object types are omitted, but I am of the mind that learning this early on will pay off in dividends. Take a deep breath before we dive in and remind yourself that it ain't nothin' but a thing.
This section is undoubtedly the most theoretically dense from a software perspective of this entire book. These concepts may be a little bit difficult to grasp at the first go around particularly if you do not have a programming background. But do not be discouraged! This is tough and there is no way to around it, so might as well go through it. If you can grasp this chapter programming in R will become so much easier. You will develop an intuition of why certain things happen to your data and how to interact with other data structures.
## Atomic Vectors
I like to think of the atomic vector much like the atom—that is as the building block of any R object. You've actually been working with atomic vectors this entire time. But we haven't been very explicit about this yet. Up until this point we have been working mainly with tibbles. And here is the secret: each column of a tibble is _actually_ an atomic vector.
What makes a vector a atomic is that it can only be **a single data type** and that they are _one-dimensional_—opposed to tibbles which are two-dimensional[^sw]. You may have noticed that every value of a column is of the same data type. This means that they are rather strict to work with and for good reason. Imagine you wanted to multiple a column by 10, what would happen if a few of the values in the column were actually written out as text? Let's try exploring this idea.
The most common way to create a vector in R is to use the `c()` function. This stands for _combine_. We can `c`ombine as many elements as we want into a single vector using `c()`. Each element of the vector is it's own argument (separated by a comma).
For example if we wanted to create a vector of [Boston's unemployment rate](https://data.bls.gov/timeseries/LAUMT257165000000003?amp%253bdata_tool=XGtable&output_view=data&include_graphs=true) rate for each month in 2019 that we have data for (until October as of this writing on Dec. 18th, 2019) we could write the below. We will save it in a a vector called `unemp`[^unemp].
```{r}
unemp <- c(3.2, 2.8, 2.8, 2.4, 2.8, 2.9, 2.7, 2.6, 2.7, 2.3)
unemp
```
What is really great about vectors is that we can perform any number of operations on them—i.e. find the sum of all the values, the average, add a value to each element, etc.
If we wanted to find the average unemployment rate for Boston for Jan - Oct. 2019, we can supply the vector to the function `mean()`.
```{r}
mean(unemp)
```
However, you may be thinking "there are 12 months in a year not 10 and that should be represented" and if you are, I totally agree with you. Since the data for November and December are missing, we should denote that and update `unemp` accordingly. R uses `NA` to represent missing data. To represent this we can append two `NA`s to the vector we have. There are two ways we can do this. We can either combine `unemp` with two `NA`s, or rewrite the above vector.
```{r}
# combining existing with 2 NAs
c(unemp, NA, NA)
```
This works, but since we will be saving this to `unemp` again it is not best practices to use the variable you are changing in that objects assignment.
```{r}
# for example
unemp <- c(unemp, NA, NA)
```
The above is rather unclear and might confuse someone that will have to read your code at a later time—that person may even be you. For this reason we will redefine it.
```{r}
unemp <- c(3.2, 2.8, 2.8, 2.4, 2.8, 2.9, 2.7, 2.6, 2.7, 2.3, NA, NA)
unemp
```
We know that there are 12 elements in this vector, but sometimes it is quite nice to sanity check oneself. We can always find out how long (or how many elements are in) a vector is by supplying the vector to the `length()` function.
```{r}
# how many observations are in `unemp`?
length(unemp)
```
There are a total of six types of vectors. Fortunately, only four of these really matter to us. These are `integer`, `double`, `character`, `logical`.
Integers represent whole numbers. To specify an integer we append an `L` after the number such as `20L`. Doubles are any number that requires any precision aka decimal places. You can specify doubles in a number of formats such as scientific notation. Generally the easiest way to do this, though, is using a decimal. Together integers and doubles are lumped into the category of numeric. This is because, well, they are numbers.
As you learned previously, character vectors are created with the use of quotation marks; either `"` or `'`.
We've already created a vector of type double, `unemp`. You can check what type of vector `unemp` is with `typeof()`
> Note: `typeof()` is used only for internal R object such as lists and vectors. In most cases you will want to use `class()` to return the class of an object.
```{r}
typeof(unemp)
```
Say we create another vector called `month` with the numbers 1 through 12.
```{r}
month <- c(1, 2, 3, 4, 5, 6, 7, 8, 9, 10, 11, 12)
typeof(month)
```
Notice that since we didn't specify the `L` after the numbers R defaulted to treating `month` as a double. When possible it is good to make the distinction between integer and numeric.
```{r}
month <- c(1L, 2L, 3L, 4L, 5L, 6L, 7L, 8L, 9L, 10L, 11L, 12L)
typeof(month)
```
R has a number of vectors that are built in these being the letters of the alphabet (`letters` and `LETTERS` respectively), as well as `month.abb`, `month.name`, and `pi`. `month.name` is already available to us so let's not recreate it.
```{r}
month.name
typeof(month.name)
```
Notice the quotes around each vector element. This is how we identify character vectors.
Logical vectors are the last kind of vector we need to go over. Logical vectors are represented as the values `TRUE` and `FALSE`. Simple enough. Onward!
Recall that vectors are atomic meaning that there can only be one type per vector and we cannot mix and match. When a character is in the presence of another element of a different type, that value is _**coerced**_ into a character. Coersion is the process of implicitly or contextually changing an object from one type to another. For example:
```{r}
x <- c("a", 1)
typeof(x)
```
Something similar happens when a logical value is in the presence of a numeric value
```{r}
c(TRUE, 1, FALSE)
```
In the presence of a numeric value `TRUE` becomes equal to `1L` and `FALSE` equal to `0L`.This behavior exists whenever a logical value is presented where a numeric is expected such as the function call below.
```{r}
sum(TRUE, FALSE, FALSE)
```
While coersion occurs from other processes like combining values in a vector, _**casting**_ is the process of intentionally changing an object's class. There are a number of casting functions whice generaly take the shape of `as.class()` or `as_class()`. Each of the vector types covered have their own casting functions.
```{r}
as.integer(TRUE)
as.character(123)
as.double("2.331")
as.logical(0)
```
As you progress in your R journey you will find scenarios in which you need to cast objects from one class to another and these functions are the trick.
You now have a strong understanding of the underbellies of R vectors. One thing that is missing is an understanding of how we can select subsets from vectors. To extract a value from vectors we append square brackets at the end of the vector `vec[]`. We supply an index value to the square brackets to receive the value at that position
To select the month of January from the `unemp` vector, the first element, we provide the value of `1` to the brackets.
```{r}
unemp[1]
```
To extract more than one value, we provide a vector of the row indexes we desire.
```{r}
unemp[c(1, 3)]
```
There is yet another way to extract values from these vectors. We can provide a logical vector to our square brackets. For example, we can identify every value of `unemp` that is above the average rate.
```{r}
# find average removing missing values
avg_unemp <- mean(unemp, na.rm = TRUE)
# identify which values are above average
index <- unemp > avg_unemp
index
```
Notice that the `NA`s stayed `NA`? They can be pesky. Hadley writes in Advanced R "missing values tend to be infectious: most computations involving a missing value will return another missing value."[^advr]
```{r}
unemp[index]
```
How annoying those NAs can be! To prevent these NAs from showing upwe can add another condition to our `index` line to remove NAs. Like there are `as.*()` functions for casting, there are also `is.*()` functions for testing. `is.*()` returns a logical vector of the same length as the provided vector.
> Note: the `*` is called a wildcard. The wildcard character comes from SQL and when present means that any string can follow. `is.*()` is intended to indicate any possible testing function such as `is.numeric()`, `is_tibble()`, etc.
```{r}
is.na(unemp)
```
As we learned, we can negate logical vectors with an `!`. We can negate the test results and include an an `&` condition to only identify unemployment values above average _and_ aren't missing.
```{r}
index <- unemp > avg_unemp & !is.na(unemp)
unemp[index]
```
There is one last thing to keep in mind and with subsetting vectors using a logical vector that is of a different length. When you use a logical vector to subset and they are of differing length, the logical vector will be recycled for the remaining values of the vector being subset. As always, an example will be the best.
Say we have an object called `x` which are the values from 0 to 10 and an `index` to subset with. If we subset it with `index` and `index` is a logical vector of length two with the values of `TRUE` and `FALSE`, every other observation will be returned. This is because come the third value in `x`, R has ran out of values in `index` to use so it goes back to the beginning
```{r}
x <- 0:10
x
```
```{r}
index <- c(TRUE, FALSE)
x[index]
```
And what happens when the only value is a single logical value?
```{r}
x[TRUE]
```
```{r}
x[FALSE]
```
In this latter case see how the output says `integer(0)`. This is informing you that the vector contains 0 elements.
## Data Frames
The entirety of the work in this book so far has been with `tibbles`. Tibbles are actually a special type of data frame. Data frames are R's native way for storing rectangular data. Rectangles are two-dimensional, so are data frames.
Data frames are secretly just a bunch of vectors squished together. The important thing is that all vectors are of the same length. This ensures that each observation (row) has one value from each vector. Because of the nature of a data frame, each column must adhere to the rules of vectors.
Let's create a tibble using the `unemp` vector and the `tibble()` function. `tibble()` works in a somewhat similar manner as `mutate()` where the arguments we provide are name value pairs. In the case of tibble, the argument take the form of `col_name = vector`.
We create a tibble with the unemployment rate below.
```{r, message=FALSE}
library(dplyr)
tibble(
unemp_rate = unemp
)
```
We can add the month name and create a new column to indicate if that month has a higher than average unemployment rate.
```{r}
unemp_tbl <- tibble(
unemp_rate = unemp,
month = month.name
) %>%
mutate(above_avg = unemp_rate > avg_unemp)
unemp_tbl
```
To interact with the underlying vector of a data frame we can use the dollar sign `$` operator. This takes the form of `tbl$col_name`.
For example, extracting the `unemp_rate` column looks like:
```{r}
unemp_tbl$unemp_rate
```
Note the difference between `select(tbl, col)` and `tbl$col`.
```{r}
select(unemp_tbl, unemp_rate)
```
The difference is that `$` returns the underlying vector whereas `select()` will always return another data frame. You now have the ability to both filter data and grab a subset of a vector. But we have yet to visit how to grab a single value from a data frame.
You could try something like
```{r}
unemp_tbl %>%
select(1) %>%
slice(10)
```
To grab the 10th value of the first column. But again, you still have a tibble and you are not able to use that directly like a standalone number.
We can again use brackets to subset the our R object. But data frames are two dimensional, so we need to specify the indexes in two dimensions. If you have made a hand drawn graph used a cartesian plane, which I assume you all have, this will is the same idea. With a cartesian plane we can identify any point with a combination of two values: x and y. x refers to the horizontal axis and y the vertical axis. When we put the cartesian plane in the same frame of reference as the rectangular data frame we envision our rows as the x and our columns as the y.
In specifying our index, we are able to select all rows or all columns by leaving the x or y spot empty respectively.
```{r}
unemp_tbl[,1]
unemp_tbl[10,]
```
To replicate the above tidyverse example we would provide the indexes 10 and 1 respectively.
```{r}
unemp_tbl[10,1]
```
This is great, we've rewritten our tidyverse code in base R. But, just like the tidyverse code, we maintain the tibble data structure. This is because when we use a single bracket, it maintains the data structure of the object we are selecting from. If we wrap our brackets in another set of bracket, we are returned the an object of the same class as the underlying object.
```{r}
unemp_tbl[[10,1]]
```
What that code is doing is narrowing the tibble down to a single column with a single row index and then extracting the underlying vector (the second bracket). To extract the underlying vector using the tidyverse, we can use the function `dplyr::pull()`.
```{r}
unemp_tbl %>%
select(1) %>%
slice(10) %>%
pull()
```
Now this brings us to the second-most fundamental structure in R: the list. Yes, second-most fundamental. I've been keeping a secret from you. Data frames are actually just lists in disguise. To prove it, I will remove the class from `unemp_tbl` and return the class of that unclassed object.
```{r}
unclass(unemp_tbl) %>%
class()
```
That is right, data frames are actually just lists disguised as rectangles.
## Lists
There is a good chance that you will not have to interact with them too often That doesn't mean you shouldn't know how to when that time comes.
Lists are generally the most flexible object type in R. Unlike vectors and data frames lists do not impose any structure on the storage of our data.
The most simple lists may resemble something like a vector.
```{r}
list("Jan", "Feb", "Mar")
```
Notice how this prints differently than
```{r}
c("Jan", "Feb", "Mar")
```
Each element of a list is self-contained. I think of lists somewhat like shipping containers where each element is its own container and all components of each element are together. We can include any type of R object in a list. For example, we can include the `unemp_tbl` and associated vectors.
```{r}
l <- list(unemp_tbl, unemp, month.name)
```
We can view the structure of the list to get an idea of what is actually contained by that list.
```{r}
str(l)
```
The structure of `l` shows us that the first element is a tibble (has class `tbl_df`), and the other elements are numeric and character vectors respectively.
Because of this flexibility there are not predetermined dimensions that we can specify to our brackets. Like extracting the underlying vector value from a data frame we have to use `[[` for indexing. I like to think of `[` as walking up to the storage container and `[[` as actually opening it up and going inside. To get a sense of the difference lets look at the `unemp` vector.
```{r}
l[2]
class(l[2])
```
When using the single bracket we are just selecting the first element of the list which is why we are returned another list.
```{r}
l[[2]]
class(l[[2]])
```
When we use the double bracket we are going inside of the container and actually plucking that element out of the list. Once you have plucked out that element, we can again use another set of brackets to subset that item. To grab the tenth row and first column of the `unemp_tbl` inside of `l` we can write.
```{r}
l[[1]][[10,1]]
```
Now that we know that data frames are lists we can actually extract the underlying vectors using `[[` as well as `$`. We can get the tenth row and first column a number of ways.
```{r}
# subsetting the data frame
l[[1]][[10,1]]
# grabbing the first vector then position
l[[1]][[1]][10]
# grabbing the vector by name then position
l[[1]]$unemp_rate[10]
```
Frankly all of these brackets can get a little messy. The tidyverse package `purrr` has a super handy function called `pluck()` which handles all of these brackets for us. `purrr::pluck()` is meant for flexible indexing into data structures (documentation).
`pluck()` works by first providing the object that you'd like to index—again, notice the data first emphasis—and then providing the position of the element you would like to pluck out of the object. Generally, I will use `pluck()` when possible. By doing so the code becomes more readable and adheres to a single style more thoroughly.
```{r}
purrr::pluck(l, 1, 1, 10)
```
Congratulations! You made it to the end of this exceptionally dense chapter. You may feel a little overwhlemed and that is to be expected. Nonetheless you should be proud! I have a few more asks of you before you move on.
### Exercises
1. Drink some water
2. Move around a bit and shake it out
3. Create a list with the vectors `unemp`, `month.name`, and `avg_unemp`.
4. Recreate the `unemp_tbl` but referencing the list elements
```{r}
library(purrr)
unemp_l <- list(unemp, month.name, avg_unemp)
tibble(
unemp_rate = pluck(unemp_l, 1),
month = pluck(unemp_l, 2)
) %>%
mutate(above_avg = unemp_rate > pluck(unemp_l, 3))
```
[^sw]: Data Types and Structures. Software Carpentries. https://swcarpentry.github.io/r-novice-inflammation/13-supp-data-structures/
[^unemp]: Boston Unemployment https://data.bls.gov/timeseries/LAUMT257165000000003?amp%253bdata_tool=XGtable&output_view=data&include_graphs=true.
[^advr]: Vectors. Advanced R. https://adv-r.hadley.nz/vectors-chap.html.