**Introduction**<br>
Subsetting is a natural complement to `str()`. str() shows you the structure of any object, and subsetting allows you to pull out the pieces that you are interested in.

## Data Types 

### Atomic Vectors

1. Subset vector: <br>
------ Use methods (1) - (2):<br>
(1) **positive integers**: returns elements at specified positions. <br>
Example: x[c(1, 2)]  <br><br>
(2) **negative integers**: omits elements at specified positions. <br>
Example: x[-c(1, 2)]) <br><br><br>

^^^^^^ More methods: (3) - (6) <br>
(3) **logical vectors**: select elements where corresponding logical value is TRUE.<br> 
Example: x[c(T, F)]) <br><br>

(4) **nothing**: returns original vector. <br>
Example: x[] <br>
Note: This is not useful for vectors, but useful for matrices, data frames, and arrays, pr in conjunction with assignment. <br><br>
(5) **zero**: returns zero-length vector. It can be useful for generating test data.<br> 
Example: x[0] <br><br>
(6) **character vectors**: If a vector is named, subsetting with character vectors returns elements with matching names. <br>
Example: x[c("a", "b")]

<br><br>

Note: <br>
Method (1) - (5) all just return numbers, method (6) preserves structures (returns numbers and names of each element). 

In [5]:
# Example 1: subset vector with positive integers

# create vector 
x <- c(2.1, 4.2, 3.3, 5.4)
x

# -------- (1) subset with positive integers
# get elements of vector x at 3rd and 1st position
x[c(3, 1)]

# order elements in vector x
x[order(x)]

# duplicated indices yield duplicated values 
x[c(1, 1)]

# real nunbers are silently truncated to integers 
x[c(2.1, 2.9)]

In [7]:
# Example 2: subset vector with negative integers

# create vector 
x <- c(2.1, 4.2, 3.3, 5.4)
x

# -------- (2) subset with negative numbers 
# omit elements of vector x on 3rd and 1st positions 
x[-c(3, 1)]

# you cannot mix positive and negative integers
x[c(-1, 2)]

ERROR: Error in x[c(-1, 2)]: only 0's may be mixed with negative subscripts


In [206]:
# Example 3: subset vector with logical integers

# create vector 
x <- c(2.1, 4.2, 3.3, 5.4)
x

# ^^^^^^ (3) subset with logical vectors 
# keep 1st and 2nd element
x[c(TRUE, TRUE, FALSE, FALSE)]

# If logical vector is shorter than vector, it will be recycled to same length 
# In effect, it does this: 
# x[c(TRUE, FALSE, TRUE, FALSE)], which keeps 1st and 3rd elements
x[c(TRUE, FALSE)]

# A missing vlaue yields missing value in output 
x[c(TRUE, FALSE, NA, FALSE)]

In [137]:
# Example 4: 
# ^^^^^^ (4) subset vector with nothing 
x[]

# Example 5: 
# ^^^^^^ (5) subset with zero 
str(x[0])

 num(0) 


In [141]:
# Example 6: 
# ^^^^^^ (6) subset with character vectors 

# create names for vector  
(y <- setNames(x, letters[1:4]))

# subset with unique names 
y[c("d", "c", "a")]

# subset with duplicated names 
y[c("a", "a", "a")]

# when subsetting with [], names are always matched exactly. 
# create vector 
z <- c(abc = 1, def = 2)
z[c("a", "d")]

### Lists

1. Subset list (preserve structure) <br><br>

------ Use method (1) - (2): <br>

(1) subset with **[name]** (preserve as list): <br>
Example: model["coefficients"] <br><br>

(2) subset with **\$** (preserve as vector): <br>
Example: model$coefficients <br><br>

^^^^^^ More methods (2) - (4): <br>

(3) subset with **[number]** (preserve as list): <br>
Example: model[1] <br><br>

(4) subset with **[[number]]** (preserve as vector): <br>
Example: model[[1]] <br><br>

2. Subset list & elements (preserve structure) <br><br>

------ Use method (1): <br>
(1) subset with **\$** (preserve as vector): <br>
Example: model$coefficients[2:3] <br><br>

^^^^^^ More methods (2) - (3): <br>
(2) subset with **[name]** (preserve as vector): <br>
Example: model[["coefficients]][2:3] <br><br>

(3) subset with **[number]** (preserve as vector): <br>
Example: model[[1]][2:3] <br><br>

3. Subset list & a single element (NOT preserve structure) <br><br>

------ Use method (1): <br>
(1) subset with **\$**: <br>
Example: model$coefficients[[2]] <br><br>

^^^^^^ More methods (2) - (3): <br>
(2) subset with **[name]**: <br>
Example: model[["coefficients]][[2]] <br><br>

(3) subset with **[number]**: <br>
Example: model[[1]][[2]]

In [209]:
# Example 1: subset a list (regression model)

# run regression
# the model is a list itself
mod <- lm(mpg ~ carb + cyl, data = mtcars)

# -------- (1) subset list: coefficients (preserve output structures)

# ~~~~~~~  Use method 1 & 2
# method 1 (perserve structure: list)
mod["coefficients"]

# method 2 (preserve structure: vector)
mod$coefficients

# Other methods: not easy to use 
# method 3 (perserve structure: list)
mod[1]
# method 4 (preserve structure: vector)
mod[[1]]

# -------- (2) subset list: "coefficients" & extract 2nd to 3rd element (preserve output structures)

# method 1 (easiest)
# ~~~~~~~  USE THIS ONE !!! 
mod$coefficients[2:3]


# other methods (not as easy)
mod[["coefficients"]][2:3]
mod[[1]][2:3] 


# -------- (3) subset list: "coefficients" & then extract 2nd element (NOT preserve output structures)

# method 1 (easiest)
# ~~~~~~~   USE THIS ONE !!! 
mod$coefficients[[2]]


# other methods (not as easy)
mod[["coefficients"]][[2]]
mod[[1]][[2]]

### Matrices and arrays

1. You can subset higher-dimensional structures in 3 ways:<br>
(1) multiple vectors<br>
(2) single vector<br>
(3) matrix
2. Most common way of subsetting matrices and arrays is a simple generalization of 1d subsetting: apply 1d index for each dimension, separated by a comma. 
3. Blank subsetting is useful because it lets you keep all rows or columns. 
4. By default, `[` will simplify results to lowest possible dimensionality.
5. Because matrices and arrays are implemented as vectors with special attributes, you can subset them with a single vector, then they behave like a vector. (Note: arrays in R are stored in column-major order.)
6. You can also subset higher-dimensional data with integer matrix (or character matrix if named). Each row in matrix specifies location of 1 value, where each column corresponds to a dimension in array being subsetted. This means: use 2-colunm matrix to subset a matrix, a 3-column matrix to subset a 3d array, etc. <br><br>
7. Subset matrix: <br>
(1) subset with positive integers <br>
Example: A[1:2, ], or A[, 1:2], or A[1, 2] <br><br>
(2) subset with negative integers <br>
Example: A[-1, ] or A[, -1], OR A[-1, -2] <br><br>
(3) subset with names <br>
Example: A[, c("B","A")] <br><br>
(4) subset with logical <br>
Example: A[c(T, F), ], A[, C(F, T)], A[c(T, F), c(F, T)]<br><br>
(5) subset with mix methods <br>
Example: A[c(T, F), c("B", "A")], or A[0, -2]

In [142]:
# Example 1: subset matrix 

# Create matrix 
a <- matrix(1:9, nrow = 3)
colnames(a) <- c("A", "B", "C")

# ------ (1) subset first 2 rows
a[1:2, ]

# ------ (2) subset rows with logical, subset columns with names 
a[c(T, F, T), c("B", "A")]

# ------ (3) Subset 0 zero rows, omit 2nd column 
a[0, -2]

A,B,C
1,4,7
2,5,8


B,A
4,1
6,3


A,C


In [48]:
# Example 2: subset array

# create array
(vals <- outer(1:5, 1:5, FUN = "paste", sep = ","))

# subset 4th and 15th elements
# (Note: count column-wise)
vals[c(4, 15)]

0,1,2,3,4
11,12,13,14,15
21,22,23,24,25
31,32,33,34,35
41,42,43,44,45
51,52,53,54,55


Note: out() helps create an array. Below are some examples using function `outer()`

In [49]:
# Example 1: outer()
outer(month.abb, 1999:2003, FUN = "paste")

0,1,2,3,4
Jan 1999,Jan 2000,Jan 2001,Jan 2002,Jan 2003
Feb 1999,Feb 2000,Feb 2001,Feb 2002,Feb 2003
Mar 1999,Mar 2000,Mar 2001,Mar 2002,Mar 2003
Apr 1999,Apr 2000,Apr 2001,Apr 2002,Apr 2003
May 1999,May 2000,May 2001,May 2002,May 2003
Jun 1999,Jun 2000,Jun 2001,Jun 2002,Jun 2003
Jul 1999,Jul 2000,Jul 2001,Jul 2002,Jul 2003
Aug 1999,Aug 2000,Aug 2001,Aug 2002,Aug 2003
Sep 1999,Sep 2000,Sep 2001,Sep 2002,Sep 2003
Oct 1999,Oct 2000,Oct 2001,Oct 2002,Oct 2003


In [52]:
# Example 2: outer()

# create matrix
x <- 1:9; names(x) <- x
y <- 2:8; names(y) <- paste(y,":", sep = "")

outer(y, x, FUN = "^")

Unnamed: 0,1,2,3,4,5,6,7,8,9
2:,2,4,8,16,32,64,128,256,512
3:,3,9,27,81,243,729,2187,6561,19683
4:,4,16,64,256,1024,4096,16384,65536,262144
5:,5,25,125,625,3125,15625,78125,390625,1953125
6:,6,36,216,1296,7776,46656,279936,1679616,10077696
7:,7,49,343,2401,16807,117649,823543,5764801,40353607
8:,8,64,512,4096,32768,262144,2097152,16777216,134217728


In [59]:
# Example 3: subset array

# create array 
vals <- outer(1:5, 1:5, FUN = "paste", sep = ",")
vals

# create a matrix to subset array 
select <- matrix(ncol = 2, 
                 byrow = TRUE, 
                 c(1, 1,
                   3, 1, 
                   2, 4))

# subset 
vals[select]

0,1,2,3,4
11,12,13,14,15
21,22,23,24,25
31,32,33,34,35
41,42,43,44,45
51,52,53,54,55


### Data frames 

1. Data frames have properties of both lists and matrices: <br>
(1) If you subset with a vector, they behave like lists. <br>
(2) If you subset with 2 vectors, they behave like matrices.<br><br>

2. Subset data frame:<br><br>

------ For (1) - (4), use Method 1 <br>
(1) subset rows with integers or (in)equalities (preserve structure): <br>
Method 1: df[c(1, 2), ] or df[df$x == 2, ] <br><br>

(2) subset columns with names (preserve structure): <br>
Method 1: df[c("a", "b")] <br> 
Method 2: df[, c("a", "b")] <br><br>

(3) subset 1 column (preserve structure): <br>
Method 1: df["a"] <br><br>

(4) subset 1 column  (NOT preserve structure): <br>
Method 1: df$a <br>
Method 2: df[, "a"] <br>
Method 3: df[["a"]]  <br><br>

Note: <br>
I omit subsetting with negative integers logicals since they're not very practical here. 

In [160]:
# Example 1: subset rows 

# create data frame 
df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])
df

# ------ (1) subset with equality: when x = 2
df[df$x == 2, ]

# ------ (2) subset with vector: 1st and 3rd rows  
df[c(1, 3), ]

x,y,z
1,3,a
2,2,b
3,1,c


Unnamed: 0,x,y,z
2,2,2,b


Unnamed: 0,x,y,z
1,1,3,a
3,3,1,c


In [124]:
# Example 2_1: subset multiple columns 

# create data frame 
df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])
cat("Original data frame:\n")
df

# ------ (1) subset with 1 vector: like a list 
cat("\nSubset with 1 vector (like a list):\n")
df[c("x", "z")]

# ------ (2) subset with 2 vectors: like a matrix 
cat("\nSubset with 2 vectors (like a matrix):\n")
df[, c("x", "z")]

Original data frame:


x,y,z
1,3,a
2,2,b
3,1,c



Subset with 1 vector (like a list):


x,z
1,a
2,b
3,c



Subset with 2 vectors (like a matrix):


x,z
1,a
2,b
3,c


In [134]:
# Example 2_2: subset only 1 column

# create data frame 
df <- data.frame(x = 1:3, y = 3:1, z = letters[1:3])

# ------- (1) subset: preverve output structure 

# list subsetting
cat("list subsetting result:\n")
df["x"]
# integer subsetting (NOT RECOMMENDED)
cat("integer subsetting result:\n")
df[1]

# ------- (2) subset: NOT preserve output structure 

# matrix subsetting
cat("\nmatrix subsetting result:")
df[, "x"]
# $ subsetting
cat("\n$ subsetting result:")
df$x
# [[ subsetting
cat("\n[[ subsetting result:\n")
df[["x"]]

list subsetting result:


x
1
2
3


integer subsetting result:


x
1
2
3



matrix subsetting result:


$ subsetting result:


[[ subsetting result:


### S3 objects

**S3 objects** are made up of atomic vectors, arrays, and list. You can always pull apart an S3 object using the techniques above and knowledge you gain from str().

### S4 objects

Two additional subsetting operators for S4 objects: <br>
(1) `@`(equivalent to `$`) <br>
(2) `slot()` (equivalent to `[[`) <br>
Note: `@` is more restrictive than `$` in that it will return an error if the slot doesn't exist.

## Subsetting operators

1. Two other subsetting operators: <br>
(1) `[[`: similar to `[`, it can only return a single value and allows you to pull pieces out of a list. <br>
(2) `$`: a useful shorthand for `[[` combined with character subsetting. <br>
2. You need `[[` when working with lists. This is because `[` always returns a list, while `[[` returns contents. 
3. Because it can return only a single value, you must use `[[` with either a single positive integer / a string.

In [213]:
# Example 1

# create list 
a <- list(a = 1, b = 2)

# ------ (1) subset: preserve structure
a["b"]

# ------ (2) subset: NOT preserve structure
a$b

In [77]:
# Example 2: if you supply a vector, it indexes recursively 

# create list 
b <- list(a = list(b = list(c = list(d = 1))))

# subset
b[[c("a", "b", "c", "d")]]
b[["a"]][["b"]][["c"]][["d"]]

In [191]:
# Example 3

# subset rows 
mtcars[1:2, ]

# subset columns 
head(mtcars[, c("wt", "am")],3)

# subset 1 column (prserve structure)
head(mtcars["mpg"],3)

# subset 1 column (NOT preserve structure) 
mtcars$mpg

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,21,6,160,110,3.9,2.62,16.46,0,1,4,4
Mazda RX4 Wag,21,6,160,110,3.9,2.875,17.02,0,1,4,4


Unnamed: 0,wt,am
Mazda RX4,2.62,1
Mazda RX4 Wag,2.875,1
Datsun 710,2.32,1


Unnamed: 0,mpg
Mazda RX4,21.0
Mazda RX4 Wag,21.0
Datsun 710,22.8


### Simplifying vs preserving subsetting

1. Simplifying subsets returns simplest possible data structure that can represent the output, and is useful interactively because it gives what you want. 
2. Preserving subsetting keeps structure of output the same as input, and is better for programming because the result will be same type. 
3. Omitting `drop = FALSE` when subsetting matrices and data frames is one of most common sources of programming erros. <br>
Note: drop = T in factor means not keep not selected levels. 
4. Preserving is same for all data types: you get same type of output as input.
5. Simplifying behavior varies a bit between different data types: <br>
(1) **atomic vector**: removes names<br>
(2) **list**: returns the object inside the list, not a single element list<br>
(3) **factor**: drops any unused levels <br>
(4) **matrix or array**: if any of the dimensions has length 1, drops that dimension<br>
(5) **data frame**: if output is a single column, returns a vector instead of a data frame

|            | Simplifying      | Preserving                           | 
|    :-:     |         :-:      |                     :-:              | 
| vector     | x[[1]]           | x[1]                               |
| list       | x[[1]]           | x[1]                                 |
| factor     | x[1:4, drop = T] | x[1:4]                               |   
| array      | x[1, ] or x[, 1] | x[1, , drop = F] or x[, 1, drop = F] | 
| data frame | x[, 1] or x[[1]] | x[, 1, drop = F] or x[1]             |

In [216]:
# Example 1: atomic vector  

# create vector 
x <- c(a = 1, b = 2)

# subset: preserve structure 
x[1]
# subset: NOT preserve structure 
x[[1]] 

In [238]:
# Example 2: list

# create list 
y <- list(a = c(1, 2), b = c(3, 4))

# subset: preserve structure 
y[1]
# subset: NOT preserve structure 
y[[1]] 

In [237]:
# Example 3: factor 

# create factor 
z <- factor(c("a", "b"))

# subset first element
cat("Right way to subset factor:")
z[1, drop = TRUE]
str(z[1, drop = TRUE])

# It is wrong if not use `drop = TRUE`
# since it keeps original 2 levels
cat("\n\nWrong way to subset factor:")
z[1]
str(z[1])

Right way to subset factor:

 Factor w/ 1 level "a": 1


Wrong way to subset factor:

 Factor w/ 2 levels "a","b": 1


In [242]:
# Example 4: matrix or array 

# create matrix 
a <- matrix(1:4, nrow = 2)

# subset row / column: 
# when the subset matrix has only 1 row / 1 column 
# `drop = FALSE` helps preserve output as matrix 
cat("Right way to subset matrix (1 row/1column)")
a[1, , drop = FALSE]
a[, 1, drop = FALSE]

# wrong way to do it 
cat("Wrong way to subset matrix (1 row/1column)")
a[1, ]

Right way to subset matrix (1 row/1column)

0,1
1,3


0
1
2


Wrong way to subset matrix (1 row/1column)

In [247]:
# Example 5

# create data frame 
df <- data.frame(a = 1:2, b = 1:2)

# subset 1 column: 
# when the subset data frame has only 1 column 
# `drop = FALSE` helps preserve structure
cat("Right way to subset data frame (1 column)")
df[, 1, drop = FALSE]

# wrong way to do it
cat("Wrong way to subset data frame (1 column)")
df[, 1]

Right way to subset data frame (1 column)

a
1
2


Wrong way to subset data frame (1 column)

### $

1. \$ is a shorthand for `[[`, so x$y is same as x\[["y", exact = FALSE]]. It's often used to access variables in a data frame. 
2. One common mistake is to try and use \$ when you have the name of a column stored in a variable. 
3. Difference between \$ and `[[`: <br>
(1) \$ does partial matching.<br>
If you want to avoid this behavior, (NOT RECOMMENDED) set global option `warnPartialMatchDollar = TRUE`. This setting may affect behavior in other code you have loaded. Use with caution!!! <br>
(2) [[ does exact matching. 

In [258]:
# Example 1

# assign variable "cyl" to varibale "var"
# so far, "var" is just a name, no values stored in it
var <- "cyl"

# subset varibale "cyl": right way 
cat("Right way to subset 'cyl' here:")
mtcars[[var]]

# subset varibale "cyl": wrong way 
cat("Wrong way to subset 'cyl' here:")
mtcars$var

Right way to subset 'cyl' here:

Wrong way to subset 'cyl' here:

NULL

In [261]:
# Example 2
x <- list(abc = 1)

# use $: 
# this works since "abc" contains "a"
# 
x$a

# this doesn't work
x[["a"]]

NULL

### Missing/out of bounds indices

1. `[` and `[[` differ a bit when the index is out of bounds (OOB). 
2. If input vector is named, then names of OOB, missing, or NULL components will be \<NA>.

Table: summary of the results of subsetting atomic vectors and lists with `[` and `[[` and different OOB value <br><br>


| Operator | Index    | Atomic | List       |
| :-:      |    :-:   | :-:    |     :-:    | 
| \[       | OOB      | NA     | list(NULL) |
| [        | NA_real_ | NA     | list(NULL) | 
| [        | NULL     | x[0]   | list(NULL) |   
| [[       | OOB      | Error  | Error      |   
| [[       | NA_real_ | Error  | NULL       |      
| [[       | NULL     | Error  | Error      |      

In [270]:
# Example 1
x <- c(1:4)

# structure of 5th element (which doesn't exist)
str(x[5])

# structure of NA value (which doesn't exist)
str(x[NA_real_])

# structure of NULL value (which doesn't exist)
str(x[NULL]) 

 int NA
 int NA
 int(0) 


## Subsetting and assignment

All subsetting operators can be combined with assignment to modify selected values of input vector. 

In [290]:
# Example 1: modify vector values by subsetting

# create vector 
x <- 1:5 

# ------ (1) use positive integer subsetting to modify vector values 
# select 1st & 3rd elements
# modify the 2 values 
x[c(1, 3)] <- 2:3
x  

# ------ (2) use negative integer subsetting to modify vector values 
# de-select 1st & 2nd elements
# modify the rest of 3 values 
x[-c(1, 2)] <- 1:3
x 

# ------ (3) Problems 
# ~~~ (a) there is no checking for duplicate indices 
# select 1st element, change value to 3 instead of 2
x[c(1, 1)] <- 2:3
x

# ~~~ (b) you can combine logical indices with NA 
# since NA itself is logical argument
# this is basically x[c(T, F, NA, T, F)] <- 1
# select 1st, 4th and replace with value 1
# skip 2nd, 3rd, 5th elements
x[c(T, F, NA)] <- 1
x 

# ~~~ (c) but you can't combine integer indices with NA
# this gives error 
x[c(1, NA)] <- c(1, 2)

ERROR: Error in x[c(1, NA)] <- c(1, 2): NAs are not allowed in subscripted assignments


In [311]:
# Example 2: conditionally modify vectors in a data frame 

# create data frame 
df <- data.frame(a = c(1, 10, NA), 
                 b = c(2, 3, 4))

# ------ (1) modify value by inequality: 
# if variable "a < 5", then assign value 0
# NA value won't be changed 
df$a[df$a < 5] <- 0
df$a 

# ------ (2) modify value by equality: 
df$a[df$a == 10] <- 1
df$a

# ------ (3) modify missing value: 
df$a[is.na(df$a)] <- 99
df$a

In [23]:
# Example 3: subset with nothing 

# ------ (1) subset with []: 
# as.integer applies to all variables 
# this preserve as: data frame
mtcars[] <- lapply(mtcars, as.integer)
head(mtcars, 3)

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Mazda RX4,1,0,0,0,1,0,0,0,2,1,1
Mazda RX4 Wag,1,0,0,0,1,0,0,0,2,1,1
Datsun 710,1,0,0,0,1,0,1,2,2,1,0


In [28]:
# if not use []: 
# it becomes lists 
mtcars <- lapply(mtcars, as.integer)
head(mtcars, 3)

In [41]:
# Example 4: modify list components with NULL

# ------ (1) remove components: assign to NULL

# create list 
x <- list(a = 1, b = 2)
# remove sublist "b" 
x[["b"]] <- NULL 
cat("Modified list x becomes:")
str(x) 

# ------ (2) add component: NULL 

# create list 
y <- list(a = 1)
# add sublist "b": component is NULL
y["b"] <- list(NULL)
cat("\nModified list y becomes:")
str(y) 

Modified list x becomes:List of 1
 $ a: num 1

Modified list y becomes:List of 2
 $ a: num 1
 $ b: NULL


## Applications 

### Lookup tables (character subsetting)

1. Character matching provides a powerful way to make lookup tables. 

In [76]:
# Example 1: create race lookup table 

# METHOD 1: faster 
c(a = "White", b = "Black", c = "Asian", d = "Ohter", u = NA)[race]

# METHOD 2 
# create race vector
race <- c("a" = 1, "b" = 2, "c" = 3, "d" = 4,  "u" = NA)

# create lookup table 
lookup <- c(a = "White", b = "Black", c = "Asian", d = "Other", NA)  
  
# apply lookup table 
library(data.table)
race_key <- lookup[race]

# put results in a table 
data.table(race_key, race) 

# to remove names for lookup table 
unname(lookup[race])

race_key,race
White,1.0
Black,2.0
Asian,3.0
Other,4.0
,


### Matching and merging by hand (integer subsetting)

1. If you have a vector of integer grades and a table that describes their properties, you can create a table to look up info. 
2. If you have multiple columns to match on, first collapse them to a single column with `interaction()`, `paste()`, or `plyr::id()`. You can also use `merge()` or `plyr:join()`. 

In [80]:
# Example 1

# create grades vector 
grades <- c(1, 2, 2, 3, 1)

# create grades performance data frame 
info <- data.frame(
    Grade = 3:1, 
    Description = c("Excellent", "Good", "Poor"),
    Fail = c(F, F, T)
)

# ------ Use this
# Method 1: use match()
# match grades in grades vector & grades performance data 
id <- match(grades, info$Grade)
# get performance for each id 
info[id, ]

# Method 2: use rownames()
rownames(info) <- info$Grade
info[as.character(grades), ]

Unnamed: 0,Grade,Description,Fail
3.0,1,Poor,True
2.0,2,Good,False
2.1,2,Good,False
1.0,3,Excellent,False
3.1,1,Poor,True


Unnamed: 0,Grade,Description,Fail
1.0,1,Poor,True
2.0,2,Good,False
2.1,2,Good,False
3.0,3,Excellent,False
1.1,1,Poor,True


### Random samples/bootstrap (integer subsetting)

1. You can use integer indices to perform random sampling or bootstrapping of a vector or data frame. 
2. `sample()` generates a vector of indices, then subsetting to access the values. `sample()` control number of samples to extract, and whether sampling is performed with(out) replacement.

In [87]:
# Example 1

# create data 
df <- data.frame(x = rep(1:3, each = 2),
                 y = 6:1,
                 z = letters[1:6])

# ------ (1) randomly reorder data 
set.seed(10)
cat("Randomly reorder data: ")
df[sample(nrow(df)), ]

# ------ (2) select 3 random rows of data 
cat("Randomly select 3 rows of data: ")
df[sample(nrow(df), 3), ]

# ------ (3) select 6 rows of bootstrap replicates 
cat("Select 6 rows of bootstrap replicates : ")
df[sample(nrow(df), 6, rep = T), ]

Randomly reorder data: 

Unnamed: 0,x,y,z
3,2,4,c
1,1,6,a
2,1,5,b
6,3,1,f
4,2,3,d
5,3,2,e


Randomly select 3 rows of data: 

Unnamed: 0,x,y,z
3,2,4,c
2,1,5,b
6,3,1,f


Select 6 rows of bootstrap replicates : 

Unnamed: 0,x,y,z
2.0,1,5,b
2.1,1,5,b
5.0,3,2,e
6.0,3,1,f
6.1,3,1,f
3.0,2,4,c


### Ordering (integer subsetting)

1. `order()` takes a vector as input and returns an integer vector describing how the subsetted vector should be ordered. 
2. `order()` can also re-order data frame. 
3. You can also use `sort()` for vector and `plyr::arrange()`. These are more concise but less flexible. 

In [13]:
# Example 1: order vector 

# create vector 
x <- c("b", "c", "a", NA)

# ------ (1) order vector x in increasing order
# order(x) tell how vector x should be ordered 
cat("Order default order: ")
# method 1: use order()
x[order(x)]
# methdo 2: use sort(), it removes NA
cat("Sort default order: ")
sort(x)

# ------ (2) order vector x in decreasing order
cat("Decreasing order: ")
x[order(x, decreasing = TRUE)]

# ------ (3) order vector x in decreasing order & remove NA 
cat("Decreasing order & remove NA: ")
x[order(x, decreasing = TRUE, na.last = NA)]

# ------ (4) order vector x in decreasing order & put NA to front
cat("Decreasing order & put NA to front: ")
x[order(x, decreasing = TRUE, na.last = FALSE)]

Order default order: 

Sort default order: 

Decreasing order: 

Decreasing order & remove NA: 

Decreasing order & put NA to front: 

In [100]:
# Example 2: order data frame 

# create data 
df <- data.frame(x = rep(1:3, each = 2),
                 y = 6:1,
                 z = letters[1:6])

# ------ (1) randomly order rows of data 
# randomly order rows 
# re-order columns in order 3,2,1
df2 <- df[sample(nrow(df)), 3:1]
df2

# ------ (2) order rows by variable x values 
df2[order(df2$x), ]

# ------ (3) order columns by column names 
df2[, order(names(df2))]

Unnamed: 0,z,y,x
5,e,2,3
2,b,5,1
4,d,3,2
3,c,4,2
6,f,1,3
1,a,6,1


Unnamed: 0,z,y,x
2,b,5,1
1,a,6,1
4,d,3,2
3,c,4,2
5,e,2,3
6,f,1,3


Unnamed: 0,x,y,z
5,3,2,e
2,1,5,b
4,2,3,d
3,2,4,c
6,3,1,f
1,1,6,a


### Expanding aggregated counts (integer subsetting)

1. Sometimes, you get data frame where identical rows have been collapsed into 1 and a count column has been added. `rep()` and integer subsetting make it easy to uncollapse the data by subsetting with a repeated row index. 

In [101]:
# Example 1

# create data
# "n" is the count of repeated rows 
df <- data.frame(x = c(2, 3, 1), 
                 y = c(9, 11, 6),
                 n = c(3, 5, 1))

# uncollapse repeated rows 
df[rep(1:nrow(df), df$n), ]

Unnamed: 0,x,y,n
1.0,2,9,3
1.1,2,9,3
1.2,2,9,3
2.0,3,11,5
2.1,3,11,5
2.2,3,11,5
2.3,3,11,5
2.4,3,11,5
3.0,1,6,1


### Removing columns from data frames (character subsetting)

1. Two ways to remove columns from a data frame: <br>
(1) set individual columns to NULL <br>
(2) subset to return only columns you want, or if you know the columns you don't want, use `setdiff()` to work out which columns to keep.

In [104]:
# Example 1: select/remove columns 

# Method 1: set not wanted column to NULL 
# create data
df <- data.frame(x = 1:3, 
                 y = 3:1, 
                 z = letters[1:3])
# set not wanted column to NULL
df$z <- NULL 
df

# Method 2: select columns to keep 
# create data
df <- data.frame(x = 1:3, 
                 y = 3:1, 
                 z = letters[1:3])
# select wanted columns 
df[c("x", "y")]

# Method 3: use setdiff()
df[setdiff(names(df), "z")]

x,y
1,3
2,2
3,1


x,y
1,3
2,2
3,1


x,y
1,3
2,2
3,1


### Selecting rows based on condition (logical subsetting)

1. Because it allows you to easily combine conditions from multiple columns, logical subsetting is the most commonly used for extracting rows out of data frame. 
2. Morgan's laws: <br>
!(x&y) is same as !x | !y <br>
!(x|y) is same as !x & !y

In [4]:
# Example 1: extract data 

# Method 1: subset() - this is faster
# ------ (1) extract rows by 1 equality 
subset(mtcars, gear == 5)

# ------ (2) extract rows by 2 equalities
subset(mtcars, gear == 5 & cyl == 4)

# Method 2 (more verbose)
# extract rows by 1 equality 
mtcars[mtcars$gear == 5, ]

# extract rows by 2 equalities
mtcars[mtcars$gear == 5 & mtcars$cyl == 4, ]

Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
Ford Pantera L,15.8,8,351.0,264,4.22,3.17,14.5,0,1,5,4
Ferrari Dino,19.7,6,145.0,175,3.62,2.77,15.5,0,1,5,6
Maserati Bora,15.0,8,301.0,335,3.54,3.57,14.6,0,1,5,8


Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2


Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2
Ford Pantera L,15.8,8,351.0,264,4.22,3.17,14.5,0,1,5,4
Ferrari Dino,19.7,6,145.0,175,3.62,2.77,15.5,0,1,5,6
Maserati Bora,15.0,8,301.0,335,3.54,3.57,14.6,0,1,5,8


Unnamed: 0,mpg,cyl,disp,hp,drat,wt,qsec,vs,am,gear,carb
Porsche 914-2,26.0,4,120.3,91,4.43,2.14,16.7,0,1,5,2
Lotus Europa,30.4,4,95.1,113,3.77,1.513,16.9,1,1,5,2


### Boolean algebra vs sets (logical & integer subsetting)

1. It's useful to know the natural equivalence between set operations (integer subsetting) and boolean algebra (logical subsetting). 
2. Using set operations is more effective when <br>
(1) want to find the first/last TRUE <br>
(2) have few TRUEs and very many FALSEs. A set representation may be faster and require less storage.<br>
(3) `which()` allows to convert a boolean representation to integer representation. There is no reverse operation in R but you can create one. 
3. Use `x[y]` not `x[which(y)]` to subset. Here which() switches from logical to integer subsetting but the result will be exactly the same. 
4. Also `x[-which(y)]` is not same as `x[!y]`: if y is all FALSE, `which(y)` returns `integer(0)` and you get no values. 
5. In general, avoid switching from logical to integer subsetting unless you want the first or last TRUE value.

In [18]:
# Example 1: which()

# create vector 
# `sample(10)` samples randomly numbers 1:10 
# if value < 4, FALSE, if value >= 4, TRUE
# all values are boolean (logical)
set.seed(1)
(x <- sample(10) < 4)

# use which()
# find positions of values in x that are TRUEs
which(x)

In [19]:
# Example 2: create unwhich()

# create unwhich() function
unwhich <- function(x, n) {
    out <- rep_len(FALSE, n)
    out[x] <- TRUE
    out
}

# use unwhich()
# recover vector x 
unwhich(which(x), 10)

In [42]:
# Example 3

# create 2 logical vectors and their equivalents
# `%%` means x mod y
# 1/2 leftover != 0 
# 2/2 leftover == 0
# basically, all even numbers are TRUEs
(x1 <- 1:10 %% 2 == 0)

# check positions of x1 that are TRUEs
(x2 <- which(x1))

# create 2 logical vectors and their equivalents
(y1 <- 1:10 %% 5 == 0)

# check positions of y1 that are TRUEs
(y2 <- which(y1))

# ------ (1) get intersection 

# method 1: better 
# intersect() 
cat("Intersection by position is: ")
intersect(x2, y2)

# method 2
# & 
cat("Intersection by logical is: ")
x1 & y1 

# ------ (2) get union
# method 1: better 
# union()
cat("Union by position is: ")
union(x2, y2)

# method 2
# |
cat("Union by logical is: ")
x1 | y1

# ------ (3) in 1st vector, not in 2nd vector

# method 1: better 
# setdiff()
# setdiff() finds all elements that are in x2, if elements show in y2, then omit
cat("Set difference by position is: ")
setdiff(x2, y2)

# method 2
# & !
cat("Set difference by logical is: ")
x1 & !y1

# ------ (4) setdiff(union, intersect)
# method 1: better 
# xor())
xor(x1, y1)

# method 2
# setdiff(union, intersect)
# union(x2, y2) = 2, 4, 6, 8, 10, 5
# intersect(x2, y2) = 10
# setdiff these two (in union but )
setdiff(union(x2, y2), intersect(x2, y2))

Intersection by position is: 

Intersection by logical is: 

Union by position is: 

Union by logical is: 

Set difference by position is: 

Set difference by logical is: 

In [59]:
# Example 4: subset vector with another vector 

# create vector 
y <- 1:10

# ------ (1) subset even number positioned values 
# create logical vector
# mod 2 
cat("Logical vector: ")
(x1 <- 1:10 %% 2 == 0)

# subset vector at even positions 
cat("Subset vector at even positions: ")
y[x1]

# subset vector at odd positions 
# use negate !
cat("Subset vector at odd positions: ")
y[!x1] 

Logical vector: 

Subset vector at even positions: 

Subset vector at odd positions: 