**Introduction**<br>
1. R's base data structures can be organized by their dimensionality (1d, 2d, 3d) and whether they're homogeneous (all contents must be of same type) or heterogeneous (contents can be of different types).
2. Almost all other objects are built upon these foundations. More complicated objects in Chapter 7 are built of these simple pieces.<br>
3. R has no 0-dimensional or scalar types. Individual numbers/strings are actually vectors of length one. 
4. `str()` gives description of any R data structure.

Table: Five data types most often used in data analysis
<br><br>

|     | Homogeneous   | Heterogeneous | 
| :-: |      :-:      |       :-:     | 
| 1d  | Atomic vector |       list    |
| 2d  | Matrix        |    Data frame |
| nd  | Array         |               |   

Note: Vectors (including Atomic vector & list) have only rows. Matrix and data frame have both rows and columns. Array has n blocks (each block contain rows and columns).

## Vectors

1. Vector is the basic data structure in R.<br>
Vectors come in 2 flavors: <br>
(1) atomic vectors<br>
(2) lists 
2. Atomic vectors and lists have 3 common properties<br>
(1) type: `typeof()` what it is<br>
(2) length: `length()` how many elements<br>
(3) attributes: `attributes()` additonal arbitrary metadata.
3. Difference between atomic vectors and lists: <br>
(1) all elements of atomic vector must be of same type (homogeneous)<br>
(2) elements of a list can have different type (heterogeneous).
4. `is.vector()` doesn't test if an object is a vector. It returns TRUE only if the object is a vector with no attributes (except names). <br>
Use `is.atomic(x) || is.list(x)` to test if an object is actually a vector.  

### Atomic vectors

1. Four main types of atomic vectors:<br>
(1) logical<br>
(2) integer<br>
(3) double (numeric) <br>
(4) character<br>

{2 rare types: (5) complex (6) raw}

2. Atomic vectors are always flat, even if you nest `c()`
3. Missing values are specified with NA, a logical vector of length 1.<br>
NA will be coerced to correct type if used inside `c()`, or you can create NAs of specific type with `NA_real_` (a double vector), `NA_integer_`, or `NA_charactor_` (logical is omitted since NA itself is logical).

In [4]:
# Example 1

# double: include both integer & decimals 
double_var <- c(1, 2.5, 4.5)
integer_var <- c(2L, 3L, 5L)
logical_var <- c(TRUE, FALSE, T, F, NA)
character_var <- c("these are", "strings")

In [5]:
# Example 2

# method 1
c(1, c(2, c(3, 4)))

# method 2
c(1, 2, 3, 4)

#### Types and tests

1. Use `typeof()` to determine a vector's type. <br>
Or check if it's a specific type with `is.character()`, `is.double()`, `is.integer()`, `is.logical()`, or more general `is.atomic()`. 
2. Note `is.numeric()` is TRUE for integer or double vectors.  

In [6]:
# Example 1

# integer
typeof(integer_var)
is.integer(integer_var)
is.atomic(integer_var)

# double
typeof(double_var)
is.double(double_var)
is.atomic(double_var)

# numeric
is.numeric(integer_var)
is.numeric(double_var)

#### Coersion

1. If you combine different types in a vector, they will be coerced to most flexible type.
2. Types from least flexible to most flexible are: logical $\rightarrow$ integer $\rightarrow$ double $\rightarrow$ character.
3. Coercion often happens auto. <br>
(1) Most math functions (+, log, abs, etc.) will coerce to double or integer<br>
(2) Most logical operations (&, |, any, etc.) will coerce to logical. <br>
4. If don't want confusion, explicitly coerce with `as.character()`, `as.double()`, `as.vector()`, `as.integer()`, `as.numeric()`, `as.logical()`.

In [8]:
# Example 1

# integer coerced to character 
str(c("a", 1))

# logical coerced to numeric 
x <- c(FALSE, FALSE, TRUE)
as.numeric(x)
# number of TRUEs
sum(x)
# proportion that are TRUE
mean(x)

 chr [1:2] "a" "1"


### Lists

1. Elements of lists can be of any type, including lists. 
2. Lists are called **recursive vectors** sometimes, because a list can contain other lists.
3. `c()` combines several lists into one.
4. If given a combination of atomic vectors and lists, `c()` coerces vectors to lists before combining them.
5. `typeof()` a list is a list. 
6. Use `is.list()` to test for a list. Use `as.list()` to coerce to a list.
7. Turn a list into atomic vector with `unlist()`
8. If elements of a list have different types, `unlist()` uses same coercion rules as `c()`.
9. Lists are used to build up many complicated data structures in R. For example, both data frame and linear model objects are lists.

In [10]:
# Example 1: create list 

x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
str(x)

List of 4
 $ : int [1:3] 1 2 3
 $ : chr "a"
 $ : logi [1:3] TRUE FALSE TRUE
 $ : num [1:2] 2.3 5.9


In [12]:
# Example 2: create list inside list 

x <- list(list(list(list())))
str(x)
is.recursive(x)

List of 1
 $ :List of 1
  ..$ :List of 1
  .. ..$ : list()


In [16]:
# Example 3: 

# list():  list of 2
x <- list(list(1, 2), c(3, 4))
str(x)

# c():  list of 4
y <- c(list(1, 2), c(3, 4))
str(y)

List of 2
 $ :List of 2
  ..$ : num 1
  ..$ : num 2
 $ : num [1:2] 3 4
List of 4
 $ : num 1
 $ : num 2
 $ : num 3
 $ : num 4


In [21]:
# Example 4

# data frame is a list
is.list(mtcars)

# linear model is a list
model <- lm(mpg ~ wt, data = mtcars)
is.list(model)

## Attributes

1. All objects can have arbitrary additional attributes, used to store metadata about the object.
2. Attributes can be thought of as a named list (with unique names)
3. You can access attributes individually with `attr()` or `attributes()`.
4. `structure()` function returns a new object with modified attributes.
5. By default, most attributes are lost when modifying a vector. 
6. The only attributes not lost are: <br>
(1) names: a character vector giving each element a name. Use `names()` to set names or access names.<br>
(2) dimensions: used to turn vectors into matrices and arrays. Use `dim()` to access.<br>
(3) class: used to implement S3 object system. Use `class()` to access. 

In [24]:
# Example 1

y <- 1:10
attr(y, "my_attribute") <- "This is a vector"
attr(y, "my_attribute")

str(attributes(y))

List of 1
 $ my_attribute: chr "This is a vector"


In [25]:
# Example 2: this returns a new object with modified attributes

structure(1:10, my_attribute = "This is a vector")

In [26]:
# Example 3: modified vector loses attributes

attributes(y[1])
attributes(sum(y))

NULL

NULL

### Names

1. You can name a vector in 3 ways:<br>
(1) when creating it<br>
(2) modifying an existing vector in place<br>
(3) create a modified copy of vector using `setNames()`
2. You can create a new vector without names using `unname()`, or remove names in place with `names(x) <- NULL`.<br>
Note: `setNames()` can only be used to set names; `names()` can be used to set names and check/access names. 

In [31]:
# Example 1

# name vector when creating it 
x <- c(a = 1, b = 2, c = 3)
x

# modify an existing vector in place 
y <- 1:3
names(y) <- c("a", "b", "c")
y 

# create a modified copy of vector
z <- setNames(1:3, c("a", "b", "c"))
z

In [30]:
# Example 2

k <- c(a = 1, 2, 3)
names(y)

l <- c(1, 2, 3)
names(z)

NULL

In [36]:
# Example 3: remove names (from exmaple 1)

# method 1
unname(y)

# method 2
names(z) <- NULL
names(z)

NULL

### Factors

1. A factor is a vector that can contain only predefined values, and is used to store categorical data.
2. Factors are built on top of integer vectors using 2 attributes:<br>
(1) `class()`, "factor", which makes them behave differently from regular integer vectors<br>
(2) `levels()`, which defines set of allowed values.
3. You can't use values not in the levels.
4. You can't combine factors. 
5. Factors are useful when you know possible values a variable may take. Using factor (not character) vector makes it obvious when some groups contain no observations.
6. Sometimes a column of a data frame produces a factor (but you expect numeric), this is caused by non-numeric value in a column (often missing value encoded like . or -).Here are some solutions:<br> 
(1) Coerce the vector from factor $\rightarrow$ character vector $\rightarrow$ double vector.<br>
(2) Use `na.strings` when `read.csv()`.<br>
(3) Use `stringsAsFactors = FALSE` to suppress this behavior (most data loading functions in R auto convert character vectors to factors), then manually convert character vector to factors. <br>
(NOT RECOMMENDED) You can also use global option `options(stringsAsFactors = FALSE)`. 
7. While factors look like character vectors, they are integers. It's best to convert factors to character vectors if you need string-like behavior. 

In [43]:
# Example 1: create factor
x <- factor(c("a", "b", "a", "b"))
x

# 2 attributes of factor vector
class(x)
levels(x)

# can't use values not in levels 
x[2] <- "c"
x

# can't combine factors
c(factor("a"), factor("b"))

“invalid factor level, NA generated”


In [44]:
# Example 2: 

sex_char <- c("m", "m", "m")
sex_factor <- factor(sex_char, levels = c("m", "f"))

table(sex_char)
table(sex_factor)

sex_char
m 
3 

sex_factor
m f 
3 0 

In [52]:
# Example 3: 
# use na.strings to STOP loading data as factor by default 

# read.csv: without na.strings 

# here, "\n" is to specify as number
# here, "." is to specify missing value
z <- read.csv(text = "value\n12\n1\n.\n9")
z

# we didn't use na.strings, with missing values".", it's loaded as factor 
typeof(z$value)

# cannot directly coerce from factor to numeric
# if we do so, it'll be converted not by original number, but by levels 
as.double(z$value)

# class is factor
class(z$value)

# solutions: 

# Method 1: convert factor to character, then convert to numeric
as.double(as.character(z$value))

# Method 2: use as.strings
z <- read.csv(text = "value\n12\n1\n.\n9", na.strings = ".")

# check type
typeof(z$value)
# check class 
class(z$value)
# type z value 
z$value

value
12
1
.
9


“NAs introduced by coercion”


## Matrices & arrays

1. You can create matrix and array using `matrix()`, `array()` or `dim()`
2. length() and names() have high-dimensional generalizations: <br>
(1) `length()` generalizes to `nrow()` and `ncol()` for matrices, and `dim()` for arrays. <br>
(2) `names()` generalizes to `rownames()` and `colnames()` for matrices, and `dimnames()`, a list of character vectors, for arrays. 
3. `c()` generalizes for `cbind()` and `rbind()` for matrices, and to `abind()` (abind package)for arrays. 
4. You can transpose a matrix with `t()`, and generalized version for arrays is `aperm()`. 
5. Test if an object is a matrix or array using `is.matrix()` or `is.array()`, or by looking at length of the `dim()`. `as.matrix()` and `as.array()` turn an existing vector into matrix or array. 
6. Vectors are not the only 1d data structure. You can have matrices with 1 single row or single column, or arrays with a single dimension. <br>
They may print similarly, but will behave differently. Use str() to reveal the differences. 
7. While atomic vectors are most commonly turned into matrices, the dimension attribute can also be set on lists to make list-matrices or list-arrays. 
8. List-matrices or list-arrays are esoteric data structures, but can be useful if you want to arrange objects into grid-like structure. 

In [12]:
# Example 1: matrix 
a <- matrix(1:6, ncol = 3, nrow = 2)
a

# Example 2 
b

# Example 3
# modify an object in place 
c <- 1:6
dim(c) <- c(3, 2)
c

0,1,2
1,3,5
2,4,6


0,1
1,4
2,5
3,6


In [18]:
# Example 4: matrix  

# create matrix 
a <- matrix(1:6, ncol = 3, nrow = 2)

length(a)
nrow(a)
ncol(a)

rownames(a) <- c("A", "B")
colnames(a)  <- c("a", "b", "c")

# Example 5: array 

# create array 
b  <- array(1:12, c(2, 3, 2))

length(b)
dim(b)

dimnames(b) <- list(c("one", "two"), c("a", "b", "c"), c("A", "B"))
b

In [25]:
# Example 6

# 1d vector
str(1:3)

# column vector (1 column matrix)
str(matrix(1:3, ncol = 1))

# row vector (1 row matrix)
str(matrix(1:3, nrow = 1))

# array vector (1d array)
str(array(1:3, 3))

 int [1:3] 1 2 3
 int [1:3, 1] 1 2 3
 int [1, 1:3] 1 2 3
 int [1:3(1d)] 1 2 3


In [29]:
# Example 7

# create a list
l <- list(1:3, "a", TRUE, 1.0)

# turn list into matrix
dim(l) <- c(2, 2)
l

0,1
"1, 2, 3",TRUE
a,1


## Data frame

1. Under the hood, a data frame is a list of equal-length vectors. This makes it a 2d structure, so it shares properties of both matrix (2d) and list (1d). 
2. A data frame has `names()` or `colnames()` (they're the same), and `rownames()`.
3. `length()` of data frame is the length of the underlying list and so is the same as `ncol()` and `nrow()`.
4. You can subset a data frame like 1d structure (it behaves like list) or 2d structure (behaves like matrix). 

### Creation

1. Use `data.frame()` create data frame.
2. `data.frame()` turns strings into factors by default. Use `stringsAsFactors = FALSE` to suppress it.

In [31]:
# Example 1
df <- data.frame(
    x = 1:3,
    y = c("a", "b", "c")
)
str(df)

'data.frame':	3 obs. of  2 variables:
 $ x: int  1 2 3
 $ y: Factor w/ 3 levels "a","b","c": 1 2 3


In [33]:
# Example 2 
df <- data.frame(
    x = 1:3,
    y= c("a", "b", "c"),
    stringsAsFactors = FALSE)
str(df)

'data.frame':	3 obs. of  2 variables:
 $ x: int  1 2 3
 $ y: chr  "a" "b" "c"


### Testing & coercion

1. Because a data.frame is a S3 class, its type reflects the underlying vector used to build it: the list. 
2. Use `class()` to check if an object is a data frame or `is.data.frame()`.
3. You can coerce an object to data frame with `as.data.frame()`<br>
(1) a vector will create a 1-column data frame <br>
(2) a list will create one column for each element; it's an error if they're not all same length. <br>
(3) a matrix will create a data frame with same number of columns and rows as matrix. 

In [34]:
# Example 1
typeof(df)
class(df)
is.data.frame(df)

### Combining data frames

1. Use `cbind()`, `rbind()` to combine data frames. 
2. Use `plyr::rbind.fill()` to combine data frames that don't have same columns. 
3. `cbind()` 2 vectors actually creates a matrix; use `data.frame()` to directly create a data frame. 

In [38]:
# Example 1
cbind(df, data.frame(z = 3:1))
rbind(df, data.frame(x = 10, y = "z"))

x,y,z
1,a,3
2,b,2
3,c,1


x,y
1,a
2,b
3,c
10,z


In [47]:
# Example 2: cbind 2 vectors actually creates a matrix 
# create vectors 
a <- c(1, 2)
b <- c("a", "b")
cbind(a,b)
class(cbind(a,b))

# Example 3: use data.frame to direclty 
data.frame(a = 1:2, b = c("a", "b"))
class(data.frame(a = 1:2, b = c("a", "b")))

a,b
1,a
2,b


a,b
1,a
2,b


### Special columns

1. Since a data frame is a list of vectors, it's possible for a data frame to have a column that is a list. 
2. However, when a list is given to `data.frame()`, it tries to put each item of list into its own column, so it fails if you put different lengths of vector/list in `data.frame()`.
3. We can use `I()` which causes `data.frame()` to treat the list as one unit. 
4. We can also have a column of a data frame that's a matrix or array, as long as the number of rows matches the data frame. 
5. Use list and array columns with cause: many functions that work with data frames assume that all columns are atomic vectors. 

In [48]:
# Example 1
df <- data.frame(x = 1:3)
df$y <- list(1:2, 1:3, 1:4)
df 

x,y
1,"1, 2"
2,"1, 2, 3"
3,"1, 2, 3, 4"


In [49]:
# Example 2
data.frame(x = 1:3, y = list(1:2, 1:3, 1:4))

ERROR: Error in (function (..., row.names = NULL, check.rows = FALSE, check.names = TRUE, : arguments imply differing number of rows: 2, 3, 4


In [65]:
# Example 3
df1 <- data.frame(x = 1:3, y = I(list(1:2, 1:3, 1:4)))
str(df1)

# access second element in variable "y"
df1[2, "y"]
# access first element in variable "y"
df1[1, "y"]
# access first element in variable "x"
df1[1, "x"]

'data.frame':	3 obs. of  2 variables:
 $ x: int  1 2 3
 $ y:List of 3
  ..$ : int  1 2
  ..$ : int  1 2 3
  ..$ : int  1 2 3 4
  ..- attr(*, "class")= chr "AsIs"


[[1]]
[1] 1 2 3


[[1]]
[1] 1 2


In [66]:
# Example 4
dfm <- data.frame(x = 1:3, y = I(matrix(1:9, nrow = 3)))
str(dfm)
dfm[2, "y"]

'data.frame':	3 obs. of  2 variables:
 $ x: int  1 2 3
 $ y: 'AsIs' int [1:3, 1:3] 1 2 3 4 5 6 7 8 9


     [,1] [,2] [,3]
[1,]    2    5    8