# 캐글뽀개기 분석툴(주말) R 2015-10-24(토) Advanced R by Hadley Wickham
## Introduction
- It’s free, open source, and available on every major platform. As a result, if you do your analysis in R, anyone can easily replicate it.
- A massive set of packages for statistical modelling, machine learning, visualisation, and importing and manipulating data. Whatever model or graphic you’re trying to do, chances are that someone has already tried to do it. At a minimum, you can learn from their efforts.
- Cutting edge tools. Researchers in statistics and machine learning will often publish an R package to accompany their articles. This means immediate access to the very latest statistical techniques and implementations.
- Deep-seated language support for data analysis. This includes features likes missing values, data frames, and subsetting.
- A fantastic community. It is easy to get help from experts on the R-help mailing list, stackoverflow, or subject-specific mailing lists like R-SIG-mixed-models or ggplot2. You can also connect with other R learners via twitter, linkedin, and through many local user groups.
- Powerful tools for communicating your results. R packages make it easy to produce html or pdf reports, or create interactive websites.
- A strong foundation in functional programming. The ideas of functional programming are well suited to solving many of the challenges of data analysis. R provides a powerful and flexible toolkit which allows you to write concise yet descriptive code.
- An IDE tailored to the needs of interactive data analysis and statistical programming.
- Powerful metaprogramming facilities. R is not just a programming language, it is also an environment for interactive data analysis. Its metaprogramming capabilities allow you to write magically succinct and concise functions and provide an excellent environment for designing domain-specific languages.
- Designed to connect to high-performance programming languages like C, Fortran, and C++.
- Much of the R code you’ll see in the wild is written in haste to solve a pressing problem. As a result, code is not very elegant, fast, or easy to understand. Most users do not revise their code to address these shortcomings.
- Compared to other programming languages, the R community tends to be more focussed on results instead of processes. Knowledge of software engineering best practices is patchy: for instance, not enough R programmers use source code control or automated testing.
- Metaprogramming is a double-edged sword. Too many R functions use tricks to reduce the amount of typing at the cost of making code that is hard to understand and that can fail in unexpected ways.
- Inconsistency is rife across contributed packages, even within base R. You are confronted with over 20 years of evolution every time you use R. Learning R can be tough because there are many special cases to remember.
- R is not a particularly fast programming language, and poorly written R code can be terribly slow. R is also a profligate user of memory.

## Who should read this book
- Intermediate R programmers who want to dive deeper into R and learn new strategies for solving diverse problems.
- Programmers from other languages who are learning R and want to understand why R works the way it does.


## Data structures
- This chapter summarises the most important data structures in base R. You’ve probably used many (if not all) of them before, but you may not have thought deeply about how they are interrelated. In this brief overview, I won’t discuss individual types in depth. Instead, I’ll show you how they fit together as a whole. If you need more details, you can find them in R’s documentation.

- R’s base data structures can be organised by their dimensionality (1d, 2d, or nd) and whether theyre 동질성(homogeneous) (all contents must be of the same type) or 이질성(heterogeneous) (the contents can be of different types). This gives rise to the five data types most often used in data analysis:

| dim | Homogeneous | Heterogeneous |
|:-----------|------------:|:------------:|
| 1d      |Atomic vector |     List     
| 2d     |Matrix	Data | Data frame    
| nd       |       array |     will     

- Almost all other objects are built upon these foundations. In the OO field guide you’ll see how more complicated objects are built of these simple pieces. Note that R has no 0-dimensional, or scalar types. Individual numbers or strings, which you might think would be scalars, are actually vectors of length one.

## Vectors
- The basic data structure in R is the vector. Vectors come in two flavours: atomic vectors and lists. They have three common properties:

- Type, typeof(), what it is.
- Length, length(), how many elements it contains.
- Attributes, attributes(), additional arbitrary metadata.
- They differ in the types of their elements: all elements of an atomic vector must be the same type(원자벡터는 같은 타입), whereas the elements of a list can have different types(리스트는 다른 타입)
- NB: is.vector() does not test if an object is a vector. Instead it returns TRUE only if the object is a vector with no attributes apart from names. Use is.atomic(x) || is.list(x) to test if an object is actually a vector.

In [1]:
typeof(2)

In [2]:
mode(2)

In [4]:
is.atomic(2)

In [5]:
is.vector(2)

## Atomic vectors
- There are four common types of atomic vectors that Ill discuss in detail: logical, integer, double (often called numeric), and character. There are two rare types that I will not discuss further: complex and raw.
- Atomic vectors are usually created with c(), short for combine:

In [6]:
dbl_var <- c(1, 2.5, 4.5)
# With the L suffix, you get an integer rather than a double
int_var <- c(1L, 6L, 10L)
# Use TRUE and FALSE (or T and F) to create logical vectors
log_var <- c(TRUE, FALSE, T, F)
chr_var <- c("these are", "some strings")

In [7]:
typeof(dbl_var)

In [8]:
typeof(int_var)

In [9]:
typeof(log_var)

In [10]:
typeof(chr_var)

In [11]:
c(1, c(2, c(3, 4)))
#> [1] 1 2 3 4
# the same as
c(1, 2, 3, 4)
#> [1] 1 2 3 4

- Missing values are specified with NA, which is a logical vector of length 1. NA will always be coerced to the correct type if used inside c(), or you can create NAs of a specific type with NA_real_ (a double vector), NA_integer_ and NA_character_.

## Types and tests
- Given a vector, you can determine its type with typeof(), or check if it’s a specific type with an “is” function: is.character(), is.double(), is.integer(), is.logical(), or, more generally, is.atomic().

In [12]:
int_var <- c(1L, 6L, 10L)
typeof(int_var)
#> [1] "integer"
is.integer(int_var)
#> [1] TRUE
is.atomic(int_var)
#> [1] TRUE

dbl_var <- c(1, 2.5, 4.5)
typeof(dbl_var)
#> [1] "double"
is.double(dbl_var)
#> [1] TRUE
is.atomic(dbl_var)
#> [1] TRUE

In [13]:
is.numeric(int_var) #integer and double vectors 
#> [1] TRUE
is.numeric(dbl_var)
#> [1] TRUE

## Coercion
- All elements of an atomic vector must be the same type, so when you attempt to combine different types they will be coerced to the most flexible type. Types from least to most flexible are: logical, integer, double, and character.
- For example, combining a character and an integer yields a character:

In [14]:
str(c("a", 1))
#>  chr [1:2] "a" "1"

 chr [1:2] "a" "1"


In [16]:
x <- c(FALSE, FALSE, TRUE)
as.numeric(x)
#> [1] 0 0 1

In [17]:
# Total number of TRUEs
sum(x)
#> [1] 1


In [18]:
# Proportion that are TRUE
mean(x)
#> [1] 0.3333333

- Coercion often happens automatically. Most mathematical functions (+, log, abs, etc.) will coerce to a double or integer, and most logical operations (&, |, any, etc) will coerce to a logical. You will usually get a warning message if the coercion might lose information. If confusion is likely, explicitly coerce with as.character(), as.double(), as.integer(), or as.logical().

## Lists
- Lists are different from atomic vectors because their elements can be of any type, including lists. You construct lists by using list() instead of c():

In [19]:
x <- list(1:3, "a", c(TRUE, FALSE, TRUE), c(2.3, 5.9))
str(x)
#> List of 4
#>  $ : int [1:3] 1 2 3
#>  $ : chr "a"
#>  $ : logi [1:3] TRUE FALSE TRUE
#>  $ : num [1:2] 2.3 5.9

List of 4
 $ : int [1:3] 1 2 3
 $ : chr "a"
 $ : logi [1:3] TRUE FALSE TRUE
 $ : num [1:2] 2.3 5.9


- Lists are sometimes called recursive vectors, because a list can contain other lists. This makes them fundamentally different from atomic vectors.

In [20]:
x <- list(list(list(list())))
str(x)

List of 1
 $ :List of 1
  ..$ :List of 1
  .. ..$ : list()


In [21]:
is.recursive(x)

In [22]:
x <- list(list(1, 2), c(3, 4))
y <- c(list(1, 2), c(3, 4))
str(x)

List of 2
 $ :List of 2
  ..$ : num 1
  ..$ : num 2
 $ : num [1:2] 3 4


In [23]:
str(y)

List of 4
 $ : num 1
 $ : num 2
 $ : num 3
 $ : num 4


- The typeof() a list is list. You can test for a list with is.list() and coerce to a list with as.list(). You can turn a list into an atomic vector with unlist(). If the elements of a list have different types, unlist() uses the same coercion rules as c().

- Lists are used to build up many of the more complicated data structures in R. For example, both data frames (described in data frames) and linear models objects (as produced by lm()) are lists:

In [24]:
is.list(mtcars)
#> [1] TRUE

mod <- lm(mpg ~ wt, data = mtcars)
is.list(mod)
#> [1] TRUE

## Attributes
- All objects can have arbitrary additional attributes, used to store metadata about the object. Attributes can be thought of as a named list (with unique names). Attributes can be accessed individually with attr() or all at once (as a list) with attributes().

In [25]:
y <- 1:10
attr(y, "my_attribute") <- "This is a vector"
attr(y, "my_attribute")

In [26]:
str(attributes(y))

List of 1
 $ my_attribute: chr "This is a vector"


In [33]:
structure(1:10, my_attribute = "This is a vector")

In [30]:
structure(1:6, dim = 2:3)

0,1,2
1,3,5
2,4,6


In [34]:
attributes(y[1])

NULL

In [35]:
attributes(sum(y))

NULL

- The only attributes not lost are the three most important:
- Names, a character vector giving each element a name, described in names.
- Dimensions, used to turn vectors into matrices and arrays, described in matrices and arrays.
- Class, used to implement the S3 object system, described in S3.
- Each of these attributes has a specific accessor function to get and set values. When working with these attributes, use names(x), dim(x), and class(x), not attr(x, "names"), attr(x, "dim"), and attr(x, "class").

## Names
- You can name a vector in three ways:
- When creating it: x <- c(a = 1, b = 2, c = 3).
- By modifying an existing vector in place: x <- 1:3; names(x) <- c("a", "b", "c").
- By creating a modified copy of a vector: x <- setNames(1:3, c("a", "b", "c")).

- Names don’t have to be unique. However, character subsetting, described in subsetting, is the most important reason to use names and it is most useful when the names are unique.

- Not all elements of a vector need to have a name. If some names are missing, names() will return an empty string for those elements. If all names are missing, names() will return NULL.

In [36]:
y <- c(a = 1, 2, 3)
names(y)

In [37]:
z <- c(1, 2, 3)
names(z)

NULL

## Factors
- One important use of attributes is to define factors. A factor is a vector that can contain only predefined values, and is used to store categorical data. Factors are built on top of integer vectors using two attributes: the class(), “factor”, which makes them behave differently from regular integer vectors, and the levels(), which defines the set of allowed values.

In [38]:
x <- factor(c("a", "b", "b", "a"))
x

In [39]:
class(x)

In [40]:
levels(x)

In [41]:
x[2] <- "c"

In `[<-.factor`(`*tmp*`, 2, value = "c"): invalid factor level, NA generated

In [42]:
c(factor("a"), factor("b"))

In [43]:
x

In [44]:
sex_char <- c("m", "m", "m")
sex_factor <- factor(sex_char, levels = c("m", "f"))

In [45]:
table(sex_char)

sex_char
m 
3 

In [46]:
table(sex_factor)

sex_factor
m f 
3 0 

- Sometimes when a data frame is read directly from a file, a column you’d thought would produce a numeric vector instead produces a factor. 

- This is caused by a non-numeric value in the column, often a missing value encoded in a special way like . or -. To remedy the situation, coerce the vector from a factor to a character vector, and then from a character to a double vector. (Be sure to check for missing values after this process.) 

- Of course, a much better plan is to discover what caused the problem in the first place and fix that; using the na.strings argument to read.csv() is often a good place to start.

In [47]:
z <- read.csv(text = "value\n12\n1\n.\n9")

In [51]:
z

Unnamed: 0,value
1,12
2,1
3,.
4,9


In [50]:
typeof(z$value)

In [52]:
as.double(z$value)

In [53]:
class(z$value)

In [55]:
as.double(as.character(z$value))

In eval(expr, envir, enclos): 강제형변환에 의해 생성된 NA 입니다

In [56]:
z <- read.csv(text = "value\n12\n1\n.\n9", na.strings=".")
typeof(z$value)

In [57]:
class(z$value)

In [58]:
z$value

- Unfortunately, most data loading functions in R automatically convert character vectors to factors. 

- This is suboptimal, because there’s no way for those functions to know the set of all possible levels or their optimal order. 

- Instead, use the argument stringsAsFactors = FALSE to suppress this behaviour, and then manually convert character vectors to factors using your knowledge of the data. 

- A global option, options(stringsAsFactors = FALSE), is available to control this behaviour, but I don’t recommend using it. Changing a global option may have unexpected consequences when combined with other code (either from packages, or code that you’re source()ing), and global options make code harder to understand because they increase the number of lines you need to read to understand how a single line of code will behave.

- While factors look (and often behave) like character vectors, they are actually integers. Be careful when treating them like strings. Some string methods (like gsub() and grepl()) will coerce factors to strings, while others (like nchar()) will throw an error, and still others (like c()) will use the underlying integer values. For this reason, it’s usually best to explicitly convert factors to character vectors if you need string-like behaviour. In early versions of R, there was a memory advantage to using factors instead of character vectors, but this is no longer the case.

## Matrices and arrays
- Adding a dim() attribute to an atomic vector allows it to behave like a multi-dimensional array. A special case of the array is the matrix, which has two dimensions. Matrices are used commonly as part of the mathematical machinery of statistics. Arrays are much rarer, but worth being aware of.

- Matrices and arrays are created with matrix() and array(), or by using the assignment form of dim():

In [61]:
# Two scalar arguments to specify rows and columns
a <- matrix(1:6, ncol = 3, nrow = 2)
# One vector argument to describe all dimensions
b <- array(1:12, c(2, 3, 2))

In [62]:
a
b

0,1,2
1,3,5
2,4,6


In [63]:
c <- 1:6
dim(c) <- c(3, 2)

In [64]:
c

0,1
1,4
2,5
3,6


In [65]:
dim(c) <- c(2, 3)

In [66]:
c

0,1,2
1,3,5
2,4,6


- length() and names() have high-dimensional generalisations:

- length() generalises to nrow() and ncol() for matrices, and dim() for arrays.

- names() generalises to rownames() and colnames() for matrices, and dimnames(), a list of character vectors, for arrays.

In [67]:
length(a)
#> [1] 6

In [68]:
nrow(a)
#> [1] 2

In [69]:
ncol(a)
#> [1] 3

In [70]:
rownames(a) <- c("A", "B")
colnames(a) <- c("a", "b", "c")

In [71]:
a

Unnamed: 0,a,b,c
A,1,3,5
B,2,4,6


In [72]:
length(b)
#> [1] 12
dim(b)
#> [1] 2 3 2

In [74]:
dimnames(b) <- list(c("one", "two"), c("a", "b", "c"), c("A", "B"))
b
#jupyter... her

- c() generalises to cbind() and rbind() for matrices, and to abind() (provided by the abind package) for arrays. You can transpose a matrix with t(); the generalised equivalent for arrays is aperm().

- You can test if an object is a matrix or array using is.matrix() and is.array(), or by looking at the length of the dim(). as.matrix() and as.array() make it easy to turn an existing vector into a matrix or array.

- Vectors are not the only 1-dimensional data structure. You can have matrices with a single row or single column, or arrays with a single dimension. They may print similarly, but will behave differently. The differences aren’t too important, but it’s useful to know they exist in case you get strange output from a function (tapply() is a frequent offender). As always, use str() to reveal the differences.

In [75]:
?tapply

0,1
tapply {base},R Documentation

0,1
X,"an atomic object, typically a vector."
INDEX,"list of one or more factors, each of same length as X. The elements are coerced to factors by as.factor."
FUN,"the function to be applied, or NULL. In the case of functions like +, %*%, etc., the function name must be backquoted or quoted. If FUN is NULL, tapply returns a vector which can be used to subscript the multi-way array tapply normally produces."
...,optional arguments to FUN: the Note section.
simplify,"If FALSE, tapply always returns an array of mode ""list"". If TRUE (the default), then if FUN always returns a scalar, tapply returns an array with the mode of the scalar."


In [76]:
str(1:3)                   # 1d vector
#>  int [1:3] 1 2 3
str(matrix(1:3, ncol = 1)) # column vector
#>  int [1:3, 1] 1 2 3
str(matrix(1:3, nrow = 1)) # row vector
#>  int [1, 1:3] 1 2 3
str(array(1:3, 3))         # "array" vector
#>  int [1:3(1d)] 1 2 3

 int [1:3] 1 2 3
 int [1:3, 1] 1 2 3
 int [1, 1:3] 1 2 3
 int [1:3(1d)] 1 2 3


In [77]:
l <- list(1:3, "a", TRUE, 1.0)
dim(l) <- c(2, 2)
l

0,1
"1, 2, 3",TRUE
a,1


In [78]:
str(l)

List of 4
 $ : int [1:3] 1 2 3
 $ : chr "a"
 $ : logi TRUE
 $ : num 1
 - attr(*, "dim")= int [1:2] 2 2


## Data frames
- A data frame is the most common way of storing data in R, and if used systematically makes data analysis easier. Under the hood, a data frame is a list of equal-length vectors. This makes it a 2-dimensional structure, so it shares properties of both the matrix and the list. This means that a data frame has names(), colnames(), and rownames(), although names() and colnames() are the same thing. The length() of a data frame is the length of the underlying list and so is the same as ncol(); nrow() gives the number of rows.

- As described in subsetting, you can subset a data frame like a 1d structure (where it behaves like a list), or a 2d structure (where it behaves like a matrix).



### Creation
- You create a data frame using data.frame(), which takes named vectors as input:

In [79]:
df <- data.frame(x = 1:3, y = c("a", "b", "c"))
str(df)

'data.frame':	3 obs. of  2 variables:
 $ x: int  1 2 3
 $ y: Factor w/ 3 levels "a","b","c": 1 2 3


In [80]:
df <- data.frame(
  x = 1:3,
  y = c("a", "b", "c"),
  stringsAsFactors = FALSE)
str(df)

'data.frame':	3 obs. of  2 variables:
 $ x: int  1 2 3
 $ y: chr  "a" "b" "c"


### Testing and coercion
- Because a data.frame is an S3 class, its type reflects the underlying vector used to build it: the list. To check if an object is a data frame, use class() or test explicitly with is.data.frame():

- R의 S3 클래스 = R에 영감을 준 S언어 버전 3에서 따왔음. S3 제네릭 함수, S4 클래스는 보안 기능을 추가하기 위해 나중에 개발되었음
- 제네릭 함수 (plot, print, summary처럼 다형성을 가짐. 동일 함수가 서로 다른 클래스에서 다른 연산을 수행)

In [81]:
typeof(df)
#> [1] "list"
class(df)
#> [1] "data.frame"
is.data.frame(df)

- You can coerce an object to a data frame with as.data.frame():
- A vector will create a one-column data frame.
- A list will create one column for each element; it’s an error if they’re not all the same length.
- A matrix will create a data frame with the same number of columns and rows as the matrix.

In [83]:
#Combining data frames
cbind(df, data.frame(z = 3:1))
#>   x y z
#> 1 1 a 3
#> 2 2 b 2
#> 3 3 c 1

Unnamed: 0,x,y,z
1,1,a,3
2,2,b,2
3,3,c,1


In [84]:
rbind(df, data.frame(x = 10, y = "z"))

Unnamed: 0,x,y
1,1,a
2,2,b
3,3,c
4,10,z


- When combining column-wise, the number of rows must match, but row names are ignored. When combining row-wise, both the number and names of columns must match. Use plyr::rbind.fill() to combine data frames that don’t have the same columns.

- It’s a common mistake to try and create a data frame by cbind()ing vectors together. This doesn’t work because cbind() will create a matrix unless one of the arguments is already a data frame. Instead use data.frame() directly:

In [85]:
bad <- data.frame(cbind(a = 1:2, b = c("a", "b")))
str(bad)

'data.frame':	2 obs. of  2 variables:
 $ a: Factor w/ 2 levels "1","2": 1 2
 $ b: Factor w/ 2 levels "a","b": 1 2


In [86]:
good <- data.frame(a = 1:2, b = c("a", "b"),
  stringsAsFactors = FALSE)
str(good)

'data.frame':	2 obs. of  2 variables:
 $ a: int  1 2
 $ b: chr  "a" "b"


In [87]:
#Special columns
df <- data.frame(x = 1:3)
df$y <- list(1:2, 1:3, 1:4)
df

Unnamed: 0,x,y
1,1,"1, 2"
2,2,"1, 2, 3"
3,3,"1, 2, 3, 4"


In [88]:
data.frame(x = 1:3, y = list(1:2, 1:3, 1:4))

ERROR: Error in data.frame(1:2, 1:3, 1:4, check.names = FALSE, stringsAsFactors = TRUE): arguments imply differing number of rows: 2, 3, 4


In [89]:
str(df)

'data.frame':	3 obs. of  2 variables:
 $ x: int  1 2 3
 $ y:List of 3
  ..$ : int  1 2
  ..$ : int  1 2 3
  ..$ : int  1 2 3 4
