# Notebook of R programming for bioinformatics

三个对初学者有用的函数：

help()

example()

get()

违法的变量名用引号括起来

In [12]:
`_foo` = 10
"10:10" = 20
ls()

除了NULL，其他所有的R变量都有属性。属性可以存储、修改。`attributes`可以获取属性，`attr`函数可以获取和修改属性。

R用属性做很多事情，特别是S3类基本上是基于属性的。

In [15]:
x = 1:10
attr(x, "foo") = 11
x

In [21]:
attr(x, "foo")

In [22]:
attributes(x)

In [24]:
attributes(x)$foo

### OOP in R

面向对象有两个成分：
- 一个是定义不同对象的类
- 二是带方法的通用函数

R中有两个OOP系统：S3，S4
- S3主要支持各类通用函数
- S4既支持类，同时也支持通用函数

S3系统非常松散，可以很轻易地给R对象附加类属性，没有检查机制。

通用函数其实就是一个分配机制，S3中，分配通过连接通用函数名与其对应的类。

In [25]:
# an example of a generic function is mean
mean

可以看到，mean其实调用了UseMethod去寻找合适的方法。

In [28]:
methods("mean")

[1] mean.Date     mean.default  mean.difftime mean.POSIXct  mean.POSIXlt 
see '?methods' for accessing help and source code

可以看到这些方法都以mean开头。当调用`mean`函数时，R寻找第一个参数看它是否有类属性，如果有，R寻找以`mean.`开头接类名的函数。如果存在，调用该方法。如果不存在，`mean.default`被调用。

### Some special values


In [2]:
length(NULL)

c(1, NULL)

list("a", NULL)

In [3]:
# 缺失值NA
# is.na()函数可以识别

typeof(NA)

In [4]:
as.character(NA)

In [5]:
as.integer(NA)

[1] NA

In [6]:
typeof(as.integer(NA))

R中无穷大和不是数值都有表征。

In [8]:
y = 1/0
y

In [9]:
-y

In [10]:
y - y

### Types of objects

R一个重要的数据结构是向量，向量是有序的向量集，所有的元素是相同的类型。

R有6中基本的向量类型：逻辑型，整型，实数，复数，字符串，raw。

三个检索变量类型函数，mode,storage.mode,typeof。

In [11]:
typeof(y)

In [12]:
typeof(is.na)

In [13]:
typeof(mean)

In [14]:
mode(NA)

In [15]:
storage.mode(letters)

In [16]:
is.integer(y)

In [17]:
is.character(y)

In [18]:
is.double(y)

In [19]:
is.numeric(y)

### 生成序列和取子集

In [20]:
1:3

In [21]:
1.3:3.2

In [22]:
6:3

In [23]:
x = 11 : 20

In [24]:
x[4:5]

In [25]:
storage.mode(1:3)
storage.mode(1.3:4.2)

### 函数的类型

三类：builtins, specials and closures

用户只能创建closures，其他的两类函数是将计算传递到下游线程（典型为C语言）的接口，两者的差别是是否评估输入参数。



### 数据结构

原子向量是绝大多数数据结构的基础。一个原子向量包括0或多个相同类型的整型，double型，逻辑型或字符串(只能四种类型之一)。复数或raw（纯字节）都有原子类型表征。

S语言中的字符向量是字符串的向量，不是字符向量。比如"super"是一个长度为1的字符向量，而不是长度为5。通过对原子向量添加`dim`属性可以创建多维数组或者矩阵。



In [26]:
x = c(1, 2, 3, 4)

In [27]:
x

In [28]:
dim(x) = c(2, 2)

In [29]:
x

0,1
1,3
2,4


In [30]:
typeof(x)

In [31]:
y = letters[1:10]

In [32]:
y

In [33]:
dim(y) = c(2, 5)

In [34]:
y

0,1,2,3,4
a,c,e,g,i
b,d,f,h,j


In [35]:
typeof(y)

In [36]:
# logic value is either TRUE FALSE NA

向量的元素可以有名字，矩阵和数组也可以为每一维命名。如果给一个已命名的向量添加`dim`属性，名字会被丢弃，其他属性保留。

创建向量的函数有：
```
c()
numeric()
double()
character()
integer()
logical()
```

`seq()`函数可以创建符合模式序列

`seq_len`生成从1到参数的序列

`seq_along`返回一个相同长度的整数序列

一些生成随机向量的函数：

像`rnorm`,`runif`

`sample`函数可以用来采样。

In [39]:
c(1, 3:5)

In [40]:
c(1, "c")

In [41]:
numeric(2)

In [42]:
character(2)

In [43]:
seq(1,10, by = 2)

In [44]:
seq_len(2.2)

In [46]:
seq_along(numeric(3))

In [47]:
sample(1:100, 5)

S语言把数组看做由数组元素加上维度dim属性的一个向量。向量可以是matrix，也可以是array，或者直接用dim函数附加属性。

数组可以用dimnames函数或者matrix,array的参数进行名字属性扩展。扩展名以列表的形式保存。

0长度向量



In [3]:
# 举例   这个概念没太懂
sum(numeric())
prod(numeric())

数值计算

In [4]:
a = sqrt(2)
a * a == 2

In [5]:
# .Machine变量存储了R运行及其的数值属性
.Machine

因子

因子反应了S语言统计应用的根本。当数据集大，包含相对少的离散的水平时非常有用。它可以被当做分类变量。

因子用facotor类表示，它是整型变量的编码，属性名为levels。

In [7]:
set.seed(123)
x = sample(letters[1:5], 10, replace = TRUE)
y = factor(x)
y

In [9]:
attributes(y)

像`read.table()`函数，它会将导入的字符数据自动转换为因子，这是需要在处理时注意的。我们可以通过将`stringsAsFactors`设置为FALSE修改。R中的很多函数都有类似的标准写法。

In [11]:
y = sample(letters[1:5], 20, rep = T)
v = as.factor(y)
xx = list(I = c("a", "e"), II = c("b", "c","d"))
levels(v) = xx
v

因子是S3类的实例。*ordered*作为有序因子的表示。

In [13]:
z = ordered(y)
class(z)

In [16]:
class(as.factor(y))  # 可以观察到有序因子比普通因子多一个类

### 列表，环境与数据框

R实际上包含两种类别的列表：`pairlists`和`lists`。前者主要为R的内部代码和工作运行服务，所以我们一般谈论的是后者。


### Lists

In [17]:
# 列表用来存储元素具有不同类型的数据
y = list(a = 1, 17, b = 4:5, c="a")
y

$a
[1] 1

[[2]]
[1] 17

$b
[1] 4 5

$c
[1] "a"


### 数据框

数据框data.frame是特殊的列表。其实就是一个矩阵型的数据列表。

### Environments

这个概念理解比较有难度，像列表但不是列表。摆上原文解释：

An environment is a set of symbol-value pairs, where the value can be any R
object, and hence they are much like lists. Originally environments were used
for R’s internal evaluation model. They have slowly been exposed as an R
version of a hash table, or an associative array. The internal implementation
is in fact that of a hash table. The symbol is used to compute the hash index,
and the hash index is used to retrieve the value. In the code below, we create
an environment, create the symbol value pair that relates the symbol a to the
value 10 and then list the contents of the hash table.

In [18]:
e1 = new.env(hash = TRUE)
e1$a = 10
ls(e1)

In [19]:
e1[["a"]]

In [20]:
e1$a

Enviroments与list有两处重要的不同之处：First, for environments, the values
can only be accessed by name; there is no notion of linear order in the hash
table. Second, environments, and their contents, are not copied when passed
as arguments to a function.

In [21]:
e1 = new.env()
attr(e1, "foo")  =  10
e1

<environment: 0x278d080>
attr(,"foo")
[1] 10

In [23]:
e2 = e1
attr(e2, "foo") = 20
e1    ###  我们可以发现修改e2改变了e1，这在R中是不多见的

<environment: 0x278d080>
attr(,"foo")
[1] 20

In [24]:
e1 = new.env()
e1$z = 10
f = function(x){
    x + z
}

In [26]:
environment(f) = e1

In [27]:
f(10)

In [28]:
e1$z = 20
f(10)

## 管理你的R线程

The variable R.version$platform is the canonical name of the platform that R was
compiled on. The function Sys.info provides similar information. The variable .Platform has information such as the file separator. The function
capabilities indicates whether specific optional features have been compiled
in, such as whether jpeg graphics can be produced, or whether memory profiling (see Chapter 9) has been enabled.

In [29]:
## 来试试
R.version$platform

In [30]:
Sys.info()

In [31]:
.Platform

In [32]:
capabilities()

You can find out what packages are on the search path using the search
function and much more detailed information can be found using sessionInfo

In [33]:
search()

In [34]:
ls(2) # 列出第二个包的对象

In [36]:
objects(jupyter:irkernel) # 类似的

“‘jupyter:irkernel’ converted to character string”

### 探寻对象更多的信息

Obvious functions
to try are class and typeof. But many find that both str and object.size
are more useful.

In [37]:
class(cars)

In [38]:
typeof(cars)

In [39]:
str(cars)

'data.frame':	50 obs. of  2 variables:
 $ speed: num  4 4 7 7 8 9 10 10 10 11 ...
 $ dist : num  2 10 4 22 16 10 18 26 34 17 ...


In [40]:
object.size(cars)

1576 bytes

In [41]:
# head 和 tail函数用来看部分数据
head(cars,5)

speed,dist
4,2
4,10
7,4
7,22
8,16


In [42]:
tail(cars,5)

Unnamed: 0,speed,dist
46,24,70
47,24,92
48,24,93
49,24,120
50,25,85


## Language basics

In [44]:
# R中的函数都可以查看和修改
colSums

In [45]:
# 有些函数不能直接以这种方式查看函数内部，比如一些操作符，我们这时候可以用get()函数
get("+")

These two functions,
.Primitive and .Internal, provide fairly direct links between R level code and
the internal C code that R is written in.

### 操作符

In [51]:
x = 1:4
x + 5

In [52]:
myP = get("+")
myP

In [53]:
myP(x,5)

In [54]:
"%p%" = function(x,y) paste(x,y,sep="")
"hi" %p% "there"

### Subscripting and subsetting (下标和取子集)

In [1]:
myl = list(a1 = 10, b = 20, c = 30)

In [2]:
myl[c(2,3)]

In [3]:
myl$a

In [4]:
myl["a"]

$<NA>
NULL


In [5]:
f = "b"

In [6]:
myl[[f]]

In [7]:
myl$f

NULL

下标索引的四种类型：正整数、负整数、逻辑值以及字符向量。四种类型不能混合使用。不是所有的向量或对象支持这四种索引方式。

In [8]:
# subsetting with positive indices
x = 11 : 20

In [9]:
x[c(1, 3, 5)]

#### Subsetting with positive indices

In [10]:
# some rules
x = 1:10

In [11]:
x[1:3]

In [12]:
x[9:11]

In [13]:
x[0:1]

In [14]:
x[c(1,2,NA)]

In [15]:
x[c(1,2,NULL)]

#### Subsetting with negative indices

In [16]:
# 负数做索引就是去掉不想要的
# 正负索引不能混合
# NA不允许

#### Subsetting with character indices

In [17]:
# Character indices can be used to extract elements of named vectors
x = 1:5
names(x) = letters[1:5]
x[c("a","d")]
# If the vector has duplicated names that match a substript, only the value with the lowest index is returned.

In [18]:
names(x)[3] = "a"

In [19]:
x["a"]

In [20]:
x[c("a","a")]

In [21]:
names(x) %in% "a"

#### Subsetting with logical indices

In [22]:
(letters[1:10])[c(TRUE,FALSE,NA)]

In [23]:
(1:5)[rep(NA, 6)]

#### Matrix and array subscripts

In [24]:
x = matrix(1:9, nc = 3)

In [25]:
x[,1]

In [26]:
x[1,]

In [27]:
## 会遇到的一个降维bug。 我遇到很多次啦～ 可以用drop设置
x[,1, drop = FALSE]

0
1
2
3


In [28]:
x[1, , drop = FALSE]

0,1,2
1,4,7


In [30]:
# 数组和矩阵都是按照列存储的，可以用byrow选项设置为按行存储
x = array(1:27, dim=c(3,3,3))
y = matrix(c(1,2,3,2,2,2,3,2,1), byrow=TRUE, ncol=3)
x[y]

Character subscripting of matrices is carried out on the row and column
names, if present. It is an error to use character subscripts if the row and
column names are not present. Attaching a dim attribute to a vector removes
the names attribute if there was one. If a dimnames attribute is present, but
one or more of the supplied character subscripts is not present, a subscript
out of bounds error is signaled, which is quite different from the way vectors
are treated. Arrays are treated similarly, but with respect to the names on
each of the dimensions.

For data.frames the effects are different. Any character subscript for a row
that is not a row name returns a vector of NAs. Any subscript of a column
with a name that is not a column name raises and error.

#### Subset assignments

In [31]:
x[1:3] = 10
x

In [34]:
x = 1:10
x[-(2:4)] = 10  # 除了2,3,4都等于10
x

In [35]:
x = matrix(1:10, nc=2)
x[] = sort(x)

In [36]:
x

0,1
1,6
2,7
3,8
4,9
5,10


#### Subsetting factors


There is a special method for the single
bracket subscript operator on factors. For this method the drop argument
indicates whether or not any unused levels should be dropped from the return
value. The [[ operator can be applied to factors and returns a factor of length
one containing the selected element.

ERROR: Error in eval(expr, envir, enclos): 没有"SessionInfo"这个函数
