## 数据框(DataFrame)
数据框这种数据结构是二维的异质性的数据结构，类似于excel，可以保存和处理结构化的表格(tables)。该数据结构也是在数据分析中用的最多的数据结构。Python中数据分析的主力`pandas`也具有这类数据结构，有一些特性是从R中借鉴过去的。

**数据框可以看做是一类特殊的列表**，以列为基本单位组织数据。

<img src="./images/data-structure-compare.png" />
from: http://venus.ifca.unican.es/Rintro/dataStruct.html

<img src='./images/data-frame.svg'/>

数据框支持不同的列可以是不同类型的数据，from: https://datacarpentry.org/R-ecology-lesson/02-starting-with-data.html

### 1. 创建数据框
在创建list时，以元素(元素名称+元素值)为单位，将多个向量(或其他数据结构)组织在一起；数据框则是以另一种方式将多个向量(或其他数据结构)组织在一起:
- dataframe使用**列**为单位组织数据；
- 每一列可以是一个向量或其他数据结构；
- 每一列可以是不同的数据类型(例如数值、字符串或逻辑值)；
- 列名不能为空，列名和行名不能重复；
- 要求每一列数据的长度相等.

**warning**: 在数据框中，`字符向量`会自动转换成`因子`，有些时候这会导致一些问题，可以使用下面的语句修改该默认设置。
`options(stringsAsFactors = FALSE)`

#### 1.1 使用`data.frame()`函数来创建数据框

In [1]:
days.name.number <- c(1:7)
days.name.full <- c('Monday', 'Tuesday', 'Wednesday', 'Thursday', 'Friday', 'Saturday', 'Sunday')
days.name.abbr <- c('Mon.', 'Tue.', 'Wed.', 'Thur.', 'Fri.', 'Sat.', 'Sun.')

In [2]:
options(stringsAsFactors = FALSE)  # 修改默认设置
days.name.df <- data.frame('number'=days.name.number, 'full'=days.name.full, 'abbr'=days.name.abbr)
days.name.df

number,full,abbr
1,Monday,Mon.
2,Tuesday,Tue.
3,Wednesday,Wed.
4,Thursday,Thur.
5,Friday,Fri.
6,Saturday,Sat.
7,Sunday,Sun.


#### 1.2 从文件中读取数据到数据框

In [3]:
data('iris')  # 从内置数据集中加载iris数据集

In [4]:
write.table(iris, './data/iris.csv', col.names = TRUE, sep = ',')  # 保存该数据集

In [5]:
options(stringsAsFactors = FALSE)  # 修改默认设置
iris.new <- read.csv('./data//iris.csv', header = TRUE)  # 默认将表格文件读入到数据框

In [6]:
class(iris.new)

In [7]:
head(iris.new)

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa


#### 1.3 由矩阵生成

In [8]:
a <- 1:9
dim(a) <- c(3,3)
a

0,1,2
1,4,7
2,5,8
3,6,9


In [9]:
a.df <- as.data.frame(a)  # 将矩阵转化成数据框，并自动添加列名`V1-V3`
a.df

V1,V2,V3
1,4,7
2,5,8
3,6,9


### 2. 数据框的属性
由于数据框是一类特殊的列表，因此列表具有的属性数据框都有，此外数据框还具有一些特有的属性。

In [10]:
mode(days.name.df)  # dataframe也可以看成是一种特殊的列表

In [11]:
class(days.name.df)  # 使用这种方法可以更准确的得到数据框的类型信息

In [12]:
dim(days.name.df)  # 行数和列数

In [13]:
nrow(iris.new)  # 行数

In [14]:
ncol(iris.new)  # 列数

In [15]:
length(days.name.df)  # 列数

In [16]:
names(days.name.df)  # 列名

In [17]:
rownames(days.name.df)  # 行名

In [18]:
str(days.name.df)  # 数据框的内部结构

'data.frame':	7 obs. of  3 variables:
 $ number: int  1 2 3 4 5 6 7
 $ full  : chr  "Monday" "Tuesday" "Wednesday" "Thursday" ...
 $ abbr  : chr  "Mon." "Tue." "Wed." "Thur." ...


### 3. 数据框取值
因为数据框是一种特殊的列表，因此列表的取值方式同样适合数据框

In [19]:
names(iris.new)

In [20]:
cat(iris.new$Petal.Length)  # 使用列名，获取对应列的值

1.4 1.4 1.3 1.5 1.4 1.7 1.4 1.5 1.4 1.5 1.5 1.6 1.4 1.1 1.2 1.5 1.3 1.4 1.7 1.5 1.7 1.5 1 1.7 1.9 1.6 1.6 1.5 1.4 1.6 1.6 1.5 1.5 1.4 1.5 1.2 1.3 1.4 1.3 1.5 1.3 1.3 1.3 1.6 1.9 1.4 1.6 1.4 1.5 1.4 4.7 4.5 4.9 4 4.6 4.5 4.7 3.3 4.6 3.9 3.5 4.2 4 4.7 3.6 4.4 4.5 4.1 4.5 3.9 4.8 4 4.9 4.7 4.3 4.4 4.8 5 4.5 3.5 3.8 3.7 3.9 5.1 4.5 4.5 4.7 4.4 4.1 4 4.4 4.6 4 3.3 4.2 4.2 4.2 4.3 3 4.1 6 5.1 5.9 5.6 5.8 6.6 4.5 6.3 5.8 6.1 5.1 5.3 5.5 5 5.1 5.3 5.5 6.7 6.9 5 5.7 4.9 6.7 4.9 5.7 6 4.8 4.9 5.6 5.8 6.1 6.4 5.6 5.1 5.6 6.1 5.6 5.5 4.8 5.4 5.6 5.1 5.1 5.9 5.7 5.2 5 5.2 5.4 5.1

数据框保存的是表格数据，因此也可以像矩阵那样使用`行号`来获取数据

In [21]:
iris.new[1,]  # 使用行号，获取对应行的值，这里的“,”是必须的，逗号前表示行号(或名称)，逗号之后表示列号(或名称)

Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
5.1,3.5,1.4,0.2,setosa


In [22]:
head(iris.new[1])  # 第一列的值

Sepal.Length
5.1
4.9
4.7
4.6
5.0
5.4


In [23]:
iris.new[3:6,]  # 数据框的切片

Unnamed: 0,Sepal.Length,Sepal.Width,Petal.Length,Petal.Width,Species
3,4.7,3.2,1.3,0.2,setosa
4,4.6,3.1,1.5,0.2,setosa
5,5.0,3.6,1.4,0.2,setosa
6,5.4,3.9,1.7,0.4,setosa


In [24]:
iris.new[3:6, c(1,5)]  # 也可以使用默认的列号来获取对应列的值

Unnamed: 0,Sepal.Length,Species
3,4.7,setosa
4,4.6,setosa
5,5.0,setosa
6,5.4,setosa


### 4. 对数据框的其他操作


#### 4.1 修改列名和行名
在创建数据框时，列名需要明确指定，行名默认为数字

In [25]:
names(iris.new)

In [26]:
names(iris.new) <- c('sepal_length', 'sepal_width', 'petal_length', 'petal_width', 'SPEcies')

In [27]:
head(iris.new)

sepal_length,sepal_width,petal_length,petal_width,SPEcies
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa


In [28]:
names(iris.new)[5] <- 'species'  # 修改单个列名

In [29]:
head(iris.new)

sepal_length,sepal_width,petal_length,petal_width,species
5.1,3.5,1.4,0.2,setosa
4.9,3.0,1.4,0.2,setosa
4.7,3.2,1.3,0.2,setosa
4.6,3.1,1.5,0.2,setosa
5.0,3.6,1.4,0.2,setosa
5.4,3.9,1.7,0.4,setosa


In [30]:
rownames(days.name.df)

In [31]:
rownames(days.name.df) <- c('1st', '2nd', '3rd', '4th', '5th', '6th', '7th')  # 添加显式的列名
days.name.df

Unnamed: 0,number,full,abbr
1st,1,Monday,Mon.
2nd,2,Tuesday,Tue.
3rd,3,Wednesday,Wed.
4th,4,Thursday,Thur.
5th,5,Friday,Fri.
6th,6,Saturday,Sat.
7th,7,Sunday,Sun.


In [32]:
days.name.df[c('2nd', '6th'), c('number', 'abbr')]

Unnamed: 0,number,abbr
2nd,2,Tue.
6th,6,Sat.


#### 4.2 添加行或列

In [33]:
days.name.df['new.column'] = 'x'
days.name.df

Unnamed: 0,number,full,abbr,new.column
1st,1,Monday,Mon.,x
2nd,2,Tuesday,Tue.,x
3rd,3,Wednesday,Wed.,x
4th,4,Thursday,Thur.,x
5th,5,Friday,Fri.,x
6th,6,Saturday,Sat.,x
7th,7,Sunday,Sun.,x


In [34]:
days.name.df['8th',] <- c(8, 'Rainday', 'Rain.', 'x')
days.name.df

Unnamed: 0,number,full,abbr,new.column
1st,1,Monday,Mon.,x
2nd,2,Tuesday,Tue.,x
3rd,3,Wednesday,Wed.,x
4th,4,Thursday,Thur.,x
5th,5,Friday,Fri.,x
6th,6,Saturday,Sat.,x
7th,7,Sunday,Sun.,x
8th,8,Rainday,Rain.,x


#### 4.3 修改值

In [35]:
days.name.df['new.column'] <- 'y'  # 直接修改整列的值
days.name.df

Unnamed: 0,number,full,abbr,new.column
1st,1,Monday,Mon.,y
2nd,2,Tuesday,Tue.,y
3rd,3,Wednesday,Wed.,y
4th,4,Thursday,Thur.,y
5th,5,Friday,Fri.,y
6th,6,Saturday,Sat.,y
7th,7,Sunday,Sun.,y
8th,8,Rainday,Rain.,y


In [36]:
cat(1:8>4)

FALSE FALSE FALSE FALSE TRUE TRUE TRUE TRUE

In [37]:
days.name.df[1:8>4, 'new.column'] = 'z'  # 使用逻辑向量，指定需要修改的行；同时指定列名
days.name.df

Unnamed: 0,number,full,abbr,new.column
1st,1,Monday,Mon.,y
2nd,2,Tuesday,Tue.,y
3rd,3,Wednesday,Wed.,y
4th,4,Thursday,Thur.,y
5th,5,Friday,Fri.,z
6th,6,Saturday,Sat.,z
7th,7,Sunday,Sun.,z
8th,8,Rainday,Rain.,z


#### 4.4 删除行或列
删除列的方法同列表中的方法，将待删除的列赋值为`NULL`；但是该方法不适用与删除行

In [38]:
dim(days.name.df)

In [39]:
days.name.df$new.column <- NULL
days.name.df

Unnamed: 0,number,full,abbr
1st,1,Monday,Mon.
2nd,2,Tuesday,Tue.
3rd,3,Wednesday,Wed.
4th,4,Thursday,Thur.
5th,5,Friday,Fri.
6th,6,Saturday,Sat.
7th,7,Sunday,Sun.
8th,8,Rainday,Rain.


**负整数**的行索引，表示不包含该行

In [40]:
days.name.df[-c(8),]  # 使用`-`符号，此时只能用数字行号索引，不能用显式的行名，此时会返回一个新的数据框不包含第8个元素

Unnamed: 0,number,full,abbr
1st,1,Monday,Mon.
2nd,2,Tuesday,Tue.
3rd,3,Wednesday,Wed.
4th,4,Thursday,Thur.
5th,5,Friday,Fri.
6th,6,Saturday,Sat.
7th,7,Sunday,Sun.


直接使用行名操作排除某行
- 参考：https://stackoverflow.com/a/7576278/2803344

In [41]:
days.name.df

Unnamed: 0,number,full,abbr
1st,1,Monday,Mon.
2nd,2,Tuesday,Tue.
3rd,3,Wednesday,Wed.
4th,4,Thursday,Thur.
5th,5,Friday,Fri.
6th,6,Saturday,Sat.
7th,7,Sunday,Sun.
8th,8,Rainday,Rain.


In [42]:
cat(!(rownames(days.name.df) %in% c('8th')))  # `!`非逻辑运算符

TRUE TRUE TRUE TRUE TRUE TRUE TRUE FALSE

In [43]:
days.name.df[!(rownames(days.name.df) %in% c('8th')),]

Unnamed: 0,number,full,abbr
1st,1,Monday,Mon.
2nd,2,Tuesday,Tue.
3rd,3,Wednesday,Wed.
4th,4,Thursday,Thur.
5th,5,Friday,Fri.
6th,6,Saturday,Sat.
7th,7,Sunday,Sun.


In [44]:
cat(-which(rownames(days.name.df) %in% c('8th')))  # 使用`which()`函数定位行名的位置

-8

In [45]:
days.name.df[-which(rownames(days.name.df) %in% c('8th')),]

Unnamed: 0,number,full,abbr
1st,1,Monday,Mon.
2nd,2,Tuesday,Tue.
3rd,3,Wednesday,Wed.
4th,4,Thursday,Thur.
5th,5,Friday,Fri.
6th,6,Saturday,Sat.
7th,7,Sunday,Sun.


#### 4.5 按行或按列组合两个数据框
方法同组合两个矩阵，使用`cbind()`和`rbind()`函数

#### 4.6 按列关联两个数据框
使用`merge()`函数

In [46]:
authors <- data.frame(
      surname = I(c("Tukey", "Venables", "Tierney", "Ripley", "McNeil")),
      nationality = c("US", "Australia", "US", "UK", "Australia"),
      deceased = c("yes", rep("no", 4)))
authors

surname,nationality,deceased
Tukey,US,yes
Venables,Australia,no
Tierney,US,no
Ripley,UK,no
McNeil,Australia,no


In [47]:
books <- data.frame(
      name = I(c("Tukey", "Venables", "Tierney",
               "Ripley", "Ripley", "McNeil", "R Core")),
      title = c("Exploratory Data Analysis",
                "Modern Applied Statistics ...",
                "LISP-STAT",
                "Spatial Statistics", "Stochastic Simulation",
                "Interactive Data Analysis",
                "An Introduction to R"),
      other.author = c(NA, "Ripley", NA, NA, NA, NA,
                       "Venables & Smith"))
books

name,title,other.author
Tukey,Exploratory Data Analysis,
Venables,Modern Applied Statistics ...,Ripley
Tierney,LISP-STAT,
Ripley,Spatial Statistics,
Ripley,Stochastic Simulation,
McNeil,Interactive Data Analysis,
R Core,An Introduction to R,Venables & Smith


现在有两个表，一张表记录"作者"相关的信息，另一张表记录"书"的信息，两个表中都有"姓名"这一列信息。现在可以根据该列信息将两个表关联起来，从而得到更多的信息，例如"每本书的作者来自哪个国家"这样的信息。

In [48]:
final <- merge(books, authors, by.x = "name", by.y = "surname", sort=F,all.x=T,all.y=F)
final

name,title,other.author,nationality,deceased
Tukey,Exploratory Data Analysis,,US,yes
Venables,Modern Applied Statistics ...,Ripley,Australia,no
Tierney,LISP-STAT,,US,no
Ripley,Spatial Statistics,,UK,no
Ripley,Stochastic Simulation,,UK,no
McNeil,Interactive Data Analysis,,Australia,no
R Core,An Introduction to R,Venables & Smith,,


### reference

https://en.wikibooks.org/wiki/R_Programming/Working_with_data_frames