**因子是“类别”数据对象**

分类型数据（category data）经常要把数据分成不同的水平或因子（factor）。比如，学生的性别包含男和女两个因子。因子代表变量的不同可能的水平（即使在数据中不出现）。在统计模型统计分析中十分有用，例如将0，1转换为’yes’,’no’就很方便，在R里可以使用factor函数来创建因子，函数形式如下：
        
             factor(x = character(), levels, labels = levels, exclude = NA, ordered = is.ordered(x))

 其中，levels用来指定因子的水平；labels用来指定水平的名字；exclude表示在x中需要排除的水平；ordered用来决定因子的水平是否有次序。


# 创建因子

## 无序因子
--factor函数

In [1]:
colour <- c('G', 'G', 'R', 'Y', 'G', 'Y', 'Y', 'R', 'Y')

In [2]:
col <- factor(colour, levels = c('G', 'R', 'Y'), labels = c('Green', 'Red', 'Yellow')) #创建一个因子
print(col)

[1] Green  Green  Red    Yellow Green  Yellow Yellow Red    Yellow
Levels: Green Red Yellow


In [3]:
class(col)

## 有序因子
--ordered函数

In [4]:
score <- c('A', 'B', 'A', 'C', 'B')
score1 <- ordered(score, levels = c('C', 'B', 'A'))
print(score1)

[1] A B A C B
Levels: C < B < A


In [5]:
class(score1)

## 将numeric向量转化因子
--cut函数

In [6]:
exam <- c(60,70,80,90,99,65,72,86)
(exam1 <- cut(exam, breaks = 3)) #切分成3组--（max-min）/3=（99-60）/3

In [7]:
class(exam1)

In [8]:
(exam1 <- cut(exam,right=F, breaks = 3)) #right = F--闭区间不在右边，默认值是T

In [9]:
(exam2 <- cut(exam, breaks = c(0, 60, 70, 80, 90, 100), right = F)) # 切分成自己设置的组

In [10]:
(exam3 <- cut(exam, breaks = c(0, 60, 70, 80,90,100),right = F,ordered_result = T)) #切分成自己设置的组，并且有序

In [11]:
 #切分成自己设置的组
exam4 <- cut(exam, breaks = c(0, 60, 70, 80,90,100),right = F,labels = c('不及格','及格', '中', '良','优'))  #其实已有顺序关系
print(exam4)   

[1] 及格 中   良   优   优   及格 中   良  
Levels: 不及格 及格 中 良 优


In [12]:
y <- c("女","男","男","女","女","女","男") 
class(y)

In [13]:
(f <- factor(y)) #生成因子

In [14]:
class(f)

In [15]:
levels(f)      #因子水平--取哪些类别值

# 按照因子排序

In [16]:
(T1 <- data.frame(
  score = exam4,
  gender = factor(c("female", "male", "female", "male", "male", "female", "female", "male"))
)) # gender被自动转化为无序因子

score,gender
<fct>,<fct>
及格,female
中,male
良,female
优,male
优,male
及格,female
中,female
良,male


In [17]:
T1[with(T1, order(gender)), ]   #T1中gender不是有序因子--按照字典顺序

Unnamed: 0_level_0,score,gender
Unnamed: 0_level_1,<fct>,<fct>
1,及格,female
3,良,female
6,及格,female
7,中,female
2,中,male
4,优,male
5,优,male
8,良,male


In [18]:
(T2 <- data.frame(
  score = exam4,
  gender = ordered(c("female", "male", "female", "male", "male", "female", "female", "male"), levels = c("male", "female"))
))

score,gender
<fct>,<ord>
及格,female
中,male
良,female
优,male
优,male
及格,female
中,female
良,male


In [19]:
levels(T2$gender)   #与levels(T1$gender)比较，顺序发生了改变 

In [20]:
T2[with(T2, order(gender)), ]    #T2中gender是有序因子，按照指定的level升序排列-- 默认decreasing = FALSE

Unnamed: 0_level_0,score,gender
Unnamed: 0_level_1,<fct>,<ord>
2,中,male
4,优,male
5,优,male
8,良,male
1,及格,female
3,良,female
6,及格,female
7,中,female


In [21]:
T2[with(T2, order(gender, decreasing = T)), ] #T2中gender是有序因子，按照指定的level降序排列

Unnamed: 0_level_0,score,gender
Unnamed: 0_level_1,<fct>,<ord>
1,及格,female
3,良,female
6,及格,female
7,中,female
2,中,male
4,优,male
5,优,male
8,良,male


# 因子约束作用

因子中的取值被限制为其因子水平或缺失值，不能赋予别的值

In [22]:
(T2 <- data.frame(
  score = exam4,
  gender = ordered(c("female", "male", "female", "male", "male", "female", "female", "male"), levels = c("male", "female"))
))

score,gender
<fct>,<ord>
及格,female
中,male
良,female
优,male
优,male
及格,female
中,female
良,male


因子gender的取值被限制为“female”、“male”或缺失值。如果把不同的字符串添加到 genders 中，此项约束则变得很明显：

In [23]:
T2$gender[1]

In [24]:
T2$gender[1] <- "Male"    # 注意是大写的 "M"

"invalid factor level, NA generated"


In [25]:
(T2$gender[1] <- "male")

In [26]:
T2

score,gender
<fct>,<ord>
及格,male
中,male
良,female
优,male
优,male
及格,female
中,female
良,male


# 去掉因子水平
在数据集清理的过程中，最终你可能需要去掉所有与因子水平对应的数据。考虑以下数据集，它记录了上班途中所使用的交通工具的次数：

In [27]:
(getting_to_work <- data.frame(mode = factor(c("bike", "car", "bus", "car", "walk","bike", "car", "bike", "car", "car" )),
                              time_mins = c(25, 13, NA, 22, 65, 28, 15, 24, NA, 14)))

mode,time_mins
<fct>,<dbl>
bike,25.0
car,13.0
bus,
car,22.0
walk,65.0
bike,28.0
car,15.0
bike,24.0
car,
car,14.0


In [28]:
#去掉那些 time_mins 是 NA 的行 ：
(getting_to_work <- subset(getting_to_work, !is.na(time_mins)))

Unnamed: 0_level_0,mode,time_mins
Unnamed: 0_level_1,<fct>,<dbl>
1,bike,25
2,car,13
4,car,22
5,walk,65
6,bike,28
7,car,15
8,bike,24
10,car,14


In [29]:
levels(getting_to_work$mode)

In [30]:
unique(getting_to_work$mode)   #数据集中没有出现'bus'

删除未使用的水平因子，可以使用 droplevels 函数

In [31]:
getting_to_work <- droplevels(getting_to_work) 
levels(getting_to_work$mode)