# 分类数据

In [1]:
import numpy as np
import pandas as pd

## cat对象

### cat对象的属性

可以通过`astype`方法将序列转换为分类变量`category`类型：

In [2]:
df = pd.read_csv('data/learn_pandas.csv',usecols = ['Grade', 'Name', 'Gender', 'Height', 'Weight'])

In [3]:
s = df.Grade.astype('category')

In [4]:
s.head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman', 'Junior', 'Senior', 'Sophomore']

在`category`类型的序列中定义了`cat`对象，其上的属性和方法可以进行分类类别操作。

In [5]:
s.cat

<pandas.core.arrays.categorical.CategoricalAccessor object at 0x0000025A841F6B48>

In [6]:
# 查看类别本身，以Index类型存储
s.cat.categories

Index(['Freshman', 'Junior', 'Senior', 'Sophomore'], dtype='object')

In [7]:
# 查看类别是否有序
s.cat.ordered

False

In [8]:
# 查看类别对应的整数编号
s.cat.codes.head()

0    0
1    0
2    2
3    3
4    3
dtype: int8

### 类别的增加、删除和修改

下表总结了类别的增删查改操作：

|命令|效果|
|:---|:---|
|cat.add_categories|增加指定类别|
|cat.remove_categories|删除指定类别|
|cat.set_categories|直接设置序列的新类别，原有类别元素若不属于新类别则设为缺失|
|remove_unused_categories|删除未出现在序列中的类别|
|rename_categories|更改类别名称|

## 有序分类

### 序的建立

可以利用`reorder_categories`将当前的无序类别转化为有序类别：

In [9]:
s = df.Grade.astype('category')

In [10]:
s = s.cat.reorder_categories(['Freshman','Junior','Senior','Sophomore'],ordered=True)

In [11]:
s.head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman' < 'Junior' < 'Senior' < 'Sophomore']

利用`as_unordered`可以将有序类别恢复为无序状态：

In [12]:
s.cat.as_unordered()

0       Freshman
1       Freshman
2         Senior
3      Sophomore
4      Sophomore
         ...    
195       Junior
196       Senior
197       Senior
198       Senior
199    Sophomore
Name: Grade, Length: 200, dtype: category
Categories (4, object): ['Freshman', 'Junior', 'Senior', 'Sophomore']

### 排序和比较

可以对有序分类变量进行排序。

In [13]:
df.Grade = df.Grade.astype('category')
df.Grade = df.Grade.cat.reorder_categories(['Freshman', 'Sophomore', 'Junior', 'Senior'],ordered=True)

In [14]:
df.sort_values('Grade').head()

Unnamed: 0,Grade,Name,Gender,Height,Weight
0,Freshman,Gaopeng Yang,Female,158.9,46.0
105,Freshman,Qiang Shi,Female,164.5,52.0
96,Freshman,Changmei Feng,Female,163.8,56.0
88,Freshman,Xiaopeng Han,Female,164.1,53.0
81,Freshman,Yanli Zhang,Female,165.1,52.0


此外，有序分类变量序列还能进行比较。

## 区间类别

### 利用cut和qcut进行区间构造

区间类别是把具体的数值变量用其所在的区间来进行分类。可以利用`cut`和`qcut`方法来构造区间类别。

`cut`根据数值变量的实际取值进行区间划分，关键参数为`bin`：若传入整数`n`，则会将数值变量的取值范围等分为n个区间；若传入数值列表，则会以这些值分割点对对取值范围进行划分。其中，`np.infty`表示无穷。

划分所得区间默认为**左开右闭**。

In [15]:
s = pd.Series([1, 2, 4, 7])

In [16]:
pd.cut(s,bins=3, right=False)

0      [1.0, 3.0)
1      [1.0, 3.0)
2      [3.0, 5.0)
3    [5.0, 7.006)
dtype: category
Categories (3, interval[float64]): [[1.0, 3.0) < [3.0, 5.0) < [5.0, 7.006)]

In [17]:
pd.cut(s, bins=[-np.infty, 1,4, np.infty])

0    (-inf, 1.0]
1     (1.0, 4.0]
2     (1.0, 4.0]
3     (4.0, inf]
dtype: category
Categories (3, interval[float64]): [(-inf, 1.0] < (1.0, 4.0] < (4.0, inf]]

可以通过指定参数`labels`来给划分区间命名：

In [18]:
s = pd.Series([1,2])
res = pd.cut(s, bins=2, labels=['small', 'big'])
res

0    small
1      big
dtype: category
Categories (2, object): ['small' < 'big']

`qcut`根据数值变量的分位数进行划分，关键参数为`q`：传入整数`n`时，会根据n等分位点进行划分；传入浮点数列表时，会根据对应的分位点划分。

In [19]:
s = pd.Series(np.arange(10)+1)

In [20]:
pd.qcut(s, q = 4)

0    (0.999, 3.25]
1    (0.999, 3.25]
2    (0.999, 3.25]
3      (3.25, 5.5]
4      (3.25, 5.5]
5      (5.5, 7.75]
6      (5.5, 7.75]
7     (7.75, 10.0]
8     (7.75, 10.0]
9     (7.75, 10.0]
dtype: category
Categories (4, interval[float64]): [(0.999, 3.25] < (3.25, 5.5] < (5.5, 7.75] < (7.75, 10.0]]

In [21]:
pd.qcut(s, q = [0,0.25,0.75,1])

0    (0.999, 3.25]
1    (0.999, 3.25]
2    (0.999, 3.25]
3     (3.25, 7.75]
4     (3.25, 7.75]
5     (3.25, 7.75]
6     (3.25, 7.75]
7     (7.75, 10.0]
8     (7.75, 10.0]
9     (7.75, 10.0]
dtype: category
Categories (3, interval[float64]): [(0.999, 3.25] < (3.25, 7.75] < (7.75, 10.0]]

### 一般区间的构造

#### 练一练1

$$
end - start = periods \times freq
$$

### 区间的属性与方法

感觉该部分不是很实用，仅进行浏览。

## 练习

### Ex1: 统计未出现的类别

In [22]:
df = pd.DataFrame({'A':['a','b','c','a'],'B':['cat','cat','dog','cat']})

In [23]:
pd.crosstab(df.A, df.B)

B,cat,dog
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2,0
b,1,0
c,0,1


In [24]:
df.B = df.B.astype('category').cat.add_categories('sheep')
pd.crosstab(df.A, df.B, dropna=False)

B,cat,dog,sheep
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,2,0,0
b,1,0,0
c,0,1,0


本题有一定难度，故尝试理解参考答案。

In [25]:
# 函数有三个参数，其中dropna默认值为True
def my_crosstab(s1, s2, dropna=True):
    # 这里的条件结构是实现dropna选项的关键。
    # 如果dropna==True，那么仅保留实际出现的类别作为索引；反之，则调用cat.categories来生成完整的索引
    idx1 = (s1.cat.categories if s1.dtype.name == 'category' and not dropna else s1.unique())
    idx2 = (s2.cat.categories if s2.dtype.name == 'category' and not dropna else s2.unique())
    # 先生成一个0元素构成的框架，再利用循环来填充
    res = pd.DataFrame(np.zeros((idx1.shape[0], idx2.shape[0])),index=idx1, columns=idx2)
    # 这里的循环结构是实现函数功能的关键：遍历所有列取值组合，以它们为索引，填充DataFrame对应位置的数值
    for i, j in zip(s1, s2):
        # at的功能类似loc，但专门用于利用索引定位单个数值
        res.at[i, j] += 1
    # 这里进一步调整输出结果的格式，包括重命名索引，调整数据格式
    res = res.rename_axis(index=s1.name, columns=s2.name).astype('int')
    return res

### Ex2: 钻石数据集

In [26]:
df = pd.read_csv('data/diamonds.csv')

In [27]:
df.head(3)

Unnamed: 0,carat,cut,clarity,price
0,0.23,Ideal,SI2,326
1,0.21,Premium,SI1,326
2,0.23,Good,VS1,327


**第1问**

In [28]:
%timeit df.cut.nunique()

2.91 ms ± 22 µs per loop (mean ± std. dev. of 7 runs, 100 loops each)


In [29]:
z = df.cut.astype('category')

In [30]:
%timeit z.nunique()

723 µs ± 3.41 µs per loop (mean ± std. dev. of 7 runs, 1000 loops each)


可以看到，在`category`类型下使用`nunique`函数性能更好。

**第2问**

In [31]:
df.cut = df.cut.astype('category').cat.reorder_categories(['Fair','Good','Very Good','Premium','Ideal'],ordered=True)

In [32]:
df.clarity = df.clarity.astype('category').cat.reorder_categories(['I1', 'SI2', 'SI1', 'VS2', 'VS1', 'VVS2', 'VVS1', 'IF'],ordered=True)

In [33]:
df.sort_values(['cut','clarity'],ascending=[False,True])

Unnamed: 0,carat,cut,clarity,price
315,0.96,Ideal,I1,2801
535,0.96,Ideal,I1,2826
551,0.97,Ideal,I1,2830
653,1.01,Ideal,I1,2844
718,0.97,Ideal,I1,2856
...,...,...,...,...
41242,0.30,Fair,IF,1208
43778,0.37,Fair,IF,1440
47407,0.52,Fair,IF,1849
49683,0.52,Fair,IF,2144


**第3问**

第一种方法：将**由好到次**排序对应的`codes`进行变换。

In [34]:
df.cut.cat.codes.map(lambda x:-(x-4)).head()

0    0
1    1
2    3
3    1
4    3
dtype: int64

第二种方法：将分类变量**由好到次**重新排序，所得`codes`自然满足要求。

**第4问**

注意题目要求，是**每克拉**的价格！

In [35]:
avg = df.price/df.carat

In [36]:
# 按照分位数分类
df['avg_cat_1'] = pd.qcut(avg, q=[0,0.2,0.4,0.6,0.8,1], labels=['Very Low','Low','Mid','High','Very High'])

In [37]:
# 按照指定分割点分类
df['avg_cat_2'] = pd.cut(avg, bins=[-np.infty,1000,3500,5500,18000,np.infty], labels=['Very Low','Low','Mid','High','Very High'])

In [38]:
df.head()

Unnamed: 0,carat,cut,clarity,price,avg_cat_1,avg_cat_2
0,0.23,Ideal,SI2,326,Very Low,Low
1,0.21,Premium,SI1,326,Very Low,Low
2,0.23,Good,VS1,327,Very Low,Low
3,0.29,Premium,VS2,334,Very Low,Low
4,0.31,Good,SI2,335,Very Low,Low


**第5问**

In [39]:
df.avg_cat_2.value_counts()

Low          26998
Mid          16398
High         10544
Very High        0
Very Low         0
Name: avg_cat_2, dtype: int64

可以看到，`Very High`和`Very Low`两个类别没有出现。

In [40]:
df.avg_cat_2 = df.avg_cat_2.cat.remove_unused_categories()

**第6问**

难点：分类变量定义了标签后，如何显示初始值？

In [41]:
pd.qcut(avg, q=[0,0.2,0.4,0.6,0.8,1])

0          (1051.162, 2295.0]
1          (1051.162, 2295.0]
2          (1051.162, 2295.0]
3          (1051.162, 2295.0]
4          (1051.162, 2295.0]
                 ...         
53935    (3073.293, 4031.683]
53936    (3073.293, 4031.683]
53937    (3073.293, 4031.683]
53938    (3073.293, 4031.683]
53939    (3073.293, 4031.683]
Length: 53940, dtype: category
Categories (5, interval[float64]): [(1051.162, 2295.0] < (2295.0, 3073.293] < (3073.293, 4031.683] < (4031.683, 5456.343] < (5456.343, 17828.846]]

In [42]:
avg_interval = pd.IntervalIndex(pd.qcut(avg, q=[0,0.2,0.4,0.6,0.8,1]))

In [43]:
# 左端点
avg_interval.left.to_series().reset_index(drop=True)

0        1051.162
1        1051.162
2        1051.162
3        1051.162
4        1051.162
           ...   
53935    3073.293
53936    3073.293
53937    3073.293
53938    3073.293
53939    3073.293
Length: 53940, dtype: float64

In [44]:
# 右端点
avg_interval.right.to_series().reset_index(drop=True)

0        2295.000
1        2295.000
2        2295.000
3        2295.000
4        2295.000
           ...   
53935    4031.683
53936    4031.683
53937    4031.683
53938    4031.683
53939    4031.683
Length: 53940, dtype: float64

In [45]:
# 区间长度
avg_interval.length.to_series().reset_index(drop=True)

0        1243.838
1        1243.838
2        1243.838
3        1243.838
4        1243.838
           ...   
53935     958.390
53936     958.390
53937     958.390
53938     958.390
53939     958.390
Length: 53940, dtype: float64