# Chapter 9 分类数据

In [1]:
import numpy as np
import pandas as pd

## 1. cat对象
### 1.1 cat对象的属性
astype('category')：把一个普通序列转换成分类变量。

In [2]:
df=pd.read_csv('/Users/jie/Documents/Python/joyful-pandas-master/data/learn_pandas.csv',
              usecols=['Grade', 'Name', 'Gender', 'Height', 'Weight'])
s=df.Grade.astype('category')
s.head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman', 'Junior', 'Senior', 'Sophomore']

In [3]:
s.cat

<pandas.core.arrays.categorical.CategoricalAccessor object at 0x11415d670>

In [4]:
#类别
s.cat.categories

Index(['Freshman', 'Junior', 'Senior', 'Sophomore'], dtype='object')

In [5]:
#是否有序
s.cat.ordered

False

In [6]:
#每一个类别按照在cat.categories中的顺序被赋予唯一的整数编号
s.cat.codes.head()

0    0
1    0
2    2
3    3
4    3
dtype: int8

## 1.2 类别的增加、删除和修改
（1）add_categories：增加类别。

In [7]:
s=s.cat.add_categories('Graduate')
s.cat.categories

Index(['Freshman', 'Junior', 'Senior', 'Sophomore', 'Graduate'], dtype='object')

（3）remove_categories：删除类别，原序例中这类会被设为缺失。

In [8]:
s=s.cat.remove_categories('Freshman')
s.cat.categories

Index(['Junior', 'Senior', 'Sophomore', 'Graduate'], dtype='object')

In [9]:
s.head()

0          NaN
1          NaN
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Junior', 'Senior', 'Sophomore', 'Graduate']

（3）set_categories：直接设置序列的新类别，原类别中如果存在元素不属于新类别，会被设置为缺失。

In [10]:
s=s.cat.set_categories(['Sophomore','PhD'])
s.cat.categories

Index(['Sophomore', 'PhD'], dtype='object')

In [11]:
s.head()

0          NaN
1          NaN
2          NaN
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (2, object): ['Sophomore', 'PhD']

（4）remove_unused_categories：删除未出现在序列中的类别。

In [12]:
s=s.cat.remove_unused_categories()
s.cat.categories

Index(['Sophomore'], dtype='object')

（5）rename_categories：修改名称。

In [13]:
s=s.cat.rename_categories({'Sophomore':'本科二年级学生'})
s.head()

0        NaN
1        NaN
2        NaN
3    本科二年级学生
4    本科二年级学生
Name: Grade, dtype: category
Categories (1, object): ['本科二年级学生']

## 2. 有序分类
### 2.1 序的建立
as_unordered和reorder_categories互相转化有序和无序类别。其中后者传入的参数必须是由当前序列的无序类别构成的列表，不能新增或减少，且必须指定参数ordered=True。

In [14]:
s=df.Grade.astype('category')
s=s.cat.reorder_categories(['Freshman','Sophomore','Junior','Senior'], ordered=True)
s.head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman' < 'Sophomore' < 'Junior' < 'Senior']

In [15]:
s.cat.as_unordered().head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman', 'Sophomore', 'Junior', 'Senior']

不实用ordered=True的参数，可以先用s.cat.as_ordered()转化为有序类别，再利用reorder_categories进行具体的相对大小调整。

### 2.2 排序和比较
（1）分类变量的排序：把列的类型改成category，然后赋予相应的大小关系，最后使用sort_index和sort_values。

In [16]:
df.Grade=df.Grade.astype('category')
df.Grade=df.Grade.cat.reorder_categories(['Freshman','Sophomore','Junior','Senior'], ordered=True)
df.sort_values('Grade').head()

Unnamed: 0,Grade,Name,Gender,Height,Weight
0,Freshman,Gaopeng Yang,Female,158.9,46.0
105,Freshman,Qiang Shi,Female,164.5,52.0
96,Freshman,Changmei Feng,Female,163.8,56.0
88,Freshman,Xiaopeng Han,Female,164.1,53.0
81,Freshman,Yanli Zhang,Female,165.1,52.0


In [17]:
df.set_index('Grade').sort_index().head()

Unnamed: 0_level_0,Name,Gender,Height,Weight
Grade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
Freshman,Gaopeng Yang,Female,158.9,46.0
Freshman,Qiang Shi,Female,164.5,52.0
Freshman,Changmei Feng,Female,163.8,56.0
Freshman,Xiaopeng Han,Female,164.1,53.0
Freshman,Yanli Zhang,Female,165.1,52.0


（2）分类变量的比较：== 或 !=（比较对象是标量或同长度的Series 或 list；> 或 >= 或 < 或 <=（比较对象类似，但所有参与比较的元素必须属于原序列的categories，且和原序列有相同的索引）。 

In [18]:
res1=df.Grade=='Sophomore'
res1.head()

0    False
1    False
2    False
3     True
4     True
Name: Grade, dtype: bool

In [19]:
res2=df.Grade==['PhD']*df.shape[0]
res2.head()

0    False
1    False
2    False
3    False
4    False
Name: Grade, dtype: bool

In [20]:
res3=df.Grade<='Sophomore'
res3.head()

0     True
1     True
2    False
3     True
4     True
Name: Grade, dtype: bool

In [21]:
res4=df.Grade<=df.Grade.sample(frac=1).reset_index(drop=True)
res4.head()

0     True
1     True
2    False
3     True
4     True
Name: Grade, dtype: bool

## 3. 区间类别
### 3.1 利用cut和qcut进行区间构造
cut和qcut会把原序列的数值特征进行装箱，即用区间位置来替代原来的具体数值。             
（1）cut函数参数：bins（等间距拆分为n段，或指定区间分割点的列表）、right（默认左开右闭，False为左闭右开，在开区间的端点会自动进行微小值调整）、labels（区间的名字）、retbins（是否返回分割点，默认不返回）。

In [22]:
s=pd.Series([1,2])
pd.cut(s, bins=2)

0    (0.999, 1.5]
1      (1.5, 2.0]
dtype: category
Categories (2, interval[float64]): [(0.999, 1.5] < (1.5, 2.0]]

In [23]:
pd.cut(s, bins=2, right=False)

0      [1.0, 1.5)
1    [1.5, 2.001)
dtype: category
Categories (2, interval[float64]): [[1.0, 1.5) < [1.5, 2.001)]

In [24]:
pd.cut(s, bins=[-np.infty, 1.2, 1.8, 2.2, np.infty])

0    (-inf, 1.2]
1     (1.8, 2.2]
dtype: category
Categories (4, interval[float64]): [(-inf, 1.2] < (1.2, 1.8] < (1.8, 2.2] < (2.2, inf]]

In [25]:
res=pd.cut(s, bins=2, labels=['small','big'], retbins=True)
res[0]

0    small
1      big
dtype: category
Categories (2, object): ['small' < 'big']

In [26]:
res[1]

array([0.999, 1.5  , 2.   ])

（2）qcut函数参数：q（按照n等分位数把数据分享，还可以传入浮点列表指代相应的分位数分割点）、right（默认左开右闭，False为左闭右开，在开区间的端点会自动进行微小值调整）、labels（区间的名字）、retbins（是否返回分割点，默认不返回）。

In [27]:
s=df.Weight
pd.qcut(s, q=3).head()

0    (33.999, 48.0]
1      (55.0, 89.0]
2      (55.0, 89.0]
3    (33.999, 48.0]
4      (55.0, 89.0]
Name: Weight, dtype: category
Categories (3, interval[float64]): [(33.999, 48.0] < (48.0, 55.0] < (55.0, 89.0]]

In [28]:
pd.qcut(s, q=[0,0.2,0.8,1]).head()

0      (44.0, 69.4]
1      (69.4, 89.0]
2      (69.4, 89.0]
3    (33.999, 44.0]
4      (69.4, 89.0]
Name: Weight, dtype: category
Categories (3, interval[float64]): [(33.999, 44.0] < (44.0, 69.4] < (69.4, 89.0]]

### 3.2 一般区间的构造
pd.Interval.      
（1）开闭状态：right、left、both、neither

In [29]:
my_interval=pd.Interval(0, 1, 'right')
my_interval

Interval(0, 1, closed='right')

（2）区间属性：mid（中点）、length（长度）、right（右端点）、left（左端点）、closed（开闭状态）、overlaps（两个区间是否有交集）。

In [30]:
0.5 in my_interval

True

In [31]:
my_interval_2=pd.Interval(0.5, 1.5, 'left')
my_interval.overlaps(my_interval_2)

True

（3）pd.IntervalIndex对象有四种方法生成：from_breaks（类似cut或qcut函数，但直接传入自定义的分割点）、from_arrays（分别给出左端点和右端点的列表，适用于有交集且知道起点和终点的情况）、from_tuples（传入起点和终点元组构成的列表）、interval_range（给出start、end、periods、frec四个中的三个即可）。

In [32]:
pd.IntervalIndex.from_breaks([1,3,6,10], closed='both')

IntervalIndex([[1, 3], [3, 6], [6, 10]],
              closed='both',
              dtype='interval[int64]')

In [33]:
pd.IntervalIndex.from_arrays(left=[1,3,6,10], right=[5,4,9,11], closed='neither')

IntervalIndex([(1, 5), (3, 4), (6, 9), (10, 11)],
              closed='neither',
              dtype='interval[int64]')

In [34]:
pd.IntervalIndex.from_tuples([(1,5),(3,4),(6,9),(10,11)], closed='neither')

IntervalIndex([(1, 5), (3, 4), (6, 9), (10, 11)],
              closed='neither',
              dtype='interval[int64]')

In [35]:
pd.interval_range(start=1, end=5, periods=8)

IntervalIndex([(1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5], (4.5, 5.0]],
              closed='right',
              dtype='interval[float64]')

In [36]:
 pd.interval_range(end=5, periods=8, freq=0.5)

IntervalIndex([(1.0, 1.5], (1.5, 2.0], (2.0, 2.5], (2.5, 3.0], (3.0, 3.5], (3.5, 4.0], (4.0, 4.5], (4.5, 5.0]],
              closed='right',
              dtype='interval[float64]')

In [37]:
#练一练

### 3.3 区间的属性与方法

In [38]:
id_interval=pd.IntervalIndex(pd.cut(s,3))

In [39]:
#选出前五个
id_demo=id_interval[:5]
id_demo

IntervalIndex([(33.945, 52.333], (52.333, 70.667], (70.667, 89.0], (33.945, 52.333], (70.667, 89.0]],
              closed='right',
              name='Weight',
              dtype='interval[float64]')

In [40]:
#左端点
id_demo.left

Float64Index([33.945, 52.333, 70.667, 33.945, 70.667], dtype='float64')

In [41]:
#右端点
id_demo.right

Float64Index([52.333, 70.667, 89.0, 52.333, 89.0], dtype='float64')

In [42]:
#两端点均值
id_demo.mid

Float64Index([43.138999999999996, 61.5, 79.8335, 43.138999999999996, 79.8335], dtype='float64')

In [43]:
#区间长度
id_demo.length

Float64Index([18.387999999999998, 18.334000000000003, 18.333,
              18.387999999999998, 18.333],
             dtype='float64')

In [44]:
#区间是否包含某元素
id_demo.contains(4)

array([False, False, False, False, False])

In [45]:
#是否有交集
id_demo.overlaps(pd.Interval(40,60))

array([ True,  True, False,  True, False])

## 4. 练习
### Ex1: 统计未出现的类别

In [46]:
df=pd.DataFrame({'A':['a','b','c','a'], 'B':['cat','cat','dog','cat']})
pd.crosstab(df.A, df.B)
#统计汇总两列组合出现的频数

B,cat,dog
A,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2,0
b,1,0
c,0,1


In [47]:
df.B=df.B.astype('category').cat.add_categories('sheep')
pd.crosstab(df.A, df.B, dropna=False)

B,cat,dog,sheep
A,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
a,2,0,0
b,1,0,0
c,0,1,0


### Ex2. 钻石数据集

In [48]:
df2=pd.read_csv('/Users/jie/Documents/Python/joyful-pandas-master/data/diamonds.csv')
df2.head(3)

Unnamed: 0,carat,cut,clarity,price
0,0.23,Ideal,SI2,326
1,0.21,Premium,SI1,326
2,0.23,Good,VS1,327


In [49]:
#1.
%timeit -n 30 df2.cut.astype('object').nunique()

2.77 ms ± 76.9 µs per loop (mean ± std. dev. of 7 runs, 30 loops each)


In [50]:
%timeit -n 30 df2.cut.astype('category').nunique()

4.01 ms ± 87.1 µs per loop (mean ± std. dev. of 7 runs, 30 loops each)


In [51]:
#2. 
df2.cut=df2.cut.astype('category').cat.reorder_categories(['Fair','Good','Very Good','Premium','Ideal'], ordered=True)