In [1]:
import pandas as  pd
import numpy as np

## 目录


**一、cat对象**

* 1.1. cat对象的属性
* 1.2. 类别的增加、删除和修改


**二、有序分类**
* 2.1. 序的建立
* 2.2. 排序和比较


**三、区间类别**
* 3.1. 利用cut和qcut进行区间构造
* 3.2. 一般区间的构造
* 3.3. 区间的属性与方法


## 学习内容

In [2]:
# 读取csv文件

df = pd.read_csv('../data/learn_pandas.csv')
df.head()

Unnamed: 0,School,Grade,Name,Gender,Height,Weight,Transfer,Test_Number,Test_Date,Time_Record
0,Shanghai Jiao Tong University,Freshman,Gaopeng Yang,Female,158.9,46.0,N,1,2019/10/5,0:04:34
1,Peking University,Freshman,Changqiang You,Male,166.5,70.0,N,1,2019/9/4,0:04:20
2,Shanghai Jiao Tong University,Senior,Mei Sun,Male,188.9,89.0,N,2,2019/9/12,0:05:22
3,Fudan University,Sophomore,Xiaojuan Sun,Female,,41.0,N,2,2020/1/3,0:04:08
4,Fudan University,Sophomore,Gaojuan You,Male,174.0,74.0,N,2,2019/11/6,0:05:22


### 一、cat对象

#### 1.1. cat对象的属性

* pandas中提供了**category类型**，使用户能够处理分类类型的变量。可以通过astype转化序列类型

* category类型得序列定义了cat对象，和str对象一样，**cat对象也有很多属性和方法**。



In [6]:
# category类型

s = df['Grade'].astype("category")

s.head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman', 'Junior', 'Senior', 'Sophomore']

* 对于一个具体的category类型，有两个组成部分，一个是**类别本身**，以Index类型存储，第二个是 **是否有序**，都可以通过cat的属性获取

In [7]:
# category Index

s.cat.categories

Index(['Freshman', 'Junior', 'Senior', 'Sophomore'], dtype='object')

In [8]:
# category order

s.cat.ordered

False

* 将序列转为 category类型之后，序列原本的类别会赋予 编号，这个编号取决于cat.categories 中的顺序，可以通过cat.code访问


In [9]:
# 
s.cat.codes.head()

0    0
1    0
2    2
3    3
4    3
dtype: int8

#### 1.2. 类别的增加、删除和修改

通过 cat 对象的 categories 属性可以对类别进行 **增删改查**


* 增 **add_categories**(new_categories,inplace = False)  
    增加新的类别
       
   
  
* 删 **remove_categories**(removals,inplace = False)    
    删除类别  
    **set_categories**(new_categories,ordered= Flase,rename = False ,inplace = False)  
    设置新的类别  
    
     
  
* 改 **rename_categories**(new_categories : list-like, dict-like or callable)
    重命名

In [10]:
s.head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman', 'Junior', 'Senior', 'Sophomore']

In [13]:
# add

s.cat.add_categories('Graduate',inplace = True)
s.cat.categories

Index(['Freshman', 'Junior', 'Senior', 'Sophomore', 'Graduate'], dtype='object')

In [14]:
# remove

s.cat.remove_categories('Freshman',inplace=True)
s.head()

0          NaN
1          NaN
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Junior', 'Senior', 'Sophomore', 'Graduate']

In [15]:
# set_new

s.cat.set_categories(['Sophomore','student'],inplace = True)
s.head()

0          NaN
1          NaN
2          NaN
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (2, object): ['Sophomore', 'student']

In [17]:
# remove_unused

s.cat.remove_unused_categories()

0            NaN
1            NaN
2            NaN
3      Sophomore
4      Sophomore
         ...    
195          NaN
196          NaN
197          NaN
198          NaN
199    Sophomore
Name: Grade, Length: 200, dtype: category
Categories (1, object): ['Sophomore']

In [18]:
# rename

s.cat.rename_categories({'Sophomore':'大学二年级'})

0        NaN
1        NaN
2        NaN
3      大学二年级
4      大学二年级
       ...  
195      NaN
196      NaN
197      NaN
198      NaN
199    大学二年级
Name: Grade, Length: 200, dtype: category
Categories (2, object): ['大学二年级', 'student']

### 二、有序分类
#### 2.1. 序的建立


类别的有序无序可以通过**as_unordered 和 reorder_categories** 互相转化

* as_unordered()
    参数说明
    
> * 
> * 




* reorder_categories(new_categories,order,inplace)
      
    参数说明
    
> * new_categories  排序后的类别
> * order  是否将类别当作排序的类别，如果不设置为True，则不会改变排序类别信息
> * inplace 是否替代原数据

In [29]:
s = df.Grade.astype('category')
s.head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman', 'Junior', 'Senior', 'Sophomore']

In [31]:
# reorder_categories order=True

s = s.cat.reorder_categories(['Freshman', 'Sophomore', 'Junior', 'Senior'],ordered = True)
s.head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman' < 'Sophomore' < 'Junior' < 'Senior']

In [32]:
# reorder_categories order=False

s = s.cat.reorder_categories(['Freshman', 'Sophomore', 'Junior', 'Senior'],ordered = False)
s.head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman', 'Sophomore', 'Junior', 'Senior']

In [33]:
#s_unordered
s = s.cat.as_unordered()
s.head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman', 'Sophomore', 'Junior', 'Senior']

#### 2.2. 排序和比较

* 类别序列 设置为有序的之后，可以通过**sort_index** 和 **sort_values** 来进行排序


In [37]:
# 
df.Grade = df.Grade.astype('category')
df.Grade = df.Grade.cat.reorder_categories(['Freshman', 'Sophomore', 'Junior', 'Senior'],ordered = True)

df.Grade.head()

0     Freshman
1     Freshman
2       Senior
3    Sophomore
4    Sophomore
Name: Grade, dtype: category
Categories (4, object): ['Freshman' < 'Sophomore' < 'Junior' < 'Senior']

In [41]:
# sort_values

df.sort_values(by='Grade',inplace = True)
df.head()

Unnamed: 0,School,Grade,Name,Gender,Height,Weight,Transfer,Test_Number,Test_Date,Time_Record
0,Shanghai Jiao Tong University,Freshman,Gaopeng Yang,Female,158.9,46.0,N,1,2019/10/5,0:04:34
105,Fudan University,Freshman,Qiang Shi,Female,164.5,52.0,N,1,2019/12/11,0:04:23
96,Peking University,Freshman,Changmei Feng,Female,163.8,56.0,N,3,2019/11/8,0:04:41
88,Peking University,Freshman,Xiaopeng Han,Female,164.1,53.0,N,1,2019/12/18,0:05:20
81,Tsinghua University,Freshman,Yanli Zhang,Female,165.1,52.0,N,1,2019/9/13,0:05:05


In [44]:
# sort_index

df = df.set_index('Grade')
df.sort_index().head()

Unnamed: 0_level_0,School,Name,Gender,Height,Weight,Transfer,Test_Number,Test_Date,Time_Record
Grade,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1,Unnamed: 8_level_1,Unnamed: 9_level_1
Freshman,Shanghai Jiao Tong University,Gaopeng Yang,Female,158.9,46.0,N,1,2019/10/5,0:04:34
Freshman,Fudan University,Qiang Shi,Female,164.5,52.0,N,1,2019/12/11,0:04:23
Freshman,Peking University,Changmei Feng,Female,163.8,56.0,N,3,2019/11/8,0:04:41
Freshman,Peking University,Xiaopeng Han,Female,164.1,53.0,N,1,2019/12/18,0:05:20
Freshman,Tsinghua University,Yanli Zhang,Female,165.1,52.0,N,1,2019/9/13,0:05:05


* 有序的类别序列除了可以进行排序之外，还可以进行 **比较操作**，  
    * == 和 ！= 比较的对象可以是标量，也可以是等长的series（或list）
    * \>,>=,<,<= 四类大小关系,比较对象和第一种类似，但是所有参与比较的元素必须属于原序列的categories

In [60]:
df = pd.read_csv('../data/learn_pandas.csv')
df.Grade = df.Grade.astype('category')
df.Grade = df.Grade.cat.reorder_categories(['Freshman', 'Sophomore', 'Junior', 'Senior'],ordered = True)
df.head()

Unnamed: 0,School,Grade,Name,Gender,Height,Weight,Transfer,Test_Number,Test_Date,Time_Record
0,Shanghai Jiao Tong University,Freshman,Gaopeng Yang,Female,158.9,46.0,N,1,2019/10/5,0:04:34
1,Peking University,Freshman,Changqiang You,Male,166.5,70.0,N,1,2019/9/4,0:04:20
2,Shanghai Jiao Tong University,Senior,Mei Sun,Male,188.9,89.0,N,2,2019/9/12,0:05:22
3,Fudan University,Sophomore,Xiaojuan Sun,Female,,41.0,N,2,2020/1/3,0:04:08
4,Fudan University,Sophomore,Gaojuan You,Male,174.0,74.0,N,2,2019/11/6,0:05:22


In [61]:
# ==

df['Grade']== 'Freshman'

0       True
1       True
2      False
3      False
4      False
       ...  
195    False
196    False
197    False
198    False
199    False
Name: Grade, Length: 200, dtype: bool

In [62]:
# == 
df['Grade'] == 'Freshman'*df.shape[0]

0      False
1      False
2      False
3      False
4      False
       ...  
195    False
196    False
197    False
198    False
199    False
Name: Grade, Length: 200, dtype: bool

In [63]:
# <=

df['Grade']<='Sophomore'

0       True
1       True
2      False
3       True
4       True
       ...  
195    False
196    False
197    False
198    False
199     True
Name: Grade, Length: 200, dtype: bool

### 三、区间类别

#### 3.1. 利用cut和qcut进行区间构造

区间类别我可以理解为，按照 定义的区间范围 将类别划分归类

* 区间类别是一种特殊的类别，实际数据分析中，区间序列往往通过 **cut和qcut** 来进行构造的


* **pd.cut**(
    x,
    bins,
    right:bool=True,
    labels=None,
    retbins:bool=False,
    precision:int=3,
    include_lowest:bool=False,
    duplicates:str='raise',
    ordered:bool=True,
)

    参数说明
    
    
 > * x,一维被划分的列表
 > * bins,如果传入整数 n ，则代表把整个传入数组的按照最大和最小值等间距地分为 n 段，默认左开右闭
 > * right:bool=True, 左闭右开则设置为False
 > * labels=None, 区间名字
 > * retbins:bool=False, 是否返回分割点
 > * precision:int=3, 是否保留3位有效数字
 > * include_lowest:bool=False,
 > * duplicates:str='raise', 区间边界不唯一是 报错 raise 还是 丢弃 drop
 > * ordered:bool=True, 返回的区间是否排序
 
 
 
 * **pd.qcut**(    x,
    q,
    labels=None,
    retbins:bool=False,
    precision:int=3,
    duplicates:str='raise',
 )    
 
     参数说明：
     
 > *     qcut的参数和cut的参数基本一致，不同的是 bins参数替换为了 q，q=n时表示 按照n等分位把数据分箱，传入浮点列表代表相应的分位数分割点

In [3]:
# bins

s = pd.Series([1,2])
pd.cut(s,bins = 2)

0    (0.999, 1.5]
1      (1.5, 2.0]
dtype: category
Categories (2, interval[float64]): [(0.999, 1.5] < (1.5, 2.0]]

In [4]:
# right

s = pd.Series([1,2])
pd.cut(s,bins = 2,right = False)

0      [1.0, 1.5)
1    [1.5, 2.001)
dtype: category
Categories (2, interval[float64]): [[1.0, 1.5) < [1.5, 2.001)]

In [6]:
# labels

s = pd.Series([1,2])
pd.cut(s,
       bins = 2,
       right = False,
       labels = ['bad','good'])

0     bad
1    good
dtype: category
Categories (2, object): ['bad' < 'good']

In [8]:
# retbins

s = pd.Series([1,2])
pd.cut(s,
       bins = 2,
       right = False,
       retbins = True)

(0      [1.0, 1.5)
 1    [1.5, 2.001)
 dtype: category
 Categories (2, interval[float64]): [[1.0, 1.5) < [1.5, 2.001)],
 array([1.   , 1.5  , 2.001]))

In [11]:
# precision

s = pd.Series([1,2,5])
pd.cut(s,
       bins = 2,
       precision = 5)

0    (0.996, 3.0]
1    (0.996, 3.0]
2      (3.0, 5.0]
dtype: category
Categories (2, interval[float64]): [(0.996, 3.0] < (3.0, 5.0]]

In [12]:
# duplicates

s = pd.Series(np.array([2, 4, 6, 8, 10]),
             index=['a', 'b', 'c', 'd', 'e'])
s

a     2
b     4
c     6
d     8
e    10
dtype: int32

In [17]:
# duplicates
pd.cut(s, [0, 2, 4, 6, 10, 10],
       labels=['一','二','三','四'], 
       retbins=True,
       right=False, 
       duplicates='drop')

(a      二
 b      三
 c      四
 d      四
 e    NaN
 dtype: category
 Categories (4, object): ['一' < '二' < '三' < '四'], array([ 0,  2,  4,  6, 10]))

In [18]:
pd.cut(s, [0, 2, 4, 6, 10, 10],
       labels=['一','二','三','四'], 
       retbins=True,
       right=False, 
       duplicates='raise')

ValueError: Bin edges must be unique: array([ 0,  2,  4,  6, 10, 10]).
You can drop duplicate edges by setting the 'duplicates' kwarg

In [20]:
# qcut

s = df.Weight
pd.qcut(s,q=3).head()

0    (33.999, 48.0]
1      (55.0, 89.0]
2      (55.0, 89.0]
3    (33.999, 48.0]
4      (55.0, 89.0]
Name: Weight, dtype: category
Categories (3, interval[float64]): [(33.999, 48.0] < (48.0, 55.0] < (55.0, 89.0]]

In [21]:
s = df.Weight
pd.qcut(s,q=[0,0.2,0.8,1]).head()

0      (44.0, 69.4]
1      (69.4, 89.0]
2      (69.4, 89.0]
3    (33.999, 44.0]
4      (69.4, 89.0]
Name: Weight, dtype: category
Categories (3, interval[float64]): [(33.999, 44.0] < (44.0, 69.4] < (69.4, 89.0]]

#### 3.2. 一般区间的构造

* 区间的构造 可以通过 **pd.Interval**来构造，一个具体的区间，具备三个要素—左端点、右端点和端点的开闭状态，指定状态参数是 ”right, left, both, neither “

In [22]:
# pd.Interval

pd.Interval(0,1,'right')

Interval(0, 1, closed='right')

* **in** 判断元素是否属于区间

* **overlap*** 判断元素是否有交集

* **pd.IndexvalIndex** 对象生成 (？？这个是个什么概念，没有看懂）

In [24]:
# in 
my_interval = pd.Interval(0,1,'right')
0.5 in my_interval

True

In [25]:
# overlap
my_interval_2 = pd.Interval(0.5, 1.5, 'left')
my_interval.overlaps(my_interval_2)

True

#### 3.3. 区间的属性与方法

* IntervalIndex 上也定义了一些有用的属性和方法。同时，如果想要具体利用 cut 或者 qcut 的结果进行分析，那么需要先将其转为该种索引类型

* 常用属性：left，right，mid，length分别表示左右端点、两端点均值和区间长度。

In [27]:
# IntervalIndex
id_interval = pd.IntervalIndex(pd.cut(s, 3))
id_interval

IntervalIndex([(33.945, 52.333], (52.333, 70.667], (70.667, 89.0], (33.945, 52.333], (70.667, 89.0] ... (33.945, 52.333], (33.945, 52.333], (33.945, 52.333], (70.667, 89.0], (33.945, 52.333]],
              closed='right',
              name='Weight',
              dtype='interval[float64]')

In [29]:
# left

id_demo = id_interval[:5]
id_demo.left

Float64Index([33.945, 52.333, 70.667, 33.945, 70.667], dtype='float64')

### 四、练习

#### Ex2：钻石数据集