# Pandas 常用操作

## 常用方法
> 参考文献：[知乎，易执，常用函数](https://zhuanlan.zhihu.com/p/106722583)  
> 参考文献：[pandas 官方 api](https://pandas.pydata.org/pandas-docs/stable/reference/api/)

### 显示方法

In [1]:
import pandas as pd
import numpy as np

- head() 返回前 5 行
- info() 返回后 5 行

In [2]:
data = pd.DataFrame({'company':[np.NaN]+list("AACACABB"),
                     'salary':[43,8,28,42,33,20,48,25,39],
                     'age':[21,41,26,28,26,18,43,23,18]})

In [3]:
data.head()

Unnamed: 0,company,salary,age
0,,43,21
1,A,8,41
2,A,28,26
3,C,42,28
4,A,33,26


In [4]:
data.tail()

Unnamed: 0,company,salary,age
4,A,33,26
5,C,20,18
6,A,48,43
7,B,25,23
8,B,39,18


### 统计方法

- describe() 返回基本统计数据
- value_counts() 统计分类变量中每个类的数量
  - normalize （boolean, default False）:返回各类的占比
  - sort （boolean, default True）:是否对统计结果进行排序
  - ascending （boolean, default False）:是否升序排列

In [5]:
data.describe()

Unnamed: 0,salary,age
count,9.0,9.0
mean,31.777778,27.111111
std,12.804079,9.143911
min,8.0,18.0
25%,25.0,21.0
50%,33.0,26.0
75%,42.0,28.0
max,48.0,43.0


In [6]:
data['company'].value_counts()

A    4
B    2
C    2
Name: company, dtype: int64

In [7]:
data['company'].value_counts(normalize=True)

A    0.50
B    0.25
C    0.25
Name: company, dtype: float64

### 缺失值处理

- isna() 判断是否为 NaN
- any(),all(),empty() 判断某行/列情形
- dropna() 删除 NaN 所在行
- fillna() 用数据替换 NaN 值
  - value （scalar, dict, Series, or DataFrame）：用于填充缺失值的值
  - method （{‘backfill’, ‘bfill’, ‘pad’, ‘ffill’, None}, default None）：缺失值的填充方式，常用的是bfill后面的值进行填充，ffill用前面的值进行填充
  - inplace （boolean, default False）：是否作用于原对象

In [8]:
data.isna()

Unnamed: 0,company,salary,age
0,True,False,False
1,False,False,False
2,False,False,False
3,False,False,False
4,False,False,False
5,False,False,False
6,False,False,False
7,False,False,False
8,False,False,False


In [9]:
data.isna().any()

company     True
salary     False
age        False
dtype: bool

In [10]:
data.dropna()

Unnamed: 0,company,salary,age
1,A,8,41
2,A,28,26
3,C,42,28
4,A,33,26
5,C,20,18
6,A,48,43
7,B,25,23
8,B,39,18


In [11]:
data.fillna('B')

Unnamed: 0,company,salary,age
0,B,43,21
1,A,8,41
2,A,28,26
3,C,42,28
4,A,33,26
5,C,20,18
6,A,48,43
7,B,25,23
8,B,39,18


### 排序方法

- sort_index()：对数据按照索引进行排序
  - ascending （boolean, default True）：是否升序排列
  - inplace （boolean, default False）：是否作用于原对象
- sort_values()：按照某列进行排序
  - by （str or list of str）：作用于DataFrame时需要指定排序的列
  - ascending （boolean, default False）：是否升序排列

In [12]:
data.sort_index(ascending=False)

Unnamed: 0,company,salary,age
8,B,39,18
7,B,25,23
6,A,48,43
5,C,20,18
4,A,33,26
3,C,42,28
2,A,28,26
1,A,8,41
0,,43,21


In [13]:
data.sort_values(by='salary')

Unnamed: 0,company,salary,age
1,A,8,41
5,C,20,18
7,B,25,23
2,A,28,26
4,A,33,26
8,B,39,18
3,C,42,28
0,,43,21
6,A,48,43


### 数据修改与筛选方法

- astype()：修改字段的数据类型
- rename()：修改列名
  - columns （dict-like or function）：指定要修改的列名以及新的列名，一般以字典形式传入
  - inplace （boolean, default False）：是否作用于原对象
- set_index()：将DataFrame中的某一（多）个字段设置为索引
- reset_index()：重置索引，默认重置后的索引为0~len(df)-1
  - drop （boolean, default False）：是否丢弃原索引，具体看下方演示
  - inplace （boolean, default False）：是否作用于原对象
- drop_duplicates()：去掉重复值
- drop()：常用于删掉DataFrame中的某些字段
- isin()：常用于构建布尔索引
- where()：将不符合条件的值替换掉成指定值

In [14]:
data["age"] = data["age"].astype(int)

In [15]:
data.rename(columns={'age':'number'})

Unnamed: 0,company,salary,number
0,,43,21
1,A,8,41
2,A,28,26
3,C,42,28
4,A,33,26
5,C,20,18
6,A,48,43
7,B,25,23
8,B,39,18


In [16]:
data.set_index('company')

Unnamed: 0_level_0,salary,age
company,Unnamed: 1_level_1,Unnamed: 2_level_1
,43,21
A,8,41
A,28,26
C,42,28
A,33,26
C,20,18
A,48,43
B,25,23
B,39,18


In [17]:
data.reset_index(drop=True)

Unnamed: 0,company,salary,age
0,,43,21
1,A,8,41
2,A,28,26
3,C,42,28
4,A,33,26
5,C,20,18
6,A,48,43
7,B,25,23
8,B,39,18


In [18]:
data['company'].drop_duplicates()

0    NaN
1      A
3      C
7      B
Name: company, dtype: object

In [19]:
data.drop(columns = ['salary'])

Unnamed: 0,company,age
0,,21
1,A,41
2,A,26
3,C,28
4,A,26
5,C,18
6,A,43
7,B,23
8,B,18


In [20]:
data.loc[data['company'].isin(['A','C'])]

Unnamed: 0,company,salary,age
1,A,8,41
2,A,28,26
3,C,42,28
4,A,33,26
5,C,20,18
6,A,48,43


In [21]:
data['salary'].where(data.salary<=40,40)

0    40
1     8
2    28
3    40
4    33
5    20
6    40
7    25
8    39
Name: salary, dtype: int64

### 数据离散化

- cut()：将连续变量离散化
  - x （array-like）：需要进行离散化的一维数据
  - bins （int, sequence of scalars, or IntervalIndex）：设置需要分成的区间，可以指定区间数量，也可以指定间断点
  - labels （array or bool, optional）：设置区间的标签
- qcut()
  - x （array-like）：需要进行离散化的一维数据
  - q（integer or array of quantiles）：设置需要分成的区间，可以指定区间格式，也可以指定间断点
  - labels （array or boolean, default None）：设置区间的标签

In [22]:
pd.cut(data.salary,bins = 5)

0    (40.0, 48.0]
1    (7.96, 16.0]
2    (24.0, 32.0]
3    (40.0, 48.0]
4    (32.0, 40.0]
5    (16.0, 24.0]
6    (40.0, 48.0]
7    (24.0, 32.0]
8    (32.0, 40.0]
Name: salary, dtype: category
Categories (5, interval[float64]): [(7.96, 16.0] < (16.0, 24.0] < (24.0, 32.0] < (32.0, 40.0] < (40.0, 48.0]]

In [23]:
pd.cut(data.salary,bins = [0,10,20,30,40,50])

0    (40, 50]
1     (0, 10]
2    (20, 30]
3    (40, 50]
4    (30, 40]
5    (10, 20]
6    (40, 50]
7    (20, 30]
8    (30, 40]
Name: salary, dtype: category
Categories (5, interval[int64]): [(0, 10] < (10, 20] < (20, 30] < (30, 40] < (40, 50]]

In [24]:
pd.cut(data.salary,bins = [0,10,20,30,40,50],labels = ['低','中下','中','中上','高'])

0     高
1     低
2     中
3     高
4    中上
5    中下
6     高
7     中
8    中上
Name: salary, dtype: category
Categories (5, object): ['低' < '中下' < '中' < '中上' < '高']

In [25]:
pd.qcut(data.salary,q = 3)

0     (40.0, 48.0]
1    (7.999, 27.0]
2     (27.0, 40.0]
3     (40.0, 48.0]
4     (27.0, 40.0]
5    (7.999, 27.0]
6     (40.0, 48.0]
7    (7.999, 27.0]
8     (27.0, 40.0]
Name: salary, dtype: category
Categories (3, interval[float64]): [(7.999, 27.0] < (27.0, 40.0] < (40.0, 48.0]]

### 数据透视

- pivot_table()：对DataFrame进行数据透视
  - values （column to aggregate, optional）：用于聚合运算的字段（数据透视的目标变量）
  - index （column, Grouper, array, or list of the previous）：类比于数据透视表中的行标签
  - columns （column, Grouper, array, or list of the previous）：类比于数据透视表中的列标签
  - aggfunc （ function, list of functions, dict, default numpy.mean）：对values进行什么聚合运算

In [26]:
data.pivot_table(values = 'salary',index = 'company',columns = 'age',aggfunc=np.mean)

age,18,23,26,28,41,43
company,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
A,,,30.5,,8.0,48.0
B,39.0,25.0,,,,
C,20.0,,,42.0,,


## 行，列，元素操作
> 参考文献：[知乎，易执](https://zhuanlan.zhihu.com/p/100064394)  
> 参考文献：[pandas 官方 api](https://pandas.pydata.org/pandas-docs/stable/reference/api/)

### map() 方法
- 方法原型
  ```py
  Series.map(arg, na_action=None)
  ```
- 备注：只有 Series 对象拥有该方法

In [27]:
boolean=[True,False]
gender=["男","女"]
color=["white","black","yellow"]
data=pd.DataFrame({
    "height":np.random.randint(150,190,100),
    "weight":np.random.randint(40,90,100),
    "smoker":[boolean[x] for x in np.random.randint(0,2,100)],
    "gender":[gender[x] for x in np.random.randint(0,2,100)],
    "age":np.random.randint(15,90,100),
    "color":[color[x] for x in np.random.randint(0,len(color),100) ]
})
data

Unnamed: 0,height,weight,smoker,gender,age,color
0,183,83,True,男,22,yellow
1,185,58,False,男,89,black
2,157,40,True,女,47,black
3,150,63,False,男,79,white
4,187,89,True,男,37,black
...,...,...,...,...,...,...
95,162,56,True,男,87,black
96,163,79,True,男,43,white
97,172,72,True,女,22,black
98,182,59,True,男,72,yellow


In [28]:
data["gender"] = data["gender"].map({"男":1, "女":0})
data

Unnamed: 0,height,weight,smoker,gender,age,color
0,183,83,True,1,22,yellow
1,185,58,False,1,89,black
2,157,40,True,0,47,black
3,150,63,False,1,79,white
4,187,89,True,1,37,black
...,...,...,...,...,...,...
95,162,56,True,1,87,black
96,163,79,True,1,43,white
97,172,72,True,0,22,black
98,182,59,True,1,72,yellow


### apply() 方法
- 方法原型
  ```py
  DataFrame.apply(func, axis=0, raw=False, result_type=None, args=(), **kwds)
  ```
- 备注：Series DataFrame 均有该方法

In [29]:
data[["height","weight","age"]].apply(np.sum, axis=0)

height    16840
weight     6479
age        5137
dtype: int64

In [30]:
def BMI(series):
    weight = series["weight"]
    height = series["height"]/100
    BMI = weight/height**2
    return BMI

data["BMI"] = data.apply(BMI,axis=1)
data

Unnamed: 0,height,weight,smoker,gender,age,color,BMI
0,183,83,True,1,22,yellow,24.784258
1,185,58,False,1,89,black,16.946676
2,157,40,True,0,47,black,16.227839
3,150,63,False,1,79,white,28.000000
4,187,89,True,1,37,black,25.451114
...,...,...,...,...,...,...,...
95,162,56,True,1,87,black,21.338211
96,163,79,True,1,43,white,29.733900
97,172,72,True,0,22,black,24.337480
98,182,59,True,1,72,yellow,17.811858


### applymap() 方法
- 方法原型
  ```py
  DataFrame.applymap(func, na_action=None)
  ```
- 备注：apply 的简化版

In [31]:
df = pd.DataFrame({
        "A":np.random.randn(5),
        "B":np.random.randn(5),
        "C":np.random.randn(5),
        "D":np.random.randn(5),
        "E":np.random.randn(5),
    })
df.applymap(lambda x:"%.2f" % x)

Unnamed: 0,A,B,C,D,E
0,-1.54,0.41,-0.27,1.68,1.03
1,-0.68,-0.66,0.19,-0.38,-0.69
2,0.9,0.54,1.13,-0.49,1.26
3,-0.76,0.26,0.23,0.41,-0.39
4,-0.89,1.56,-0.06,1.53,0.48


## 表划分
> 参考文献：[知乎，易执](https://zhuanlan.zhihu.com/p/101284491)  
> 参考文献：[pandas 官方 api](https://pandas.pydata.org/pandas-docs/stable/reference/api/)

### groupby() 方法
- 方法原型
  ```py
  DataFrame.groupby(by=None, axis=0, level=None, as_index=True, sort=True, group_keys=True, squeeze=<object object>, observed=False, dropna=True)
  ```
- 说明：Series 和 DataFrame 都有该方法

In [32]:
company=["A","B","C"]
data=pd.DataFrame({
    "company":[company[x] for x in np.random.randint(0,len(company),10)],
    "salary":np.random.randint(5,50,10),
    "age":np.random.randint(15,50,10)
})
data

Unnamed: 0,company,salary,age
0,A,13,33
1,C,24,42
2,C,16,15
3,B,46,38
4,B,16,39
5,B,23,37
6,B,8,34
7,A,16,18
8,A,24,18
9,B,11,48


In [33]:
data.groupby("company").mean()

Unnamed: 0_level_0,salary,age
company,Unnamed: 1_level_1,Unnamed: 2_level_1
A,17.666667,23.0
B,20.8,39.2
C,20.0,28.5


### agg() 方法
- 方法原型
  ```py
  DataFrame.agg(func=None, axis=0, *args, **kwargs)
  ```
- 备注：Series DataFrame 均有该方法
- 常用函数
  |函数|min|max|sum|mean|median|std|var|count|
  |--|--|--|--|--|--|--|--|--|
  |用途|最小值|最大值|求和|均值|中位数|标准差|方差|计数|

In [34]:
data.groupby("company").agg('mean')

Unnamed: 0_level_0,salary,age
company,Unnamed: 1_level_1,Unnamed: 2_level_1
A,17.666667,23.0
B,20.8,39.2
C,20.0,28.5


In [35]:
data.groupby('company').agg({'salary':'median','age':'mean'})

Unnamed: 0_level_0,salary,age
company,Unnamed: 1_level_1,Unnamed: 2_level_1
A,16,23.0
B,16,39.2
C,20,28.5


In [36]:
avg_salary_dict = data.groupby('company')['salary'].mean().to_dict()
data['avg_salary'] = data['company'].map(avg_salary_dict)
data

Unnamed: 0,company,salary,age,avg_salary
0,A,13,33,17.666667
1,C,24,42,20.0
2,C,16,15,20.0
3,B,46,38,20.8
4,B,16,39,20.8
5,B,23,37,20.8
6,B,8,34,20.8
7,A,16,18,17.666667
8,A,24,18,17.666667
9,B,11,48,20.8


### transform() 方法
- 方法原型
  ```py
  DataFrame.transform(func, axis=0, *args, **kwargs)
  ```
- 备注：Series DataFrame 均有该方法

In [37]:
data['avg_salary'] = data.groupby('company')['salary'].transform('mean')
data

Unnamed: 0,company,salary,age,avg_salary
0,A,13,33,17.666667
1,C,24,42,20.0
2,C,16,15,20.0
3,B,46,38,20.8
4,B,16,39,20.8
5,B,23,37,20.8
6,B,8,34,20.8
7,A,16,18,17.666667
8,A,24,18,17.666667
9,B,11,48,20.8


## 表合并
> 参考文献：[知乎，晓伟，三个方法解析](https://zhuanlan.zhihu.com/p/70438557)  
> 参考文献：[pandas 官方 api](https://pandas.pydata.org/pandas-docs/stable/reference/api/)

### concat() 方法
- 方法原型
  ```py
  pandas.concat(objs, axis=0, join='outer', ignore_index=False, keys=None, levels=None, names=None, verify_integrity=False, sort=False, copy=True)
  ```
- 备注：此为静态方法，不是 Series 或 DataFrame 的成员方法

In [38]:
s1 = pd.Series(['a', 'b'])
s2 = pd.Series(['c', 'd'])
pd.concat([s1, s2])

0    a
1    b
0    c
1    d
dtype: object

ignore_index参数：忽略原 index 序列

In [39]:
pd.concat([s1, s2], ignore_index=True)  # 忽略原有的 index

0    a
1    b
2    c
3    d
dtype: object

key参数：指定各自表名

In [40]:
pd.concat([s1, s2], keys=['s1', 's2'])  # 增加键

s1  0    a
    1    b
s2  0    c
    1    d
dtype: object

name参数：指定名称

In [41]:
pd.concat([s1, s2], keys=['s1', 's2'],
          names=['name', 'Row ID'])  # 增加列名

name  Row ID
s1    0         a
      1         b
s2    0         c
      1         d
dtype: object

index 与 column 参数：指定索引名

In [42]:
df1 = pd.DataFrame([['a', 1], ['b', 2]],
                   columns=['letter', 'number'])
df1

Unnamed: 0,letter,number
0,a,1
1,b,2


In [43]:
df3 = pd.DataFrame([['c', 3, 'cat'], ['d', 4, 'dog']],
                   columns=['letter', 'number', 'animal'],index=[0,1])
df3    # 增加行列名

Unnamed: 0,letter,number,animal
0,c,3,cat
1,d,4,dog


join参数：连接方式，merge中详解

In [44]:
pd.concat([df1, df3], join="inner")    

Unnamed: 0,letter,number
0,a,1
1,b,2
0,c,3
1,d,4


axis参数：横向连接

In [45]:
df4 = pd.DataFrame([['bird', 'polly'], ['monkey', 'george']],
                   columns=['animal', 'name'])
pd.concat([df1, df4], axis=1) 

Unnamed: 0,letter,number,animal,name
0,a,1,bird,polly
1,b,2,monkey,george


### append() 方法
- 方法原型
  ```py
  DataFrame.append(other, ignore_index=False, verify_integrity=False, sort=False)
  ```
- 备注：Series DataFrame 都有 append() 方法
- 简化版 concat(), 只能纵向 append

In [46]:
df = pd.DataFrame([[1, 2], [3, 4]], columns=list('AB'))
df2 = pd.DataFrame([[5, 6], [7, 8]], columns=list('AB'))
df.append(df2)

Unnamed: 0,A,B
0,1,2
1,3,4
0,5,6
1,7,8


### merge() 方法
- 方法原型
  ```py
  pandas.merge(left, right, how='inner', on=None, left_on=None, right_on=None, left_index=False, right_index=False, sort=False, suffixes=('_x', '_y'), copy=True, indicator=False, validate=None)
  ```
- 备注：pandas 和 DataFrame 都有 merge() 方法

通过成员方法合并

In [47]:
df1 = pd.DataFrame({'lkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [1, 2, 3, 5]})
df2 = pd.DataFrame({'rkey': ['foo', 'bar', 'baz', 'foo'],
                    'value': [5, 6, 7, 8]})
df1.merge(df2, left_on='lkey', right_on='rkey')

Unnamed: 0,lkey,value_x,rkey,value_y
0,foo,1,foo,5
1,foo,1,foo,8
2,foo,5,foo,5
3,foo,5,foo,8
4,bar,2,bar,6
5,baz,3,baz,7


通过静态方法合并

In [48]:
left = pd.DataFrame({'key': ['K0', 'K1', 'K2'],
                     'A': ['A0', 'A1', 'A2'],
                     'B': ['B0', 'B1', 'B2']})
right = pd.DataFrame({'key': ['K1', 'K2', 'K3'],
                       'C': ['C1', 'C2', 'C3'],
                       'D': ['D1', 'D2', 'D3']}) 
pd.merge(left, right, on='key')

Unnamed: 0,key,A,B,C,D
0,K1,A1,B1,C1,D1
1,K2,A2,B2,C2,D2


how 参数：设置合并模式
- 四种匹配模式
  |how|模式|
  |--|--|
  |inner(默认)|匹配交集|
  |outer|匹配并集|
  |left|以左表为准|
  |right|以右表为准|

In [49]:
pd.merge(left, right, on='key', how='outer')

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,,
1,K1,A1,B1,C1,D1
2,K2,A2,B2,C2,D2
3,K3,,,C3,D3


In [50]:
pd.merge(left, right, on='key', how='left')

Unnamed: 0,key,A,B,C,D
0,K0,A0,B0,,
1,K1,A1,B1,C1,D1
2,K2,A2,B2,C2,D2


## 数据清洗与准备

### 处理缺失值

In [52]:
string_data = pd.Series(['aardvark','art',np.nan,'avoc'])
string_data

0    aardvark
1         art
2         NaN
3        avoc
dtype: object

#### .isnull() 判断缺失值
- 方法原型
  ```py
  DataFrame.isnull()
  ```
- 备注：Series 也有该方法

In [53]:
string_data.isnull()

0    False
1    False
2     True
3    False
dtype: bool

#### dropna() 过滤缺失值
- 方法原型
  ```py
  DataFrame.dropna(axis=0, how='any', thresh=None, subset=None, inplace=False)
  ```
- 备注：Series 也有该方法

In [55]:
data = pd.Series([1,np.nan,3.4,np.nan,7])

In [56]:
data.dropna()

0    1.0
2    3.4
4    7.0
dtype: float64

In [58]:
data[data.notnull()]  # 等效

0    1.0
2    3.4
4    7.0
dtype: float64

In [59]:
data = pd.DataFrame([[1,6.4,3],[1,np.nan,np.nan],[np.nan,np.nan,np.nan],[np.nan,6.5,3]])
data

Unnamed: 0,0,1,2
0,1.0,6.4,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [60]:
data.dropna(how='all')  # 改变 how 参数

Unnamed: 0,0,1,2
0,1.0,6.4,3.0
1,1.0,,
3,,6.5,3.0


In [61]:
data.dropna(how='all',axis=1)  # 改变轴向

Unnamed: 0,0,1,2
0,1.0,6.4,3.0
1,1.0,,
2,,,
3,,6.5,3.0


In [62]:
data.dropna(thresh=2)  # 指定 NaN 个数

Unnamed: 0,0,1,2
0,1.0,6.4,3.0
3,,6.5,3.0


### 补全缺失值

#### .fillna() 替换 NaN 值
- 方法原型
  ```py
  DataFrame.fillna(value=None, method=None, axis=None, inplace=False, limit=None, downcast=None)
  ```
- 备注：Series 也有该方法

In [66]:
df = pd.DataFrame(np.random.randn(7,3))
df.iloc[:4,1] = np.nan
df.iloc[:2,2] = np.nan
df

Unnamed: 0,0,1,2
0,-0.581135,,
1,-0.630551,,
2,-0.446898,,-0.98633
3,-1.842301,,0.479967
4,-1.183843,1.562471,0.170157
5,-0.001579,-0.003075,-0.25122
6,1.719569,1.709399,-0.22312


In [67]:
df.fillna(0)

Unnamed: 0,0,1,2
0,-0.581135,0.0,0.0
1,-0.630551,0.0,0.0
2,-0.446898,0.0,-0.98633
3,-1.842301,0.0,0.479967
4,-1.183843,1.562471,0.170157
5,-0.001579,-0.003075,-0.25122
6,1.719569,1.709399,-0.22312


In [69]:
df.fillna({1:0.5,2:1})

Unnamed: 0,0,1,2
0,-0.581135,0.5,1.0
1,-0.630551,0.5,1.0
2,-0.446898,0.5,-0.98633
3,-1.842301,0.5,0.479967
4,-1.183843,1.562471,0.170157
5,-0.001579,-0.003075,-0.25122
6,1.719569,1.709399,-0.22312


In [72]:
df.fillna(method='backfill')

Unnamed: 0,0,1,2
0,-0.581135,1.562471,-0.98633
1,-0.630551,1.562471,-0.98633
2,-0.446898,1.562471,-0.98633
3,-1.842301,1.562471,0.479967
4,-1.183843,1.562471,0.170157
5,-0.001579,-0.003075,-0.25122
6,1.719569,1.709399,-0.22312


## 数据转换

### 删除重复值 .duplicated()
- 方法原型
  ```py
  DataFrame.duplicated(subset=None, keep='first')
  ```
- 作用：返回每一行是否有重复值
- 备注：一般对象均有该方法

In [74]:
data = pd.DataFrame({'k1':['one','two']*3 + ['two'],
                     'k2':[1,1,2,3,4,3,4]})
data

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,4
5,two,3
6,two,4


In [75]:
data.duplicated()

0    False
1    False
2    False
3    False
4    False
5     True
6    False
dtype: bool

In [76]:
data.drop_duplicates()  # 删除上述为 True 的行

Unnamed: 0,k1,k2
0,one,1
1,two,1
2,one,2
3,two,3
4,one,4
6,two,4


In [77]:
data.drop_duplicates(['k1'])

Unnamed: 0,k1,k2
0,one,1
1,two,1


### 数据转换 map() 方法
- 方法原型
  ```py
  Series.map(arg, na_action=None)
  ```
- 备注：只有 Series 对象拥有该方法

In [80]:
data = pd.DataFrame({'food':['bacon','pulled pork','bacon','Pastrami','corned beef','Bacon','pastrami','honey ham','nova lox'],
                     'ounces':[4,3,12,6,7.5,8,3,5,6]})
meat_to_animal = {'bacon':'pig','pulled pork':'pig','pastrami':'cow','corned beef':'cow','honey ham':'pig','nova lox':'salmon'}

In [85]:
lowercased = data['food'].str.lower()
data['animal'] = lowercased.map(meat_to_animal)
data

Unnamed: 0,food,ounces,animal
0,bacon,4.0,pig
1,pulled pork,3.0,pig
2,bacon,12.0,pig
3,Pastrami,6.0,cow
4,corned beef,7.5,cow
5,Bacon,8.0,pig
6,pastrami,3.0,cow
7,honey ham,5.0,pig
8,nova lox,6.0,salmon


In [86]:
data['food'].map(lambda x:meat_to_animal[x.lower()])

0       pig
1       pig
2       pig
3       cow
4       cow
5       pig
6       cow
7       pig
8    salmon
Name: food, dtype: object

### 替代值 replace()
- 方法原型
  ```py
  DataFrame.replace(to_replace=None, value=None, inplace=False, limit=None, regex=False, method='pad')
  ```

In [87]:
data = pd.Series([1,-999,2,-999,-1000,3],dtype=float)

In [89]:
data.replace([-999,-1000],np.nan)

0    1.0
1    NaN
2    2.0
3    NaN
4    NaN
5    3.0
dtype: float64

In [90]:
data.replace({-999:np.nan,-1000:0})

0    1.0
1    NaN
2    2.0
3    NaN
4    0.0
5    3.0
dtype: float64

### 重命名轴索引 .rename()
- 方法原型
  ```py
  DataFrame.rename(mapper=None, index=None, columns=None, axis=None, copy=True, inplace=False, level=None, errors='ignore')[source]
  ```

In [92]:
data = pd.DataFrame(np.arange(12).reshape((3,4)),
                   index = ['Ohio','Colorado','New York'],
                   columns = ['one','two','three','four'])

In [95]:
transform = lambda x:x[:4].upper()
data.index.map(transform)

Index(['OHIO', 'COLO', 'NEW '], dtype='object')

In [96]:
data.rename(index=str.title,columns=str.upper)

Unnamed: 0,ONE,TWO,THREE,FOUR
Ohio,0,1,2,3
Colo,4,5,6,7
New,8,9,10,11


In [97]:
data.rename(index={'OHIO':'INDIANA'},columns={'three':'peekaboo'})

Unnamed: 0,one,two,peekaboo,four
INDIANA,0,1,2,3
COLO,4,5,6,7
NEW,8,9,10,11


### 离散化和分箱
pandas.cut() 切分
- 方法原型
  ```py
  pandas.cut(x, bins, right=True, labels=None, retbins=False, precision=3, include_lowest=False, duplicates='raise', ordered=True)
  ```

In [100]:
ages = [20,22,25,27,21,23,37,31,61,45,41,32]
bins = [18,25,35,60,100]
cats = pd.cut(ages,bins)

In [101]:
cats.codes

array([0, 0, 0, 1, 0, 0, 2, 1, 3, 2, 2, 1], dtype=int8)

In [102]:
cats.categories

IntervalIndex([(18, 25], (25, 35], (35, 60], (60, 100]],
              closed='right',
              dtype='interval[int64]')

In [103]:
pd.value_counts(cats)

(18, 25]     5
(35, 60]     3
(25, 35]     3
(60, 100]    1
dtype: int64

qcut() 基于样本分位数分箱
- 方法原型
  ```py
  pandas.qcut(x, q, labels=None, retbins=False, precision=3, duplicates='raise')
  ```

In [113]:
data = np.random.randn(1000)
pd.value_counts(pd.qcut(data,4))

(0.725, 3.549]      250
(0.0755, 0.725]     250
(-0.611, 0.0755]    250
(-3.334, -0.611]    250
dtype: int64

### 检查和过滤异常值

In [115]:
data = pd.DataFrame(np.random.rand(1000,4))
data.describe()

Unnamed: 0,0,1,2,3
count,1000.0,1000.0,1000.0,1000.0
mean,0.503165,0.50466,0.493694,0.499174
std,0.288792,0.287278,0.288524,0.290374
min,4.4e-05,0.004584,0.001117,0.001196
25%,0.25683,0.249286,0.248993,0.243958
50%,0.491925,0.515709,0.492989,0.486874
75%,0.758129,0.74297,0.740939,0.75439
max,0.997363,0.999846,0.999894,0.999856


In [130]:
col = data[2]
col[np.abs(col)>0.99]

2      0.994547
60     0.990128
326    0.995075
560    0.993024
589    0.999894
773    0.994718
Name: 2, dtype: float64

In [133]:
data[(np.abs(data)>0.999).any(1)]

Unnamed: 0,0,1,2,3
589,0.400718,0.562161,0.999894,0.763866
727,0.201976,0.150954,0.643154,0.999856
892,0.445703,0.999846,0.653783,0.684225


### 置换和随机抽样

In [137]:
df = pd.DataFrame(np.arange(20).reshape(5,4))

In [139]:
sampler = np.random.permutation(5)

In [140]:
df.take(sampler)

Unnamed: 0,0,1,2,3
3,12,13,14,15
4,16,17,18,19
1,4,5,6,7
2,8,9,10,11
0,0,1,2,3


## 字符串操作

### 字符串对象方法

In [144]:
val = "a,b,  guido"
val.split(',')

['a', 'b', '  guido']

In [145]:
pieces = [x.strip() for x in val.split(',')]
pieces

['a', 'b', 'guido']

In [146]:
':'.join(pieces)

'a:b:guido'

In [149]:
val.index(',')

1