# Pandas - GroupBy对象详解

## 1 语法

### 1.1 创建GroupBy对象
- GroupBy对象可以通过pandas.DataFrame.groupby(), pandas.Series.groupby()来创建。
```
DataFrame.groupby(by=None, axis=0, level=None, as_index=True, 
                  sort=True, group_keys=True, squeeze=False, 
                  **kwargs)[source]
```

- 参数：
    * by : mapping, function, str, or iterable
    * axis : int, default 0
    * level : int, level name, or sequence of such, default None(复合索引的时候指定索引层级)
    * as_index : boolean, default True,(by列当成索引); False, by字段不被当成索引。
    * sort : boolean, default True(排序)
    * group_keys : boolean, default True(?)
    * squeeze : boolean, default False(?)
- 返回值：
    * GroupBy object

### 1.2 返回对象的属性：
* `GroupBy.iter()`: Groupby iterator
* `GroupBy.groups`: dict {group name -> group labels}
* `GroupBy.indices`: dict {group name -> group indices}
* `GroupBy.get_group(name[, obj])`: Constructs NDFrame from group with provided name
* `Grouper([key, level, freq, axis, sort])`: A Grouper allows the user to specify a groupby instruction for a target

### 1.3 返回对象方法

#### 1.3.1 DataFrame 和 Series通用方法
1. 统计函数
    * `GroupBy.sum()`: 求和
    * `GroupBy.ohlc()`: 求和, 包含缺失值
    * `GroupBy.prod()`: 计算组积
    * `GroupBy.var([ddof])`: 方差，不包含缺失值
    * `GroupBy.std([ddof])`: 标准差，不包含缺失值
    * `GroupBy.sem([ddof])`: 标准误，不包含缺失值
2. 描述函数
    * `GroupBy.size()`: 组大小
    * `GroupBy.count()`: 组元素个数，不包含缺失值
    * `GroupBy.max()`: 组最大值
    * `GroupBy.min()`: 组最小值
    * `GroupBy.median()`: 组中间值
3. 索引函数
    * `GroupBy.first()`: 返回组内首行数据；
    * `GroupBy.last()`: 返回组内尾行数据；
    * `GroupBy.head([n])`: 返回组内顺序排名前 n 行；
    * `GroupBy.tail([n])`: 返回组内逆序排名后 n 行；
    * `GroupBy.nth(n[, dropna])`: 每组第n条数据。

#### 1.3.2 DataFrame 适用特殊方法
* `DataFrameGroupBy.agg(arg,?*args,?**kwargs)`: Aggregate using input function or dict of {column ->
* `DataFrameGroupBy.all([axis,?bool_only,?…])`: Return whether all elements are True over requested axis
* `DataFrameGroupBy.any([axis,?bool_only,?…])`: Return whether any element is True over requested axis
* `DataFrameGroupBy.bfill([limit])`: Backward fill the values
* `DataFrameGroupBy.corr([method,?min_periods])`: Compute pairwise correlation of columns, excluding NA/null values
* `DataFrameGroupBy.cov([min_periods])`: Compute pairwise covariance of columns, excluding NA/null values
* `DataFrameGroupBy.cummax([axis,?skipna])`: Return cumulative max over requested axis.
* `DataFrameGroupBy.cummin([axis,?skipna])`: Return cumulative minimum over requested axis.
* `DataFrameGroupBy.cumprod([axis])`: Cumulative product for each group
* `DataFrameGroupBy.cumsum([axis])`: Cumulative sum for each group
* `DataFrameGroupBy.describe([percentiles,?…])`: Generate various summary statistics, excluding NaN values.
* `DataFrameGroupBy.diff([periods,?axis])`: 1st discrete difference of object
* `DataFrameGroupBy.ffill([limit])`: Forward fill the values
* `DataFrameGroupBy.rank([axis,?method,?…])`: Compute numerical data ranks (1 through n) along axis.
* `DataFrameGroupBy.resample(rule,?*args,?**kwargs)`: Provide resampling when using a TimeGrouper
* `DataFrameGroupBy.shift([periods,?freq,?axis])`: Shift each group by periods observations
* `DataFrameGroupBy.tshift([periods,?freq,?axis])`: Shift the time index, using the index’s frequency if available.

#### 1.3.3 Series 适用特殊方法
* `SeriesGroupBy.nlargest(*args,?**kwargs)`: Return the largest?n?elements.
* `SeriesGroupBy.nsmallest(*args,?**kwargs)`: Return the smallest?n?elements.
* `SeriesGroupBy.nunique([dropna])`: Returns number of unique elements in the group
* `SeriesGroupBy.unique()`: Return np.ndarray of unique values in the object.
* `SeriesGroupBy.value_counts([normalize,?…])`: 

## 2 案例

In [1]:
import pandas as pd
import sys

In [2]:
print('Python version ' + sys.version)
print('Pandas version ' + pd.__version__)

Python version 3.6.5 (v3.6.5:f59c0932b4, Mar 28 2018, 17:00:18) [MSC v.1900 64 bit (AMD64)]
Pandas version 0.23.0


In [3]:
#  Our small data set
d = {'one':[1,3,5,7,5, 3, None, None],
     'two':[2,4,6,8,2, 4, None, None],
     'letter':['a','a','b','b','c', 'c', 'c', 'd']}

# Create dataframe
df = pd.DataFrame(d)
df

Unnamed: 0,one,two,letter
0,1.0,2.0,a
1,3.0,4.0,a
2,5.0,6.0,b
3,7.0,8.0,b
4,5.0,2.0,c
5,3.0,4.0,c
6,,,c
7,,,d


In [4]:
# Create group object
one = df.groupby('letter', as_index=True)

### 2.1 描述函数

In [5]:
display(
    'size', one.size(),          # 组大小
    'count()', one.count(),      # 组元素个数，不包含缺失值
    'max()', one.max(),          # 组元素最大值
    'median()', one.median(),    # 组元素中间值
)

'size'

letter
a    2
b    2
c    3
d    1
dtype: int64

'count()'

Unnamed: 0_level_0,one,two
letter,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2,2
b,2,2
c,2,2
d,0,0


'max()'

Unnamed: 0_level_0,one,two
letter,Unnamed: 1_level_1,Unnamed: 2_level_1
a,3.0,4.0
b,7.0,8.0
c,5.0,4.0
d,,


'median()'

Unnamed: 0_level_0,one,two
letter,Unnamed: 1_level_1,Unnamed: 2_level_1
a,2.0,3.0
b,6.0,7.0
c,4.0,3.0
d,,


### 2.2 索引函数

In [6]:
one.first()
display(
    'first', one.first(),   # 返回组内首行数据
    'last', one.last(),     # 返回组内最后非空行数据
    'head', one.head(1),    # 返回组内顺序排名前 n 行
    'tail', one.tail(1),    # 返回组内逆序排名后 n 行
    # 'nth', one.nth(),     # 组元素中间值
)

'first'

Unnamed: 0_level_0,one,two
letter,Unnamed: 1_level_1,Unnamed: 2_level_1
a,1.0,2.0
b,5.0,6.0
c,5.0,2.0
d,,


'last'

Unnamed: 0_level_0,one,two
letter,Unnamed: 1_level_1,Unnamed: 2_level_1
a,3.0,4.0
b,7.0,8.0
c,3.0,4.0
d,,


'head'

Unnamed: 0,one,two,letter
0,1.0,2.0,a
2,5.0,6.0,b
4,5.0,2.0,c
7,,,d


'tail'

Unnamed: 0,one,two,letter
1,3.0,4.0,a
3,7.0,8.0,b
6,,,c
7,,,d


### 2.2 聚合函数 `sum(), count()` 

In [7]:
letter_one = df.groupby(['letter','one'])
display(
    'count', letter_one.count(), 
    'sum', letter_one.sum(),
    'prod', letter_one.prod(),
    'ohlc', letter_one.ohlc(),
    'cumcount', letter_one.cumcount(),
)

'count'

Unnamed: 0_level_0,Unnamed: 1_level_0,two
letter,one,Unnamed: 2_level_1
a,1.0,1
a,3.0,1
b,5.0,1
b,7.0,1
c,3.0,1
c,5.0,1


'sum'

Unnamed: 0_level_0,Unnamed: 1_level_0,two
letter,one,Unnamed: 2_level_1
a,1.0,2.0
a,3.0,4.0
b,5.0,6.0
b,7.0,8.0
c,3.0,4.0
c,5.0,2.0


'prod'

Unnamed: 0_level_0,Unnamed: 1_level_0,two
letter,one,Unnamed: 2_level_1
a,1.0,2.0
a,3.0,4.0
b,5.0,6.0
b,7.0,8.0
c,3.0,4.0
c,5.0,2.0


'ohlc'

Unnamed: 0_level_0,Unnamed: 1_level_0,two,two,two,two
Unnamed: 0_level_1,Unnamed: 1_level_1,open,high,low,close
letter,one,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
a,1.0,2.0,2.0,2.0,2.0
a,3.0,4.0,4.0,4.0,4.0
b,5.0,6.0,6.0,6.0,6.0
b,7.0,8.0,8.0,8.0,8.0
c,3.0,4.0,4.0,4.0,4.0
c,5.0,2.0,2.0,2.0,2.0


'cumcount'

0    0
1    0
2    0
3    0
4    0
5    0
6    0
7    1
dtype: int64

In [8]:
letter_one.count()

Unnamed: 0_level_0,Unnamed: 1_level_0,two
letter,one,Unnamed: 2_level_1
a,1.0,1
a,3.0,1
b,5.0,1
b,7.0,1
c,3.0,1
c,5.0,1


### 2.2 如果不希望聚合字段成为索引，可以设置 `as_index=False`

In [9]:
letterone = df.groupby(['letter','one'], as_index=False).sum()
letterone

Unnamed: 0,letter,one,two
0,a,1.0,2.0
1,a,3.0,4.0
2,b,5.0,6.0
3,b,7.0,8.0
4,c,3.0,4.0
5,c,5.0,2.0


In [10]:
letterone.index

Int64Index([0, 1, 2, 3, 4, 5], dtype='int64')

### 2.3 `DataFrame => dict` 的转换技巧