### groupby函数

#### 1、基本用法
形式上，为DataFrame.groupby([variables])或Series.groupby([variables])，生成一个groupby类型的对象（分别叫DataFrameGroupBy和SeriesGroupBy），对这个对象可以再进行.mean()/.sum()/.count()的处理，类似于sas里对数据集进行group by的操作。

In [4]:
# 官方文档例子的理解
import pandas as pd

df = pd.DataFrame({'A': 'a a b'.split(), 'B': [1,2,3], 'C': [4,6, 5]})
print(df)
g = df.groupby(['A'])
print("g = df.groupby(['A']):")
print(g)
df_1 = g.apply(lambda x: x / x.sum())  # 和apply联用
print('和apply联用后的结果')
print(df_1)
# 从结果上来看，g(DataFrame.groupby())在逻辑上仍然等同于DataFrame，只是有了groupby的信息（根据groupby信息分割的多个DataFrame）

   A  B  C
0  a  1  4
1  a  2  6
2  b  3  5
g = df.groupby(['A']):
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x111ae2198>
和apply联用后的结果
          B    C
0  0.333333  0.4
1  0.666667  0.6
2  1.000000  1.0


In [13]:
# 分组信息可以不来自于DataFrame
import pandas as pd
import numpy as np

df = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],
                            'key2':['one', 'two', 'one', 'two', 'one'],
                            'data1':np.random.randn(5),
                            'data2':np.random.randn(5)})

states = np.array(['Ohio', 'California', 'California', 'Ohio', 'Ohio'])
years = np.array([2005, 2005, 2006, 2005, 2006])
g1 = df['data1'].groupby([states, years])  # 分组信息来自于其它数据（但是行数要对应上）
g2 = df.groupby([states, years])['data1']  # 注意：这两种写法是等价的--也就是逻辑上等同于DataFrame的意思
g3 = df.groupby([states, years])[['data1']]  # 注意：g3的类型和g1/g2不一样就很好理解了
print(g1)
print(g2)
print(g3)
df_1 = g1.mean()
print(type(df_1))  # 注意：结果其实是一个Series，只是有一个MultiIndex
print(df_1.index)
print(df_1)

<pandas.core.groupby.groupby.SeriesGroupBy object at 0x119894898>
<pandas.core.groupby.groupby.SeriesGroupBy object at 0x119894358>
<pandas.core.groupby.groupby.DataFrameGroupBy object at 0x1198944e0>
<class 'pandas.core.series.Series'>
MultiIndex(levels=[['California', 'Ohio'], [2005, 2006]],
           labels=[[0, 0, 1, 1], [0, 1, 0, 1]])
California  2005   -0.655099
            2006   -0.249379
Ohio        2005    1.942186
            2006    0.426500
Name: data1, dtype: float64


In [19]:
# groupby的内容可以提取出来(和被分割的多个DataFrame的逻辑是一致的)
import pandas as pd
import numpy as np

df = pd.DataFrame({'key1':['a', 'a', 'b', 'b', 'a'],
                            'key2':['one', 'two', 'one', 'two', 'one'],
                            'data1':np.random.randn(5),
                            'data2':np.random.randn(5)})

g1 = df.groupby(['key1', 'key2'])

g1_list = list(g1)  # 这个结果在spyder里面会更清晰一点，是一个list，每一list元素是一个元组(tuple)，而每一个元组又由一个复合index和一个DataFrame构成
print('list(g1)的结果：')
print(g1_list)
g1_dict = dict(g1_list)
print('dict(g1_list)的结果：')
print(g1_dict)
print('将g1直接用在for循环里：')
for (k1, k2), group in g1:
    print(k1, k2)
    print(group)

list(g1)的结果：
[(('a', 'one'),   key1 key2     data1     data2
0    a  one -0.718774  1.192843
4    a  one -1.643126 -0.157505), (('a', 'two'),   key1 key2     data1     data2
1    a  two  0.963136  1.117195), (('b', 'one'),   key1 key2     data1     data2
2    b  one  0.059704  0.864151), (('b', 'two'),   key1 key2     data1     data2
3    b  two -0.399036 -0.575472)]
dict(g1_list)的结果：
{('a', 'one'):   key1 key2     data1     data2
0    a  one -0.718774  1.192843
4    a  one -1.643126 -0.157505, ('a', 'two'):   key1 key2     data1     data2
1    a  two  0.963136  1.117195, ('b', 'one'):   key1 key2     data1     data2
2    b  one  0.059704  0.864151, ('b', 'two'):   key1 key2     data1     data2
3    b  two -0.399036 -0.575472}
将g1直接用在for循环里：
a one
  key1 key2     data1     data2
0    a  one -0.718774  1.192843
4    a  one -1.643126 -0.157505
a two
  key1 key2     data1     data2
1    a  two  0.963136  1.117195
b one
  key1 key2     data1     data2
2    b  one  0.059704  0.864151
b two


#### 2、用groupby()函数进行fama-macbeth回归
fama-macbeth回归的思路，是先进行截面回归，再看截面回归系数的时序。在sas中可以用“proc reg; by date;”比较简单的完成，这里尝试用groupby()函数在python里也进行fama-macbeth回归。

In [1]:
import pandas as pd
import numpy as np
import time
from sklearn.linear_model import LinearRegression

# 首先要做一个假数据，y=0.5*x1+1.2*x2，20个日期且每个日期20个数据（总共400个）

day = 60 * 60 * 24

start = time.mktime(time.strptime('2018-01-01', '%Y-%m-%d'))
date = [i*day+start for i in range(20)]*20
date = [time.strftime('%Y-%m-%d', time.localtime(i)) for i in date]
date = np.array(date)
x1 = np.array( list(4*np.random.randn(400)) )
x2 = np.array( list(2*np.random.randn(400)) )
y = 0.5*x1 + 1.2*x2 + np.array( list(np.random.randn(400)) )

df = pd.DataFrame({'date': date, 'y': y, 'x1': x1, 'x2': x2})

# 然后对每个时间截面进行线性回归，并将各个时点的结果保存在beta中，形成一个时序

result = {}
model = LinearRegression(fit_intercept=False)

for date, group in df.groupby('date'):
    x_gp = group[['x1', 'x2']].values.reshape(-1, 2)
    y_gp = group['y'].values.reshape(-1, 1)
    model.fit(x_gp, y_gp)
    b = model.coef_
    result[date] = {'x1': b[0][0], 'x2': b[0][1]}

beta = pd.DataFrame(result).T

print(beta)

                  x1        x2
2018-01-01  0.483233  1.255034
2018-01-02  0.523127  1.087927
2018-01-03  0.596617  1.145879
2018-01-04  0.487350  1.145565
2018-01-05  0.497071  1.057849
2018-01-06  0.349764  1.190401
2018-01-07  0.430165  1.216643
2018-01-08  0.551273  1.219016
2018-01-09  0.457809  1.062471
2018-01-10  0.439193  1.096075
2018-01-11  0.603069  1.283148
2018-01-12  0.510601  1.322092
2018-01-13  0.484822  1.239840
2018-01-14  0.596872  1.115434
2018-01-15  0.417653  1.149470
2018-01-16  0.511738  1.170869
2018-01-17  0.466286  1.146867
2018-01-18  0.445645  1.192179
2018-01-19  0.477549  1.233146
2018-01-20  0.552809  1.452900
