# Selecting a Column or Subset of Columns

Indexing a GroupBy object created from a DataFrame with a column name or array of column names has the effect of selecting those columns for aggregation. This means that:

In [1]:
import pandas as pd
import numpy as np
from pandas import DataFrame, Series

In [2]:
df = DataFrame({'key1' : ['a', 'a', 'b', 'b', 'a'],
                'key2' : ['one', 'two', 'one', 'two', 'one'],
                'data1' : np.arange(5),
                'data2' : np.random.randn(5),
                'data3' : ['aa', 'bb', 'bb', 'aa', 'bb']})

In [3]:
df.groupby('key1')['data1']

df.groupby('key1')[['data2']]

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001BC6B34B730>

are syntactic sugar for:

In [4]:
df['data1'].groupby(df['key1'])

df[['data2']].groupby(df['key1'])

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x000001BC6B34BDC0>

In [5]:
for i, j in df.groupby('key1')['data1']:

    print(i)
    print(j)

a
0    0
1    1
4    4
Name: data1, dtype: int32
b
2    2
3    3
Name: data1, dtype: int32


Especially for large data sets, it may be desirable to aggregate only a few columns. For example, in the above data set, to compute means for just the data2 column and get the result as a DataFrame, we could write:

In [6]:
df.groupby(['key1', 'key2'])[['data2']].sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,1.409536
a,two,-0.803993
b,one,0.285942
b,two,0.232361


The object returned by this indexing operation is a grouped DataFrame if a list or array is passed and a grouped Series is just a single column name that is passed as a scalar:

In [7]:
s_grouped = df.groupby(['key1', 'key2'])[['data2']]

In [8]:
s_grouped.sum()

Unnamed: 0_level_0,Unnamed: 1_level_0,data2
key1,key2,Unnamed: 2_level_1
a,one,1.409536
a,two,-0.803993
b,one,0.285942
b,two,0.232361
