# Pandas

주어진 어떤 데이터 셋을 파이썬을 사용하여 분석하고 싶을 때 사용

모듈 불러오기

In [1]:
import numpy as np
import pandas as pd

Pandas의 Object : 1)Series 2)DataFrame

Object : "데이터를 담는 그릇이다!"

### Series

* Series의 Parameter : Data와 Index  
  Series는 1차원 ndarray와 매우 비슷.  
  다른 점 : 인덱스가 출력값에서 드러나며, 임의로 설정해 줄 수 있다.

In [2]:
n = np.random.randn(5)
s = pd.Series(n, index = ['a','b','c','d','e'])
print(n,"\n")
print(s)

[ 0.10096968 -0.6858538  -0.02999197  1.12112752  0.42917744] 

a    0.100970
b   -0.685854
c   -0.029992
d    1.121128
e    0.429177
dtype: float64


* 두 가지 종류의 인덱스가 존재 : Implicit, Explicit  
  Implicit index : 겉으로 드러나지 않음. 스스로 생성됨.  
  Explicit index : 출력 시 겉으로 드러남. 사용자 설정 가능. (안 할 시 Implicit과 동일.)    
  Ex) 아래 Series의 'a'의 Implicit index는 0, Explicit index는 10이다. 

In [3]:
np.array(['a','b','c','d'])
pd.Series(np.array(['a','b','c','d']), index = ['10','20','30','40'])

10    a
20    b
30    c
40    d
dtype: object

In [4]:
# Index 설정 안 할 시 Explicit index = Implicit index
n = np.random.randn(5)
s = pd.Series(n)
s

0    1.269620
1    1.171949
2    1.333959
3    1.534646
4    0.573244
dtype: float64

* Series의 속성을 알려주는 코드들

.values : Series의 원소 출력

In [5]:
s.values

array([1.26961963, 1.17194902, 1.33395918, 1.53464567, 0.57324366])

.index : Series의 Implicit index를 출력 

In [6]:
s.index # 원소가 많아질 경우를 생각해 Range의 형태로 나타냄.

RangeIndex(start=0, stop=5, step=1)

.shape : Series의 형태를 출력함.

In [7]:
s.shape

(5,)

* Series의 data parameter

ndarray, list, dictionary, scalar

In [8]:
# ndarray
series_ndarray = pd.Series(np.random.randn(5))
print(type(series_ndarray))
print(series_ndarray)

<class 'pandas.core.series.Series'>
0   -0.409701
1   -0.242804
2    0.777897
3   -0.066191
4    0.204592
dtype: float64


In [9]:
# list
series_list = pd.Series([1,2,4,8,16,32,64])
print(type(series_list))
print(series_list)

<class 'pandas.core.series.Series'>
0     1
1     2
2     4
3     8
4    16
5    32
6    64
dtype: int64


In [10]:
# Dictionary
series_dict = pd.Series({2:'a',1:'b',3:'c'})
print(type(series_dict))
print(series_dict)
# dictionary는 별도의 Explicit index를 설정하지 않을 시 key가 인덱스로 변환됨.

<class 'pandas.core.series.Series'>
2    a
1    b
3    c
dtype: object


In [11]:
# Index를 설정 시, Dictionary의 Key가 아닌 Index는 상응하는 값이 NaN으로 출력된다.
series_dict_2 = pd.Series({2:'a',1:'b',3:'c'}, index = [3,2,5])
print(series_dict_2)

3      c
2      a
5    NaN
dtype: object


* Data Selection in Series

Series : index로 상응하는 data에 접근 가능

In [12]:
data = pd.Series([0.25,0.5,0.75,1.0], index = ['a','b','c','d'])

print(data['b'],'\n') # Index로 data에 접근하기

print('a' in data,'\n') # Index로 원소의 유무 확인하기(Index가 data에 있는지 확인하기)

print(data.keys(),'\n') # data의 index들 출력

data['e'] = 1.25 # Series에 append하기
print(data)

0.5 

True 

Index(['a', 'b', 'c', 'd'], dtype='object') 

a    0.25
b    0.50
c    0.75
d    1.00
e    1.25
dtype: float64


Series는 data간의 "순서"가 있다. -> slicing, numerical indexing 가능.

### DataFrame

* DataFrame의 Parameter : Data, Index, Columns  
  2차원 Numpy 배열과 비슷.  
  다른 점 : Index(행의 label)와 Columns(열의 label)이 출력됨.

In [13]:
# Series 선언하기
population_dict = {'California':28332521,
                  'Texas':26448193,
                  'New York':19651127,
                  'Florida':19552860,
                  'Illinois':12882135}
popdata = pd.Series(population_dict)

# DataFrame 선언하기
pd.DataFrame(popdata, columns = ['population'])

Unnamed: 0,population
California,28332521
Texas,26448193
New York,19651127
Florida,19552860
Illinois,12882135


data가 2개 이상이라 column이 2개 이상이 되는 경우는 index가 같은 데이터끼리 같은 행에 들어간다.

In [14]:
# 2번째 Series 선언하기
area_dict = {'California':423967,
              'Texas':695662,
              'New York':141297,
              'Florida':170312,
              'Illinois':149995}
areadata = pd.Series(area_dict)

# DataFrame 선언하기
states = pd.DataFrame({'population':popdata, 'area':areadata})
states

Unnamed: 0,population,area
California,28332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


* DataFrame의 data parameter

Series를 원소로 갖는 Dictionary
* Dictionary의 key는 index로  
* Series의 index는 columns로  
* Index에 대해 비는 데이터는 NaN으로 표시

In [15]:
d = {'one':pd.Series([1,2,3], index = ['a','b','c']),
    'two':pd.Series([1,2,3,4], index = ['a','b','c','d'])}
df = pd.DataFrame(d)
df

Unnamed: 0,one,two
a,1.0,1
b,2.0,2
c,3.0,3
d,,4


ndarray

index와 column을 모두 지정하기.(안 할 시 Implicit index)

In [16]:
pd.DataFrame(np.random.rand(3,2), columns=['foo','bar'], index = ['a','b','c'])

Unnamed: 0,foo,bar
a,0.972693,0.34872
b,0.355352,0.697218
c,0.028174,0.281607


* DataFrame의 속성을 알려주는 코드들

.index:DataFrame의 index를 출력

In [17]:
df.index

Index(['a', 'b', 'c', 'd'], dtype='object')

.columns:DataFrame의 columns를 출력

In [18]:
df.columns

Index(['one', 'two'], dtype='object')

.shape:DataFrame의 형태를 출력

In [19]:
df.shape

(4, 2)

* Data Selection in DataFrame

[], loc, iloc.  
"[]보다는 loc, iloc을 더 많이 사용하자."

[]

In [20]:
states

Unnamed: 0,population,area
California,28332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


[]를 이용한 indexing : column label만 가능

In [21]:
states['area'] # 차원 downgrade

California    423967
Texas         695662
New York      141297
Florida       170312
Illinois      149995
Name: area, dtype: int64

[]를 이용한 slicing : 행을 가져오는 기능

In [22]:
states['California':'Florida'] # 차원이 유지

Unnamed: 0,population,area
California,28332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312


Boolean Masking

In [23]:
states[states['area']>200000]

Unnamed: 0,population,area
California,28332521,423967
Texas,26448193,695662


In [24]:
states[states>np.mean(states,axis=0)]

Unnamed: 0,population,area
California,28332521.0,423967.0
Texas,26448193.0,695662.0
New York,,
Florida,,
Illinois,,


loc

In [25]:
states

Unnamed: 0,population,area
California,28332521,423967
Texas,26448193,695662
New York,19651127,141297
Florida,19552860,170312
Illinois,12882135,149995


In [26]:
states.loc['California','area'] #indexing

423967

In [27]:
states.loc['California',:'area'] #indexing과 slicing 모두 활용

population    28332521
area            423967
Name: California, dtype: int64

In [28]:
states.loc[states.area > 150000, 'population'] #boolean masking

California    28332521
Texas         26448193
Florida       19552860
Name: population, dtype: int64

In [29]:
states.loc[['California','Texas'],'population'] #numerical indexing과 indexing

California    28332521
Texas         26448193
Name: population, dtype: int64

In [30]:
states.loc[['California','Texas'],'population':'area'] #numerical indexing과 slicing

Unnamed: 0,population,area
California,28332521,423967
Texas,26448193,695662


In [31]:
states.loc[['California','Texas'],['population','area']] #numerical indexing

Unnamed: 0,population,area
California,28332521,423967
Texas,26448193,695662


iloc

In [32]:
states.iloc[1,:1]

population    26448193
Name: Texas, dtype: int64

In [33]:
states.iloc[:1,:1]

Unnamed: 0,population
California,28332521


In [34]:
states.iloc[:3,[0,1]]

Unnamed: 0,population,area
California,28332521,423967
Texas,26448193,695662
New York,19651127,141297


In [35]:
states.iloc[[0,3,2], [0,1]]

Unnamed: 0,population,area
California,28332521,423967
Florida,19552860,170312
New York,19651127,141297


### read_csv()

csv : comma separated value (쉼표로 구분된 데이터)

용법 : pd.read_csv("파일 경로",arguments)

In [36]:
df = pd.read_csv('temp1.csv')
df

Unnamed: 0,S.No,Name,Age,City,Salary
0,1,Tom,28,Toronto,20000
1,2,Lee,32,HongKong,3000
2,3,Steven,43,Bay Area,8300
3,4,Ram,38,Hyderabad,3900


* csv 내장 파라미터

1. path

argument : 파일경로, 파일명

2. sep

csv(comma separated value)파일에서 다른 기호로 구분이 되어 있을 때, 이를 인식하게 해주는 함수.  
Ex) sep = '/' : /를 데이터의 구분 기준으로 인식

3. header

어떤 행을 columns label로 사용할 지 지정해주는 역할 (Default는 첫 번째 행(0))

In [37]:
df = pd.read_csv('temp1.csv', header = 0)
df

Unnamed: 0,S.No,Name,Age,City,Salary
0,1,Tom,28,Toronto,20000
1,2,Lee,32,HongKong,3000
2,3,Steven,43,Bay Area,8300
3,4,Ram,38,Hyderabad,3900


4. index_col

DataFrame의 인덱스 열을 지정해주는 역할

In [38]:
df = pd.read_csv('temp1.csv', index_col = 'S.No')
df

Unnamed: 0_level_0,Name,Age,City,Salary
S.No,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
1,Tom,28,Toronto,20000
2,Lee,32,HongKong,3000
3,Steven,43,Bay Area,8300
4,Ram,38,Hyderabad,3900


5. names

열의 label을 바꿀 때 사용

In [39]:
df = pd.read_csv('temp1.csv', names = ['a','b','c','d','e'])
df

# Columns label을 따로 설정하면 파이썬은 csv 파일 찻 줄부터 data value로 인식

Unnamed: 0,a,b,c,d,e
0,S.No,Name,Age,City,Salary
1,1,Tom,28,Toronto,20000
2,2,Lee,32,HongKong,3000
3,3,Steven,43,Bay Area,8300
4,4,Ram,38,Hyderabad,3900


6. skiprows

특정 행을 불러오지 않는 기능

In [40]:
df = pd.read_csv('temp1.csv', skiprows = 2)
df

Unnamed: 0,2,Lee,32,HongKong,3000
0,3,Steven,43,Bay Area,8300
1,4,Ram,38,Hyderabad,3900


7. dtype

특정 열의 데이터의 type를 바꾸고 싶을 때 사용

In [41]:
df = pd.read_csv('temp1.csv', dtype = {'Salary':np.float64})
df

Unnamed: 0,S.No,Name,Age,City,Salary
0,1,Tom,28,Toronto,20000.0
1,2,Lee,32,HongKong,3000.0
2,3,Steven,43,Bay Area,8300.0
3,4,Ram,38,Hyderabad,3900.0


### Ufuncs: Operations on Series or DataFrames
- Unary operations on Series or DataFrames

In [42]:
ser = pd.Series(np.random.randint(0,10,4))
ser

0    8
1    4
2    8
3    4
dtype: int32

In [43]:
np.sqrt(ser)

0    2.828427
1    2.000000
2    2.828427
3    2.000000
dtype: float64

In [44]:
df = pd.DataFrame(np.random.randint(0, 10, (3,4)),
                 columns = ['A','B','C','D'])
df

Unnamed: 0,A,B,C,D
0,4,7,5,3
1,5,2,8,7
2,9,0,8,1


In [45]:
np.sin(df * np.pi/4)

Unnamed: 0,A,B,C,D
0,1.224647e-16,-0.707107,-0.7071068,0.707107
1,-0.7071068,1.0,-2.449294e-16,-0.707107
2,0.7071068,0.0,-2.449294e-16,0.707107


### Binary Operations on Series

In [46]:
A = pd.Series([2,4,6], index=[0,1,2])
B = pd.Series([1,3,5], index=[1,2,3])
print(A,B, sep="\n\n")

0    2
1    4
2    6
dtype: int64

1    1
2    3
3    5
dtype: int64


In [47]:
A + B

0    NaN
1    5.0
2    9.0
3    NaN
dtype: float64

In [48]:
area = pd.Series({'Alaska': 1723337,
                  'Texas': 695662,
                  'California':423967})
population = pd.Series({'California':38332521,
                  'Texas': 26448193,
                  'New York': 19651127})
print(area, population, sep='\n\n')

Alaska        1723337
Texas          695662
California     423967
dtype: int64

California    38332521
Texas         26448193
New York      19651127
dtype: int64


In [49]:
population / area

Alaska              NaN
California    90.413926
New York            NaN
Texas         38.018740
dtype: float64

### Binary Operations on DataFrames

In [50]:
A = pd.DataFrame(np.random.randint(0, 20, (2,2)), columns=list('AB'))
B = pd.DataFrame(np.random.randint(0, 10, (3,3)), columns=list('BAC'))
print(A,B, sep='\n\n')

    A   B
0  18   0
1   1  19

   B  A  C
0  0  2  6
1  3  1  1
2  5  9  2


In [51]:
A + B

Unnamed: 0,A,B,C
0,20.0,0.0,
1,2.0,22.0,
2,,,


In [52]:
A * B

Unnamed: 0,A,B,C
0,36.0,0.0,
1,1.0,57.0,
2,,,


### Working with Columns
Adding New Columns to a DataFrame

- df['새로운 열 이름'] = pd.Series() / ndarray
- df[ ['새로운 열1','새로운 열2] ] = pd.DataFrame()

In [53]:
A = np.random.randint(1,10, size=(3,4))
df = pd.DataFrame(A, columns=list('ABCD'))
df

Unnamed: 0,A,B,C,D
0,2,8,4,5
1,8,5,7,9
2,7,9,8,9


In [54]:
len(df) # 행의 개수

3

In [55]:
df['new_col'] = range(len(df))
df

Unnamed: 0,A,B,C,D,new_col
0,2,8,4,5,0
1,8,5,7,9,1
2,7,9,8,9,2


In [56]:
df['new_col'] = np.random.rand(len(df))
df

Unnamed: 0,A,B,C,D,new_col
0,2,8,4,5,0.075331
1,8,5,7,9,0.974538
2,7,9,8,9,0.128113


In [57]:
df['new_col'] = df['A']/df['B']
df

Unnamed: 0,A,B,C,D,new_col
0,2,8,4,5,0.25
1,8,5,7,9,1.6
2,7,9,8,9,0.777778


In [58]:
df['new_col'] = df['A']*100
df

Unnamed: 0,A,B,C,D,new_col
0,2,8,4,5,200
1,8,5,7,9,800
2,7,9,8,9,700


In [59]:
df['new_col'] = np.log(df['A'])
df

Unnamed: 0,A,B,C,D,new_col
0,2,8,4,5,0.693147
1,8,5,7,9,2.079442
2,7,9,8,9,1.94591


In [60]:
df['new_col'] = pd.Series([0,2,3,4], index=[0,2,3,4])
df

Unnamed: 0,A,B,C,D,new_col
0,2,8,4,5,0.0
1,8,5,7,9,
2,7,9,8,9,2.0


In [61]:
df[['A','B']] = df[['B','A']]
df

Unnamed: 0,A,B,C,D,new_col
0,8,2,4,5,0.0
1,5,8,7,9,
2,9,7,8,9,2.0


In [62]:
df.drop('new_col')

KeyError: "['new_col'] not found in axis"

In [None]:
df.drop(1)

In [None]:
df # inplace 를 해야함

In [None]:
df.drop('new_col', axis=1)

In [63]:
df.drop(['C','new_col'], axis=1)

Unnamed: 0,A,B,D
0,8,2,5
1,5,8,9
2,9,7,9


In [64]:
df.drop(df.columns[4], axis=1)

Unnamed: 0,A,B,C,D
0,8,2,4,5
1,5,8,7,9
2,9,7,8,9


In [65]:
df.drop(df.columns[-1], axis=1)

Unnamed: 0,A,B,C,D
0,8,2,4,5
1,5,8,7,9
2,9,7,8,9


In [66]:
del df['new_col']

In [67]:
df # del 의 경우 inplace 불필요

Unnamed: 0,A,B,C,D
0,8,2,4,5
1,5,8,7,9
2,9,7,8,9


In [68]:
key = ['dtype', 'size', 'count', 'sum', 'prod', 'min', 'max', 'mean', 'median', 'cov', 'describe', 'value_counts']
values = []
values.append(df['A'].dtype)
values.append(df['A'].size)
values.append(df['A'].count())
values.append(df['A'].sum())
values.append(df['A'].prod())
values.append(df['A'].min())
values.append(df['A'].max)
values.append(df['A'].mean)
values.append(df['A'].median)
values.append(df['A'].cov(df['B']))
values.append(df['A'].describe())
values.append(df['A'].value_counts())

for i in range(len(key)):
    print('attribute: ', key[i])
    print(values[i])
    print('\n')

attribute:  dtype
int32


attribute:  size
3


attribute:  count
3


attribute:  sum
22


attribute:  prod
360


attribute:  min
5


attribute:  max
<bound method Series.max of 0    8
1    5
2    9
Name: A, dtype: int32>


attribute:  mean
<bound method Series.mean of 0    8
1    5
2    9
Name: A, dtype: int32>


attribute:  median
<bound method Series.median of 0    8
1    5
2    9
Name: A, dtype: int32>


attribute:  cov
-2.833333333333333


attribute:  describe
count    3.000000
mean     7.333333
std      2.081666
min      5.000000
25%      6.500000
50%      8.000000
75%      8.500000
max      9.000000
Name: A, dtype: float64


attribute:  value_counts
9    1
5    1
8    1
Name: A, dtype: int64




In [69]:
df['A'].isnull

<bound method Series.isnull of 0    8
1    5
2    9
Name: A, dtype: int32>

In [70]:
df['A'].astype(np.float64)

0    8.0
1    5.0
2    9.0
Name: A, dtype: float64

In [71]:
df['A'].abs()

0    8
1    5
2    9
Name: A, dtype: int32

In [72]:
df['A'].round(decimals=2)

0    8
1    5
2    9
Name: A, dtype: int32

In [73]:
df['A'].fillna(value=0) # NaN을 0으로 대체

0    8
1    5
2    9
Name: A, dtype: int32

In [74]:
df['A'].replace([999,0], np.NaN, inplace=True)
df

Unnamed: 0,A,B,C,D
0,8,2,4,5
1,5,8,7,9
2,9,7,8,9


In [75]:
df['A'].cumsum()

0     8
1    13
2    22
Name: A, dtype: int32

In [76]:
df['A'].cumprod()

0      8
1     40
2    360
Name: A, dtype: int32

In [77]:
df['A'].diff(periods=1)

0    NaN
1   -3.0
2    4.0
Name: A, dtype: float64

In [78]:
df['A'].shift(periods=1)

0    NaN
1    8.0
2    5.0
Name: A, dtype: float64

In [79]:
df['A'].pct_change(periods=1)

0      NaN
1   -0.375
2    0.800
Name: A, dtype: float64

In [80]:
df['A'].value_counts()

9    1
5    1
8    1
Name: A, dtype: int64

In [81]:
df['new_col'] = pd.Series([np.NaN]*3)
df

Unnamed: 0,A,B,C,D,new_col
0,8,2,4,5,
1,5,8,7,9,
2,9,7,8,9,


In [82]:
df.dropna(axis=1, inplace=True)
df

Unnamed: 0,A,B,C,D
0,8,2,4,5
1,5,8,7,9
2,9,7,8,9


### Concatenation

In [83]:
df1 = pd.DataFrame({'A':['A0','A3','A1','A2'],
                   'C':['C0','C3','C1','C2'],
                   'D':['D0','D3','D1','D2'],
                   'B':['B0','B3','B1','B2']}, index=[0,3,1,2])
df2 = pd.DataFrame({'B':['B3','B2','B6','B7'],
                   'D':['D3','D2','D6','D7'],
                    'F':['F3','F2','F6','F7']}, index=[3,2,6,7])
print(df1, df2, sep='\n\n')

    A   C   D   B
0  A0  C0  D0  B0
3  A3  C3  D3  B3
1  A1  C1  D1  B1
2  A2  C2  D2  B2

    B   D   F
3  B3  D3  F3
2  B2  D2  F2
6  B6  D6  F6
7  B7  D7  F7


In [84]:
pd.concat([df1, df2]) #아래로 concate (행의 수직 방향)
#pd.concat([df1, df2], axis=1) #오른쪽으로 concat(열의 수직 방향)

Unnamed: 0,A,C,D,B,F
0,A0,C0,D0,B0,
3,A3,C3,D3,B3,
1,A1,C1,D1,B1,
2,A2,C2,D2,B2,
3,,,D3,B3,F3
2,,,D2,B2,F2
6,,,D6,B6,F6
7,,,D7,B7,F7


In [85]:
pd.concat([df2, pd.Series(['E0','E1','E2','E3'])], axis=1) 
#by default: explicit index - 0,1,2,3

Unnamed: 0,B,D,F,0
0,,,,E0
1,,,,E1
2,B2,D2,F2,E2
3,B3,D3,F3,E3
6,B6,D6,F6,
7,B7,D7,F7,


In [86]:
pd.concat([df1, df2], axis=1, join='inner')
#pd.concat([df1, df2], axis=1, join='outer')

Unnamed: 0,A,C,D,B,B.1,D.1,F
3,A3,C3,D3,B3,B3,D3,F3
2,A2,C2,D2,B2,B2,D2,F2


In [87]:
pd.concat([df1, df2], ignore_index = True)

Unnamed: 0,A,C,D,B,F
0,A0,C0,D0,B0,
1,A3,C3,D3,B3,
2,A1,C1,D1,B1,
3,A2,C2,D2,B2,
4,,,D3,B3,F3
5,,,D2,B2,F2
6,,,D6,B6,F6
7,,,D7,B7,F7


### Aggregation & Grouping

In [88]:
df = pd.DataFrame({'key':['A','B','C','A','B','C'],
                  'data':range(6),
                  'data2':[1.9,3.5,5.3,2.7,4.2,6.1]})
df

Unnamed: 0,key,data,data2
0,A,0,1.9
1,B,1,3.5
2,C,2,5.3
3,A,3,2.7
4,B,4,4.2
5,C,5,6.1


In [89]:
def sum_of_square(x, c=2):
    return((x-x.mean())**c).sum()

In [90]:
df[['data','data2']].apply(sum_of_square)

data     17.500
data2    12.475
dtype: float64

In [91]:
df[['data','data2']].apply(sum_of_square, c=2, axis=1)

0    1.805
1    3.125
2    5.445
3    0.045
4    0.020
5    0.605
dtype: float64

In [92]:
df.groupby('key').sum()

Unnamed: 0_level_0,data,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,3,4.6
B,5,7.7
C,7,11.4


In [93]:
df.groupby('key')['data2'].var()

key
A    0.320
B    0.245
C    0.320
Name: data2, dtype: float64

In [94]:
df.groupby('key').aggregate([min, np.median, max])

Unnamed: 0_level_0,data,data,data,data2,data2,data2
Unnamed: 0_level_1,min,median,max,min,median,max
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2
A,0,1.5,3,1.9,2.3,2.7
B,1,2.5,4,3.5,3.85,4.2
C,2,3.5,5,5.3,5.7,6.1


In [95]:
df.groupby('key').aggregate({'data':min, 'data2':max})

Unnamed: 0_level_0,data,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,0,2.7
B,1,4.2
C,2,6.1


In [96]:
df.groupby('key').aggregate({'data':[min,max], 'data2':[np.mean, np.sum]})

Unnamed: 0_level_0,data,data,data2,data2
Unnamed: 0_level_1,min,max,mean,sum
key,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
A,0,3,2.3,4.6
B,1,4,3.85,7.7
C,2,5,5.7,11.4


In [97]:
df.groupby('key').aggregate(lambda x: x.max()-x.min())

Unnamed: 0_level_0,data,data2
key,Unnamed: 1_level_1,Unnamed: 2_level_1
A,3,0.8
B,3,0.7
C,3,0.8
