### 피봇테이블
- 피봇테이블(pivot table) : 데이터 열 중에서 두 개의 열을 각각 행 인덱스, 열 인덱스로 사용하여 데이터를 조회하여 펼쳐놓은 것

In [2]:
import pandas as pd
import numpy as np

In [3]:
data = {
    "도시": ["서울", "서울", "서울", "부산", "부산", "부산", "인천", "인천"],
    "연도": ["2015", "2010", "2005", "2015", "2010", "2005", "2015", "2010"],
    "인구": [9904312, 9631482, 9762546, 3448737, 3393191, 3512547, 2890451, 263203],
    "지역": ["수도권", "수도권", "수도권", "경상권", "경상권", "경상권", "수도권", "수도권"]
}
columns = ["도시", "연도", "인구", "지역"]
df1 = pd.DataFrame(data, columns=columns)
df1

Unnamed: 0,도시,연도,인구,지역
0,서울,2015,9904312,수도권
1,서울,2010,9631482,수도권
2,서울,2005,9762546,수도권
3,부산,2015,3448737,경상권
4,부산,2010,3393191,경상권
5,부산,2005,3512547,경상권
6,인천,2015,2890451,수도권
7,인천,2010,263203,수도권


- 첫번째 인수로는 행 인덱스로 사용할 열 이름, 두번째 인수로는 열 인덱스로 사용할 열 이름, 그리고 마지막으로 데이터로 사용할 열 이름
- 데이터가 존재하지 않으면 해당 칸에 NaN 값을 넣는다.

In [4]:
df1.pivot('도시','연도','인구')

연도,2005,2010,2015
도시,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
부산,3512547.0,3393191.0,3448737.0
서울,9762546.0,9631482.0,9904312.0
인천,,263203.0,2890451.0


In [5]:
df1.set_index(['도시','연도'])[['인구']].unstack()

Unnamed: 0_level_0,인구,인구,인구
연도,2005,2010,2015
도시,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
부산,3512547.0,3393191.0,3448737.0
서울,9762546.0,9631482.0,9904312.0
인천,,263203.0,2890451.0


- 행 인덱스나 열 인덱스를 리스트로 주는 경우에는 다중 인덱스 피봇 테이블을 생성한다.

In [6]:
df1.pivot(['지역','도시'],'연도','인구')

Unnamed: 0_level_0,연도,2005,2010,2015
지역,도시,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
경상권,부산,3512547.0,3393191.0,3448737.0
수도권,서울,9762546.0,9631482.0,9904312.0
수도권,인천,,263203.0,2890451.0


- 행 인덱스와 열 인덱스는 데이터를 찾는 키(key)의 역할을 한다. 따라서 키 값으로 데이터가 단 하나만 찾아져야 한다. 만약 행 인덱스와 열 인덱스 조건을 만족하는 데이터가 2개 이상인 경우에는 에러가 발생한다.

In [7]:
try:
    df1.pivot('지역','연도','인구')
except ValueError as e:
    print('ValueError:',e)

ValueError: Index contains duplicate entries, cannot reshape


### 그룹분석
- 만약 키가 지정하는 조건에 맞는 데이터가 하나 이상이라서 데이터 그룹을 이루는 경우에는 그룹의 특성을 보여주는 그룹분석(group analysis)을 해야 한다.
- 그룹분석은 피봇테이블과 달리 키에 의해서 결정되는 데이터가 여러개가 있을 경우 미리 지정한 연산을 통해 그 그룹 데이터의 대표값을 계산한다.

#### groupby 메서드
- 데이터를 그룹 별로 분류하는 역할을 한다.

### 그룹연산 메서드
- size, count : 그룹 데이터의 갯수
- mean, median, min, max : 그룹 데이터의 평균, 중앙값, 최소, 최대
- sum, prod, std, var, quantile : 그룹 데이터의 합계, 곱, 표준편차, 분산, 사분위수
- first, last : 그룹 데이터 중 가장 첫번째 데이터와 가장 나중 데이터
- agg, aggregate : 만약 원하는 그룹연산이 없는 경우 함수를 만들고 이 함수를 agg에 전달한다. 또는 여러가지 그룹연산을 동시에 하고 싶은 경우 함수 이름 문자열의 리스트를 전달한다.
- describe : 하나의 그룹 대표값이 아니라 여러개의 값을 데이터프레임으로 구한다.
- apply : describe처럼 하나의 대표값이 아닌 데이터프레임을 출력하지만 원하는 그룹연산이 없는 경우에 사용한다.
- transform : 그룹에 대한 대표값을 만드는 것이 아니라 그룹별 계산을 통해 데이터 자체를 변형한다.

In [8]:
np.random.seed(0)
df2 = pd.DataFrame({
    'key1' : ['A','A','B','B','A'],
    'key2' : ['one','two','one','two','one'],
    'data1' : [1,2,3,4,5],
    'data2' : [10,20,30,40,50]
})
df2

Unnamed: 0,key1,key2,data1,data2
0,A,one,1,10
1,A,two,2,20
2,B,one,3,30
3,B,two,4,40
4,A,one,5,50


- key1의 값(A 또는 B)에 따른 data1의 평균 구하기

In [9]:
groups = df2.groupby(df2.key1)
groups

<pandas.core.groupby.generic.DataFrameGroupBy object at 0x0000028F46F49FD0>

In [10]:
groups.groups # Groupby 클래스 객체에는 각 그룹 데이터의 인덱스를 저장한 groups 속성이 있다.

{'A': [0, 1, 4], 'B': [2, 3]}

In [11]:
groups.sum()

Unnamed: 0_level_0,data1,data2
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
A,8,80
B,7,70


In [12]:
df2.data1.groupby(df2.key1).sum() # Groupby 클래스 객체를 명시적으로 얻을 필요가 없을 때

key1
A    8
B    7
Name: data1, dtype: int64

In [13]:
df2.groupby(df2.key1)['data1'] .sum() # Groupby 클래스 객체에서 data1만 선택하여 분석하는 경우

key1
A    8
B    7
Name: data1, dtype: int64

In [14]:
df2.groupby(df2.key1).sum()['data1'] # 전체 데이터를 분석한 후 data1만 선택하는 경우

key1
A    8
B    7
Name: data1, dtype: int64

In [15]:
df2.data1.groupby([df2.key1, df2.key2]).sum() # 키가 복수이면 리스트를 사용

key1  key2
A     one     6
      two     2
B     one     3
      two     4
Name: data1, dtype: int64

In [16]:
df2.data1.groupby([df2['key1'],df2['key2']]).sum().unstack('key2')

key2,one,two
key1,Unnamed: 1_level_1,Unnamed: 2_level_1
A,6,2
B,3,4


In [17]:
df1['인구'].groupby([df1['지역'],df1['연도']]).sum().unstack('연도')

연도,2005,2010,2015
지역,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
경상권,3512547,3393191,3448737
수도권,9762546,9894685,12794763


In [18]:
import seaborn as sns
iris = sns.load_dataset('iris')

- 각 붓꽃 종별로 가장 큰 값과 가장 작은 값의 비율 구하기

In [19]:
def peak_to_peak_ratio(x):
    return x.max() / x.min()

iris.groupby(iris.species).agg(peak_to_peak_ratio)

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,1.348837,1.913043,1.9,6.0
versicolor,1.428571,1.7,1.7,1.8
virginica,1.612245,1.727273,1.533333,1.785714


In [20]:
iris.groupby(iris.species).describe().T

Unnamed: 0,species,setosa,versicolor,virginica
sepal_length,count,50.0,50.0,50.0
sepal_length,mean,5.006,5.936,6.588
sepal_length,std,0.35249,0.516171,0.63588
sepal_length,min,4.3,4.9,4.9
sepal_length,25%,4.8,5.6,6.225
sepal_length,50%,5.0,5.9,6.5
sepal_length,75%,5.2,6.3,6.9
sepal_length,max,5.8,7.0,7.9
sepal_width,count,50.0,50.0,50.0
sepal_width,mean,3.428,2.77,2.974


In [21]:
def top3_petal_length(df):
    return df.sort_values(by='petal_length', ascending=False)[:3]

iris.groupby(iris.species).apply(top3_petal_length) # 데이터프레임을 만듬

Unnamed: 0_level_0,Unnamed: 1_level_0,sepal_length,sepal_width,petal_length,petal_width,species
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
setosa,24,4.8,3.4,1.9,0.2,setosa
setosa,44,5.1,3.8,1.9,0.4,setosa
setosa,23,5.1,3.3,1.7,0.5,setosa
versicolor,83,6.0,2.7,5.1,1.6,versicolor
versicolor,77,6.7,3.0,5.0,1.7,versicolor
versicolor,72,6.3,2.5,4.9,1.5,versicolor
virginica,118,7.7,2.6,6.9,2.3,virginica
virginica,117,7.7,3.8,6.7,2.2,virginica
virginica,122,7.7,2.8,6.7,2.0,virginica


In [22]:
def q3cut(s):
    return pd.qcut(s,3,labels=['소','중','대']).astype(str)

iris['petal_length_class'] = iris.groupby(iris.species).petal_length.transform(q3cut)
iris[['petal_length','petal_length_class']].tail(10)

Unnamed: 0,petal_length,petal_length_class
140,5.6,중
141,5.1,소
142,5.1,소
143,5.9,대
144,5.7,중
145,5.2,소
146,5.0,소
147,5.2,소
148,5.4,중
149,5.1,소


#### 연습 문제 4.7.2

In [36]:
iris.groupby(iris.species).mean()

Unnamed: 0_level_0,sepal_length,sepal_width,petal_length,petal_width
species,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
setosa,5.006,3.428,1.462,0.246
versicolor,5.936,2.77,4.26,1.326
virginica,6.588,2.974,5.552,2.026


### pivot_table
- pivot_table 명령은 groupby 명령처럼 그룹분석을 하지만 최종적으로는 pivot 명령처럼 피봇테이블을 만든다. 즉 groupby 명령의 결과에 unstack을 자동 적용하면 2차원적인 형태로 변형한다.
- values : 분석할 데이터프레임에서 분석할 열
- index : 행 인덱스로 들어갈 키 열 또는 키 열의 리스트
- pivot_table(data, value=None, index=None, columns=None, aggfunc='mean', fill_vlaue=None, margins=False, margins_name='All')
- aggfunc : 분석 메서드
- fill_vlaue : NaN 대체 값
- margins : 모든 데이터를 분석한 결과를 오른쪽과 아래에 붙일지 여부
- margins_name : 마진 열(행)의 이름

In [23]:
df1.pivot_table('인구', '도시','연도') # 인수 순서 주의

연도,2005,2010,2015
도시,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
부산,3512547.0,3393191.0,3448737.0
서울,9762546.0,9631482.0,9904312.0
인천,,263203.0,2890451.0


- margins=True 인수를 주면 aggfunc로 주어진 분석 방법을 해당 열의 모든 데이터, 해당 행의 모든 데이터 그리고 전체 데이터에 대해 적용하나 결과를 같이 보여준다. aggfunc가 주어지지 않았으면 평균을 계산한다.

In [24]:
df1.pivot_table('인구', '도시', '연도' ,margins=True, margins_name='합계')

연도,2005,2010,2015,합계
도시,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
부산,3512547.0,3393191.0,3448737.0,3451492.0
서울,9762546.0,9631482.0,9904312.0,9766113.0
인천,,263203.0,2890451.0,1576827.0
합계,6637546.5,4429292.0,5414500.0,5350809.0


In [25]:
df1['인구'].mean()

5350808.625

In [26]:
df1.pivot_table('인구', index=['연도','도시'])

Unnamed: 0_level_0,Unnamed: 1_level_0,인구
연도,도시,Unnamed: 2_level_1
2005,부산,3512547
2005,서울,9762546
2010,부산,3393191
2010,서울,9631482
2010,인천,263203
2015,부산,3448737
2015,서울,9904312
2015,인천,2890451


- 식당에서 식사 후 내는 팁과 관련된 데이터
    - total_bill : 식사대금
    - tip : 팁
    - sex : 성별
    - smoker : 흡연/금연 여부
    - day : 요일
    - time : 시간
    - size : 인원

In [27]:
tips = sns.load_dataset('tips')
tips.tail()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size
239,29.03,5.92,Male,No,Sat,Dinner,3
240,27.18,2.0,Female,Yes,Sat,Dinner,2
241,22.67,2.0,Male,Yes,Sat,Dinner,2
242,17.82,1.75,Male,No,Sat,Dinner,2
243,18.78,3.0,Female,No,Thur,Dinner,2


- 분석 목표 : 식사 대금 대비 팁의 비유리 어떤 경우에 가장 높아지는가를 찾는 것

In [28]:
tips['tip_pct'] = tips['tip']/tips['total_bill']  # 식사대금과 팁의 비율
tips.tail()

Unnamed: 0,total_bill,tip,sex,smoker,day,time,size,tip_pct
239,29.03,5.92,Male,No,Sat,Dinner,3,0.203927
240,27.18,2.0,Female,Yes,Sat,Dinner,2,0.073584
241,22.67,2.0,Male,Yes,Sat,Dinner,2,0.088222
242,17.82,1.75,Male,No,Sat,Dinner,2,0.098204
243,18.78,3.0,Female,No,Thur,Dinner,2,0.159744


In [29]:
tips.describe()

Unnamed: 0,total_bill,tip,size,tip_pct
count,244.0,244.0,244.0,244.0
mean,19.785943,2.998279,2.569672,0.160803
std,8.902412,1.383638,0.9511,0.061072
min,3.07,1.0,1.0,0.035638
25%,13.3475,2.0,2.0,0.129127
50%,17.795,2.9,2.0,0.15477
75%,24.1275,3.5625,3.0,0.191475
max,50.81,10.0,6.0,0.710345


In [30]:
tips.groupby('sex').count()

Unnamed: 0_level_0,total_bill,tip,smoker,day,time,size,tip_pct
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1,Unnamed: 7_level_1
Male,157,157,157,157,157,157,157
Female,87,87,87,87,87,87,87


In [31]:
tips.groupby('sex').size()

sex
Male      157
Female     87
dtype: int64

- 데이터 갯수의 경우 NaN 데이터가 없다면 모두 같은 값이 나온다. 이때 size 명령을 사용하면 더 간단히 표시된다. size 명령은 NaN이 있어도 상관하지 않는다.

In [32]:
tips.groupby(['sex','smoker']).size()

sex     smoker
Male    Yes       60
        No        97
Female  Yes       33
        No        54
dtype: int64

In [33]:
tips.pivot_table('tip_pct','sex','smoker',aggfunc='count',margins=True)

smoker,Yes,No,All
sex,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Male,60,97,157
Female,33,54,87
All,93,151,244


In [34]:
tips.groupby('sex')[['tip_pct']].mean()

Unnamed: 0_level_0,tip_pct
sex,Unnamed: 1_level_1
Male,0.157651
Female,0.166491


In [35]:
tips.groupby('smoker')[['tip_pct']].mean()

Unnamed: 0_level_0,tip_pct
smoker,Unnamed: 1_level_1
Yes,0.163196
No,0.159328


In [37]:
tips.pivot_table('tip_pct','sex')

Unnamed: 0_level_0,tip_pct
sex,Unnamed: 1_level_1
Male,0.157651
Female,0.166491


In [38]:
tips.pivot_table('tip_pct',['sex','smoker'])

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct
sex,smoker,Unnamed: 2_level_1
Male,Yes,0.152771
Male,No,0.160669
Female,Yes,0.18215
Female,No,0.156921


In [39]:
tips.pivot_table('tip_pct','sex','smoker')

smoker,Yes,No
sex,Unnamed: 1_level_1,Unnamed: 2_level_1
Male,0.152771,0.160669
Female,0.18215,0.156921


- 여성 혹은 흡연자의 팁 비율이 높은 것을 볼 수 있다.

In [40]:
tips.groupby('sex')[['tip_pct']].describe()

Unnamed: 0_level_0,tip_pct,tip_pct,tip_pct,tip_pct,tip_pct,tip_pct,tip_pct,tip_pct
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
sex,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Male,157.0,0.157651,0.064778,0.035638,0.121389,0.153492,0.18624,0.710345
Female,87.0,0.166491,0.053632,0.056433,0.140416,0.155581,0.194266,0.416667


In [42]:
tips.groupby('smoker')[['tip_pct']].describe()

Unnamed: 0_level_0,tip_pct,tip_pct,tip_pct,tip_pct,tip_pct,tip_pct,tip_pct,tip_pct
Unnamed: 0_level_1,count,mean,std,min,25%,50%,75%,max
smoker,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2
Yes,93.0,0.163196,0.085119,0.035638,0.106771,0.153846,0.195059,0.710345
No,151.0,0.159328,0.03991,0.056797,0.136906,0.155625,0.185014,0.29199


In [44]:
tips.groupby(['sex','smoker'])[['tip_pct']].describe()

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,tip_pct,tip_pct,tip_pct,tip_pct,tip_pct,tip_pct,tip_pct
Unnamed: 0_level_1,Unnamed: 1_level_1,count,mean,std,min,25%,50%,75%,max
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
Male,Yes,60.0,0.152771,0.090588,0.035638,0.101845,0.141015,0.191697,0.710345
Male,No,97.0,0.160669,0.041849,0.071804,0.13181,0.157604,0.18622,0.29199
Female,Yes,33.0,0.18215,0.071595,0.056433,0.152439,0.173913,0.198216,0.416667
Female,No,54.0,0.156921,0.036421,0.056797,0.139708,0.149691,0.18163,0.252672


#### 연습 문제 4.7.3

In [47]:
tips.pivot_table('day','time',aggfunc='count',margins=True)

Unnamed: 0_level_0,day
time,Unnamed: 1_level_1
Lunch,68
Dinner,176
All,244


In [50]:
tips.groupby('day')[['time']].count()

Unnamed: 0_level_0,time
day,Unnamed: 1_level_1
Thur,62
Fri,19
Sat,87
Sun,76


In [52]:
tips.pivot_table('tip_pct','day','size')

size,1,2,3,4,5,6
day,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Thur,0.181728,0.163935,0.144599,0.145515,0.121389,0.173706
Fri,0.223776,0.168693,0.187735,0.11775,,
Sat,0.231832,0.155289,0.151439,0.138289,0.106572,
Sun,,0.18087,0.152662,0.153168,0.159839,0.103799


In [54]:
tips.pivot_table('tip_pct','time','size')

size,1,2,3,4,5,6
time,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Lunch,0.202752,0.16575,0.153226,0.145515,0.121389,0.173706
Dinner,0.231832,0.165704,0.151995,0.146017,0.146522,0.103799


In [57]:
tips.pivot_table('tip_pct',['day','time'])

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct
day,time,Unnamed: 2_level_1
Thur,Lunch,0.161301
Thur,Dinner,0.159744
Fri,Lunch,0.188765
Fri,Dinner,0.158916
Sat,Dinner,0.153152
Sun,Dinner,0.166897


- 각 그룹에서 가장 많은 팁과 가장 적은 팁의 차이 알아보기

In [58]:
def peak_to_peak(x):
    return x.max()-x.min()

tips.groupby(['sex','smoker'])[['tip']].agg(peak_to_peak)

Unnamed: 0_level_0,Unnamed: 1_level_0,tip
sex,smoker,Unnamed: 2_level_1
Male,Yes,9.0
Male,No,7.75
Female,Yes,5.5
Female,No,4.2


- 여러가지 그룹연산을 동시에 하고 싶을 때 리스트 사용

In [61]:
tips.groupby(['sex','smoker']).agg(['mean',peak_to_peak])[['total_bill']] 

Unnamed: 0_level_0,Unnamed: 1_level_0,total_bill,total_bill
Unnamed: 0_level_1,Unnamed: 1_level_1,mean,peak_to_peak
sex,smoker,Unnamed: 2_level_2,Unnamed: 3_level_2
Male,Yes,22.2845,43.56
Male,No,19.791237,40.82
Female,Yes,17.977879,41.23
Female,No,18.105185,28.58


- 데이터 열마다 다른 연산을 하고 싶을때 열 라벨과 연산 이름(또는 함수)를 딕셔너리로 넣는다.

In [62]:
tips.groupby(['sex','smoker']).agg(
    {'tip_pct': 'mean', 'total_bill' : peak_to_peak}) 

Unnamed: 0_level_0,Unnamed: 1_level_0,tip_pct,total_bill
sex,smoker,Unnamed: 2_level_1,Unnamed: 3_level_1
Male,Yes,0.152771,43.56
Male,No,0.160669,40.82
Female,Yes,0.18215,41.23
Female,No,0.156921,28.58


In [63]:
tips.pivot_table(['tip_pct','size'],['sex','day'],'smoker')

Unnamed: 0_level_0,Unnamed: 1_level_0,size,size,tip_pct,tip_pct
Unnamed: 0_level_1,smoker,Yes,No,Yes,No
sex,day,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2
Male,Thur,2.3,2.5,0.164417,0.165706
Male,Fri,2.125,2.0,0.14473,0.138005
Male,Sat,2.62963,2.65625,0.139067,0.162132
Male,Sun,2.6,2.883721,0.173964,0.158291
Female,Thur,2.428571,2.48,0.163073,0.155971
Female,Fri,2.0,2.5,0.209129,0.165296
Female,Sat,2.2,2.307692,0.163817,0.147993
Female,Sun,2.5,3.071429,0.237075,0.16571


In [65]:
tips.pivot_table('size',['time','sex','smoker'],'day',
                aggfunc='sum',fill_value=0)

Unnamed: 0_level_0,Unnamed: 1_level_0,day,Thur,Fri,Sat,Sun
time,sex,smoker,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1,Unnamed: 6_level_1
Lunch,Male,Yes,23,5,0,0
Lunch,Male,No,50,0,0,0
Lunch,Female,Yes,17,6,0,0
Lunch,Female,No,60,3,0,0
Dinner,Male,Yes,0,12,71,39
Dinner,Male,No,0,4,85,124
Dinner,Female,Yes,0,8,33,10
Dinner,Female,No,2,2,30,43


In [66]:
titanic = sns.load_dataset('titanic')

In [67]:
titanic

Unnamed: 0,survived,pclass,sex,age,sibsp,parch,fare,embarked,class,who,adult_male,deck,embark_town,alive,alone
0,0,3,male,22.0,1,0,7.2500,S,Third,man,True,,Southampton,no,False
1,1,1,female,38.0,1,0,71.2833,C,First,woman,False,C,Cherbourg,yes,False
2,1,3,female,26.0,0,0,7.9250,S,Third,woman,False,,Southampton,yes,True
3,1,1,female,35.0,1,0,53.1000,S,First,woman,False,C,Southampton,yes,False
4,0,3,male,35.0,0,0,8.0500,S,Third,man,True,,Southampton,no,True
...,...,...,...,...,...,...,...,...,...,...,...,...,...,...,...
886,0,2,male,27.0,0,0,13.0000,S,Second,man,True,,Southampton,no,True
887,1,1,female,19.0,0,0,30.0000,S,First,woman,False,B,Southampton,yes,True
888,0,3,female,,1,2,23.4500,S,Third,woman,False,,Southampton,no,False
889,1,1,male,26.0,0,0,30.0000,C,First,man,True,C,Cherbourg,yes,True


In [71]:
titanic['age_level'] = pd.qcut(titanic['age'],3,labels=['age1','age2','age3'])
titanic['age_level']

0      age1
1      age3
2      age2
3      age3
4      age3
       ... 
886    age2
887    age1
888     NaN
889    age2
890    age2
Name: age_level, Length: 891, dtype: category
Categories (3, object): ['age1' < 'age2' < 'age3']

In [75]:
def surv_prob(t):
    return t.sum() / t.count()  # 생존률

titanic.groupby(['sex','age_level','class'])['survived'].agg(surv_prob)

sex     age_level  class 
female  age1       First     0.954545
                   Second    1.000000
                   Third     0.508475
        age2       First     0.947368
                   Second    0.909091
                   Third     0.481481
        age3       First     0.977273
                   Second    0.857143
                   Third     0.250000
male    age1       First     0.500000
                   Second    0.357143
                   Third     0.158879
        age2       First     0.500000
                   Second    0.076923
                   Third     0.195652
        age3       First     0.347826
                   Second    0.062500
                   Third     0.055556
Name: survived, dtype: float64

In [80]:
titanic.pivot_table('survived',['sex','class'],'age_level',aggfunc=surv_prob)

Unnamed: 0_level_0,age_level,age1,age2,age3
sex,class,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1
female,First,0.954545,0.947368,0.977273
female,Second,1.0,0.909091,0.857143
female,Third,0.508475,0.481481,0.25
male,First,0.5,0.5,0.347826
male,Second,0.357143,0.076923,0.0625
male,Third,0.158879,0.195652,0.055556
