# 계층적 인덱싱 (MultiIndexing)


#### [ 다중 인덱스 Seires ]
#### [ MutiIndex 추가 차원 ]
#### [ MutiIndex 생성 메서드]


- 3차원, 4차원 데이터는 Pandas에 Panel, Panel4D 객체 사용
- 단일 인덱스내에 여러 인덱스 레벨을 포함하는 방법 : 계층적 인덱싱, 다중 인덱싱
- 3,4차원 데이터를 **계층적 인덱싱(Hierarchical indexing), 다중 인덱싱(Multi-indexing)**을 사용하여 Series, Dataframe으로 사용 함

In [1]:
import pandas as pd
import numpy as np

print("pandas ver : ",pd.__version__)
print("numpy ver : ",np.__version__)

pandas ver :  0.24.2
numpy ver :  1.16.4


### [ 다중 인덱스 Seires ]

- **2차원 데이터를 1차원 Series에 표현**

- *나쁜 방식* : 튜플을 키값을 가지는 방식으로 인덱싱 

In [2]:
index = [('California', 2000), ('California', 2010),
         ('New York', 2000), ('New York', 2010),
         ('Texas', 2000), ('Texas', 2010)]

populations = [33871648, 37253956,
               18976457, 19378102,
               20851820, 25145561]

pop = pd.Series(populations, index=index)
print(pop)
print(type(pop))

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64
<class 'pandas.core.series.Series'>


- 인덱스를 이용하여 데이터 접근

In [3]:
# 튜플 기를 이용하여 슬라이싱 인덱싱
pop[('California', 2000):('Texas', 2000)]

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
dtype: int64

- 만약, 2010년 데이터만 접근 하려고 한다면...**데이터 먼징(munging)**을 사용

In [4]:
# 데이터 먼징 : 2010 년 데이터 추출
[i for i in pop.index if i[1] == 2010]

[('California', 2010), ('New York', 2010), ('Texas', 2010)]

In [5]:
# 데이터 먼징 : 2010 년 데이터 추출
print([i for i in pop.index if i[1] == 2010])
print("\n")
for i in pop.index:
    if i[1] == 2010:
        print(i)

[('California', 2010), ('New York', 2010), ('Texas', 2010)]


('California', 2010)
('New York', 2010)
('Texas', 2010)


In [6]:
# 인덱스에서 2010년 데이터를 먼징으로 가져와 접근
pop[[i for i in pop.index if i[1] == 2010]]

(California, 2010)    37253956
(New York, 2010)      19378102
(Texas, 2010)         25145561
dtype: int64

- *더 좋은 방식* : **Pandas Mutiindex** 사용 

- `classmethod MultiIndex.from_tuples(tuples, sortorder=None, names=None)`
  - tuples : list / sequence of tuple-likes
  - sortorder : int or None
  - names : list / sequence of str, optional
  - return - index : MultiIndex
- `classmethod MultiIndex.from_arrays(arrays, sortorder=None, names=None)`
  - arrays : list / sequence of array-likes
  - sortorder : int or None
  - names : list / sequence of str, optional
  - return - index : MultiIndex
- `classmethod MultiIndex.from_product(iterables, sortorder=None, names=None)`
  - iterables : list / sequence of iterables. Each iterable has unique labels for each level of the index.
  - sortorder : int or None
  - names : list / sequence of str, optional
  - return - index : MultiIndex
- `classmethod MultiIndex.from_frame(df, sortorder=None, names=None)`
  - df : DataFrame. DataFrame to be converted to MultiIndex.
  - sortorder : int, optional
  - names : list-like, optional
  - return : MultiIndex. The MultiIndex representation of the given DataFrame.

In [7]:
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

In [11]:
index = pd.MultiIndex.from_tuples(index)
index

MultiIndex(levels=[['California', 'New York', 'Texas'], [2000, 2010]],
           codes=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]])

###### codes 항목 

<pre><code>
codes=[[0, 0, 1, 1, 2, 2], [0, 1, 0, 1, 0, 1]]

지역              연도
0:California     0:2000
0:California     1:2010           
1:New York       0:2000
1:New York       1:2010
2:Texas          0:2000
2:Texas          1:2010
</code></pre>

In [12]:
pop2 = pop.reindex(index)
print(pop2)
print(type(pop2))

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64
<class 'pandas.core.series.Series'>


In [13]:
pop

(California, 2000)    33871648
(California, 2010)    37253956
(New York, 2000)      18976457
(New York, 2010)      19378102
(Texas, 2000)         20851820
(Texas, 2010)         25145561
dtype: int64

In [26]:
# MultiIndex 적용
print(pop2)
print(type(pop2))

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64
<class 'pandas.core.series.Series'>


In [27]:
# 2010년 데이터 접근
pop2[:,2010]

California    37253956
New York      19378102
Texas         25145561
dtype: int64

### [ MutiIndex 추가 차원 ]

- 다중 인덱스를 가진 Series 데이터를 Dataframe으로 변환 할때 **`unstack()`** 사용
- **`Series.unstack(self, level=-1, fill_value=None)`**

In [32]:
print("pop2 : \n",pop2)
print(type(pop2))
print("\n")

pop_df = pop2.unstack()
print("pop_df : \n",pop_df)
print(type(pop_df))

pop2 : 
 California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64
<class 'pandas.core.series.Series'>


pop_df : 
                 2000      2010
California  33871648  37253956
New York    18976457  19378102
Texas       20851820  25145561
<class 'pandas.core.frame.DataFrame'>


###### 예) unstack() - level parameter
- level=-1 : default 값이며 두번째 인덱스가 column 으로 지정됨
- level=0  : 첫번째 인덱스가 column 으로 지정됨

In [35]:
s = pd.Series([1,2,3,4], index=pd.MultiIndex.from_product([['one', 'two'],
                                                           ['a','b']]))
print(s)
print(type(s))
print("\n")

# default => level = -1
s_df1 = s.unstack(level=-1)
print("s_df1 (level=-1) :\n",s_df1)
print(type(s_df1))

# level = 0
print("\n")
s_df0 = s.unstack(level=0)
print("s_df0 (level=0) :\n",s_df0)
print(type(s_df0))

one  a    1
     b    2
two  a    3
     b    4
dtype: int64
<class 'pandas.core.series.Series'>


s_df1 (level=-1) :
      a  b
one  1  2
two  3  4
<class 'pandas.core.frame.DataFrame'>


s_df0 (level=0) :
    one  two
a    1    3
b    2    4
<class 'pandas.core.frame.DataFrame'>


- Dataframe 데이터를 다중인덱스를 가지는 Series로 변환 할때 **`stack()`** 사용
- **`DataFrame.stack(level=-1, dropna=True)`**

In [36]:
pop_df

Unnamed: 0,2000,2010
California,33871648,37253956
New York,18976457,19378102
Texas,20851820,25145561


In [38]:
pop_ser = pop_df.stack()
print(pop_ser)
print(type(pop_ser))
print("\n")
print(pop2)
print(type(pop2))

California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64
<class 'pandas.core.series.Series'>


California  2000    33871648
            2010    37253956
New York    2000    18976457
            2010    19378102
Texas       2000    20851820
            2010    25145561
dtype: int64
<class 'pandas.core.series.Series'>


##### MutiIndex는 3차원, 4차원 데이터처러 고차원 데이터를 2차원이나 1차원으로 표현하기 위하여 사용함

In [39]:
pop_df2 = pd.DataFrame({'total':pop2,
                       'under18':[9267089, 9284094,
                                   4687374, 4318033,
                                   5906301, 6879014]})
print(pop_df2)
print(type(pop_df2))

                    total  under18
California 2000  33871648  9267089
           2010  37253956  9284094
New York   2000  18976457  4687374
           2010  19378102  4318033
Texas      2000  20851820  5906301
           2010  25145561  6879014
<class 'pandas.core.frame.DataFrame'>


In [41]:
# 전체 인구중에서 18세 이하 인구의 비율 데이터 추가
pop_df2['rate'] = pop_df2['under18'] / pop_df2['total']
pop_df2

Unnamed: 0,Unnamed: 1,total,under18,rate
California,2000,33871648,9267089,0.273594
California,2010,37253956,9284094,0.249211
New York,2000,18976457,4687374,0.24701
New York,2010,19378102,4318033,0.222831
Texas,2000,20851820,5906301,0.283251
Texas,2010,25145561,6879014,0.273568


In [44]:
(pop_df2['under18'] / pop_df2['total']).unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


In [45]:
pop_df2['rate'].unstack()

Unnamed: 0,2000,2010
California,0.273594,0.249211
New York,0.24701,0.222831
Texas,0.283251,0.273568


### [ MutiIndex 생성 메서드 ]

- 다중 인덱스를 가지는 Series, Dataframe을 생성하는 간단한 방법은 생성자에 index 파라미터에 2차원 이상의 배열(**index=[ [ ] , [ ] ,...]**)의 인덱스를 지정하면 됨

In [47]:
# index 파라미터를 이용한 MutiIndex 생성
df = pd.DataFrame(np.random.rand(4, 2),
                  index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2],['T','T','T','T']],
                  columns=['data1', 'data2'])
print(df)
print(type(df))

          data1     data2
a 1 T  0.489276  0.278325
  2 T  0.537170  0.215015
b 1 T  0.481349  0.108768
  2 T  0.524437  0.258011
<class 'pandas.core.frame.DataFrame'>


In [51]:
aa = df.stack()
print(aa)
print(type(aa))

a  1  T  data1    0.489276
         data2    0.278325
   2  T  data1    0.537170
         data2    0.215015
b  1  T  data1    0.481349
         data2    0.108768
   2  T  data1    0.524437
         data2    0.258011
dtype: float64
<class 'pandas.core.series.Series'>


- **명시적 MultiIndex 생성자**
    - `classmethod MultiIndex.from_tuples(tuples, sortorder=None, names=None)`
      - tuples : list / sequence of tuple-likes
      - sortorder : int or None
      - names : list / sequence of str, optional
      - return - index : MultiIndex
    - `classmethod MultiIndex.from_arrays(arrays, sortorder=None, names=None)`
      - arrays : list / sequence of array-likes
      - sortorder : int or None
      - names : list / sequence of str, optional
      - return - index : MultiIndex
    - `classmethod MultiIndex.from_product(iterables, sortorder=None, names=None)`
      - iterables : list / sequence of iterables. Each iterable has unique labels for each level of the index.
      - sortorder : int or None
      - names : list / sequence of str, optional
      - return - index : MultiIndex
    - `classmethod MultiIndex.from_frame(df, sortorder=None, names=None)`
      - df : DataFrame. DataFrame to be converted to MultiIndex.
      - sortorder : int, optional
      - names : list-like, optional
      - return : MultiIndex. The MultiIndex representation of the given DataFrame.

In [52]:
pd.MultiIndex.from_arrays([['a', 'a', 'b', 'b'], [1, 2, 1, 2]])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]])

In [53]:
pd.MultiIndex.from_tuples([('a', 1), ('a', 2), ('b', 1), ('b', 2)])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]])

In [54]:
pd.MultiIndex.from_product([['a', 'b'], [1, 2]])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]])

In [57]:
pd.MultiIndex(levels=[['a', 'b'], [1, 2]],
              codes=[[0, 0, 1, 1], [0, 1, 0, 1]])

MultiIndex(levels=[['a', 'b'], [1, 2]],
           codes=[[0, 0, 1, 1], [0, 1, 0, 1]])

- **MutiIndex 레벨 이름 지정**

In [72]:
# hierarchical indices and columns
index = pd.MultiIndex.from_product([[2013, 2014], [1, 2]],
                                   names=['year', 'visit'])
columns = pd.MultiIndex.from_product([['Bob', 'Guido', 'Sue'], ['HR', 'Temp']],
                                     names=['subject', 'type'])

# mock some data
data = np.round(np.random.randn(4, 6), 1)
data[:, ::2] *= 10
data += 37

# create the DataFrame
health_data = pd.DataFrame(data, index=index, columns=columns)
health_data

Unnamed: 0_level_0,subject,Bob,Bob,Guido,Guido,Sue,Sue
Unnamed: 0_level_1,type,HR,Temp,HR,Temp,HR,Temp
year,visit,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2
2013,1,37.0,36.6,34.0,37.4,30.0,37.7
2013,2,46.0,35.9,41.0,37.0,31.0,35.7
2014,1,52.0,37.1,27.0,36.5,46.0,37.2
2014,2,43.0,36.9,21.0,38.5,42.0,39.0


###### name - index 매칭
<pre><code>
<H3>columns MutiIndex name</H3>
<b>[names]</b>   <b>[col]</b>
subject - Bob, Guido, Sue
type    - HR, Temp
<br>
<H3>index MutiIndex name</H3>
<b>[names]</b>   <b>[index]</b>
year    - 2013, 2014
visit   - 1, 2 
</code></pre>

In [73]:
# column 으로 데이터 접근 하였을때 해당 index 의 name은 보임
health_data['Guido']

Unnamed: 0_level_0,type,HR,Temp
year,visit,Unnamed: 2_level_1,Unnamed: 3_level_1
2013,1,34.0,37.4
2013,2,41.0,37.0
2014,1,27.0,36.5
2014,2,21.0,38.5
