<a href="https://colab.research.google.com/github/smiledinisa/data_python_analysis/blob/master/pandas004_DataWrangling.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# CAHPER 8 : Data Wrangling : Join, Combine and Reshape



In [1]:
import numpy as np
from pandas import Series as Series
from pandas import DataFrame as DataFrame

## Hirerarchical Indexing


hirearchial indexing is used extensively in some of the manipulations.

provides a way for you to work with higher dimenssion data in a lower 
dimenssion form.

用处广泛，以低维的方法处理高维的数据提供了便利。

In [3]:
data = Series(np.random.randn(9), index = [['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                        [1, 2, 3, 1, 3, 1, 2, 2, 3]])


data

a  1    1.284930
   2   -0.524601
   3    1.520868
b  1    0.298517
   3    0.591261
c  1   -1.047802
   2    0.354460
d  2    1.079982
   3   -1.187737
dtype: float64

In [4]:
data.index

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

In [7]:
#With a hierarchically indexed object, so-called partial indexing is possible, enabling
# you to concisely select subsets of the data:
# 分层索引使得局部索引成为可能。

data['b']


1    0.298517
3    0.591261
dtype: float64

In [8]:
data['b':'c']

b  1    0.298517
   3    0.591261
c  1   -1.047802
   2    0.354460
dtype: float64

In [9]:
data.loc[['b','c']]

b  1    0.298517
   3    0.591261
c  1   -1.047802
   2    0.354460
dtype: float64

In [11]:
data.loc[:,2]

a   -0.524601
c    0.354460
d    1.079982
dtype: float64

In [12]:
# 多层索引的unstack 方法b
 
data.unstack()

Unnamed: 0,1,2,3
a,1.28493,-0.524601,1.520868
b,0.298517,,0.591261
c,-1.047802,0.35446,
d,,1.079982,-1.187737


In [13]:
# inverse operation of unstack is stack.
data.unstack().stack()

a  1    1.284930
   2   -0.524601
   3    1.520868
b  1    0.298517
   3    0.591261
c  1   -1.047802
   2    0.354460
d  2    1.079982
   3   -1.187737
dtype: float64

With a DataFrame, either axis can have a hierarchical index:

各个轴都可以由 多层索引。

In [21]:
frame = DataFrame(np.arange(12).reshape((4, 3)),
          index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
          columns=[['Ohio', 'Ohio', 'Colorado'],
          ['Green', 'Red', 'Green']])
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [22]:
frame.index.names

FrozenList([None, None])

In [23]:
frame.columns.names

FrozenList([None, None])

In [24]:
# hierarchical levels can have names


frame.index.names = ['key1', 'key2']
frame.columns.names = ['state', 'color']

In [25]:
print(frame.index.names)
print(frame.columns.names)

['key1', 'key2']
['state', 'color']


In [26]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [27]:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


In [29]:
frame['Ohio'].loc['a'] # 可以组合使用。

color,Green,Red
key2,Unnamed: 1_level_1,Unnamed: 2_level_1
1,0,1
2,3,4


In [35]:
# A  MultiIndex can be created by itself and then reused; the columns in the preceding
# DataFrame with level names could be created like this:
import pandas as pd
mul_col = pd.MultiIndex.from_arrays([['Ohio', 'Ohio','Colorado'], ['green','red','green']], names = ['state', 'color'])

In [36]:
x = DataFrame(np.arange(12).reshape((4, 3)),
          index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
          columns=mul_col)
x

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,green,red,green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


### Reordering and Sorting Levels


At times we need to rearrange the order of the levels.

有时候我们需要对索引的级别重新排列。


keyward: ***swaplevel***.




In [37]:
x

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,green,red,green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [39]:
x.swaplevel('state','color',axis=1)

Unnamed: 0_level_0,color,green,red,green
Unnamed: 0_level_1,state,Ohio,Ohio,Colorado
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [40]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [41]:
frame.swaplevel('key1','key2',axis=0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [42]:
# 我们对dataframe的value进行排序，按某一个index或者axis

frame.sort_index(level=1,axis=0,ascending=True, inplace= False, kind='quicksort')

# 除了level 其他参数都是默认的。 方便我们进行控制。

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [43]:
frame.sort_index(level=0,axis=0,ascending=True, inplace= False, kind='quicksort')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [44]:
frame.swaplevel(0,1).sort_index(level = 0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


### Summary Statistics by Level

按索引级别的分层概要统计。



In [46]:
# 一般来说，统计function都是含有level选项参数的。
frame.sum(level='key2')

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [47]:
frame.sum(level= 'color', axis= 1)

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


### Indexing with a DataFrame's columns






## Combining and Merging Datasets 

### Database-Style DataFrame Joins




### Merging on Index




### Concatenating Along an Axis





### Combining Data wtih Overlap





## Reshaping and Pivoting

### Reshaping with Hierarchical Indexing




### Pivoting "Long" to "Wide" Format




### Pivoting "Wide" to "Long" Format




