### Hierarchical Indexing

Hierarchical *indexing* enables you to have multiple index level on an axis.

Or in a Nutshell it provides a way to work with higher dimensional data.

In [1]:
import pandas as pd
import numpy as np

In [2]:
data = pd.Series(np.random.randn(9), 
                 index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'],
                        [1, 2, 3, 1, 3, 1, 2, 2, 3]])

In [3]:
data

a  1    1.111987
   2   -0.115392
   3   -2.121899
b  1    0.416634
   3   -2.147181
c  1   -0.257905
   2   -0.041241
d  2   -0.831751
   3   -0.186393
dtype: float64

So in data we have *MultiIndex* as its index.

In [4]:
data.index

MultiIndex(levels=[['a', 'b', 'c', 'd'], [1, 2, 3]],
           labels=[[0, 0, 0, 1, 1, 2, 2, 3, 3], [0, 1, 2, 0, 2, 0, 1, 1, 2]])

In [5]:
data['b']

1    0.416634
3   -2.147181
dtype: float64

In [6]:
data.loc['b':'c']

b  1    0.416634
   3   -2.147181
c  1   -0.257905
   2   -0.041241
dtype: float64

In [7]:
data.loc[['b', 'd']]

b  1    0.416634
   3   -2.147181
d  2   -0.831751
   3   -0.186393
dtype: float64

Selection is possible from inner level as well.

In [8]:
data.loc[:, 2]

a   -0.115392
c   -0.041241
d   -0.831751
dtype: float64

We can rearrange the data into a DataFrame using `unstack` method.

In [9]:
data.unstack()

Unnamed: 0,1,2,3
a,1.111987,-0.115392,-2.121899
b,0.416634,,-2.147181
c,-0.257905,-0.041241,
d,,-0.831751,-0.186393


The inverse option for `unstack` is `stack`

In [10]:
data.unstack().stack()

a  1    1.111987
   2   -0.115392
   3   -2.121899
b  1    0.416634
   3   -2.147181
c  1   -0.257905
   2   -0.041241
d  2   -0.831751
   3   -0.186393
dtype: float64

With a DataFrame any axis can have a hierarchical index.

In [11]:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)),
                    index=[['a', 'a', 'b', 'b'], [1, 2, 1, 2]],
                    columns=[['Ohio', 'Ohio','Colorado'], 
                             ['Green', 'Red','Green']])

In [12]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


Hierarchical levels can have names. If so, these will show up in the console output:

In [13]:
frame.index.names = ['key1', 'key2']

In [14]:
frame.columns.names = ['state', 'color']

In [15]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [16]:
frame['Ohio']

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


### Reordering and Sorting Levels

The `swaplevel` takes two level numbers or names and returns a new object with the levels interchanged.

In [17]:
frame.swaplevel('key1', 'key2')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [18]:
frame.sort_index(level=1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [18]:
frame.unstack()

state,Ohio,Ohio,Ohio,Ohio,Colorado,Colorado
color,Green,Green,Red,Red,Green,Green
key2,1,2,1,2,1,2
key1,Unnamed: 1_level_3,Unnamed: 2_level_3,Unnamed: 3_level_3,Unnamed: 4_level_3,Unnamed: 5_level_3,Unnamed: 6_level_3
a,0,3,1,4,2,5
b,6,9,7,10,8,11


In [19]:
frame.swaplevel(0, 1).sort_index(level=0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


Data selection performance is much better on hierarchically
indexed objects if the index is lexicographically sorted starting with
the outermost level—that is, the result of calling
`sort_index(level=0)` or `sort_index()`

**Summary Statistics by Level**

In [20]:
frame.sum(level='key2')

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [21]:
frame.sum(level='key1')

state,Ohio,Ohio,Colorado
color,Green,Red,Green
key1,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
a,3,5,7
b,15,17,19


In [22]:
frame

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Green,Red,Green
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [23]:
frame.sum(level='color', axis=1)

Unnamed: 0_level_0,color,Green,Red
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


So in above code we are summing up values which has same color values, i.e. Green will be added to Green and Red will be added to Red.

#### Indexing with DataFrame's columns

In [24]:
frame = pd.DataFrame({'a':range(7), 'b':range(7, 0, -1),
                     'c':['one', 'one', 'one', 'two', 'two', 'two', 'two'],
                     'd':[0, 1, 2, 0, 1, 2, 3]})

In [25]:
frame

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


DataFrame's `set_index` function will create a new DataFrame using one or more of its columns as the index.

In [26]:
frame2 = frame.set_index(['c', 'd'])

In [27]:
frame2

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [28]:
frame.set_index(['c', 'd'], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


In [29]:
frame.reset_index()

Unnamed: 0,index,a,b,c,d
0,0,0,7,one,0
1,1,1,6,one,1
2,2,2,5,one,2
3,3,3,4,two,0
4,4,4,3,two,1
5,5,5,2,two,2
6,6,6,1,two,3


## Combining and Merging Datasets

**Database-Style DataFrame Joins**

*Merge* or *join* operations combine datasets by linking rows using one or more keys.

In [30]:
df1 = pd.DataFrame({'key':['b', 'b','a', 'c', 'a', 'a', 'b'],
                   'data1':range(7)})

In [31]:
df2 = pd.DataFrame({'key':['a', 'b', 'd'],
                   'data2':range(3)})

In [35]:
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [36]:
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,d,2


In [37]:
pd.merge(df2, df1)

Unnamed: 0,key,data2,data1
0,a,0,2
1,a,0,4
2,a,0,5
3,b,1,0
4,b,1,1
5,b,1,6


In [38]:
pd.merge(df1, df2)

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


In above code section I did not mention on which column to join, since it was a *many-to-one* join, because df1 has multiple rows labeled *a* and *b*

In [39]:
pd.merge(df1, df2, on='key')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


If the column names are different in each object, you can specify them separately

In [40]:
df3 = pd.DataFrame({'lkey':['b', 'b', 'a', 'c', 'a', 'a', 'b'],
                   'data1':range(7)})

In [41]:
df4 = pd.DataFrame({'rkey':['a', 'b', 'd'],
                   'data2':range(3)})

In [42]:
df4

Unnamed: 0,rkey,data2
0,a,0
1,b,1
2,d,2


In [43]:
pd.merge(df3, df4, left_on='lkey', right_on='rkey')

Unnamed: 0,lkey,data1,rkey,data2
0,b,0,b,1
1,b,1,b,1
2,b,6,b,1
3,a,2,a,0
4,a,4,a,0
5,a,5,a,0


In [44]:
pd.merge(df4, df3, left_on='rkey', right_on='lkey')

Unnamed: 0,rkey,data2,lkey,data1
0,a,0,a,2
1,a,0,a,4
2,a,0,a,5
3,b,1,b,0
4,b,1,b,1
5,b,1,b,6


Merging the data higly depends on the `left_on` and `right_on` parameters of `merge`function.

You may notice that 'c' and 'd' values and associated data are missing from the result. By default `merge` does an `inner` join; the keys in the result are the intersection, or the common set found in both tables. 

Other possible options are `left`, `right` and `outer`.

`outer` join takes the union of the keys, combining the effect of applying both left and right joins.

In [45]:
pd.merge(df1, df2, how='outer')

Unnamed: 0,key,data1,data2
0,b,0.0,1.0
1,b,1.0,1.0
2,b,6.0,1.0
3,a,2.0,0.0
4,a,4.0,0.0
5,a,5.0,0.0
6,c,3.0,
7,d,,2.0


*Many-to-Many* merges have well-defined, though not intutive, behavior.

In [46]:
df1 = pd.DataFrame({'key':['b', 'b', 'a', 'c', 'a', 'b'],
                   'data1':range(6)})

In [47]:
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,b,5


In [48]:
df2 = pd.DataFrame({'key':['a', 'b', 'a', 'b', 'd'],
                   'data2':range(5)})

In [49]:
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,a,2
3,b,3
4,d,4


In [50]:
pd.merge(df1, df2, on='key', how='left')

Unnamed: 0,key,data1,data2
0,b,0,1.0
1,b,0,3.0
2,b,1,1.0
3,b,1,3.0
4,a,2,0.0
5,a,2,2.0
6,c,3,
7,a,4,0.0
8,a,4,2.0
9,b,5,1.0


*Many-to-Many* joins form the Cartesian product of rows. Since there were three 'b' in rows in the left DataFrame and two in right one, there are six 'b' rows in the result.

In [51]:
pd.merge(df1, df2, how='inner')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,0,3
2,b,1,1
3,b,1,3
4,b,5,1
5,b,5,3
6,a,2,0
7,a,2,2
8,a,4,0
9,a,4,2


To merge with multiple keys, pass a list of columns names:

In [54]:
left =pd.DataFrame({'key1':['foo', 'foo', 'bar'],
                   'key2':['one', 'two', 'one'],
                   'lval':[1, 2, 3]})

In [55]:
right = pd.DataFrame({'key1':['foo', 'foo', 'bar', 'bar'],
                     'key2':['one', 'one', 'one', 'two'],
                     'rval':[4, 5, 6, 7]})

In [56]:
pd.merge(left, right, on=['key1', 'key2'], how='outer')

Unnamed: 0,key1,key2,lval,rval
0,foo,one,1.0,4.0
1,foo,one,1.0,5.0
2,foo,two,2.0,
3,bar,one,3.0,6.0
4,bar,two,,7.0


When you are joining columns-to-columns, the indexes on the passed DataFrame objects are discarded.

Last problem with merge is treatment of overlapping column names.

In [57]:
pd.merge(left, right, on='key1')

Unnamed: 0,key1,key2_x,lval,key2_y,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


In [58]:
pd.merge(left, right, on='key1', suffixes=('_left', '_right'))

Unnamed: 0,key1,key2_left,lval,key2_right,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


#### Merging on Index

In [59]:
left1 = pd.DataFrame({'key':['a', 'b', 'a', 'a', 'b', 'c'],
                     'value':range(6)})

In [60]:
right1 = pd.DataFrame({'group_val':[3.5, 7]}, index=['a', 'b'])

In [61]:
pd.merge(left1, right1, left_on='key', right_index=True)

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0


We can also do union instead of intersection, with `how=outer` parameter.

In [62]:
pd.merge(left1, right1, left_on='key', right_index=True, how='outer')

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0
5,c,5,


In [63]:
lefth = pd.DataFrame({'key1':['Ohio','Ohio', 'Ohio',
                             'Nevada', 'Nevada'],
                     'key2':[2000, 2001, 2002, 2001, 2002],
                     'data':np.arange(5.)})

In [64]:
righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
                     index=[['Nevada', 'Nevada', 'Ohio', 'Ohio', 'Ohio', 'Ohio'],
                           [2001,2000, 2000, 2000, 2001, 2002]],
                     columns=['event1', 'event2'])

In [65]:
lefth

Unnamed: 0,key1,key2,data
0,Ohio,2000,0.0
1,Ohio,2001,1.0
2,Ohio,2002,2.0
3,Nevada,2001,3.0
4,Nevada,2002,4.0


In [66]:
righth

Unnamed: 0,Unnamed: 1,event1,event2
Nevada,2001,0,1
Nevada,2000,2,3
Ohio,2000,4,5
Ohio,2000,6,7
Ohio,2001,8,9
Ohio,2002,10,11


In this case you have to indicate multiple columns to merge on as a list:

In [67]:
pd.merge(lefth, righth, left_on=['key1', 'key2'], right_index=True)

Unnamed: 0,key1,key2,data,event1,event2
0,Ohio,2000,0.0,4,5
0,Ohio,2000,0.0,6,7
1,Ohio,2001,1.0,8,9
2,Ohio,2002,2.0,10,11
3,Nevada,2001,3.0,0,1


In [68]:
pd.merge(lefth, righth, left_on=['key1', 'key2'],
        right_index=True, how='outer')

Unnamed: 0,key1,key2,data,event1,event2
0,Ohio,2000,0.0,4.0,5.0
0,Ohio,2000,0.0,6.0,7.0
1,Ohio,2001,1.0,8.0,9.0
2,Ohio,2002,2.0,10.0,11.0
3,Nevada,2001,3.0,0.0,1.0
4,Nevada,2002,4.0,,
4,Nevada,2000,,2.0,3.0


Using the indexes of both sides of the merge is also possible.

In [72]:
left2 = pd.DataFrame([[1, 2], [3, 4], [5, 6]],
                    index=['a', 'c', 'e'],
                    columns=['Ohio', 'Nevada'],
                    dtype=np.float64)

In [73]:
right2 = pd.DataFrame([[7, 8], [9, 10], [11, 12], [13, 14]],
                     index=['b', 'c', 'd', 'e'],
                     columns=['Missouri', 'Albama'],
                     dtype=np.float64)

In [74]:
left2

Unnamed: 0,Ohio,Nevada
a,1.0,2.0
c,3.0,4.0
e,5.0,6.0


In [75]:
right2

Unnamed: 0,Missouri,Albama
b,7.0,8.0
c,9.0,10.0
d,11.0,12.0
e,13.0,14.0


In [76]:
pd.merge(left2, right2, how='outer', left_index=True, right_index=True)

Unnamed: 0,Ohio,Nevada,Missouri,Albama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


We can also merge indexes with `join` instance. It can also be used to combine together many DataFrame objects having the same or similar indexes but non-overlapping columns.

for getting the previous dataframe we could've also written this:
```Python
>>left2.join(right2, how='outer')
```

In [77]:
left2.join(right2, how='outer')

Unnamed: 0,Ohio,Nevada,Missouri,Albama
a,1.0,2.0,,
b,,,7.0,8.0
c,3.0,4.0,9.0,10.0
d,,,11.0,12.0
e,5.0,6.0,13.0,14.0


In [78]:
left1

Unnamed: 0,key,value
0,a,0
1,b,1
2,a,2
3,a,3
4,b,4
5,c,5


In [79]:
right1

Unnamed: 0,group_val
a,3.5
b,7.0


In [80]:
left1.join(right1, on='key')

Unnamed: 0,key,value,group_val
0,a,0,3.5
1,b,1,7.0
2,a,2,3.5
3,a,3,3.5
4,b,4,7.0
5,c,5,


for merging the data to prior dataframe we have to write this code

In [84]:
pd.merge(left1, right1, left_on='key', right_index=True, how='outer')

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0
5,c,5,


So for merging the two DataFrames we there has to be common columns otherwise we need to define the `right_index=True` and `left_on='Key Name'` and put larger data set first then put smaller.

##### index-on-index merges

In [91]:
another = pd.DataFrame([[7, 8], [9, 10], [11, 12], [16, 17]],
                      index=[i for i in 'acef'],
                      columns=['New York', 'Oregon'],
                      dtype=np.float64)

In [92]:
another

Unnamed: 0,New York,Oregon
a,7.0,8.0
c,9.0,10.0
e,11.0,12.0
f,16.0,17.0


In [93]:
left2.join([right2,another])

Unnamed: 0,Ohio,Nevada,Missouri,Albama,New York,Oregon
a,1.0,2.0,,,7.0,8.0
c,3.0,4.0,9.0,10.0,9.0,10.0
e,5.0,6.0,13.0,14.0,11.0,12.0


In [95]:
left2.join([right2, another], how='outer', sort=True)

Unnamed: 0,Ohio,Nevada,Missouri,Albama,New York,Oregon
a,1.0,2.0,,,7.0,8.0
b,,,7.0,8.0,,
c,3.0,4.0,9.0,10.0,9.0,10.0
d,,,11.0,12.0,,
e,5.0,6.0,13.0,14.0,11.0,12.0
f,,,,,16.0,17.0


## Concatenating Along an Axis

NumPy's `concatenate` function can do this with NumPy arrays.

In [96]:
arr = np.arange(12).reshape((3, 4))

In [97]:
arr

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

In [98]:
np.concatenate([arr, arr], axis=1)

array([[ 0,  1,  2,  3,  0,  1,  2,  3],
       [ 4,  5,  6,  7,  4,  5,  6,  7],
       [ 8,  9, 10, 11,  8,  9, 10, 11]])

In [99]:
np.concatenate([arr, arr], axis=0)

array([[ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11],
       [ 0,  1,  2,  3],
       [ 4,  5,  6,  7],
       [ 8,  9, 10, 11]])

Suppose we have three series with no index overlap or I should say there is no common index values among them so how we are going to concatenate them??

Take an example

In [100]:
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])

Calling simply concatenation will add these value from head to tail.

In [101]:
pd.concat([s1, s3, s2])

a    0
b    1
f    5
g    6
c    2
d    3
e    4
dtype: int64

In [102]:
np.concatenate([s2, s1, s3])

array([2, 3, 4, 0, 1, 5, 6])

We can also concatenate series in rows or columns manner.

In [104]:
pd.concat([s1, s2, s3], axis=1, sort=False)

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


In [106]:
s4 = pd.concat([s1, s2, s3], axis=1, sort=False)

In [107]:
s4

Unnamed: 0,0,1,2
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


In [108]:
s4 = pd.concat([s1, s3], axis=1, sort=False)

In [109]:
s4

Unnamed: 0,0,1
a,0.0,
b,1.0,
f,,5.0
g,,6.0


In [110]:
s4 = pd.concat([s1, s3], sort=False)

In [111]:
s4

a    0
b    1
f    5
g    6
dtype: int64

In [112]:
pd.concat([s1, s4], axis=1, sort=False)

Unnamed: 0,0,1
a,0.0,0
b,1.0,1
f,,5
g,,6


In [113]:
pd.concat([s1, s4], axis=1, join='inner')

Unnamed: 0,0,1
a,0,0
b,1,1


We can even specify the axes to be used on the other axes with `join_axes`

In [114]:
pd.concat([s1, s4], axis=1, join_axes=[[i for i in 'acbe']])

Unnamed: 0,0,1
a,0.0,0.0
c,,
b,1.0,1.0
e,,


Suppose we wanted to use hierarchical index on the concatenation axis.

In [115]:
result = pd.concat([s1, s2, s3], keys=['one', 'two', 'three'])

In [116]:
result

one    a    0
       b    1
two    c    2
       d    3
       e    4
three  f    5
       g    6
dtype: int64

In [118]:
result.unstack()

Unnamed: 0,a,b,c,d,e,f,g
one,0.0,1.0,,,,,
two,,,2.0,3.0,4.0,,
three,,,,,,5.0,6.0


In the case of combining Series along axis=1, the keys become the DataFrame column headers.

In [119]:
pd.concat([s1, s2, s3],axis=1, keys=['one', 'two', 'three'])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,one,two,three
a,0.0,,
b,1.0,,
c,,2.0,
d,,3.0,
e,,4.0,
f,,,5.0
g,,,6.0


This method can be extended to DataFrame objects.

In [120]:
df1 = pd.DataFrame(np.arange(6).reshape((3, 2)), 
                  index=['a', 'b', 'c'],
                  columns=['one', 'two'])

In [121]:
df2 = pd.DataFrame(5+np.arange(4).reshape((2, 2)),
                  index=['a', 'c'],
                  columns=['three', 'four'])

In [122]:
df1

Unnamed: 0,one,two
a,0,1
b,2,3
c,4,5


In [123]:
df2

Unnamed: 0,three,four
a,5,6
c,7,8


In [124]:
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


If we pass dict of objects instead of a list, the dict's keys will be used for the keys option:

In [125]:
pd.concat({'level1':df1, 'level2':df2},axis=1)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


we can name the created axis levels with the `names` arguments:

In [126]:
pd.concat([df1, df2], axis=1, keys=['level1', 'level2'], 
          names=['upper', 'lower'])

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  


upper,level1,level1,level2,level2
lower,one,two,three,four
a,0,1,5.0,6.0
b,2,3,,
c,4,5,7.0,8.0


When the row index does not contain any relevent data.

In [127]:
df1 = pd.DataFrame(np.random.randn(3, 4),columns=[i for i in 'abcd'])

In [128]:
df2 = pd.DataFrame(np.random.randn(2, 3), columns=[i for i in 'bda'])

In [129]:
df1

Unnamed: 0,a,b,c,d
0,0.076308,-1.502635,0.17238,0.70934
1,1.083938,-0.887503,-0.871296,2.269384
2,-0.98391,1.022088,1.157365,0.962637


In [130]:
df2

Unnamed: 0,b,d,a
0,-0.197287,0.363217,1.215855
1,0.210288,-0.62188,0.148376


In [131]:
pd.concat([df1, df2], ignore_index=True)

of pandas will change to not sort by default.

To accept the future behavior, pass 'sort=False'.


  """Entry point for launching an IPython kernel.


Unnamed: 0,a,b,c,d
0,0.076308,-1.502635,0.17238,0.70934
1,1.083938,-0.887503,-0.871296,2.269384
2,-0.98391,1.022088,1.157365,0.962637
3,1.215855,-0.197287,,0.363217
4,0.148376,0.210288,,-0.62188


### Combining Data with Overlap

Suppose you have two datasets whose indexes overlap in full or part. 
We can consider NumPy's `where` function, which performs the array-oriented equivalent of an if-else expression.

In [132]:
from numpy import nan as NA

In [133]:
a = pd.Series([NA, 2.5, NA, 3.5, 4.5, NA],
             index=[i for i in 'fedcba'])

In [134]:
b = pd.Series(np.arange(len(a), dtype=np.float64),
             index=[i for i in 'fedcba'])

In [135]:
a

f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

In [136]:
b

f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    5.0
dtype: float64

In [137]:
b[-1]

5.0

In [138]:
b[-1] = NA

In [139]:
np.where(pd.isnull(a), b, a)

array([0. , 2.5, 2. , 3.5, 4.5, nan])

Sereis has a `combine_first` method, which performs the equivalent operation along with pandas's usual data alignment logic

In [140]:
pd.isnull(a)

f     True
e    False
d     True
c    False
b    False
a     True
dtype: bool

In [142]:
b[:-2]

f    0.0
e    1.0
d    2.0
c    3.0
dtype: float64

In [143]:
b[:-2].combine_first(a[2:])

a    NaN
b    4.5
c    3.0
d    2.0
e    1.0
f    0.0
dtype: float64

With DataFrames `combine_first`doesthe same thing column by column, so you think of it as patching missing data in the calling object with data from the object you pass.

In [145]:
df1 = pd.DataFrame({'a':[1., NA, 5., NA],
                   'b':[NA, 2., NA,6.],
                   'c':range(2, 18, 4)})

In [146]:
df2 = pd.DataFrame({'a': [5., 4., NA, 3., 7.],
                   'b':[NA, 3., 4., 6., 8.]})

In [147]:
df1

Unnamed: 0,a,b,c
0,1.0,,2
1,,2.0,6
2,5.0,,10
3,,6.0,14


In [148]:
df2

Unnamed: 0,a,b
0,5.0,
1,4.0,3.0
2,,4.0
3,3.0,6.0
4,7.0,8.0


In [150]:
df1.combine_first(df2)

Unnamed: 0,a,b,c
0,1.0,,2.0
1,4.0,2.0,6.0
2,5.0,4.0,10.0
3,3.0,6.0,14.0
4,7.0,8.0,


## Reshaping and Pivoting

**Reshaping with Hierarchical Indexing**

In [151]:
data = pd.DataFrame(np.arange(6).reshape((2, 3)),
                   index=pd.Index(['Ohio', 'Colorado'], name='state'),
                   columns=pd.Index(['one', 'two', 'three'],
                                   name='number'))

In [152]:
data

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


Using the `stack` method on this data pivots the columns into the rows, producing a Series:

In [153]:
result = data.stack()

In [154]:
result

state     number
Ohio      one       0
          two       1
          three     2
Colorado  one       3
          two       4
          three     5
dtype: int64

To reverse it, we can use `unstack()`

In [155]:
result.unstack()

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,1,2
Colorado,3,4,5


We can unstack a different level by passing a level number or name

In [156]:
result.unstack(0)

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


In [157]:
result.unstack('state')

state,Ohio,Colorado
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,3
two,1,4
three,2,5


## The End :) :)