#### 8.1 Hierarchical Indexing

Hierarchical indexing is an important feature of pandas that enables you to have mul‐
tiple (two or more) index levels on an axis. Somewhat abstractly, it provides a way for
you to work with higher dimensional data in a lower dimensional form. Let’s start
with a simple example; create a Series with a list of lists (or arrays) as the index:

In [14]:
import numpy as np
import pandas as pd

In [15]:
data=pd.Series(np.random.randn(9),index=[['a','a','a','b','b','c','c','d','d'],[1,2,3,1, 3, 1, 2, 2, 3]])
data

a  1   -1.186938
   2   -1.358542
   3   -1.371070
b  1    0.806074
   3   -0.557864
c  1   -0.234863
   2   -2.536878
d  2    0.116082
   3    0.052379
dtype: float64

In [16]:
data.index

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

######  performing indexing

With a hierarchically indexed object, so-called partial indexing is possible, enabling
you to concisely select subsets of the data:


In [17]:
data['a']

1   -1.186938
2   -1.358542
3   -1.371070
dtype: float64

In [18]:
data['b':'c']

b  1    0.806074
   3   -0.557864
c  1   -0.234863
   2   -2.536878
dtype: float64

In [19]:
data[['b','d']]

b  1    0.806074
   3   -0.557864
d  2    0.116082
   3    0.052379
dtype: float64

one can also use inner index.

In [20]:
data[:,2]

a   -1.358542
c   -2.536878
d    0.116082
dtype: float64

In [21]:
data

a  1   -1.186938
   2   -1.358542
   3   -1.371070
b  1    0.806074
   3   -0.557864
c  1   -0.234863
   2   -2.536878
d  2    0.116082
   3    0.052379
dtype: float64

###### stack unstack

Hierarchical indexing plays an important role in reshaping data and group-based
operations like forming a pivot table. For example, you could rearrange the data into
a DataFrame using its unstack method.

In [22]:
data.unstack()

Unnamed: 0,1,2,3
a,-1.186938,-1.358542,-1.37107
b,0.806074,,-0.557864
c,-0.234863,-2.536878,
d,,0.116082,0.052379


In [23]:
data.unstack().stack()

a  1   -1.186938
   2   -1.358542
   3   -1.371070
b  1    0.806074
   3   -0.557864
c  1   -0.234863
   2   -2.536878
d  2    0.116082
   3    0.052379
dtype: float64

In [24]:
data.unstack().stack(dropna=False)

a  1   -1.186938
   2   -1.358542
   3   -1.371070
b  1    0.806074
   2         NaN
   3   -0.557864
c  1   -0.234863
   2   -2.536878
   3         NaN
d  1         NaN
   2    0.116082
   3    0.052379
dtype: float64

###### in data frame

In [25]:
df=pd.DataFrame(np.arange(12).reshape(4,3),index=[['a','a','b','b'],[1,2,1,2]],columns=[['Ohio','Ohio','Colorado'],['Red','Green','Red']])
df

Unnamed: 0_level_0,Unnamed: 1_level_0,Ohio,Ohio,Colorado
Unnamed: 0_level_1,Unnamed: 1_level_1,Red,Green,Red
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [26]:
df.index

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 1),
            ('b', 2)],
           )

In [27]:
df.columns

MultiIndex([(    'Ohio',   'Red'),
            (    'Ohio', 'Green'),
            ('Colorado',   'Red')],
           )

###### Heirarchial index can have names

In [28]:
df.index.names=['key1','key2']
df.columns.names=['state','color']
df

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Red,Green,Red
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [29]:
df['Ohio']

Unnamed: 0_level_0,color,Red,Green
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,1
a,2,3,4
b,1,6,7
b,2,9,10


In [30]:
df['Ohio','Red']

key1  key2
a     1       0
      2       3
b     1       6
      2       9
Name: (Ohio, Red), dtype: int32

In [31]:
df.loc['a','Ohio']['Red']

key2
1    0
2    3
Name: Red, dtype: int32

In [32]:
(df.loc['a','Ohio']).loc[1,'Red']

0

###### pd.MultiIndex.from_arrays

A MultiIndex can be created by itself and then reused; the columns in the preceding
DataFrame with level names could be created like this:

In [33]:
dt=pd.DataFrame(np.arange(12).reshape(-1,3))
dt

Unnamed: 0,0,1,2
0,0,1,2
1,3,4,5
2,6,7,8
3,9,10,11


In [34]:
dt.columns=pd.MultiIndex.from_arrays([['Ohio','Ohio','Colorado'],['Red','Green','Red']],names=('state','colour'))
dt

state,Ohio,Ohio,Colorado
colour,Red,Green,Red
0,0,1,2
1,3,4,5
2,6,7,8
3,9,10,11


#### Reordering and Sorting Levels

In [35]:
df

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Red,Green,Red
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [36]:
df.swaplevel('key2','key1')

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Red,Green,Red
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [37]:
df.swaplevel("state",'color',axis='columns')

Unnamed: 0_level_0,color,Red,Green,Red
Unnamed: 0_level_1,state,Ohio,Ohio,Colorado
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [38]:
df.swaplevel('state','color',axis='columns')['Red']

Unnamed: 0_level_0,state,Ohio,Colorado
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,0,2
a,2,3,5
b,1,6,8
b,2,9,11


In [39]:
df.sort_index(level=1)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Red,Green,Red
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [40]:
df.sort_index(axis='columns')

Unnamed: 0_level_0,state,Colorado,Ohio,Ohio
Unnamed: 0_level_1,color,Red,Green,Red
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,2,1,0
a,2,5,4,3
b,1,8,7,6
b,2,11,10,9


In [41]:
df.sort_index(axis='columns',level=1)

Unnamed: 0_level_0,state,Ohio,Colorado,Ohio
Unnamed: 0_level_1,color,Green,Red,Red
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,1,2,0
a,2,4,5,3
b,1,7,8,6
b,2,10,11,9


##### using both swap level and sort index

In [42]:
df.swaplevel(0,1).sort_index(level=0)

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Red,Green,Red
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


##### Summary statistics by levels

In [43]:
df

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Red,Green,Red
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [31]:
df.sum(level='key2')

  df.sum(level='key2')


state,Ohio,Ohio,Colorado
color,Red,Green,Red
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [32]:
df.sum(level='color',axis=1)

  df.sum(level='color',axis=1)


Unnamed: 0_level_0,color,Red,Green
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


In [33]:
df.sum(level='key2').sum(axis=1)

  df.sum(level='key2').sum(axis=1)


key2
1    24
2    42
dtype: int64

In [34]:
df

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Red,Green,Red
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [35]:
df.sum(level='state',axis=1)

  df.sum(level='state',axis=1)


Unnamed: 0_level_0,state,Ohio,Colorado
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,1,2
a,2,7,5
b,1,13,8
b,2,19,11


In [36]:
df.sum(level='state',axis=1).sum()

  df.sum(level='state',axis=1).sum()


state
Ohio        40
Colorado    26
dtype: int64

In [44]:
df

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Red,Green,Red
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [48]:
df.sum(level=1,axis='columns')

  df.sum(level=1,axis='columns')


Unnamed: 0_level_0,color,Red,Green
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,8,4
b,1,14,7
b,2,20,10


In [None]:
df.sum(level=1,axis='columns').sum()

In [49]:
df.sum(level='state',axis='columns')

  df.sum(level='state',axis='columns')


Unnamed: 0_level_0,state,Ohio,Colorado
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,1,2
a,2,7,5
b,1,13,8
b,2,19,11


In [54]:
df.sum(level=0).sum(axis=1)

  df.sum(level=0).sum(axis=1)


key1
a    15
b    51
dtype: int64

#### Indexing with a DataFrame’s columns.


In [37]:
df

Unnamed: 0_level_0,state,Ohio,Ohio,Colorado
Unnamed: 0_level_1,color,Red,Green,Red
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [38]:
df=pd.DataFrame({'a':range(7),'b':range(7,0,-1),'c':['one','one','one', 'two', 'two','two','two'],'d': [0, 1, 2, 0, 1, 2, 3]})
df

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


In [39]:
df.set_index(['c','d'])

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [40]:
df.set_index(['c','d'],drop=False)


Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


In [41]:
a=df.set_index(['c','d'])
a

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [42]:
a.reset_index(inplace=True)

In [43]:
a

Unnamed: 0,c,d,a,b
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


In [44]:
a.sort_index(axis='columns')

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


#### Combining and merging datasets

#### Database-Style DataFrame Joins
 

Merge or join operations combine datasets by linking rows using one or more keys.
 These operations are central to relational databases (e.g., SQL-based). The merge
 function in pandas is the main entry point for using these algorithms on your data.

In [45]:
import pandas as pd
df1 = pd.DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df2=pd.DataFrame({'key':['a','c','b'],'data2':range(3)})
df1

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [46]:
df2

Unnamed: 0,key,data2
0,a,0
1,c,1
2,b,2


In [47]:
pd.merge(df1,df2)

Unnamed: 0,key,data1,data2
0,b,0,2
1,b,1,2
2,b,6,2
3,a,2,0
4,a,4,0
5,a,5,0
6,c,3,1


Note that I didn’t specify which column to join on. If that information is not speci
fied, merge uses the overlapping column names as the keys. It’s a good practice to
specify explicitly, though

In [48]:
pd.merge(df1,df2,on='key')

Unnamed: 0,key,data1,data2
0,b,0,2
1,b,1,2
2,b,6,2
3,a,2,0
4,a,4,0
5,a,5,0
6,c,3,1


 If the column names are different in each object, you can specify them separately

In [49]:
df1=df1.rename(columns={'key':'lkey'})

In [50]:
df1

Unnamed: 0,lkey,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [51]:
df2=df2.rename(columns={'key':'rkey'})
df2

Unnamed: 0,rkey,data2
0,a,0
1,c,1
2,b,2


In [52]:
pd.merge(df1,df2,left_on='lkey',right_on='rkey')

Unnamed: 0,lkey,data1,rkey,data2
0,b,0,b,2
1,b,1,b,2
2,b,6,b,2
3,a,2,a,0
4,a,4,a,0
5,a,5,a,0
6,c,3,c,1


By default merge performs inner join

In [53]:
df1=pd.DataFrame({'key': ['b', 'c', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)})
df2=pd.DataFrame({'key':['a','b','d'],'data2':range(3)})

In [54]:
df1.merge(df2,on='key')

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,6,1
2,a,2,0
3,a,4,0
4,a,5,0


d and c are missing not included in merge data

Other possible options are 'left',
 'right', and 'outer'

In [55]:
df1.merge(df2,on='key',how='outer')

Unnamed: 0,key,data1,data2
0,b,0.0,1.0
1,b,6.0,1.0
2,c,1.0,
3,c,3.0,
4,a,2.0,0.0
5,a,4.0,0.0
6,a,5.0,0.0
7,d,,2.0


In [56]:
df1.merge(df2,how='left')

Unnamed: 0,key,data1,data2
0,b,0,1.0
1,c,1,
2,a,2,0.0
3,c,3,
4,a,4,0.0
5,a,5,0.0
6,b,6,1.0


In [57]:
df1.merge(df2,how='right')

Unnamed: 0,key,data1,data2
0,a,2.0,0
1,a,4.0,0
2,a,5.0,0
3,b,0.0,1
4,b,6.0,1
5,d,,2


To merge with multiple keys

In [58]:
left = pd.DataFrame({'key1': ['foo', 'foo', 'bar'],'key2': ['one', 'two', 'one'],'lval': [1, 2, 3]})
right = pd.DataFrame({'key1': ['foo', 'foo', 'bar', 'bar'],'key2': ['one', 'one', 'one', 'two'],'rval': [4, 5, 6, 7]})
left

Unnamed: 0,key1,key2,lval
0,foo,one,1
1,foo,two,2
2,bar,one,3


In [59]:
right

Unnamed: 0,key1,key2,rval
0,foo,one,4
1,foo,one,5
2,bar,one,6
3,bar,two,7


In [60]:
pd.merge(left,right,on=['key1','key2'])

Unnamed: 0,key1,key2,lval,rval
0,foo,one,1,4
1,foo,one,1,5
2,bar,one,3,6


In [61]:
pd.merge(left,right,on='key1',how='outer')

Unnamed: 0,key1,key2_x,lval,key2_y,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


 A last issue to consider in merge operations is the treatment of overlapping column
 names.merge has a suffixes option for specifying strings to append
 to overlapping names in the left and right DataFrame objects

In [62]:
pd.merge(left,right,on='key1',how='outer',suffixes=['_left','_right'])

Unnamed: 0,key1,key2_left,lval,key2_right,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,foo,two,2,one,4
3,foo,two,2,one,5
4,bar,one,3,one,6
5,bar,one,3,two,7


In [63]:
df1=left.rename(columns={'key2':'lkey'})
df2=right.rename(columns={'key2':'rkey'})
df1

Unnamed: 0,key1,lkey,lval
0,foo,one,1
1,foo,two,2
2,bar,one,3


In [64]:
df2

Unnamed: 0,key1,rkey,rval
0,foo,one,4
1,foo,one,5
2,bar,one,6
3,bar,two,7


In [65]:
pd.merge(df1,df2,left_on=['key1','lkey'],right_on=['key1','rkey'],suffixes=['_left','_right'])

Unnamed: 0,key1,lkey,lval,rkey,rval
0,foo,one,1,one,4
1,foo,one,1,one,5
2,bar,one,3,one,6


###  Merging on Index

In [66]:

df1=pd.DataFrame({'key':['a','a','b','b','c','c'],'val':range(6)})
df2=pd.DataFrame({'val':range(3)},index=list("abd"))
df1

Unnamed: 0,key,val
0,a,0
1,a,1
2,b,2
3,b,3
4,c,4
5,c,5


In [67]:
df2

Unnamed: 0,val
a,0
b,1
d,2


In [68]:
pd.merge(df1,df2,left_on='key',right_index=True)

Unnamed: 0,key,val_x,val_y
0,a,0,0
1,a,1,0
2,b,2,1
3,b,3,1


In [69]:
pd.merge(df1,df2,left_on='key',right_index=True,how='right',suffixes=['_df1','_df2'])

Unnamed: 0,key,val_df1,val_df2
0.0,a,0.0,0
1.0,a,1.0,0
2.0,b,2.0,1
3.0,b,3.0,1
,d,,2


In [70]:
import numpy as np
lefth = pd.DataFrame({'key1': ['Ohio', 'Ohio', 'Ohio','Nevada', 'Nevada'],
                      'key2': [2000, 2001, 2002, 2001, 2002],'data': np.arange(5.)})
righth = pd.DataFrame(np.arange(12).reshape((6, 2)),index=[['Nevada', 'Nevada', 'Ohio', 'Ohio','Ohio', 'Ohio'],
                                                           [2001, 2000, 2000, 2000, 2001, 2002]], 
                      columns=['event1', 'event2'])
lefth


Unnamed: 0,key1,key2,data
0,Ohio,2000,0.0
1,Ohio,2001,1.0
2,Ohio,2002,2.0
3,Nevada,2001,3.0
4,Nevada,2002,4.0


In [71]:
righth

Unnamed: 0,Unnamed: 1,event1,event2
Nevada,2001,0,1
Nevada,2000,2,3
Ohio,2000,4,5
Ohio,2000,6,7
Ohio,2001,8,9
Ohio,2002,10,11


In [72]:
pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True)

Unnamed: 0,key1,key2,data,event1,event2
0,Ohio,2000,0.0,4,5
0,Ohio,2000,0.0,6,7
1,Ohio,2001,1.0,8,9
2,Ohio,2002,2.0,10,11
3,Nevada,2001,3.0,0,1


In [73]:
pd.merge(lefth,righth,left_on='key1',right_index=True)

ValueError: len(left_on) must equal the number of levels in the index of "right"

In [None]:
pd.merge(lefth,righth,left_on=['key1','key2'],right_index=True,how='left')

One can use both left_index=True  and right_index=True to merge using only indexes

### Concatinating Along Axis

###### in numpy array  using numpy concatenate

In [None]:
ar=np.arange(12).reshape(4,3)
ar

In [None]:
np.concatenate([ar,ar],axis=1)

###### Using pd.concat

By default concat works along axis=0, producing another Series. If you pass axis=1,
 the result will instead be a DataFrame (axis=1 is the columns)

In [None]:
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'd', 'e'])
s3 = pd.Series([5, 6], index=['f', 'g'])

In [None]:
pd.concat([s1,s2,s3])

In [None]:
pd.concat([s1,s2,s3],axis=1)

In [76]:
s1 = pd.Series([0, 1], index=['a', 'b'])
s2 = pd.Series([2, 3, 4], index=['c', 'a', 'b'])
s3 = pd.Series([5, 6], index=['f', 'c'])
pd.concat([s1,s2,s3])

a    0
b    1
c    2
a    3
b    4
f    5
c    6
dtype: int64

In this case there is no overlap on the other axis, which as you can see is the sorted
union (the 'outer' join) of the indexes. You can instead intersect them by passing
join='inner':

In [77]:
s4=pd.concat([s1,s3])
s4

a    0
b    1
f    5
c    6
dtype: int64

In [78]:
pd.concat([s1,s4],axis=1)

Unnamed: 0,0,1
a,0.0,0
b,1.0,1
f,,5
c,,6


In [80]:
pd.concat([s1,s4],axis=1,join='inner')

Unnamed: 0,0,1
a,0,0
b,1,1


in this f and c dissapeared because of join=inner

 You can even specify the axes to be used on the other axes with join_axes.

 A potential issue is that the concatenated pieces are not identifiable in the result. Sup
pose instead you wanted to create a hierarchical index on the concatenation axis. To
 do this, use the keys argument:

In [88]:
a=pd.concat([s1,s2,s3],keys=['one','two','three'])
a

one    a    0
       b    1
two    c    2
       a    3
       b    4
three  f    5
       c    6
dtype: int64

In [89]:
a.unstack()

Unnamed: 0,a,b,c,f
one,0.0,1.0,,
two,3.0,4.0,2.0,
three,,,6.0,5.0


In [90]:
pd.concat([s1,s2,s3],axis=1,keys=['s1','s2','s3'])

Unnamed: 0,s1,s2,s3
a,0.0,3.0,
b,1.0,4.0,
c,,2.0,6.0
f,,,5.0


#### concat in dataframe

In [92]:
 df1 = pd.DataFrame(np.arange(6).reshape(3, 2), index=['a', 'b', 'c'],
                     columns=['one', 'two'])
df2 = pd.DataFrame(5 + np.arange(6).reshape(2, 3), index=['a', 'c'],
                  columns=['three','two', 'four'])
df1

Unnamed: 0,one,two
a,0,1
b,2,3
c,4,5


In [93]:
df2

Unnamed: 0,three,two,four
a,5,6,7
c,8,9,10


In [94]:
pd.concat([df1,df2])

Unnamed: 0,one,two,three,four
a,0.0,1,,
b,2.0,3,,
c,4.0,5,,
a,,6,5.0,7.0
c,,9,8.0,10.0


In [96]:
pd.concat([df1,df2],axis=1,keys=['df1','df2'])

Unnamed: 0_level_0,df1,df1,df2,df2,df2
Unnamed: 0_level_1,one,two,three,two,four
a,0,1,5.0,6.0,7.0
b,2,3,,,
c,4,5,8.0,9.0,10.0


In [97]:
df1 = pd.DataFrame(np.random.randn(3, 4), columns=['a', 'b', 'c', 'd'])
df2 = pd.DataFrame(np.random.randn(2, 3), columns=['b', 'd', 'a'])

In [102]:
pd.concat([df1,df2],ignore_index=True)

Unnamed: 0,a,b,c,d
0,-0.329879,0.572181,0.858536,0.469628
1,0.039489,-1.006989,-0.134986,0.412453
2,-0.717406,-0.902983,0.586851,-1.425558
3,-0.526287,-0.80144,,-0.140048
4,0.457285,-1.132036,,-0.028638


In [103]:
df1

Unnamed: 0,a,b,c,d
0,-0.329879,0.572181,0.858536,0.469628
1,0.039489,-1.006989,-0.134986,0.412453
2,-0.717406,-0.902983,0.586851,-1.425558


In [104]:
df2

Unnamed: 0,b,d,a
0,-0.80144,-0.140048,-0.526287
1,-1.132036,-0.028638,0.457285


ignore_index Do not preserve indexes along concatenation axis, instead producing a new
 

#### combining data with overlap

In [106]:
a = pd.Series([np.nan, 2.5, np.nan, 3.5, 4.5, np.nan],index=['f', 'e', 'd', 'c', 'b', 'a'])
b = pd.Series(np.arange(len(a), dtype=np.float64),index=['f', 'e', 'd', 'c', 'b', 'a'])
a

f    NaN
e    2.5
d    NaN
c    3.5
b    4.5
a    NaN
dtype: float64

In [107]:
b

f    0.0
e    1.0
d    2.0
c    3.0
b    4.0
a    5.0
dtype: float64

In [116]:
np.where(a.isnull(),b,a)

array([0. , 2.5, 2. , 3.5, 4.5, 5. ])

In [117]:
df1 = pd.DataFrame({'a': [1., np.nan, 5., np.nan],'b': [np.nan, 2., np.nan, 6.],'c': range(2, 18, 4)})
df2 = pd.DataFrame({'a': [5., 4., np.nan, 3., 7.],
                    'b': [np.nan, 3., 4., 6., 8.]})
df1

Unnamed: 0,a,b,c
0,1.0,,2
1,,2.0,6
2,5.0,,10
3,,6.0,14


In [118]:
df2

Unnamed: 0,a,b
0,5.0,
1,4.0,3.0
2,,4.0
3,3.0,6.0
4,7.0,8.0


In [119]:
df1.combine_first(df2)

Unnamed: 0,a,b,c
0,1.0,,2.0
1,4.0,2.0,6.0
2,5.0,4.0,10.0
3,3.0,6.0,14.0
4,7.0,8.0,


## Reshaping and pivot

stack  and  unstack

In [132]:
df=pd.DataFrame(np.arange(6).reshape(3,2),index=pd.Index(['one','two','three'],name='number'),
                columns=pd.Index(['Ohio',"Maladine"],name='state'))


In [133]:
df

state,Ohio,Maladine
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,1
two,2,3
three,4,5


In [134]:
df.stack()

number  state   
one     Ohio        0
        Maladine    1
two     Ohio        2
        Maladine    3
three   Ohio        4
        Maladine    5
dtype: int32

In [135]:
df.unstack()

state     number
Ohio      one       0
          two       2
          three     4
Maladine  one       1
          two       3
          three     5
dtype: int32

In [136]:
temp=df.stack()
temp.unstack()

state,Ohio,Maladine
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,1
two,2,3
three,4,5


###### unstacking with levels

In [137]:
temp

number  state   
one     Ohio        0
        Maladine    1
two     Ohio        2
        Maladine    3
three   Ohio        4
        Maladine    5
dtype: int32

In [138]:
temp.unstack(0)

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,2,4
Maladine,1,3,5


In [139]:
temp.unstack('number')

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
Ohio,0,2,4
Maladine,1,3,5


In [140]:
temp.unstack('state')

state,Ohio,Maladine
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,0,1
two,2,3
three,4,5


 Unstacking might introduce missing data if all of the values in the level aren’t found
 in each of the subgroups:

In [141]:
s1 = pd.Series([0, 1, 2, 3], index=['a', 'b', 'c', 'd'])
s1

a    0
b    1
c    2
d    3
dtype: int64

In [144]:
s2=pd.Series([2,3,4],index=list("cde"))
s2

c    2
d    3
e    4
dtype: int64

In [147]:
df=pd.concat([s1,s2],keys=['one','two'])
df

one  a    0
     b    1
     c    2
     d    3
two  c    2
     d    3
     e    4
dtype: int64

In [150]:
df.unstack()

Unnamed: 0,a,b,c,d,e
one,0.0,1.0,2.0,3.0,
two,,,2.0,3.0,4.0


In [151]:
df.unstack().stack()

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
two  c    2.0
     d    3.0
     e    4.0
dtype: float64

ignores missing value

In [155]:
df.unstack().stack(dropna=False)

one  a    0.0
     b    1.0
     c    2.0
     d    3.0
     e    NaN
two  a    NaN
     b    NaN
     c    2.0
     d    3.0
     e    4.0
dtype: float64

When you unstack in a DataFrame, the level unstacked becomes the lowest level in
 the result:

In [156]:
data=pd.DataFrame({'left':df,'right':df*5})
data

Unnamed: 0,Unnamed: 1,left,right
one,a,0,0
one,b,1,5
one,c,2,10
one,d,3,15
two,c,2,10
two,d,3,15
two,e,4,20


In [157]:
data.unstack(0)

Unnamed: 0_level_0,left,left,right,right
Unnamed: 0_level_1,one,two,one,two
a,0.0,,0.0,
b,1.0,,5.0,
c,2.0,2.0,10.0,10.0
d,3.0,3.0,15.0,15.0
e,,4.0,,20.0


In [160]:
pd.concat([s1,s2],ignore_index=True)

0    0
1    1
2    2
3    3
4    2
5    3
6    4
dtype: int64

In [167]:
data.unstack(0).stack(0)

Unnamed: 0,Unnamed: 1,one,two
a,left,0.0,
a,right,0.0,
b,left,1.0,
b,right,5.0,
c,left,2.0,2.0
c,right,10.0,10.0
d,left,3.0,3.0
d,right,15.0,15.0
e,left,,4.0
e,right,,20.0
