# Data Wrangling: Join,Combine, and Reshape
## Hierarchical Indexing
- Reordering and Sorting levels
- Summary statitics by level
- Indexing with DataFrame's Columns


## Combining and Merging Datasets
- Database-Style DataFrame joins
- Merging on Index
- Concatenating Along an Axis
- Combining Data with Overlap


## Reshaping and Pivoting
- Reshaping with hierarchical Indexing
- Pivoting 'long' to 'wide' format
- pivoting 'wide' to 'long' format


## Hierarchical Indexing (Series)

In [1]:
import pandas as pd
import numpy as np

In [5]:
data = pd.Series(np.random.uniform(size = 9),
                index =  [['a', 'a', 'b', 'c', 'c', 'b', 'c', 'b','a'],
                          [1, 2, 3, 1, 3, 4, 3, 2, 1]])
data

a  1    0.684862
   2    0.701188
b  3    0.870829
c  1    0.958994
   3    0.042434
b  4    0.539591
c  3    0.668997
b  2    0.501304
a  1    0.260682
dtype: float64

In [7]:
# gaps for 'multi-index'
data.index

MultiIndex([('a', 1),
            ('a', 2),
            ('b', 3),
            ('c', 1),
            ('c', 3),
            ('b', 4),
            ('c', 3),
            ('b', 2),
            ('a', 1)],
           )

In [24]:
mean = [0, 0]
cov =  [[1,0], [0, 100]]

In [16]:
data

a  1    0.684862
   2    0.701188
b  3    0.870829
c  1    0.958994
   3    0.042434
b  4    0.539591
c  3    0.668997
b  2    0.501304
a  1    0.260682
dtype: float64

In [8]:
# selecting subset
data['b']

3    0.870829
4    0.539591
2    0.501304
dtype: float64

In [13]:
# selecting the data values with loc operator
data.loc[['a','b']]

a  1    0.684862
   2    0.701188
   1    0.260682
b  3    0.870829
   4    0.539591
   2    0.501304
dtype: float64

In [14]:
data.loc[:, 2]

a    0.701188
b    0.501304
dtype: float64

## Hierarchical index (DataFrame)

In [18]:
frame = pd.DataFrame(np.arange(12).reshape((4,3)),
                    index = [["a", "a", "b", "b"], [1, 2, 1, 2]],
                    columns = [['fdk', 'fzp', 'chd'],
                               ['PB', 'PB', 'CHD']])

In [19]:
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,fdk,fzp,chd
Unnamed: 0_level_1,Unnamed: 1_level_1,PB,PB,CHD
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [20]:
frame.index.names = ['key1', 'key2']

In [22]:
frame.columns.names = ['city', 'province']

In [23]:
frame

Unnamed: 0_level_0,city,fdk,fzp,chd
Unnamed: 0_level_1,province,PB,PB,CHD
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
a,2,3,4,5
b,1,6,7,8
b,2,9,10,11


In [29]:
# to check how many levels an index has
frame.index.nlevels

2

In [33]:
# partial column indexing
frame['fdk']

Unnamed: 0_level_0,province,PB
key1,key2,Unnamed: 2_level_1
a,1,0
a,2,3
b,1,6
b,2,9


In [34]:
frame['fzp']

Unnamed: 0_level_0,province,PB
key1,key2,Unnamed: 2_level_1
a,1,1
a,2,4
b,1,7
b,2,10


In [35]:
frame['chd']

Unnamed: 0_level_0,province,CHD
key1,key2,Unnamed: 2_level_1
a,1,2
a,2,5
b,1,8
b,2,11


In [None]:
pd.MultiIndex.from_arrays([['fdk', 'fzp', 'chd'],
                          ['PB', 'PB', 'CHD'],
                           names=['city', 'capital'])
                           
                           

### Reordering and Sorting levels

In [38]:
frame.swaplevel('key1', 'key2')

Unnamed: 0_level_0,city,fdk,fzp,chd
Unnamed: 0_level_1,province,PB,PB,CHD
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
2,a,3,4,5
1,b,6,7,8
2,b,9,10,11


In [39]:
frame.sort_index(level=1)

Unnamed: 0_level_0,city,fdk,fzp,chd
Unnamed: 0_level_1,province,PB,PB,CHD
key1,key2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
a,1,0,1,2
b,1,6,7,8
a,2,3,4,5
b,2,9,10,11


In [40]:
frame.swaplevel(0,1).sort_index(level=0)

Unnamed: 0_level_0,city,fdk,fzp,chd
Unnamed: 0_level_1,province,PB,PB,CHD
key2,key1,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
1,a,0,1,2
1,b,6,7,8
2,a,3,4,5
2,b,9,10,11


### Summary Statistics by Level

In [41]:
frame.groupby(level='key2').sum()

city,fdk,fzp,chd
province,PB,PB,CHD
key2,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
1,6,8,10
2,12,14,16


In [42]:
frame.groupby(level= 'province', axis = 'columns').sum()

Unnamed: 0_level_0,province,CHD,PB
key1,key2,Unnamed: 2_level_1,Unnamed: 3_level_1
a,1,2,1
a,2,5,7
b,1,8,13
b,2,11,19


### Indexing with a DataFrame's columns

In [43]:
frame2 = pd.DataFrame({'a': range(7), 'b': range(7,0,-1),
                      'c': ['one', 'one', 'one', 'two', 'two',
                           'two', 'two'],
                      'd': [0, 1,2,0,1,3,2]})

In [44]:
frame2

Unnamed: 0,a,b,c,d
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,3
6,6,1,two,2


In [46]:
# set_index to create a new DataFrame

frame3 = frame2.set_index(['c', 'd'])

frame3

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,3,5,2
two,2,6,1


In [47]:
# we can set it to index by doing drop= False

frame2.set_index(["c",'d'], drop= False)

Unnamed: 0_level_0,Unnamed: 1_level_0,a,b,c,d
c,d,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,3,5,2,two,3
two,2,6,1,two,2


In [50]:
# reset_index brings it back to the orignal position

frame2.reset_index()

Unnamed: 0,index,a,b,c,d
0,0,0,7,one,0
1,1,1,6,one,1
2,2,2,5,one,2
3,3,3,4,two,0
4,4,4,3,two,1
5,5,5,2,two,3
6,6,6,1,two,2


## Combining and Merging Datasets

- pandas.merge (connects rows based on one/more keys) [how](https://learning.oreilly.com/library/view/python-for-data/9781098104023/ch08.html#table_merge_how_behavior)
- pandas.concat (stacks objects together on axis)
- combine_first (slice together overlapping data to fill missing values)

In [52]:
# DataFrame joins
df1 = pd.DataFrame({"key": ['a', 'c', 'd', 'b', 'a', 'c'],
                   'data1': pd.Series(range(6), dtype= 'Int64')})

df2 = pd.DataFrame({'key': ['a', 'b', 'c'],
                   'data2': pd.Series(range(3), dtype='Int64')})

In [53]:
df1

Unnamed: 0,key,data1
0,a,0
1,c,1
2,d,2
3,b,3
4,a,4
5,c,5


In [54]:
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,c,2


In [55]:
pd.merge(df1, df2)

Unnamed: 0,key,data1,data2
0,a,0,0
1,a,4,0
2,c,1,2
3,c,5,2
4,b,3,1


In [57]:
# specifying the column
pd.merge(df1, df2, on= 'key')

Unnamed: 0,key,data1,data2
0,a,0,0
1,a,4,0
2,c,1,2
3,c,5,2
4,b,3,1


In [59]:
pd.merge(df1, df2, how= 'outer')

Unnamed: 0,key,data1,data2
0,a,0,0.0
1,a,4,0.0
2,c,1,2.0
3,c,5,2.0
4,d,2,
5,b,3,1.0
