<a href="https://colab.research.google.com/github/recervictory/LearingPython/blob/master/09_Pandas_Data_Wrangling_Join_Combine_and_Reshape.ipynb" target="_parent"><img src="https://colab.research.google.com/assets/colab-badge.svg" alt="Open In Colab"/></a>

# Data Wrangling: Join, Combine, and Reshape

In [1]:
# Importing
import pandas as pd
import numpy as np

## 0. Indexing with a DataFrame’s columns

In [2]:
frame = pd.DataFrame({'roll': range(7), 
                      'marks': range(7, 0, -1), 
                      'group': ['one', 'one', 'one', 'two', 'two', 'two', 'two'], 
                      'id': [0, 1, 2, 0, 1, 2, 3]})

frame

Unnamed: 0,roll,marks,group,id
0,0,7,one,0
1,1,6,one,1
2,2,5,one,2
3,3,4,two,0
4,4,3,two,1
5,5,2,two,2
6,6,1,two,3


In [3]:
frame.set_index(['roll'])

Unnamed: 0_level_0,marks,group,id
roll,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
0,7,one,0
1,6,one,1
2,5,one,2
3,4,two,0
4,3,two,1
5,2,two,2
6,1,two,3


In [4]:
# seting new index
frameNew = frame.set_index(['group', 'id'])
frameNew

Unnamed: 0_level_0,Unnamed: 1_level_0,roll,marks
group,id,Unnamed: 2_level_1,Unnamed: 3_level_1
one,0,0,7
one,1,1,6
one,2,2,5
two,0,3,4
two,1,4,3
two,2,5,2
two,3,6,1


In [7]:
# Not removing the original column
frame.set_index(['group', 'id'], drop=False)

Unnamed: 0_level_0,Unnamed: 1_level_0,roll,marks,group,id
group,id,Unnamed: 2_level_1,Unnamed: 3_level_1,Unnamed: 4_level_1,Unnamed: 5_level_1
one,0,0,7,one,0
one,1,1,6,one,1
one,2,2,5,one,2
two,0,3,4,two,0
two,1,4,3,two,1
two,2,5,2,two,2
two,3,6,1,two,3


## 1. Hierarchical Indexing
Hierarchical indexing is an important feature of pandas that enables you to have multiple (two or more) index levels on an axis. Somewhat abstractly, it provides a way for you to work with higher dimensional data in a lower dimensional form.

Hierarchical indexing plays an important role in reshaping data and **group-based** operations like forming a **pivot table**. For example, you could rearrange the data into a DataFrame using its unstack method:

In [8]:
# Hierarchical Indexing
data =  pd.Series(np.random.randn(9), \
        index=[['a', 'a', 'a', 'b', 'b', 'c', 'c', 'd', 'd'], \
        [1, 2, 3, 1, 3, 1, 2, 2, 3]])
data

a  1   -0.886779
   2    0.465580
   3    0.585076
b  1   -0.764119
   3   -0.453926
c  1   -1.249739
   2   -1.332372
d  2   -0.739264
   3   -0.671560
dtype: float64

In [9]:
# Find Out Index
data.index

MultiIndex([('a', 1),
            ('a', 2),
            ('a', 3),
            ('b', 1),
            ('b', 3),
            ('c', 1),
            ('c', 2),
            ('d', 2),
            ('d', 3)],
           )

In [10]:
data['b']

1   -0.764119
3   -0.453926
dtype: float64

In [11]:
data['b':'c']

b  1   -0.764119
   3   -0.453926
c  1   -1.249739
   2   -1.332372
dtype: float64

In [12]:
data.loc[['b', 'd']]

b  1   -0.764119
   3   -0.453926
d  2   -0.739264
   3   -0.671560
dtype: float64

Hierarchical indexing plays an important role in reshaping data and group-based
operations like forming a `pivot table`. For example, you could rearrange the data into a DataFrame using its `unstack method`:

In [13]:
data.unstack()

Unnamed: 0,1,2,3
a,-0.886779,0.46558,0.585076
b,-0.764119,,-0.453926
c,-1.249739,-1.332372,
d,,-0.739264,-0.67156


In [14]:
# The inverse operation of unstack is stack
data.unstack().stack()

a  1   -0.886779
   2    0.465580
   3    0.585076
b  1   -0.764119
   3   -0.453926
c  1   -1.249739
   2   -1.332372
d  2   -0.739264
   3   -0.671560
dtype: float64

In [16]:
frame = pd.DataFrame(np.arange(12).reshape((4, 3)), 
                     index=[['jan', 'jan', 'feb', 'feb'], [2011, 2012, 2011, 2012]],
                     columns=[['Kolkata', 'Kolkata', 'Delhi'],
                              ['Green', 'Red', 'Green']])

frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Kolkata,Kolkata,Delhi
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
jan,2011,0,1,2
jan,2012,3,4,5
feb,2011,6,7,8
feb,2012,9,10,11


In [18]:
# Show index key
frame.index.names = ['month', 'year']
frame

Unnamed: 0_level_0,Unnamed: 1_level_0,Kolkata,Kolkata,Delhi
Unnamed: 0_level_1,Unnamed: 1_level_1,Green,Red,Green
month,year,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
jan,2011,0,1,2
jan,2012,3,4,5
feb,2011,6,7,8
feb,2012,9,10,11


In [19]:
# Show column names
frame.columns.names = ['city', 'color']

In [20]:
frame

Unnamed: 0_level_0,city,Kolkata,Kolkata,Delhi
Unnamed: 0_level_1,color,Green,Red,Green
month,year,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
jan,2011,0,1,2
jan,2012,3,4,5
feb,2011,6,7,8
feb,2012,9,10,11


In [21]:
# Reseting the index
frameNew.reset_index()

Unnamed: 0,group,id,roll,marks
0,one,0,0,7
1,one,1,1,6
2,one,2,2,5
3,two,0,3,4
4,two,1,4,3
5,two,2,5,2
6,two,3,6,1


## 2. Reordering and Sorting Levels

In [30]:
frame.swaplevel('year', 'month') 

Unnamed: 0_level_0,city,Kolkata,Kolkata,Delhi
Unnamed: 0_level_1,color,Green,Red,Green
year,month,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
2011,jan,0,1,2
2012,jan,3,4,5
2011,feb,6,7,8
2012,feb,9,10,11


`sort_index`, on the other hand, sorts the data using only the values in a single level. When swapping levels, it’s not uncommon to also use sort_index so that the result is lexicographically sorted by the indicated level

In [38]:
frame.sort_index(level=0)

Unnamed: 0_level_0,city,Kolkata,Kolkata,Delhi
Unnamed: 0_level_1,color,Green,Red,Green
month,year,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
feb,2011,6,7,8
feb,2012,9,10,11
jan,2011,0,1,2
jan,2012,3,4,5


In [35]:
frame

Unnamed: 0_level_0,city,Kolkata,Kolkata,Delhi
Unnamed: 0_level_1,color,Green,Red,Green
month,year,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
jan,2011,0,1,2
jan,2012,3,4,5
feb,2011,6,7,8
feb,2012,9,10,11


## 3. Summary Statistics by Level

Many descriptive and summary statistics on DataFrame and `Series` have a level
option in which you can specify the level you want to `aggregate` by on a particular axis. 

In [26]:
frame.mean(level='month')

city,Kolkata,Kolkata,Delhi
color,Green,Red,Green
month,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2
jan,1.5,2.5,3.5
feb,7.5,8.5,9.5


In [29]:
frame.sum(level='city', axis=1) # column wise

Unnamed: 0_level_0,city,Kolkata,Delhi
month,year,Unnamed: 2_level_1,Unnamed: 3_level_1
jan,2011,1,2
jan,2012,7,5
feb,2011,13,8
feb,2012,19,11


## 4. Combining and Merging Datasets

Data contained in pandas objects can be combined together in a number of ways:
- `pandas.merge` connects rows in DataFrames based on one or more keys. This
will be familiar to users of SQL or other relational databases, as it implements
database join operations.
-  `pandas.concat` concatenates or “stacks” together objects along an axis.
-  The `combine_first` instance method enables splicing together overlapping data to fill in missing values in one object with values from another.



### Database-Style DataFrame Joins
Merge or join operations combine datasets by linking rows using one or more keys. These operations are central to relational databases (e.g., SQL-based). The merge function in pandas is the main entry point for using these algorithms on your data.

In [39]:
# 1st Dataframe
df1 = pd.DataFrame({'name': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 
                    'math': range(7)})
df1

Unnamed: 0,name,math
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [45]:
# 2nd Dataframe
df2 = pd.DataFrame({'name': [ 'b', 'd','c'], 'bio': range(3)})
df2

Unnamed: 0,name,bio
0,b,0
1,d,1
2,c,2


In [46]:
pd.merge(df1,df2)


Unnamed: 0,name,math,bio
0,b,0,0
1,b,1,0
2,b,6,0
3,c,3,2


##### Note that I didn’t specify which column to join on. If that information is not specified, merge uses the **overlapping column names** as the keys. It’s a good practice to specify explicitly, though:

In [47]:
pd.merge(df1, df2, on='name') # merge on the basis of name column

Unnamed: 0,name,math,bio
0,b,0,0
1,b,1,0
2,b,6,0
3,c,3,2


In [48]:
# If the column names are different in each object, you can specify them separately:
df3 = pd.DataFrame({'lname': ['b', 'b', 'a', 'c', 'a', 'a', 'b'], 
                    'math': range(7)})
df4 = pd.DataFrame({'rname': ['a', 'b', 'd'], 'bio': range(3)})


In [50]:
pd.merge(df3, df4, left_on='lname', right_on='rname')

Unnamed: 0,lname,math,rname,bio
0,b,0,b,1
1,b,1,b,1
2,b,6,b,1
3,a,2,a,0
4,a,4,a,0
5,a,5,a,0


### Type of JOIN in DataFrame
![join](https://cdn.mindmajix.com/blog/images/db-01_2119.png "Data Frame Join")

By default merge does an 'inner' join; the keys in the result are the intersec‐
tion, or the common set found in both tables. Other possible options are 'left',
'right', and 'outer'. The outer join takes the union of the keys, combining the
effect of applying both left and right joins:

### Different join types with how argument
- 'inner' Use only the key combinations observed in both tables
- 'left' Use all key combinations found in the left table
- 'right' Use all key combinations found in the right table
- 'output' Use all key combinations observed in both tables together



In [56]:
pd.merge(df1, df2, how='outer')

Unnamed: 0,name,math,bio
0,b,0.0,0.0
1,b,1.0,0.0
2,b,6.0,0.0
3,a,2.0,
4,a,4.0,
5,a,5.0,
6,c,3.0,2.0
7,d,,1.0


### Merging on Index
In some cases, the merge key(s) in a DataFrame will be found in its index. In this case, you can pass left_index=True or right_index=True (or both) to indicate that the index should be used as the merge key

In [57]:
left1 = pd.DataFrame({'key': ['a', 'b', 'a', 'a', 'b', 'c'],
                      'value': range(6)})
right1 = pd.DataFrame({'group_val': [3.5, 7]}, index=['a', 'b'])

pd.merge(left1, right1, left_on='key', right_index=True)

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0


In [58]:
# Using Outer Join 
pd.merge(left1, right1, left_on='key', right_index=True, how='outer') # if it is on index right/left_index =  True
# column right_on 

Unnamed: 0,key,value,group_val
0,a,0,3.5
2,a,2,3.5
3,a,3,3.5
1,b,1,7.0
4,b,4,7.0
5,c,5,


In [59]:
lefth = pd.DataFrame({'state': ['Ohio', 'Ohio', 'Ohio', 'Nevada', 'Nevada'],
                      'year': [2000, 2001, 2002, 2001, 2002], 
                      'data': np.arange(5.)})

righth = pd.DataFrame(np.arange(12).reshape((6, 2)),
                      index=[['Nevada', 'Nevada', 'Ohio', 'Ohio','Ohio', 'Ohio'],
                             [2001, 2000, 2000, 2000, 2001, 2002]],
                      columns=['event1', 'event2'])

In [63]:
lefth

Unnamed: 0,state,year,data
0,Ohio,2000,0.0
1,Ohio,2001,1.0
2,Ohio,2002,2.0
3,Nevada,2001,3.0
4,Nevada,2002,4.0


In [61]:
righth

Unnamed: 0,Unnamed: 1,event1,event2
Nevada,2001,0,1
Nevada,2000,2,3
Ohio,2000,4,5
Ohio,2000,6,7
Ohio,2001,8,9
Ohio,2002,10,11


In [64]:
pd.merge(lefth, righth, left_on=['state', 'year'], right_index=True)

Unnamed: 0,state,year,data,event1,event2
0,Ohio,2000,0.0,4,5
0,Ohio,2000,0.0,6,7
1,Ohio,2001,1.0,8,9
2,Ohio,2002,2.0,10,11
3,Nevada,2001,3.0,0,1


In [65]:
pd.merge(lefth, righth, left_on=['state', 'year'], right_index=True,how="outer")

Unnamed: 0,state,year,data,event1,event2
0,Ohio,2000,0.0,4.0,5.0
0,Ohio,2000,0.0,6.0,7.0
1,Ohio,2001,1.0,8.0,9.0
2,Ohio,2002,2.0,10.0,11.0
3,Nevada,2001,3.0,0.0,1.0
4,Nevada,2002,4.0,,
4,Nevada,2000,,2.0,3.0
