<div class="alert alert-block" style = "background-color: black">
    <p><b><font size="+4" color="orange">Data Wrangling in Pandas</font></b></p>
    <p><b><font size="+1" color="white">by Jubril Davies</font></b></p>
    </div>

In [73]:
import warnings
warnings.filterwarnings('ignore')
import numpy as np
import pandas as pd
import seaborn as sb
import matplotlib
import matplotlib.pyplot as plt
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg') # display figures in vector format
plt.rcParams.update({'font.size':12}) # set global font size

---
<div class="alert alert-block" style="background-color: black">
    <p><b><font size="+3" color="white">Combining & Merging Datasets</font></b></p>
    </div>
    
---

Most interesting studies of data come from combining different data sources. These operations include but not limited to:

  - Complicated database-style joins and merges that correctly handle overlaps
  - Concatenation of 2 different datasets
  - Combine_first to enamble splicing together of overlapping data to fill missing values in one object with other values

Pandas comes with a variety of functions and methods that make this sort of data wrangling fast and straightforward. 

<div class= "alert alert-block" style="background-color: orange; border-color: black">
    <p><b><font size="+2" color="black">Database-Like Joins & Merges</font></b></p>
    </div>
    
#### Merge or Join operations combine datasets by linking rows using one or more primary keys as found in relational databases. 

Pandas Merge function offers a conveinient way to do these operations within Python

> #### **Example on Merging two Dataframes in a MANY-TO-ONE merge  situation**

In [3]:
data1 = pd.DataFrame({'values1':range(6),'key':['b','b','a','d','a','a']})
data1

Unnamed: 0,values1,key
0,0,b
1,1,b
2,2,a
3,3,d
4,4,a
5,5,a


In [4]:
data2 = pd.DataFrame({'values2':range(3),'key':['a','b','e']})
data2

Unnamed: 0,values2,key
0,0,a
1,1,b
2,2,e


In [5]:
pd.merge(data1,data2)

Unnamed: 0,values1,key,values2
0,0,b,1
1,1,b,1
2,2,a,0
3,4,a,0
4,5,a,0


>**Specifying explicitly the column on which the join is based**

In [6]:
pd.merge(data1,data2,on='key')

Unnamed: 0,values1,key,values2
0,0,b,1
1,1,b,1
2,2,a,0
3,4,a,0
4,5,a,0


By default, Pandas does an inner join which is an **intersection** and the merge is done using the `key` as the column to join on. If the column names of the primary key are different, the merge key can be explicitly specified separately.

In [7]:
data3 = pd.DataFrame({'xkey':['b','b','a','d','a','a'],'values1':range(6)})
data3

Unnamed: 0,xkey,values1
0,b,0
1,b,1
2,a,2
3,d,3
4,a,4
5,a,5


In [8]:
data4 = pd.DataFrame({'ykey':['a','b','e'],'values2':range(3)})
data4

Unnamed: 0,ykey,values2
0,a,0
1,b,1
2,e,2


* **A left join means that we keep all rows from the left dataFrame and only those rows from the right dataframe where the keys match and a right join is vice-versa**

In [9]:
pd.merge(data3,data4,left_on='xkey',right_on='ykey') #Double check

Unnamed: 0,xkey,values1,ykey,values2
0,b,0,b,1
1,b,1,b,1
2,a,2,a,0
3,a,4,a,0
4,a,5,a,0


As noticed,there is no d and e in the merged result. That is because Pandas does an inner join (intersection). The keys in the result are an intersection.

> #### **Example on Merging two Dataframes using an Outer Join**

In [10]:
pd.merge(data1,data2,how='outer')

Unnamed: 0,values1,key,values2
0,0.0,b,1.0
1,1.0,b,1.0
2,2.0,a,0.0
3,4.0,a,0.0
4,5.0,a,0.0
5,3.0,d,
6,,e,2.0


This gives a union of the two dataframes by taking the union of the keys which gives the effect of applying both left and right joins.

> #### **Example on Merging two Dataframes in a MANY-TO-MANY merge situation**

In [11]:
dt1 = pd.DataFrame({'id': [1,2,3,4], 'Name': ['Alice','Bob','Charlie','David'],'Age':[25,30,35,40]})
dt1

Unnamed: 0,id,Name,Age
0,1,Alice,25
1,2,Bob,30
2,3,Charlie,35
3,4,David,40


In [12]:
dt2 = pd.DataFrame({'id': [1,2,4,5], 'Dept': ['HR','IT','Finance','Marketing']})
dt2

Unnamed: 0,id,Dept
0,1,HR
1,2,IT
2,4,Finance
3,5,Marketing


* **A left join means that we keep all rows from the left dataFrame and only those rows from the right dataframe where the keys match**
* **This is useful when you want to preserve all the information on the left and supplement with information on the right**

In [13]:
pd.merge(dt1,dt2,on='id',how='left')

Unnamed: 0,id,Name,Age,Dept
0,1,Alice,25,HR
1,2,Bob,30,IT
2,3,Charlie,35,
3,4,David,40,Finance


> #### **Example on Merging two Dataframes in a MANY-TO-MANY merge situation with MULTIPLE KEYS**

In [14]:
df1 = pd.DataFrame({'EmployeeID':[1,2,3,4,5],'Name':['Alice','Bob','Charlie','David','Emma'],'DeptID':[10,20,30,40,50]})
df1

Unnamed: 0,EmployeeID,Name,DeptID
0,1,Alice,10
1,2,Bob,20
2,3,Charlie,30
3,4,David,40
4,5,Emma,50


In [15]:
df2 = pd.DataFrame({'DeptID':[10,20,30,40,60],'EmployeeID':[1,2,3,4,5],'DeptName':['HR','IT','Sales','Finance','Marketing'],})
df2

Unnamed: 0,DeptID,EmployeeID,DeptName
0,10,1,HR
1,20,2,IT
2,30,3,Sales
3,40,4,Finance
4,60,5,Marketing


> #### **Left Join**

In [16]:
pd.merge(df1,df2,how='left',on=['EmployeeID','DeptID'])

Unnamed: 0,EmployeeID,Name,DeptID,DeptName
0,1,Alice,10,HR
1,2,Bob,20,IT
2,3,Charlie,30,Sales
3,4,David,40,Finance
4,5,Emma,50,


> #### **Right Join**

In [17]:
pd.merge(df1,df2,how='right',on=['EmployeeID','DeptID'])

Unnamed: 0,EmployeeID,Name,DeptID,DeptName
0,1,Alice,10,HR
1,2,Bob,20,IT
2,3,Charlie,30,Sales
3,4,David,40,Finance
4,5,,60,Marketing


> #### **Outer Join**

In [18]:
pd.merge(df1,df2,how='outer',on=['EmployeeID','DeptID'])

Unnamed: 0,EmployeeID,Name,DeptID,DeptName
0,1,Alice,10,HR
1,2,Bob,20,IT
2,3,Charlie,30,Sales
3,4,David,40,Finance
4,5,Emma,50,
5,5,,60,Marketing


> #### **Example on Merging two Dataframes in a MANY-TO-MANY merge situation with MULTIPLE KEYS & OVERLAPPING COLUMN NAMES**

In [19]:
dframe1 = pd.DataFrame({'EmployeeID':[1,2,3,4,5],'Name':['Alice','Bob','Charlie','David','Emma'],
                        'DeptID':[10,20,30,40,50],'DeptName':['Sales','IT','HR','Finance','Marketing']})
dframe1

Unnamed: 0,EmployeeID,Name,DeptID,DeptName
0,1,Alice,10,Sales
1,2,Bob,20,IT
2,3,Charlie,30,HR
3,4,David,40,Finance
4,5,Emma,50,Marketing


In [20]:
dframe2 = pd.DataFrame({'DeptID':[10,20,30,40,60],'EmployeeID':[1,2,3,4,5],'DeptName':['HR','IT','Sales','Finance','Marketing'],})
dframe2

Unnamed: 0,DeptID,EmployeeID,DeptName
0,10,1,HR
1,20,2,IT
2,30,3,Sales
3,40,4,Finance
4,60,5,Marketing


> #### **Left Join with Suffixes**

In [21]:
pd.merge(dframe1,dframe2,how='left',on=['EmployeeID','DeptID'],suffixes=('_left','_right'))

Unnamed: 0,EmployeeID,Name,DeptID,DeptName_left,DeptName_right
0,1,Alice,10,Sales,HR
1,2,Bob,20,IT,IT
2,3,Charlie,30,HR,Sales
3,4,David,40,Finance,Finance
4,5,Emma,50,Marketing,


> #### **Right Join with Suffixes**

In [22]:
pd.merge(dframe1,dframe2,how='right',on=['EmployeeID','DeptID'],suffixes=('_left','_right'))

Unnamed: 0,EmployeeID,Name,DeptID,DeptName_left,DeptName_right
0,1,Alice,10,Sales,HR
1,2,Bob,20,IT,IT
2,3,Charlie,30,HR,Sales
3,4,David,40,Finance,Finance
4,5,,60,,Marketing


> #### **Outer Join with Suffixes**

This returns all rows from both dataframes. Where there are no matches, it fills the missing matches with NaN

In [23]:
pd.merge(dframe1,dframe2,how='outer',on=['EmployeeID','DeptID'],suffixes=('_left','_right'))

Unnamed: 0,EmployeeID,Name,DeptID,DeptName_left,DeptName_right
0,1,Alice,10,Sales,HR
1,2,Bob,20,IT,IT
2,3,Charlie,30,HR,Sales
3,4,David,40,Finance,Finance
4,5,Emma,50,Marketing,
5,5,,60,,Marketing


> #### **Example on Merging two Dataframes ON INDEX**

In cases where the merge key or keys in a dataframe will be found in its index. In this case, you can pass left_index=True or right_index=True or both to indicate that the index should be used as the merge key

In [47]:
dx1 = pd.DataFrame({'EmployeeID':[1,2,3,4,5],'Name':['Alice','Bob','Charlie','David','Emma'],
                        'DeptID':[10,20,30,40,50]})
dx1

Unnamed: 0,EmployeeID,Name,DeptID
0,1,Alice,10
1,2,Bob,20
2,3,Charlie,30
3,4,David,40
4,5,Emma,50


In [48]:
dx2 = pd.DataFrame({'DeptID':[10,20,30,40,60],'Salary':[70000,80000,75000,90000,65000]},index =[1,2,3,4,6])
dx2

Unnamed: 0,DeptID,Salary
1,10,70000
2,20,80000
3,30,75000
4,40,90000
6,60,65000


In [50]:
pd.merge(dx1,dx2,left_on='EmployeeID',right_index=True,how='left')

Unnamed: 0,EmployeeID,Name,DeptID_x,DeptID_y,Salary
0,1,Alice,10,10.0,70000.0
1,2,Bob,20,20.0,80000.0
2,3,Charlie,30,30.0,75000.0
3,4,David,40,40.0,90000.0
4,5,Emma,50,,


Left_on='EmployeeID': this tells pandas to use empoyeeID column in dx1 as the merge key
right_index=True :  this tells pandas to use the index of dx2 as the merge key

> #### **Example on Merging two Dataframes WITH MULTIPLE KEYS WITH ONE KEY AS INDEX**

In [70]:
dd1 = pd.DataFrame({'EmployeeID':[1,2,3,4,5],'Name':['Alice','Bob','Charlie','David','Emma'],
                        'DeptID':[10,20,30,40,50]})
dd1.set_index(['EmployeeID','DeptID'],inplace=True)
dd1

Unnamed: 0_level_0,Unnamed: 1_level_0,Name
EmployeeID,DeptID,Unnamed: 2_level_1
1,10,Alice
2,20,Bob
3,30,Charlie
4,40,David
5,50,Emma


In [64]:
dd2 = pd.DataFrame({'DeptID':[10,20,30,40,50],'EmployeeID':[1,2,3,4,5],'Dept':['Sales','IT','HR','Finance','Marketing']})
dd2

Unnamed: 0,DeptID,EmployeeID,Dept
0,10,1,Sales
1,20,2,IT
2,30,3,HR
3,40,4,Finance
4,50,5,Marketing


In [72]:
pd.merge(dd1,dd2,on=['EmployeeID','DeptID'],suffixes=('_left','_right'))

Unnamed: 0,EmployeeID,DeptID,Name,Dept
0,1,10,Alice,Sales
1,2,20,Bob,IT
2,3,30,Charlie,HR
3,4,40,David,Finance
4,5,50,Emma,Marketing


> #### **Merging on Multiple Key Index**

In [74]:
dd3 = pd.DataFrame({'DeptID':[10,20,30],'EmployeeID':[1,2,3],'Dept':['Sales','IT','HR']})
dd3

Unnamed: 0,DeptID,EmployeeID,Dept
0,10,1,Sales
1,20,2,IT
2,30,3,HR


In [80]:
pd.merge(dd1,dd3,left_index=True,right_on=['EmployeeID','DeptID'],suffixes=('_left','_right'))

Unnamed: 0,Name,DeptID,EmployeeID,Dept
0,Alice,10,1,Sales
1,Bob,20,2,IT
2,Charlie,30,3,HR


## COMBINING DATASETS

### **1. CONCAT**

In [25]:
#Example - Lets make a dataframe that will be used in this example
def make_df(cols,ind):
    #create data
    data = {c:[str(c) + str(i) for i in ind]
            for c in cols}
    return pd.DataFrame(data,ind)

In [26]:
#Example DataFrame
make_df('ABC',range(3))

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2


### Using Pandas pd.concat()
pd.concat() can be used for simple concatenation of series or dataframe objects


# *Syntax*
pd.concat(objs,axis=0,join='outer',join_axes=None,ignore_index=False,keys=None,levels=None, names=None, verify_integrity=False, copy=True)

* **Example - Using Series**

In [27]:
series1 = pd.Series(['A','B','C'],index=[1,2,3])
series2 = pd.Series(['D','E','F'],index=[4,5,6])
pd.concat([series1,series2]) 

1    A
2    B
3    C
4    D
5    E
6    F
dtype: object

**By default concatenation takes place row-wise within a dataframe**

* **Example - For a DataFrame**

In [28]:
df1 = make_df('AB',[1,2])
df2 = make_df('AB',[3,4])
print(df1,"\n"); print(df2,"\n"); print(pd.concat([df1,df2]))

    A   B
1  A1  B1
2  A2  B2 

    A   B
3  A3  B3
4  A4  B4 

    A   B
1  A1  B1
2  A2  B2
3  A3  B3
4  A4  B4


### Specifying the axis along which concatenation will take place

- **Example**
   - Axis=0 refers to row-wise operation while Axis=1 refers to column-wise operation

In [29]:
df3 = make_df('AB',[0,1])
df4 = make_df('CD',[0,1])
print(df3); print(df4); print(pd.concat([df3,df4], axis=1))

    A   B
0  A0  B0
1  A1  B1
    C   D
0  C0  D0
1  C1  D1
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1


### Dealing with duplicate Indices
* Pandas preserves indices even if the result will have duplicate indices
* **Example**

In [30]:
x = make_df('AB',[0,1])
y = make_df('AB',[2,3])

#make duplicate indices
y.index = x.index
print(x); print(y); print(pd.concat([x,y]))

    A   B
0  A0  B0
1  A1  B1
    A   B
0  A2  B2
1  A3  B3
    A   B
0  A0  B0
1  A1  B1
0  A2  B2
1  A3  B3


**Observe the duplicate indicies in the concatenated dataframe**

* **Catching the repeat indices as an error**

To verify the the inidices from pd.concat do not overlap, you can specifiy the **verify_integrity** flag and set it to `true`. This will raise an exception if there are duplicate inidices. 

In [31]:
try:
    pd.concat([x,y], verify_integrity=True)
except ValueError as e:
    print("ValueError:", e)

ValueError: Indexes have overlapping values: Int64Index([0, 1], dtype='int64')


* **Ignoring the overlapping Indices** 
This can be explored when the index does not matter and can simply be ignored. This can be achieved using the **ignore_index flag** which if set to `true`, the operation will create a new integer index for the resulting Series.

* **Example**

In [32]:
print(x); print(y); print(pd.concat([x,y], ignore_index=True))

    A   B
0  A0  B0
1  A1  B1
    A   B
0  A2  B2
1  A3  B3
    A   B
0  A0  B0
1  A1  B1
2  A2  B2
3  A3  B3


* **Adding MultiIndex Keys** 

Another approach is to use the keys option to specify a label for the data sources. The result will be a hierarchically indexed series containing the data:

In [33]:
print(x); print(y); print(pd.concat([x,y], keys=['x', 'y']))

    A   B
0  A0  B0
1  A1  B1
    A   B
0  A2  B2
1  A3  B3
      A   B
x 0  A0  B0
  1  A1  B1
y 0  A2  B2
  1  A3  B3


This as shown above gives a multiple indexed dataframe and can be explored using hierarchical indexing

* **Concatenation with Joins**

The previous examples showed concatenation between dataframes with shared column names. 

   - But real world data from different sources might have different sets of column names.

pd.concat() has several functions to handle this.

**Example - Outer Join**

In [34]:
df5 = make_df('ABC',[1,2])
df6 = make_df('BCD',[3,4])
print(df5); print(df6); print(pd.concat([df5,df6]))

    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   C   D
3  B3  C3  D3
4  B4  C4  D4
     A   B   C    D
1   A1  B1  C1  NaN
2   A2  B2  C2  NaN
3  NaN  B3  C3   D3
4  NaN  B4  C4   D4


This is the default join output is a union, **an outer join**, where output dataframe contains missing values automatically replaced by NaN.

* **To change this default output to an intersection, an inner join, we specify one of several options for the join & join_ axes parameters of the concatenate function

**Example - Inner join**

In [35]:
print(df5); print(df6); print(pd.concat([df5,df6],join='inner'))

    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   C   D
3  B3  C3  D3
4  B4  C4  D4
    B   C
1  B1  C1
2  B2  C2
3  B3  C3
4  B4  C4


**Example - Joining along axes**