<div class="alert alert-block" style = "background-color: black">
    <p><b><font size="+4" color="orange">Data Wrangling in Pandas</font></b></p>
    <p><b><font size="+1" color="white">by Jubril Davies</font></b></p>
    </div>

In [1]:
import warnings
warnings.filterwarnings('ignore')

import numpy as np
import pandas as pd

$$\begin{align} \text{This work focuses on rearranging data} \end{align}$$
---
<div class="alert alert-block" style="background-color: black">
    <p><b><font size="+3" color="white">Combining & Merging Datasets</font></b></p>
    </div>
    
---

Most interesting studies of data come from combining different data sources. These operations include but not limited to:

  - Complicated database-style joins and merges that correctly handle overlaps
  - Concatenation of 2 different datasets
  - Combine_first to enamble splicing together of overlapping data to fill missing values in one object with other values

Pandas comes with a variety of functions and methods that make this sort of data wrangling fast and straightforward. 

<div class= "alert alert-block" style="background-color: orange; border-color: black">
    <p><b><font size="+2" color="black">Database-Like Joins & Merges</font></b></p>
    </div>

---
    
#### Merge or Join operations combine datasets by linking rows using one or more primary keys as found in relational databases. 

<div style="background-color: black; border-color: black; padding: 5px">
    <p><b><font size="+2" color="white">1. Using Pandas Merge Method</font></b></p>
    </div>
    
Pandas Merge function offers a conveinient way to do these operations within Python

> #### **Example on Merging two Dataframes in a MANY-TO-ONE merge  situation**

In [2]:
data1 = pd.DataFrame({'values1':range(6),'key':['b','b','a','d','a','a']})
data1

Unnamed: 0,values1,key
0,0,b
1,1,b
2,2,a
3,3,d
4,4,a
5,5,a


In [3]:
data2 = pd.DataFrame({'values2':range(3),'key':['a','b','e']})
data2

Unnamed: 0,values2,key
0,0,a
1,1,b
2,2,e


In [4]:
pd.merge(data1,data2)

Unnamed: 0,values1,key,values2
0,0,b,1
1,1,b,1
2,2,a,0
3,4,a,0
4,5,a,0


>**Specifying explicitly the column on which the join is based**

In [5]:
pd.merge(data1,data2,on='key')

Unnamed: 0,values1,key,values2
0,0,b,1
1,1,b,1
2,2,a,0
3,4,a,0
4,5,a,0


By default, Pandas does an inner join which is an **intersection** and the merge is done using the `key` as the column to join on. If the column names of the primary key are different, the merge key can be explicitly specified separately.

In [6]:
data3 = pd.DataFrame({'xkey':['b','b','a','d','a','a'],'values1':range(6)})
data3

Unnamed: 0,xkey,values1
0,b,0
1,b,1
2,a,2
3,d,3
4,a,4
5,a,5


In [7]:
data4 = pd.DataFrame({'ykey':['a','b','e'],'values2':range(3)})
data4

Unnamed: 0,ykey,values2
0,a,0
1,b,1
2,e,2


* **A left join means that we keep all rows from the left dataFrame and only those rows from the right dataframe where the keys match and a right join is vice-versa**

In [8]:
pd.merge(data3,data4,left_on='xkey',right_on='ykey') #Double check

Unnamed: 0,xkey,values1,ykey,values2
0,b,0,b,1
1,b,1,b,1
2,a,2,a,0
3,a,4,a,0
4,a,5,a,0


As noticed,there is no d and e in the merged result. That is because Pandas does an inner join (intersection). The keys in the result are an intersection.

> #### **Example on Merging two Dataframes using an Outer Join**

In [9]:
pd.merge(data1,data2,how='outer')

Unnamed: 0,values1,key,values2
0,0.0,b,1.0
1,1.0,b,1.0
2,2.0,a,0.0
3,4.0,a,0.0
4,5.0,a,0.0
5,3.0,d,
6,,e,2.0


This gives a union of the two dataframes by taking the union of the keys which gives the effect of applying both left and right joins.

> #### **Example on Merging two Dataframes in a MANY-TO-MANY merge situation**

In [10]:
dt1 = pd.DataFrame({'id': [1,2,3,4], 'Name': ['Alice','Bob','Charlie','David'],'Age':[25,30,35,40]})
dt1

Unnamed: 0,id,Name,Age
0,1,Alice,25
1,2,Bob,30
2,3,Charlie,35
3,4,David,40


In [11]:
dt2 = pd.DataFrame({'id': [1,2,4,5], 'Dept': ['HR','IT','Finance','Marketing']})
dt2

Unnamed: 0,id,Dept
0,1,HR
1,2,IT
2,4,Finance
3,5,Marketing


* **A left join means that we keep all rows from the left dataFrame and only those rows from the right dataframe where the keys match**
* **This is useful when you want to preserve all the information on the left and supplement with information on the right**

In [12]:
pd.merge(dt1,dt2,on='id',how='left')

Unnamed: 0,id,Name,Age,Dept
0,1,Alice,25,HR
1,2,Bob,30,IT
2,3,Charlie,35,
3,4,David,40,Finance


> #### **Example on Merging two Dataframes in a MANY-TO-MANY merge situation with MULTIPLE KEYS**

In [13]:
df1 = pd.DataFrame({'EmployeeID':[1,2,3,4,5],'Name':['Alice','Bob','Charlie','David','Emma'],'DeptID':[10,20,30,40,50]})
df1

Unnamed: 0,EmployeeID,Name,DeptID
0,1,Alice,10
1,2,Bob,20
2,3,Charlie,30
3,4,David,40
4,5,Emma,50


In [14]:
df2 = pd.DataFrame({'DeptID':[10,20,30,40,60],'EmployeeID':[1,2,3,4,5],'DeptName':['HR','IT','Sales','Finance','Marketing'],})
df2

Unnamed: 0,DeptID,EmployeeID,DeptName
0,10,1,HR
1,20,2,IT
2,30,3,Sales
3,40,4,Finance
4,60,5,Marketing


> #### **Left Join**

In [15]:
pd.merge(df1,df2,how='left',on=['EmployeeID','DeptID'])

Unnamed: 0,EmployeeID,Name,DeptID,DeptName
0,1,Alice,10,HR
1,2,Bob,20,IT
2,3,Charlie,30,Sales
3,4,David,40,Finance
4,5,Emma,50,


> #### **Right Join**

In [16]:
pd.merge(df1,df2,how='right',on=['EmployeeID','DeptID'])

Unnamed: 0,EmployeeID,Name,DeptID,DeptName
0,1,Alice,10,HR
1,2,Bob,20,IT
2,3,Charlie,30,Sales
3,4,David,40,Finance
4,5,,60,Marketing


> #### **Outer Join**

In [17]:
pd.merge(df1,df2,how='outer',on=['EmployeeID','DeptID'])

Unnamed: 0,EmployeeID,Name,DeptID,DeptName
0,1,Alice,10,HR
1,2,Bob,20,IT
2,3,Charlie,30,Sales
3,4,David,40,Finance
4,5,Emma,50,
5,5,,60,Marketing


> #### **Example on Merging two Dataframes in a MANY-TO-MANY merge situation with MULTIPLE KEYS & OVERLAPPING COLUMN NAMES**

In [18]:
dframe1 = pd.DataFrame({'EmployeeID':[1,2,3,4,5],'Name':['Alice','Bob','Charlie','David','Emma'],
                        'DeptID':[10,20,30,40,50],'DeptName':['Sales','IT','HR','Finance','Marketing']})
dframe1

Unnamed: 0,EmployeeID,Name,DeptID,DeptName
0,1,Alice,10,Sales
1,2,Bob,20,IT
2,3,Charlie,30,HR
3,4,David,40,Finance
4,5,Emma,50,Marketing


In [19]:
dframe2 = pd.DataFrame({'DeptID':[10,20,30,40,60],'EmployeeID':[1,2,3,4,5],'DeptName':['HR','IT','Sales','Finance','Marketing'],})
dframe2

Unnamed: 0,DeptID,EmployeeID,DeptName
0,10,1,HR
1,20,2,IT
2,30,3,Sales
3,40,4,Finance
4,60,5,Marketing


> #### **Left Join with Suffixes**

In [20]:
pd.merge(dframe1,dframe2,how='left',on=['EmployeeID','DeptID'],suffixes=('_left','_right'))

Unnamed: 0,EmployeeID,Name,DeptID,DeptName_left,DeptName_right
0,1,Alice,10,Sales,HR
1,2,Bob,20,IT,IT
2,3,Charlie,30,HR,Sales
3,4,David,40,Finance,Finance
4,5,Emma,50,Marketing,


> #### **Right Join with Suffixes**

In [21]:
pd.merge(dframe1,dframe2,how='right',on=['EmployeeID','DeptID'],suffixes=('_left','_right'))

Unnamed: 0,EmployeeID,Name,DeptID,DeptName_left,DeptName_right
0,1,Alice,10,Sales,HR
1,2,Bob,20,IT,IT
2,3,Charlie,30,HR,Sales
3,4,David,40,Finance,Finance
4,5,,60,,Marketing


> #### **Outer Join with Suffixes**

This returns all rows from both dataframes. Where there are no matches, it fills the missing matches with NaN

In [22]:
pd.merge(dframe1,dframe2,how='outer',on=['EmployeeID','DeptID'],suffixes=('_left','_right'))

Unnamed: 0,EmployeeID,Name,DeptID,DeptName_left,DeptName_right
0,1,Alice,10,Sales,HR
1,2,Bob,20,IT,IT
2,3,Charlie,30,HR,Sales
3,4,David,40,Finance,Finance
4,5,Emma,50,Marketing,
5,5,,60,,Marketing


> #### **Example on Merging two Dataframes ON INDEX**

In cases where the merge key or keys in a dataframe will be found in its index. In this case, you can pass left_index=True or right_index=True or both to indicate that the index should be used as the merge key

In [23]:
dx1 = pd.DataFrame({'EmployeeID':[1,2,3,4,5],'Name':['Alice','Bob','Charlie','David','Emma'],
                        'DeptID':[10,20,30,40,50]})
dx1

Unnamed: 0,EmployeeID,Name,DeptID
0,1,Alice,10
1,2,Bob,20
2,3,Charlie,30
3,4,David,40
4,5,Emma,50


In [24]:
dx2 = pd.DataFrame({'DeptID':[10,20,30,40,60],'Salary':[70000,80000,75000,90000,65000]},index =[1,2,3,4,6])
dx2

Unnamed: 0,DeptID,Salary
1,10,70000
2,20,80000
3,30,75000
4,40,90000
6,60,65000


In [25]:
pd.merge(dx1,dx2,left_on='EmployeeID',right_index=True,how='left')

Unnamed: 0,EmployeeID,Name,DeptID_x,DeptID_y,Salary
0,1,Alice,10,10.0,70000.0
1,2,Bob,20,20.0,80000.0
2,3,Charlie,30,30.0,75000.0
3,4,David,40,40.0,90000.0
4,5,Emma,50,,


Left_on='EmployeeID': this tells pandas to use empoyeeID column in dx1 as the merge key
right_index=True :  this tells pandas to use the index of dx2 as the merge key

> #### **Example on Merging two Dataframes WITH MULTIPLE KEYS WITH ONE KEY AS INDEX**

In [26]:
dd1 = pd.DataFrame({'EmployeeID':[1,2,3,4,5],'Name':['Alice','Bob','Charlie','David','Emma'],
                        'DeptID':[10,20,30,40,50]})
dd1.set_index(['EmployeeID','DeptID'],inplace=True)
dd1

Unnamed: 0_level_0,Unnamed: 1_level_0,Name
EmployeeID,DeptID,Unnamed: 2_level_1
1,10,Alice
2,20,Bob
3,30,Charlie
4,40,David
5,50,Emma


In [27]:
dd2 = pd.DataFrame({'DeptID':[10,20,30,40,50],'EmployeeID':[1,2,3,4,5],'Dept':['Sales','IT','HR','Finance','Marketing']})
dd2

Unnamed: 0,DeptID,EmployeeID,Dept
0,10,1,Sales
1,20,2,IT
2,30,3,HR
3,40,4,Finance
4,50,5,Marketing


In [28]:
pd.merge(dd1,dd2,on=['EmployeeID','DeptID'],suffixes=('_left','_right'))

Unnamed: 0,EmployeeID,DeptID,Name,Dept
0,1,10,Alice,Sales
1,2,20,Bob,IT
2,3,30,Charlie,HR
3,4,40,David,Finance
4,5,50,Emma,Marketing


> #### **Example on Merging on Multiple Key Index**

In [29]:
dd3 = pd.DataFrame({'DeptID':[10,20,30],'EmployeeID':[1,2,3],'Dept':['Sales','IT','HR']})
dd3

Unnamed: 0,DeptID,EmployeeID,Dept
0,10,1,Sales
1,20,2,IT
2,30,3,HR


In [30]:
pd.merge(dd1,dd3,left_index=True,right_on=['EmployeeID','DeptID'],suffixes=('_left','_right'))

Unnamed: 0,Name,DeptID,EmployeeID,Dept
0,Alice,10,1,Sales
1,Bob,20,2,IT
2,Charlie,30,3,HR


> #### **Another example on Merging on Hierarchical Indexed Data**

In [31]:
dl = pd.DataFrame({'states': ['Oregon','Oregon','Oregon','Virginia','Missouri','Virginia'],'year':[1999,2000,2001,2000,2001,2002]
                   ,'population':[120000,80000,60000,210000,310000,180000]})
dl

Unnamed: 0,states,year,population
0,Oregon,1999,120000
1,Oregon,2000,80000
2,Oregon,2001,60000
3,Virginia,2000,210000
4,Missouri,2001,310000
5,Virginia,2002,180000


In [32]:
dr = pd.DataFrame(np.linspace(60000,320000,14).reshape(7,2),index=[['Oregon','Virginia','Virginia','Oregon','Missouri','Oregon','Oregon'],
                                                                                [1999,2002,2000,2000,2001,1999,2001]],columns=['population1','population2'])
dr

Unnamed: 0,Unnamed: 1,population1,population2
Oregon,1999,60000.0,80000.0
Virginia,2002,100000.0,120000.0
Virginia,2000,140000.0,160000.0
Oregon,2000,180000.0,200000.0
Missouri,2001,220000.0,240000.0
Oregon,1999,260000.0,280000.0
Oregon,2001,300000.0,320000.0


> **Now merging on multiple columns while taking note of the duplicate index values**

In [33]:
pd.merge(dl,dr,left_on=['states','year'],right_index=True)

Unnamed: 0,states,year,population,population1,population2
0,Oregon,1999,120000,60000.0,80000.0
0,Oregon,1999,120000,260000.0,280000.0
1,Oregon,2000,80000,180000.0,200000.0
2,Oregon,2001,60000,300000.0,320000.0
3,Virginia,2000,210000,140000.0,160000.0
4,Missouri,2001,310000,220000.0,240000.0
5,Virginia,2002,180000,100000.0,120000.0


> #### **It is possible to perform an outer join (i.e a union) while merging on Multiple columns**

In [34]:
pd.merge(dl,dr,left_on=['states','year'],right_index=True,how='outer')

Unnamed: 0,states,year,population,population1,population2
0,Oregon,1999,120000,60000.0,80000.0
0,Oregon,1999,120000,260000.0,280000.0
1,Oregon,2000,80000,180000.0,200000.0
2,Oregon,2001,60000,300000.0,320000.0
3,Virginia,2000,210000,140000.0,160000.0
4,Missouri,2001,310000,220000.0,240000.0
5,Virginia,2002,180000,100000.0,120000.0


> #### **It is possible to also merge on the indices of both tables**

In [35]:
dl1 = pd.DataFrame(np.linspace(180000,280000,6).reshape(3,2),columns=['Oregon','Virginia'],
                    index= [1999,2001,2002])
dl1

Unnamed: 0,Oregon,Virginia
1999,180000.0,200000.0
2001,220000.0,240000.0
2002,260000.0,280000.0


In [36]:
dl2 = pd.DataFrame(np.linspace(180000,320000,8).reshape(4,2),columns=['Missouri','Nevada'],
                    index= [1999,2000,2001,2002,])
dl2

Unnamed: 0,Missouri,Nevada
1999,180000.0,200000.0
2000,220000.0,240000.0
2001,260000.0,280000.0
2002,300000.0,320000.0


> **Now merging on multiple columns using indices from both tables**

In [37]:
pd.merge(dl1,dl2,how='outer',left_index=True,right_index=True)

Unnamed: 0,Oregon,Virginia,Missouri,Nevada
1999,180000.0,200000.0,180000.0,200000.0
2000,,,220000.0,240000.0
2001,220000.0,240000.0,260000.0,280000.0
2002,260000.0,280000.0,300000.0,320000.0


<div style="background-color: black; padding: 5px">
    <p><b><font size="+2" color="white">2. Using Pandas Join Method</font></b></p>
    </div>
    
#### Pandas Join instance is more converinient for merging on on Index. it can also be used to combine many datagrame objects with similar indices but non-overlapping columns

> #### **Considering previous example gives the same result**

In [38]:
dl1.join(dl2,how='outer')

Unnamed: 0,Oregon,Virginia,Missouri,Nevada
1999,180000.0,200000.0,180000.0,200000.0
2000,,,220000.0,240000.0
2001,220000.0,240000.0,260000.0,280000.0
2002,260000.0,280000.0,300000.0,320000.0


> **Join also supports joining on one of the index of the passed DataFrame**


In [39]:
dl.join(dr,on=['states','year'])

Unnamed: 0,states,year,population,population1,population2
0,Oregon,1999,120000,60000.0,80000.0
0,Oregon,1999,120000,260000.0,280000.0
1,Oregon,2000,80000,180000.0,200000.0
2,Oregon,2001,60000,300000.0,320000.0
3,Virginia,2000,210000,140000.0,160000.0
4,Missouri,2001,310000,220000.0,240000.0
5,Virginia,2002,180000,100000.0,120000.0


> #### **It is possible to join more than one dataframe using join by passing a list of them**

In [40]:
dl3 = pd.DataFrame(np.linspace(100000,280000,10).reshape(5,2),columns=['NewYork','Illinois'],
                    index= [1998,1999,2001,2002,2003,])
dl3

Unnamed: 0,NewYork,Illinois
1998,100000.0,120000.0
1999,140000.0,160000.0
2001,180000.0,200000.0
2002,220000.0,240000.0
2003,260000.0,280000.0


> **By default, the join operation combines by intersection thus explaining the abscence of 1998 & 2000 & 2003**

In [41]:
dl1.join([dl2,dl3])

Unnamed: 0,Oregon,Virginia,Missouri,Nevada,NewYork,Illinois
1999,180000.0,200000.0,180000.0,200000.0,140000.0,160000.0
2001,220000.0,240000.0,260000.0,280000.0,180000.0,200000.0
2002,260000.0,280000.0,300000.0,320000.0,220000.0,240000.0


> **An outer join does a union of the 3 tables**

In [42]:
dl1.join([dl2,dl3],how='outer')

Unnamed: 0,Oregon,Virginia,Missouri,Nevada,NewYork,Illinois
1998,,,,,100000.0,120000.0
1999,180000.0,200000.0,180000.0,200000.0,140000.0,160000.0
2000,,,220000.0,240000.0,,
2001,220000.0,240000.0,260000.0,280000.0,180000.0,200000.0
2002,260000.0,280000.0,300000.0,320000.0,220000.0,240000.0
2003,,,,,260000.0,280000.0


<div class= "alert alert-block" style="background-color: orange; border-color: black">
    <p><b><font size="+2" color="black">Concatenation, Binding or Stacking</font></b></p>
    </div>
    
---
Numpy has a concatenation function for combining arrays along rows or columns

> #### **Numpy Concat**

In [43]:
n_array = np.arange(10).reshape(5,2)
n_array

array([[0, 1],
       [2, 3],
       [4, 5],
       [6, 7],
       [8, 9]])

In [44]:
np.concatenate([n_array,n_array],axis=1) #axis = 1 - along columns, 0 - along rows

array([[0, 1, 0, 1],
       [2, 3, 2, 3],
       [4, 5, 4, 5],
       [6, 7, 6, 7],
       [8, 9, 8, 9]])

<div style="background-color: black; padding: 5px">
    <p><b><font size="+2" color="white">1. Series Concateation in Pandas</font></b></p>
    </div>
    
In concatenating pandas objects like series and dataframes, it is important to have a labeled axes. This enables you to further generalize array concatenation. Things to consider include:

* **If the Series or DataFrame are indexed differently on the other axes, should the collection of axes be union or intersected?**
* **Do the groups need to be identifiable in the resulting object?**
* **Does the concatenation axis matter at all?**

> #### **Examples**

In [45]:
series1 = pd.Series(['Adam','Eve'], index=['a','b'])
series2 = pd.Series(['Isiah','Isaac','John'], index=['c','d','e'])
series3 = pd.Series(['Luke','Mark'],index=['f','g'])

> #### **Concatenation in a list with all 3 glues all together**

In [46]:
pd.concat([series1,series2,series3]) #default combination is along rows giving another Series

a     Adam
b      Eve
c    Isiah
d    Isaac
e     John
f     Luke
g     Mark
dtype: object

> #### **Concatenation along columns gives a dataframe**

In [47]:
pd.concat([series1,series2,series3],axis=1)

Unnamed: 0,0,1,2
a,Adam,,
b,Eve,,
c,,Isiah,
d,,Isaac,
e,,John,
f,,,Luke
g,,,Mark


> #### **Combining series1 and series3 along columns**
No overlap on the other axis is observed here. This is shown as the sorted union or outer join of the indexes

In [48]:
pd.concat([series1,series3],axis=1)

Unnamed: 0,0,1
a,Adam,
b,Eve,
f,,Luke
g,,Mark


> #### **Intersecting Series1 and series4 with an inner join**

In [49]:
series4 = pd.Series(['Adam','Eve','Luke'],index=['a','h','i'])
series4

a    Adam
h     Eve
i    Luke
dtype: object

In [50]:
pd.concat([series1,series4],axis=1,join='inner')

Unnamed: 0,0,1
a,Adam,Adam


> #### **Concatenated pieces are not identified in the result. To distinguish, use keys argument to create an hierearchical index**

In [51]:
series1 = pd.Series(['Adam','Eve'], index=['a','b'])
pd.concat([series1,series1,series3],keys=['one','two','three','four']) #along rows

one    a    Adam
       b     Eve
two    a    Adam
       b     Eve
three  f    Luke
       g    Mark
dtype: object

> #### **In combining the series along the columns i.e axis=1, the keys become the dataframe columns**

In [52]:
series1 = pd.Series(['Adam','Eve'], index=['a','b'])
pd.concat([series1,series4],axis=1,keys=['one','two','three']) #along columns

Unnamed: 0,one,two
a,Adam,Adam
b,Eve,
h,,Eve
i,,Luke


<div style="background-color: black; padding: 5px">
    <p><b><font size="+2" color="white">2. DataFrame Concatenation in Pandas</font></b></p>
    </div>

In [53]:
dtf1 = pd.DataFrame(np.array([['apple','orange','kiwi'],['cherry','pineapple','banana']]).reshape(3,2),
                    index=['a','b','c'],columns=['one','two'])
dtf1                              

Unnamed: 0,one,two
a,apple,orange
b,kiwi,cherry
c,pineapple,banana


In [54]:
dtf2 = pd.DataFrame(np.array([['30cal','17cal'],['10cal','15cal']]),index=['a','c'],columns=['three','four'])
dtf2

Unnamed: 0,three,four
a,30cal,17cal
c,10cal,15cal


In [55]:
pd.concat([dtf1,dtf2],axis=1,keys=['level1','level2'])

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,apple,orange,30cal,17cal
b,kiwi,cherry,,
c,pineapple,banana,10cal,15cal


> #### **Passing a dictionary of Keys instead of a list of dataframes, the dicts keys will be used for the keys option**

In [56]:
pd.concat({'level1':dtf1,'level2':dtf2},axis=1)

Unnamed: 0_level_0,level1,level1,level2,level2
Unnamed: 0_level_1,one,two,three,four
a,apple,orange,30cal,17cal
b,kiwi,cherry,,
c,pineapple,banana,10cal,15cal


> #### **Where the row index is not useful in the analysis, the index can be ignored**

In [57]:
dq = pd.DataFrame(np.random.randn(4,3),columns=['a','b','c'])
dq

Unnamed: 0,a,b,c
0,1.07348,-1.311343,-0.565271
1,-1.039575,0.017408,-0.03374
2,1.692465,0.195359,-0.274575
3,-0.918661,-0.149951,-0.759879


In [58]:
dp = pd.DataFrame(np.random.randn(2,4),columns=['b','d','a','c'])
dp

Unnamed: 0,b,d,a,c
0,0.643142,-2.088268,-0.893665,0.65077
1,0.517249,-0.87136,-1.012703,-0.75977


In [59]:
pd.concat([dq,dp],ignore_index=True)

Unnamed: 0,a,b,c,d
0,1.07348,-1.311343,-0.565271,
1,-1.039575,0.017408,-0.03374,
2,1.692465,0.195359,-0.274575,
3,-0.918661,-0.149951,-0.759879,
4,-0.893665,0.643142,0.65077,-2.088268
5,-1.012703,0.517249,-0.75977,-0.87136


<div class= "alert alert-block" style="background-color: orange; border-color: black">
    <p><b><font size="+2" color="black">Combining Data with Overlap</font></b></p>
    </div>

---
This is when a combination situation cannot be expressed as a merge or concatenation operation. A typical case is where you have two datasets whose indices overlap in full or part

> #### **Combination using Numpy's where**

In [60]:
a = pd.Series([np.nan,2,np.nan,3,4,np.nan],index=['a','b','c','d','e','f'])
a

a    NaN
b    2.0
c    NaN
d    3.0
e    4.0
f    NaN
dtype: float64

In [61]:
b = pd.Series([0,2,4,6,8,np.nan],index=['a','b','c','d','e','f'])
b

a    0.0
b    2.0
c    4.0
d    6.0
e    8.0
f    NaN
dtype: float64

In [62]:
np.where(pd.isnull(a),b,a)

array([ 0.,  2.,  4.,  3.,  4., nan])

<div style="background-color: black; padding: 5px">
    <p><b><font size="+2" color="white">Using combine_first Method in Pandas</font></b></p>
    </div>

There is a **combine_first** method for series and dataframes used to combine two series or dataframes giving preference to the values from the calling object over the values from the other object.

#### **How it works:**
* Indices and Columns must align. The method aligns the values based on their indices and columns
* For each index, if the value in the first object is NaN, it will be replaced by the value from the second object. If the value in the first object is not NaN, it remains unchanged

#### **Purpose:**

To fill missing values in one object(series or dataframe) with the corresponding values from another object. It effectively merges the data while prioritizing the original data. 

> #### **Combine_First Method using Series method**

For Series, combine_first fills the NaN values in the first series with values from the second series where they align.

In [63]:
b[:-2].combine_first(a[2:])

a    0.0
b    2.0
c    4.0
d    6.0
e    4.0
f    NaN
dtype: float64

> #### **Combine First Method using DataFrame method**

For DataFrames, combine_first fills NaN values in the first dataframe with values from the second dataframe aligning on both rows and columns.

In [64]:
da = pd.DataFrame({'a':[1,np.nan,5,np.nan],'b':[np.nan,2,np.nan,7],'c':[2,6,10,14]})
da

Unnamed: 0,a,b,c
0,1.0,,2
1,,2.0,6
2,5.0,,10
3,,7.0,14


In [65]:
db = pd.DataFrame({'a':[4,3,np.nan,5,8],'b':[np.nan,2,4,6,7]})
db

Unnamed: 0,a,b
0,4.0,
1,3.0,2.0
2,,4.0
3,5.0,6.0
4,8.0,7.0


In [66]:
da.combine_first(db)

Unnamed: 0,a,b,c
0,1.0,,2.0
1,3.0,2.0,6.0
2,5.0,4.0,10.0
3,5.0,7.0,14.0
4,8.0,7.0,


<div class= "alert alert-block" style="background-color: orange; border-color: black">
    <p><b><font size="+2" color="black">Reshaping & Pivoting</font></b></p>
    </div>
    
    
There are a couple of fundamental operations for rearranging tabular data. These are also known as **reshape or pivot operations**

<div style="background-color: black; padding: 5px">
    <p><b><font size="+2" color="white">Reshaping with Hierarchical Indexing</font></b></p>
    </div>
    
Hierarchical Indexing offers a consistent way to rearrange data in a DataFrame. There are 2 primary approcahes:

#### **Stacking**

This rotates by transposing the columns in the data to rows

> #### **Example on Reshaping a DataFrame**

In [67]:
pop_data = pd.DataFrame(np.linspace(400000,700000,6).reshape((2,3)), index=pd.Index(['kentucky','Alabama'],name='state'),
                       columns=pd.Index(['one','two','three'],name='number'))
pop_data

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
kentucky,400000.0,460000.0,520000.0
Alabama,580000.0,640000.0,700000.0


* **Using the stack method on this data pivots the columns into rows yielding a Series**

In [68]:
result = pop_data.stack()
result

state     number
kentucky  one       400000.0
          two       460000.0
          three     520000.0
Alabama   one       580000.0
          two       640000.0
          three     700000.0
dtype: float64

#### **Unstacking**

This rotates the stacked column by pivoting from the rows back into columns
* **Use the unstack method to rearrange the dat back into a DataFrame**

In [69]:
result.unstack()

number,one,two,three
state,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
kentucky,400000.0,460000.0,520000.0
Alabama,580000.0,640000.0,700000.0


By default the innermost level is unstacked i.e number is pivoted back to columns. A different level such as state can be unstacked by specifying the name of the level. In this case, state.

In [70]:
result.unstack('state')

state,kentucky,Alabama
number,Unnamed: 1_level_1,Unnamed: 2_level_1
one,400000.0,580000.0
two,460000.0,640000.0
three,520000.0,700000.0


> #### **Unstacking introduces Missing data**

This is true if the values in the level aren't found in each subgroup

In [71]:
s1 = pd.Series(['ace','bay','case','days'], index=['a','b','c','d'])
s1

a     ace
b     bay
c    case
d    days
dtype: object

In [72]:
s2 = pd.Series(['eggs','feds','gets'],index=['c','d','e'])
s2

c    eggs
d    feds
e    gets
dtype: object

In [73]:
s1s2 = pd.concat([s1,s2],keys=['one','two'])
s1s2

one  a     ace
     b     bay
     c    case
     d    days
two  c    eggs
     d    feds
     e    gets
dtype: object

> #### **Unstack s1s2**

In [74]:
s1s2.unstack()

Unnamed: 0,a,b,c,d,e
one,ace,bay,case,days,
two,,,eggs,feds,gets


> #### **Stacking Filters out missing data**

In [75]:
s1s2.unstack().stack()

one  a     ace
     b     bay
     c    case
     d    days
two  c    eggs
     d    feds
     e    gets
dtype: object

In [76]:
s1s2.unstack().stack(dropna=False)

one  a     ace
     b     bay
     c    case
     d    days
     e     NaN
two  a     NaN
     b     NaN
     c    eggs
     d    feds
     e    gets
dtype: object

> #### **Unstacking a DataFrame makes the level unstacked the lowest level in the result**

In [77]:
df = pd.DataFrame({'top': result,'bottom': (-1*result)},columns=pd.Index(['top','bottom'],name='level'))
df

Unnamed: 0_level_0,level,top,bottom
state,number,Unnamed: 2_level_1,Unnamed: 3_level_1
kentucky,one,400000.0,-400000.0
kentucky,two,460000.0,-460000.0
kentucky,three,520000.0,-520000.0
Alabama,one,580000.0,-580000.0
Alabama,two,640000.0,-640000.0
Alabama,three,700000.0,-700000.0


In [78]:
df.unstack('state')

level,top,top,bottom,bottom
state,kentucky,Alabama,kentucky,Alabama
number,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2
one,400000.0,580000.0,-400000.0,-580000.0
two,460000.0,640000.0,-460000.0,-640000.0
three,520000.0,700000.0,-520000.0,-700000.0


In [79]:
df.unstack('state').stack('level')

Unnamed: 0_level_0,state,Alabama,kentucky
number,level,Unnamed: 2_level_1,Unnamed: 3_level_1
one,bottom,-580000.0,-400000.0
one,top,580000.0,400000.0
two,bottom,-640000.0,-460000.0
two,top,640000.0,460000.0
three,bottom,-700000.0,-520000.0
three,top,700000.0,520000.0


<div style="background-color: black; padding: 5px">
    <p><b><font size="+2" color="white">Creating Hierarchical Indexing with Pivot</font></b></p>
    </div>
    
    
> #### **Transforming Time Series Data**

In [80]:
tdata = pd.read_excel('./data/time_data.xls')
tdata

Unnamed: 0,id,date,item,value
0,0,1959-03-31 00:00:01,realgdp,2710.349
1,1,1959-03-31 00:00:01,infl,0.0
2,2,1959-03-31 00:00:01,unemp,5.8
3,3,1959-06-30 00:01:23,realgdp,2778.801
4,4,1959-06-30 00:01:23,infl,2.34
5,5,1959-06-30 00:01:23,unemp,5.1
6,6,1959-09-30 00:04:42,realgdp,2775.488
7,7,1959-09-30 00:04:42,infl,2.74
8,8,1959-09-30 00:04:42,unemp,5.3
9,9,1959-12-31 00:34:05,realgdp,2785.204


The data shown above shows how data is tored in a relational database where date and time would be the primary keys but such data may not be easy to work with in long format. 

A better way to work with such data is to have a DataFrame containing one column per distinct item value indexed by timestamps in the date column. DataFrame's pivot method does this transformation exactly well.

In [81]:
pivoted_tdata = tdata.pivot('date','item','value')
pivoted_tdata.head(10)

item,infl,realgdp,unemp
date,Unnamed: 1_level_1,Unnamed: 2_level_1,Unnamed: 3_level_1
1959-03-31 00:00:01,0.0,2710.349,5.8
1959-06-30 00:01:23,2.34,2778.801,5.1
1959-09-30 00:04:42,2.74,2775.488,5.3
1959-12-31 00:34:05,,2785.204,


The first two items passed (**date and item**) are the columns to be used as row and column index and an optional value (**value**) column to fill the dataframe.

>Whereby there is more than one column of values that need to be reshaped simltaneously, the last argument 'value' would be omitted to give hierarchical columns

In [82]:
tdata['value2'] = np.random.randn(len(tdata))

In [83]:
tdata

Unnamed: 0,id,date,item,value,value2
0,0,1959-03-31 00:00:01,realgdp,2710.349,-1.19138
1,1,1959-03-31 00:00:01,infl,0.0,-0.723625
2,2,1959-03-31 00:00:01,unemp,5.8,-0.098321
3,3,1959-06-30 00:01:23,realgdp,2778.801,0.855044
4,4,1959-06-30 00:01:23,infl,2.34,-0.672058
5,5,1959-06-30 00:01:23,unemp,5.1,-0.304352
6,6,1959-09-30 00:04:42,realgdp,2775.488,1.315042
7,7,1959-09-30 00:04:42,infl,2.74,0.644506
8,8,1959-09-30 00:04:42,unemp,5.3,0.487363
9,9,1959-12-31 00:34:05,realgdp,2785.204,-0.499954


In [84]:
pivoted_tdata2 = tdata.pivot('date','item')
pivoted_tdata2

Unnamed: 0_level_0,id,id,id,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
1959-03-31 00:00:01,1.0,0.0,2.0,0.0,2710.349,5.8,-0.723625,-1.19138,-0.098321
1959-06-30 00:01:23,4.0,3.0,5.0,2.34,2778.801,5.1,-0.672058,0.855044,-0.304352
1959-09-30 00:04:42,7.0,6.0,8.0,2.74,2775.488,5.3,0.644506,1.315042,0.487363
1959-12-31 00:34:05,,9.0,,,2785.204,,,-0.499954,


This is exactly what you get when you set_index with **date and time** and unstack with **item**

In [85]:
unstacked = tdata.set_index(['date','item']).unstack('item')
unstacked

Unnamed: 0_level_0,id,id,id,value,value,value,value2,value2,value2
item,infl,realgdp,unemp,infl,realgdp,unemp,infl,realgdp,unemp
date,Unnamed: 1_level_2,Unnamed: 2_level_2,Unnamed: 3_level_2,Unnamed: 4_level_2,Unnamed: 5_level_2,Unnamed: 6_level_2,Unnamed: 7_level_2,Unnamed: 8_level_2,Unnamed: 9_level_2
1959-03-31 00:00:01,1.0,0.0,2.0,0.0,2710.349,5.8,-0.723625,-1.19138,-0.098321
1959-06-30 00:01:23,4.0,3.0,5.0,2.34,2778.801,5.1,-0.672058,0.855044,-0.304352
1959-09-30 00:04:42,7.0,6.0,8.0,2.74,2775.488,5.3,0.644506,1.315042,0.487363
1959-12-31 00:34:05,,9.0,,,2785.204,,,-0.499954,
