In [12]:
import numpy as np
import pandas as pd
import matplotlib
import matplotlib.pyplot as plt
import matplotlib_inline.backend_inline
matplotlib_inline.backend_inline.set_matplotlib_formats('svg') # display figures in vector format
plt.rcParams.update({'font.size':12}) # set global font size

# DATA MANIPULATION IN PANDAS PART 3

Most interesting studies of data come from combining different data sources. These operations include but not limited to:
  - Concatenation of 2 different datasets
  - Complicated database-style joins and merges that correctly handle overlaps

Pandas comes with a variety of functions and methods that make this sort of data wrangling fast and straightforward. 
  

## COMBINING DATASETS

### **1. CONCAT**

In [2]:
#Example - Lets make a dataframe that will be used in this example
def make_df(cols,ind):
    #create data
    data = {c:[str(c) + str(i) for i in ind]
            for c in cols}
    return pd.DataFrame(data,ind)

In [3]:
#Example DataFrame
make_df('ABC',range(3))

Unnamed: 0,A,B,C
0,A0,B0,C0
1,A1,B1,C1
2,A2,B2,C2


### Using Pandas pd.concat()
pd.concat() can be used for simple concatenation of series or dataframe objects


# *Syntax*
pd.concat(objs,axis=0,join='outer',join_axes=None,ignore_index=False,keys=None,levels=None, names=None, verify_integrity=False, copy=True)

* **Example - Using Series**

In [4]:
series1 = pd.Series(['A','B','C'],index=[1,2,3])
series2 = pd.Series(['D','E','F'],index=[4,5,6])
pd.concat([series1,series2]) 

1    A
2    B
3    C
4    D
5    E
6    F
dtype: object

**By default concatenation takes place row-wise within a dataframe**

* **Example - For a DataFrame**

In [5]:
df1 = make_df('AB',[1,2])
df2 = make_df('AB',[3,4])
print(df1,"\n"); print(df2,"\n"); print(pd.concat([df1,df2]))

    A   B
1  A1  B1
2  A2  B2 

    A   B
3  A3  B3
4  A4  B4 

    A   B
1  A1  B1
2  A2  B2
3  A3  B3
4  A4  B4


### Specifying the axis along which concatenation will take place

- **Example**
   - Axis=0 refers to row-wise operation while Axis=1 refers to column-wise operation

In [6]:
df3 = make_df('AB',[0,1])
df4 = make_df('CD',[0,1])
print(df3); print(df4); print(pd.concat([df3,df4], axis=1))

    A   B
0  A0  B0
1  A1  B1
    C   D
0  C0  D0
1  C1  D1
    A   B   C   D
0  A0  B0  C0  D0
1  A1  B1  C1  D1


### Dealing with duplicate Indices
* Pandas preserves indices even if the result will have duplicate indices
* **Example**

In [7]:
x = make_df('AB',[0,1])
y = make_df('AB',[2,3])

#make duplicate indices
y.index = x.index
print(x); print(y); print(pd.concat([x,y]))

    A   B
0  A0  B0
1  A1  B1
    A   B
0  A2  B2
1  A3  B3
    A   B
0  A0  B0
1  A1  B1
0  A2  B2
1  A3  B3


**Observe the duplicate indicies in the concatenated dataframe**

* **Catching the repeat indices as an error**

To verify the the inidices from pd.concat do not overlap, you can specifiy the **verify_integrity** flag and set it to `true`. This will raise an exception if there are duplicate inidices. 

In [8]:
try:
    pd.concat([x,y], verify_integrity=True)
except ValueError as e:
    print("ValueError:", e)

ValueError: Indexes have overlapping values: Int64Index([0, 1], dtype='int64')


* **Ignoring the overlapping Indices** 
This can be explored when the index does not matter and can simply be ignored. This can be achieved using the **ignore_index flag** which if set to `true`, the operation will create a new integer index for the resulting Series.

* **Example**

In [9]:
print(x); print(y); print(pd.concat([x,y], ignore_index=True))

    A   B
0  A0  B0
1  A1  B1
    A   B
0  A2  B2
1  A3  B3
    A   B
0  A0  B0
1  A1  B1
2  A2  B2
3  A3  B3


* **Adding MultiIndex Keys** 

Another approach is to use the keys option to specify a label for the data sources. The result will be a hierarchically indexed series containing the data:

In [10]:
print(x); print(y); print(pd.concat([x,y], keys=['x', 'y']))

    A   B
0  A0  B0
1  A1  B1
    A   B
0  A2  B2
1  A3  B3
      A   B
x 0  A0  B0
  1  A1  B1
y 0  A2  B2
  1  A3  B3


This as shown above gives a multiple indexed dataframe and can be explored using hierarchical indexing

* **Concatenation with Joins**

The previous examples showed concatenation between dataframes with shared column names. 

   - But real world data from different sources might have different sets of column names.

pd.concat() has several functions to handle this.

**Example - Outer Join**

In [13]:
df5 = make_df('ABC',[1,2])
df6 = make_df('BCD',[3,4])
print(df5); print(df6); print(pd.concat([df5,df6]))

    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   C   D
3  B3  C3  D3
4  B4  C4  D4
     A   B   C    D
1   A1  B1  C1  NaN
2   A2  B2  C2  NaN
3  NaN  B3  C3   D3
4  NaN  B4  C4   D4


This is the default join output is a union, **an outer join**, where output dataframe contains missing values automatically replaced by NaN.

* **To change this default output to an intersection, an inner join, we specify one of several options for the join & join_ axes parameters of the concatenate function

**Example - Inner join**

In [14]:
print(df5); print(df6); print(pd.concat([df5,df6],join='inner'))

    A   B   C
1  A1  B1  C1
2  A2  B2  C2
    B   C   D
3  B3  C3  D3
4  B4  C4  D4
    B   C
1  B1  C1
2  B2  C2
3  B3  C3
4  B4  C4


**Example - Joining along axes**