# Data Wrangling: Join, Combine, and Reshape Data 

In [1]:
# Much of programming work in data analysis and modeling is spent on data preparation:
# Loading, cleaning, transforming, and rearranging.
# Sometimes the way that data is stored in files or databases is not the way you need it for a data processing application. 
# Many people choose to do and hoc processing of data from one form to another using a general purpose programming language, 
# like Python, Perl, R, or Java, or Unix text processing tools like sed or awk. 


# Combining and Merging Data Sets

In [2]:
# Data contained in pandas object can be combined together in a number of ways. 
# The three main ways of combining data are merges, joins, and concatenation.
# This section focuses on the mechanics of joining data sets with pandas.

# Database-style DataFrame merges

In [3]:
# Merge or join operations combine data sets by linking rows using one or more keys. 
# These operations are central to relational databases. 
# The merge function in pandas is the main entry point for using these algorithms on your data. 
# Let's start with a simple example: 
import pandas as pd 
import numpy as np
from pandas import Series, DataFrame

df1 = DataFrame({'key': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)}) 
df2 = DataFrame({'key': ['a', 'b', 'd'],'data2': range(3)}) 

In [4]:
df1 

Unnamed: 0,key,data1
0,b,0
1,b,1
2,a,2
3,c,3
4,a,4
5,a,5
6,b,6


In [5]:
df2

Unnamed: 0,key,data2
0,a,0
1,b,1
2,d,2


In [6]:
# This is an example of a many-to-one merge situation;
# the data in df1 has multiple rows labeled a and b, whereas df2 has only one row for each value in the key column. 
# Calling merge with these objects we obtain:
pd.merge(df1, df2) 

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


In [7]:
# Note that I didn't specify which column to join on.
# If that information is not specified, merge uses the overlapping column names as the keys. 
# It's good practice to specify explicitly, though:
pd.merge(df1,df2, on='key') 

Unnamed: 0,key,data1,data2
0,b,0,1
1,b,1,1
2,b,6,1
3,a,2,0
4,a,4,0
5,a,5,0


In [8]:
# If the column names are different in each object, you can specify them separately: 
df3 = DataFrame({'lkey': ['b', 'b', 'a', 'c', 'a', 'a', 'b'],'data1': range(7)}) 
df4 = DataFrame({'rkey': ['a', 'b', 'd'],'data2': range(3)})
pd.merge(df3, df4, left_on='lkey', right_on='rkey') 

Unnamed: 0,lkey,data1,rkey,data2
0,b,0,b,1
1,b,1,b,1
2,b,6,b,1
3,a,2,a,0
4,a,4,a,0
5,a,5,a,0


In [9]:
# You may notice that the 'c' and 'd' values and associated data are missing from the result.
# By default merge does an 'inner' join; the keys in the result are the intersection, or the common set found in both tables.
# Other possible options are 'left', 'right', and 'outer'.
# The outer join takes the union of the keys, combining the effect of applying both left and right joins:
pd.merge(df1, df2, how='outer') 

Unnamed: 0,key,data1,data2
0,b,0.0,1.0
1,b,1.0,1.0
2,b,6.0,1.0
3,a,2.0,0.0
4,a,4.0,0.0
5,a,5.0,0.0
6,c,3.0,
7,d,,2.0
