# Chapter 32: Joining Dataframes

In [1]:
import pandas as pd
import numpy as np

df1 = pd.DataFrame({'name': ['John', ' George', 'Ringo'],
                    'color':['Blue', 'Blue',' Purple']})

df2 = pd.DataFrame({'name': ['Paul', 'George', ' Ringo'],
                    'carcolor': ['Red', 'Blue', np.nan]},
                    index=[3, 1, 2])

In [2]:
df1

Unnamed: 0,name,color
0,John,Blue
1,George,Blue
2,Ringo,Purple


In [3]:
df2

Unnamed: 0,name,carcolor
3,Paul,Red
1,George,Blue
2,Ringo,


## 32.1 Adding Rows to Dataframes

- ``.concat`` preserves index values, so the resulting dataframe has duplicate index values

In [4]:
pd.concat([df2, df2])

Unnamed: 0,name,carcolor
3,Paul,Red
1,George,Blue
2,Ringo,
3,Paul,Red
1,George,Blue
2,Ringo,


In [5]:
# ignore index
pd.concat([df2, df2], ignore_index=True)

Unnamed: 0,name,carcolor
0,Paul,Red
1,George,Blue
2,Ringo,
3,Paul,Red
4,George,Blue
5,Ringo,


## 32.2 Adding Columns to Dataframes

In [6]:
pd.concat([df1, df2], axis=1)

Unnamed: 0,name,color,name.1,carcolor
0,John,Blue,,
1,George,Blue,George,Blue
2,Ringo,Purple,Ringo,
3,,,Paul,Red


## 32.3 Joins

- The ``.join`` method is meant for joining based on the index rather than columns

In [7]:
df1.set_index('name').join(df2.set_index('name'))

Unnamed: 0_level_0,color,carcolor
name,Unnamed: 1_level_1,Unnamed: 2_level_1
John,Blue,
George,Blue,
Ringo,Purple,


- It is easier to just use the ``.merge`` method

In [9]:
df1

Unnamed: 0,name,color
0,John,Blue
1,George,Blue
2,Ringo,Purple


In [10]:
df2

Unnamed: 0,name,carcolor
3,Paul,Red
1,George,Blue
2,Ringo,


In [8]:
# inner join
df1.merge(df2)

Unnamed: 0,name,color,carcolor


In [11]:
# inner join
df1.merge(df2, how='outer')

Unnamed: 0,name,color,carcolor
0,John,Blue,
1,George,Blue,
2,Ringo,Purple,
3,Paul,,Red
4,George,,Blue
5,Ringo,,


In [12]:
# left join
df1.merge(df2, how='left')

Unnamed: 0,name,color,carcolor
0,John,Blue,
1,George,Blue,
2,Ringo,Purple,


In [13]:
# right join
df1.merge(df2, how='right')

Unnamed: 0,name,color,carcolor
0,Paul,,Red
1,George,,Blue
2,Ringo,,


In [14]:
df1.merge(df2, how='right', left_on='color', right_on='carcolor')

Unnamed: 0,name_x,color,name_y,carcolor
0,,,Paul,Red
1,John,Blue,George,Blue
2,George,Blue,George,Blue
3,,,Ringo,


## 32.4 Join Indicators

- The ``.merge`` method has an option to add a column that indicates where the data in the row can come from. 

In [15]:
df1.merge(df2, how='outer', indicator='True')

Unnamed: 0,name,color,carcolor,True
0,John,Blue,,left_only
1,George,Blue,,left_only
2,Ringo,Purple,,left_only
3,Paul,,Red,right_only
4,George,,Blue,right_only
5,Ringo,,,right_only


## 32.5 Merge Validation

- The ``.merge`` method has a validate parameter. It will raise a ``MergeError`` if the join validates a constraint. 
- The constraint can be '1:1', '1:m' or 'm:1' for ensuring that join keys are indeed one to one, one to many, or many to 1
- In the example, the left key is color, which is non-unique values and the right key is carcolor, which is unique so the constraint should be 'm:1'