## Merging 'on'

This works fine as long as we are merging two DataFrames that share a column label, *and* have shared values in that label column (e.g., participant numbers 1–5 in the above example). But our data aren't always structured that way. For example, let's re-load the RT data used in the previous chapter, that came from the same participant in two different testing sessions:

In [7]:
sess_1 = pd.read_csv('session_1.csv', index_col='trial')
sess_2 = pd.read_csv('session_2.csv', index_col='trial')
pd.merge(sess_1, sess_2)

Unnamed: 0,rt


Merging these generates no output (other than the `rt` label). Why not? Let's look at the inputs:

In [8]:
print(sess_1)
print(sess_2)

          rt
trial       
0      0.988
1      0.753
2      0.949
3      0.824
4      0.262
5      0.803
6      0.376
7      0.496
8      0.235
9      0.336
10     0.645
          rt
trial       
0      0.718
1      0.851
2      0.747
3      0.520
4      0.991
5      0.004
6      0.547
7      0.883
8      0.841
9      0.195
10     0.828


Both inputs have a `trial` column, with the same values (0–10). However, they both also have an `rt` column, and the RT values are different for every trial. Since pandas sees the `rt` column label in both columns, it will only do the inner join on rows that match on *both* `trial` *and* `rt`. 

We can override this default behaviour by explicitly telling pandas what columns to merge on; in this case, `trial`. 

In [9]:
sess_12 = pd.merge(sess_1, sess_2, on='trial')
sess_12

Unnamed: 0_level_0,rt_x,rt_y
trial,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.988,0.718
1,0.753,0.851
2,0.949,0.747
3,0.824,0.52
4,0.262,0.991
5,0.803,0.004
6,0.376,0.547
7,0.496,0.883
8,0.235,0.841
9,0.336,0.195


Note that in this case, the identically-named `rt` columns are given distinct names so that we know where they came from (`x` being the first input, and `y` the second). We can replace these with meaningful labels if we like, using the `suffixes=` argument and a list of labels:

In [10]:
sess_12 = pd.merge(sess_1, sess_2, on='trial', suffixes=['_sess_1', '_sess_2'])
sess_12

Unnamed: 0_level_0,rt_sess_1,rt_sess_2
trial,Unnamed: 1_level_1,Unnamed: 2_level_1
0,0.988,0.718
1,0.753,0.851
2,0.949,0.747
3,0.824,0.52
4,0.262,0.991
5,0.803,0.004
6,0.376,0.547
7,0.496,0.883
8,0.235,0.841
9,0.336,0.195


`pd.merge` can also come to the rescue if you have matching data columns in two inputs, but the column names aren't the same. It's not uncommon that a researcher will make little errors like capitalizing a title one time, but not another. This happened with the third session from our RT experiment:

In [11]:
sess_3 = pd.read_csv('session_3.csv')
sess_3

Unnamed: 0,Trial,RT
0,0,0.844168
1,1,0.913048
2,2,0.843295
3,3,0.530306
4,4,0.266715
5,5,0.707006
6,6,0.973193
7,7,0.432562
8,8,0.522106
9,9,0.876626


So when we try to merge this third session with the already-merged other two, we get an error:

In [12]:
pd.merge(sess_12, sess_3)

MergeError: No common columns to perform merge on. Merge options: left_on=None, right_on=None, left_index=False, right_index=False

So we need to tell pandas which columns in each input to merge on, with `left_on` referring to the left (first) input, and `right_on` referring to the right (second) input:

In [None]:
pd.merge(sess_12, sess_3, left_on='trial', right_on='Trial')

Note that this matches data between the two inputs, but keeps both columns.